## DNA Sequencing With Machine Learning

In this notebook, I will apply a classification model that can predict a gene's function based on the DNA sequence of the coding sequence alone.

## 1. Insertion of requirement Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

In [None]:
human_data_old = pd.read_table('human_data.txt')
human_data_old.head()

In [None]:
human_data_old['sequence'][0]

In [None]:
dog_data_old = pd.read_table('dog_data.txt')
dog_data_old.head()

## 2. Data preprocessing

In [None]:
human_data_old.info()

In [None]:
dog_data_old.info()

In [None]:
# Dropping Extra Rows:
human_data = human_data_old.iloc[:-2380 , :]
dog_data = dog_data_old.iloc[:-320, : ]

In [None]:
dog_data.info()

In [None]:
# function to convert sequence strings into k-mer words, size = 5 
def myFunction(sequence, size=5):
    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]

In [None]:
human_data['words'] = human_data.apply(lambda x: myFunction(x['sequence']), axis=1)
human_data = human_data.drop('sequence', axis=1)
dog_data['words'] = dog_data.apply(lambda x: myFunction(x['sequence']), axis=1)
dog_data = dog_data.drop('sequence', axis=1)

In [None]:
human_data.head()

In [None]:
human_texts = list(human_data['words'])
for item in range(len(human_texts)):
    human_texts[item] = ' '.join(human_texts[item])
y_data = human_data.iloc[:, 0].values                         

In [None]:
print(human_texts[2])

In [None]:
y_data

In [None]:
dog_texts = list(dog_data['words'])
for item in range(len(dog_texts)):
    dog_texts[item] = ' '.join(dog_texts[item])
y_dog = dog_data.iloc[:, 0].values   

## Now we will apply the BAG of WORDS using CountVectorizer of NLP

In [None]:
# Creating the Bag of Words model using CountVectorizer()
# This is equivalent to k-mer counting
# The n-gram size of 4 was previously determined by testing
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(4,4))
X = cv.fit_transform(human_texts)
X_dog = cv.transform(dog_texts)

In [None]:
print(X.shape)
print(X_dog.shape)

### If we have a look at class balance we can see we have relatively balanced dataset.

In [None]:
human_data['class'].value_counts().sort_index().plot.bar()

## 3. Model Builing

A multinomial and Bernoulli naive Bayes classifier will be created.  I previously did some parameter tuning and found the ngram size of 4 (reflected in the Countvectorizer() instance) and a model alpha of 0.1 did the best.

In [None]:

# Splitting the human dataset into the training set and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y_data, test_size = 0.20, random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [None]:
mnb = MultinomialNB(alpha=0.1)
mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)

In [None]:
bnb = BernoulliNB(alpha=0.1)
bnb.fit(X_train, y_train)
y2_pred = bnb.predict(X_test)

### Let's look at some model performce metrics like the confusion matrix, accuracy, precision, recall and f1 score.

In [None]:
print('Result Box for MultinomialNB: \n') 
print("Confusion matrix")
print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y_pred, name='Predicted')))
def myFunction(y_test, y_predicted):
    accuracy = accuracy_score(y_test, y_predicted)
    precision = precision_score(y_test, y_predicted, average='weighted')
    recall = recall_score(y_test, y_predicted, average='weighted')
    f1 = f1_score(y_test, y_predicted, average='weighted')
    return accuracy, precision, recall, f1
accuracy, precision, recall, f1 = myFunction(y_test, y_pred)
print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))

In [None]:
print('Result Box for BernoulliNB: \n')
print("Confusion matrix")
print(pd.crosstab(pd.Series(y_test, name='Actual'), pd.Series(y2_pred, name='Predicted')))
def myFunction(y_test, y_predicted):
    accuracy = accuracy_score(y_test, y_predicted)
    precision = precision_score(y_test, y_predicted, average='weighted')
    recall = recall_score(y_test, y_predicted, average='weighted')
    f1 = f1_score(y_test, y_predicted, average='weighted')
    return accuracy, precision, recall, f1
accuracy, precision, recall, f1 = myFunction(y_test, y_pred)
print("accuracy = %.3f \nprecision = %.3f \nrecall = %.3f \nf1 = %.3f" % (accuracy, precision, recall, f1))

## Comparison between 2 ouptuts: 

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm_alg1 = confusion_matrix(y_test, y_pred)
cm_alg2 = confusion_matrix(y_test, y2_pred)
sns.heatmap(cm_alg1, annot=True, cmap="Greens")
plt.title("Multinomial Confusion Matrix")
plt.show()

sns.heatmap(cm_alg2, annot=True, cmap='Blues')
plt.title("Bernoulli Confusion Matrix")
plt.show()
