### Introduction

DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes and cancer. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. 
This project is based on classifying DNA sequences as promoter or non promoter. 

### Dataset

This project will use a dataset from the UCI Machine Learning Repository that has 106 DNA sequences, with 57 sequential nucleotides (“base-pairs”) each.
The dataset can be downloaded from: https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/


##### Loading and Exploring the dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'

names = ['Class','id','Sequence']
data = pd.read_csv(url, names = names)
data

Unnamed: 0,Class,id,Sequence
0,+,S10,\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
1,+,AMPC,\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaat...
2,+,AROH,\t\tgtactagagaactagtgcattagcttatttttttgttatcat...
3,+,DEOP2,\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
4,+,LEU1_TRNA,\ttcgataattaactattgacgaaaagctgaaaaccactagaatgc...
...,...,...,...
101,-,799,\t\tcctcaatggcctctaaacgggtcttgaggggttttttgctga...
102,-,987,\t\tgtattctcaacaagattaaccgacagattcaatctcgtggat...
103,-,1226,\t\tcgcgactacgatgagatgcctgagtgcttccgttactggatt...
104,-,794,\t\tctcgtcctcaatggcctctaaacgggtcttgaggggtttttt...


In [3]:
data.shape

(106, 3)

In [4]:
data.iloc[3]

Class                                                       +
id                                                      DEOP2
Sequence    \taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaa...
Name: 3, dtype: object

##### Data Preprocessing

As we can see, the data which we got from the website is not in the format which we can use for analysis.
We need to make a customized dataframe by processing the information.

*Each column in a DataFrame is called a Series. We will make series for each column that we need i.e classes and sequences. We don't require the id column*

In [5]:
classes = data.loc[:,'Class']
classes

0      +
1      +
2      +
3      +
4      +
      ..
101    -
102    -
103    -
104    -
105    -
Name: Class, Length: 106, dtype: object

In [6]:
#generate list of DNA sequences

sequences = list(data.loc[:,'Sequence'])
sequences[0:5]

['\t\ttactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt',
 '\t\ttgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa',
 '\t\tgtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg',
 '\taattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatactaacaaactc',
 '\ttcgataattaactattgacgaaaagctgaaaaccactagaatgcgcctccgtggtag']

##### In these sequences we need to remove the tab character which is present at the start and we need to break this into individual nucleotides.

In [7]:
dataset = {}
for i, seq in enumerate(sequences):
    # get individual nucleotides and remove tab
    nucleotides = list(seq)
    nucleotides = [x for x in nucleotides if x != '\t']
    
    # append class assignment
    nucleotides.append(classes[i])
    
    # add to dataset
    dataset[i] = nucleotides
    

In [8]:
print(dataset[0])

['t', 'a', 'c', 't', 'a', 'g', 'c', 'a', 'a', 't', 'a', 'c', 'g', 'c', 't', 't', 'g', 'c', 'g', 't', 't', 'c', 'g', 'g', 't', 'g', 'g', 't', 't', 'a', 'a', 'g', 't', 'a', 't', 'g', 't', 'a', 't', 'a', 'a', 't', 'g', 'c', 'g', 'c', 'g', 'g', 'g', 'c', 't', 't', 'g', 't', 'c', 'g', 't', '+']


##### Now we will create our final dataframe from this dataset

In [9]:
df1 = pd.DataFrame(dataset)
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,96,97,98,99,100,101,102,103,104,105
0,t,t,g,a,t,a,c,t,c,t,...,c,c,t,a,g,c,g,c,c,t
1,a,g,t,a,c,g,a,t,g,t,...,c,g,a,g,a,c,t,g,t,a
2,c,c,a,t,g,g,g,t,a,t,...,g,c,t,a,g,t,a,c,c,a
3,t,t,c,t,a,g,g,c,c,t,...,a,t,g,g,a,c,t,g,g,c
4,a,a,t,g,t,g,g,t,t,a,...,g,a,a,g,g,a,t,a,t,a


In [10]:
#take transpose of dataframe
df = df1.transpose()
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,t,a,c,t,a,g,c,a,a,t,...,g,c,t,t,g,t,c,g,t,+
1,t,g,c,t,a,t,c,c,t,g,...,c,a,t,c,g,c,c,a,a,+
2,g,t,a,c,t,a,g,a,g,a,...,c,a,c,c,c,g,g,c,g,+
3,a,a,t,t,g,t,g,a,t,g,...,a,a,c,a,a,a,c,t,c,+
4,t,c,g,a,t,a,a,t,t,a,...,c,c,g,t,g,g,t,a,g,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101,c,c,t,c,a,a,t,g,g,c,...,g,a,a,c,t,a,t,a,t,-
102,g,t,a,t,t,c,t,c,a,a,...,t,c,a,a,c,a,t,t,g,-
103,c,g,c,g,a,c,t,a,c,g,...,a,a,g,g,c,t,t,c,c,-
104,c,t,c,g,t,c,c,t,c,a,...,a,g,g,a,g,g,a,a,c,-


In [11]:
df.rename(columns = {57:'Class'}, inplace=True)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,Class
0,t,a,c,t,a,g,c,a,a,t,...,g,c,t,t,g,t,c,g,t,+
1,t,g,c,t,a,t,c,c,t,g,...,c,a,t,c,g,c,c,a,a,+
2,g,t,a,c,t,a,g,a,g,a,...,c,a,c,c,c,g,g,c,g,+
3,a,a,t,t,g,t,g,a,t,g,...,a,a,c,a,a,a,c,t,c,+
4,t,c,g,a,t,a,a,t,t,a,...,c,c,g,t,g,g,t,a,g,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101,c,c,t,c,a,a,t,g,g,c,...,g,a,a,c,t,a,t,a,t,-
102,g,t,a,t,t,c,t,c,a,a,...,t,c,a,a,c,a,t,t,g,-
103,c,g,c,g,a,c,t,a,c,g,...,a,a,g,g,c,t,t,c,c,-
104,c,t,c,g,t,c,c,t,c,a,...,a,g,g,a,g,g,a,a,c,-


##### Our data is now ready!

Exploring the data more:

In [12]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,Class
count,106,106,106,106,106,106,106,106,106,106,...,106,106,106,106,106,106,106,106,106,106
unique,4,4,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,2
top,t,a,a,c,a,a,a,a,a,a,...,c,c,c,t,t,c,c,t,t,+
freq,38,34,30,30,36,42,38,34,33,36,...,36,42,31,33,35,32,29,29,34,53


In [13]:
# desribe does not tell us enough information since the attributes are text.
# record value counts for each sequence
series = []
for name in df.columns:
    series.append(df[name].value_counts())
    
info = pd.DataFrame(series)
details = info.transpose()
print(details)

      0     1     2     3     4     5     6     7     8     9  ...    48  \
t  38.0  26.0  27.0  26.0  22.0  24.0  30.0  32.0  32.0  28.0  ...  21.0   
c  27.0  22.0  21.0  30.0  19.0  18.0  21.0  20.0  22.0  22.0  ...  36.0   
a  26.0  34.0  30.0  22.0  36.0  42.0  38.0  34.0  33.0  36.0  ...  23.0   
g  15.0  24.0  28.0  28.0  29.0  22.0  17.0  20.0  19.0  20.0  ...  26.0   
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   NaN   

     49    50    51    52    53    54    55    56  Class  
t  22.0  23.0  33.0  35.0  30.0  23.0  29.0  34.0    NaN  
c  42.0  31.0  32.0  21.0  32.0  29.0  29.0  17.0    NaN  
a  24.0  28.0  27.0  25.0  22.0  26.0  24.0  27.0    NaN  
g  18.0  24.0  14.0  25.0  22.0  28.0  24.0  28.0    NaN  
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  

[6 rows x 58 columns]


##### We can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch it to numerical data. This can easily be accomplished using the pd.get_dummies() function which converts categorical variable into dummy/indicator variables.

In [14]:
numerical_df = pd.get_dummies(df)
numerical_df

Unnamed: 0,0_a,0_c,0_g,0_t,1_a,1_c,1_g,1_t,2_a,2_c,...,55_a,55_c,55_g,55_t,56_a,56_c,56_g,56_t,Class_+,Class_-
0,0,0,0,1,1,0,0,0,0,1,...,0,0,1,0,0,0,0,1,1,0
1,0,0,0,1,0,0,1,0,0,1,...,1,0,0,0,1,0,0,0,1,0
2,0,0,1,0,0,0,0,1,1,0,...,0,1,0,0,0,0,1,0,1,0
3,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
4,0,0,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101,0,1,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
102,0,0,1,0,0,0,0,1,1,0,...,0,0,0,1,0,0,1,0,0,1
103,0,1,0,0,0,0,1,0,0,1,...,0,1,0,0,0,1,0,0,0,1
104,0,1,0,0,0,0,0,1,0,1,...,1,0,0,0,0,1,0,0,0,1


In [15]:
numerical_df.columns

Index(['0_a', '0_c', '0_g', '0_t', '1_a', '1_c', '1_g', '1_t', '2_a', '2_c',
       ...
       '55_a', '55_c', '55_g', '55_t', '56_a', '56_c', '56_g', '56_t',
       'Class_+', 'Class_-'],
      dtype='object', length=230)

In [16]:
# We don't need both class columns.  Lets drop one then rename the other to simply 'Class'.

df = numerical_df.drop(columns = ['Class_-'])  #overwriting the previous df
df.rename(columns = {'Class_+':'Class'},inplace=True)
df

Unnamed: 0,0_a,0_c,0_g,0_t,1_a,1_c,1_g,1_t,2_a,2_c,...,54_t,55_a,55_c,55_g,55_t,56_a,56_c,56_g,56_t,Class
0,0,0,0,1,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,1
1,0,0,0,1,0,0,1,0,0,1,...,0,1,0,0,0,1,0,0,0,1
2,0,0,1,0,0,0,0,1,1,0,...,0,0,1,0,0,0,0,1,0,1
3,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,1
4,0,0,0,1,0,1,0,0,0,0,...,1,1,0,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101,0,1,0,0,0,1,0,0,0,0,...,1,1,0,0,0,0,0,0,1,0
102,0,0,1,0,0,0,0,1,1,0,...,1,0,0,0,1,0,0,1,0,0
103,0,1,0,0,0,0,1,0,0,1,...,1,0,1,0,0,0,1,0,0,0
104,0,1,0,0,0,0,0,1,0,1,...,0,1,0,0,0,0,1,0,0,0


### Machine Learning Algorithms

We can start to deploy different classification algorithms. It's relatively easy to test multiple models. Therefore we will compare and contrast the performance of few different algorithms which are:
- **K-Nearest Neighbors:** K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.

    Step-1: Select the number K of the neighbors
    
    Step-2: Calculate the Euclidean distance of K number of neighbors.
    
    Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
    
    Step-4: Among these k neighbors, count the number of the data points in each category.
    
    Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
- **Multilayer Perceptron:** The multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers and each layer is fully connected to the following one. The nodes of the layers are neurons using nonlinear activation functions, except for the nodes of the input layer. There can be one or more non-linear hidden layers between the input and the output layer.
- **Decision Tree:** Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. The decisions or the test are performed on the basis of features of the given dataset.
- **Random Forests:** Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.
    Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
- **Gaussian Naive Bayes:** Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
    The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
- **Support Vector Machine:** The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine

In [17]:
#Creating X and y variables
from sklearn.model_selection import train_test_split

X = np.array(df.drop(['Class'], axis=1))
y = np.array(df['Class'])

In [18]:
X_train, X_test , y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state=1) 

In [19]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(84, 228)
(22, 228)
(84,)
(22,)


##### Importing necessary libraries for various ML algorithms

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import KFold, cross_val_score


In [25]:
scoring = 'accuracy'

names = ['KNN',
         'Neural Networks', 
         'Decision Tree',
         'Random Forests',
         'Naive Bayes',
         'SVM Linear',
         'SVM RBF',
         'SVM Sigmoid'
                ]

classifiers = [KNeighborsClassifier(n_neighbors = 3),
                MLPClassifier(hidden_layer_sizes=100, max_iter=400,activation = 'relu',alpha=1),
                DecisionTreeClassifier(max_depth=5),
                RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
                GaussianNB(),
                SVC(kernel = 'linear'), 
                SVC(kernel = 'rbf'),
                SVC(kernel = 'sigmoid')] 

models  = zip(names,classifiers)

In [26]:
#fitting the model and making predictions

for name, model in models:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(name)
    print(accuracy_score(y_test,y_pred))
    print(classification_report(y_test,y_pred))
    

KNN
0.7727272727272727
              precision    recall  f1-score   support

           0       1.00      0.67      0.80        15
           1       0.58      1.00      0.74         7

    accuracy                           0.77        22
   macro avg       0.79      0.83      0.77        22
weighted avg       0.87      0.77      0.78        22

Neural Networks
0.9090909090909091
              precision    recall  f1-score   support

           0       1.00      0.87      0.93        15
           1       0.78      1.00      0.88         7

    accuracy                           0.91        22
   macro avg       0.89      0.93      0.90        22
weighted avg       0.93      0.91      0.91        22

Decision Tree
0.8181818181818182
              precision    recall  f1-score   support

           0       0.92      0.80      0.86        15
           1       0.67      0.86      0.75         7

    accuracy                           0.82        22
   macro avg       0.79      0.83    

### Conclusion
We can see that MLP Classifier, Naive Bayes, and SVC performed best on this classification task.