## Idendification of Gene Splice Sites

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/splice-junction-gene-sequences/splice.data

For this exercise, I will use a KNearestNeighbor algorithm to classify splice sites in a sequence of nucleotides. 
Goals:
- Split string of nucleotides into individual features
- Train KNN model, testing multiple hyperparameters

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, roc_auc_score, recall_score

In [2]:
df = pd.read_csv(r'C:\Users\Taylor\Documents\01082018_Clas_Proj\splice_junctions.csv', header = None, names = ['Class', 'Name', 'Sequence'])
df.head()

Unnamed: 0,Class,Name,Sequence
0,EI,ATRINS-DONOR-521,CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...
1,EI,ATRINS-DONOR-905,AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...
2,EI,BABAPOE-DONOR-30,GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...
3,EI,BABAPOE-DONOR-867,GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...
4,EI,BABAPOE-DONOR-2817,GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...


In [3]:
df['Sequence'] = [seq.strip() for seq in df['Sequence']]

In [4]:
df.drop('Name', axis = 1, inplace=True)

In [10]:
def make_single_nuc_feature_enc(df):
    seq_lists = []
    counter = 0
    nucs = ""
    for index,row in df.iterrows():
        new_nucs = row['Sequence']
        nucs = nucs + new_nucs
    le = LabelEncoder()
    le.fit(list(nucs))
    print(le.classes_)
    for index,row in df.iterrows():
        seq_list = list(row['Sequence'])
        seq_list = le.transform(seq_list)
        seq_lists.append(list(seq_list))
    ohe = OneHotEncoder(n_values=8)
    ohe.fit(seq_lists)
    #print(len(seq_lists.n_values_))
    seq_lists = ohe.transform(seq_lists).toarray()
    #seq_lists = ohe.transform(seq_lists).toarray()
    df = pd.DataFrame(seq_lists)
    return df
#df = pd.concat([df,make_single_nuc_feature_enc(df)],axis = 1)
make_single_nuc_feature_enc(df)
#print(df)

['A' 'C' 'D' 'G' 'N' 'R' 'S' 'T']


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,470,471,472,473,474,475,476,477,478,479
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df.drop('Sequence', axis=1, inplace = True)

In [None]:
y = df.pop('Class')
X = df
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)

In [None]:
y_pred = knn.predict(X_test)
# 65% of the predicted y values matched the actual
print(accuracy_score(y_pred,y_test))
# Recall tp/(tp + fn) the proportion of actual matches out of all that should match
print(recall_score(y_pred,y_test, average=None))
# Precision tp/(tp + fp) the proportion of actual matches out of all that where said to match
print(precision_score(y_pred,y_test,average=None))