## Idendification of Gene Splice Sites

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/splice-junction-gene-sequences/splice.data

For this exercise, I will use a KNearestNeighbor algorithm to classify splice sites in a sequence of nucleotides. 
Goals:
- Split string of nucleotides into individual features
- Train KNN model, testing multiple hyperparameters

In [24]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, roc_auc_score, recall_score

In [2]:
df = pd.read_csv(r'C:\Users\Taylor\Documents\01082018_Clas_Proj\splice_junctions.csv', header = None, names = ['Class', 'Name', 'Sequence'])
df.head()

Unnamed: 0,Class,Name,Sequence
0,EI,ATRINS-DONOR-521,CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...
1,EI,ATRINS-DONOR-905,AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...
2,EI,BABAPOE-DONOR-30,GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...
3,EI,BABAPOE-DONOR-867,GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...
4,EI,BABAPOE-DONOR-2817,GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...


In [3]:
df['Sequence'] = [seq.strip() for seq in df['Sequence']]

In [4]:
df.drop('Name', axis = 1, inplace=True)

In [5]:
def make_single_nuc_feature_enc(df):
    seq_lists = []
    counter = 0
    for index,row in df.iterrows():
        seq_list = list(row['Sequence'])
        if counter == 0:
            le = LabelEncoder()
            le.fit(seq_list)
            seq_list = le.transform(seq_list)
            seq_lists.append(seq_list)
        else:
            seq_list = le.transform(seq_list)
            seq_lists.append(seq_list)
    df = pd.DataFrame(seq_lists)
    return df
df = pd.concat([df,make_single_nuc_feature_enc(df)],axis = 1)
df.drop('Sequence', axis=1, inplace = True)

In [6]:
y = df.pop('Class')
X = df
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)

In [8]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [27]:
y_pred = knn.predict(X_test)
# 65% of the predicted y values matched the actual
print(accuracy_score(y_pred,y_test))
# Recall tp/(tp + fn) the proportion of actual matches out of all that should match
print(recall_score(y_pred,y_test, average=None))
# Precision tp/(tp + fp) the proportion of actual matches out of all that where said to match
print(precision_score(y_pred,y_test,average=None))

0.6551724137931034
[0.5163728  0.61576355 0.83193277]
[0.8266129  0.5787037  0.60243408]
