# Predicting Behavior based on protein expression using SVM

Context
Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.
Content

The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.

The eight classes of mice are described based on features such as genotype, behavior and treatment. According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.

Classes:

    c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)

    c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)

    c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)

    c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)

    t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)

    t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)

    t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)

    t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

Attribute Information

[1] Mouse ID

[2:78] Values of expression levels of 77 proteins; the names of proteins are followed by N indicating that they were measured in the nuclear fraction. *For example: DYRK1A_n*

[79] Genotype: control (c) or trisomy (t)

[80] Treatment type: memantine (m) or saline (s)

[81] Behavior: context-shock (CS) or shock-context (SC)

[82] Class: c-CS-s, c-CS-m, c-SC-s, c-SC-m, t-CS-s, t-CS-m, t-SC-s, t-SC-m

In [179]:
import pandas as pd
import numpy as  np

In [186]:
#reading data from file
data = pd.read_csv('data.csv')

In [187]:
#Display all the columns
data.columns

Index(['MouseID', 'DYRK1A_N', 'ITSN1_N', 'BDNF_N', 'NR1_N', 'NR2A_N', 'pAKT_N',
       'pBRAF_N', 'pCAMKII_N', 'pCREB_N', 'pELK_N', 'pERK_N', 'pJNK_N',
       'PKCA_N', 'pMEK_N', 'pNR1_N', 'pNR2A_N', 'pNR2B_N', 'pPKCAB_N',
       'pRSK_N', 'AKT_N', 'BRAF_N', 'CAMKII_N', 'CREB_N', 'ELK_N', 'ERK_N',
       'GSK3B_N', 'JNK_N', 'MEK_N', 'TRKA_N', 'RSK_N', 'APP_N', 'Bcatenin_N',
       'SOD1_N', 'MTOR_N', 'P38_N', 'pMTOR_N', 'DSCR1_N', 'AMPKA_N', 'NR2B_N',
       'pNUMB_N', 'RAPTOR_N', 'TIAM1_N', 'pP70S6_N', 'NUMB_N', 'P70S6_N',
       'pGSK3B_N', 'pPKCG_N', 'CDK5_N', 'S6_N', 'ADARB1_N', 'AcetylH3K9_N',
       'RRP1_N', 'BAX_N', 'ARC_N', 'ERBB4_N', 'nNOS_N', 'Tau_N', 'GFAP_N',
       'GluR3_N', 'GluR4_N', 'IL1B_N', 'P3525_N', 'pCASP9_N', 'PSD95_N',
       'SNCA_N', 'Ubiquitin_N', 'pGSK3B_Tyr216_N', 'SHH_N', 'BAD_N', 'BCL2_N',
       'pS6_N', 'pCFOS_N', 'SYP_N', 'H3AcK18_N', 'EGR1_N', 'H3MeK4_N',
       'CaNA_N', 'Genotype', 'Treatment', 'Behavior', 'class'],
      dtype='object')

In [188]:
#Drop unwanted Columns
data = data.drop(['MouseID','Treatment', 'Genotype', 'class'],axis=1)
#Drop all the columns which have more than or equal to 10 missing values
temp_data = (data.isnull().sum()  < 10)
columns_with_missing_lte_10 =[]
for i in range(temp_data.shape[0]):
    if temp_data.iloc[i] == True:
        columns_with_missing_lte_10.append(temp_data.index[i])
data = data[columns_with_missing_lte_10]

In [189]:
data.columns

Index(['DYRK1A_N', 'ITSN1_N', 'BDNF_N', 'NR1_N', 'NR2A_N', 'pAKT_N', 'pBRAF_N',
       'pCAMKII_N', 'pCREB_N', 'pELK_N', 'pERK_N', 'pJNK_N', 'PKCA_N',
       'pMEK_N', 'pNR1_N', 'pNR2A_N', 'pNR2B_N', 'pPKCAB_N', 'pRSK_N', 'AKT_N',
       'BRAF_N', 'CAMKII_N', 'CREB_N', 'ERK_N', 'GSK3B_N', 'JNK_N', 'MEK_N',
       'TRKA_N', 'RSK_N', 'APP_N', 'SOD1_N', 'MTOR_N', 'P38_N', 'pMTOR_N',
       'DSCR1_N', 'AMPKA_N', 'NR2B_N', 'pNUMB_N', 'RAPTOR_N', 'TIAM1_N',
       'pP70S6_N', 'NUMB_N', 'P70S6_N', 'pGSK3B_N', 'pPKCG_N', 'CDK5_N',
       'S6_N', 'ADARB1_N', 'AcetylH3K9_N', 'RRP1_N', 'BAX_N', 'ARC_N',
       'ERBB4_N', 'nNOS_N', 'Tau_N', 'GFAP_N', 'GluR3_N', 'GluR4_N', 'IL1B_N',
       'P3525_N', 'pCASP9_N', 'PSD95_N', 'SNCA_N', 'Ubiquitin_N',
       'pGSK3B_Tyr216_N', 'SHH_N', 'pS6_N', 'SYP_N', 'CaNA_N', 'Behavior'],
      dtype='object')

In [190]:
#Replace Blank values with NaN
columns = data.columns
X_data = data[columns[:-1]]
y_data = data[columns[-1]]
X_data.replace('',np.NaN,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [191]:
#Fill missing values with mean
from sklearn.preprocessing import Imputer

In [192]:
imputer = Imputer()

In [193]:
imputer.fit(X_data)

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [194]:
X_data = pd.DataFrame(columns=X_data.columns,data=imputer.transform(X_data))

In [195]:
X_data.isnull().sum()

DYRK1A_N           0
ITSN1_N            0
BDNF_N             0
NR1_N              0
NR2A_N             0
pAKT_N             0
pBRAF_N            0
pCAMKII_N          0
pCREB_N            0
pELK_N             0
pERK_N             0
pJNK_N             0
PKCA_N             0
pMEK_N             0
pNR1_N             0
pNR2A_N            0
pNR2B_N            0
pPKCAB_N           0
pRSK_N             0
AKT_N              0
BRAF_N             0
CAMKII_N           0
CREB_N             0
ERK_N              0
GSK3B_N            0
JNK_N              0
MEK_N              0
TRKA_N             0
RSK_N              0
APP_N              0
                  ..
TIAM1_N            0
pP70S6_N           0
NUMB_N             0
P70S6_N            0
pGSK3B_N           0
pPKCG_N            0
CDK5_N             0
S6_N               0
ADARB1_N           0
AcetylH3K9_N       0
RRP1_N             0
BAX_N              0
ARC_N              0
ERBB4_N            0
nNOS_N             0
Tau_N              0
GFAP_N       

In [196]:
#Train and test 
from sklearn.model_selection  import train_test_split

In [197]:
X_train, X_test, y_train, y_test = train_test_split( X_data, y_data, test_size=0.2)

In [198]:
from sklearn.svm import SVC
from sklearn.grid_search import RandomizedSearchCV
import scipy

In [201]:
clf  =  SVC()

In [204]:
#Randomized Grid Search to optimize hyperparameters
s = RandomizedSearchCV(clf,param_distributions={'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf','linear']},)

In [205]:
#Train model
s.fit(X_train,y_train)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f4f62d69ba8>, 'kernel': ['rbf', 'linear'], 'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f4f556bbbe0>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          scoring=None, verbose=0)

In [206]:
#Test Score
print("Train score ",s.score(X_train,y_train))

Train score  1.0


In [207]:
#Train Score
print("Test score ",s.score(X_test,y_test))

Test score  1.0


In [208]:
#Best hyperparameters
s.best_params_

{'C': 28.968207987811734, 'gamma': 0.057383230603662155, 'kernel': 'linear'}

In [211]:
#Predict first 10 values
s.predict(X_test[:10])

array(['C/S', 'S/C', 'C/S', 'S/C', 'S/C', 'S/C', 'S/C', 'S/C', 'S/C', 'C/S'], dtype=object)

In [212]:
#Actual first 10 classes
y_test[:10]

103     C/S
267     S/C
314     C/S
498     S/C
495     S/C
815     S/C
534     S/C
1000    S/C
1015    S/C
144     C/S
Name: Behavior, dtype: object