In this script, we will use pandas to read and process the data, and sklearn to apply logistic regression and test the model.

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedShuffleSplit

First, we will open the df_coef dataframe created in the last script. It contains the coefficients obtained from applying linear regression to the neurite area values of each sample over time. 

In [3]:
coef = pd.read_pickle('df_coef.pkl')
coef

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


We will use the coefficients as a feature to classify the samples as wt, ko, and oe. Therefore, I will create three dataframes containing just wt-ko, wt-oe, ko-oe, to be able to apply logistic regression.

In [4]:
wt_ko = coef.drop(coef[coef['class']=='oe'].index)
wt_ko

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


In [5]:
wt_oe = coef.drop(coef[coef['class']=='ko'].index)
wt_oe

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


In [6]:
ko_oe = coef.drop(coef[coef['class']=='wt'].index)
ko_oe

Unnamed: 0,samples,coef,class,color_column
12,ko_9,156.84079,ko,blue
13,ko_13,127.592621,ko,blue
14,ko_14,153.181776,ko,blue
15,ko_18,136.275938,ko,blue
16,ko_20,191.435503,ko,blue
17,ko_2,151.537653,ko,blue
18,ko_22,147.357002,ko,blue
19,ko_16,115.692426,ko,blue
20,ko_21,162.133378,ko,blue
21,ko_8,151.965995,ko,blue


We will reserve 20% of the samples in each dataframe for testing the model. Therefore, we will randomly divide them into training and testing, using the rand_split function (tested in the cell below).

In [8]:
def rand_split(df):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
    x = df['coef'].values.reshape(-1,1)
    y = df['class'].values
    for train_index, test_index in sss.split(x, y):
        X_train, X_test = x[train_index], x[test_index]
        y_train, y_test = y[train_index], y[test_index]
    res= {'X_train':X_train, 'X_test':X_test, 'y_train':y_train, 'y_test':y_test}
    return res

In [9]:
rand_split(wt_ko)

{'X_train': array([[168.90759922],
        [151.9659947 ],
        [149.00509093],
        [ 86.39646807],
        [162.13337754],
        [153.18177648],
        [147.3570016 ],
        [ 90.03892116],
        [ 94.45682197],
        [171.61071452],
        [104.2709601 ],
        [115.69242588],
        [151.53765312],
        [127.59262061],
        [ 85.09674612],
        [191.43550346],
        [177.10379506],
        [135.7669118 ],
        [174.21003043],
        [ 95.19407753],
        [ 79.14166479],
        [141.0878327 ],
        [ 82.52257108],
        [ 82.93487407]]),
 'X_test': array([[141.46546361],
        [ 83.70939451],
        [143.69428275],
        [156.84078958],
        [ 80.82016949],
        [104.44259961],
        [136.27593812]]),
 'y_train': array(['ko', 'ko', 'ko', 'wt', 'ko', 'ko', 'ko', 'wt', 'wt', 'ko', 'wt',
        'ko', 'ko', 'ko', 'wt', 'ko', 'ko', 'ko', 'ko', 'wt', 'wt', 'ko',
        'wt', 'wt'], dtype=object),
 'y_test': array(['ko', 'wt', 'ko', 

We will first use a dummy classifier on our dataframes to check that the classification is not done correctly in this case.

In [10]:
def dummy_class(df):
    x = rand_split(df)['X_train']
    y = rand_split(df)['y_train']
    x_test = rand_split(df)['X_test']
    dummy_clf = DummyClassifier(strategy="most_frequent")
    dummy_clf.fit(x, y)
    dummy_pred = dummy_clf.predict(x_test)
    dummy_score = dummy_clf.score(x, y)
    res ={'dummy score': dummy_score, 'check prediction': [x_test, dummy_pred]}
    return res

In [11]:
dummy_class(wt_oe)

{'dummy score': 0.64,
 'check prediction': [array([[ 33.31958616],
         [ 83.70939451],
         [ 21.3529343 ],
         [ 28.21299402],
         [ 80.82016949],
         [104.44259961],
         [ 29.68087281]]),
  array(['oe', 'oe', 'oe', 'oe', 'oe', 'oe', 'oe'], dtype='<U2')]}

In [12]:
dummy_class(wt_ko)

{'dummy score': 0.625,
 'check prediction': [array([[141.46546361],
         [ 83.70939451],
         [143.69428275],
         [156.84078958],
         [ 80.82016949],
         [104.44259961],
         [136.27593812]]),
  array(['ko', 'ko', 'ko', 'ko', 'ko', 'ko', 'ko'], dtype='<U2')]}

In [13]:
dummy_class(ko_oe)

{'dummy score': 0.5161290322580645,
 'check prediction': [array([[ 43.21983581],
         [136.27593812],
         [ 37.55928167],
         [141.46546361],
         [143.69428275],
         [ 21.91959085],
         [ 24.49157557],
         [156.84078958]]),
  array(['oe', 'oe', 'oe', 'oe', 'oe', 'oe', 'oe', 'oe'], dtype='<U2')]}

We can now apply the logistic regression model to our data and test it throught the log_reg function. For a given dataframe, it will return the model score and the prediction of the x_test to check that the values were correctly classified. 

In [18]:
def log_reg(df):
    x = rand_split(df)['X_train']
    y = rand_split(df)['y_train']
    x_test = rand_split(df)['X_test']
    res = {}
    model = LogisticRegression()
    model.fit(x,y)
    score = model.score(x,y)
    y_pred = model.predict(x_test)
    res['score'] = score
    res['check prediction'] = [x_test, y_pred]
    return res

In [19]:
log_reg(wt_oe)

{'score': 1.0,
 'check prediction': [array([[ 33.31958616],
         [ 83.70939451],
         [ 21.3529343 ],
         [ 28.21299402],
         [ 80.82016949],
         [104.44259961],
         [ 29.68087281]]),
  array(['oe', 'wt', 'oe', 'oe', 'wt', 'wt', 'oe'], dtype=object)]}

In [20]:
log_reg(wt_ko)

{'score': 1.0,
 'check prediction': [array([[141.46546361],
         [ 83.70939451],
         [143.69428275],
         [156.84078958],
         [ 80.82016949],
         [104.44259961],
         [136.27593812]]),
  array(['ko', 'wt', 'ko', 'ko', 'wt', 'wt', 'ko'], dtype=object)]}

In [21]:
log_reg(ko_oe)

{'score': 1.0,
 'check prediction': [array([[ 43.21983581],
         [136.27593812],
         [ 37.55928167],
         [141.46546361],
         [143.69428275],
         [ 21.91959085],
         [ 24.49157557],
         [156.84078958]]),
  array(['oe', 'ko', 'oe', 'ko', 'ko', 'oe', 'oe', 'ko'], dtype=object)]}

In conclusion, we can fit a linear regression model to the growth of neuronal cultures. The slope coeffient obtained can be then used as a feature to identify the absence or overexpression of my protein of interest. More in general, it can be used to classify cultures as "fast" or "slow" growing, and thus to quickly screen the effect of different genetic and pharmacological manipulations on neuronal growth.