In this script, we will use pandas to read and process the data, and sklearn to apply logistic regression and test the model.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.dummy import DummyClassifier

First, we will open the df_coef dataframe created in the last script. It contains the coefficients obtained from applying linear regression to the neurite area values of each sample over time. 

In [2]:
coef = pd.read_pickle('df_coef.pkl')
coef

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


We will use the coefficients as a feature to classify the samples as wt, ko, and oe. We are mostly interested in the variation from the normality (wt) towards lower (oe) or higher (ko) neuronal growth (thus slope coefficient). Therefore, I will create two dataframes containing just wt-ko and wt-oe.

In [3]:
coef_ko = coef.drop(coef[coef['class']=='oe'].index)
coef_ko

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


In [4]:
coef_oe = coef.drop(coef[coef['class']=='ko'].index)
coef_oe

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
3,wt_18,82.934874,wt,#CE1141
4,wt_15,90.038921,wt,#CE1141
5,wt_14,86.396468,wt,#CE1141
6,wt_22,104.4426,wt,#CE1141
7,wt_19,83.709395,wt,#CE1141
8,wt_24,104.27096,wt,#CE1141
9,wt_9,95.194078,wt,#CE1141


We will reserve three samples per category for testing the model. Therefore, we will divide the dataframe into two parts through the split_df function (tested below).

In [5]:
def split_df(df):
    testw = df[:3]
    test2 = df[-3:]
    res ={}
    res['test'] = pd.concat([testw, test2])
    res['train'] = df.drop(res['test'].index)
    return res

In [6]:
split_df(coef_ko)['test']

Unnamed: 0,samples,coef,class,color_column
0,wt_7,82.522571,wt,#CE1141
1,wt_8,94.456822,wt,#CE1141
2,wt_13,79.141665,wt,#CE1141
28,ko_17,141.087833,ko,blue
29,ko_19,168.907599,ko,blue
30,ko_7,135.766912,ko,blue


We can now apply the model and test it throught the log_reg function. For a given dataframe, it will return the model score and the prediction of the testing dataframe (which, if the model is working, should be wt for the first three, and the others ko or oe depending of the dataframe). 

In [7]:
def log_reg(df):
    train_df = split_df(df)['train']
    test_df = split_df(df)['test']
    x = train_df['coef'].values.reshape(-1,1)
    y = train_df['class'].values
    model = LogisticRegression()
    model.fit(x,y)
    score = model.score(x,y)
    y_pred = model.predict(test_df['coef'].values.reshape(-1,1))
    return score, y_pred

In [8]:
log_reg(coef_oe)

(1.0, array(['wt', 'wt', 'wt', 'oe', 'oe', 'oe'], dtype=object))

In [9]:
log_reg(coef_ko)

(1.0, array(['wt', 'wt', 'wt', 'ko', 'ko', 'ko'], dtype=object))

Finally, we will use a dummy classifier to check that the score and prediction are different with this model (using the same data to train it).

In [10]:
def dummy_class(df):
    train_df = split_df(df)['train']
    test_df = split_df(df)['test']
    x = train_df['coef'].values.reshape(-1,1)
    y = train_df['class'].values
    dummy_clf = DummyClassifier(strategy="most_frequent")
    dummy_clf.fit(x, y)
    dummy_pred = dummy_clf.predict(test_df['coef'].values.reshape(-1,1))
    dummy_score = dummy_clf.score(x, y)
    return dummy_score, dummy_pred

In [11]:
dummy_class(coef_oe)

(0.6538461538461539, array(['oe', 'oe', 'oe', 'oe', 'oe', 'oe'], dtype='<U2'))

In [12]:
dummy_class(coef_ko)

(0.64, array(['ko', 'ko', 'ko', 'ko', 'ko', 'ko'], dtype='<U2'))

In conclusion, we can fit a linear regression model to the growth of neuronal cultures. The slope coeffient obtained can be then used as a feature to identify the absence or overexpression of my protein of interest. More in general, it can be used to classify cultures as "fast" or "slow" growing, and thus to quickly screen the effect of different genetic and pharmacological manipulations on neuronal growth.