In [1]:
import numpy as np
import pandas as pd
import sklearn

## Data preparation

In [2]:
df_train = pd.read_csv('data/data_set_ALL_AML_train.csv')
df_test = pd.read_csv('data/data_set_ALL_AML_independent.csv')
df_labels = pd.read_csv('data/actual.csv')

Data is from the [Gene Expression Kaggle dataset](https://www.kaggle.com/datasets/crawford/gene-expression/code).
It consists of gene expression intensity values measured for 72 patients, each diagnosed with one of two types of leukemia: acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL).

Patients are split into a training and test set (38 and 34, respectively). The goal is to train a classifier on the training set that will successfully predict the type of tumor for the test set.

In [3]:
ntrain = 38
ntest = 34

### Training set

In [4]:
df_train.head(3)

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A


The dataframe has features on the rows and patients on the columns. Since it is common practice to work with samples as rows, we will transpose it.

We will also only keep columns labelled by numbers 1 to 38 and for the moment disregard the information in the "call" columns.

In [5]:
cols = [str(i+1) for i in range(ntrain)]
df_train = df_train[cols].T
df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


### Test set

In [6]:
df_test.head(3)

Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,...,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,...,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,...,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,...,-5,A,63,A,-46,A,-124,A,-81,A


Again, we only keep the relevant columns and transpose

In [7]:
cols = [str(i+ntrain+1) for i in range(ntest)]
df_test = df_test[cols].T
df_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
39,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
40,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21
41,-62,-23,-7,142,-233,-284,-167,-97,-12,-70,...,129,383,46,104,15,245,164,84,100,-18
42,22,-153,17,276,-211,-250,55,-141,0,500,...,413,399,16,558,24,893,297,6,1971,-42
43,86,-36,-141,252,-201,-384,-420,-197,-60,-468,...,341,91,-84,615,-52,1235,9,7,1545,-81


### Labels

In [8]:
df_labels.head()

Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


Convert labels (ALL, AML) to 0, 1. Then, split training and test labels and incorporate them into the corresponding dataframes 

In [9]:
df_labels['cancer'] = pd.factorize(df_labels['cancer'])[0]
df_labels.head()

Unnamed: 0,patient,cancer
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0


In [43]:
df_labels['cancer'][:ntrain].reset_index(drop=True).index
df_train.reset_index(drop=True).index

RangeIndex(start=0, stop=38, step=1)

In [11]:
labels_train = df_labels['cancer'][:ntrain].reset_index(drop=True).to_frame()
labels_train.head()

Unnamed: 0,cancer
0,0
1,0
2,0
3,0
4,0


In [12]:
labels_test = df_labels['cancer'][ntrain:].reset_index(drop=True).to_frame()
labels_test.head()

Unnamed: 0,cancer
0,0
1,0
2,0
3,0
4,0


## Classification

In [13]:
nfeatures = len(df_train.columns)
print(f'We have {ntrain} samples and {nfeatures} features')

We have 38 samples and 7129 features


Let us try some simple classification algorithms from the `scikit-learn` library and see how well they do

In [90]:
X.to_numpy()

array([[-214, -153,  -58, ...,   36,  191,  -37],
       [-139,  -73,   -1, ...,   11,   76,  -14],
       [ -76,  -49, -307, ...,   41,  228,  -41],
       ...,
       [-213, -252,  136, ...,   26,  246,   23],
       [ -25,  -20,  124, ...,   12, 3193,  -33],
       [ -72, -139,   -1, ...,   21, 2520,    0]], shape=(38, 7129))

In [91]:
X = df_train.to_numpy()
y = labels_train['cancer'].to_numpy()
Xtest = df_test.to_numpy()
ytest = labels_test['cancer'].to_numpy()

In [92]:
def train_and_predict(model, X, Xtest, y, ytest):
    model.fit(X, y)
    return model.score(X, y), model.score(Xtest, ytest)

In [93]:
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

models = [
    LogisticRegression(random_state=0, penalty='l2'),
    LogisticRegression(random_state=0, penalty='l1', solver='liblinear'),
    Perceptron(random_state=0),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
]

In [94]:
for model in models:
    acc_train, acc_test = train_and_predict(model, X, Xtest, y, ytest)
    print(model, acc_train, acc_test)

LogisticRegression(random_state=0) 1.0 0.9411764705882353
LogisticRegression(penalty='l1', random_state=0, solver='liblinear') 1.0 1.0
Perceptron() 1.0 0.8529411764705882
DecisionTreeClassifier() 1.0 0.9117647058823529
RandomForestClassifier() 1.0 0.8529411764705882


We can see that logistic regression with L-1 penalty obtains the best test set accuracy.
This is not surprising: since we have way more features than datapoints, some regularization is expected to be needed in order to generalize well. What L-1 penalty does is to encourage the model to give zero weight to a large portion of the features, thus performing what is called feature selection.