In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Problem 2: Leukemia Diagnosis

The following dataset contains cancer gene-expression levels (16,063 genes, 144 samples) from Multiclass cancer diagnosis using tumor gene expression signatures (Golub et all, 2001)

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Theory/master/Data/leukemia.csv'
data = pd.read_csv(url)
data.head(5)

Unnamed: 0,gene 0,gene 1,gene 2,gene 3,gene 4,gene 5,gene 6,gene 7,gene 8,gene 9,...,gene 16054,gene 16055,gene 16056,gene 16057,gene 16058,gene 16059,gene 16060,gene 16061,gene 16062,label
0,-73.0,-69.0,-48.0,13.0,-86.0,-147.0,-65.0,-71.0,-32.0,100.0,...,352.0,-67.0,121.0,-5.0,-11.0,-21.0,-41.0,-967.0,-120.0,1
1,-16.0,-63.0,-97.0,-42.0,-91.0,-164.0,-53.0,-77.0,-17.0,122.0,...,244.0,-15.0,119.0,-32.0,4.0,-14.0,-28.0,-205.0,-31.0,1
2,4.0,-45.0,-112.0,-25.0,-85.0,-127.0,56.0,-110.0,81.0,41.0,...,163.0,-14.0,7.0,15.0,-8.0,-104.0,-36.0,-245.0,34.0,1
3,-31.0,-110.0,-20.0,-50.0,-115.0,-113.0,-17.0,-40.0,-17.0,80.0,...,625.0,18.0,59.0,-10.0,32.0,-2.0,10.0,-495.0,-37.0,1
4,-33.0,-39.0,-45.0,14.0,-56.0,-106.0,73.0,-34.0,18.0,64.0,...,398.0,38.0,215.0,-2.0,44.0,3.0,68.0,-293.0,-34.0,1


In [3]:
X = data.iloc[:,0:16063].to_numpy() # gene expressions levels
y = data['label'].to_numpy() # cancer labels

Cancer classes are labelled as follows:

| Label | Description |
| :- | -: |
1 | Breast
2 | Prostate
3 | Lung
4 | Collerectal
5 | Lymphoma
6 | Bladder
7 | Melanoma
8 | Uterus
9 | Leukemia
10 | Renal
11 | Pancreas
12 | Ovary
13 | Mesothelioma
14 | central nervous system

The goal is to fit a linear regression model to classify cancers as either **leukemia** (label 9) or **not-leukemia**.

Notice that the matrix $X$ has more columns than rows:

In [4]:
X.shape

(144, 16063)

As a consequence of this, the matrix $X^TX$ is **not invertible**.
This is a situation where it is absolutely necessary to **add regularizaton**.

## Part 1

 Build a (0,1)-vector y_leuk, where its $i$th entry is given by

$$
\left\{ \begin{array}{ll} 1 & \mbox{ if cancer $i$ is leukemia}\\
0 & \mbox{ if cancer $i$ is not leukemia}. \end{array}\right.
$$

## Part 2

Use the data (X,y_leuk) to fit a linear regression model that predicts the vector y_leuk.
Add regularization to your model.

## Part 3

Use your linear regression model to classify all the cancers in the following test set. 

In [7]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Theory/master/Data/leukemia_test.csv'
data_test = pd.read_csv(url)
data_test.head(5)

Unnamed: 0,gene 0,gene 1,gene 2,gene 3,gene 4,gene 5,gene 6,gene 7,gene 8,gene 9,...,gene 16054,gene 16055,gene 16056,gene 16057,gene 16058,gene 16059,gene 16060,gene 16061,gene 16062,label
0,-44.0,-254.0,-166.0,-55.0,-130.0,-133.0,80.0,-62.0,1.0,22.0,...,715.0,25.0,47.0,-270.0,-265.0,-290.0,-235.0,-826.0,-262.0,1
1,-13.0,-124.0,-206.0,-29.0,-117.0,-114.0,111.0,-20.0,-25.0,18.0,...,695.0,10.0,57.0,-50.0,-45.0,-61.0,-22.0,-263.0,-52.0,1
2,-64.3,-184.9,-334.4,-102.2,-289.0,-145.2,-251.9,-150.9,-184.0,179.1,...,1324.6,9.8,127.7,244.0,76.5,100.1,-69.6,-1444.3,-121.6,1
3,-22.0,-169.0,-164.0,-32.0,-49.0,-141.0,0.0,-22.0,16.0,67.0,...,170.0,20.0,35.0,-275.0,-230.0,-136.0,-228.0,-404.0,-259.0,2
4,-28.0,-167.0,-158.0,100.0,-129.0,-232.0,148.0,-48.0,14.0,164.0,...,1627.0,76.0,119.0,-6.0,68.0,-22.0,27.0,-1088.0,-58.0,2


In [8]:
data_test.shape

(54, 16064)

 Use the rule:

$$
\mbox{classify cancer $i$ as a }\left\{ \begin{array}{ll} 
\mbox{leukemia} & \mbox{ if predicted y_leuk_test}[i]>0.5 \\
\mbox{not leukemia }5 & \mbox{ if predicted y_leuk_test}[i]\leq 0.5
\end{array}\right.
$$