In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# Problem 2: Leukemia Diagnosis

In [26]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/leukemia.csv'
data = pd.read_csv(url)
data.shape

(198, 16064)

This dataset contains cancer gene-expression levels (16,063 genes, 198 samples) from the paper [*Multiclass cancer diagnosis using tumor gene expression signatures (Golub et all, 2001)*](http://cbcl.mit.edu/publications/ps/rifkin-pnas-2001.pdf)

Cancer classes are labelled as follows:


- 1: breast
- 2:  prostate
- 3:  lung
- 4:  collerectal
- 5:  lymphoma
- 6:  bladder
- 7:  melanoma
- 8:  uterus
- 9:  **leukemia**
- 10: renal
- 11: pancreas
- 12: ovary
- 13: meso
- 14: cns


Your goal is to train a regression model to classify cancers as either leukemia or not-leukemia

**Part 1:** Replace the `label` values with:

$$
\left\{ \begin{array}{ll} 1 & \mbox{ if cancer is leukemia}\\
0 & \mbox{ if cancer is not leukemia}. \end{array}\right.
$$

In [27]:
data.label.unique()

def LeukemiaMap(classification):
    if (classification==9):
        return 1
    else:
        return 0
data['label'] = data.label.map(LeukemiaMap)

In [28]:
data.loc[data.label==1].label.values.sum()

30

**Part 2:** Define X and y from the DataFrame, and then split X and y into training and testing sets

In [29]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [30]:
y

0      0
1      0
2      0
3      0
4      0
      ..
193    0
194    0
195    0
196    0
197    0
Name: label, Length: 198, dtype: int64

In [31]:
X = data.drop('label', axis=1)
import numpy as np
#XShortened = X.astype(np.uint8) # make values take up less memory so I can run this. Hopefully doesn't break anything
y = data.label
X_train,X_test,y_train,y_test = train_test_split(X,y)

**Part 3:** Use the training set to train a **Lasso regression model**. 
Plot the model's coefficients. How many coefficients are equal to 0?

In [32]:
from sklearn.linear_model import Ridge, Lasso

# regression pipeline
pipe = Pipeline(steps=[
    ('scaler', MinMaxScaler()),
   # ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('regressor', Ridge(alpha=0.001))
])


In [33]:
pipe.fit(X_train,y_train)

**Part 4:**  Use your regression model to classify all the cancers in the test set.  
Use the rule:

$$
\mbox{classify cancer as a }\left\{ \begin{array}{ll} 
\mbox{leukemia} & \mbox{ if y_test_pred}>0.5 \\
\mbox{not leukemia }5 & \mbox{ if predicted y_test_pred}\leq 0.5
\end{array}\right.
$$



In [59]:
y_test_pred = pipe.predict(X_test)

myTable = pd.DataFrame(y_test_pred)

def is_leukemia(value):
    return value>0.5
myTable['predicted_leukemia'] = myTable.apply(is_leukemia)

How many cancers are misclassified? 

In [85]:
myTable['actual_leukemia'] = y_test.tolist()

In [91]:
correctLabelCount = len(myTable.loc[myTable['actual_leukemia'] == myTable['predicted_leukemia']])
labelAmount = len(y_test)

In [96]:
misclassifications = np.abs(correctLabelCount-labelAmount)
print('The number of misclassifications are:', misclassifications)

The number of misclassifications are: 0
