# Personalized Medicine: Redefining Cancer Treatment

## Competition Info/Resources

Competition homepage: https://www.kaggle.com/c/msk-redefining-cancer-treatment

Exploratory Data Analysis: https://www.kaggle.com/headsortails/personalised-medicine-eda-with-tidy-r\

High-level insight: https://www.kaggle.com/dextrousjinx/brief-insight-on-genetic-variations

## Plan

1. Process and organize data
2. Display example data to get an idea of what's going on
3. Train simple Keras model
4. Word2Vec? RNN? We shall see

## Data Loading

Setup and stuff

In [1]:
# Import utility libraries
import os, sys
from IPython.core.debugger import Tracer

import numpy as np
import pandas as pd
import os
import gc
import cv2 # OpenCV (Open Source Computer Vision Library). Image manipulation, for our purposes
import matplotlib.image as mpimg
from skimage import io
from tqdm import tqdm # Progress bars

# Allow importing utils, Vgg, etc. from the parent directory
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from utils import *

%matplotlib inline

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [2]:
current_dir = os.getcwd()
NOTEBOOK_DIR = current_dir
DATA_DIR = os.path.dirname(current_dir) + "/data/cancer-treatment"

In [3]:
# Sample data

# Training data

Data should be extracted and unzipped at this point, from $DATA_DIR:

```
unzip *.zip
```

In [6]:
%mkdir -p $DATA_DIR
%cd $DATA_DIR

# Set up sample data
# !mkdir -p sample-jpg/train
# !find train-jpg -type f | shuf -n 1000 | xargs -I {} cp "{}" sample-jpg/train
# !mkdir -p sample-jpg/valid
# !find train-jpg -type f | shuf -n 250 | xargs -I {} cp "{}" sample-jpg/valid

# Set up validation data (n.b. we `mv` files here instead of `cp` since we don't want overlap between training and validation data)
# !mkdir -p valid-jpg
# !find train-jpg -type f | shuf -n 8000 | xargs -I {} mv "{}" valid-jpg

!mkdir -p results

%cd $NOTEBOOK_DIR

/home/ubuntu/nbs/data/cancer-treatment
/home/ubuntu/nbs/cancer-treatment


## Looking at data

> Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

training_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)

In [30]:
df_train = pd.read_csv(DATA_DIR + "/training_variants", nrows=10)
df_train.head()

Unnamed: 0,ID,Gene,Variation,Class
0,0,FAM58A,Truncating Mutations,1
1,1,CBL,W802*,2
2,2,CBL,Q249E,2
3,3,CBL,N454D,3
4,4,CBL,L399V,4


training_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

In [28]:
df_train = pd.read_csv(DATA_DIR + "/training_text", sep="\|\|", nrows=10)
df_train.head()

  if __name__ == '__main__':


Unnamed: 0,ID,Text
0,0,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,Abstract Background Non-small cell lung canc...
2,2,Abstract Background Non-small cell lung canc...
3,3,Recent evidence has demonstrated that acquired...
4,4,Oncogenic mutations in the monomeric Casitas B...


test_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations)

In [31]:
df_train = pd.read_csv(DATA_DIR + "/test_variants", nrows=10)
df_train.head()

Unnamed: 0,ID,Gene,Variation
0,0,ACSL4,R570S
1,1,NAGLU,P521L
2,2,PAH,L333F
3,3,ING1,A148D
4,4,TMEM216,G77A


test_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

In [32]:
df_train = pd.read_csv(DATA_DIR + "/test_text", sep="\|\|", nrows=10)
df_train.head()

  if __name__ == '__main__':


Unnamed: 0,"ID,Text"
0,2. This mutation resulted in a myeloproliferat...
1,Abstract The Large Tumor Suppressor 1 (LATS1)...
2,Vascular endothelial growth factor receptor (V...
3,Inflammatory myofibroblastic tumor (IMT) is a ...
4,Abstract Retinoblastoma is a pediatric retina...


submissionSample - a sample submission file in the correct format

In [35]:
df_train = pd.read_csv(DATA_DIR + "/submissionFile", nrows=10)
df_train.head()

Unnamed: 0,ID,class1,class2,class3,class4,class5,class6,class7,class8,class9
0,0,0,0,0,0,0,1,0,0,0
1,1,0,1,0,0,0,0,0,0,0
2,2,0,0,0,0,0,1,0,0,0
3,3,0,0,0,0,0,0,0,1,0
4,4,0,0,0,1,0,0,0,0,0


## Keras Model

## Vocab

* TF-IDF: Term Frequency-Inverse Document Frequency; a heuristic index telling us how frequent a word is in a certain context (here: a certain Class) within the context of a larger document (here: all Classes). You can understand it as a normalisation of the relative text frequency by the overall document frequency. This will lead to words standing out that are characteristic for a specific Class, which is pretty much what we want to achieve in order to train a model.