# Personalized Medicine: Redefining Cancer Treatment

## Competition Info/Resources

Competition homepage: https://www.kaggle.com/c/msk-redefining-cancer-treatment

Exploratory Data Analysis: https://www.kaggle.com/headsortails/personalised-medicine-eda-with-tidy-r\

High-level insight: https://www.kaggle.com/dextrousjinx/brief-insight-on-genetic-variations

## Plan

1. Process and organize data
2. Display example data to get an idea of what's going on
3. Train simple Keras model
4. Word2Vec? RNN? We shall see

### TODO

* Attempt RCNN architecture with word2vec
* Look at leaked data and class data

## Data Loading

Setup and stuff

In [1]:
# Import utility libraries
import os, sys
from IPython.core.debugger import Tracer

import numpy as np
import pandas as pd
import os
import gc
import cv2 # OpenCV (Open Source Computer Vision Library). Image manipulation, for our purposes
import matplotlib.image as mpimg
import pickle
from skimage import io
from tqdm import tqdm # Progress bars

# Allow importing utils, Vgg, etc. from the parent directory
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from utils import *

%matplotlib inline

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: numpy.core.multiarray failed to import


RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using Theano backend.


In [2]:
current_dir = os.getcwd()
NOTEBOOK_DIR = current_dir
DATA_DIR = os.path.dirname(current_dir) + "/data/cancer-treatment"
RESULTS_DIR = DATA_DIR + "/results"

In [3]:
# Sample data

# Training data

Data should be extracted and unzipped at this point, from $DATA_DIR:

```
unzip *.zip
```

In [4]:
%mkdir -p $DATA_DIR
%cd $DATA_DIR

# Set up sample data
# !mkdir -p sample-jpg/train
# !find train-jpg -type f | shuf -n 1000 | xargs -I {} cp "{}" sample-jpg/train
# !mkdir -p sample-jpg/valid
# !find train-jpg -type f | shuf -n 250 | xargs -I {} cp "{}" sample-jpg/valid

# Set up validation data (n.b. we `mv` files here instead of `cp` since we don't want overlap between training and validation data)
# !mkdir -p valid-jpg
# !find train-jpg -type f | shuf -n 8000 | xargs -I {} mv "{}" valid-jpg

!mkdir -p results

%cd $NOTEBOOK_DIR

/home/ubuntu/nbs/data/cancer-treatment
/home/ubuntu/nbs/cancer-treatment


## Looking at data

> Both, training and test, data sets are provided via two different files. One (training/test_variants) provides the information about the genetic mutations, whereas the other (training/test_text) provides the clinical evidence (text) that our human experts used to classify the genetic mutations. Both are linked via the ID field.

training_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations), Class (1-9 the class this genetic mutation has been classified on)

In [None]:
df_train = pd.read_csv(DATA_DIR + "/training_variants", nrows=10)
df_train.head()

training_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

In [None]:
df_train = pd.read_csv(DATA_DIR + "/training_text", sep="\|\|", nrows=10)
df_train.head()

test_variants - a comma separated file containing the description of the genetic mutations used for training. Fields are ID (the id of the row used to link the mutation to the clinical evidence), Gene (the gene where this genetic mutation is located), Variation (the aminoacid change for this mutations)

In [None]:
df_train = pd.read_csv(DATA_DIR + "/test_variants", nrows=10)
df_train.head()

test_text - a double pipe (||) delimited file that contains the clinical evidence (text) used to classify genetic mutations. Fields are ID (the id of the row used to link the clinical evidence to the genetic mutation), Text (the clinical evidence used to classify the genetic mutation)

In [None]:
df_train = pd.read_csv(DATA_DIR + "/test_text", sep="\|\|", nrows=10)
df_train.head()

submissionSample - a sample submission file in the correct format

In [None]:
df_train = pd.read_csv(DATA_DIR + "/submissionFile", nrows=10)
df_train.head()

## Data Preparation (TF-IDF + SVD)

Import libraries to read data

In [5]:
from sklearn import *
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer



Read in data

In [6]:
train_variants = pd.read_csv(DATA_DIR + "/training_variants")
test_variants = pd.read_csv(DATA_DIR + "/test_variants")
train_text = pd.read_csv(DATA_DIR + "/training_text", sep="\|\|")
test_text = pd.read_csv(DATA_DIR + "/test_text", sep="\|\|", header=None, skiprows=1, names=["ID", "Text"])

  app.launch_new_instance()


Merge text and variant files into one dataset

In [7]:
train = pd.merge(train_variants, train_text, how='left', on='ID')
train_y = train['Class'].values # Extract label as the expected output
train_x = train.drop('Class', axis=1) # Remove labels from the input
print(train_x.head())
print("\n%s training examples with shape %s" % (len(train_x), train_x.shape))

   ID    Gene             Variation  \
0   0  FAM58A  Truncating Mutations   
1   1     CBL                 W802*   
2   2     CBL                 Q249E   
3   3     CBL                 N454D   
4   4     CBL                 L399V   

                                                Text  
0  Cyclin-dependent kinases (CDKs) regulate a var...  
1   Abstract Background  Non-small cell lung canc...  
2   Abstract Background  Non-small cell lung canc...  
3  Recent evidence has demonstrated that acquired...  
4  Oncogenic mutations in the monomeric Casitas B...  

3321 training examples with shape (3321, 4)


In [8]:
test_x = pd.merge(test_variants, test_text, how='left', on='ID') # Whoa this is super cool
len(test_x)

5668

Combine all data into one array in order to get the entire corpus of text

In [9]:
all_data = np.concatenate((train_x, test_x), axis=0)
all_data = pd.DataFrame(all_data) # DataFrame docs: https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe
all_data.columns = ["ID", "Gene", "Variation", "Text"]
all_data.head()

Unnamed: 0,ID,Gene,Variation,Text
0,0,FAM58A,Truncating Mutations,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,Oncogenic mutations in the monomeric Casitas B...


Perform TF-IDF vectorization (see Vocab). Converts the raw documents into TF-IDF features (matrix of term-document). http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html 

In [10]:
sentences = all_data["Text"]
print(sentences.shape)

(8989,)


In [11]:
vectorizer = TfidfVectorizer(stop_words='english')
sentence_vectors = vectorizer.fit_transform(sentences)

In [12]:
print(type(sentence_vectors))
print(sentence_vectors.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(8989, 169129)


Perform truncated SVD over the sentence vectors for space/time/memory savings. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

In [13]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(500) # Output is of 500 dimensions, rather than 150,000
sentence_vectors = svd.fit_transform(sentence_vectors)

In [14]:
print(type(sentence_vectors))
print(sentence_vectors.shape)

<type 'numpy.ndarray'>
(8989, 500)


Save processed sentence vectors to a file

In [15]:
np.save(RESULTS_DIR + "/sentence_vectors.npy", sentence_vectors)

Load sentence vectors from file (save a lot of time!)

In [None]:
sentence_vectors = np.load(RESULTS_DIR + "/sentence_vectors.npy")
sentence_vectors.shape

## SciKit-Learn Keras Model

Import Keras libraries and classes

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils

Define a base fully connected model

In [None]:
def fully_connected_model():
    model = Sequential()
    model.add(Dense(512, input_shape=(500,), kernel_initializer='glorot_normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, kernel_initializer='glorot_normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, kernel_initializer='glorot_normal', activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, kernel_initializer='glorot_normal', activation='relu'))
    model.add(Dense(9, init='glorot_normal', activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Encode labels for training data. http://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets

In [None]:
from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
encoder.fit(train_y) 
print(encoder.classes_) # Comes up with classes 1-9

encoded_y = encoder.transform(train_y) # Transforms training labels to 0-8
np.unique(encoded_y)

In [None]:
onehot_y = np_utils.to_categorical(encoded_y)
onehot_y.shape

Use Keras scikit-learn wrappers to build model. https://keras.io/scikit-learn-api/. More info: http://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/. Fit model!

In [None]:
train_sentence_vectors = sentence_vectors[0:len(train_x)]
print(train_sentence_vectors.shape)

In [None]:
num_epochs = 15

### Train model from scratch

In [None]:
estimator = KerasClassifier(build_fn=fully_connected_model, epochs=num_epochs, batch_size=64)

### Load latest model from file (Skip if you want to train from scratch)

In [None]:
latest_model_filename = '/estimator_%d.sav' % num_epochs
estimator = pickle.load(open(RESULTS_DIR + latest_model_filename, 'rb'))

### Fit model

In [None]:
estimator.fit(train_sentence_vectors, onehot_y, validation_split=0.1)

In [None]:
print(type(estimator))

Save latest model to a file

In [None]:
pickle.dump(estimator, open(RESULTS_DIR + latest_model_filename, 'wb'))
print "Saved latest model to %s" % RESULTS_DIR + latest_model_filename

### Make predictions!

In [None]:
test_sentence_vectors = sentence_vectors[len(train_x):]
y_pred = estimator.predict_proba(test_sentence_vectors)

Submit predictions. https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation

In [None]:
submission = pd.DataFrame(y_pred)
submission['id'] = test_x['ID'].values
submission.columns = ['class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9', 'id'] 

In [None]:
submission_file_name = "/submission.csv"
submission.to_csv(DATA_DIR + submission_file_name, index=False)

In [None]:
from IPython.display import FileLink

FileLink(os.path.relpath(DATA_DIR + submission_file_name, current_dir))

## RCNN Model

Based on the paper [Recurrent Convolutional Neural Networks for Text Classification](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9745/9552)

<img src="rcnn.png" />

Import libraries

In [38]:
import gensim

In [17]:
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Model

Import pre-trained Word2Vec embeddings (see https://en.wikipedia.org/wiki/Word2vec for more info)

In [18]:
pretrained_embeddings_path = "~/nbs/data/word2vec/GoogleNews-vectors-negative300.bin"
word2vec = gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, binary=True)

Set up embeddings for my model

In [19]:
unseen_token_embedding = np.zeros((1, word2vec.syn0.shape[1]), dtype="float32")
embeddings = np.concatenate((unseen_token_embedding, word2vec.syn0))
embeddings.shape

(3000001, 300)

Define metadata for model

In [20]:
MAX_TOKENS = word2vec.syn0.shape[0]
NUM_CLASSES = 9
hidden_layer_1_size = 200
hidden_layer_2_size = 100
embedding_size = word2vec.syn0.shape[1]

Define input layers

In [21]:
sentence_vector_input = Input(shape = (None, ), dtype="float32")
left_context = Input(shape = (None, ), dtype="float32")
right_context = Input(shape = (None, ), dtype="float32")

Create embeddings for inputs

In [22]:
embedder = Embedding(MAX_TOKENS + 1, embedding_size, weights = [embeddings], trainable = False)
sentence_vector_embedding = embedder(sentence_vector_input)
left_embedding = embedder(left_context)
right_embedding = embedder(right_context)

### Convert sentence vectors to Word2Vec

### Build model

In [25]:
# Create forwards and backwards RNNs to build left and right contexts, as described in the paper
forward_left_context_rnn = LSTM(hidden_layer_1_size, return_sequences = True)(left_embedding)
backward_right_context_rnn = LSTM(hidden_layer_1_size, return_sequences = True, go_backwards = True)(right_embedding)
sentence_vector_embedding_rnn = concatenate(
    [
        forward_left_context_rnn, 
        sentence_vector_embedding, 
        backward_right_context_rnn
    ],
    axis = 2
)

In [27]:
# Activation layer
latent_semantic_activation_layer = TimeDistributed(Dense(hidden_layer_2_size, activation = "tanh"))(sentence_vector_embedding_rnn)

In [33]:
# Max pooling layer
# Don't use built-in Keras for now for compatibility
# pool = MaxPooling1D(pool_size=10, strides=10)(latent_semantic_activation_layer) 

pool = Lambda(lambda x: backend.max(x, axis = 1), output_shape = (hidden_layer_2_size, ))(latent_semantic_activation_layer)

In [34]:
# Output layer
output = Dense(NUM_CLASSES, input_dim = hidden_layer_2_size, activation = 'softmax')(pool)

In [35]:
# Final model
model = Model(inputs = [sentence_vector_input, left_context, right_context], outputs = output)
model.compile(optimizer = "adam", loss = "categorical_crossentropy", metrics = ["accuracy"])

### Test Model

Generate test string

In [96]:
import string
import re

In [97]:
punctuation = '([.,!?()])'
text = "This is some example text."
text = re.sub(punctuation, r' \1 ', text)
text = re.sub('\s{2,}', ' ', text)
text

'This is some example text . '

Tokenize text

In [98]:
tokens = text.split()
tokens = [word2vec.vocab[token].index if token in word2vec.vocab else MAX_TOKENS for token in tokens]

Convert text into input format

In [119]:
token_array = np.array([tokens])
left_context_token_array = np.array([np.append([MAX_TOKENS], token_array[0][:-1])]) # Left-shift for left contexts
right_context_token_array = np.array([np.append(token_array[0][1:], [MAX_TOKENS])]) # Right-shift for right contexts

print(token_array)
print(left_context_token_array)
print(right_context_token_array)

[[    105       4      78    1026    2986 3000000]]
[[3000000     105       4      78    1026    2986]]
[[      4      78    1026    2986 3000000 3000000]]


Generate expected output

In [120]:
expected = np.array([NUM_CLASSES * [0]])
expected[0][2] = 1

Use Word2Vec embeddings in model on text

In [121]:
history = model.fit([token_array, left_context_token_array, right_context_token_array], expected, epochs = 1)
# loss = history.history["loss"][0]

Epoch 1/1


## Vocab

* TF-IDF: Term Frequency-Inverse Document Frequency; a heuristic index telling us how frequent a word is in a certain context (here: a certain Class) within the context of a larger document (here: all Classes). You can understand it as a normalisation of the relative text frequency by the overall document frequency. This will lead to words standing out that are characteristic for a specific Class, which is pretty much what we want to achieve in order to train a model.
* Truncated SVD: Truncated Singular Value Decomposition. Used to reduce dimensionality of a dataset. Approximate a given matrix with three other matrices of lower dimensionality. Similar to PCA, typically used as the basis for PCA actually.
* CSR Matrix: Compressed Sparse Row matrix. If rows in your matrix are sparse, this format compresses them down
* scikit-learn vs. Keras: sk-learn is for general machine learning (for example, we can use it to build a classifier model), whereas keras is specifically for deep learning