# Personalized Medicine: Redefining Cancer Treatment

## Plan

1. Process data and display some example images to get an idea of what's going on
2. Train VGG model (multi-input?)

## Data Loading

Setup and stuff

In [2]:
# Import utility libraries
import os, sys
from IPython.core.debugger import Tracer

import numpy as np
import pandas as pd
import os
import gc
import cv2 # OpenCV (Open Source Computer Vision Library). Image manipulation, for our purposes
import matplotlib.image as mpimg
from skimage import io
from tqdm import tqdm # Progress bars

# Allow importing utils, Vgg, etc. from the parent directory
sys.path.insert(1, os.path.join(sys.path[0], '..'))

from utils import *

%matplotlib inline

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [3]:
current_dir = os.getcwd()
NOTEBOOK_DIR = current_dir
DATA_DIR = os.path.dirname(current_dir) + "/data/cancer_treatment"

In [4]:
# Sample data

# Training data

Data should be extracted and unzipped at this point, from $DATA_DIR:

```
unzip *.zip
```

In [5]:
%mkdir -p $DATA_DIR
%cd $DATA_DIR

# Set up sample data
# !mkdir -p sample-jpg/train
# !find train-jpg -type f | shuf -n 1000 | xargs -I {} cp "{}" sample-jpg/train
# !mkdir -p sample-jpg/valid
# !find train-jpg -type f | shuf -n 250 | xargs -I {} cp "{}" sample-jpg/valid

# Set up validation data (n.b. we `mv` files here instead of `cp` since we don't want overlap between training and validation data)
# !mkdir -p valid-jpg
# !find train-jpg -type f | shuf -n 8000 | xargs -I {} mv "{}" valid-jpg

!mkdir results

%cd $NOTEBOOK_DIR

/home/ubuntu/nbs/data/cancer_treatment
/home/ubuntu/nbs/cancer-treatment


## Looking at data

## VGG16 Model

## Keras Model

Initialize train and test numpy arrays

In [7]:
x_train = [] # Training input
x_test = [] # Test input
y_train = [] # Expected training output

Read and inspect training data

In [8]:
df_train = pd.read_csv(DATA_DIR + "/train_v2.csv")
df_train.head()

Unnamed: 0,image_name,tags
0,train_0,haze primary
1,train_1,agriculture clear primary water
2,train_2,clear primary
3,train_3,clear primary
4,train_4,agriculture clear habitation primary road


In [9]:
# Extract labels
flatten = lambda l: [item for sublist in l for item in sublist]
# Take all values of 'tags' key in df_train, split on spaces, flatten it, turn it into a set (extract unique values), and then back to a list
labels = list(set(flatten([l.split(' ') for l in df_train['tags'].values])))

print(labels)

['slash_burn', 'clear', 'blooming', 'primary', 'cloudy', 'conventional_mine', 'water', 'haze', 'cultivation', 'partly_cloudy', 'artisinal_mine', 'habitation', 'bare_ground', 'blow_down', 'agriculture', 'road', 'selective_logging']


In [10]:
label_map = {l: i for i, l in enumerate(labels)} # Map label to index
inv_label_map = {i: l for l, i in label_map.items()} # Map index to label

print(label_map)
print(inv_label_map)

{'selective_logging': 16, 'cultivation': 8, 'clear': 1, 'habitation': 11, 'conventional_mine': 5, 'cloudy': 4, 'primary': 3, 'water': 6, 'haze': 7, 'slash_burn': 0, 'partly_cloudy': 9, 'artisinal_mine': 10, 'blooming': 2, 'bare_ground': 12, 'blow_down': 13, 'agriculture': 14, 'road': 15}
{0: 'slash_burn', 1: 'clear', 2: 'blooming', 3: 'primary', 4: 'cloudy', 5: 'conventional_mine', 6: 'water', 7: 'haze', 8: 'cultivation', 9: 'partly_cloudy', 10: 'artisinal_mine', 11: 'habitation', 12: 'bare_ground', 13: 'blow_down', 14: 'agriculture', 15: 'road', 16: 'selective_logging'}


Assemble training data into inputs and outputs

In [11]:
for filename, tags in tqdm(df_train.values[:1000], miniters=1000):
    # Read in img data
    img = io.imread(TRAIN_TIF_DIR + "{}.tif".format(filename))
    resized_img = cv2.resize(img, (32, 32))
    # Set `targets` to one-hot encoded labels of the training data
    targets = np.zeros(17)
    for t in tags.split(' '):
        targets[label_map[t]] = 1
    x_train.append(resized_img)
    y_train.append(targets)

x_train_full = np.array(x_train, np.float16)
y_train_full = np.array(y_train, np.uint8)

100%|██████████| 1000/1000 [00:00<00:00, 1685.93it/s]


In [12]:
print(x_train_full.shape)
print(y_train_full.shape)

(1000, 32, 32, 4)
(1000, 17)


Configure number of training inputs and validation inputs

In [13]:
NUM_TRAIN = 500
NUM_VALID = 150

Establish training and valiadtion data sets

In [14]:
x_train = x_train_full[:NUM_TRAIN]
y_train = y_train_full[:NUM_TRAIN]

x_valid = x_train_full[NUM_TRAIN:NUM_TRAIN + NUM_VALID]
y_valid = y_train_full[NUM_TRAIN:NUM_TRAIN + NUM_VALID]

print(x_train.shape)
print(y_train.shape)
print(x_valid.shape)
print(y_valid.shape)

(500, 32, 32, 4)
(500, 17)
(150, 32, 32, 4)
(150, 17)


### Build model

In [17]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 4)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(17, activation='sigmoid'))

model.compile(
    loss='binary_crossentropy', # For multi-class ouputs, need binary crossentropy instead of categorical crossentropy
    optimizer='adam',
    metrics=['accuracy']
)

### Fit model

In [18]:
model.fit(
    x_train,
    y_train,
    batch_size=64,
    epochs=10,
    verbose=1,
    validation_data=(x_valid, y_valid)
)

Train on 500 samples, validate on 150 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbe91e86910>

### Test model

Tip from forum that fbeta_score can come from `sklearn.metrics`: https://www.kaggle.com/anokas/fixed-f2-score-in-python/comments

In [25]:
from sklearn.metrics import fbeta_score

In [26]:
p_valid = model.predict(x_valid, batch_size=64)

In [27]:
print(y_valid)

[[0 0 0 ..., 0 0 0]
 [0 1 0 ..., 0 0 0]
 [0 1 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 1 0 ..., 1 0 0]]


In [28]:
print(p_valid)

[[ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  0.  0.]]


In [29]:
print(fbeta_score(y_valid, np.array(p_valid) > 0.2, beta=2, average='samples'))

0.282485829564
