# Deriving cancer gene's from expression data

See if it's possible to train a deep neural network tumor/normal binary classifier using just the Toil TCGA, TARGET and GTEX expression datasets:

https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net)

If we can, then see whether any of the early layers re-capitulates the COSMIC cancer gene census:

http://cancer.sanger.ac.uk/census/

In [7]:
import os
import numpy as np
import pandas as pd

# fix random seed for reproducibility
np.random.seed(42)

In [8]:
from keras.utils.io_utils import HDF5Matrix
import h5py

input_file = "data/tumor_normal.h5"

with h5py.File(input_file, "r") as f:
    print("Datasets:", list(f.keys()))

# different size parameters if debugging or using full dataset
if os.getenv("DEBUG", "True") == "True":
    X_train = HDF5Matrix(input_file, "X_train", start=0, end=5000)
    X_test = HDF5Matrix(input_file, "X_test", start=0, end=1000)
    y_train = HDF5Matrix(input_file, "y_train", start=0, end=5000)
    y_test = HDF5Matrix(input_file, "y_test", start=0, end=1000)
    print("Training on partial dataset")
    epochs=1
    batch_size=128
else:
    X_train = HDF5Matrix(input_file, "X_train")
    X_test = HDF5Matrix(input_file, "X_test")
    y_train = HDF5Matrix(input_file, "y_train")
    y_test = HDF5Matrix(input_file, "y_test")
    print("Training on full dataset")
    epochs=16
    batch_size=128
    
print("X_train.shape:", X_train.shape, "epochs:", epochs, "batch_size:", batch_size)

Datasets: ['X_test', 'X_train', 'class_labels', 'classes_test', 'classes_train', 'features', 'genes', 'labels', 'y_test', 'y_train']
Training on partial dataset
X_train.shape: (5000, 60498) epochs: 1 batch_size: 256


In [9]:
# Try PCA to reduce dimenions before network
from sklearn.decomposition import PCA
print("Computing PCA")
X_train_pca = PCA(n_components=5000).fit_transform(X_train)

CPU times: user 1h 31min 9s, sys: 1h 27min 35s, total: 2h 58min 44s
Wall time: 10min 30s


In [10]:
"""
Trivial network to just verify all the pieces are working
"""

from keras.layers import Dense
from keras.models import Model, Sequential

classify = [
    Dense(1000, input_dim=X_train_pca.shape[1], activation='relu'),
    Dense(500, activation='relu'),
    Dense(1, activation='sigmoid')
]

model = Sequential(classify)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_pca, y_train, epochs=epochs, batch_size=batch_size, shuffle="batch")
# model.evaluate(X_test, y_test)

Epoch 1/1


<keras.callbacks.History at 0x7f8e2c153278>

#### Results
When trained on the full dataset, with PCA to 5000, and fed into:

    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense_4 (Dense)              (None, 1000)              5001000   
    _________________________________________________________________
    dense_5 (Dense)              (None, 500)               500500    
    _________________________________________________________________
    dense_6 (Dense)              (None, 1)                 501       
    =================================================================
    Total params: 5,502,001.0
    Trainable params: 5,502,001
    Non-trainable params: 0.0
    _________________________________________________________________

Achieves:

    Epoch 16/16
    15300/15300 [==============================] - 23s - loss: 0.6338 - acc: 0.9603