# Deriving cancer gene's from expression data

See if it's possible to train a deep neural network tumor/normal binary classifier using just the Toil TCGA, TARGET and GTEX expression datasets:

https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net)

If we can, then see whether any of the early layers re-capitulates the COSMIC cancer gene census:

http://cancer.sanger.ac.uk/census/

In [1]:
import os
import numpy as np
import pandas as pd

# fix random seed for reproducibility
np.random.seed(42)

In [8]:
from keras.utils.io_utils import HDF5Matrix
import h5py

input_file = "data/tumor_normal.h5"

with h5py.File(input_file, "r") as f:
    print("Datasets:", list(f.keys()))

# different size parameters if debugging or using full dataset
if os.getenv("DEBUG", "True") == "True":
    X_train = HDF5Matrix(input_file, "X_train", start=0, end=1000)
    X_test = HDF5Matrix(input_file, "X_test", start=0, end=200)
    y_train = HDF5Matrix(input_file, "y_train", start=0, end=1000)
    y_test = HDF5Matrix(input_file, "y_test", start=0, end=200)
    print("Training on partial dataset")
    epochs=1
    batch_size=256
else:
    X_train = HDF5Matrix(input_file, "X_train")
    X_test = HDF5Matrix(input_file, "X_test")
    y_train = HDF5Matrix(input_file, "y_train")
    y_test = HDF5Matrix(input_file, "y_test")
    print("Training on full dataset")
    epochs=32
    batch_size=128
    
print("X_train.shape:", X_train.shape, "epochs:", epochs, "batch_size:", batch_size)

Datasets: ['X_test', 'X_train', 'class_labels', 'classes_test', 'classes_train', 'features', 'genes', 'labels', 'y_test', 'y_train']
Training on partial dataset
X_train.shape: (1000, 60498) epochs: 1 batch_size: 256


In [9]:
"""
Trivial network to just verify all the pieces are working
"""

from keras.layers import Dense
from keras.models import Model, Sequential

classify = [
    Dense(1000, input_dim=X_train.shape[1], activation='relu'),
    Dense(500, activation='relu'),
    Dense(1, activation='sigmoid')
]

model = Sequential(classify)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, shuffle="batch")
model.evaluate(X_test, y_test)

Epoch 1/1


KeyboardInterrupt: 