# Feature Importance

Determine which features contribute the most to the output and see how well this coorelates with the COSMIC cancer gene census:

http://cancer.sanger.ac.uk/census/

Resources

https://en.wikipedia.org/wiki/Feature_selection#Wrapper_method

https://stats.stackexchange.com/questions/250381/feature-selection-using-deep-learning

http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/

https://arxiv.org/abs/1704.02685


In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load the model from disk as trained on the GPU box
from keras.models import model_from_json

print("Loading model...")
with open("models/model.json", "r") as f:
    model = model_from_json(f.read())
print("Loading weights...")
model.load_weights("models/weights.h5")
print("Compliling model...")
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Using TensorFlow backend.


Loading model...
Loading weights...
Compliling model...


In [3]:
# Check the accuracy using the test set
from keras.utils.io_utils import HDF5Matrix
import h5py

input_file = "data/tumor_normal.h5"
X_test = HDF5Matrix(input_file, "X_test")
y_test = HDF5Matrix(input_file, "y_test")

print(model.metrics_names, model.evaluate(X_test, y_test))

['loss', 'acc'] [0.706727260781157, 0.79952953479331135]


In [4]:
# Load the features into a dataframe
import h5py
with h5py.File("data/tumor_normal.h5", "r") as f:
    genes = f["genes"][:]

In [5]:
"""
First do the simplest thing - sum the weights at the first layer by gene, sort,
and see if those with the highest weights intersect with the COSMIC list

Maybe it should be the sum of the absolute value? Negative or positive weight implies affect...
"""
weights = model.layers[2].get_weights()
ranks = np.absolute(np.sum(weights[0], axis=1))
rankings = pd.DataFrame(ranks, index=genes.astype('U')).sort_values(by=0, ascending=False)
rankings.head()

Unnamed: 0,0
RP11-78A18,13.06272
CTC-305H11,11.556487
RP11-297L1,11.231505
LINC00363,10.93818
RP11-257I8,10.923991


In [6]:
rankings.index.values

array(['RP11-78A18', 'CTC-305H11', 'RP11-297L1', ..., 'MIR6888',
       'RP11-307C1', 'MIR4745'], dtype=object)

In [7]:
# How many of the top 10k weighted features overlap with cosmic?
cosmic = pd.read_table("cancer_genes.tsv")["Gene Symbol"].values
np.intersect1d(rankings.index.values[0:10000], cosmic).shape

(69,)