# Feature Importance

Determine which features contribute the most to the output and see how well this coorelates with the COSMIC cancer gene census:

http://cancer.sanger.ac.uk/census/

Resources

https://en.wikipedia.org/wiki/Feature_selection#Wrapper_method

https://stats.stackexchange.com/questions/250381/feature-selection-using-deep-learning

http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/


In [68]:
import numpy as np
import pandas as pd

In [67]:
# Load the model from disk as trained on the GPU box
from keras.models import model_from_json

print("Loading model...")
with open("models/model.json", "r") as f:
    model = model_from_json(f.read())
print("Loading weights...")
model.load_weights("models/weights.h5")
print("Compliling model...")
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Loading model...
Loading weights...
Compliling model...


In [75]:
"""
First do the simplest thing - sum the weights at the first layer by gene, sort,
and see if those with the highest weights intersect with the COSMIC list

Maybe it should be the sum of the absolute value? Negative or positive weight implies affect...
"""
weights = model.layers[2].get_weights()
ranks = np.sum(weights[0], axis=1)

In [70]:
# Load the features into a dataframe
import h5py
with h5py.File("data/tumor_normal.h5", "r") as f:
    genes = f["genes"][:]

In [77]:
rankings = pd.DataFrame(ranks, index=genes)

In [96]:
r = rankings.sort_values(by=0).index.astype('U').values

In [95]:
cosmic = pd.read_table("cancer_genes.tsv")["Gene Symbol"].values

In [99]:
np.intersect1d(r, cosmic).shape

(589,)