## BME230B Final Project Scoring

Final project scoring will be computed via the R script BME230_F1score_V2.R on your team's predictions against the expression file data/tcga_mutations_test_unlabeled.h5. This file comprises a holdout of 20% of TCGA expression vs. the 80% in your tcga_mutations_train.h5 file.

Below is a simple example that rains a set of classifiers and then scores the results. For the competition you should predict the mutation status and disease for all the expression values in data/tcga_mutations_test_unlabeled.h5 and write out a predictions.tsv file that we will run through the R script to create a leaderboard.

In [1]:
import numpy as np
import pandas as pd

# So we can use multiple jobs/cores with sklearn
# https://stackoverflow.com/questions/40115043/no-space-left-on-device-error-while-fitting-sklearn-model
%env JOBLIB_TEMP_FOLDER=/tmp

env: JOBLIB_TEMP_FOLDER=/tmp


In [2]:
# Read in our training data and labels
X = pd.read_hdf("data/tcga_mutation_train.h5", "expression")
Y = pd.read_hdf("data/tcga_mutation_train.h5", "labels")

In [3]:
# Prune expression to only KEGG pathway genes
with open("data/c2.cp.kegg.v6.1.symbols.gmt") as f:
    genes_subset = list(set().union(*[line.strip().split("\t")[2:] for line in f.readlines()]))
X_pruned = X.drop(labels=(set(X.columns) - set(genes_subset)), axis=1, errors="ignore")

# Encode disease
from sklearn import preprocessing
disease_encoder = preprocessing.LabelEncoder()
disease_encoder.fit(Y["primary.disease.or.tissue"])
Y["disease_encoding"] = disease_encoder.transform(Y["primary.disease.or.tissue"])

# Divide up into train and test
import sklearn.model_selection
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
    X_pruned, Y, test_size=0.20, random_state=42)

In [6]:
%%time
from sklearn.linear_model import LogisticRegression

tp53_model = LogisticRegression(C=1e5)
tp53_model.fit(X_train, Y_train.TP53_mutant)
print("TP53 Score:", tp53_model.score(X_test, Y_test.TP53_mutant))

kras_model = LogisticRegression(C=1e5)
kras_model.fit(X_train, Y_train.KRAS_mutant)
print("KRAS Score:", kras_model.score(X_test, Y_test.KRAS_mutant))

braf_model = LogisticRegression(C=1e5)
braf_model.fit(X_train, Y_train.BRAF_mutant)
print("BRAF Score:", braf_model.score(X_test, Y_test.BRAF_mutant))

TP53 Score: 0.7799295774647887
KRAS Score: 0.9448356807511737
BRAF Score: 0.9595070422535211
CPU times: user 2min 34s, sys: 2.64 s, total: 2min 37s
Wall time: 2min 35s


In [9]:
%%time
from sklearn.multiclass import OneVsRestClassifier
disease_model = OneVsRestClassifier(LogisticRegression()).fit(X_train, Y_train["primary.disease.or.tissue"])
print("Disease Score:", disease_model.score(X_test, Y_test["primary.disease.or.tissue"]))

Disease Score: 0.9565727699530516
CPU times: user 5min 28s, sys: 20.5 s, total: 5min 49s
Wall time: 5min 41s


In [11]:
%%time
# Write out predictions and actuals and score
pd.DataFrame({
    "TumorTypePrediction": disease_model.predict(X_test),
    "TP53MutationPrediction": tp53_model.predict(X_test),
    "KRASMutationPrediction": kras_model.predict(X_test),
    "BRAFMutationPrediction": braf_model.predict(X_test),
}).to_csv("test_predictions.tsv", sep="\t")

pd.DataFrame({
    "primary.disease.or.tissue": Y_test["primary.disease.or.tissue"],
    "TP53_mutant": Y_test.TP53_mutant,
    "KRAS_mutant": Y_test.KRAS_mutant,
    "BRAF_mutant": Y_test.BRAF_mutant,
}).to_csv("test_actuals.tsv", sep="\t")

CPU times: user 3.48 s, sys: 6.79 s, total: 10.3 s
Wall time: 1.31 s


In [16]:
# Score our test predictions against actuals
!Rscript class/BME230_F1score_V2.R test_predictions.tsv test_actuals.tsv

[1] "Pheochromocytoma & Paraganglioma_F1_score: 0.96875"
[1] "Cervical & Endocervical Cancer_F1_score: 0.923076923076923"
[1] "Breast Invasive Carcinoma_F1_score: 0.997555012224939"
[1] "Lung Adenocarcinoma_F1_score: 0.951219512195122"
[1] "Lung Squamous Cell Carcinoma_F1_score: 0.91358024691358"
[1] "Colon Adenocarcinoma_F1_score: 0.865979381443299"
[1] "Rectum Adenocarcinoma_F1_score: 0.666666666666667"
[1] "Thyroid Carcinoma_F1_score: 1"
[1] "Kidney Clear Cell Carcinoma_F1_score: 0.976303317535545"
[1] "Esophageal Carcinoma_F1_score: 0.891566265060241"
[1] "Mesothelioma_F1_score: 0.96551724137931"
[1] "Ovarian Serous Cystadenocarcinoma_F1_score: 0.994011976047904"
[1] "Prostate Adenocarcinoma_F1_score: 0.994764397905759"
[1] "Brain Lower Grade Glioma_F1_score: 0.976303317535545"
[1] "Cholangiocarcinoma_F1_score: 0.533333333333333"
[1] "Liver Hepatocellular Carcinoma_F1_score: 0.96551724137931"
[1] "Bladder Urothelial Carcinoma_F1_score: 0.944444444444444"
[1] "Uterine Carcinosarcoma

In [13]:
%%time
# Predict on the holdout set for the competition and write out for the leaderboard
X_holdout = pd.read_hdf("data/tcga_mutation_test_unlabeled.h5", "expression")
X_holdout_pruned = X_holdout.drop(labels=(set(X_holdout.columns) - set(genes_subset)), axis=1, errors="ignore")

pd.DataFrame({
    "TumorTypePrediction": disease_model.predict(X_holdout_pruned),
    "TP53MutationPrediction": tp53_model.predict(X_holdout_pruned),
    "KRASMutationPrediction": kras_model.predict(X_holdout_pruned),
    "BRAFMutationPrediction": braf_model.predict(X_holdout_pruned),
}).to_csv("predictions.tsv", sep="\t")

CPU times: user 4.5 s, sys: 7.61 s, total: 12.1 s
Wall time: 1.94 s
