# Multi-omics Enabled Sample Mislabeling Correction Challenge

This notebook is using various classifiers in an attempt to detect sample misclassifications

Details about this challenge: https://precision.fda.gov/challenges

## Solution

Import libraries

In [717]:
import os
import sys
import getopt
import re
import pandas as pd
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn import preprocessing

Load data

In [1188]:
labels = pd.read_csv("challenge_data/train_cli.tsv", sep="\t", index_col="sample")
proteins = pd.read_csv("challenge_data/train_pro.tsv", sep="\t")
# Transpose proteins matrix
proteins = proteins.T
misClassified = pd.read_csv("challenge_data/sum_tab_1.csv", sep=",")
# Replace missing values with median
proteins = proteins.fillna(proteins.median())
# Drop remaining columns with missing values
proteins = proteins.dropna(axis='columns')

Select only rows which were correctly classified (matches) for machine learning

In [1189]:
matches = list(misClassified.query('mismatch==0').loc[:,"sample"])
x = proteins.loc[matches]
y = labels.loc[matches]

Classification function, any classifier can be supplied

In [1190]:
def classify(x, y, clf):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, shuffle=True, random_state=100)
    lb = preprocessing.LabelBinarizer()
    # I will have separat models for gende and msi
    y_gender_train = lb.fit_transform(y_train.loc[:,"gender"]).ravel()
    y_gender_test = lb.fit_transform(y_test.loc[:,"gender"]).ravel()
    y_msi_train = lb.fit_transform(y_train.loc[:,"msi"]).ravel()
    y_msi_test = lb.fit_transform(y_test.loc[:,"msi"]).ravel()

    clf.fit(x_train, y_gender_train)

    y_gender_predict = clf.predict(x_train)
    print("Gender train accuracy:", accuracy_score(y_gender_train, y_gender_predict))
    # print("Gender train F1:", f1_score(y_gender_train, y_gender_predict))

    y_gender_predict = clf.predict(x_test)
    print("Gender test accuracy:", accuracy_score(y_gender_test, y_gender_predict))
    # print("Gender test F1:", f1_score(y_gender_test, y_gender_predict))

    clf.fit(x_train, y_msi_train)

    y_msi_predict = clf.predict(x_train)
    print("Msi train accuracy:", accuracy_score(y_msi_train, y_msi_predict))
    # print("Msi train F1:", f1_score(y_msi_train, y_msi_predict))

    y_msi_predict = clf.predict(x_test)
    print("Msi train accuracy:", accuracy_score(y_msi_test, y_msi_predict))
    # print("Msi train F1:", f1_score(y_msi_test, y_msi_predict))

Train classifiers. 
TODO: Figure out best parameters for them...


### SVM

* It seems that a high penalty needs to be set for SVM, otherwise it assigns the more frequent label (female and low msi) to everything.

In [1191]:
classify(x, y, SVC(C=100, kernel="rbf", gamma="scale", probability=True))

('Gender train accuracy:', 1.0)
('Gender test accuracy:', 0.6190476190476191)
('Msi train accuracy:', 1.0)
('Msi train accuracy:', 1.0)


### Random Forest

In [1192]:
classify(x, y, RandomForestClassifier(n_estimators = 10))

('Gender train accuracy:', 0.9787234042553191)
('Gender test accuracy:', 0.5238095238095238)
('Msi train accuracy:', 1.0)
('Msi train accuracy:', 0.9523809523809523)


### How to Combine?

Msi seems to be better indicator than gender. How do we take this into account?

* MSI does not match --> Mismatch label, no matter what gender says
* MSI matching, gender mismatch - what do we do?
* I propose to calculate a confidence score and use it in this case.


### Confidence Score
* Instead of reporting just label, show model's confidence score. This can help us decide in case of matching msi and gender misclassification.

In [1194]:
clf = SVC(C=100, kernel="rbf", gamma="scale", probability=True)
#clf = RandomForestClassifier(n_estimators=100)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, shuffle=True, random_state=100)

y_gender_train = lb.fit_transform(y_train.loc[:,"gender"]).ravel()
y_gender_test = lb.fit_transform(y_test.loc[:,"gender"]).ravel()
y_msi_train = lb.fit_transform(y_train.loc[:,"msi"]).ravel()
y_msi_test = lb.fit_transform(y_test.loc[:,"msi"]).ravel()

clf.fit(x_train, y_gender_train)
y_gender_predict = clf.predict(x_test)
print("Gender test accuracy:", accuracy_score(y_gender_test, y_gender_predict))
probs = clf.predict_proba(x_test)
for i in range(len(probs)):
    print(probs[i] , y_gender_test[i])
print()

clf.fit(x_train, y_msi_train)
y_msi_predict = clf.predict(x_test)
print("Msi test accuracy:", accuracy_score(y_msi_test, y_msi_predict))
probs = clf.predict_proba(x_test)
for i in range(len(probs)):
    print(probs[i], y_msi_test[i])

('Gender test accuracy:', 0.6190476190476191)
(array([0.66867051, 0.33132949]), 1)
(array([0.68102202, 0.31897798]), 0)
(array([0.67024997, 0.32975003]), 1)
(array([0.67216623, 0.32783377]), 0)
(array([0.62350403, 0.37649597]), 1)
(array([0.47738122, 0.52261878]), 1)
(array([0.80211405, 0.19788595]), 1)
(array([0.6586943, 0.3413057]), 0)
(array([0.65746198, 0.34253802]), 1)
(array([0.72341851, 0.27658149]), 0)
(array([0.76553483, 0.23446517]), 1)
(array([0.66989072, 0.33010928]), 0)
(array([0.79354618, 0.20645382]), 0)
(array([0.65093283, 0.34906717]), 0)
(array([0.72705496, 0.27294504]), 0)
(array([0.64248199, 0.35751801]), 0)
(array([0.8371426, 0.1628574]), 1)
(array([0.65931179, 0.34068821]), 1)
(array([0.74153726, 0.25846274]), 0)
(array([0.56489859, 0.43510141]), 1)
(array([0.73691679, 0.26308321]), 0)
()
('Msi test accuracy:', 1.0)
(array([0.11766586, 0.88233414]), 1)
(array([0.02053533, 0.97946467]), 1)
(array([0.11965383, 0.88034617]), 1)
(array([0.05851661, 0.94148339]), 1)
(a