# Topological feature for lung tumor type classification (SF/PA cohort)

In this notebook, we will study the use of topological features for lung tumor type classification. We start by setting the working directory and importing the required libraries.

In [1]:
# Set working directory (change accordingly)
workdir = "/home/robin/Documents/Stanford_VSR/NMI/Code"

import os
import sys
os.chdir("/home/robin/Documents/Stanford_VSR/NMI/Code")
sys.path.insert(0, os.path.join(workdir, "Functions"))

In [2]:
# Topological feature extraction
import TDAfeatures as tf

# Our custom functions to conduct our feature comparison experiments
import CV_SF_PA
from CV_SF_PA import run_exps_class

# Handling arrays and data frames
import numpy as np
import pandas as pd 

# Obtaining p-values
import scipy

## Loading the data

We start by loading the meta data (including the annotated tumor type), the radiomic features, and the topological features.

### Loading the tumor types

In [3]:
meta = pd.read_excel(os.path.join("Features", "SF-PA", "Meta.xlsx"), index_col=0)
meta = meta.loc[[p for p in meta.index if p.startswith("PA") or p.startswith("SF")],:]
meta.loc[:,["Diagnosis"]].head(5)

Unnamed: 0_level_0,Diagnosis
Patient,Unnamed: 1_level_1
PA_001,Benign
PA_002,Benign
PA_003,Benign
PA_004,Benign
PA_006,Benign


### Loading the radiomic features

In [4]:
X_rad_PA = pd.read_csv(os.path.join("Features", "SF-PA", "Radiomic-PA.csv"), index_col=0)
X_rad_SF = pd.read_csv(os.path.join("Features", "SF-PA", "Radiomic-SF.csv"), index_col=0)

X_rad =  pd.concat([X_rad_PA, X_rad_SF], axis=0)

del X_rad_PA, X_rad_SF

X_rad.head(5)

Unnamed: 0,original_shape_Elongation,original_shape_Flatness,original_shape_LeastAxisLength,original_shape_MajorAxisLength,original_shape_Maximum2DDiameterColumn,original_shape_Maximum2DDiameterRow,original_shape_Maximum2DDiameterSlice,original_shape_Maximum3DDiameter,original_shape_MeshVolume,original_shape_MinorAxisLength,...,original_gldm_LargeDependenceLowGrayLevelEmphasis,original_gldm_LowGrayLevelEmphasis,original_gldm_SmallDependenceEmphasis,original_gldm_SmallDependenceHighGrayLevelEmphasis,original_gldm_SmallDependenceLowGrayLevelEmphasis,original_ngtdm_Busyness,original_ngtdm_Coarseness,original_ngtdm_Complexity,original_ngtdm_Contrast,original_ngtdm_Strength
PA_001,0.605672,0.434971,16.92444,38.909364,46.360721,30.162321,48.324465,51.338529,5546.264648,23.566325,...,0.033035,0.002518,0.307242,195.264436,0.00115,0.439791,0.001385,2049.893186,0.12536,1.339161
PA_002,0.629704,0.576293,5.730105,9.943039,9.656916,10.339174,9.583878,11.786427,225.80844,6.261175,...,0.052424,0.010117,0.354831,69.689027,0.007439,0.147176,0.019231,559.154932,0.44683,3.649947
PA_003,0.921953,0.736933,21.59944,29.309906,33.814821,35.744093,33.186376,37.770528,12025.904109,27.022362,...,0.096633,0.000953,0.159743,159.668231,0.000302,0.456699,0.000512,3964.548672,0.015585,3.700103
PA_006,0.730354,0.598227,5.912898,9.884036,11.319231,9.519716,11.288995,11.473475,214.920044,7.218842,...,0.075388,0.022753,0.468849,229.607131,0.01464,0.048106,0.022808,3464.115534,1.099773,15.651324
PA_007,0.769291,0.737189,9.945531,13.491164,14.419092,15.422968,13.356241,15.836129,999.739437,10.378631,...,0.124417,0.005763,0.274149,67.33527,0.003903,0.31732,0.004203,829.156209,0.24985,2.335415


### Extracting the topological features

The topological features are not stored. Rather, we can quickly obtain them through the stored diagrams, which may also be used to potentially explore other machine learning models. We start by loading these diagrams.

In [5]:
dgms = {"img_sub":{}, "img_sup":{}, "img_box_sub":{}, "img_box_sup":{}, "point_cloud":{}}

for patient in os.listdir(os.path.join("Diagram", "SF-PA")):
    
    if len(os.listdir(os.path.join("Diagram", "SF-PA", patient)) != 15:
        continue
    
    for dgm in os.listdir(os.path.join("Diagram", "SF-PA", patient)):
        
        dgmtype = "_".join(dgm.split("_")[:-1])
        dgmdim = dgm.split("_")[-1].replace(".npy", "")
        dgms[dgmtype].setdefault(patient, {})
        dgms[dgmtype][patient][dgmdim] = np.load(os.path.join("Diagram", "SF-PA", patient, dgm))

We can now obtain our data of topological feature vectors as follows.

In [6]:
X_top = dict()

for dgm_type in dgms.keys():
    for patient in dgms[dgm_type].keys():
        
        X_top[patient] = X_top.setdefault(patient, {})
        
        for dim in dgms[dgm_type][patient].keys():
            
            features = tf.persistence_statistics(dgms[dgm_type][patient][dim])
            for key in features.keys():
                new_key = dgm_type + "_" + dim + "_" + key
                X_top[patient][new_key] = features[key]
            
            del features

X_top = pd.DataFrame(X_top).transpose()            
X_top.head(5)

Unnamed: 0,img_sub_dgm2_min_birth,img_sub_dgm2_no_infinite_lifespans,img_sub_dgm2_no_finite_lifespans,img_sub_dgm2_mean_finite_midlifes,img_sub_dgm2_mean_finite_lifespans,img_sub_dgm2_std_finite_midlifes,img_sub_dgm2_std_finite_lifespans,img_sub_dgm2_skew_finite_midlifes,img_sub_dgm2_skew_finite_lifespans,img_sub_dgm2_kurtosis_finite_midlifes,...,point_cloud_dgm2_kurtosis_finite_lifespans,point_cloud_dgm2_median_finite_midlifes,point_cloud_dgm2_median_finite_lifespans,point_cloud_dgm2_Q1_finite_midlifes,point_cloud_dgm2_Q1_finite_lifespans,point_cloud_dgm2_Q3_finite_midlifes,point_cloud_dgm2_Q3_finite_lifespans,point_cloud_dgm2_IQR_finite_midlifes,point_cloud_dgm2_IQR_finite_lifespans,point_cloud_dgm2_entropy_finite_lifespans
SF_138,-37.0,0.0,202.0,105.289604,16.20297,34.440457,16.414899,0.002625,2.224121,3.034862,...,22.474623,3.448817,0.179362,3.038327,0.069967,4.079365,0.292047,1.041038,0.22208,1.887704
SF_099,-564.0,8.0,762.0,64.19685,13.296588,32.72394,11.384417,-8.54239,1.542019,152.562225,...,13.750821,5.319809,0.264368,4.903261,0.102028,6.785983,0.582593,1.882722,0.480565,3.946821
PA_111,-22.0,1.0,479.0,36.387265,8.987474,15.33903,8.033324,-0.191061,1.562512,-0.125307,...,5.104158,3.479384,0.104443,3.258472,0.050184,3.764763,0.212081,0.50629,0.161897,0.394692
SF_140,-937.0,97.0,13212.0,14.552452,31.587345,191.644702,28.295495,-2.821279,2.119256,6.93224,...,104.28787,10.681124,0.676721,10.027158,0.313158,11.217973,1.058532,1.190815,0.745374,4.530362
PA_085,-979.0,0.0,11397.0,-67.292402,20.020444,312.422884,29.462102,-1.173713,4.354624,0.469028,...,-2.0,47.387635,25.49472,41.135733,12.807174,53.639536,38.182265,12.503803,25.375091,0.016549


Some columns of our topological features are known to be constant in advance. For example, the filtration constructed from the image with boundary pixels will always end at one connected component with no higher-dimensional holes. We will discard these from our topological features. Notice that indeed only columns from which we know in advance they are constant are discarded below. 

In [7]:
constant_features = []

for col in X_top.columns:
    values = np.unique(X_top[col])
    values = values[~np.isnan(values)]
    if len(values) == 1:
        constant_features.append(col)
        
X_top = X_top.drop(columns=constant_features)

print("Discarded features (with constant values):")
print("\n")
for f in constant_features:
    print(f)

Discarded features (with constant values):


img_box_sub_dgm2_no_infinite_lifespans
img_box_sub_dgm0_no_infinite_lifespans
img_box_sub_dgm1_no_infinite_lifespans
img_box_sup_dgm1_no_infinite_lifespans
img_box_sup_dgm0_no_infinite_lifespans
img_box_sup_dgm2_no_infinite_lifespans
point_cloud_dgm0_min_birth
point_cloud_dgm0_no_infinite_lifespans
point_cloud_dgm1_no_infinite_lifespans
point_cloud_dgm2_no_infinite_lifespans


### Final preparations

Finally, we ensure that the indexing of all dataframes is consistent.

In [8]:
missing_rad = np.setdiff1d(meta.index, X_rad.index)
missing_top = np.setdiff1d(meta.index, X_top.index)
patients = np.setdiff1d(meta.index, np.concatenate([missing_rad, missing_top]))

print("Number of patients: " + str(len(patients)) +
      " (" + str(len(meta.index) - len(patients)) + " omitted from study)")
print("Number of radiomic features: " + str(X_rad.shape[1]) + 
      " (missing rows for " + str(len(missing_rad)) + " patients)")
print("Number of topological features: " + str(X_top.shape[1]) 
      + " (missing rows for " + str(len(missing_top)) + " patients)")

meta = meta.loc[patients,:]
X_rad =  X_rad.loc[patients,:]
X_top = X_top.loc[patients,:]

Number of patients: 165 (27 omitted from study)
Number of radiomic features: 105 (missing rows for 27 patients)
Number of topological features: 290 (missing rows for 2 patients)


## Analysis of our outcome by contrast

Below we analyse the frequency of tumor types in our data. These will be used as our dependent variables. Note that some classes are overlapping, e.g., adenocarcinoma is a more specific case of malignant. We will also separate our analyses below for scans obtained with and without contrast material injected into the bloodstream.

In [9]:
all_classes = {"benign": dict(), "malignant": dict(), 
               "small cell": dict(), "non-small cell": dict(),
               "adeno": dict(), "squamous": dict(),
               "Total": dict()}
with_contrast = [p for p in meta.index if meta.loc[p, "Contrast"] == "Y"]
without_contrast = [p for p in meta.index if meta.loc[p, "Contrast"] == "N"]

all_classes["benign"]["Y"] = np.sum(meta.loc[with_contrast, "Diagnosis"] == "Benign")
all_classes["benign"]["N"] = np.sum(meta.loc[without_contrast, "Diagnosis"] == "Benign")
all_classes["malignant"]["Y"] = np.sum(meta.loc[with_contrast, "Diagnosis"] != "Benign")
all_classes["malignant"]["N"] = np.sum(meta.loc[without_contrast, "Diagnosis"] != "Benign")

all_classes["small cell"]["Y"] = np.sum(meta.loc[with_contrast, "Diagnosis"] == "Small cell lung cancer")
all_classes["small cell"]["N"] = np.sum(meta.loc[without_contrast, "Diagnosis"] == "Small cell lung cancer")
all_classes["non-small cell"]["Y"] = np.sum(~meta.loc[with_contrast, "Diagnosis"].isin(["Benign", "Small cell lung cancer"]))
all_classes["non-small cell"]["N"] = np.sum(~meta.loc[without_contrast, "Diagnosis"] .isin(["Benign", "Small cell lung cancer"]))

all_classes["adeno"]["Y"] = np.sum(meta.loc[with_contrast, "Diagnosis"] == "Adenocarcinoma")
all_classes["adeno"]["N"] = np.sum(meta.loc[without_contrast, "Diagnosis"] == "Adenocarcinoma")
all_classes["squamous"]["Y"] = np.sum(meta.loc[with_contrast, "Diagnosis"] == "Squamous cell carcinoma")
all_classes["squamous"]["N"] = np.sum(meta.loc[without_contrast, "Diagnosis"] == "Squamous cell carcinoma")

all_classes["Total"]["Y"] = len(with_contrast)
all_classes["Total"]["N"] = len(without_contrast)

all_classes = pd.DataFrame(all_classes).transpose()
all_classes["Total"] = all_classes["Y"] + all_classes["N"]

all_classes

Unnamed: 0,Y,N,Total
benign,22,63,85
malignant,33,47,80
small cell,17,10,27
non-small cell,16,37,53
adeno,11,20,31
squamous,5,15,20
Total,55,110,165


## Evaluating features for tumor type prediction

We now compare the effectivess of using radiomic features, topological features, and radiomic + topological  features (through concatenation or ensemble models) for the various binary classification problems one may consider within our data. These furthermore correspond to the main lung tumor classification that can be considered overall. All scores will be evaluated through the ROC AUC. The following function will be used to derive binary outcomes.

In [10]:
def get_patients_with_outcome(I, class0, class1):
    
    new_I = [patient for patient in I if meta.loc[patient, "Diagnosis"] in class0 + class1]
    y = np.array([0 if meta.loc[patient, "Diagnosis"] in class0 else 1 for patient in new_I]).astype("int")
    return new_I, y

A dictionary will be used to distinguish between categories of features that should be evaluated.

In [11]:
X_dict = {"rad": X_rad, "top": X_top}

Finally, the function below will be used to apply coloring to columns for visualization.

In [12]:
def grey(s):
    color = "lightgrey"
    return "background-color: %s" % color

### Comparing features for Benign vs. Malignant classification

We compare the effectivess of radiomic features, topological features, and radiomic + topological features, for **'benign'** vs. **'non-benign'** class prediction.

In [13]:
class0 = ["Benign"]
class1 = ["Small cell lung cancer", "Non-small cell lung cancer", "Adenocarcinoma", "Squamous cell carcinoma"]

#### Images with contrast 

In [14]:
I, y = get_patients_with_outcome(with_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Benign'], class1 = ['Small cell lung cancer', 'Non-small cell lung cancer', 'Adenocarcinoma', 'Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 5.660084124885129e-05


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std,stack score,stack std
LR,0.867048,0.086591,0.874571,0.109413,0.871524,0.107239,0.889381,0.091627,0.888524,0.09184
RF,0.856738,0.119588,0.879262,0.11009,0.874786,0.116265,0.888071,0.113846,0.866119,0.115742
KNN,0.839143,0.115059,0.871786,0.106304,0.878333,0.099574,0.879381,0.105863,0.83719,0.105378
SVM,0.84419,0.099208,0.849881,0.115988,0.84419,0.116707,0.88019,0.095443,0.870095,0.095538
GNB,0.848571,0.10317,0.858119,0.118772,0.867071,0.121429,0.876476,0.09479,0.872429,0.097959
XGB,0.819,0.128032,0.876095,0.122936,0.866286,0.134445,0.865548,0.116214,0.815095,0.145852
mean,0.845782,0.110499,0.868286,0.114557,0.867032,0.116978,0.879841,0.103713,0.858242,0.112958


#### Images without contrast 

In [15]:
I, y = get_patients_with_outcome(without_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Benign'], class1 = ['Small cell lung cancer', 'Non-small cell lung cancer', 'Adenocarcinoma', 'Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 1.030755413755527e-08


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std,stack score,stack std
LR,0.755774,0.084026,0.781974,0.096893,0.785895,0.096132,0.801132,0.08801,0.796214,0.089507
RF,0.732447,0.087133,0.752541,0.097828,0.76491,0.098194,0.786241,0.085062,0.695226,0.094462
KNN,0.739778,0.102338,0.757682,0.094093,0.751985,0.103954,0.79641,0.092216,0.731641,0.111014
SVM,0.750936,0.087415,0.76863,0.092457,0.773588,0.098372,0.796402,0.085676,0.790333,0.086154
GNB,0.780519,0.09065,0.749485,0.104436,0.752342,0.105587,0.778325,0.093047,0.77,0.104795
XGB,0.71487,0.094501,0.756983,0.096535,0.763015,0.098744,0.771609,0.091422,0.659942,0.108085
mean,0.74572,0.093468,0.761216,0.097737,0.765289,0.10092,0.788353,0.089926,0.740559,0.111371


### Comparing features for Small vs. Non-small classification

We compare the effectivess of radiomic features, topological features, and radiomic + topological features, for **'small cell'** vs. **'non-small cell'** class prediction.

In [16]:
class0 = ["Small cell lung cancer"]
class1 = ["Non-small cell lung cancer", "Adenocarcinoma", "Squamous cell carcinoma"]

#### Images with contrast 

In [17]:
I, y = get_patients_with_outcome(with_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Small cell lung cancer'], class1 = ['Non-small cell lung cancer', 'Adenocarcinoma', 'Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 0.9447426989966621


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std,stack score,stack std
LR,0.797778,0.180708,0.625833,0.196701,0.621111,0.208667,0.756667,0.196463,0.744444,0.210819
RF,0.783889,0.188422,0.631667,0.196305,0.7375,0.177131,0.760278,0.194103,0.771944,0.199905
KNN,0.768333,0.173482,0.642222,0.195129,0.668333,0.192943,0.772222,0.189032,0.748333,0.191156
SVM,0.798333,0.181201,0.625556,0.183585,0.608333,0.207907,0.750556,0.19488,0.690556,0.256701
GNB,0.736667,0.205667,0.625278,0.183701,0.614722,0.183876,0.728333,0.191025,0.649444,0.264758
XGB,0.763611,0.171807,0.613889,0.220217,0.718333,0.205969,0.733333,0.171121,0.696667,0.213046
mean,0.774769,0.185153,0.627407,0.196502,0.661389,0.20302,0.750231,0.19024,0.716898,0.228312


#### Images without contrast 

In [18]:
I, y = get_patients_with_outcome(without_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Small cell lung cancer'], class1 = ['Non-small cell lung cancer', 'Adenocarcinoma', 'Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 0.0395000654001305


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std,stack score,stack std
LR,0.803929,0.214895,0.797768,0.192174,0.816071,0.2,0.825893,0.1972,0.81625,0.201929
RF,0.838482,0.188954,0.816607,0.192662,0.843661,0.194384,0.868393,0.173123,0.837679,0.176064
KNN,0.791071,0.184305,0.782946,0.190279,0.812946,0.186034,0.835893,0.192693,0.717857,0.239542
SVM,0.779107,0.214366,0.773482,0.202446,0.807679,0.21769,0.814286,0.207873,0.66125,0.299438
GNB,0.822679,0.196479,0.759464,0.242278,0.772143,0.2116,0.820982,0.191969,0.828571,0.197792
XGB,0.801696,0.201375,0.787857,0.223154,0.803393,0.201857,0.840893,0.174931,0.695268,0.27799
mean,0.806161,0.201355,0.786354,0.20884,0.809315,0.20329,0.83439,0.190836,0.759479,0.246588


### Comparing features for Adeno vs. Squamous classification

We compare the effectivess of radiomic features, topological features, and radiomic + topological features, for **'adenocarcinoma'** vs. **'squamous cell carcinoma'** class prediction.

In [19]:
class0 = ["Adenocarcinoma"]
class1 = ["Squamous cell carcinoma"]

#### Images with contrast 

In [20]:
I, y = get_patients_with_outcome(with_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Adenocarcinoma'], class1 = ['Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 1.2420945946508668e-17


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std
LR,0.801667,0.280322,0.973333,0.10729,0.973333,0.10729,0.963333,0.160866
RF,0.668333,0.375829,0.983333,0.083333,0.983333,0.083333,0.91,0.219114
KNN,0.545,0.258849,0.896667,0.215742,0.871667,0.216801,0.86,0.250799
SVM,0.685,0.332202,0.943333,0.184421,0.97,0.118743,0.966667,0.152753
GNB,0.655,0.328722,0.708333,0.271442,0.655,0.295339,0.73,0.341793
XGB,0.678333,0.369688,0.968333,0.111667,0.951667,0.143459,0.868333,0.234574
mean,0.672222,0.335502,0.912222,0.199951,0.900833,0.211376,0.883056,0.24843


#### Images without contrast 

In [21]:
I, y = get_patients_with_outcome(without_contrast, class0, class1)

print("Calculating performances for tumor class prediction (class0 = {}, class1 = {})\n".format(class0, class1))

results, classifiers, results_summary = run_exps_class(X_dict={key:X_dict[key].loc[I,] for key in X_dict.keys()}, 
                                                       y=y, random_state=42)

p = scipy.stats.ttest_ind(results.loc[np.where(results["type"] == "rad")[0], "score"], 
                          results.loc[np.where(results["type"] == "vote_soft")[0], "score"])[1]
if results_summary.loc["mean", "vote_soft score"] > results_summary.loc["mean", "rad score"]:
    p /= 2
else:
    p = 1 - p/2

print("\np-value for difference in performance (rad >= vote): " + str(p))

results_summary.style.highlight_max(color="lightgreen", axis=1)\
    .applymap(grey, subset=[c for c in results_summary.columns if c.endswith("std")])

Calculating performances for tumor class prediction (class0 = ['Adenocarcinoma'], class1 = ['Squamous cell carcinoma'])

Calculating performances for model: LR
Calculating performances for model: RF
Calculating performances for model: KNN
Calculating performances for model: SVM
Calculating performances for model: GNB
Calculating performances for model: XGB

p-value for difference in performance (rad >= vote): 3.759765833643999e-05


Unnamed: 0,rad score,rad std,top score,top std,concat score,concat std,vote_soft score,vote_soft std,stack score,stack std
LR,0.629167,0.223646,0.719167,0.169109,0.706667,0.178333,0.720833,0.185639,0.699167,0.179368
RF,0.638333,0.255419,0.678333,0.139453,0.668333,0.176376,0.705833,0.158204,0.605833,0.231272
KNN,0.6425,0.194024,0.71,0.15365,0.698333,0.16278,0.741667,0.164359,0.679167,0.179554
SVM,0.659167,0.219799,0.714167,0.172605,0.716667,0.169353,0.715833,0.177108,0.6825,0.196301
GNB,0.7,0.22314,0.696667,0.169362,0.695,0.163605,0.75,0.217147,0.721667,0.216256
XGB,0.588333,0.244898,0.684167,0.141689,0.645833,0.166302,0.635,0.200444,0.518333,0.209689
mean,0.642917,0.230106,0.700417,0.158957,0.688472,0.17127,0.711528,0.188673,0.651111,0.214441
