# Topological feature for NSCLS cohort

In this notebook we extract the topological features for the scans in the NSCLS cohort.

In [2]:
# Set working directory (change accordingly)
workdir = "/home/robin/Documents/Stanford_VSR/NMI/TDA_Lung_Histology"

import os
import sys
os.chdir(workdir)
sys.path.insert(0, os.path.join(workdir, "Functions"))

In [3]:
# Topological feature extraction
import TDAfeatures as tf

# Handling arrays and data frames
import numpy as np
import pandas as pd

## Extracting the topological features

The topological features are not stored. Rather, we can quickly obtain them through the stored diagrams, which may also be used to potentially explore other machine learning models. We start by loading these diagrams.

In [9]:
dgms = {"img_sub":{}, "img_sup":{}, "img_box_sub":{}, "img_box_sup":{}, "point_cloud":{}}

for patient in os.listdir(os.path.join("Diagram", "NSCLC")):
        for dgm in os.listdir(os.path.join("Diagram", "NSCLC", patient)):
        
            dgmtype = "_".join(dgm.split("_")[:-1])
            dgmdim = dgm.split("_")[-1].replace(".npy", "")
            dgms[dgmtype].setdefault(patient, {})
            dgms[dgmtype][patient][dgmdim] = np.load(os.path.join("Diagram", "NSCLC", patient, dgm))

We can now obtain our data of topological feature vectors as follows.

In [13]:
X_top = dict()

for dgm_type in dgms.keys():
    for patient in dgms[dgm_type].keys():
        
        X_top[patient] = X_top.setdefault(patient, {})
        
        for dim in dgms[dgm_type][patient].keys():
            
            features = tf.persistence_statistics(dgms[dgm_type][patient][dim])
            for key in features.keys():
                new_key = dgm_type + "_" + dim + "_" + key
                X_top[patient][new_key] = features[key]
            
            del features

X_top = pd.DataFrame(X_top).transpose()            
X_top.head(5)

Unnamed: 0,img_sub_dgm2_min_birth,img_sub_dgm2_no_infinite_lifespans,img_sub_dgm2_no_finite_lifespans,img_sub_dgm2_mean_finite_midlifes,img_sub_dgm2_mean_finite_lifespans,img_sub_dgm2_std_finite_midlifes,img_sub_dgm2_std_finite_lifespans,img_sub_dgm2_skew_finite_midlifes,img_sub_dgm2_skew_finite_lifespans,img_sub_dgm2_kurtosis_finite_midlifes,...,point_cloud_dgm2_kurtosis_finite_lifespans,point_cloud_dgm2_median_finite_midlifes,point_cloud_dgm2_median_finite_lifespans,point_cloud_dgm2_Q1_finite_midlifes,point_cloud_dgm2_Q1_finite_lifespans,point_cloud_dgm2_Q3_finite_midlifes,point_cloud_dgm2_Q3_finite_lifespans,point_cloud_dgm2_IQR_finite_midlifes,point_cloud_dgm2_IQR_finite_lifespans,point_cloud_dgm2_entropy_finite_lifespans
R01-003,46.996204,0.0,32.0,106.791064,29.390752,33.305599,26.05182,0.055856,1.434946,-1.061152,...,16.048142,1.319479,0.189469,1.319479,0.189469,1.319479,0.189469,0.0,0.0,1.729933
R01-017,-775.255676,0.0,111.0,-342.155521,69.271735,216.872513,74.89558,0.266959,2.103931,-1.02307,...,6.949115,2.473429,0.204302,2.393835,0.164118,3.980255,0.226933,1.58642,0.062815,1.10487
R01-089,-483.144012,1.0,4799.0,76.944033,17.65317,26.508828,18.402499,-1.985738,5.518228,96.309002,...,-0.666672,5.610717,0.118157,5.400659,0.090228,10.284398,9.102928,4.883739,9.012701,0.060452
R01-090,-606.450928,0.0,7972.0,165.718801,52.90042,50.111947,43.128684,-5.693903,1.235165,63.821045,...,4.122338,6.593709,0.468146,6.318488,0.266154,7.291438,0.582875,0.97295,0.316721,0.410139
R01-035,-511.76001,0.0,498.0,146.155279,46.597896,64.095599,40.14938,-3.791177,1.239231,31.930349,...,13.992937,3.348906,0.166731,3.219434,0.136743,3.515529,0.358425,0.296095,0.221682,1.171994


Some columns of our topological features are known to be constant in advance. For example, the filtration constructed from the image with boundary pixels will always end at one connected component with no higher-dimensional holes. We will discard these from our topological features. Notice that indeed only columns from which we know in advance they are constant are discarded below. 

In [14]:
constant_features = []

for col in X_top.columns:
    values = np.unique(X_top[col])
    values = values[~np.isnan(values)]
    if len(values) == 1:
        constant_features.append(col)
        
X_top = X_top.drop(columns=constant_features)

print("Discarded features (with constant values):")
print("\n")
for f in constant_features:
    print(f)

Discarded features (with constant values):


img_box_sub_dgm2_no_infinite_lifespans
img_box_sub_dgm0_no_infinite_lifespans
img_box_sub_dgm1_no_infinite_lifespans
img_box_sup_dgm1_no_infinite_lifespans
img_box_sup_dgm0_no_infinite_lifespans
img_box_sup_dgm2_no_infinite_lifespans
point_cloud_dgm0_min_birth
point_cloud_dgm0_no_infinite_lifespans
point_cloud_dgm1_no_infinite_lifespans
point_cloud_dgm2_no_infinite_lifespans


Finally, we save the topological feature vectors for our experiments.

In [18]:
X_top.to_csv(os.path.join("Features", "NSCLC", "Topological.csv"))