This notebook is on features that were extracted with [pyradiomics](https://github.com/Radiomics/pyradiomics) from 813 lung cancer patients' CT-Scans at Cleveland Clinic.

In [1]:
import pandas as pd
import numpy as np

In [2]:
features = pd.read_csv('results_4_with_names.csv')
features.set_index('Unnamed: 0', inplace=True)
features.index.name = 'Patient_ID'

In [7]:
features.shape

(813, 128)

*813 patients and 128 features for each patient. 21 of these are 'diagnostic' features which will be discussed below. that leaves about 107 features for each patient.*

#### The features are as follows:

In [3]:
list(features.columns)

['diagnostics_Versions_PyRadiomics',
 'diagnostics_Versions_Numpy',
 'diagnostics_Versions_SimpleITK',
 'diagnostics_Versions_PyWavelet',
 'diagnostics_Versions_Python',
 'diagnostics_Configuration_Settings',
 'diagnostics_Configuration_EnabledImageTypes',
 'diagnostics_Image-original_Hash',
 'diagnostics_Image-original_Spacing',
 'diagnostics_Image-original_Size',
 'diagnostics_Image-original_Mean',
 'diagnostics_Image-original_Minimum',
 'diagnostics_Image-original_Maximum',
 'diagnostics_Mask-original_Hash',
 'diagnostics_Mask-original_Spacing',
 'diagnostics_Mask-original_Size',
 'diagnostics_Mask-original_BoundingBox',
 'diagnostics_Mask-original_VoxelNum',
 'diagnostics_Mask-original_VolumeNum',
 'diagnostics_Mask-original_CenterOfMassIndex',
 'diagnostics_Mask-original_CenterOfMass',
 'original_shape_Elongation',
 'original_shape_Flatness',
 'original_shape_LeastAxisLength',
 'original_shape_MajorAxisLength',
 'original_shape_Maximum2DDiameterColumn',
 'original_shape_Maximum2DD

### Counting the Features in the Different Categories:

In [4]:
first_two_words_feature_names = [ '_'.join(item.split('_')[:2]) for item in list(features.columns) ]
from collections import Counter
counter_dict = Counter(first_two_words_feature_names)
for key, value in counter_dict.items():
    print(key + ' ' + str(value))

diagnostics_Configuration 2
original_glcm 24
original_firstorder 18
diagnostics_Versions 5
original_glszm 16
original_ngtdm 5
diagnostics_Mask-original 8
original_glrlm 16
original_gldm 14
original_shape 14
diagnostics_Image-original 6


## The features come in 7 categories: 

### ** 0. Diagnostics **

These are not features but just 'diagnostics'. 21 of these in total.
These come in multiple subgroups:
a) *diagnostics_Versions* : These are the same values for all patients. can be useful for debugging if one needs to regenerate these numbers in the future or calculate the same for other images using the same version.

b) *diagnostics_Configuration* Consider the same for all images not meanignful till further notice/exploration. 

c) *diagnostics_Image-original* : Characteristics of the image/file. Some are the same for all patients and some are not for example the size of the image series (how many images in the series) changes from patient to patient.

d) *diagnostics_Mask-original* : Characteristics of the Mask. I am curious if these are captured by the calculated features : for example 'diagnostics_Mask-original_VoxelNum' seemst o capture the number of voxels but there may be a feature that has the same value as this. Check?




### ** 1) Original Shape Features** (14)

Describe the three-dimensional size and shape of the ROI

### ** 2) Original First Order Features** (18)

First-order statistics describe the distribution of voxel intensities within the image region defined by the mask 
through commonly used and basic metrics.

### ** 3)  "Gray Level Co-occurrence Matrix (GLCM)" Features** (24)

A *Gray Level Co-occurrence Matrix (GLCM)* of size $N_g \times N_g$ describes the second-order joint probability
  function of an image region constrained by the mask and is defined as $\textbf{P}(i,j|\delta,\theta)$.
  The $(i,j)^{\text{th}}$ element of this matrix represents the number of times the combination of
  levels $i$ and $j$ occur in two pixels in the image, that are separated by a distance of $\delta$
  pixels along angle $\theta$.


### ** 4)  "Gray Level Dependence Matrix (GLDM)" Features** (14)

A Gray Level Dependence Matrix (GLDM) quantifies gray level dependencies in an image.
A gray level dependency is defined as a the number of connected voxels within distance $\delta$ that are dependent on the center voxel.


### ** 5) Gray Level Run Length Matrix** (16)

Original GLRLM Features length in number of pixels, of consecutive pixels that have the same gray level value.

### **6) 16 GLSZM Features** (16)

These quantify gray level zones in an image. A gray level zone is defined as a the number
  of connected voxels that share the same gray level intensity.
  
### ** 7) NGTDM Features ** (5)
 
Neighbouring Gray Tone Difference Matrix measures the difference between a gray value and the average gray value of its neighbours within distance $\delta$.


##### To View Some Values for Each of the Feature Categories, change the feature_category variable's value and run the cell below:

In [5]:
feature_category = 'diagnostics_Mask-original'
features.loc[:, [item for item in list(features.columns) if item.startswith(feature_category)]].head()

Unnamed: 0_level_0,diagnostics_Mask-original_Hash,diagnostics_Mask-original_Spacing,diagnostics_Mask-original_Size,diagnostics_Mask-original_BoundingBox,diagnostics_Mask-original_VoxelNum,diagnostics_Mask-original_VolumeNum,diagnostics_Mask-original_CenterOfMassIndex,diagnostics_Mask-original_CenterOfMass
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
P0002,1d5d805a4c63753953bdfef2ab8e33d70262b862,"(1.171875, 1.171875, 3.0)","(512, 512, 129)","(151, 230, 54, 20, 29, 6)",706,1,"(161.44192634560906, 242.9815864022663, 56.665...","(-110.8102425637394, 62.74404656515583, -582.0..."
P0003,f85cce9c5b06b65a3e2d31ddffc228618939ec03,"(1.171875, 1.171875, 3.0)","(512, 512, 131)","(212, 294, 94, 25, 18, 8)",1153,1,"(224.1699913269731, 303.5021682567216, 97.1405...","(-37.300791413703394, 84.66660342584561, -611...."
P0005,c7f00370a59b5efa8c3d8fbbcef707d84db9774c,"(1.171875, 1.171875, 3.0)","(512, 512, 145)","(169, 295, 100, 26, 27, 10)",2293,1,"(180.82250327082426, 307.5381596162233, 104.82...","(-88.09862897950282, 97.3962808002617, -700.01..."
P0008,2f65743a563e2a20c039fa1cf45c30da16b87aa6,"(1.171875, 1.171875, 3.0)","(512, 512, 121)","(132, 265, 62, 63, 59, 23)",33615,1,"(166.4716049382716, 293.3513907481779, 72.5782...","(-104.91608796296296, 124.77116103302097, -913..."
P0015,35c48947bf004155943b31824e3d600a930ce68b,"(1.171875, 1.171875, 3.0)","(512, 512, 137)","(197, 267, 81, 63, 60, 24)",27947,1,"(229.14337853794683, 297.09596736680146, 91.68...","(-31.47260327584354, 137.15933675797044, 385.0..."


In [6]:
features.head()

Unnamed: 0_level_0,diagnostics_Versions_PyRadiomics,diagnostics_Versions_Numpy,diagnostics_Versions_SimpleITK,diagnostics_Versions_PyWavelet,diagnostics_Versions_Python,diagnostics_Configuration_Settings,diagnostics_Configuration_EnabledImageTypes,diagnostics_Image-original_Hash,diagnostics_Image-original_Spacing,diagnostics_Image-original_Size,...,original_glszm_SmallAreaHighGrayLevelEmphasis,original_glszm_SmallAreaLowGrayLevelEmphasis,original_glszm_ZoneEntropy,original_glszm_ZonePercentage,original_glszm_ZoneVariance,original_ngtdm_Busyness,original_ngtdm_Coarseness,original_ngtdm_Complexity,original_ngtdm_Contrast,original_ngtdm_Strength
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P0002,0+unknown,1.14.1,1.1.0,1.0.1,3.6.2,"{'minimumROIDimensions': 2, 'minimumROISize': ...",{'Original': {}},f9d241fb1777ecf697740ff73112fdda238e568e,"(1.171875, 1.171875, 3.0)","(512, 512, 129)",...,300.98709,0.010867,6.15856,0.484419,14.20054,0.14467,0.008142,2068.860642,0.678496,4.2979
P0003,0+unknown,1.14.1,1.1.0,1.0.1,3.6.2,"{'minimumROIDimensions': 2, 'minimumROISize': ...",{'Original': {}},30a1544a237665296e593a6f8e7c22191d14e101,"(1.171875, 1.171875, 3.0)","(512, 512, 131)",...,378.437162,0.008456,5.724877,0.40503,167.957815,0.160603,0.005229,1457.029221,0.221854,4.117904
P0005,0+unknown,1.14.1,1.1.0,1.0.1,3.6.2,"{'minimumROIDimensions': 2, 'minimumROISize': ...",{'Original': {}},9f8641c7e7d1ab04ee8174bbec4e36203a9cafe1,"(1.171875, 1.171875, 3.0)","(512, 512, 145)",...,434.543541,0.004569,5.828763,0.317052,704.735551,0.295324,0.002109,1889.491792,0.223549,2.939912
P0008,0+unknown,1.14.1,1.1.0,1.0.1,3.6.2,"{'minimumROIDimensions': 2, 'minimumROISize': ...",{'Original': {}},1c1dc942aadcc74b3b319e0b6114e0378cc9bee8,"(1.171875, 1.171875, 3.0)","(512, 512, 121)",...,509.233039,0.002196,5.724402,0.085051,110068.690109,2.520202,0.000189,730.593673,0.009323,0.624311
P0015,0+unknown,1.14.1,1.1.0,1.0.1,3.6.2,"{'minimumROIDimensions': 2, 'minimumROISize': ...",{'Original': {}},980ff8a6a53f71f56245ba7980f54ee515d5040d,"(1.171875, 1.171875, 3.0)","(512, 512, 137)",...,829.625744,0.002526,5.902157,0.064587,156935.277969,1.163708,0.000249,915.461762,0.006939,2.080878


Next Steps:

1) Check that all features exist for all patients, no NaNs.
2) Check for redundancy of the features. for example:
a) I got a warning "GLCM is symmetrical, therefore Sum Average = 2 * Joint Average, only 1 needs to be calculated"
b) Check if some diagnostic features are also found in actual features, predominantly first_order, shape features.
3) How could one examine the value of each one of these features: prediction without overfitting? Logistic Regression.