# Data analysis of the database (Fluoro-Crossed Data Points)

## Table of Contents
[Import Data and Basic preparation](#import)  
[Data Preparation (Ignore)](#dataprepi)  
[Data Preparation (Balancing)(Ignore)](#dataprepb)  
[Training Using Scikit Learn](#sklearn)  
..[Support Vector Machine](#svmk)  
..[Ensemble Learning](#ensemble)  
[Training Using Xgboost](#xgboost)  
[Training Using Tensorflow](#tf)  
[Summary of All Classifiers](#sklearn_sum)  
<br>
<br> 
This project focused on using data points in the database to construct the correlation between fluorescence and polarization signal. The main goal is to use the statistics of polarization signal to predict whether fluorescence signal of the deposit exists.  
There are four categories of data points: (naming->{f: fluorescence, c: crossed, t: Positive, f: negative})
1. ftct: Deposits are fluorescence positive and polarization positive. (num: 789)
2. ftcf: Deposits are fluorescence positive but polarization negative. (num: 20)
3. ffct: Deposits are fluorescence negative but polarization positive. (num: 131)
4. ffcf: Deposits are both negative in fluorescence and polarization signals. Since the number of deposits in this   category is too small, we use the background retina of ftct deposits as the data points for this category.


## Import Data and Basic preparation
<a id="import"></a>

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
import os
from math import pi


pd.options.display.max_columns = None
pd.options.display.max_rows = None
np.set_printoptions(precision=5)
#precision 5
pd.options.display.precision = 3
pd.set_option('precision',3)

In [15]:
datapath = os.getcwd() + "\\data\\dbt_m.csv"
df = pd.read_csv(datapath)

In [16]:
list(df)
# the fluorescence variable name: FluoroSignal

['RegionFolder',
 'Subject',
 'Species',
 'Age',
 'Gender',
 'CauseOfDeath',
 'TimeOfDeath',
 'MedicalHistory',
 'Diagnosis1Type',
 'Diagnosis1Level',
 'Is1PrimaryDiagnosis',
 'Diagnosis2Type',
 'Diagnosis2Level',
 'Is2PrimaryDiagnosis',
 'Diagnosis3Type',
 'Diagnosis3Level',
 'Is3PrimaryDiagnosis',
 'Braak_stage_tau',
 'NP_CERAD_Biel',
 'NP_FC_Biel',
 'NP_TC_Biel',
 'NP_PC_Biel',
 'NP_score',
 'DP_CERAD_Biel',
 'DP_score',
 'A_beta_thal',
 'Thal_Phase',
 'CAA_CR',
 'CAA_Abeta_Cb',
 'CAA_A_beta_and_CR',
 'ABC_score',
 'Likelihood_of_AD',
 'SubjectNotes',
 'EntryOrder',
 'MicroscopeMotors',
 'Sample',
 'EyeSource',
 'EyeInitialFixative',
 'EyeInitialFixativePercent',
 'EyeInitialFixingTime',
 'EyeDissectionDoneBy',
 'EyeMountingDoneBy',
 'EyeMountingDate_1',
 'EyeMountingDate_2',
 'EyeMountingDate_3',
 'EyeMountingDate_4',
 'EyeStain',
 'EyeMounting',
 'EyeQuarterPositions',
 'EyeNotes',
 'Region',
 'ImagingDoneBy',
 'XCoordinate',
 'YCoordinate',
 'RadialDistanceFromFovea',
 'ImageMagn

In [17]:
# Adjust some values in the table
# The Q metric
df[["Q_metric_Background_Mean", "Q_metric_Background_Std", "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std", 
    "Q_metric_Full_Mean", "Q_metric_Full_Std"]] = \
df[["Q_metric_Background_Mean", "Q_metric_Background_Std", "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std", 
    "Q_metric_Full_Mean", "Q_metric_Full_Std"]].divide(3) 
# The Linear retardance
df[["Retardance_Lin_Background_Mean","Retardance_Lin_Background_Std", 
    "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std", 
    "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std"]] = \
df[["Retardance_Lin_Background_Mean","Retardance_Lin_Background_Std", 
    "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std", 
    "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std"]].divide(180)


In [18]:
# The Circular retardance, Circular diattenuation , Circular polarizance
df[["Retardance_Circ_Background_Mean", "Retardance_Circ_Background_Std", 
    "Retardance_Circ_Deposit_Mean", "Retardance_Circ_Deposit_Std",
    "Retardance_Circ_Full_Mean", "Retardance_Circ_Full_Std"]] = \
(df[["Retardance_Circ_Background_Mean", "Retardance_Circ_Background_Std", 
    "Retardance_Circ_Deposit_Mean", "Retardance_Circ_Deposit_Std",
    "Retardance_Circ_Full_Mean", "Retardance_Circ_Full_Std"]]+ 180).divide(360)

In [19]:
# The Circular diattenuation
df[["Diattenuation_Circ_Background_Mean", "Diattenuation_Circ_Background_Std", 
    "Diattenuation_Circ_Deposit_Mean", "Diattenuation_Circ_Deposit_Std",
    "Diattenuation_Circ_Full_Mean", "Diattenuation_Circ_Full_Std"]] = \
(df[["Diattenuation_Circ_Background_Mean", "Diattenuation_Circ_Background_Std", 
    "Diattenuation_Circ_Deposit_Mean", "Diattenuation_Circ_Deposit_Std",
    "Diattenuation_Circ_Full_Mean", "Diattenuation_Circ_Full_Std"]]+ 1).divide(2)

In [20]:
# The Circular polarizance
df[["Polarizance_Circ_Background_Mean", "Polarizance_Circ_Background_Std", 
    "Polarizance_Circ_Deposit_Mean", "Polarizance_Circ_Deposit_Std",
    "Polarizance_Circ_Full_Mean", "Polarizance_Circ_Full_Std"]] = \
(df[["Polarizance_Circ_Background_Mean", "Polarizance_Circ_Background_Std", 
     "Polarizance_Circ_Deposit_Mean", "Polarizance_Circ_Deposit_Std",
     "Polarizance_Circ_Full_Mean", "Polarizance_Circ_Full_Std"]]+ 1).divide(2)

In [21]:
# The MMT parameters
df[["A_metric_Background_Mean", "A_metric_Background_Std",
    "A_metric_Deposit_Mean", "A_metric_Deposit_Std"]] = \
(df[["A_metric_Background_Mean", "A_metric_Background_Std",
    "A_metric_Deposit_Mean", "A_metric_Deposit_Std"]] + 1).divide(2)

df[["t_metric_Background_Mean", "t_metric_Background_Std",
    "t_metric_Deposit_Mean", "t_metric_Deposit_Std"]] = \
(df[["t_metric_Background_Mean", "t_metric_Background_Std",
    "t_metric_Deposit_Mean", "t_metric_Deposit_Std"]] + 1).divide(2)

df[["x_metric_Background_Mean", "x_metric_Background_Std",
    "x_metric_Deposit_Mean", "x_metric_Deposit_Std"]] = \
(df[["x_metric_Background_Mean", "x_metric_Background_Std",
    "x_metric_Deposit_Mean", "x_metric_Deposit_Std"]] + pi/4).divide(pi/2)

In [22]:
df_label = df[["RegionFolder","Subject", "FluoroSignal", "CrossedSignal"]]

In [23]:
# Statistics of the number 
print("  Number of each class \n"
      "  Fluo_Positive_Cross_Positive: " + str(sum(np.multiply(df_label["FluoroSignal"], df_label["CrossedSignal"]))) + "\n" + 
      "  Fluo_Positive_Cross_Negative: " + str(sum(np.multiply(df_label["FluoroSignal"] == 1, df_label["CrossedSignal"] == 0))) + "\n"+
      "  Fluo_Negative_Cross_Positive: " + str(sum(np.multiply(df_label["FluoroSignal"] == 0, df_label["CrossedSignal"] == 1))) + "\n"+
      "  Fluo_Negative_Cross_Negative: " + str(sum(np.multiply(df_label["FluoroSignal"] == 0, df_label["CrossedSignal"] == 0))) + "\n"
     )
# it's better to separate the class and fine tune the training examples

  Number of each class 
  Fluo_Positive_Cross_Positive: 789
  Fluo_Positive_Cross_Negative: 20
  Fluo_Negative_Cross_Positive: 131
  Fluo_Negative_Cross_Negative: 7



In [24]:
# df for Background
dfb = df[["RegionFolder", "Subject",
          "Depolarization_Power_Background_Mean", "Depolarization_Power_Background_Std", 
          "Q_metric_Background_Mean", "Q_metric_Background_Std",
          "Anisotropy_Lin_Background_Mean", "Anisotropy_Lin_Background_Std",
          "Anisotropy_Circ_Background_Mean", "Anisotropy_Circ_Background_Std",
          "Polarizance_Lin_Background_Mean", "Polarizance_Lin_Background_Std",
          "Polarizance_Circ_Background_Mean", "Polarizance_Circ_Background_Std",
          "Diattenuation_Lin_Background_Mean", "Diattenuation_Lin_Background_Std",
          "Diattenuation_Circ_Background_Mean", "Diattenuation_Circ_Background_Std",
          "Retardance_Lin_Background_Mean", "Retardance_Lin_Background_Std",
          "Retardance_Circ_Background_Mean", "Retardance_Circ_Background_Std",
          "A_metric_Background_Mean", "A_metric_Background_Std",
          "b_metric_Background_Mean", "b_metric_Background_Std",
          "t_metric_Background_Mean", "t_metric_Background_Std",
          "x_metric_Background_Mean", "x_metric_Background_Std",
          "FluoroSignal", "CrossedSignal"
        ]]
dfb.set_index(["RegionFolder", "Subject"], inplace=True)


In [36]:
dfb

Unnamed: 0_level_0,Unnamed: 1_level_0,Depolarization_Power_Background_Mean,Depolarization_Power_Background_Std,Q_metric_Background_Mean,Q_metric_Background_Std,Anisotropy_Lin_Background_Mean,Anisotropy_Lin_Background_Std,Anisotropy_Circ_Background_Mean,Anisotropy_Circ_Background_Std,Polarizance_Lin_Background_Mean,Polarizance_Lin_Background_Std,Polarizance_Circ_Background_Mean,Polarizance_Circ_Background_Std,Diattenuation_Lin_Background_Mean,Diattenuation_Lin_Background_Std,Diattenuation_Circ_Background_Mean,Diattenuation_Circ_Background_Std,Retardance_Lin_Background_Mean,Retardance_Lin_Background_Std,Retardance_Circ_Background_Mean,Retardance_Circ_Background_Std,A_metric_Background_Mean,A_metric_Background_Std,b_metric_Background_Mean,b_metric_Background_Std,t_metric_Background_Mean,t_metric_Background_Std,x_metric_Background_Mean,x_metric_Background_Std,FluoroSignal,CrossedSignal
RegionFolder,Subject,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 001\,NR15-203,0.182,0.017,0.675,0.028,0.161,0.03,0.003,0.003,0.064,0.014,0.491,0.503,0.048,0.014,0.498,0.503,0.023,0.006,0.502,0.505,0.585,0.528,0.783,0.022,0.534,0.511,0.329,0.615,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 002\,NR15-203,0.181,0.02,0.676,0.033,0.166,0.024,0.003,0.002,0.064,0.016,0.49,0.503,0.048,0.012,0.496,0.503,0.024,0.005,0.501,0.504,0.569,0.524,0.786,0.024,0.527,0.51,0.361,0.589,0,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 A\,NR15-203,0.172,0.015,0.689,0.024,0.184,0.024,0.005,0.005,0.067,0.013,0.492,0.503,0.052,0.013,0.498,0.503,0.027,0.005,0.506,0.504,0.575,0.524,0.789,0.02,0.53,0.51,0.307,0.593,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 B\,NR15-203,0.191,0.022,0.659,0.035,0.186,0.021,0.002,0.003,0.065,0.014,0.492,0.503,0.048,0.013,0.498,0.503,0.03,0.005,0.501,0.504,0.578,0.526,0.774,0.025,0.531,0.51,0.328,0.603,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 C\,NR15-203,0.198,0.032,0.646,0.051,0.199,0.019,0.002,0.002,0.061,0.012,0.492,0.503,0.051,0.013,0.498,0.503,0.035,0.006,0.501,0.504,0.563,0.525,0.77,0.029,0.524,0.51,0.293,0.59,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 D\,NR15-203,0.182,0.026,0.668,0.041,0.224,0.025,0.013,0.008,0.063,0.014,0.491,0.503,0.076,0.012,0.5,0.503,0.036,0.004,0.511,0.504,0.572,0.525,0.781,0.025,0.528,0.51,0.185,0.547,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 005\,NR15-203,0.211,0.036,0.622,0.054,0.232,0.023,0.013,0.01,0.058,0.014,0.504,0.504,0.076,0.015,0.511,0.503,0.045,0.007,0.511,0.505,0.574,0.529,0.741,0.037,0.528,0.511,0.135,0.641,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 007 C\,NR15-203,0.191,0.029,0.654,0.046,0.195,0.024,0.018,0.009,0.071,0.013,0.49,0.503,0.077,0.012,0.496,0.502,0.03,0.004,0.514,0.504,0.564,0.525,0.776,0.027,0.525,0.51,0.207,0.546,0,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 008\,NR15-203,0.175,0.014,0.684,0.023,0.201,0.023,0.006,0.004,0.076,0.015,0.483,0.504,0.066,0.014,0.491,0.504,0.027,0.005,0.503,0.505,0.573,0.527,0.786,0.021,0.529,0.511,0.345,0.587,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 013\,NR15-203,0.168,0.015,0.695,0.024,0.308,0.032,0.008,0.007,0.067,0.016,0.49,0.503,0.062,0.013,0.498,0.503,0.054,0.008,0.507,0.505,0.584,0.527,0.787,0.021,0.533,0.511,0.24,0.57,0,1


In [25]:
# df for deposits
dfd = df[["RegionFolder", "Subject",
          "Depolarization_Power_Deposit_Mean", "Depolarization_Power_Deposit_Std", 
          "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std",
          "Anisotropy_Lin_Deposit_Mean", "Anisotropy_Lin_Deposit_Std",
          "Anisotropy_Circ_Deposit_Mean", "Anisotropy_Circ_Deposit_Std",
          "Polarizance_Lin_Deposit_Mean", "Polarizance_Lin_Deposit_Std",
          "Polarizance_Circ_Deposit_Mean", "Polarizance_Circ_Deposit_Std",
          "Diattenuation_Lin_Deposit_Mean", "Diattenuation_Lin_Deposit_Std",
          "Diattenuation_Circ_Deposit_Mean", "Diattenuation_Circ_Deposit_Std",
          "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std",
          "Retardance_Circ_Deposit_Mean", "Retardance_Circ_Deposit_Std",
          "A_metric_Deposit_Mean", "A_metric_Deposit_Std",
          "b_metric_Deposit_Mean", "b_metric_Deposit_Std",
          "t_metric_Deposit_Mean", "t_metric_Deposit_Std",
          "x_metric_Deposit_Mean", "x_metric_Deposit_Std",
          "FluoroSignal", "CrossedSignal"
        ]]
dfd.set_index(["RegionFolder", "Subject"], inplace=True)





In [76]:
dfd

Unnamed: 0_level_0,Unnamed: 1_level_0,Depolarization_Power_Deposit_Mean,Depolarization_Power_Deposit_Std,Q_metric_Deposit_Mean,Q_metric_Deposit_Std,Anisotropy_Lin_Deposit_Mean,Anisotropy_Lin_Deposit_Std,Anisotropy_Circ_Deposit_Mean,Anisotropy_Circ_Deposit_Std,Polarizance_Lin_Deposit_Mean,Polarizance_Lin_Deposit_Std,Polarizance_Circ_Deposit_Mean,Polarizance_Circ_Deposit_Std,Diattenuation_Lin_Deposit_Mean,Diattenuation_Lin_Deposit_Std,Diattenuation_Circ_Deposit_Mean,Diattenuation_Circ_Deposit_Std,Retardance_Lin_Deposit_Mean,Retardance_Lin_Deposit_Std,Retardance_Circ_Deposit_Mean,Retardance_Circ_Deposit_Std,A_metric_Deposit_Mean,A_metric_Deposit_Std,b_metric_Deposit_Mean,b_metric_Deposit_Std,t_metric_Deposit_Mean,t_metric_Deposit_Std,x_metric_Deposit_Mean,x_metric_Deposit_Std,FluoroSignal,CrossedSignal
RegionFolder,Subject,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 001\,NR15-203,0.217,0.017,0.617,0.026,0.199,0.029,0.003,0.003,0.059,0.015,0.494,0.504,0.044,0.014,0.498,0.504,0.04,0.009,0.501,0.505,0.601,0.53,0.757,0.023,0.539,0.512,0.345,0.638,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 002\,NR15-203,0.22,0.017,0.611,0.026,0.204,0.022,0.003,0.002,0.064,0.016,0.49,0.503,0.045,0.012,0.494,0.503,0.039,0.005,0.501,0.505,0.56,0.525,0.752,0.022,0.523,0.51,0.371,0.602,0,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 A\,NR15-203,0.237,0.04,0.587,0.06,0.302,0.062,0.003,0.003,0.067,0.014,0.497,0.504,0.046,0.014,0.494,0.504,0.072,0.028,0.504,0.505,0.576,0.539,0.731,0.04,0.528,0.513,0.323,0.617,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 B\,NR15-203,0.327,0.061,0.465,0.077,0.259,0.061,0.002,0.002,0.058,0.016,0.499,0.505,0.037,0.014,0.493,0.504,0.085,0.033,0.501,0.505,0.652,0.57,0.664,0.052,0.551,0.522,0.352,0.665,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 C\,NR15-203,0.356,0.053,0.427,0.064,0.231,0.029,0.002,0.002,0.051,0.013,0.498,0.504,0.036,0.014,0.493,0.504,0.083,0.027,0.501,0.505,0.669,0.569,0.647,0.047,0.556,0.521,0.312,0.668,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 003 D\,NR15-203,0.298,0.045,0.495,0.06,0.246,0.021,0.009,0.006,0.057,0.014,0.496,0.504,0.057,0.012,0.495,0.504,0.07,0.019,0.511,0.504,0.581,0.555,0.687,0.04,0.527,0.517,0.165,0.598,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 005\,NR15-203,0.31,0.042,0.483,0.055,0.285,0.046,0.005,0.005,0.053,0.015,0.506,0.504,0.065,0.015,0.509,0.504,0.083,0.019,0.496,0.509,0.594,0.541,0.631,0.047,0.53,0.512,0.258,0.856,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 007 C\,NR15-203,0.246,0.014,0.57,0.02,0.204,0.016,0.017,0.009,0.065,0.013,0.495,0.503,0.073,0.012,0.495,0.502,0.044,0.004,0.515,0.504,0.597,0.526,0.731,0.019,0.536,0.509,0.208,0.549,0,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 008\,NR15-203,0.195,0.014,0.651,0.022,0.276,0.031,0.006,0.004,0.077,0.015,0.486,0.504,0.066,0.014,0.487,0.504,0.049,0.008,0.504,0.505,0.578,0.527,0.768,0.021,0.53,0.511,0.339,0.585,1,1
E:\Human AD Ex Vivo\NR15-203\Left Eye\Region 013\,NR15-203,0.195,0.013,0.65,0.021,0.379,0.022,0.008,0.007,0.066,0.016,0.491,0.503,0.058,0.014,0.494,0.503,0.079,0.007,0.508,0.505,0.567,0.527,0.758,0.02,0.526,0.51,0.237,0.578,0,1


In [11]:
# This is for Harry's train, test csv file

P = 0.8;
mm = dfd_used.shape[0]
idx_train_test = np.random.permutation(mm)

dfFd_used.iloc[idx_train_test[0:round(P*mm)]].to_csv(os.getcwd() + "\\data\\dbt_train.csv")

dfd_used.iloc[idx_train_test[round(P*mm):]].to_csv(os.getcwd() + "\\data\\dbt_test.csv")

NameError: name 'dfd_used' is not defined

In [12]:
# df for full stats
dff =  df[["RegionFolder", "Subject", 
          "Depolarization_Power_Full_Mean", "Depolarization_Power_Full_Std", 
          "Q_metric_Full_Mean", "Q_metric_Full_Std",
          "Anisotropy_Lin_Full_Mean", "Anisotropy_Lin_Full_Std",
          "Polarizance_Lin_Full_Mean", "Polarizance_Lin_Full_Std",
          "Diattenuation_Lin_Full_Mean", "Diattenuation_Lin_Full_Std",
          "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std",
          "FluoroSignal", "CrossedSignal"
        ]]
dff.set_index(["RegionFolder", "Subject"], inplace=True)


In [13]:
# exclude std for each metric 
# see if the results change

dfb_nostd = df[["RegionFolder", "Subject",
          "Depolarization_Power_Background_Mean",
          "Q_metric_Background_Mean", 
          "Anisotropy_Lin_Background_Mean", 
          "Polarizance_Lin_Background_Mean", 
          "Diattenuation_Lin_Background_Mean", 
          "Retardance_Lin_Background_Mean",  
          "FluoroSignal", "CrossedSignal"
        ]]
dfb_nostd.set_index(["RegionFolder", "Subject"], inplace=True)

dfd_nostd = df[["RegionFolder", "Subject", 
          "Depolarization_Power_Deposit_Mean", 
          "Q_metric_Deposit_Mean", 
          "Anisotropy_Lin_Deposit_Mean",
          "Polarizance_Lin_Deposit_Mean", 
          "Diattenuation_Lin_Deposit_Mean",
          "Retardance_Lin_Deposit_Mean", 
          "FluoroSignal", "CrossedSignal"
        ]]
dfd_nostd.set_index(["RegionFolder", "Subject"], inplace=True)

dff_nostd =  df[["RegionFolder", "Subject", 
          "Depolarization_Power_Full_Mean", 
          "Q_metric_Full_Mean", 
          "Anisotropy_Lin_Full_Mean", 
          "Polarizance_Lin_Full_Mean", 
          "Diattenuation_Lin_Full_Mean", 
          "Retardance_Lin_Full_Mean", 
          "FluoroSignal", "CrossedSignal"
        ]]
dff_nostd.set_index(["RegionFolder", "Subject"], inplace=True)



## Training Using Scikit Learn
<a id="sklearn"></a>

In [27]:
# import basic fucntions
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import BorderlineSMOTE
# Train lda, support vector machine, random forest
from sklearn import svm
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda

In [28]:
# select data_preparation scheme
def data_preparation_scheme(s_num):
    if s_num==1: # train with background data
        # (ftct deposits), (ffct deposits + background(ffcf)) 
        dfct = dfd[dfd["CrossedSignal"]==1].copy()
        #dfcf = dfd[dfd["CrossedSignal"]==0].copy()
        df_b = dfb.sample(n=658, random_state=42)
        # change the name of columns
        oldname = ["Depolarization_Power_Background_Mean", "Q_metric_Background_Mean",
                   "Anisotropy_Lin_Background_Mean", "Anisotropy_Lin_Background_Std",
                   "Polarizance_Circ_Background_Std","Diattenuation_Circ_Background_Std",
                   "Retardance_Lin_Background_Mean", "Retardance_Lin_Background_Std",
                   "Retardance_Circ_Background_Std","A_metric_Background_Std",
                   "b_metric_Background_Std","t_metric_Background_Std",
                   "x_metric_Background_Mean", 
                   "FluoroSignal", "CrossedSignal"]
        
        df_b = df_b[oldname] # added sentence
        
        
        newname = ["Depolarization_Power_Deposit_Mean", "Q_metric_Deposit_Mean",
                   "Anisotropy_Lin_Deposit_Mean", "Anisotropy_Lin_Deposit_Std",
                   "Polarizance_Circ_Deposit_Std","Diattenuation_Circ_Deposit_Std",
                   "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std",
                   "Retardance_Circ_Deposit_Std","A_metric_Deposit_Std",
                   "b_metric_Deposit_Std","t_metric_Deposit_Std",
                   "x_metric_Deposit_Mean",
                   "FluoroSignal", "CrossedSignal"]
        
        dfct = dfct[newname] # added sentence
        
        namedict = {oldname[i]: newname[i] for i in range(len(oldname))}
        # Inpalce changing the names of the columns
        df_b.rename(columns = namedict, inplace=True)
        df_b["FluoroSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_b["CrossedSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_r = pd.concat([dfct, df_b])
        return(df_r)
    
    if s_num==2: # downsampling the positive data
        dfct = dfd[dfd["CrossedSignal"]==1].copy()
        dftct = dfct[dfct["FluoroSignal"]==1].copy()
        dftct_use = dfct.sample(n=230,random_state=42)
        dffct = dfct[dfct["FluoroSignal"]==0].copy()
        df_r = pd.concat([dftct_use,dffct])
        return(df_r)
    
    if s_num==3:
        
        dfct = dfd_nostd[dfb_nostd["CrossedSignal"]==1].copy()
        
        df_b = dfb_nostd.sample(n=658, random_state=42)
        
        oldname = ["Depolarization_Power_Background_Mean", 
                   "Q_metric_Background_Mean", 
                   "Anisotropy_Lin_Background_Mean", 
                   "Polarizance_Lin_Background_Mean", 
                   "Diattenuation_Lin_Background_Mean", 
                   "Retardance_Lin_Background_Mean",  
                   "FluoroSignal", "CrossedSignal"]
        
        newname =  ["Depolarization_Power_Deposit_Mean", 
                   "Q_metric_Deposit_Mean", 
                   "Anisotropy_Lin_Deposit_Mean", 
                   "Polarizance_Lin_Deposit_Mean", 
                   "Diattenuation_Lin_Deposit_Mean", 
                   "Retardance_Lin_Deposit_Mean",
                   "FluoroSignal", "CrossedSignal"]
        
        namedict = {oldname[i]: newname[i] for i in range(len(oldname))}
        
        df_b.rename(columns = namedict, inplace=True)
        df_b["FluoroSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_b["CrossedSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_r = pd.concat([dfct, df_b])
        return(df_r)
        
        
        #dfct_ft = dfd[(dfd["FluoroSignal"]==1) & (dfd["CrossedSignal"]==1)].copy()
        #dfct_ff = dfd[(dfd["FluoroSignal"]==0) & (dfd["CrossedSignal"]==1)].copy()
        #dfct_ft = dfct_ft.sample(frac=0.5, random_state=42)[0:3*dfct_ff.shape[0]]
        #df_r = pd.concat([dfct_ft, dfct_ff])    
        #return(df_r)    

### Support vector machine
<a id="svm"></a>

In [29]:
def train_svm(data_train, label_train):
    wdict_svm = {0: 1, 1: 1}
    clf_svm = svm.SVC(class_weight=wdict_svm)
    # optimize the parameters
    param_dist = {"kernel": ["rbf", "poly"], 
                  "degree": [1, 2, 3],
                  "gamma": sp_randint(0, 10), 
                  "shrinking": [True, False]
                  }
    n_iter_search = 30
    rs_svm = RandomizedSearchCV(clf_svm, param_distributions=param_dist,
                                      n_iter=n_iter_search, cv=10, n_jobs=-1)
    # train
    rs_svm.fit(data_train, label_train)
    return rs_svm
    

### Ensemble learning (sklearn)
<a id="ensemble"></a>

In [30]:
def train_rf(data_train, label_train, std=False):
    wdict_rf = {0: 1, 1: 1}
    clf_rf = RandomForestClassifier(n_jobs=-1)
    param_dist = {"max_depth": [2, 3, 4, 5],
                  "n_estimators": [100, 500],
                  "bootstrap": [True, False],
                  "max_features": sp_randint(1, 6) if std else sp_randint(1,13),
                  "min_samples_split": sp_randint(2, 5) if std else sp_randint(2, 12),
                  "criterion": ["gini", "entropy"]}
    n_iter_search = 30
    rs_rf = RandomizedSearchCV(clf_rf, param_distributions=param_dist,
                                      n_iter=n_iter_search, cv=10, n_jobs=-1)
    # train
    rs_rf.fit(data_train, label_train)
    return rs_rf

### Linear discriminant analysis

In [31]:
def train_lda(data_train,label_train):
    clf_lda = lda(store_covariance=True)
    
    # train
    clf_lda.fit(data_train,label_train)
    lda_cv_score = cross_val_score(clf_lda, data_train, label_train, cv=10, scoring="accuracy")
    return clf_lda, lda_cv_score

## Summary of All Classifiers
<a id="sklearn_sum"></a>

### Using ADASYN

In [34]:
# function for preprocesing to genearte training and test data sets
def data_preprocessing(df_in):
    # Prepare the datamatrix and labels
    X_r = df_in.values[:, 0:(df_in.shape[1]-2)]
    y = df_in["FluoroSignal"].values
    # Standarize X
    scaler = preprocessing.StandardScaler().fit(X_r)
    X = scaler.transform(X_r)
    # split train set and test set
    X_train_, X_test_, y_train_, y_test_ = train_test_split(X, y, test_size=0.2, random_state=10) 
    return X_train_, X_test_, y_train_, y_test_, scaler

# Define a function for inputing stats from classifier to dataframe
def metric_scores(m_clf,mname, truevalt, predictvalt,LDA=False,lda_cv_score=None):
    
    trueval = truevalt.copy()
    predictval = predictvalt.copy()
    accuracy = accuracy_score(trueval, predictval)
    precision = precision_score(trueval, predictval)
    recall = recall_score(trueval, predictval)
    specificity = recall_score(trueval, predictval, pos_label=0)
    # CV score of the best_estimator
    cvscore = np.mean(lda_cv_score) if LDA else m_clf.best_score_
        
    df_scores = pd.DataFrame({"Method": mname, "Accuracy": [accuracy], "Precision": [precision], "Sensitivity (Recall)": [recall], 
                              "Specificity": [specificity], "Mean accuracy": [cvscore]})
    return df_scores

def train_models(X_train_in, X_test_in, y_train_in, y_test_in,std=False):  
    df_result = pd.DataFrame({"Method": [], "Accuracy": [], "Precision": [], "Sensitivity (Recall)": [], 
                              "Specificity": [], "Mean accuracy": []})
    # lda
    rs_lda_t,lda_cv_score = train_lda(X_train_in, y_train_in)
    y_lda_predict = rs_lda_t.predict(X_test_in)
    df_temp1 = metric_scores(rs_lda_t, "LDA", y_test_in, y_lda_predict,LDA=True,lda_cv_score=lda_cv_score)
    df_result = df_result.append(df_temp1)
    
    # svm
    rs_svm_t = train_svm(X_train_in, y_train_in)
    y_svm_predict = rs_svm_t.predict(X_test_in)
    df_temp2 = metric_scores(rs_svm_t, "SVM", y_test_in, y_svm_predict)
    df_result = df_result.append(df_temp2)
    # rf
    rs_rf_t = train_rf(X_train_in, y_train_in,std=std)
    y_rf_predict = rs_rf_t.predict(X_test_in)    
    df_temp3 = metric_scores(rs_rf_t, "RF", y_test_in, y_rf_predict)
    df_result = df_result.append(df_temp3)
    return df_result, rs_lda_t, y_lda_predict, rs_svm_t, y_svm_predict, rs_rf_t, y_rf_predict

def fimportance_dataframe(name, score):
    feature_importance_table = pd.DataFrame({"Metric": [], "Importance in percentage":[]})
    for name, score in zip(name,score):
        feature_importance_add = pd.DataFrame({"Metric": name, "Importance in percentage":[score]})
        feature_importance_table = feature_importance_table.append(feature_importance_add)
    
    feature_importance_table = feature_importance_table[['Metric','Importance in percentage']]
    feature_importance_table = feature_importance_table.sort_values('Importance in percentage', ascending=False)
    
    return feature_importance_table

In [42]:
dfd_used = dfd[dfd['CrossedSignal']==1]
metric_data = dfd_used.values[:,0:(dfd_used.shape[1]-2)]
metric_label = dfd_used["FluoroSignal"].values

ada = ADASYN(random_state=42)
[ada_data, ada_label] = ada.fit_resample(metric_data,metric_label)

X_ada_train, X_ada_test, y_ada_train, y_ada_test = train_test_split(ada_data, ada_label, test_size=0.2, random_state=42)

ada_result, ada_lda_model, ada_y_lda_pred, ada_svm_model, ada_y_svm_pred, ada_rf_model,ada_y_rf_pred = \
                                            train_models(X_ada_train, X_ada_test, y_ada_train, y_ada_test)

In [38]:
ada_result.set_index(["Method"], inplace=True)
ada_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.73,0.72,0.71,0.701,0.754
SVM,0.774,0.789,0.821,0.653,0.877
RF,0.824,0.834,0.805,0.816,0.83


ADASYN on all metric data

In [86]:
# ada result on all metric data
ada_result.set_index(["Method"], inplace=True)
ada_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.741,0.768,0.718,0.747,0.737
SVM,0.883,0.884,0.931,0.813,0.946
RF,0.877,0.851,0.894,0.84,0.91


ADASYN on original data (exclude the algorithm data)

In [39]:
ada_result_1, ada_lda_model_1, ada_y_lda_pred_1, ada_svm_model_1, ada_y_svm_pred_1, ada_rf_model_1,ada_y_rf_pred_1 = \
                                            train_models(X_ada_train, metric_data, y_ada_train, metric_label)

In [40]:
ada_result_1.set_index(["Method"], inplace=True)
ada_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.747,0.72,0.957,0.738,0.802
SVM,0.754,0.789,0.98,0.729,0.908
RF,0.89,0.838,0.986,0.885,0.924


In [87]:
import collections
collections.Counter(ada_label)

Counter({0: 792, 1: 789})

###  Using borderlineSMOTE

In [41]:
BS = BorderlineSMOTE(random_state=42)
[bs_data, bs_label] = BS.fit_resample(metric_data,metric_label)

X_bs_train, X_bs_test, y_bs_train, y_bs_test = train_test_split(bs_data, bs_label, test_size=0.2, random_state=42)

bs_result, bs_lda_model, bs_y_lda_pred, bs_svm_model, bs_y_svm_pred, bs_rf_model,bs_y_rf_pred = \
                                            train_models(X_bs_train, X_bs_test, y_bs_train, y_bs_test)

NameError: name 'metric_data' is not defined

In [43]:
bs_result.set_index(["Method"], inplace=True)
bs_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.813,0.782,0.86,0.785,0.847
SVM,0.839,0.848,0.942,0.75,0.944
RF,0.905,0.894,0.918,0.907,0.903


In [26]:
bs_result.set_index(["Method"], inplace=True)
bs_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.81,0.787,0.859,0.779,0.847
SVM,0.842,0.842,0.942,0.756,0.944
RF,0.918,0.892,0.94,0.907,0.931


BorderlineSMOTE on all metrics

In [89]:
bs_result.set_index(["Method"], inplace=True)
bs_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.807,0.815,0.858,0.773,0.847
SVM,0.911,0.903,0.934,0.901,0.924
RF,0.896,0.886,0.937,0.866,0.931


NameError: name 'bs_rf_model' is not defined

### Using original data

In [44]:
X_metric_train, X_metric_test, y_metric_train, y_metric_test = train_test_split(metric_data, metric_label, test_size=0.2, random_state=42)

metric_result, metric_lda_model, metric_y_lda_pred, metric_svm_model, metric_y_svm_pred, metric_rf_model,metric_y_rf_pred = \
                                            train_models(X_metric_train, X_metric_test, y_metric_train, y_metric_test)



In [45]:
metric_result.set_index(["Method"], inplace=True)
metric_result

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.886,0.893,0.914,0.955,0.5
SVM,0.875,0.889,0.88,0.987,0.25
RF,0.902,0.897,0.901,0.994,0.393


### Using background data

In [91]:
# Scheme 1 (separate ftct deposit and (ffct deposits + ffcf background) for fluorescence)
df_1 = data_preparation_scheme(s_num=1)

# add coefficient of variation in df_1

X_train_1, X_test_1, y_train_1, y_test_1, scaler_data_1 = data_preprocessing(df_1)
df_result_1, rs_lda_model_1, y_lda_pred_val_1, rs_svm_model_1, y_svm_pred_val_1, rs_rf_model_1, y_rf_pred_val_1 = \
                                            train_models(X_train_1, X_test_1, y_train_1, y_test_1)

# Scheme 2 (separate ftct and ffct deposits for fluorescence)
#df_2 = data_preparation_scheme(s_num=2)
#X_train_2, X_test_2, y_train_2, y_test_2, scaler_data_2 = data_preprocessing(df_2)
#df_result_2, rs_svm_model_2, y_svm_pred_val_2, rs_rf_model_2, y_rf_pred_val_2 = \
#                                           train_models(X_train_2, X_test_2, y_train_2, y_test_2)

#### On all metric data

In [92]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.924,0.909,0.95,0.904,0.946
SVM,0.94,0.922,0.94,0.946,0.933
RF,0.943,0.936,0.931,0.964,0.919


In [93]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:28]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Std,0.272
Anisotropy_Lin_Deposit_Std,0.265
Retardance_Lin_Deposit_Mean,0.151
Anisotropy_Lin_Deposit_Mean,0.058
t_metric_Deposit_Std,0.047
b_metric_Deposit_Std,0.027
A_metric_Deposit_Std,0.025
Diattenuation_Circ_Deposit_Std,0.017
Q_metric_Deposit_Mean,0.017
Retardance_Circ_Deposit_Std,0.013


In [35]:
df_2 = data_preparation_scheme(s_num=2)

In [36]:
X_train_2, X_test_2, y_train_2, y_test_2, scaler_data_2 = data_preprocessing(df_2)
df_result_2, rs_lda_model_2, y_lda_pred_val_2, rs_svm_model_2, y_svm_pred_val_2, rs_rf_model_2, y_rf_pred_val_2 = \
                                            train_models(X_train_2, X_test_2, y_train_2, y_test_2)

In [37]:
df_result_2.set_index(["Method"], inplace=True)
df_result_2

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.795,0.733,0.756,0.861,0.73
SVM,0.781,0.816,0.717,0.917,0.649
RF,0.822,0.792,0.767,0.917,0.73


In [33]:
df_result_2.set_index(["Method"], inplace=True)
df_result_2

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.698,0.771,0.643,0.45,0.848
SVM,0.755,0.804,0.652,0.75,0.758
RF,0.66,0.847,0.55,0.55,0.727


In [75]:
# train without std of each metric
df_3 = data_preparation_scheme(s_num=3)

X_train_3, X_test_3, y_train_3, y_test_3, scaler_data_3 = data_preprocessing(df_3)
df_result_3, rs_lda_model_3, y_lda_pred_val_3, rs_svm_model_3, y_svm_pred_val_3, rs_rf_model_3, y_rf_pred_val_3 = \
                                           train_models(X_train_3, X_test_3, y_train_3, y_test_3,std=True)

In [44]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.892,0.901,0.931,0.849,0.936
SVM,0.934,0.911,0.926,0.943,0.924
RF,0.949,0.928,0.928,0.975,0.924


df_result_1.set_index(["Method"], inplace=True)
df_result_1

In [31]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.915,0.901,0.94,0.887,0.943
SVM,0.937,0.916,0.943,0.931,0.943
RF,0.946,0.928,0.944,0.95,0.943


In [27]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.892,0.903,0.91,0.847,0.93
SVM,0.905,0.928,0.89,0.903,0.907
RF,0.921,0.935,0.889,0.944,0.901


In [22]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.911,0.898,0.937,0.892,0.933
SVM,0.924,0.917,0.918,0.94,0.906
RF,0.93,0.933,0.914,0.958,0.899


In [19]:
df_result_1.set_index(["Method"], inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.921,0.897,0.935,0.919,0.924
SVM,0.924,0.922,0.916,0.948,0.896
RF,0.934,0.933,0.917,0.965,0.896


In [45]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:12]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Std,0.232
Anisotropy_Lin_Deposit_Std,0.228
Retardance_Lin_Deposit_Mean,0.147
Anisotropy_Lin_Deposit_Mean,0.119
Q_metric_Deposit_Mean,0.058
Depolarization_Power_Deposit_Mean,0.05
Depolarization_Power_Deposit_Std,0.044
Diattenuation_Lin_Deposit_Std,0.038
Polarizance_Lin_Deposit_Std,0.033
Q_metric_Deposit_Std,0.033


In [32]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:12]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Std,0.482
Anisotropy_Lin_Deposit_Std,0.287
Retardance_Lin_Deposit_Mean,0.126
Anisotropy_Lin_Deposit_Mean,0.053
Q_metric_Deposit_Mean,0.026
Diattenuation_Lin_Deposit_Std,0.011
Depolarization_Power_Deposit_Mean,0.008
Polarizance_Lin_Deposit_Std,0.005
Q_metric_Deposit_Std,0.002
Depolarization_Power_Deposit_Std,0.002


In [28]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:12]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Std,0.324
Anisotropy_Lin_Deposit_Std,0.305
Retardance_Lin_Deposit_Mean,0.152
Anisotropy_Lin_Deposit_Mean,0.079
Q_metric_Deposit_Mean,0.045
Depolarization_Power_Deposit_Mean,0.029
Depolarization_Power_Deposit_Std,0.021
Q_metric_Deposit_Std,0.015
Diattenuation_Lin_Deposit_Std,0.011
Polarizance_Lin_Deposit_Std,0.01


In [24]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:12]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Anisotropy_Lin_Deposit_Std,0.423
Retardance_Lin_Deposit_Std,0.409
Retardance_Lin_Deposit_Mean,0.061
Polarizance_Lin_Deposit_Std,0.019
Diattenuation_Lin_Deposit_Std,0.017
Anisotropy_Lin_Deposit_Mean,0.013
Depolarization_Power_Deposit_Std,0.012
Q_metric_Deposit_Mean,0.011
Q_metric_Deposit_Std,0.011
Depolarization_Power_Deposit_Mean,0.01


In [74]:
feature_names = list(df_1.columns.values)
feature_names_1 = feature_names[:12]

feature_importance_1 = rs_rf_model_1.best_estimator_.feature_importances_
importance_table_1 = fimportance_dataframe(feature_names_1,feature_importance_1)


importance_table_1.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Std,0.26719
Anisotropy_Lin_Deposit_Std,0.23843
Retardance_Lin_Deposit_Mean,0.15641
Anisotropy_Lin_Deposit_Mean,0.1149
Q_metric_Deposit_Mean,0.05237
Depolarization_Power_Deposit_Std,0.03998
Q_metric_Deposit_Std,0.03886
Diattenuation_Lin_Deposit_Std,0.02992
Depolarization_Power_Deposit_Mean,0.02305
Polarizance_Lin_Deposit_Std,0.01697


In [76]:
df_result_3.set_index(["Method"], inplace=True)
df_result_3

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.88291,0.893,0.89941,0.88372,0.88194
SVM,0.91139,0.91204,0.8913,0.95349,0.86111
RF,0.91139,0.91997,0.8913,0.95349,0.86111


In [81]:
feature_names_3 = list(df_3.columns.values)
feature_names_3 = feature_names_3[:6]

feature_importance_3 = rs_rf_model_3.best_estimator_.feature_importances_
importance_table_3 = fimportance_dataframe(feature_names_3,feature_importance_3)


importance_table_3.set_index(["Metric"])

Unnamed: 0_level_0,Importance in percentage
Metric,Unnamed: 1_level_1
Retardance_Lin_Deposit_Mean,0.62644
Anisotropy_Lin_Deposit_Mean,0.21777
Q_metric_Deposit_Mean,0.08497
Polarizance_Lin_Deposit_Mean,0.02897
Depolarization_Power_Deposit_Mean,0.0259
Diattenuation_Lin_Deposit_Mean,0.01595


In [31]:
feature_names=list(df_1.columns.values)
feature_names = feature_names[:12]

In [16]:
print("Divide theoretical range: \n")
for name,score in zip(feature_names, rs_rf_model_1.best_estimator_.feature_importances_):
    print(name,score)

Divide theoretical range: 

Depolarization_Power_Deposit_Mean 0.008097302805251963
Depolarization_Power_Deposit_Std 0.020702249127373187
Q_metric_Deposit_Mean 0.029331071854892178
Q_metric_Deposit_Std 0.011650750233509758
Anisotropy_Lin_Deposit_Mean 0.08467855964956157
Anisotropy_Lin_Deposit_Std 0.2981347428684587
Polarizance_Lin_Deposit_Mean 0.007131264769197165
Polarizance_Lin_Deposit_Std 0.015340946828351673
Diattenuation_Lin_Deposit_Mean 0.006902481481991697
Diattenuation_Lin_Deposit_Std 0.01288562744378184
Retardance_Lin_Deposit_Mean 0.14102662105674998
Retardance_Lin_Deposit_Std 0.36411838188088


In [32]:
for name,score in zip(feature_names, rs_rf_model_1.best_estimator_.feature_importances_):
    print(name,score)

Depolarization_Power_Deposit_Mean 0.012447958449987638
Depolarization_Power_Deposit_Std 0.015958446728892253
Q_metric_Deposit_Mean 0.03292933324147724
Q_metric_Deposit_Std 0.009996611122886283
Anisotropy_Lin_Deposit_Mean 0.07746925839396057
Anisotropy_Lin_Deposit_Std 0.3249561491516812
Polarizance_Lin_Deposit_Mean 0.005717668596595977
Polarizance_Lin_Deposit_Std 0.013354590001299285
Diattenuation_Lin_Deposit_Mean 0.004448488773767493
Diattenuation_Lin_Deposit_Std 0.011696534949700127
Retardance_Lin_Deposit_Mean 0.17111383105074268
Retardance_Lin_Deposit_Std 0.3199111295390091


In [34]:
df_result_1.set_index(["Method"],  inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.92089,0.91525,0.94186,0.92393
RF,0.93038,0.91667,0.9593,0.93502


In [86]:
df_result_1.set_index(["Method"],  inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.93858,0.925,0.959259,0.912015
RF,0.942418,0.925532,0.966667,0.930937


In [16]:
# Scheme 1 (separate ftct deposit and (ffct deposits + ffcf background) for fluorescence)
df_result_1.set_index(["Method"],  inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.93858,0.925,0.959259,0.912015
RF,0.942418,0.925532,0.966667,0.929991


In [17]:
# Scheme 2 (separate ftct and ffct deposits for fluorescence)
#df_result_2.set_index(["Method"],  inplace=True)
#df_result_2

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.820809,0.859259,0.90625,0.849003
RF,0.83237,0.861314,0.921875,0.871795


In [18]:
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16, n_jobs=-1)

In [22]:
rnd_clf.fit(X_train_1,y_train_1)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=16,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [23]:
y_pred_rf = rnd_clf.predict(X_test_1)

In [24]:
rnd_clf.feature_importances_

array([0.02038201, 0.03848675, 0.05784372, 0.0342613 , 0.10280471,
       0.27417238, 0.00917664, 0.02079857, 0.00728367, 0.04203503,
       0.14365055, 0.24910466])

In [28]:
accuracy_score(y_test_1,y_pred_rf)

0.9404990403071017

In [42]:
rs_rf_model_1.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features=5, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [43]:
rs_rf_model_1.best_score_

0.9263074484944532

In [44]:
rs_rf_model_1.best_params_

{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 5,
 'max_features': 5,
 'min_samples_split': 10,
 'n_estimators': 200}

In [51]:
rs_rf_model_1.best_score_

0.93026941362916

In [52]:
rs_rf_model_1.best_params_

{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 4,
 'max_features': 6,
 'min_samples_split': 6,
 'n_estimators': 400}

In [56]:
rs_rf_model_1.best_score_

0.927892234548336

In [57]:
rs_rf_model_1.best_params_

{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': 5,
 'max_features': 10,
 'min_samples_split': 8,
 'n_estimators': 100}

In [60]:
rs_rf_model_1.best_score_

0.9350237717908082

In [61]:
rs_rf_model_1.best_params_

{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': 3,
 'max_features': 8,
 'min_samples_split': 9,
 'n_estimators': 100}

In [64]:
rs_rf_model_1.best_params_

{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 4,
 'max_features': 5,
 'min_samples_split': 6,
 'n_estimators': 500}

## Results after RF feature selection  

In [35]:
# Scheme 1 (separate ftct deposit and (ffct deposits + ffcf background) for fluorescence)
df_fs = data_preparation_scheme(s_num=1)

# add coefficient of variation in df_1

X_train_fs, X_test_fs, y_train_fs, y_test_fs, scaler_data_fs = data_preprocessing(df_fs)
df_result_fs, rs_lda_model_fs, y_lda_pred_val_fs, rs_svm_model_fs, y_svm_pred_val_fs, rs_rf_model_fs, y_rf_pred_val_fs = \
                                            train_models(X_train_fs, X_test_fs, y_train_fs, y_test_fs)

In [111]:
df_result_fs.set_index(["Method"],  inplace=True)
df_result_fs

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.918,0.9,0.943,0.898,0.94
SVM,0.943,0.922,0.916,0.982,0.899
RF,0.943,0.935,0.931,0.964,0.919


In [33]:
# 2nd test
df_result_fs.set_index(["Method"],  inplace=True)
df_result_fs

Unnamed: 0_level_0,Accuracy,Mean accuracy,Precision,Sensitivity (Recall),Specificity
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LDA,0.915,0.901,0.934,0.907,0.924
SVM,0.93,0.922,0.908,0.971,0.882
RF,0.934,0.935,0.917,0.965,0.896


In [38]:
len(y_train_fs)

1262