# Data analysis of the database (Fluoro-Crossed Data Points)

## Table of Contents
[Import Data and Basic preparation](#import)  
[Data Preparation (Ignore)](#dataprepi)  
[Data Preparation (Balancing)(Ignore)](#dataprepb)  
[Training Using Scikit Learn](#sklearn)  
..[Support Vector Machine](#svmk)  
..[Ensemble Learning](#ensemble)  
[Training Using Xgboost](#xgboost)  
[Training Using Tensorflow](#tf)  
[Summary of All Classifiers](#sklearn_sum)  
<br>
<br> 
This project focused on using data points in the database to construct the correlation between fluorescence and polarization signal. The main goal is to use the statistics of polarization signal to predict whether fluorescence signal of the deposit exists.  
There are four categories of data points: (naming->{f: fluorescence, c: crossed, t: Positive, f: negative})
1. ftct: Deposits are fluorescence positive and polarization positive. (num: 789)
2. ftcf: Deposits are fluorescence positive but polarization negative. (num: 20)
3. ffct: Deposits are fluorescence negative but polarization positive. (num: 131)
4. ffcf: Deposits are both negative in fluorescence and polarization signals. Since the number of deposits in this   category is too small, we use the background retina of ftct deposits as the data points for this category.


## Import Data and Basic preparation
<a id="import"></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
import os

In [2]:
datapath = os.getcwd() + "\\data\\dbt_m.csv"
df = pd.read_csv(datapath)

In [3]:
list(df)
# the fluorescence variable name: FluoroSignal

['RegionFolder',
 'Subject',
 'Species',
 'Age',
 'Gender',
 'CauseOfDeath',
 'TimeOfDeath',
 'MedicalHistory',
 'Diagnosis1Type',
 'Diagnosis1Level',
 'Is1PrimaryDiagnosis',
 'Diagnosis2Type',
 'Diagnosis2Level',
 'Is2PrimaryDiagnosis',
 'Diagnosis3Type',
 'Diagnosis3Level',
 'Is3PrimaryDiagnosis',
 'Braak_stage_tau',
 'NP_CERAD_Biel',
 'NP_FC_Biel',
 'NP_TC_Biel',
 'NP_PC_Biel',
 'NP_score',
 'DP_CERAD_Biel',
 'DP_score',
 'A_beta_thal',
 'Thal_Phase',
 'CAA_CR',
 'CAA_Abeta_Cb',
 'CAA_A_beta_and_CR',
 'ABC_score',
 'Likelihood_of_AD',
 'SubjectNotes',
 'EntryOrder',
 'MicroscopeMotors',
 'Sample',
 'EyeSource',
 'EyeInitialFixative',
 'EyeInitialFixativePercent',
 'EyeInitialFixingTime',
 'EyeDissectionDoneBy',
 'EyeMountingDoneBy',
 'EyeMountingDate_1',
 'EyeMountingDate_2',
 'EyeMountingDate_3',
 'EyeMountingDate_4',
 'EyeStain',
 'EyeMounting',
 'EyeQuarterPositions',
 'EyeNotes',
 'Region',
 'ImagingDoneBy',
 'XCoordinate',
 'YCoordinate',
 'RadialDistanceFromFovea',
 'ImageMagn

In [4]:
# Adjust some values in the table
# The Q metric
df[["Q_metric_Background_Mean", "Q_metric_Background_Std", "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std", 
    "Q_metric_Full_Mean", "Q_metric_Full_Std"]] = \
df[["Q_metric_Background_Mean", "Q_metric_Background_Std", "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std", 
    "Q_metric_Full_Mean", "Q_metric_Full_Std"]].divide(3) 
# The Linear retardance
df[["Retardance_Lin_Background_Mean","Retardance_Lin_Background_Std", 
    "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std", 
    "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std"]] = \
df[["Retardance_Lin_Background_Mean","Retardance_Lin_Background_Std", 
    "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std", 
    "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std"]].divide(180)

In [5]:
df_label = df[["RegionFolder", "Subject" ,  "FluoroSignal", "CrossedSignal"]]
df_label.set_index(["RegionFolder", "Subject"], inplace=True)
# Statistics of the number 
print("  Number of each class \n"
      "  Fluo_T_Cross_T: " + str(sum(np.multiply(df_label["FluoroSignal"], df_label["CrossedSignal"]))) + "\n" + 
      "  Fluo_T_Cross_F: " + str(sum(np.multiply(df_label["FluoroSignal"] == 1, df_label["CrossedSignal"] == 0))) + "\n"+
      "  Fluo_F_Cross_T: " + str(sum(np.multiply(df_label["FluoroSignal"] == 0, df_label["CrossedSignal"] == 1))) + "\n"+
      "  Fluo_F_Cross_F: " + str(sum(np.multiply(df_label["FluoroSignal"] == 0, df_label["CrossedSignal"] == 0))) + "\n"
     )
# it's better to separate the class and fine tune the training examples

  Number of each class 
  Fluo_T_Cross_T: 789
  Fluo_T_Cross_F: 20
  Fluo_F_Cross_T: 131
  Fluo_F_Cross_F: 7



In [6]:
# df for background
dfb = df[["RegionFolder", "Subject",
          "Depolarization_Power_Background_Mean", "Depolarization_Power_Background_Std", 
          "Q_metric_Background_Mean", "Q_metric_Background_Std",
          "Anisotropy_Lin_Background_Mean", "Anisotropy_Lin_Background_Std",
          "Polarizance_Lin_Background_Mean", "Polarizance_Lin_Background_Std",
          "Diattenuation_Lin_Background_Mean", "Diattenuation_Lin_Background_Std",
          "Retardance_Lin_Background_Mean", "Retardance_Lin_Background_Std", 
          "FluoroSignal", "CrossedSignal"
        ]]
dfb.set_index(["RegionFolder", "Subject"], inplace=True)
# df for deposits
dfd = df[["RegionFolder", "Subject", 
          "Depolarization_Power_Deposit_Mean", "Depolarization_Power_Deposit_Std", 
          "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std",
          "Anisotropy_Lin_Deposit_Mean", "Anisotropy_Lin_Deposit_Std",
          "Polarizance_Lin_Deposit_Mean", "Polarizance_Lin_Deposit_Std",
          "Diattenuation_Lin_Deposit_Mean", "Diattenuation_Lin_Deposit_Std",
          "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std",
          "FluoroSignal", "CrossedSignal"
        ]]
dfd.set_index(["RegionFolder", "Subject"], inplace=True)
# df for full stats
dff =  df[["RegionFolder", "Subject", 
          "Depolarization_Power_Full_Mean", "Depolarization_Power_Full_Std", 
          "Q_metric_Full_Mean", "Q_metric_Full_Std",
          "Anisotropy_Lin_Full_Mean", "Anisotropy_Lin_Full_Std",
          "Polarizance_Lin_Full_Mean", "Polarizance_Lin_Full_Std",
          "Diattenuation_Lin_Full_Mean", "Diattenuation_Lin_Full_Std",
          "Retardance_Lin_Full_Mean", "Retardance_Lin_Full_Std",
          "FluoroSignal", "CrossedSignal"
        ]]
dff.set_index(["RegionFolder", "Subject"], inplace=True)


## Data Preparation (Ignore) 
<a id="dataprepi"></a>

make the data matrix reasonable:
1. downsize the number of fluo-cross positive data points down to similar level as the rest
2. The number of no fluo no cross is too small (maybe use some background ?)
3. Replace the deposit average of Cross_F by full average

In [218]:
df_ftct = dfd.loc[(dfd["FluoroSignal"]==1) & (dfd["CrossedSignal"] == 1)]
df_ftcf = dfd.loc[(dfd["FluoroSignal"]==1) & (dfd["CrossedSignal"] == 0)]
df_ffct = dfd.loc[(dfd["FluoroSignal"]==0) & (dfd["CrossedSignal"] == 1)]
df_ffcf = dfd.loc[(dfd["FluoroSignal"]==0) & (dfd["CrossedSignal"] == 0)]

In [219]:
# Use K-means clustering to extract 100 representative points from the 789 ftct data points
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=100, max_iter=3000).fit(df_ftct.iloc[:, 0:12].values)
ftct_c = kmeans.cluster_centers_

In [220]:
# Replace the statistiscs of fpcf and ffcf deposits with their full average
dff_ftcf = dff.loc[(dff["FluoroSignal"]==1) & (dff["CrossedSignal"] == 0)]
dff_ffcf = dff.loc[(dff["FluoroSignal"]==0) & (dff["CrossedSignal"] == 0)]
df_ftcf.iloc[:, 0:12] = dff_ftcf.iloc[:, 0:12].values
df_ffcf.iloc[:, 0:12] = dff_ffcf.iloc[:, 0:12].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [221]:
# Create the data matrix
X = np.concatenate((ftct_c, df_ftcf.iloc[:, 0:12].values, 
                    df_ffct.iloc[:, 0:12].values, df_ffcf.iloc[:, 0:12].values),
                    axis=0)
y_full = np.concatenate((np.tile([1,1], (ftct_c.shape[0], 1)), 
                         df_ftcf.iloc[:, 12:14].values,
                         df_ffct.iloc[:, 12:14].values, 
                         df_ffcf.iloc[:, 12:14].values), 
                         axis=0)

## Data Preparation (Balancing) (Ignore)
<a id="dataprepb"></a>

In [11]:
# Replace the statistiscs of fpcf and ffcf deposits with their full average
# dff_ftcf = dff.loc[(dff["FluoroSignal"]==1) & (dff["CrossedSignal"] == 0)]
# dff_ffcf = dff.loc[(dff["FluoroSignal"]==0) & (dff["CrossedSignal"] == 0)]
# df_ftcf.iloc[:, 0:12] = dff_ftcf.iloc[:, 0:12].values
# df_ffcf.iloc[:, 0:12] = dff_ffcf.iloc[:, 0:12].values

dfd.loc[(dff["FluoroSignal"]==1) & (dff["CrossedSignal"] == 0)] = dff.loc[(dff["FluoroSignal"]==1) & (dff["CrossedSignal"] == 0)].values.copy()
dfd.loc[(dff["FluoroSignal"]==0) & (dff["CrossedSignal"] == 0)] = dff.loc[(dff["FluoroSignal"]==0) & (dff["CrossedSignal"] == 0)].values.copy()     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [12]:
# Adding an equal number of background averages to the data set as ffcf (fluorescence false crossed false)
# Extracting background of ftct data 
dfb_ffcf = dfb[(dfd["FluoroSignal"]==1) & (dfd["CrossedSignal"]==1)]
# change the name of columns
oldname = ["Depolarization_Power_Background_Mean", "Depolarization_Power_Background_Std", 
           "Q_metric_Background_Mean", "Q_metric_Background_Std",
           "Anisotropy_Lin_Background_Mean", "Anisotropy_Lin_Background_Std",
           "Polarizance_Lin_Background_Mean", "Polarizance_Lin_Background_Std",
           "Diattenuation_Lin_Background_Mean", "Diattenuation_Lin_Background_Std",
           "Retardance_Lin_Background_Mean", "Retardance_Lin_Background_Std", 
           "FluoroSignal", "CrossedSignal"]
newname = ["Depolarization_Power_Deposit_Mean", "Depolarization_Power_Deposit_Std", 
           "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std",
           "Anisotropy_Lin_Deposit_Mean", "Anisotropy_Lin_Deposit_Std",
           "Polarizance_Lin_Deposit_Mean", "Polarizance_Lin_Deposit_Std",
           "Diattenuation_Lin_Deposit_Mean", "Diattenuation_Lin_Deposit_Std",
           "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std",
           "FluoroSignal", "CrossedSignal"]
namedict = {oldname[i]: newname[i] for i in range(len(oldname))}
# Inpalce changing the names of the columns
dfb_ffcf.rename(columns = namedict, inplace=True)
dfb_ffcf["FluoroSignal"] = np.zeros(dfb_ffcf.shape[0], dtype=np.int32)
dfb_ffcf["CrossedSignal"] = np.zeros(dfb_ffcf.shape[0], dtype=np.int32)
dfd = pd.concat([dfd, dfb_ffcf])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Training Using Scikit Learn
<a id="sklearn"></a>

In [119]:
# import basic fucntions
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix 
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
# Train support vector machine
from sklearn import svm
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier

In [120]:
# select data_preparation scheme
def data_preparation_scheme(s_num):
    if s_num==1:
        # (ftct deposits), (ffct deposits + background(ffcf)) 
        dfct = dfd[dfd["CrossedSignal"]==1].copy()
        df_b = dfb.sample(frac=0.7, random_state=42).iloc[0:658, :]
        # change the name of columns
        oldname = ["Depolarization_Power_Background_Mean", "Depolarization_Power_Background_Std", 
                   "Q_metric_Background_Mean", "Q_metric_Background_Std",
                   "Anisotropy_Lin_Background_Mean", "Anisotropy_Lin_Background_Std",
                   "Polarizance_Lin_Background_Mean", "Polarizance_Lin_Background_Std",
                   "Diattenuation_Lin_Background_Mean", "Diattenuation_Lin_Background_Std",
                   "Retardance_Lin_Background_Mean", "Retardance_Lin_Background_Std", 
                   "FluoroSignal", "CrossedSignal"]
        newname = ["Depolarization_Power_Deposit_Mean", "Depolarization_Power_Deposit_Std", 
                   "Q_metric_Deposit_Mean", "Q_metric_Deposit_Std",
                   "Anisotropy_Lin_Deposit_Mean", "Anisotropy_Lin_Deposit_Std",
                   "Polarizance_Lin_Deposit_Mean", "Polarizance_Lin_Deposit_Std",
                   "Diattenuation_Lin_Deposit_Mean", "Diattenuation_Lin_Deposit_Std",
                   "Retardance_Lin_Deposit_Mean", "Retardance_Lin_Deposit_Std",
                   "FluoroSignal", "CrossedSignal"]
        namedict = {oldname[i]: newname[i] for i in range(len(oldname))}
        # Inpalce changing the names of the columns
        df_b.rename(columns = namedict, inplace=True)
        df_b["FluoroSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_b["CrossedSignal"] = np.zeros(df_b.shape[0], dtype=np.int32)
        df_r = pd.concat([dfct, df_b])
        return(df_r)
    else:
        dfct_ft = dfd[(dfd["FluoroSignal"]==1) & (dfd["CrossedSignal"]==1)].copy()
        dfct_ff = dfd[(dfd["FluoroSignal"]==0) & (dfd["CrossedSignal"]==1)].copy()
        dfct_ft = dfct_ft.sample(frac=0.5, random_state=9)[0:3*dfct_ff.shape[0]]
        df_r = pd.concat([dfct_ft, dfct_ff])    
        return(df_r)    

### Support vector machine
<a id="svm"></a>

In [121]:
def train_svm(data_train, label_train):
    wdict_svm = {0: 1, 1: 1}
    clf_svm = svm.SVC(class_weight=wdict_svm)
    # optimize the parameters
    param_dist = {"kernel": ["rbf", "poly"], 
                  "degree": [1, 2, 3],
                  "gamma": sp_randint(0, 10), 
                  "shrinking": [True, False]
                  }
    n_iter_search = 30
    rs_svm = RandomizedSearchCV(clf_svm, param_distributions=param_dist,
                                      n_iter=n_iter_search, cv=5)
    # train
    rs_svm.fit(data_train, label_train)
    return rs_svm
    

### Ensemble learning (sklearn)
<a id="ensemble"></a>

In [122]:
def train_rf(data_train, label_train):
    wdict_rf = {0: 1, 1: 1}
    clf_rf = RandomForestClassifier()
    param_dist = {"max_depth": [2, 3, 4, 5],
                  "n_estimators": [200, 500],
                  "bootstrap": [True, False],
                  "max_features": sp_randint(1, 12),
                  "min_samples_split": sp_randint(2, 11),
                  "criterion": ["gini", "entropy"]}
    n_iter_search = 30
    rs_rf = RandomizedSearchCV(clf_rf, param_distributions=param_dist,
                                      n_iter=n_iter_search, cv=5)
    # train
    rs_rf.fit(data_train, label_train)
    return rs_rf

## Training using Xgboost
<a id="xgboost"></a>

In [16]:
# import xgboost as xgb

# # # weight the classes
# # w = np.ones(y_train.shape[0])
# # w[y_train == 3] = 40
# # w[y_train == 4] = 6
# wt = 1
# # Use the scikit learn wrapper
# clf_xgb = xgb.XGBClassifier(max_depth=5, objective="binary:hinge", scale_pos_weight=wt, silent=1)
# # clf_xgb = xgb.XGBClassifier(max_depth=5, num_class=4, objective="multi:softmax")
# clf_xgb.fit(X_train, y_train)
# y_xgb_predict = clf_xgb.predict(X_test)
# df_temp = metric_scores(clf_xgb, "XGBoost", X_test, y_test, y_xgb_predict, X_train, y_train)

# df_sklearn_result = df_sklearn_result.append(df_temp)


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


## Training Using Tensorflow
<a id="tf"></a>

In [None]:
# import tensorflow as tf
# Now there are only 12 features, nn won't produce better result

## Summary of All Classifiers
<a id="sklearn_sum"></a>

In [123]:
# function for preprocesing to genearte training and test data sets
def data_preprocessing(df_in):
    # Prepare the datamatrix and labels
    X_r = df_in.iloc[:, 0:(df_a.shape[1]-2)].values
    y = df_in["FluoroSignal"].values
    # Standarize X
    scaler = preprocessing.StandardScaler().fit(X_r)
    X = scaler.transform(X_r)
    X_train_, X_test_, y_train_, y_test_ = train_test_split(X, y, test_size=0.33, random_state=42) 
    return X_train_, X_test_, y_train_, y_test_, scaler

# Define a function for inputing stats from classifier to dataframe
def metric_scores(m_clf, mname, truevalt, predictvalt):
    # convert the multiclass label back to 2 class
    trueval = truevalt.copy()
    predictval = predictvalt.copy()
    accuracy = accuracy_score(trueval, predictval)
    precision = precision_score(trueval, predictval)
    # CV score of the best_estimator
    recall = recall_score(trueval, predictval)
    cvscore = m_clf.best_score_    
    df_scores = pd.DataFrame({"Method": mname, "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], 
                              "CV (mean)": [cvscore]})
    return df_scores

def train_models(X_train_in, X_test_in, y_train_in, y_test_in):  
    df_result = pd.DataFrame({"Method": [], "Accuracy": [], "Precision": [], "Recall": [], "CV (mean)": []})
    # svm
    rs_svm_t = train_svm(X_train_in, y_train_in)
    y_svm_predict = rs_svm_t.predict(X_test_in)
    df_temp1 = metric_scores(rs_svm_t, "SVM", y_test_in, y_svm_predict)
    df_result = df_result.append(df_temp1)
    # rf
    rs_rf_t = train_rf(X_train_in, y_train_in)
    y_rf_predict = rs_rf_t.predict(X_test_in)    
    df_temp2 = metric_scores(rs_rf_t, "RF", y_test_in, y_rf_predict)
    df_result = df_result.append(df_temp2)
    return df_result, rs_svm_t, y_svm_predict, rs_rf_t, y_rf_predict

In [132]:
# Scheme 1 (separate ftct deposit and (ffct deposits + ffcf background) for fluorescence)
df_1 = data_preparation_scheme(s_num=1)
X_train_1, X_test_1, y_train_1, y_test_1, scaler_data_1 = data_preprocessing(df_1)
df_result_1, rs_svm_model_1, y_svm_pred_val_1, rs_rf_model_1, y_rf_pred_val_1 = \
                                            train_models(X_train_1, X_test_1, y_train_1, y_test_1)

# Scheme 2 (separate ftct and ffct deposits for fluorescence)
df_2 = data_preparation_scheme(s_num=2)
X_train_2, X_test_2, y_train_2, y_test_2, scaler_data_2 = data_preprocessing(df_2)
df_result_2, rs_svm_model_2, y_svm_pred_val_2, rs_rf_model_2, y_rf_pred_val_2 = \
                                            train_models(X_train_2, X_test_2, y_train_2, y_test_2)


In [133]:
# Scheme 1 (separate ftct deposit and (ffct deposits + ffcf background) for fluorescence)
df_result_1.set_index(["Method"],  inplace=True)
df_result_1

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.93858,0.928058,0.955556,0.912015
RF,0.940499,0.925267,0.962963,0.929991


In [134]:
# Scheme 2 (separate ftct and ffct deposits for fluorescence)
df_result_2.set_index(["Method"],  inplace=True)
df_result_2

Unnamed: 0_level_0,Accuracy,Precision,Recall,CV (mean)
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVM,0.809249,0.827586,0.9375,0.851852
RF,0.83237,0.866667,0.914062,0.868946
