## Machine Learning on AHEAD blood cancer patients data -- David Lin 20230705

### 專有名詞 : 

#### FSC(前散射光) : 前散射光（FSC, 正前方散射Forward Scatter）可以分析細胞顆粒大小(Size)
- FSC-A, FSC-h, and FSC-w are parameters commonly used in flow cytometry to analyze and sort cells based on their physical and optical properties.

    - FSC-A : FSC-A (Forward Scatter Area) measures the intensity of light scattered in the forward direction as a cell passes through a laser beam. It is proportional to the cell's size and provides information on the granularity of the cell.

    - FSC-H : FSC-h (Forward Scatter Height) measures the peak of the forward scatter light pulse. It is also proportional to the cell's size and provides information on the shape of the cell.

    - FSC-W	:FSC-w (Forward Scatter Width) measures the width of the forward scatter light pulse. It provides information on the uniformity of the cell's shape and can be used to distinguish between cells with different morphologies.

- Together, these parameters are used to create a scatter plot known as a forward scatter (FSC) plot, which is used to identify and gate cells of interest for further analysis or sorting.


#### SSC(測散色光) : 測散色光側（SSC 面散射光Side Scatter)可以分析細胞的顆粒性，利用此特性可以分別細胞是否健康,大角度散射代表細胞表面顆粒 性(Granularity)及內部細胞質 密度

- SSC-A, SSC-H, and SSC-W are additional parameters used in flow cytometry to analyze and sort cells based on their physical and optical properties.

    - SSC-A : SSC-A (Side Scatter Area) measures the intensity of light scattered to the side as a cell passes through a laser beam. It is proportional to the granularity and complexity of the cell, as well as its refractive index.

    - SSC-H	:SSC-H (Side Scatter Height) measures the peak of the side scatter light pulse. It is also proportional to the granularity and complexity of the cell and provides information on the shape of the cell.

    - SSC-W	:SSC-W (Side Scatter Width) measures the width of the side scatter light pulse. It provides information on the uniformity of the cell's shape and can be used to distinguish between cells with different morphologies.

- Similar to FSC parameters, SSC parameters are used to create a scatter plot known as a side scatter (SSC) plot, which is used to identify and gate cells of interest for further analysis or sorting. Together, FSC and SSC parameters can provide a wealth of information about cells, including cell size, granularity, complexity, and morphology.

#### FJComp : FJComp is a software program used for the compensation calculation in flow cytometry
- 'FJComp-APC-A'  
- 'FJComp-APC-H7-A'
- 'FJComp-APC-R700-A'
- 'FJComp-BB630-A'
- 'FJComp-BB660-P-A'
- 'FJComp-BB700-P-A'
- 'FJComp-BB790-P-A'
- 'FJComp-BUV395-A'
- 'FJComp-BUV496-A'
- 'FJComp-BUV563-A'
- 'FJComp-BUV615-P-A'
- 'FJComp-BUV661-A'
- 'FJComp-BUV737-A'
- 'FJComp-BUV805-A' 
- 'FJComp-BV421-A',
- 'FJComp-BV480-A'
- 'FJComp-BV570-A'
- 'FJComp-BV605-A'
- 'FJComp-BV650-A'
- 'FJComp-BV711-A'
- 'FJComp-BV750-P-A'
- 'FJComp-BV786-A'
- 'FJComp-BYG584-A'
- 'FJComp-BYG670-A'
- 'FJComp-BYG790-A'
- 'FJComp-FITC-A'
- 'FJComp-PE-CF594-A'
- 'FJComp-PE-Cy5.5-A'
- 'Time'

#### Fluorescence螢光散射 細胞上螢光物質散射之訊號

- 正在凋亡程序中的細胞FSC會變小，SSC是先增加後減少，利用此原理可以定性細胞

## A.Testing one(Patient) Data

### 1.Import Necessary Package and set Data path

In [113]:
import FlowCal
import pandas as pd

filename = "../raw_fcs/flowrepo_covid_EU_002_flow_001/export_COVID19 samples 23_04_20_ST3_COVID19_HC_005 ST3 230420_016_Live_cells.fcs"

### 2 Concat  FCSdata with FeatureName

In [114]:
def concat_FCSdata_FeatureName(filename):
    # Load the FCS file
    fcs_file = FlowCal.io.FCSData(filename)

    # View and store feature
    tuple_feature = fcs_file.channels
    #print(tuple_feature)
    list_feature = list(tuple_feature)
    # print(list_feature)

    # Convert the FCS data to DataFrame
    df = pd.DataFrame(fcs_file)

    # concatenate feature name with FCS data
    df.columns = list_feature
    #display(df)
    
    return df

df = concat_FCSdata_FeatureName(filename)

### 3.Data Analyze

In [119]:
# Analyze dataframe
display(df.describe())

# Check missing values
df.isna().sum()

def df_descripbe(df):
    df_summary = df.describe()
    series_max = df_summary.loc['max']
    series_min = df_summary.loc['min']
    series_mean = df_summary.loc['mean']
    series_std = df_summary.loc['std']
    
    # convert the Series to a DataFrame
    df_max_column = series_max.to_frame()
    df_min_column = series_min.to_frame()
    df_mean_column = series_mean.to_frame()
    df_std_column = series_std.to_frame()
    
    # data type from row to column
    df_max = df_max_column.transpose()
    df_min = df_min_column.transpose()
    df_mean = df_mean_column.transpose()
    df_std = df_std_column.transpose()

    return df_max, df_min, df_mean, df_std

df_max, df_min, df_mean, df_std = df_descripbe(df)
display(df_max)
display(df_min)
display(df_mean)
display(df_std)

Unnamed: 0,FSC-A,FSC-H,FSC-W,SSC-A,SSC-H,SSC-W,FJComp-APC-A,FJComp-APC-H7-A,FJComp-APC-R700-A,FJComp-BB630-A,...,FJComp-BV711-A,FJComp-BV750-P-A,FJComp-BV786-A,FJComp-BYG584-A,FJComp-BYG670-A,FJComp-BYG790-A,FJComp-FITC-A,FJComp-PE-CF594-A,FJComp-PE-Cy5.5-A,Time
count,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,...,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0,363314.0
mean,88499.132812,73202.773438,149273.890625,60463.1875,54483.792969,115949.4375,119.033829,21.407328,1851.710938,166.880081,...,171.879883,111.18026,55.604519,293.2739,330.207611,142.243225,405.172821,116.856743,1378.702637,5332.581543
std,18322.863281,15825.029297,5798.940918,31521.603516,25982.580078,8930.786133,466.302155,61.23489,919.425415,737.956726,...,290.328186,326.396332,152.417572,3963.492,1663.515137,273.108826,1215.315308,853.33667,3546.534424,3093.984863
min,34308.964844,27766.271484,121543.507812,9385.411133,8974.638672,86993.296875,-93330.976562,-383.226288,-479.190582,-118570.117188,...,-51060.292969,-6400.95752,-1003.2146,-85025.84,-70170.914062,-296.121155,-6568.705078,-119198.28125,-2116.27002,4.671663
25%,78392.460938,64116.529297,145733.660156,40944.28418,38149.447266,110367.542969,-44.070578,-11.892114,1247.43573,52.590258,...,47.279906,-70.082777,-44.090408,72.36685,2.168737,14.116482,54.44526,-3.043472,27.167182,2654.671143
50%,86781.671875,71337.152344,148671.148438,48435.183594,44922.826172,113756.277344,47.65797,17.212558,1634.454773,100.995693,...,176.789856,45.10314,44.708937,186.9438,58.717863,64.899448,84.50153,77.916611,89.443344,5332.649414
75%,95824.46875,79414.636719,151750.621094,64532.243164,58898.02832,119661.699219,156.157173,50.606909,2181.860229,175.944347,...,313.858521,207.577816,145.387432,394.0524,147.579445,178.879021,183.105476,180.233746,182.419765,8012.412354
max,210290.15625,189928.171875,260692.96875,218799.9375,181892.03125,234528.984375,15816.052734,236.911606,22646.830078,172914.328125,...,21963.839844,12576.303711,3405.897461,2271936.0,127783.210938,52144.199219,42998.929688,457087.375,40289.757812,10687.269531


### 3.Concat dataframe with label & Label Encoding

In [118]:
import pandas as pd
from io import StringIO

# read the Excel file into a DataFrame
EU_label = pd.read_excel('EU_label.xlsx')
display(EU_label)
label = EU_label['label']

# filename extract patient ID:
Patient = filename.split('/')[-2]
PatientID = StringIO(Patient)
df_PatientID = pd.read_csv(PatientID)


length = len(EU_label['file_flow_id'])
count = 0

# define the mapping from string labels to integer labels
label_map = {
    'Sick': 1,
    'Healthy': 0,
}

# Concat Label from (EU_label.xlsx)
for ID in (EU_label['file_flow_id']):
    if Patient == ID :
        df_max.insert(0,"Patient_ID", ID)
        df_max.insert(1, 'COVID19',label[count])
        df_max['Label'] = df_max['COVID19'].replace(label_map)
        count+=1
        print(' < Below : The df with Label > ')
        display(df_max)

Unnamed: 0,file_flow_id,label
0,flowrepo_covid_EU_002_flow_001,Healthy
1,flowrepo_covid_EU_003_flow_001,Healthy
2,flowrepo_covid_EU_004_flow_001,Healthy
3,flowrepo_covid_EU_005_flow_001,Healthy
4,flowrepo_covid_EU_006_flow_001,Healthy
5,flowrepo_covid_EU_007_flow_001,Healthy
6,flowrepo_covid_EU_008_flow_001,Healthy
7,flowrepo_covid_EU_009_flow_001,Healthy
8,flowrepo_covid_EU_010_flow_001,Healthy
9,flowrepo_covid_EU_011_flow_001,Healthy


 < Below : The df with Label > 


Unnamed: 0,Patient_ID,COVID19,FSC-A,FSC-H,FSC-W,SSC-A,SSC-H,SSC-W,FJComp-APC-A,FJComp-APC-H7-A,...,FJComp-BV750-P-A,FJComp-BV786-A,FJComp-BYG584-A,FJComp-BYG670-A,FJComp-BYG790-A,FJComp-FITC-A,FJComp-PE-CF594-A,FJComp-PE-Cy5.5-A,Time,Label
max,flowrepo_covid_EU_002_flow_001,Healthy,210290.15625,189928.171875,260692.96875,218799.9375,181892.03125,234528.984375,15816.052734,236.911606,...,12576.303711,3405.897461,2271936.0,127783.210938,52144.199219,42998.929688,457087.375,40289.757812,10687.269531,0


# B.Merge all the patients data and train the model
- Clean the Code and watch the result

In [109]:
import os
import FlowCal
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


def concat_FCSdata_FeatureName(filename):
    # Load the FCS file
    fcs_file = FlowCal.io.FCSData(filename)

    # View and store feature
    tuple_feature = fcs_file.channels
    list_feature = list(tuple_feature)

    # Convert the FCS data to DataFrame
    df = pd.DataFrame(fcs_file)

    # concatenate feature name with FCS data
    df.columns = list_feature
    #display(df)
    
    return df

def df_descripbe(df):
    df_summary = df.describe()
    series_max = df_summary.loc['max']
    series_min = df_summary.loc['min']
    series_mean = df_summary.loc['mean']
    series_std = df_summary.loc['std']
    
    # convert the Series to a DataFrame
    df_max_column = series_max.to_frame()
    df_min_column = series_min.to_frame()
    df_mean_column = series_mean.to_frame()
    df_std_column = series_std.to_frame()
    
    # data type from row to column
    df_max = df_max_column.transpose()
    df_min = df_min_column.transpose()
    df_mean = df_mean_column.transpose()
    df_std = df_std_column.transpose()
    
    return df_max, df_min, df_mean, df_std


count = 0
fcs_list = []
folder_path = '../raw_fcs/'

# read the Excel file into a DataFrame
EU_label = pd.read_excel('EU_label.xlsx')

# create an empty DataFrame to store the results
concatAll_df = pd.DataFrame()

# define the mapping from string labels to integer labels
label_map = {
    'Sick': 1,
    'Healthy': 0,
}


for patientID in EU_label['file_flow_id']:
    path = folder_path + patientID

    # 1.get a list of all the FCS files in the folder
    list_fcs_files = [f for f in os.listdir(path) if f.endswith('.fcs')]
    fcs_files_tail = ', '.join(list_fcs_files)  
    fcs_files_Path = path + "/" + fcs_files_tail
    
    # 2.Data Analyze Each fcs 
    df = concat_FCSdata_FeatureName(fcs_files_Path)
    df_max, df_min, df_mean, df_std = df_descripbe(df)
    
    # 3.Concat with label, filename extract patient ID:
    Patient = path.split('/')[-1]
    label = EU_label['label'] 
    

    for ID in (EU_label['file_flow_id']):
        if Patient == ID :
            df_max.insert(0, 'COVID19',label[count])
            df_max.insert(1,"Patient_ID", ID)
            count+=1
            df_max['Label'] = df_max['COVID19'].replace(label_map)
            
            # concat all the patientID information as a new data
            concatAll_df = concatAll_df.append(df_max)

x = concatAll_df.drop(['Label','COVID19','Patient_ID'], axis=1)
y = concatAll_df['Label'].to_frame()   
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# print the shapes of the training and testing sets
print(f'Training set shape: {X_train.shape}, {y_train.shape}')
print(f'Testing set shape: {X_test.shape}, {y_test.shape}')


# train a decision tree classifier on the training data
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# make predictions on the testing data
y_pred = clf.predict(X_test)

# evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAll_df = concatAll_df.append(df_max)
  concatAl

Training set shape: (28, 35), (28, 1)
Testing set shape: (12, 35), (12, 1)
Accuracy: 0.9166666666666666


  concatAll_df = concatAll_df.append(df_max)


- evaluate the performance of the model using mean squared error and R-squared


In [112]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Mean squared error:", mse)
print("R-squared:", r2)

Accuracy: 0.9166666666666666
Mean squared error: 0.08333333333333333
R-squared: 0.5555555555555556
