# Tutorial for Cancer-Biomarkers-from-Clinical-Data

## 1. Data loading and preprocessing

### 1.1. Import the necessary modules

In [1]:
# Library imports
import matplotlib.pyplot as plt

# Add the path to the src folder
import sys
sys.path.append('src')

# Import data preprocessing functions
from data_preprocessing import load_data, feature_label_split

### 1.2. Load the data

In [2]:
categories, dfs = load_data('data/clinical_cancer_data.xlsx')

list(enumerate(categories))

[(0, 'Breast'),
 (1, 'Colorectum'),
 (2, 'Esophagus'),
 (3, 'Liver'),
 (4, 'Lung'),
 (5, 'Normal'),
 (6, 'Ovary'),
 (7, 'Pancreas'),
 (8, 'Stomach')]

## 2. Random forest model for `Ovary`, `Pancreas` and `Liver` samples, taken with random subsamples of `Normal` samples

In this section, we will see three cancer types for which the list of biomarkers given by random forest classifier contain biomarkers with uniquely high level in the particular cancer type, along with other biomarkers whose Q3 values are in the top 2 among all cancer types. Essentially, in each of these cancer types, we obtain practically viable biomarkers. 

### 2.1. Import the necessary modules

In [3]:
from random_forest_model import rf_normal_cancers, plot_important_biomarkers

### 2.2. Random forest classification for `Normal + Ovary` samples

* Uniquely high levels: `CA-125`
* Higher side filtering: `Prolactin` (1st)

In [4]:
important_biomarkers_normal_ovary = rf_normal_cancers(categories = categories, 
                                                      dfs = dfs, 
                                                      cancer1_category_index = 6, 
                                                      iterations = 100,
                                                      threshold = 0.05)

Random forest classification: Normal + Ovary

Average Accuracy over 100 iterations: 0.9614

Biomarkers with Importance >= 0.05:
     Biomarker  Importance
3      CA-125     0.159995
29  Prolactin     0.109398
18       IL-6     0.099734
35       TGFa     0.094122
31       sFas     0.075817


### 2.3. Random forest classification for `Normal + Pancreas` samples

* Uniquely high levels: `CA19-9`
* Higher side filtering: `sHER2/sEGFR2/sErbB2` (1st)

In [5]:
important_biomarkers_normal_pancreas = rf_normal_cancers(categories = categories, 
                                                         dfs = dfs, 
                                                         cancer1_category_index = 7, 
                                                         iterations = 100, 
                                                         threshold = 0.05)

Random forest classification: Normal + Pancreas

Average Accuracy over 100 iterations: 0.9389

Biomarkers with Importance >= 0.05:
               Biomarker  Importance
5                CA19-9     0.142069
19                 IL-8     0.094102
18                 IL-6     0.080587
33  sHER2/sEGFR2/sErbB2     0.077110
27                  OPN     0.060258
15                GDF15     0.057668
23              Midkine     0.050770


### 2.4. Random forest classification for `Normal + Liver` samples

* Uniquely high levels: `AFP`
* Higher side filtering: `OPN` (1st), `Myeloperoxidase`(1st), `HGF` (2nd), `GDF15` (2nd)

`IL-8`, being important in regulating immune response and inflammation, are not specific to any one type of cancer. For example, the same two biomarkers are two of the most important ones in separating `Normal + Breast`, `Normal + Colorectum`, and `Normal + Esophagus` samples, as shown above. And they can be found in higher levels in `Esophagus` and `Stomach` samples.

`HGF` Q3 levels are close in `Liver` and `Esophagus` samples, and higher in `Stomach` samples. Since `HGF` Q3 level in `Liver` is in top 2 among cancer types, we consider it as a potential biomarker for `Liver` cancer.

`OPN` Q3 levels are the highest in `Liver` samples. Hence we consider it as a potential biomarker for `Liver` cancer.

`GDF15` Q3 level in `Liver` is close to that of `Esophagus` and higher in `Pancreas`. Since `GDF15` Q3 level in `Liver` is in top 2 among cancer types, we consider it as a potential biomarker for `Liver` cancer.

`Myeloperoxidase` Q3 level is the highest in `Liver` samples. Hence we consider it as a potential biomarker for `Liver` cancer.

`AFP` levels are the highest in `Liver` samples, and uniquely so. Hence we consider it as a potential biomarker for `Liver` cancer with uniquely high levels.

In [15]:
important_biomarkers_normal_liver = rf_normal_cancers(categories = categories, 
                                                      dfs = dfs,
                                                      cancer1_category_index = 3,
                                                      iterations = 100,
                                                      threshold = 0.05)

Random forest classification: Normal + Liver

Average Accuracy over 100 iterations: 0.9483

Biomarkers with Importance >= 0.05:
           Biomarker  Importance
19             IL-8     0.128577
17              HGF     0.122702
27              OPN     0.117051
15            GDF15     0.085102
24  Myeloperoxidase     0.062697
0               AFP     0.054613


## 3. Random forest model for `Normal`, `Breast` and `Colorectum` samples

Now we see two cancer types for which the important biomarkers given by random forest classifier are not suitable in practical scenario for distinguishing between different cancer types from normal samples. None of the biomarkers display uniquely high level for the particular cancer type, and none can be found with Q3 level in top 2 among all the cancer types. 

### 3.1. Random forest classification for `Normal + Breast` samples

* Uniquely high levels: None
* Higher side filtering: None

As per the random forest classifier, the important biomarkers for distinguishing between `Normal` and `Breast` samples are the following:
* `TGFa` (35)
* `IL-8` (19)
* `IL-6` (18)
* `Prolactin` (29)
* `CYFRA 21-1` (8)

Note that, `IL-8` and `IL-6`, being important in regulating immune response and inflammation, are not specific to any one type of cancer. For example, the same two biomarkers are the most important ones in seperating `Normal` and `Colorectum` samples, as shown below. And they can be found in higher levels in some other cancer types, such as `Esophagus`. In fact, `IL-6` and `IL-8` Q3 value is the lowest in `Breast` samples among all the cancer types.

Also, we can have a look at the descriptive statistics of `TGFa` to see that it's Q1, Q2 and Q3 values are very close, and in some cases virtually indistinguishable, in `Breast`, `Colorectum`, `Lung`, `Pancreas` and even `Normal` samples.

`Prolactin` levels are much higher in `Liver`, `Lung` and `Ovary` samples than in `Breast` samples.

`CYFRA 21-1` Q1, Q2 and Q3 values are very close in `Breast` and `Colorectum` samples, and can be found in higher levels in the other cancer types. In fact, the Q3 value of `CYFRA 21-1` is the lowest in `Breast` samples among all the cancer types.

In [8]:
important_biomarkers_normal_breast = rf_normal_cancers(categories = categories, 
                                                       dfs = dfs,
                                                       cancer1_category_index = 0,
                                                       iterations = 100,
                                                       threshold = 0.05)

Random forest classification: Normal + Breast

Average Accuracy over 100 iterations: 0.9630

Biomarkers with Importance >= 0.05:
      Biomarker  Importance
35        TGFa     0.100919
19        IL-8     0.087020
18        IL-6     0.085325
29   Prolactin     0.055615
8   CYFRA 21-1     0.052147


### 3.2. Random forest classification for `Normal + Colorectum` samples

* Uniquely high levels: None
* Higher side filtering: None

As per the random forest classifier, the important biomarkers for distinguishing between `Normal` and `Colorectum` samples are the following:
* `IL-8` (19)
* `IL-6` (18)
* `OPN` (27)
* `HGF` (17)
* `GDF15` (15)
* `sFas` (31)
* `Prolactin` (29)

Note that, `IL-8` and `IL-6`, being important in regulating immune response and inflammation, are not specific to any one type of cancer. For example, the same two biomarkers are two of the most important ones in separating `Normal` and `Breast` samples, as shown above. And they can be found in higher levels in some other cancer types, such as `Esophagus`, `Liver` and `Lung` samples.

`OPN` Q3 levels are much higher in `Esophagus`, `Liver`, and `Stomach` samples than in `Colorectum` samples.

`HGF` Q3 levels are higher in `Pancreas` samples, and much higher in `Esophagus`, `Liver`, and `Stomach` samples than in `Colorectum` samples.

`GDF15` Q3 levels are close in `Colorectum` and Ovary samples, and higher in `Esophagus`, `Liver`, `Stomach` and `Pancreas` samples than in `Colorectum` samples.

`sFas` Q3 levels are higher in `Breast`, `Esophagus`, `Liver`, `Lung`, `Pancreas`, `Stomach` and even `Normal` samples than in `Colorectum` samples.

`Prolactin` levels are much higher in `Liver`, `Lung` and `Ovary` samples than in `Colorectum` samples.


In [9]:
important_biomarkers_normal_colorectum = rf_normal_cancers(categories = categories, 
                                                           dfs = dfs,
                                                           cancer1_category_index = 1,
                                                           iterations = 100,
                                                           threshold = 0.05)

Random forest classification: Normal + Colorectum

Average Accuracy over 100 iterations: 0.9576

Biomarkers with Importance >= 0.05:
     Biomarker  Importance
19       IL-8     0.135958
18       IL-6     0.095141
27        OPN     0.090611
17        HGF     0.061261
15      GDF15     0.058960
31       sFas     0.057872
29  Prolactin     0.054625


### 3.3. Random forest classification for `Normal + Breast + Colorectum` samples

In [10]:
important_biomarkers_normal_breast_colorectum = rf_normal_cancers(categories = categories, 
                                                                  dfs = dfs,
                                                                  cancer1_category_index = 0,
                                                                  cancer2_category_index = 1,
                                                                  iterations = 100,
                                                                  threshold = 0.05)

Random forest classification: Normal + Breast + Colorectum

Average Accuracy over 100 iterations: 0.8930

Biomarkers with Importance >= 0.05:
   Biomarker  Importance
19     IL-8     0.088869
18     IL-6     0.064423
35     TGFa     0.057447
31     sFas     0.053016


## 4. Random forest model for `Normal` and `Esophagus`, `Liver`, `Lung`, `Stomach` samples

### 4.1. Random forest classification for `Normal + Esophagus` samples

* Uniquely high levels: None
* Higher side filtering: `TIMP-1` (1st), `OPN` (2nd), `Myeloperoxidase` (2nd)

`OPN` Q2 and Q3 levels are close in `Esophagus` and `Stomach` samples, and higher in `Liver` samples. Since `OPN` Q3 level in `Esophagus` is in top 2 among cancer types, we consider it as a potential biomarker for `Esophagus` cancer.

`HGF` Q3 levels are higher in `Liver` and `Stomach` samples.

`IL-6` and `IL-8`, being important in regulating immune response and inflammation, are not specific to any one type of cancer. For example, the same two biomarkers are two of the most important ones in separating `Normal` and `Breast` samples, and also `Normal` and `Colorectum` samples, as shown above. And they can be found in higher levels in some other cancer types, such as `Liver` and `Lung` samples.

`Myeloperoxidase` Q3 levels are higher only in `Liver` samples. Since `Myeloperoxidase` Q3 level in `Esophagus` is in top 2 among cancer types, we consider it as a potential biomarker for `Esophagus` cancer.

`GDF15` Q3 levels are higher in `Liver` and `Pancreas` samples.

Despite being present in high levels in `Liver`, `Ovary`, `Pancreas` and `Stomach` samples, `TIMP-1` Q3 levels are the highest in `Esophagus` samples. Hence, we consider it as a potential biomarker for `Esophagus` cancer.

In [12]:
important_biomarkers_normal_esophagus = rf_normal_cancers(categories = categories, 
                                                          dfs = dfs,
                                                          cancer1_category_index = 2,
                                                          iterations = 100,
                                                          threshold = 0.05)

Random forest classification: Normal + Esophagus

Average Accuracy over 100 iterations: 0.8617

Biomarkers with Importance >= 0.05:
           Biomarker  Importance
27              OPN     0.098270
17              HGF     0.096577
18             IL-6     0.078884
19             IL-8     0.068289
24  Myeloperoxidase     0.060353
15            GDF15     0.060257
37           TIMP-1     0.051771


### 4.2. Random forest classification for `Normal + Lung` samples

`NSE` Q3 level in `Lung` samples is the lowest among all the cancer types, and even the `Normal` samples.

In [13]:
important_biomarkers_normal_lung = rf_normal_cancers(categories = categories, 
                                                     dfs = dfs,
                                                     cancer1_category_index = 4,
                                                     iterations = 100,
                                                     threshold = 0.05)

Random forest classification: Normal + Lung

Average Accuracy over 100 iterations: 0.9769

Biomarkers with Importance >= 0.05:
     Biomarker  Importance
29  Prolactin     0.156559
19       IL-8     0.080851
25        NSE     0.079439
18       IL-6     0.074091
15      GDF15     0.052625
27        OPN     0.051229


### 4.3. Random forest classification for `Normal + Stomach` samples

In [14]:
important_biomarkers_normal_stomach = rf_normal_cancers(categories = categories, 
                                                        dfs = dfs,
                                                        cancer1_category_index = 8,
                                                        iterations = 100,
                                                        threshold = 0.05)

Random forest classification: Normal + Stomach

Average Accuracy over 100 iterations: 0.9304

Biomarkers with Importance >= 0.05:
   Biomarker  Importance
27      OPN     0.152077
18     IL-6     0.088584
19     IL-8     0.075116
15    GDF15     0.066546
17      HGF     0.063138
30    sEGFR     0.056581
