# Glioma Grade Classification
### Approach
- Data cleaning and transformation
- XGBoost model
- 10-Fold cross validation while training 
### Description (from the website):

- Dataset link: https://archive-beta.ics.uci.edu/dataset/759/glioma+grading+clinical+and+mutation+features+dataset

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients.  

In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects.

The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

### Reference
Tasci,Erdal, Camphausen,Kevin, Krauze,Andra Valentina & Zhuge,Ying. (2022). Glioma Grading Clinical and Mutation Features Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5R62J.

### Importing libraries

In [87]:
import pandas as pd
import xgboost
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

### Reading data

In [88]:
grade_data = pd.read_csv('TCGA_GBM_LGG_Mutations_all.csv')
grade_data.head()

Unnamed: 0,Grade,Project,Case_ID,Gender,Age_at_diagnosis,Primary_Diagnosis,Race,IDH1,TP53,ATRX,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
0,LGG,TCGA-LGG,TCGA-DU-8164,Male,51 years 108 days,"Oligodendroglioma, NOS",white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
1,LGG,TCGA-LGG,TCGA-QH-A6CY,Male,38 years 261 days,Mixed glioma,white,MUTATED,NOT_MUTATED,NOT_MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
2,LGG,TCGA-LGG,TCGA-HW-A5KM,Male,35 years 62 days,"Astrocytoma, NOS",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED
3,LGG,TCGA-LGG,TCGA-E1-A7YE,Female,32 years 283 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,MUTATED,NOT_MUTATED
4,LGG,TCGA-LGG,TCGA-S9-A6WG,Male,31 years 187 days,"Astrocytoma, anaplastic",white,MUTATED,MUTATED,MUTATED,...,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED,NOT_MUTATED


### Cleaning data

In [89]:
grade_data.columns

Index(['Grade', 'Project', 'Case_ID', 'Gender', 'Age_at_diagnosis',
       'Primary_Diagnosis', 'Race', 'IDH1', 'TP53', 'ATRX', 'PTEN', 'EGFR',
       'CIC', 'MUC16', 'PIK3CA', 'NF1', 'PIK3R1', 'FUBP1', 'RB1', 'NOTCH1',
       'BCOR', 'CSMD3', 'SMARCA4', 'GRIN2A', 'IDH2', 'FAT4', 'PDGFRA'],
      dtype='object')

In [90]:
len(grade_data)

862

Checking imbalance
- The percentage of the Grades that are 'LGG' is 499/862 which is 58%.
- There is some imbalance but it can be safely ignored as it is not far from 50%. 


In [91]:
grade_data['Grade'].value_counts()

Grade
LGG    499
GBM    363
Name: count, dtype: int64

Checking for columns that can be dropped
- Columns which would not contribute to the model being trained can be trained.
- These columns are the ones which have a unique value for each row, for example: Unique Identifiers.
- Here Case_Id and Project can be dropped as Case_Id has unique values for each row and Project is a field which may not be provided with input.

In [92]:
grade_data['Project'].value_counts()

Project
TCGA-LGG    499
TCGA-GBM    363
Name: count, dtype: int64

In [93]:
len(grade_data['Case_ID'].unique())

862

In [94]:
grade_data.drop(columns=['Project','Case_ID'], inplace = True)

In [95]:
grade_data['Primary_Diagnosis'].value_counts()

Primary_Diagnosis
Glioblastoma                     360
Astrocytoma, anaplastic          129
Mixed glioma                     128
Oligodendroglioma, NOS           108
Oligodendroglioma, anaplastic     75
Astrocytoma, NOS                  58
--                                 4
Name: count, dtype: int64

In [96]:
grade_data['Gender'].value_counts()

Gender
Male      499
Female    359
--          4
Name: count, dtype: int64

In [97]:
grade_data['Race'].value_counts()

Race
white                               766
black or african american            59
not reported                         18
asian                                14
--                                    4
american indian or alaska native      1
Name: count, dtype: int64

In [98]:
(grade_data == '--').sum()

Grade                0
Gender               4
Age_at_diagnosis     5
Primary_Diagnosis    4
Race                 4
IDH1                 0
TP53                 0
ATRX                 0
PTEN                 0
EGFR                 0
CIC                  0
MUC16                0
PIK3CA               0
NF1                  0
PIK3R1               0
FUBP1                0
RB1                  0
NOTCH1               0
BCOR                 0
CSMD3                0
SMARCA4              0
GRIN2A               0
IDH2                 0
FAT4                 0
PDGFRA               0
dtype: int64

In [99]:
(grade_data == 'not reported').sum()

Grade                 0
Gender                0
Age_at_diagnosis      0
Primary_Diagnosis     0
Race                 18
IDH1                  0
TP53                  0
ATRX                  0
PTEN                  0
EGFR                  0
CIC                   0
MUC16                 0
PIK3CA                0
NF1                   0
PIK3R1                0
FUBP1                 0
RB1                   0
NOTCH1                0
BCOR                  0
CSMD3                 0
SMARCA4               0
GRIN2A                0
IDH2                  0
FAT4                  0
PDGFRA                0
dtype: int64

Dropping null values
- To find the values that need to be replaced, we need to use the value_counts() on the columns. The reason behind using this function is that sometimes the null values can be in a different form.
- In our case, they are found in the form of '--' and 'not reported'
- After we checked the count of '--' and 'not reported', we realize that we can safely drop the rows containing these rows since they contribute to less than 3% of the number of rows in the dataset.

In [100]:
grade_data = grade_data.replace('--',None)
grade_data = grade_data.replace('not reported',None)
grade_data.isnull().sum()

Grade                 0
Gender                4
Age_at_diagnosis      5
Primary_Diagnosis     4
Race                 22
IDH1                  0
TP53                  0
ATRX                  0
PTEN                  0
EGFR                  0
CIC                   0
MUC16                 0
PIK3CA                0
NF1                   0
PIK3R1                0
FUBP1                 0
RB1                   0
NOTCH1                0
BCOR                  0
CSMD3                 0
SMARCA4               0
GRIN2A                0
IDH2                  0
FAT4                  0
PDGFRA                0
dtype: int64

In [101]:
no_missing_data = grade_data.dropna()
no_missing_data.isnull().sum()

Grade                0
Gender               0
Age_at_diagnosis     0
Primary_Diagnosis    0
Race                 0
IDH1                 0
TP53                 0
ATRX                 0
PTEN                 0
EGFR                 0
CIC                  0
MUC16                0
PIK3CA               0
NF1                  0
PIK3R1               0
FUBP1                0
RB1                  0
NOTCH1               0
BCOR                 0
CSMD3                0
SMARCA4              0
GRIN2A               0
IDH2                 0
FAT4                 0
PDGFRA               0
dtype: int64

In [102]:
print(grade_data.shape)
print(no_missing_data.shape)

(862, 25)
(839, 25)


### Data Transformation

- Since the 'age_at_diagnosis' column comprises of string type data, we need to transform this data into numerical form so that it can be fed into the XGBoost model.

In [103]:
no_missing_data['Age_at_diagnosis'].head()

0    51 years 108 days
1    38 years 261 days
2     35 years 62 days
3    32 years 283 days
4    31 years 187 days
Name: Age_at_diagnosis, dtype: object

- To convert the data from the above string format to numberical form, we seperate the years and divide the number of days by 365 so that we can get how much of a year the days indicate.

In [104]:
def age_transform(age):
    split_age = age.split(" ")
    days = 0
    if len(split_age) > 2:
        days = float(split_age[2])/365
    years = float(split_age[0])
    return round(years + days,2)
no_missing_data['Age_at_diagnosis'] = no_missing_data['Age_at_diagnosis'].map(age_transform)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_missing_data['Age_at_diagnosis'] = no_missing_data['Age_at_diagnosis'].map(age_transform)


In [105]:
no_missing_data['Age_at_diagnosis'].head()

0    51.30
1    38.72
2    35.17
3    32.78
4    31.51
Name: Age_at_diagnosis, dtype: float64

- We map the 'Grade' column according to the instructions on the website

In [106]:
no_missing_data['Grade'] = no_missing_data['Grade'].map({'GBM':1,'LGG':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  no_missing_data['Grade'] = no_missing_data['Grade'].map({'GBM':1,'LGG':0})


- Seperating the dependent feature (Y) from the independent features (X).

In [107]:
X = no_missing_data.drop('Grade', axis = 1)
Y = no_missing_data['Grade']

- We convert all categorical data to One-Hot encodings. 
- This creates a sparse matrix, to view it we need to convert it to a dense form using todense()
- XGBoost can handle sparse data well and has the capability of working with large number of features, hence we do not perform any further transformation such as dimensionality reduction.

In [108]:
one_hot_encoded_data = OneHotEncoder().fit_transform(X)
one_hot_encoded_data.todense()

matrix([[0., 1., 0., ..., 1., 0., 1.],
        [0., 1., 0., ..., 1., 0., 1.],
        [0., 1., 0., ..., 1., 0., 1.],
        ...,
        [1., 0., 0., ..., 1., 0., 1.],
        [0., 1., 0., ..., 1., 0., 1.],
        [0., 1., 0., ..., 1., 0., 1.]])

# K-Folds creation
- We split the data into 10-Folds of train set and a test set so that the model can be evaluated.
- A 10-Folds cross validation method was suggested on the dataset website. 

In [109]:
kf = KFold(n_splits=10,shuffle=True)

# XGBoost Implementation
- We fit the train data and use the test data in the eval_set. We use the log-loss objective function to optimize the performance of the model.
- We haven't experimented with any hyperparameters in this case since the model fits the evaluation set well.
- The confusion matrix for the model using train and test datasets are also calculated to check the performance of the model.

In [110]:
model = xgboost.XGBClassifier(early_stopping_rounds = 20, eval_metric = 'logloss')
scores = []
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold: {i + 1}")
    X_train = one_hot_encoded_data[train_index]
    Y_train = Y.iloc[train_index]
    X_test = one_hot_encoded_data[test_index]
    Y_test = Y.iloc[test_index]
    model.fit(X_train, Y_train, verbose=False, \
              eval_set = [(X_test, Y_test)])
    scores.append(model.best_score)
    train_preds = model.predict(X_train)
    print("Training confusion matrix: \n",confusion_matrix(train_preds, Y_train))
    test_preds = model.predict(X_test)
    print("Testing confusion matrix: \n", confusion_matrix(test_preds, Y_test))

Fold: 1
Training confusion matrix: 
 [[434   0]
 [  0 321]]
Testing confusion matrix: 
 [[53  0]
 [ 0 31]]
Fold: 2
Training confusion matrix: 
 [[435   0]
 [  0 320]]
Testing confusion matrix: 
 [[52  0]
 [ 0 32]]
Fold: 3
Training confusion matrix: 
 [[438   0]
 [  0 317]]
Testing confusion matrix: 
 [[49  0]
 [ 0 35]]
Fold: 4
Training confusion matrix: 
 [[436   0]
 [  0 319]]
Testing confusion matrix: 
 [[51  0]
 [ 0 33]]
Fold: 5
Training confusion matrix: 
 [[437   0]
 [  0 318]]
Testing confusion matrix: 
 [[50  0]
 [ 0 34]]
Fold: 6
Training confusion matrix: 
 [[436   0]
 [  0 319]]
Testing confusion matrix: 
 [[51  0]
 [ 0 33]]
Fold: 7
Training confusion matrix: 
 [[444   0]
 [  0 311]]
Testing confusion matrix: 
 [[43  0]
 [ 0 41]]
Fold: 8
Training confusion matrix: 
 [[441   0]
 [  0 314]]
Testing confusion matrix: 
 [[46  0]
 [ 0 38]]
Fold: 9
Training confusion matrix: 
 [[442   0]
 [  0 313]]
Testing confusion matrix: 
 [[45  0]
 [ 0 39]]
Fold: 10
Training confusion matrix: 


Mean log-loss score over all folds

In [111]:
sum(scores)/10

0.002311360619515255

### Conclusion

- The mean log-loss score over all folds is low, which means that the model is performing well.
- It can be observed that the model performs well as can be observed from the confusion matrices for each fold.