# Predicting the degree of Alzheimer's disease using MRI data, socioeconomic score and Mental test scores

## Introduction

In this report we will utilise Magnetic Resonance Imaging Comparisons of Demented and Nondemented Adults to predict the degree of dementia, using Decision tree classifier. The dataset is obtained from Kaggle: https://www.kaggle.com/jboysen/mri-and-alzheimers

The Key variables which are used to predict the level of impariment/dementia are Gender (M/F), Age, Education level (Educ), Socioeconomic status (SES), Mini Mental State Examination (MMSE), Estimated Total Intracranial Volume (eTIV) and Normalize Whole Brain Volume (nWBV). These are the predictors or the independent variables (X) and the outcome or dependent variable (y) is the Clinical dementia rating (CDR)

In [109]:
#import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#read csv file 
myfile= ('C:/Users/saksh/Downloads/oasis_cross-sectional.csv')
df=pd.read_csv(myfile)
df

Unnamed: 0,ID,M/F,Hand,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay
0,OAS1_0001_MR1,F,R,74,2.0,3.0,29.0,0.0,1344,0.743,1.306,
1,OAS1_0002_MR1,F,R,55,4.0,1.0,29.0,0.0,1147,0.810,1.531,
2,OAS1_0003_MR1,F,R,73,4.0,3.0,27.0,0.5,1454,0.708,1.207,
3,OAS1_0004_MR1,M,R,28,,,,,1588,0.803,1.105,
4,OAS1_0005_MR1,M,R,18,,,,,1737,0.848,1.010,
...,...,...,...,...,...,...,...,...,...,...,...,...
431,OAS1_0285_MR2,M,R,20,,,,,1469,0.847,1.195,2.0
432,OAS1_0353_MR2,M,R,22,,,,,1684,0.790,1.042,40.0
433,OAS1_0368_MR2,M,R,22,,,,,1580,0.856,1.111,89.0
434,OAS1_0379_MR2,F,R,20,,,,,1262,0.861,1.390,2.0


## Data cleaning

In [110]:
#drop irrelevant columns that are not involved in predicting Clinical Dementia Rating (CDR)
df.drop(columns=['Delay','ASF','Hand','ID'], inplace=True)
#drop rows with N/A values
grouped=df.dropna()

In [113]:
#replace M with 0 and F with 1 and changing the CDR to make the predicted value continuous
#Three CDR groups: no dementia (0), very mild dementia (1), mild dementia (2) and moderate dementia (3)
grouped2=grouped.replace({'CDR': {0.5:1.0,1.0:2.0,2.0:3.0},'M/F': {'M':1,'F':0}})
grouped2.head()

Unnamed: 0,M/F,Age,Educ,SES,MMSE,CDR,eTIV,nWBV
0,0,74,2.0,3.0,29.0,0.0,1344,0.743
1,0,55,4.0,1.0,29.0,0.0,1147,0.81
2,0,73,4.0,3.0,27.0,1.0,1454,0.708
8,1,74,5.0,2.0,30.0,0.0,1636,0.689
9,0,52,3.0,2.0,30.0,0.0,1321,0.827


The grouped2 variable will be used later for Decision Tree Classifier model below.

In [114]:
#replace M with 0 and F with 1 and changing the CDR to make the predicted value continuous
#Two CDR groups: no dementia (0), mild dementia (1) and moderate dementia (2)
grouped3=grouped.replace(['M','F',0.5],[1,0,1.0])
grouped3.head()

Unnamed: 0,M/F,Age,Educ,SES,MMSE,CDR,eTIV,nWBV
0,0,74,2.0,3.0,29.0,0.0,1344,0.743
1,0,55,4.0,1.0,29.0,0.0,1147,0.81
2,0,73,4.0,3.0,27.0,1.0,1454,0.708
8,1,74,5.0,2.0,30.0,0.0,1636,0.689
9,0,52,3.0,2.0,30.0,0.0,1321,0.827


The grouped3 variable will be used later for Decision Tree Classifier model below.

## Decision tree classifier model

In [100]:
#import machine learning packages on sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#X is the independent variable (predictor) and y is the dependent variable (predicted)
X=grouped2.drop(columns=['CDR'])
y= grouped2['CDR']
#list of average accuracy score
average_score=[]
#100 repeats to get an average of accuracy score
for i in range(100):
    #Split data into Training and test
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2) #test size is 0.2 
    model= DecisionTreeClassifier()
    model.fit(X_train,y_train)
    predictions=model.predict(X_test)
    score=accuracy_score(y_test,predictions)
    average_score.append(score)
mean_score=np.mean(average_score)
print('The mean accuracy score of Decision tree Classifier model is:',mean_score)

The mean accuracy score of Decision tree Classifier model is: 0.6688636363636362


<b>Therefore this decision tree classifier model has an accuracy of 67% in predicting if the patient has no dementia (0), very mild dementia (1), mild dementia (2) and moderate dementia (3)  based on MRI data, socioeconomic factors and mental exams.<b>

To improve the accuracy, it may be worth having less outcome variables so very mild dementia and mild demntia can be grouped as one category. This leaves the outcome variables as no dementia (0), mild dementia (1), moderate dementia (2). this is done below:

In [108]:
#X is the independent variable (predictor) and y is the dependent variable (predicted)
X=grouped3.drop(columns=['CDR'])
y= grouped3['CDR']
#list of average accuracy score
average_score=[]
#100 repeats to get an average of accuracy score
for i in range(100):
    #Split data into Training and test
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2) #test size is 0.2 
    model= DecisionTreeClassifier()
    model.fit(X_train,y_train)
    predictions=model.predict(X_test)
    score=accuracy_score(y_test,predictions)
    average_score.append(score)
mean_score=np.mean(average_score)
print('The mean accuracy score of Decision tree Classifier model is:',mean_score)

The mean accuracy score of Decision tree Classifier model is: 0.77


<b>This leads to a more accurate decision tree classifier model with an accuracy of 77% in predicting if the patient has no dementia (0), mild dementia (1) and moderate dementia (2) based on MRI data, socioeconomic factors and mental exams.<b>