# Alzheimer's Disease Classification
## Overview:
* Load and clean data.
* Exploratory Data Analysis.
* Preprocess data.
* train several models and determine the best.
* Visualize results with confusion  matrix

# Getting Started
In this classification I will be using the longtiduinal study. This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.

### Description of Features

| Feature     | Description                         |
| ----------- | ----------------------------------- |
| ID          | Identification                      |
| Group       | Demented or Nondemented             |
| Visit       | The visit number                    |
| M/F         | Gender                              |
| Hand        | Dominant Hand                       |
| Age         | Age in years                        |
| Educ        | Years of Education                  |
| SES         | Socioeconomic Status                |
| MMSE        | Mini Mental State Examination       |
| CDR         | Clinical Dementia Rating            |
| eTIV        | Estimated Total Intracranial Volume |
| nWBV        | Normalize Whole Brain Volume        |
| ASF         | Atlas Scaling Factor                |
| Delay       | Delay                               |

# Packages & Libraries


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

import warnings 
warnings.filterwarnings('ignore')


# Load Data

In [None]:
data = pd.read_csv('../input/mri-and-alzheimers/oasis_longitudinal.csv')
data.head(10)

# Data Cleaning

In [None]:
data = data.loc[data['Visit']==1]          #Only look at first visit
data = data.reset_index(drop=True)         #reset index after filtering first visit data

In [None]:
data.info()

In [None]:
data = data[['Group', 'M/F', 'Age', 'EDUC', 'SES',
            'MMSE', 'CDR', 'eTIV', 'nWBV', 'ASF']]
data.rename(columns={'M/F':'Gender'}, inplace=True)
data.head()

In [None]:
#Check for missing values
data.isna().sum()

In [None]:
data['SES'].value_counts()

Socioeconomic Status is a categorical feature, therefore, we will fill in missing values with the mode.

In [None]:
data['SES'] = data['SES'].fillna(2.0)

In [None]:
data.isna().sum().sum()

In [None]:
#Binary encode object columns
data['Group'] = data['Group'].apply(lambda x: 1 if x == 'Demented' else 0)
data['Gender'] = data['Gender'].apply(lambda x: 1 if x == 'M' else 0)

In [None]:
data.head(10)

In [None]:
data = data.astype('float64')
print(data.dtypes)

# Alzheimer's Exploratory Data Analysis

In [None]:
data.describe()

In [None]:
corr = data.corr()
plt.figure(figsize=(12,6))
sns.heatmap(corr, annot=True, vmin=-1)
plt.show()

### Relaionship between gender and dementia

In [None]:
demented_group = data[data['Group']==1]['Gender'].value_counts()
demented_group = pd.DataFrame(demented_group)
demented_group.index=['Male', 'Female']
demented_group.plot(kind='bar', figsize=(8,6))
plt.title('Gender vs Dementia', size=16)
plt.xlabel('Gender', size=14)
plt.ylabel('Patients with Dementia', size=14)
plt.xticks(rotation=0)
plt.show()

### Relationship Between Age and Normalized Whole Brain Volume

Group: 0 = Nondemented, 1 = Demented

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='Age', y='nWBV', data=data, hue='Group')
plt.title('Age vs Normalized Whole Brain Volume', size=16)
plt.xlabel('Age', size=14)
plt.ylabel('Normalized Whole Brain Volume', size=14)
plt.show()

### Relationship Between CDR and Dementia

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='Age', y='CDR', data=data, hue='Group')
plt.title('Clinical Dementia Rating vs Dementia', size=16)
plt.xlabel('Age', size=14)
plt.ylabel('Clinical Dementia Rating',size=14)
plt.show()

### Relationship between MMSE and Dementia

In [None]:
#print('Nondemented Group: \n',data.query('Group == 0')['MMSE'].value_counts().sort_values())
#print('Demented Group: \n',data.query('Group == 1')['MMSE'].value_counts().sort_values())

In [None]:
plt.figure(figsize=(10,6))
sns.kdeplot(x='MMSE', shade=True, hue='Group', data=data)
plt.title('Distrubtion of MMSE scores in Demented and Nondemented Patients', size=16)
plt.xlim(data['MMSE'].min(), data['MMSE'].max())
plt.xlabel('MMSE Score', size=14)
plt.ylabel('Density of Scores', size=14)
plt.show()

### Relationship Between Education Years and Dementia

In [None]:
plt.figure(figsize=(10,6))
sns.kdeplot(x='EDUC', shade=True, hue='Group', data=data)
plt.title('Years of Education vs Dementia', size=16)
plt.xlabel('Education (years)', size=14)
plt.ylabel('Density', size=14)
plt.show()

## EDA Conclusions
* In this dataset, it appears that there is a higher rate of dementia in males than females.
* Normalized Whole Brain Volume (nWBV) has a negative correlation with age in general, however, this correlation seems more pronunced in dementia patients.
* Clinical Dementia Rating (CDR) showed clear distinctions between demented and nondemented patients. Regardless of age, CDR seems to be a robust measure of dementia as, almost all patients with dementia showed a CDR score >= 0.5. Similarly, MMSE scores also show a distinction between the two groups. The MMSE scores for demented patients were much more spread out, ranging from 17-26, while nondemented individuals showed little varitation in MMSE score ranging from 26-30. Although we see a differnece in scores between demented and nondemented individuals the CDR seems to be much more robust than MMSE.
* Demented patients had less years of education in general as compared to nondemented patients. These findings align with many studies which have also demonstarted that less education is greater risk factor for AD (Sharp & Gatz, 2011).
* The relationship between Atlast Scoring Factor (ASF) and estimated total intracranial volume (eTIV) was almost 1-1. This is because ASF is the volume-scaling factor necessary to fit each individual. The ASF should be proportionate to TIV since atlas normalization equalizes head size(Buckner et al., 2004). Considering this, I do not feel ASF will be neccessary for our model.

## Data Preproccessing

In [None]:
def preprocessing_inputs(df):
    df = df.copy()
    
    #split df into X and y
    y = df['Group']
    X = df.drop(['Group', 'ASF'], axis=1)
    
    #Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    #Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = preprocessing_inputs(data)

print('Train set:', X_train.shape, y_train.shape)
print('Test set:', X_test.shape, y_test.shape)

In [None]:
X_train.head()

# Model Training

In [None]:
models = {'         Logistic Regression': LogisticRegression(),
          '                         KNN': KNeighborsClassifier(),
          '    Decision Tree Classifier': DecisionTreeClassifier(),
          '              Neural Network': MLPClassifier(),
          '    Random Forest Classifier': RandomForestClassifier(),
          'Gradient Boosting Classifier': GradientBoostingClassifier()}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + ' trained.')

# Model Results

### Accuracy Score

In [None]:
for name, model in models.items():
    yhat = model.predict(X_test)
    acc = accuracy_score(y_test, yhat)
    print(name + ' Accuracy: {:.2%}'.format(acc))

### F1-Score

In [None]:
for name, model in models.items():
    yhat = model.predict(X_test)
    f1 = f1_score(y_test, yhat, pos_label=1)
    print(name + ' F1-Score: {:.5}'.format(f1))

# Confusion Matrix

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

def plot_confusion_matrix(y_test, yhat):
    cm = confusion_matrix(y_test, yhat)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax=ax, fmt='g', cmap=plt.cm.Blues, cbar=False);
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix', size=16); 
    ax.xaxis.set_ticklabels(['Nondemented', 'Demented']); ax.yaxis.set_ticklabels(['Nondemented', 'Demented'])

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
yhat = logreg.predict(X_test)
plt.figure(figsize=(8,6))
plot_confusion_matrix(y_test, yhat)

# References

1. Buckner RL, Head D, Parker J, Fotenos AF, Marcus D, Morris JC, Snyder AZ. A unified approach for morphometric and functional data analysis in young, old, and demented adults using automated atlas-based head size normalization: reliability and validation against manual measurement of total intracranial volume. Neuroimage. 2004 Oct;23(2):724-38. doi: 10.1016/j.neuroimage.2004.06.018. PMID: 15488422.

2. Sharp, E. S., & Gatz, M. (2011). Relationship between education and dementia: an updated systematic review. Alzheimer disease and associated disorders, 25(4), 289–304. https://doi.org/10.1097/WAD.0b013e318211c83c