# Operation Mind Shield: Decoding Alzheimer's
#### **Full Name:** Karl Munroe
#### **Link to SDS Profile:** https://community.superdatascience.com/u/4157ad30

This is the solution file to the tasks defined in the Monthly Mission. It will answer the questions and tasks outlined in the project documentation: [Link](https://github.com/SuperDataScience-Community-Missions/MM0001-Operation-Mind-Shield?tab=readme-ov-file#-how-to-contribute)

#### Import the necessary libraries

In [2]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme()

#### Load the data and examine the features. 
The PatientID is a column indicating the unique ID of the patient. This column will be used at the index

In [3]:
raw_data = pd.read_csv('../../data/alzheimers_disease_data.csv', index_col='PatientID')
raw_data.head()

Unnamed: 0_level_0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,9.025679,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,7.151293,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,9.673574,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,8.392554,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,5.597238,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid


#### Look at summary statistics information about and structure of the dataset

In [4]:
raw_data.describe() # summary statistics

Unnamed: 0,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,...,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis
count,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,...,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0,2149.0
mean,74.908795,0.506282,0.697534,1.286645,27.655697,0.288506,10.039442,4.920202,4.993138,7.051081,...,5.080055,0.208004,0.156817,4.982958,0.205212,0.158213,0.150768,0.158678,0.301536,0.353653
std,8.990221,0.500077,0.996128,0.904527,7.217438,0.453173,5.75791,2.857191,2.909055,1.763573,...,2.892743,0.405974,0.363713,2.949775,0.40395,0.365026,0.357906,0.365461,0.459032,0.478214
min,60.0,0.0,0.0,0.0,15.008851,0.0,0.002003,0.003616,0.009385,4.002629,...,0.00046,0.0,0.0,0.001288,0.0,0.0,0.0,0.0,0.0,0.0
25%,67.0,0.0,0.0,1.0,21.611408,0.0,5.13981,2.570626,2.458455,5.482997,...,2.566281,0.0,0.0,2.342836,0.0,0.0,0.0,0.0,0.0,0.0
50%,75.0,1.0,0.0,1.0,27.823924,0.0,9.934412,4.766424,5.076087,7.115646,...,5.094439,0.0,0.0,5.038973,0.0,0.0,0.0,0.0,0.0,0.0
75%,83.0,1.0,1.0,2.0,33.869778,1.0,15.157931,7.427899,7.558625,8.562521,...,7.546981,0.0,0.0,7.58149,0.0,0.0,0.0,0.0,1.0,1.0
max,90.0,1.0,3.0,3.0,39.992767,1.0,19.989293,9.987429,9.998346,9.99984,...,9.996467,1.0,1.0,9.999747,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
raw_data.info() # information on the data types and columns

<class 'pandas.core.frame.DataFrame'>
Index: 2149 entries, 4751 to 6899
Data columns (total 34 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        2149 non-null   int64  
 1   Gender                     2149 non-null   int64  
 2   Ethnicity                  2149 non-null   int64  
 3   EducationLevel             2149 non-null   int64  
 4   BMI                        2149 non-null   float64
 5   Smoking                    2149 non-null   int64  
 6   AlcoholConsumption         2149 non-null   float64
 7   PhysicalActivity           2149 non-null   float64
 8   DietQuality                2149 non-null   float64
 9   SleepQuality               2149 non-null   float64
 10  FamilyHistoryAlzheimers    2149 non-null   int64  
 11  CardiovascularDisease      2149 non-null   int64  
 12  Diabetes                   2149 non-null   int64  
 13  Depression                 2149 non-null   int64  

The data above indicates that there are no missing values in any of the columns on the dataset.

In [6]:
print('The number of duplicated rows: ',raw_data.duplicated().sum()) # count the duplicated values

The number of duplicated rows:  0


The dataset column 'DoctorInCharge' has the same value 'XXXConfid'. This will not add any useful information to the analysis and so will be dropped

In [7]:
raw_data.drop('DoctorInCharge', axis=1, inplace=True)
raw_data.columns

Index(['Age', 'Gender', 'Ethnicity', 'EducationLevel', 'BMI', 'Smoking',
       'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality',
       'FamilyHistoryAlzheimers', 'CardiovascularDisease', 'Diabetes',
       'Depression', 'HeadInjury', 'Hypertension', 'SystolicBP', 'DiastolicBP',
       'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL',
       'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment',
       'MemoryComplaints', 'BehavioralProblems', 'ADL', 'Confusion',
       'Disorientation', 'PersonalityChanges', 'DifficultyCompletingTasks',
       'Forgetfulness', 'Diagnosis'],
      dtype='object')

The above tests indicate that the dataset was presented in a clean state. No further cleaning/wrangling will be needed
The analysis to perform is whether the given the features, whether or not the patient will have a disgnosis of having Alzheimer's Disease or not.

###  Preparing the features and targets

The dataset has three categorical variables Gender, Ethnicity and EducationLevel. Gender is already encoded but Ethnnicity and Education Level will have to be encoded as dummy variables

#### Encoding Ethnicity

In [8]:
raw_data.Ethnicity.unique()

array([0, 3, 1, 2], dtype=int64)

In [9]:
eth_dummies = pd.get_dummies(raw_data['Ethnicity'], drop_first=True, dtype=int) # encode as 0 & 1
eth_dummies.rename(columns={1: 'African American',2:'Asian', 3: 'Other'}, inplace=True)
raw_data = pd.concat([eth_dummies, raw_data], axis=1)
raw_data.drop(['Ethnicity'], axis=1, inplace=True)

In [10]:
from sklearn.linear_model import LogisticRegression # Use the logistic
from sklearn.model_selection import train_test_split # split into test and training sets
from sklearn.preprocessing import MinMaxScaler


dataset = raw_data.copy() # make a copy of the data for feature scaling
y= dataset.iloc[:,-1].values # target in the last column
X = dataset.iloc[:,:-1].values # features in all others exclusing the ID

display(y)
display(X)

array([0, 0, 0, ..., 1, 1, 0], dtype=int64)

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 1., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

#### Split the dataset into test and training set

In [11]:
#### Split the dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### Scale the training and test sets.

In [12]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

#### Build the regression Model and fit. Random state was chosen for reproducibility

In [13]:
classifier = LogisticRegression(random_state=123)
classifier.fit(X_train, y_train)

#### Make predictions using the test set

In [14]:

y_pred = classifier.predict(X_test)
y_pred

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score

cm = pd.DataFrame(confusion_matrix(y_test, y_pred)) # create the confusion matrix
cm.columns = ['Predicted 0', 'Predicted 1'] #
cm.rename(index={0: 'Acutal 0', 1: 'Acutal 1'}, inplace=True)
accuracy = accuracy_score(y_test, y_pred)

display(cm)
print('Accuracy of the model: ', accuracy)

Unnamed: 0,Predicted 0,Predicted 1
Acutal 0,256,25
Acutal 1,45,104


Accuracy of the model:  0.8372093023255814


### Summary of the Question

The logistic regresssion shows an accuracy of 0.848 or 84.8% accuracy in predicting the outcome of Alzheimer's Disease given all the factors shown in the data set.