### About Dementia & Alzheimer

![](https://i.ibb.co/djzsmNG/alzehemiers.jpg)

**Dementia** is a syndrome, not a disease. A syndrome is a group of symptoms that doesn’t have a definitive diagnosis. Dementia is a group of symptoms that affects mental cognitive tasks such as memory and reasoning. Dementia is an umbrella term that Alzheimer’s disease can fall under. It can occur due to a variety of conditions, the most common of which is Alzheimer’s disease.

**Dementia** is the term applied to a group of symptoms that negatively impact memory, but **Alzheimer’s** is a progressive disease of the brain that slowly causes impairment in memory and cognitive function. The exact cause is unknown and no cure is available.

The **World Health Organization** Trusted Source says that **47.5 million** people around the world are living with dementia.

The **National Institutes of Health** estimate that more than 5 million people in the United States have Alzheimer’s disease. Although younger people can and do get Alzheimer’s, the symptoms generally begin after age 60.

The time from diagnosis to death can be as little as three years in people over 80 years old. However, it can be much longer for younger people.

Damage to the brain begins years before symptoms appear. Abnormal protein deposits form plaques and tangles in the brain of someone with Alzheimer’s disease. Connections between cells are lost, and they begin to die. In advanced cases, the brain shows significant shrinkage.

It’s impossible to diagnose Alzheimer’s with complete accuracy while a person is alive. The diagnosis can only be confirmed when the brain is examined under a microscope during an autopsy.

### Alzheimer’s vs. Dementia symptoms

The symptoms of **Alzheimer’s** and **dementia** can overlap, but there can be some differences.

Both conditions can cause:

* a decline in the ability to think
* memory impairment
* communication impairment

The symptoms of Alzheimer’s include:

* difficulty remembering recent events or conversations
* apathy
* depression
* impaired judgment
* disorientation
* confusion
* behavioral changes
* difficulty speaking, swallowing, or walking in advanced stages of the disease

Reference: [Healthline.com](https://www.healthline.com/health/alzheimers-disease/difference-dementia-alzheimers#dementia)

## Problem Statement
As the population of the world increases, there will be larger numbers of people with dementia and an emerging need for prompt diagnosis and treatment. But an accurate early, or timely, diagnosis of dementia with the use of daat analysis with close coordination with medical fraternity can help people access to treatments that can improve symptoms and slow down the progress of the disease, can
access to advice and support, have time to prepare for the future and plan ahead.

So, early detection of dementia is very crucial and it is the need of the hour.

## Data Understanding
This dataset of a **Cross-sectional MRI Data** and **Longitudinal MRI data**.Cross-sectional data is a collection of 416 subjects aged 18 to 96. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 100 of the included subjects over the age of 60 have been clinically diagnosed with very mild to moderate Alzheimer’s disease (AD). Additionally, a reliability data set is included containing 20 nondemented subjects imaged on a subsequent visit within 90 days of their initial session.

Longitudinal data is a collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.

As I am working on only longitudinal data in this kernel so lets check out the attributes of longitudinal data

### Attributes:
It consists of 15 attributes which are describes as follows :

- **Subject.ID** - Unique Id of the patient
- **MRI.ID** - Unique Id generated after conducting MRI on patient
- **Group** - It is a group of Converted (Previously Normal but developed dimentia later),  Demented and  Nondemented (Normal Pateints)
- **Visit** - Number of visit to detect dementia status
- **MR.Delay** - Not Known

**Demographics Info**

- **M.F** - Gender
- **Hand** - Handedness (actually all subjects were right-handed so I will drop this column)
- **Age** - Age in years
- **EDUC** - Years of education
- **SES** - Socioeconomic status as assessed by the Hollingshead Index of Social Position and classified into categories from 1 (highest status) to 5 (lowest status)

**Clinical Info**

- **MMSE** - Mini-Mental State Examination score (range is from 0 = worst to 30 = best)
- **CDR** - Clinical Dementia Rating (0 = no dementia, 0.5 = very mild AD, 1 = mild AD, 2 = moderate AD)

**Derived anatomic volumes**

- **eTIV** - Estimated total intracranial volume, mm3
- **nWBV** - Normalized whole-brain volume, expressed as a percent of all voxels in the atlas-masked image that are labeled as gray or white matter by the automated tissue segmentation process
- **ASF** - Atlas scaling factor (unitless). Computed scaling factor that transforms native-space brain and skull to the atlas target (i.e., the determinant of the transform matrix)

## Analysis Approach
My analysis approach is divided into following steps:

**1. Data Cleaning**

**2. Univariate Analysis**

**3. Bivariate Analysis**

**4. Key Insights**

## Importing Essential Libraraies

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed# libraries for data wrangling
import pandas as pd
import numpy as np
# libraries for plotting
import matplotlib.pyplot as plt
%matplotlib inline  
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set(style="whitegrid")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Data Reading

In [None]:
# reading longitudinal data
df_long = pd.read_csv('../input/mri-and-alzheimers/oasis_longitudinal.csv')

In [None]:
# lets see first few entries of the dataset
df_long.head()

## 1. Data Cleaning

In this step, we will first see the overall distribution of categorical and numerical columns so that we can drop irrelevant features which only one unique value throughout or remove features which will not be helpful in our analysis.

Next, we will handle the missing values and data type mismatch cases.

In [None]:
# lets see the summary stats of numerical columns
df_long.describe(include=[np.number])

In [None]:
# lets see the summary of categorical columns
df_long.describe(include=[np.object])

After seeing the summary stats we observe that **Hand** feature has only one unique value whereas **Subject ID** and **MRI ID** is of no use in our analysis so we will drop these three columns from the dataset.

In [None]:
# dropping irrelevant columns
df_long=df_long.drop(['Subject ID','MRI ID','Hand'],axis=1)

df_long.head()

### Missing value treatment

In [None]:
# checking missing values in each column
df_long.isna().sum()

In [None]:
# for better understanding lets check the percentage of missing values in each column
round(df_long.isnull().sum()/len(df_long.index), 2)*100

So, we have to impute missing values in SES and MMSE. Lets analyze SES column

In [None]:
# Plotting distribution of SES
def univariate_mul(var):
    fig = plt.figure(figsize=(16,12))
    cmap=plt.cm.Blues
    cmap1=plt.cm.coolwarm_r
    ax1 = fig.add_subplot(221)
    ax2 = fig.add_subplot(212)
    df_long[var].plot(kind='hist',ax=ax1, grid=True)
    ax1.set_title('Histogram of '+var, fontsize=14)
    
    ax2=sns.distplot(df_long[[var]],hist=False)
    ax2.set_title('Distribution of '+ var)
    plt.show()

In [None]:
# lets see the distribution of SES to decide which value we can impute in place of missing values.
univariate_mul('SES')
df_long['SES'].describe()

As SES has values of integer type so we cannot impute float value of mean but we can impute median in place as both median and mean have very close values and median in this case is most representative value of SES.

In [None]:
# imputing missing value in SES with median
df_long['SES'].fillna((df_long['SES'].median()), inplace=True)

Next we will analyze another column having missing values i.e., MMSE

In [None]:
univariate_mul('MMSE')
df_long['MMSE'].describe()

MMSE also has integer values so we cannot impute float. So we will impute it with median value

In [None]:
# imputing MMSE with median values
df_long['MMSE'].fillna((df_long['MMSE'].median()), inplace=True)

In [None]:
# Now, lets check the percentage of missing values in each column
round(df_long.isnull().sum()/len(df_long.index), 2)*100

Great!! so there no missing values now. lets move towards Univariate Analysis

## 2. Univariate Analysis

In [None]:
# Defining function to create pie chart and bar plot as subplots
def plot_piechart(var):
  plt.figure(figsize=(14,7))
  plt.subplot(121)
  label_list = df_long[var].unique().tolist()
  df_long[var].value_counts().plot.pie(autopct = "%1.0f%%",colors = sns.color_palette("prism",7),startangle = 60,labels=label_list,
  wedgeprops={"linewidth":2,"edgecolor":"k"},shadow =True)
  plt.title("Distribution of "+ var +"  variable")

  plt.subplot(122)
  ax = df_long[var].value_counts().plot(kind="barh")

  for i,j in enumerate(df_long[var].value_counts().values):
    ax.text(.7,i,j,weight = "bold",fontsize=20)

  plt.title("Count of "+ var +" cases")
  plt.show()



First, we will analyze categorical column named Group

In [None]:
plot_piechart('Group')

**Observation:**
- As we can see from the above plot, there are around 39% demented cases in the dataset i.e., majority of the data is of Non Demented cases while 10% of the data is of Converted.
- So lets analyze numerical features and perform univariate analysis on those features to see if we find any pattern or some interesting insights.

So, we first begin with analyzing the most important categorical feature i.e., **Clinical Dementia Rating (CDR)**.


In [None]:
df_long['CDR'].describe()

The CDR™ Scoring Table provides descriptive anchors that guide the clinician in making appropriate ratings based on interview data and clinical judgment. In addition to ratings for each domain, an overall CDR™ score may be calculated through the use of an CDR™ Scoring Algorithm. This score is useful for characterizing and tracking a patient’s level of impairment/dementia:

1. 0 = Normal
2. 0.5 = Very Mild Dementia or Questionable
3. 1 = Mild Dementia
4. 2 = Moderate Dementia
5. 3 = Severe Dementia

Information was taken from [The Charles F. and Joanne Knight Alzheimer’s Disease Research Center website](http://alzheimer.wustl.edu/cdr/cdr.htm). 

Interpretations of CDR scores:
![CDR Scoring Table](https://i.ibb.co/LhhT9n5/CDR.png)

After seeing from the above table, it has been observed that except Normal score all other score including 0.5 have dementia symptoms because it is very crucial to detect dementia in early stages. So, I am grouping cases having 0 score as Normal and all other score >=0.5 as dementia.

In [None]:
# Plotting CDR with other variable
def univariate_percent_plot(cat):
    fig = plt.figure(figsize=(18,12))
    cmap=plt.cm.Blues
    cmap1=plt.cm.coolwarm_r
    ax1 = fig.add_subplot(221)
    ax2 = fig.add_subplot(222)
    
    result = df_long.groupby(cat).apply (lambda group: (group.CDR == 'Normal').sum() / float(group.CDR.count())
         ).to_frame('Normal')
    result['Dementia'] = 1 -result.Normal
    result.plot(kind='bar', stacked = True,colormap=cmap1, ax=ax1, grid=True)
    ax1.set_title('stacked Bar Plot of '+ cat +' (in %)', fontsize=14)
    ax1.set_ylabel('% Dementia status (Normal vs Dementia)')
    ax1.legend(loc="lower right")
    group_by_stat = df_long.groupby([cat, 'CDR']).size()
    group_by_stat.unstack().plot(kind='bar', stacked=True,ax=ax2,grid=True)
    ax2.set_title('stacked Bar Plot of '+ cat +' (in %)', fontsize=14)
    ax2.set_ylabel('Number of Cases')
    plt.show()



# Categorizing feature CDR
def cat_CDR(n):
    if n == 0:
        return 'Normal'
    
    else:                                         # As we have no cases of sever dementia CDR score=3
        return 'Dementia'

df_long['CDR'] = df_long['CDR'].apply(lambda x: cat_CDR(x))

In [None]:
plot_piechart('CDR')

As we can see majority of the cases are Normal while very few cases are of Mild and Moderate dementia.

Next we will analyse another feature named MMSE.

### About MMSE (Mini Mental State Examination)
**Mini-mental state**: A practical method for grading the cognitive state of patients for the clinician study. The MMSE was designed as a screening test for the purpose of evaluating cognitive impairment in older adults. It is a 30-point questionnaire that is used extensively in clinical and research settings to measure cognitive impairment.

#### Interpretations:
Any score of 24 or more (out of 30) indicates a normal cognition. Below this, scores can indicate severe (≤9 points), moderate (10–18 points) or mild (19–23 points) cognitive impairment.That is, even a maximum score of 30 points can never rule out dementia. Low to very low scores correlate closely with the presence of dementia, although other mental disorders can also lead to abnormal findings on MMSE testing.

In [None]:
df_long['MMSE'].describe()

In [None]:
# Categorizing feature MMSE
def cat_MMSE(n):
    if n >= 24:
        return 'Normal'
    elif n <= 9:
        return 'Severe'
    elif n >= 10 and n <= 18:
        return 'Moderate'
    elif n >= 19 and n <= 23:                                        # As we have no cases of sever dementia CDR score=3
        return 'Mild'

df_long['MMSE'] = df_long['MMSE'].apply(lambda x: cat_MMSE(x))

In [None]:
plot_piechart('MMSE')

Here, also there are majority of cases of normal cognitive impairment whereas very few cases of Mild, Moderate and Severe cognitive Impairment.

In [None]:
univariate_percent_plot('MMSE')

As we can see from the above plot, there are around 40% of the cases in Normal MMSE status are of dementia cases accroding to CDR scoring.

Next we will analyze Age feature to see how age is impacting the dementia status.

In [None]:
univariate_mul('Age')
df_long['Age'].describe()

Age in this dataset is uniformly distributed ranging from 60 years to 98 years.

In [None]:
df_long['age_group'] = pd.cut(df_long['Age'], [60, 70, 80,90, 100], labels=['60-70', '70-80', '80-90','90-100'])
df_long['age_group'].value_counts()

In [None]:
# Now plotting age group to see dementia distribution
univariate_percent_plot('age_group')

Majority of cases of Dementia are in the age group of **70-80 years** (around 45%) while second most highest cases are in **80-90 years** of age.

## Bivariate Analysis

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="Age",hue="CDR",split=True, data=df_long)
plt.show()

As we can observe from the above plot, in case of Male most number of dementia cases are reported in the age of around 80 years while in case of females dementia is prevalent in 75 years of Age. One more observation suggests that in case of Males dementia starts early even before 60 years of age while in case of females demetia generally after 60 years of age.

Next we will analyze another important feature named eTIV.

### Estimated total intracranial volume (eTIV):
Intracranial volume (ICV) is an important normalization measure used in morphometric analyses to correct for head size in studies of Alzheimer Disease (AD).The ICV measure, sometimes referred to as total intracranial volume (TIV), refers to the estimated volume of the cranial cavity as outlined by the supratentorial dura matter or cerebral contour when dura is not clearly detectable.
ICV is often used in studies involved with analysis of the cerebral structure under different imaging modalities, such as Magnetic Resonance (MR).

In [None]:
df_long['eTIV'].describe()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="age_group", y="eTIV",hue="CDR",split=True, data=df_long)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="eTIV",hue="CDR",split=True, data=df_long)
plt.show()

**Normalized whole-brain volume**, expressed as a percent of all voxels in the atlas-masked image that are labeled as gray or white matter by the automated tissue segmentation process

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="nWBV",hue="CDR",split=True, data=df_long)
plt.show()


In [None]:
df_long['EDUC'].describe()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="EDUC",hue="CDR",split=True, data=df_long)
plt.show()


**Observation:**

As we can observe from the above plot,Mens having education level between 10 and 17 have higher level of dementia cases and mens started to show dymentia symptoms with less education levels starting from 4 years whereas females starts showing dymentia symptoms after 6 years of education level having highest peak at 13 years of age.

**SES - Socioeconomic status** as assessed by the Hollingshead Index of Social Position and classified into categories from 1 (highest status) to 5 (lowest status)

In [None]:
df_long['SES'].describe()

In [None]:
# Now plotting socio economic status to see dementia distribution
univariate_percent_plot('SES')

**Observation:**

At lowest level of socio economic status there is a highest probability of dementia which may be due to lower economic condition which results in depression, sufferings which in turn results in dementia.

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="SES",hue="CDR",split=True, data=df_long)
plt.show()


**Observation:**

- Interesting pattern observed from the above plot that in mens there are two peaks of highest dementia cases one at 1 (Highest status) and 4(lower status) and in between 1 and 4 there less instances of dementia cases whereas in case of females highest peak is at 2 whereas at 1 and 5 there are slightly less dementia cases reported.
- It suggests that womens have less dementia probability at extreme higher and extreme lower level of socio economic status while mens have exactly opposite phenomenon.

**ASF - Atlas scaling factor** (unitless). Computed scaling factor that transforms native-space brain and skull to the atlas target (i.e., the determinant of the transform matrix)

In [None]:
df_long['ASF'].describe()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="M/F", y="ASF",hue="CDR",split=True, data=df_long)
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="MMSE", y="ASF",split=True, data=df_long)
plt.show()

From the above plot we can get the intuition about ASF as in  case of normal patients the value of ASF distributed between 0.8 and 1.6 but as the patients started showing dementia cases its value centered around 1 as in case of Mild, Moderate and Severe it shrinks down to 1.1

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="MMSE", y="nWBV",split=True, data=df_long)
plt.show()

**Observation:**

Same pattern observed in case of nWBV as the dementia level increases nWBV centered between 0.65 and 0.70.

In [None]:
plt.figure(figsize=(12, 8))
ax = sns.violinplot(x="MMSE", y="Visit",split=True, data=df_long)
plt.show()

**Observation:**

Severe Dementia cases starts reporting as the number of visits increases to more than 3 whereas normal cases are also reported after higher number of visits more than 3 but they are very few in number. 

## Multicollinearity

In this step, we will see how much variables are correlated with another

In [None]:
plt.figure(figsize=(14, 8))
sns.heatmap(df_long.corr(), annot=True)
plt.show()

As we can see Visit and MR Delay are showing close correlation to 0.92 but I am not dropping any correlated variable as of now.

# Key Insights:
- Most of the cases of dementia observed in the age group of 70 - 80 years of Age.
- Mens develop dementia at early age before 60 years while womens have tendency of dementia at later age of later than 60 years
- In mens dementia starts at an education level of 4 years and most prevalent at education level of 12 years and 16 years and it can also extend upto more than 20 years of education level, while in womens dementia starts after 5 years of education level and most prevalent around 12 to 13 years of education level and it started to decrease as womens education level increase
- Dementia is prevalent in Mens  having highest and lowest socio economic status while womens having medium socio economic status have higher dementia cases.
- Lower values of ASF close to 1 corresponds to severe dementia cases.
- Severe dementia is diagnosed after minnimum 3 number of visits. 