<h1><font color='green'>Dementia Analysis</font></h1>

The following Jupyter notebook in Python covers our analysis of dementia data. All related files for this analysis can be found in the `dementia_project` folder.

The data in the dataset is gathered from three longitudinal studies that are described in the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7274929/.

<center>
    <img src="img/three-studies.png" alt="three studies" width=600/>
</center>

<h2><font color='green'>Table of Contents</font></h2>

1. [Setup](#setup)
2. [Explore the dataset](#explorer-the-dataset)
3. [Clean the dataset](#clean-the-dataset)

<h2><font color='green'>1. Setup</font></h2>

### Import libraries

In [2]:
import pandas as pd
import numpy as np;

In [34]:
import pprint # pretty print
from collections import Counter # dict subclass for counting hashable objects

### Load the dataset

In [6]:
df = pd.read_csv("dementia_studies_data.csv", delimiter=",")

<h2><font color='green'>2. Explore the dataset</font></h2>

<div class="alert alert-block">
    <b>Shape:</b> First, let's find out the shape of the data.
</div>

In [8]:
df.shape

(1842, 22)

There are 1882 rows — that is the # of entities, and then there are 22 columns, the features, 21 of which are the independent variables, and the feature `dementia` or `dementia_all` will be the dependent variable.

### Exploratory Data Analysis

Now, we can follow Cox's exploratory data analysis (EDA) process before starting any statistical analysis.

#### 1. Identify Data Types

<div class="alert alert-block">
    <b>Head:</b> Let's view the initial three entities (rows) of the dataset. Visually easier to interpret when transposed.
</div>

In [11]:
df.head(3).T

Unnamed: 0,0,1,2
ID,1,10,100
age,52.67,64.58,74.92
gender,male,male,male
dementia,0.0,0.0,0.0
dementia_all,0,0,0
educationyears,11.0,10.0,8.0
EF,-2.403333,1.28,-1.44
PS,-1.29,0.36,-1.52
Global,-1.287,0.744,-0.922
diabetes,0,0,0


The dataset contains both quantitative and qualitative values. For instance, the IDs are discrete, age is continuous, gender is binary, and smoking status is an example of an ordinal value.

<div class="alert alert-block">
    <b>Data types:</b> Now, we show the various data types of the dataframe.
</div>

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1842 entries, 0 to 1841
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    1842 non-null   int64  
 1   age                   1842 non-null   float64
 2   gender                1842 non-null   object 
 3   dementia              1808 non-null   float64
 4   dementia_all          1842 non-null   int64  
 5   educationyears        1842 non-null   float64
 6   EF                    1634 non-null   float64
 7   PS                    1574 non-null   float64
 8   Global                1534 non-null   float64
 9   diabetes              1842 non-null   int64  
 10  smoking               1831 non-null   object 
 11  hypertension          1842 non-null   object 
 12  hypercholesterolemia  1842 non-null   object 
 13  lacunes_num           1842 non-null   object 
 14  fazekas_cat           1842 non-null   object 
 15  study                

<div class="alert alert-block alert-info">
    We see the presence of three data types: int64, float64, and objects. 
</div>

We can see that `dementia` contains $$1842-1808=34$$ null entries, but `dementia_all` does not contain any null entries. Several other features contain some null entries, like:

- `EF`: Executive function (EF) refers to **a set of skills (i.e., working memory, planning, task switching, inhibitory control, and attention).**
- `PS`: Processing speed is **a measure of the time required to respond to and/or process information in one's environment.**
- `Global`: **Global cognitive function score as a function of health status and level of physical fitness.**
- `smoking`: current smoker, ex-smoker, or never-smoker.
- `SVD Simple Score`: method used to assess the presence and severity of small vessel disease, a condition that affects the small arteries in the heart and brain.
- `SVD Amended Score`: The "SVD Amended Score" is a revised version of the "SVD Simple Score", taking into account additional or different factors.

<div class="alert alert-block">
    <b>Dementia feature:</b> Let's locate the 34 null entries in the dementia feature.
</div>

In [17]:
df[df['dementia'].isnull()]

Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,...,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
122,10,80.591376,male,,1,6.0,-1.731029,-0.805562,-1.187208,0,...,Yes,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
124,101,67.441478,male,,1,10.0,-0.873236,0.135702,-0.323025,0,...,Yes,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
134,110,72.5859,female,,1,15.0,-0.127388,0.448854,0.152974,0,...,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
193,165,76.202601,male,,1,10.0,,,,0,...,Yes,zero,2 to 3,rundmc,rundmc,1.0,3.0,3,Zero,0
203,174,78.978782,male,,1,10.0,,,,0,...,Yes,more-than-zero,2 to 3,rundmc,rundmc,2.0,3.0,2,1 to 2,0
205,176,76.925394,male,,1,17.0,0.028057,0.885542,0.044095,0,...,No,more-than-zero,0 to 1,rundmc,rundmc,2.0,3.0,1,1 to 2,>=1
225,196,75.841205,female,,1,10.0,,,,0,...,Yes,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
232,201,69.98768,male,,1,10.0,,,,1,...,Yes,more-than-zero,2 to 3,rundmc,rundmc,2.0,4.0,3,1 to 2,0
246,216,59.671458,male,,1,17.0,,,,0,...,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
267,236,65.752224,male,,1,7.0,,,,0,...,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0


We can investigate further why the `NaN` values are present for some of the `dementia` values. Notice that `dementia_all` is set to the true value of 1 for all instances of `NaN` `dementia`. This must mean that `dementia_all` has altered the data for all `NaN` values to 1.

In [55]:
non_unique_IDs = df[df['ID'].duplicated(keep=False)]['ID'].tolist() # get all duplicated IDs
id_counter = Counter(non_unique_IDs) # count how many times each ID is duplicated
sorted_id_counter = dict(sorted(id_counter.items())) # sort the dictionary by key

print("Number of duplicated IDs: ", len(sorted_id_counter))
print(sorted_id_counter)
# pprint.pprint(sorted_id_counter, compact=True) # readable format

Number of duplicated IDs:  301
{1: 3, 2: 2, 3: 2, 4: 3, 5: 3, 6: 2, 7: 3, 8: 3, 9: 3, 10: 2, 11: 2, 12: 3, 13: 3, 14: 2, 15: 3, 16: 3, 17: 3, 18: 3, 19: 2, 20: 3, 22: 2, 23: 3, 25: 2, 26: 3, 27: 3, 28: 3, 29: 3, 30: 2, 31: 3, 32: 3, 33: 3, 34: 3, 35: 2, 36: 3, 37: 2, 38: 3, 39: 3, 40: 3, 41: 3, 42: 3, 43: 3, 44: 2, 45: 3, 46: 2, 47: 3, 48: 2, 49: 2, 50: 3, 51: 3, 53: 2, 54: 3, 55: 3, 56: 2, 57: 2, 58: 2, 59: 3, 60: 3, 61: 3, 62: 2, 63: 2, 64: 3, 65: 3, 66: 2, 67: 2, 68: 2, 69: 2, 70: 3, 71: 3, 72: 3, 73: 3, 74: 3, 75: 3, 76: 3, 77: 3, 78: 3, 79: 2, 80: 3, 81: 3, 82: 3, 83: 2, 84: 2, 85: 3, 86: 3, 87: 2, 88: 2, 89: 2, 90: 2, 91: 3, 92: 2, 93: 3, 95: 3, 96: 2, 97: 3, 98: 3, 99: 2, 100: 3, 101: 2, 102: 2, 103: 2, 104: 3, 105: 3, 106: 3, 107: 2, 108: 3, 109: 3, 110: 3, 111: 3, 112: 2, 113: 3, 114: 2, 115: 2, 116: 2, 117: 3, 120: 3, 121: 2, 122: 2, 123: 2, 124: 2, 125: 2, 126: 2, 127: 3, 128: 2, 129: 3, 130: 3, 131: 2, 132: 3, 133: 2, 134: 2, 135: 2, 137: 2, 139: 2, 140: 2, 143: 2, 144: 2, 

In [43]:
# Examples of ID 9's rows
df[df['ID'] == 9]

Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,...,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
111,9,67.33,male,0.0,0,14.0,-0.533333,-0.91,-0.248,0,...,Yes,more-than-zero,2 to 3,scans,scans,2.0,6.0,3,3 to 5,>=1
613,9,70.655715,male,0.0,0,7.0,,,,0,...,Yes,more-than-zero,2 to 3,rundmc,rundmc,3.0,4.0,2,1 to 2,>=1
1777,9,70.0,female,0.0,0,10.0,-1.45,0.45,-0.81,0,...,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,0,Zero,0


<div class="alert alert-block alert-warning">  
    Here, we encounter something confusing. The fact that ID 9 appears with different ages can be explained by the study being longitudinal, but the difference in genders is mysterious.
</div>

<div class="alert alert-block">
    <b>Statistics:</b> Let's apply the describe method to display the descriptive statistics of the dementia dataframe.
</div>

In [12]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,1842.0,29897.929967,67056.874773,1.0,223.5,612.5,1713.25,211301.0
age,1842.0,65.952588,8.923488,38.0,60.0,66.151393,73.0,90.0
dementia,1808.0,0.045354,0.208137,0.0,0.0,0.0,0.0,1.0
dementia_all,1842.0,0.062975,0.242984,0.0,0.0,0.0,0.0,1.0
educationyears,1842.0,11.139522,2.983946,1.0,9.0,10.0,13.0,24.0
EF,1634.0,-0.063088,0.785264,-5.2,-0.436896,0.13,0.484327,2.35
PS,1574.0,-0.066649,0.871836,-2.68,-0.670805,-0.02,0.55,2.73
Global,1534.0,-0.02686,0.677071,-2.42,-0.476881,0.006667,0.423333,1.853333
diabetes,1842.0,0.122693,0.328173,0.0,0.0,0.0,0.0,1.0
SVD Simple Score,1165.0,0.719313,0.932063,0.0,0.0,0.0,1.0,3.0


<h2><font color='green'>3. Clean the dataset </font></h2>