# Dementia Analysis

The following Jupyter notebook in Python covers our analysis of dementia data. All related files for this analysis can be found in the `dementia_project` folder.

## Table of Contents

1. [Setup](#setup)
2. [Explore the dataset](#explore-the-dataset)

## 1. Setup

### Import libraries

In [1]:
import pandas as pd
import numpy as np;

### Load the dataset

In [2]:
dementia_dataset = pd.read_csv("dementia_studies_data.csv", delimiter=",")
dementia_dataset.head()

Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,...,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
0,1,52.67,male,0.0,0,11.0,-2.403333,-1.29,-1.287,0,...,Yes,more-than-zero,2 to 3,scans,scans,3.0,7.0,3,>5,>=1
1,10,64.58,male,0.0,0,10.0,1.28,0.36,0.744,0,...,Yes,more-than-zero,0 to 1,scans,scans,2.0,3.0,1,1 to 2,>=1
2,100,74.92,male,0.0,0,8.0,-1.44,-1.52,-0.922,0,...,Yes,more-than-zero,0 to 1,scans,scans,1.0,2.0,1,1 to 2,0
3,101,74.83,male,1.0,1,9.0,,-2.136271,-1.301102,0,...,Yes,more-than-zero,2 to 3,scans,scans,2.0,4.0,2,3 to 5,0
4,102,79.25,male,0.0,0,10.0,-0.92,-1.493333,-0.924,0,...,Yes,more-than-zero,2 to 3,scans,scans,2.0,3.0,2,1 to 2,0


## 2. Explore the dataset

In [4]:
dementia_dataset

Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,...,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
0,1,52.67,male,0.0,0,11.0,-2.403333,-1.290000,-1.287000,0,...,Yes,more-than-zero,2 to 3,scans,scans,3.0,7.0,3,>5,>=1
1,10,64.58,male,0.0,0,10.0,1.280000,0.360000,0.744000,0,...,Yes,more-than-zero,0 to 1,scans,scans,2.0,3.0,1,1 to 2,>=1
2,100,74.92,male,0.0,0,8.0,-1.440000,-1.520000,-0.922000,0,...,Yes,more-than-zero,0 to 1,scans,scans,1.0,2.0,1,1 to 2,0
3,101,74.83,male,1.0,1,9.0,,-2.136271,-1.301102,0,...,Yes,more-than-zero,2 to 3,scans,scans,2.0,4.0,2,3 to 5,0
4,102,79.25,male,0.0,0,10.0,-0.920000,-1.493333,-0.924000,0,...,Yes,more-than-zero,2 to 3,scans,scans,2.0,3.0,2,1 to 2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1837,989,67.00,female,0.0,0,13.0,-0.100000,-0.020000,-0.260000,0,...,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,0,Zero,0
1838,990,71.00,female,0.0,0,13.0,-0.370000,-1.060000,-1.003333,0,...,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,1,Zero,0
1839,991,55.00,male,0.0,0,10.0,0.460000,0.960000,0.610000,0,...,Yes,zero,0 to 1,ASPS-elderly,ASPS,0.0,0.0,0,Zero,0
1840,995,54.00,male,0.0,0,10.0,0.190000,0.030000,0.590000,0,...,Yes,zero,0 to 1,ASPS-elderly,ASPS,0.0,0.0,1,Zero,0


In [5]:
dementia_data.dtypes

ID                        int64
age                     float64
gender                   object
dementia                float64
dementia_all              int64
educationyears          float64
EF                      float64
PS                      float64
Global                  float64
diabetes                  int64
smoking                  object
hypertension             object
hypercholesterolemia     object
lacunes_num              object
fazekas_cat              object
study                    object
study1                   object
SVD Simple Score        float64
SVD Amended Score       float64
Fazekas                   int64
lac_count                object
CMB_count                object
dtype: object

In [6]:
dementia_data.columns


Index(['ID', 'age', 'gender', 'dementia', 'dementia_all', 'educationyears',
       'EF', 'PS', 'Global', 'diabetes', 'smoking', 'hypertension',
       'hypercholesterolemia', 'lacunes_num', 'fazekas_cat', 'study', 'study1',
       'SVD Simple Score', 'SVD Amended Score', 'Fazekas', 'lac_count',
       'CMB_count'],
      dtype='object')

In [7]:
dementia_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1842 entries, 0 to 1841
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    1842 non-null   int64  
 1   age                   1842 non-null   float64
 2   gender                1842 non-null   object 
 3   dementia              1808 non-null   float64
 4   dementia_all          1842 non-null   int64  
 5   educationyears        1842 non-null   float64
 6   EF                    1634 non-null   float64
 7   PS                    1574 non-null   float64
 8   Global                1534 non-null   float64
 9   diabetes              1842 non-null   int64  
 10  smoking               1831 non-null   object 
 11  hypertension          1842 non-null   object 
 12  hypercholesterolemia  1842 non-null   object 
 13  lacunes_num           1842 non-null   object 
 14  fazekas_cat           1842 non-null   object 
 15  study                

## Missing values 
To check whether the dataset is containing missing values, we run the following code: 

In [8]:
dementia_data.isnull().values.any()


True

As the we get a 'True', we now know that there is missing data in the dataset, to check which columns are missing data and how many rows in the columns are having values, we use the count()-method 

In [9]:
dementia_data.count()

ID                      1842
age                     1842
gender                  1842
dementia                1808
dementia_all            1842
educationyears          1842
EF                      1634
PS                      1574
Global                  1534
diabetes                1842
smoking                 1831
hypertension            1842
hypercholesterolemia    1842
lacunes_num             1842
fazekas_cat             1842
study                   1842
study1                  1842
SVD Simple Score        1165
SVD Amended Score       1165
Fazekas                 1842
lac_count               1842
CMB_count               1842
dtype: int64

In [11]:
dementia_data.isna().sum()


ID                        0
age                       0
gender                    0
dementia                 34
dementia_all              0
educationyears            0
EF                      208
PS                      268
Global                  308
diabetes                  0
smoking                  11
hypertension              0
hypercholesterolemia      0
lacunes_num               0
fazekas_cat               0
study                     0
study1                    0
SVD Simple Score        677
SVD Amended Score       677
Fazekas                   0
lac_count                 0
CMB_count                 0
dtype: int64

In [13]:
pd.set_option('display.max_columns',None)
## 
dementia_data[dementia_data["smoking"].isnull()]


Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
679,1097,78.0,female,0.0,0,10.0,0.28,-1.41,-0.7,0,,Yes,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,1,Zero,0
736,12,64.0,female,0.0,0,10.0,0.09,-0.31,-0.22,0,,Yes,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,1,Zero,0
878,1454,73.0,male,0.0,0,18.0,0.3,-0.24,-0.023333,1,,No,Yes,zero,0 to 1,ASPS-elderly,ASPS,0.0,0.0,0,Zero,0
965,161300,69.0,male,0.0,0,10.0,0.67,-0.06,0.393333,0,,Yes,Yes,zero,0 to 1,ASPS-family,ASPS,0.0,0.0,0,Zero,0
966,161301,50.0,male,0.0,0,13.0,0.75,0.62,1.273333,0,,Yes,Yes,zero,0 to 1,ASPS-family,ASPS,1.0,1.0,1,Zero,>=1
977,163402,65.0,female,0.0,0,9.0,-0.52,-0.76,-0.913333,0,,Yes,Yes,zero,0 to 1,ASPS-family,ASPS,0.0,0.0,1,Zero,0
1198,202200,76.0,female,0.0,0,10.0,-1.13,-1.08,-0.906667,1,,Yes,Yes,zero,2 to 3,ASPS-family,ASPS,1.0,1.0,2,Zero,0
1199,202202,50.0,male,0.0,0,18.0,0.57,0.48,1.416667,0,,No,Yes,zero,0 to 1,ASPS-family,ASPS,0.0,0.0,1,Zero,0
1261,205600,72.0,male,0.0,0,10.0,-1.76,-2.1,-1.77,1,,Yes,Yes,zero,0 to 1,ASPS-family,ASPS,0.0,0.0,1,Zero,0
1276,206401,49.0,male,0.0,0,10.0,0.52,1.02,0.88,0,,Yes,No,zero,0 to 1,ASPS-family,ASPS,0.0,0.0,1,Zero,0


In [14]:
dementia_data[dementia_data["Global"].isnull()]


Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
121,1,73.626283,male,0.0,0,10.0,,,,0,current-smoker,Yes,Yes,more-than-zero,2 to 3,rundmc,rundmc,3.0,6.0,3,3 to 5,>=1
126,103,56.134155,male,0.0,0,10.0,,,,0,current-smoker,No,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
131,108,76.358658,male,0.0,0,15.0,,,,0,ex-smoker,Yes,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
132,109,68.536619,female,0.0,0,8.0,,,,0,current-smoker,No,No,zero,0 to 1,rundmc,rundmc,0.0,1.0,1,Zero,0
135,111,71.501711,male,0.0,0,7.0,,,,0,ex-smoker,Yes,Yes,zero,2 to 3,rundmc,rundmc,1.0,2.0,2,Zero,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1794,921,53.000000,male,0.0,0,10.0,0.57,,,0,current-smoker,No,Yes,zero,0 to 1,ASPS-elderly,ASPS,0.0,0.0,0,Zero,0
1800,930,78.000000,female,0.0,0,9.0,-1.22,-1.08,,0,never-smoker,No,Yes,zero,0 to 1,ASPS-elderly,ASPS,,,1,Zero,0
1805,937,77.000000,male,0.0,0,10.0,0.19,,,0,never-smoker,Yes,Yes,zero,2 to 3,ASPS-elderly,ASPS,,,3,Zero,0
1814,950,77.000000,male,0.0,0,9.0,,-0.89,,0,never-smoker,No,No,zero,2 to 3,ASPS-elderly,ASPS,,,3,Zero,0


In [15]:
dementia_data.isna().sum()

ID                        0
age                       0
gender                    0
dementia                 34
dementia_all              0
educationyears            0
EF                      208
PS                      268
Global                  308
diabetes                  0
smoking                  11
hypertension              0
hypercholesterolemia      0
lacunes_num               0
fazekas_cat               0
study                     0
study1                    0
SVD Simple Score        677
SVD Amended Score       677
Fazekas                   0
lac_count                 0
CMB_count                 0
dtype: int64