## 1. Setup 

### Import and Load the dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import asarray

dementia_data = pd.read_csv("data/dementia_data_cleaned_v1.csv", delimiter=",")

## 4. Correlation matrix

The correlation matrix makes a visualization of the pearson correlation between the viables in the dataset

In [None]:
corr = dementia_data. corr(method='pearson', numeric_only=True)
corr

In [None]:
fig, ax = plt.subplots(figsize=(50,50))
sns.set(font_scale=8.0)
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu_r', annot=True, linewidth=0.5, ax=ax, annot_kws={"size":42},)


The Pearson Correlation Matrix, visualized as a heat map, shows the correlations between the features in the dataset. We did so to gather an understanding of which features in the dataset that might have the most impact on dementia_all, as this is the feature we want to be able to predict. 

The dataset is still consisting of nominal and categorical datatypes, hence to use the nominal data for modeling, we need to transform it to numerical data. 

## <font color=green> 3. Cleaning the dataset 2.0 </font> ##

In [None]:
dementia_data

Our target feature is *dementia_all*. This feature is want we aim to predict with our model and given the correlation with the other features. *dementia_all* is a duplication of the feature *dementia*. However, *dementia* contains 34 missing values (NaN), whereas these values has been assigned with the value 1 in the feature *dementia_all*. 


with the argument of creating a model that rather predicts a false positive than a false negative <font color = yellow> (source????) </font>, *dementia_all* are chosen as the target feature, which is the same argument for creating the colum in the first place and assigning the missing values with the value 1, in stead of dropping the 34 rows containing missing values. 

In order to include the different feature in our correlation matrix later on, we transform the datatypes of the features containing objects as datatypes. 

In [None]:
dementia_data.rename(columns={"SVD Simple Score": "svd_simple_score", "SVD Amended Score": "svd_amended_score"})

By deciding on using *dementia_all* as the target feature, we are no longer in need of the *dementia*, hence this will be removed from the dataset. Furthermore, the two features *study* and *study1* is once again a duplikation of information of which study the information came from. The *study1* column does not include new information, hence its purpose in this study and development of a model to predict dementia, the feature is not relevant. 

<font color = yellow>   </font>

In [None]:
dementia_data.drop("dementia", axis=1, inplace=True)

In [None]:
dementia_data.drop("study1", axis=1, inplace=True)
dementia_data.drop("ID", axis=1, inplace=True)
dementia_data.drop("study", axis=1, inplace=True)

In [None]:
dementia_data

In [None]:
dementia_data['lac_count'].unique()

In [None]:

dementia_data['fazekas_cat'].unique()

In [None]:
dementia_data['lacunes_num'].unique()

In [None]:
dementia_data['SVD Simple Score'].unique()

In [None]:
dementia_data['SVD Amended Score'].unique()

As we found out earlier on, a magnificicant amount of rows (677) contains missing values (NaN) within the feature *'SVD simple score'* and *'SVD Amended Score'*. As these features indicates the results of patient's MRI scan. Hence, it is assumed that the patient's whose rows include NaN values within theses features, have not been scanned, which serves as an argument of filling these missing values with the value of 0. Taken the assumption, that patients will not require a MRI scan unless they are showing symptoms, into account. 

In [None]:
dementia_data['SVD Simple Score'].fillna(dementia_data['SVD Simple Score'].mode()[0], inplace = True)
dementia_data['SVD Amended Score'].fillna(dementia_data['SVD Amended Score'].mode()[0], inplace = True)

Checking that the code worked, there are no more NaN values included in the feature.

In [None]:
dementia_data['SVD Simple Score'].unique()

Arguing that the ones with a missing value within the smoking feature, we assume that they are never-smokers (0)

In [None]:
dementia_data['smoking'].fillna(0, inplace = True)
dementia_data.smoking.unique()

As well within the features *EF*, *PS* and *Global* there is a noticeable amount of missing values (respectively 208, 268, and 308). The same argument as above-mentioned does not apply to these features as EF (Executive function), PS (Processing Speed), and Global (Global Cognitive Score) are not 0 as the patients would then have no EF, PS, and Global. In stead we replace the NaN values in these three features with the mean of the values included in the respective features.

In [None]:
dementia_data['EF'].fillna(dementia_data['EF'].mean(), inplace = True)
dementia_data['PS'].fillna(dementia_data['PS'].mean(), inplace = True)
dementia_data['Global'].fillna(dementia_data['Global'].mean(), inplace = True)

Checking whether we managed to exclude missing values. 

In [None]:
dementia_data.isna().sum()

In [None]:
dementia_data['lac_count'].unique()

In [None]:
dementia_data['CMB_count'].unique()

In [None]:
dementia_data['lacunes_num'].unique()

Replacing the columns with numerical values.

In [None]:
dementia_data.replace({"lacunes_num": {"zero": 0, "more-than-zero": 1}}, inplace=True)
dementia_data.replace({"fazekas_cat": {"0 to 1": 0, "2 to 3": 1}}, inplace=True)
dementia_data.replace({"lac_count": {"Zero": 0, "1 to 2": 1, "3 to 5": 3, ">5": 5}},inplace=True)
dementia_data.replace({"CMB_count": {"0": 0, ">=1": 1}}, inplace=True)
dementia_data.replace({"gender":{"female" : 0, "male" : 1}}, inplace=True)

In [None]:
dementia_data['lacunes_num'].unique()

In [None]:
dementia_data

In [None]:
dementia_data.dtypes

## <font color=green> 4. Visualization 2.0 </font> ##

In [None]:
corr = dementia_data. corr(method='pearson', numeric_only=True)
corr

In [None]:
fig, ax = plt.subplots(figsize=(50,50))
sns.set(font_scale=8.0)
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu_r', annot=True, linewidth=0.5, ax=ax, annot_kws={"size":42},)

Redoing the heatmap once the data has been cleaned shows an even lesser correlation with our target feature *dementia_all*

In [None]:
dementia_data

In [None]:
sns.set(font_scale=1.0)
dementia_data['dementia_all'].value_counts().plot(kind='pie', explode=[0,0.1], autopct='%1.1f%%', shadow=True)
plt.legend(dementia_data["dementia_all"].value_counts().index)
plt.show()

Only 6.3% of the participants in the dataset have been diagnosed with dementia. When looking in to how the dataset is divided on gender, it is almost 50/50 with 

In [None]:
ax = dementia_data['gender'].value_counts().sort_index().plot(kind='pie',rot=0, ylabel='Distribution in %',labels=['Women','Men'], 
                                                                   shadow=True, autopct='%1.1f%%',textprops={'horizontalalignment': 'center'})
plt.savefig("img/pie_chart_gender")

In [None]:
# plots 
distribution_features = ["age", "educationyears", "smoking"]
boxplot_feature = ["smoking", "educationyears"]
bar_feature = ["smoking"]
img_folder = "img/"



for feature in distribution_features: 
    sns.kdeplot(dementia_data, x=feature, hue="dementia_all")
    plt.title("Distribution for " + feature)
    plt.savefig(img_folder + "dis_" + feature)
    plt.show()

for feature in boxplot_feature:
    sns.boxplot(dementia_data, x="dementia_all", y=feature)
    plt.title("Boxplot for " + feature)
    plt.savefig(img_folder + "box_" + feature)
    plt.show()


for feature in bar_feature: 

    grouped_data = dementia_data.groupby(['smoking', 'dementia_all']).size().reset_index(name='count')

    # Pivot the data to prepare for plotting
    pivot_data = grouped_data.pivot(index='smoking', columns='dementia_all', values='count').fillna(0)
    pivot_data = pivot_data.reindex(columns=[1, 0])

    # Create the stacked bar plot
    ax = pivot_data.plot(kind='bar', stacked=True, width=0.75)
    for p in ax.patches:
        if p.get_height() < 100: 
            text= str(int(p.get_height())) # avoid duplicate labeling on the bars
        ax.annotate(text, (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 6), textcoords='offset points')
        text = None
    totals = pivot_data.sum(axis=1)
    for i, total in enumerate(totals):

        ax.text(i, total + 6, str(int(total)), ha='center', va='bottom')
    plt.xticks(range(len(pivot_data.index)), pivot_data.index.astype(int))
    plt.xlabel("Smoking")
    plt.ylabel("Sum")
    plt.legend(title='Dementia', loc='upper right')
    plt.tight_layout()
    plt.xticks(rotation=0)
    plt.savefig(img_folder + "bar_" + feature)
    plt.show()

In [None]:
dementia_data.drop('dementia_all', axis=1).corrwith(dementia_data.dementia_all).sort_values().plot(kind='barh',figsize=(10,10))
dementia_data

We are selecting 13 feature to do model traning, chosen on their absolete values

In [None]:
NUMBER_OF_FEATURES = 13
corr = dementia_data.drop('dementia_all', axis=1).corrwith(dementia_data.dementia_all).abs().sort_values(ascending=False)
n_features = corr.head(NUMBER_OF_FEATURES)
type(n_features)


In [None]:
model_features = n_features.index.to_numpy()
model_features

Save the dataframe to csv to be used in a clean notebook for model training

In [None]:
dementia_data.to_csv('data/dementia_data_cleaned_v2.csv', sep=',', index=False)