## <font color=green> 1. Setup <a name="setup"></a></font> ##

### Import libraries

In [None]:
import pandas as pd
import numpy as np

### Load the dataset

In [None]:
dementia_data = pd.read_csv("data/dementia_studies_data.csv", delimiter=",")
dementia_data.head()

## <font color=green> 2. Explore the dataset <a name="explore"></a></font> ##

<div class=\"alert alert-block alert-info\">
    <b>Shape:</b> First, let's find out the shape of the data.
</div>

In [None]:
dementia_data.shape

There are 1842 rows in the dataset, which is the number of entities and 22 coloumns, the features. The features consist of 21 independent viables and the feature <font color=green> **dementia** </font> or <font color =green> **dementia_all** </font> will be the dependent viable. 

<font color=green> What is the difference between dementia and dementia_all </font>


The difference between <font color=green> dementia </font> and <font color =green> dementia_all </font>  is 


In [None]:
dementia_data

In [None]:
dementia_data.dtypes

In [None]:
dementia_data.columns


In [None]:
dementia_data.info()

## 3. Exploring the dataset <a name="clean"></a>

### Missing values

To check whether the dataset is containing missing values, we run the following code: 

Importing various libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import asarray

In [None]:
dementia_data.isnull().values.any()

As the we get a 'True', we now know that there is missing data in the dataset, to check which columns are missing data and how many rows in the columns are having values, we use the count()-method 

In [None]:
dementia_data.count()

In [None]:
dementia_data.isna().sum()

In [None]:
pd.set_option('display.max_columns', None)
dementia_data[dementia_data["smoking"].isnull()]


In [None]:
dementia_data[dementia_data["Global"].isnull()]

In [None]:
dementia_data.isna().sum()

## <font color=green> 3. Cleaning the dataset </font> ##


(This is the changes we made in the exercise April 19, 2024). 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from numpy import asarray

In [None]:
dementia_data

In [None]:
dementia_data_type = dementia_data['hypertension'].dtype
print(dementia_data_type)


In stead of having object in the hypertension column, we will change it to an integer by using the following code, and we do the same with the hypercholesterolemia-column 

In [None]:
dementia_data.replace({"hypertension" : {"No": 0, "Yes": 1}}, inplace=True)
dementia_data.replace({"hypercholesterolemia" : {"No": 0, "Yes": 1}}, inplace=True)
dementia_data.replace({"smoking" : {"never-smoker": int(0), "ex-smoker": int(1), "current-smoker": int(2)}}, inplace=True)
dementia_data.head()
dementia_data

In [None]:
dementia_data.smoking

When exploring the dataset in step 2, we found that there were 11 missing values within the feature 'smoking', hence, it is not possible to transform the datatype of this feature into integers. This is why this feature consists of the datatype float64



In [None]:
dementia_data['smoking'] = dementia_data['smoking'].astype('Int64')

In [None]:
dementia_data.smoking

In [None]:
#flyt denne her ned til de andre replace() eller flyt de andre herop. 
mapping = {"smoking": 0, "current-smoker": 1, "never-smoker": 2}
# Mapping the values in the dataframe column 
dementia_data["smoking"].map(mapping)

In [None]:
dementia_data.smoking.unique()

We are printing the missing_smoking rows to see what a missing value in the smoking column is called after we converted the feature (smoking) into containing integers at datatype. 


<font color = yellow> we need to either remove, convert the "na" to 0 or 1, if smoking is a very important feature </font>

In [None]:
dementia_data

<font color = yellow> Consider removing columns displayng from which study the rows came from </font> 


## 4. Correlation matrix

The correlation matrix makes a visualization of the pearson correlation between the viables in the dataset

In [None]:
corr = dementia_data. corr(method='pearson', numeric_only=True)
corr

In [None]:
fig, ax = plt.subplots(figsize=(50,50))
sns.set(font_scale=8.0)
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu_r', annot=True, linewidth=0.5, ax=ax, annot_kws={"size":42},)

The Pearson Correlation Matrix, visualized as a heat map, shows the correlations between the features in the dataset. We did so to gather an understanding of which features in the dataset that might have the most impact on dementia_all, as this is the feature we want to be able to predict. 

The dataset is still consisting of nominal and categorical datatypes, hence to use the nominal data for modeling, we need to transform it to numerical data. 

Save cleaned data to csv for further cleaning and training to reduce clutter in the notebooks 

In [None]:
dementia_data.to_csv('data/dementia_data_cleaned_v1.csv', sep=',', index=False)