## Intro to Analysis

In this Notebook I´ll be investigating the reasonings of heart deseases by doing EDA. I´ve never touched Health datasets. So it´s going to be a complete new adventure for myself digging deep into new stories of the health of people.

### Not known factors

* Within the description it is not clearly defined when the data have been taken from the patient (e.g. all at the same time or last visit in a hospital)
* What kind of Heart dessease the patient had

### Backbone questions to start Analysis 

* Are there any input variable which do have a high indicator to a heart disease?

For further information of the given columns in the dataset -- please visit the [mainpage](https://www.kaggle.com/ronitf/heart-disease-uci) of this dataset. The creator did describe all of them quite well.  

### Receive Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
heart_disease = pd.read_csv('../input/heart.csv')

### Retrieve an Overview of the dataset

In [None]:
heart_disease.head()

In [None]:
heart_disease.info()

The dataset consists of 303 entries and each information (column) has been completely filled out. We do not need to handle any missing values. 

In [None]:
heart_disease.describe()

For better understanding of the given columns for later purposes -- I´ll be renaming the columns so that the notebook will be more readable.
We can identify that most of the given information are categorical numerical values

In [None]:
heart_disease.rename(columns = {
    'cp' : 'chest_pain_type',
    'trestbps' : 'resting_blood_pressure',
    'chol' : 'cholesteral',
    'fbs' : 'fasting_blood_sugar_higher_120',
    'restecg' : 'resting_cardiographic_results',
    'thalach' : 'max_heartrate',
    'exang' : 'induced_angina',
    'ca' : 'num_maj_vessels_flourosopy',
    'thal' : 'blood_characterization'
}, inplace = True)

### Understanding each information given in the dataset

To get a better understanding I´ll be dividing the given information into a better statistical understandable way. 

As far as I could see potential Label information have been already converted to numeric values. 
For example: 
* column **sex** is shown as 0 / 1 which stand for Female / Male. 
* column **blood_characterization** is shown as 0 / 1 / 2 / 3 which stand for different characterizations. 

All in all those are numerical classifier and not numerical values which might have an endless range. 
Imagine we´re doing some model predictions. The normal behaviour would be that we´re using something like LabelEncoder and OneHotEncoder for those information. For this dataset - the usage of LabelEncoder is not necessary because the values are already written down as numerical values. 

### Lets summarize which information has what kind of numerical type

* numerical categorical: sex, chest_pain_type, fasting_blood_sugar_higher_120, resting_cardiographic_results, induced_angina, blood_characterization, target

* numerical continous: b



## EDA 

#### Checking the age in relation to the target result

In [None]:
print(plt.hist(heart_disease['age'], histtype = 'step'))

We can clearly identify that the majority of the sample data are within the range of 40 years to around 65 years. 
At this point lets add an additional information -- clustered ages
The ages are going to be clustered in 5 year steps. I´ll be writing a function which I can reuse for other columns as well later on (e.g. max_heart_rate)

In [None]:
def define_cluster_groups(dataset, column, interval_indicator = 5):
    new_column = column + '_cluster'
    # create cluster
    cluster = 0
    dataset[new_column] = 0
    while(cluster * interval_indicator < max(dataset[column])):
        cluster_value = cluster * interval_indicator + 1 # +1 because else we would overwrite the last age value per cluster
        min_cluster_value = min(dataset[column]) + cluster_value
        max_cluster_value = max(dataset[column]) + cluster_value
        if cluster == 0:
             dataset.at[(dataset[column] >= (min_cluster_value - 1)) & 
                        (dataset[column] <= (max_cluster_value - 1)),
                       new_column] = cluster
        else:
            dataset.at[(dataset[column] >= min_cluster_value) & 
                       (dataset[column] <= max_cluster_value),
                       new_column] = cluster
        cluster += 1

In [None]:
## create age_cluster
define_cluster_groups(heart_disease, 'age', interval_indicator = 5)

In [None]:
print(heart_disease['age_cluster'].value_counts())

In [None]:
### check the first rows for validation of our function
heart_disease[['age', 'age_cluster']].head(10)

For further investigation I´ll be using the age_cluster instead of the real age. My reasoning is that I´ve reduced the different inputs into 10 groups which I can much better categorize instead of using the direct age. Secondly is my assumption that I do get a better understanding of the dataset by categorizing some information in the dataset (like blood pressure as well)

In [None]:
heart_disease.groupby(['age_cluster', 'target'])['age'].count().reset_index()

In [None]:
sns.barplot(x = 'age_cluster', y = 'age', hue = 'target', data = heart_disease.groupby(['age_cluster', 'target'])['age'].count().reset_index())
## since I´m doing a count --- 'age' can be replaced with any other column

Really interestingly to see is that people within the age_cluster of 1 to 4 have more samples which had a heart disease instead of the older samples from cluster 5 - 6. The degree of heart disease starting from 8 - 9 is being normal again. 

I do have the assumption that the environment / globalisation of our world might be an important trigger for increased heart diseases. Unfortunately it is not possible to review is. 
As an example: 
* in 2019 people are trying to eat more healthy again
* around 2000 people did eat lots of fast food and other unhealthy things. 
This leads also to the point that the weight might be an important triggerpoint for an heart disease. 

Hypothesis: The lifestyle is one main factor which indicates the potential of having an heart disease.

Furthermore I´ll be removing the cluster Groups 0, 8 and 9 because of too less samples in the dataset. 

In [None]:
heart_disease = heart_disease.loc[(heart_disease['age_cluster'] > 0) &
                                  (heart_disease['age_cluster'] < 8), :].reset_index().drop('index', axis = 1)

#### How does the heart rate correlate with the age group. 

To do this I´ll be investigating the same behaviour just like with the age

In [None]:
plt.hist(heart_disease['max_heartrate'], histtype = 'step')

In [None]:
## create heart rate cluster
define_cluster_groups(heart_disease, 'max_heartrate', interval_indicator = 13)
# for clarification: I´m using the interval_indicator of the step hist -- you can see the values in the second printed out array 

In [None]:
sns.barplot(x = 'max_heartrate_cluster', y = 'age', hue = 'target', data = heart_disease.groupby(['max_heartrate_cluster', 'target'])['age'].count().reset_index())


We can see a clear trend in the heart disease in relation to the maximum heartrate of our sample data. 

Next step: Lets check how the age_cluster correlates to the max_heartrate_cluster

In [None]:
sns.jointplot(x = 'age_cluster', y = 'max_heartrate_cluster', data = heart_disease.loc[heart_disease['target'] == 1])

In [None]:
heart_disease.info()

#### Does the resting blood pressure has relations to the max_heartrate or fasting_blood_sugar_higher_120? 

In [None]:
sns.jointplot(x = 'resting_blood_pressure', y = 'max_heartrate', data = heart_disease[heart_disease['target']==1], kind = 'kde')

We can see that a heart disease is much higher to trigger if** the sample has a higher resting_blood_pressure and his max_heartrate is above 140** 

It is said that people who do sports have a lower resting_blood_pressure and do also need higher volume of exercise to pump up their heart rate. 
Seems like that I should start doing sports ;-). 

### Check correlation of all non numerical categorical features. 

This might help us understanding how those information are being related to each other. Just like we already did with the heartrate. 


In [None]:
sns.pairplot(heart_disease[['age', 'resting_blood_pressure', 'cholesteral', 'max_heartrate', 'oldpeak', 'target']], hue = 'target')

It looks like that all of the given features do have relations of potentially increasing a heart disease. Lets get deeper into it and check the correlation of those information together. 

In [None]:
heart_disease[['age', 'resting_blood_pressure', 'cholesteral', 'max_heartrate', 'oldpeak']].corr()

This is really interesting. If we´re looking at the correlation table - None of them have a real high correlation together. Lets do another correlation matrix with only the ones who had a heart disease or not. 

In [None]:
heart_disease.loc[heart_disease['target'] == 1, ['age', 'resting_blood_pressure', 'cholesteral', 'max_heartrate', 'oldpeak']].corr()

In [None]:
heart_disease.loc[heart_disease['target'] == 0, ['age', 'resting_blood_pressure', 'cholesteral', 'max_heartrate', 'oldpeak']].corr()

This is absolutely interesting. Just by filtering the information the cholesteral has a real high impact on the relationing to other values. Just look at the cholesteral / resting_blood_pressure change.

In [None]:
sns.jointplot(x = 'resting_blood_pressure', y = 'cholesteral', data = heart_disease[heart_disease['target'] == 1], kind = 'kde')

### Comparing sex type with already evaluated dat

Before I go into the prediction. Lets compare sextypes with our given information.  

In [None]:
sex  = sns.FacetGrid(heart_disease, col="sex", hue="target")
sex.map(plt.scatter, "cholesteral", "resting_blood_pressure", alpha=.7)
sex.add_legend();

We´re able to identify a clear difference in both sex. With sex **0** we can see a small trend but on sex **1** it is completely mixed up. The sex type seems to be playing a big role in identifying a heart disease. 

In [None]:
sex  = sns.FacetGrid(heart_disease, col="sex", hue="target")
sex.map(plt.scatter, "max_heartrate", "resting_blood_pressure", alpha=.7)
sex.add_legend();

On both sides we can see a clear trend. For sex **1** it is not that clear just like with sex **0**

## Summary

Each of the given factors do have its reason to *predict* a potential heart desease. I could find out that the *blood pressure*, *heart rate* and *cholesteral* have an interesting impact on a potential heart desease. 
Even if the gender division is not perfectly 50 / 50 in the given dataset I could cleary see in the last visualization that the gender does not have different impact on the heart desease. It might be more like that the type of heart desease might differ which results in small deviations. 

It would´ve been really interesting to know some more information about their lifestyle. My assumptions are that the lifestyle as well as the psyche has an high impact on a heart desease. 

As always -- feedback is welcome :-)