![](https://2rdnmg1qbg403gumla1v9i2h-wpengine.netdna-ssl.com/wp-content/uploads/sites/3/2018/08/GettyImages-498686795-650x450.jpg)
# Introduction

## History
According to World Health Organisation (WHO), heart diseases, also a part of cardiovascular diseases, are the number 1 cause of death globally- killing 17.9 millions of lives every year. People who are suffering from heart diseases are known to demonstrate high blood pressure, lipids, glucose as well as obesity and overweight issues. The ability to identify these high risk factors will ensure that the patients receieve appropriate medical care and prevent premature deaths. 

## Understanding this study
We have the following information about our dataset:
- Age
- Sex: (1 = Male, 0 = Female)
- cp(chest pain type): 
    * 1 = typical angina
    * 2 = atypical angina
    * 3 = non-anginal pain 
    * 4 = asymptomatic
- trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
- chol: Serum cholestoral in mg/dl
- fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- restecg: Resting electrocardiographic results
    * 0: Normal
    * 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach: Maximum heart rate achieved 
- exang: Exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: The slope of the peak exercise ST segment
    * 1: Upsloping
    * 2: Flat
    * 3: Downsloping
- ca: Number of major vessels (0-3) colored by flourosopy
- thal: Thalium heart scan
    * 3: Normal
    * 6: Fixed defect
    * 7: Reversable defect
- target: Diagnosis of heart disease
    * 1: Yes
    * 0: No
    
## Objective
- Find any correlations between attributes
- Find correlations between each attribute and the diagnosis of heart disease

First step is to import the required packages, namely numpy, pandas, matplotlib and seaborn.

In [None]:
# Importing packages
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
import scipy.stats # Needed to compute statistics for categorical data

Importing dataset into notebook and have a preview. 

In [None]:
# Importing dataset
heart_data = pd.read_csv('../input/heart-disease/heart.csv')
heart_data.head()

Now, let's check for any unknown, NaN or NULL values.

In [None]:
heart_data.isnull().sum()

Looking good! We do know that some of the attributes like sex, slope, target have numbers denoting their categorical attributes. We will need to change them to something we can understand without looking back. So here is what we're going to do:
- For sex, we will change 1 to 'Male' and 0 to 'Female'.
- For cp (chest pain), we will change:
    * 1 to 'typical_ang' 
    * 2 to 'atypical_ang' 
    * 3 to 'non_anginal_pain' 
    * 4 to 'asymptomatic'
- fbs (fasting blood sugar): 
    * 1 to 'True'
    * 0 to 'False'
- restecg: 
    * 0 to 'normal' 
    * 1 to 'st_abnormality'
    * 2 to 'prob_lvh'
- exang (Exercise induced angina):
    * 1 to 'yes'
    * 0 to 'no'
- slope: The slope of the peak exercise ST segment
    * 1 to 'upsloping'
    * 2 to 'flat'
    * 3 to 'downsloping'
- thal: Thalium heart scan
    * 3 to 'normal'
    * 6 to 'fixed_def'
    * 7 to 'rev_def'
- target: 1 to 'yes', 0 to 'no'


In [None]:
heart_data['sex'] = heart_data.sex.replace([1,0], ['male', 'female'])
heart_data['cp'] = heart_data.cp.replace([0,1,2,3,4], ['no_cp','typical_ang', 'atypical_ang', 'non_anginal_pain', 'asymptomatic'])
heart_data['fbs'] = heart_data.fbs.replace([1,0], ['true', 'false'])
heart_data['restecg'] = heart_data.restecg.replace([0,1,2], ['normal', 'st_abnormality', 'prob_lvh'])
heart_data['exang'] = heart_data.exang.replace([0,1], ['no', 'yes'])
heart_data['slope'] = heart_data.slope.replace([0,1,2,3], ['no_slope','upsloping', 'flat', 'downsloping'])
heart_data['thal'] = heart_data.thal.replace([3,6,7], ['normal', 'fixed_def', 'rev_def'])
heart_data['target'] = heart_data.target.replace([1,0], ['yes', 'no'])
heart_data.head()

Here, we will use the PairPlot tool from Seaborn to see the distribution and relationships among variables. Since pairplot won't work well with categorical data, we can only pick numerical data for this case. 

In [None]:
g = sns.pairplot(heart_data, vars =['age', 'trestbps', 'chol', 'thalach', 'oldpeak' ], hue = 'target')
g.map_diag(sns.distplot)
g.add_legend()
g.fig.suptitle('FacetGrid plot', fontsize = 20)
g.fig.subplots_adjust(top= 0.9);

## What do we see here?
- Other than resting blood pressure, we do see distinct differences between heart disease patients and healthy patients in the targeted attributes. 
- For instance, we do see an even distribution of heart disease patients in the age category, while healthly patients are more distributed to the right. 

## Let's look at correlations!

- Note: Correlation is determined by Person's R and can't be defined when the data is categorical. Hence, we need to change the categorical atttributes back to numeric for this analysis. 
- We will simply rename the required variable.

In [None]:
# Plotting correlation matrix
heart_data1 = pd.read_csv('../input/heart-disease/heart.csv')
corr = heart_data1.corr()
corr.style.background_gradient(cmap='RdBu_r')

From here, we can see that there is a close correlation between chest pain factors, maximum heart rate achieved and the slope and whether the patient is healthy or a heart disease patient. Except for these attributes, the rest seem to show very weak correlation.

### Let's look closely into some attributes.

# Finding correlation between age and whether the patient has heart disease

Firstly, let's look at the distribution.

In [None]:
plt.figure(figsize=(10,4))
plt.legend(loc='upper left')
g = sns.countplot(data = heart_data, x = 'age', hue = 'target')
g.legend(title = 'Heart disease patient?', loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)

In [None]:
age_corr = ['age', 'target']
age_corr1 = heart_data[age_corr]
age_corr_y = age_corr1[age_corr1['target'] == 'yes'].groupby(['age']).size().reset_index(name = 'count')
age_corr_y.corr()

In [None]:
sns.regplot(data = age_corr_y, x = 'age', y = 'count').set_title("Correlation graph for Age vs heart disease patient")

In [None]:
age_corr_n = age_corr1[age_corr1['target'] == 'no'].groupby(['age']).size().reset_index(name = 'count')
age_corr_n.corr()

In [None]:
sns.regplot(data = age_corr_n, x = 'age', y = 'count').set_title("Correlation graph for Age vs healthy patient")

## What can we say about this?
- Well, can we say that older people are more susceptible to heart diseases? Not really for this case. We do see an even distribution of heart disease patients across all ages. In fact we even saw a positive correlation between age and healthy patients. This sadly, does not indicate anything significant to us as it just shows an overview of people participating in the study and not a precursor of heart disease. 

# Correlation between sex and heart disease 

In [None]:
# Showing number of heart disease patients based on sex
sex_corr = ['sex', 'target']
sex_corr1 = heart_data[sex_corr]
sex_corr_y = sex_corr1[sex_corr1['target'] == 'yes'].groupby(['sex']).size().reset_index(name = 'count')
sex_corr_y

In [None]:
# Showing number of healthy patients based on sex 
sex_corr_n = sex_corr1[sex_corr1['target'] == 'no'].groupby(['sex']).size().reset_index(name = 'count')
sex_corr_n

In [None]:
g1 = sns.boxplot(data = heart_data, x = 'sex', y = 'age', hue = 'target',palette="Set3")
g1.legend(title = 'Heart disease patient?', loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
g1.set_title('Boxplot showing age vs sex')

As we know, sex is a categorical variable. Target, which tells us whether the patient has heart disease or not is also a categorical variable. To compute the correlation between two categorical data, we will need to use Chi-Square test. We will be using 95% confidence interval (95% chance that the confidence interval you calculated contains the true population mean). 
- The null hypothesis is that they are independent.
- The alternative hypothesis is that they are correlated in some way. 

In [None]:
# Chi-sq test
cont = pd.crosstab(heart_data["sex"],heart_data["target"])
scipy.stats.chi2_contingency(cont)

## What can we say about this?
We performed the test and we obtained a p-value < 0.05 and we can reject the hypothesis of independence. So is there truly a correlation between sex and heart disease? Well, I can't really accept this result here mainly for one reason. The data for healthy female is too low. We only have 24 female individuals that are healthy. If we were to push the number up to, let's say 94, we will get a much higher p-value. Hence, I feel that there is no point in performing a correlation analysis if the difference between the test samples are too high.

# Correlation between types of chest pain and heart disease 

### An overview of types of chest pains in heart disease patients

In [None]:
# Showing number of heart disease patients based on cp
cp_corr = ['cp', 'target']
cp_corr1 = heart_data[cp_corr]
cp_corr_y = cp_corr1[cp_corr1['target'] == 'yes'].groupby(['cp']).size().reset_index(name = 'count')
cp_corr_y

In [None]:
# Showing number of healthy patients based on cp 
cp_corr_n = cp_corr1[cp_corr1['target'] == 'no'].groupby(['cp']).size().reset_index(name = 'count')
cp_corr_n

What we can see here is that heart disease patients tend to experience all 3 types of chest pain while healthy patients generally do not experience any chest pains. Hence, without any statistical test, we can say that there is definitely a correlation between chest pain and heart disease patient. 
- However, we will still need to prove this through the Chi-sqaure test.

In [None]:
# Chi-square test
cont1 = pd.crosstab(heart_data["cp"],heart_data["target"])
scipy.stats.chi2_contingency(cont1)

# Correlation between resting blood pressure and heart disease

In [None]:
# Showing number of heart disease patients based on trestbps
restbp_corr = ['trestbps', 'target']
restbp_corr1 = heart_data[restbp_corr]
restbp_corr_y = restbp_corr1[restbp_corr1['target'] == 'yes'].groupby(['trestbps']).size().reset_index(name = 'count')
restbp_corr_y.corr()

In [None]:
sns.regplot(data = restbp_corr_y, x = 'trestbps', y = 'count').set_title('Correlation between resting blood pressure and heart disease patients')

In [None]:
restbp_corr_n = restbp_corr1[restbp_corr1['target'] == 'no'].groupby(['trestbps']).size().reset_index(name = 'count')
restbp_corr_n.corr()

In [None]:
sns.regplot(data = restbp_corr_n, x = 'trestbps', y = 'count').set_title('Correlation between resting blood pressure and healthy patients')

## What do we see here
- We see weak correlation between resting blood pressure and whether the patient has heart disease.

# Correlation between serum cholesterol and heart disease

In [None]:
# Showing number of heart disease patients based on serum cholesterol
chol_corr = ['chol', 'target']
chol_corr1 = heart_data[chol_corr]
chol_corr1.chol = chol_corr1.chol.round(decimals=-1)
chol_corr_y = chol_corr1[chol_corr1['target'] == 'yes'].groupby(['chol']).size().reset_index(name = 'count')
chol_corr_y.corr()

In [None]:
sns.regplot(data = chol_corr_y, x = 'chol', y = 'count').set_title('Correlation between serum cholesterol and heart disease patients')

In [None]:
chol_corr_n = chol_corr1[chol_corr1['target'] == 'no'].groupby(['chol']).size().reset_index(name = 'count')
chol_corr_n.corr()

In [None]:
sns.regplot(data = chol_corr_n, x = 'chol', y = 'count').set_title('Correlation between serum cholesterol and healthy patients')

## What do we see here?
- We do not see a correlation between the level of serum cholesterol and heart disease. 

# Correlation between fasting blood sugar and heart disease 

In [None]:
# Showing number of heart disease patients based on fasting blood sugar
fbs_corr = ['fbs', 'target']
fbs_corr1 = heart_data[fbs_corr]
fbs_corr_y = fbs_corr1[fbs_corr1['target'] == 'yes'].groupby(['fbs']).size().reset_index(name = 'count')
fbs_corr_y

In [None]:
# Showing number of healthy patients based on fasting blood sugar
fbs_corr_n = fbs_corr1[fbs_corr1['target'] == 'no'].groupby(['fbs']).size().reset_index(name = 'count')
fbs_corr_n

Performing Chi-Sq test

In [None]:
# Chi-square test
cont3 = pd.crosstab(heart_data["fbs"],heart_data["target"])
scipy.stats.chi2_contingency(cont3)

## What do we see here?
- We obtained a p-value of 0.744. Therefore we will accept the hypothesis of independence.

# Correlation between resting ECG results and heart disease

In [None]:
# Showing number of heart disease patients based on resting ECG results
restecg_corr = ['restecg', 'target']
restecg_corr1 = heart_data[restecg_corr]
restecg_corr_y = restecg_corr1[restecg_corr1['target'] == 'yes'].groupby(['restecg']).size().reset_index(name = 'count')
restecg_corr_y

In [None]:
restecg_corr_n = restecg_corr1[restecg_corr1['target'] == 'no'].groupby(['restecg']).size().reset_index(name = 'count')
restecg_corr_n

In [None]:
# Chi-square test
cont4 = pd.crosstab(heart_data["restecg"],heart_data["target"])
scipy.stats.chi2_contingency(cont4)

## What do we see here?
- We obtained a p-value of 0.00666. This shows that there is a correlation between the various types of ECG results and heart disease. We do see a huge difference in ST-T wave abnormality between healthy and heart disease patients. 

# Correlation between maximum heart rate achieved and heart disease 

In [None]:
# Showing number of heart disease patients based on maximum heart rate
heartrate_corr = ['thalach', 'target']
heartrate_corr1 = heart_data[heartrate_corr]
heartrate_corr_y = heartrate_corr1[heartrate_corr1['target'] == 'yes'].groupby(['thalach']).size().reset_index(name = 'count')
heartrate_corr_y.corr()

In [None]:
sns.regplot(data = heartrate_corr_y, x = 'thalach', y = 'count').set_title('Correlation between maximum heart rate and heart disease patients')

In [None]:
heartrate_corr_n = heartrate_corr1[heartrate_corr1['target'] == 'no'].groupby(['thalach']).size().reset_index(name = 'count')
heartrate_corr_n.corr()

In [None]:
sns.regplot(data = heartrate_corr_n, x = 'thalach', y = 'count').set_title('Correlation between maximum heart rate and healthy patients')

## What do we see here?
- We do not see a strong correlation between maximum heart rate and heart disease. If we look into the distribution, we do see close similarity in maximum heart rate in both heart disease patients and healthy patients. 

# Summary
- We have tested most of the attributes for correlation and from the results, we can confidently say that both resting ECG results and types of chest pains are correlated to heart disease. 
- Although we do see a correlation when performing Chi-Sq test on the gender attribute, the huge difference in healthy female data posed a huge concern for its accuracy. 