## This notebook on Heart Disease UCI Dataset is where we'll learn to implement various methods & techniques in a gradual progressive manner, i.e., simple to complex beginning with comprehensive EDA.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Imports

In [None]:
import seaborn as sns; import matplotlib.pyplot as plt;
sns.set();

my_palette = {0 : 'red' , 1 : 'black'}

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
data = pd.read_csv('../input/heart-disease-uci/heart.csv')
data.head()

In [None]:
data.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(['target'], axis = 1), data['target'], test_size = 0.2, random_state = 21)

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
X_train.shape

In [None]:
y_train.shape

## Data Exploration

Understanding the variables :- 
1. cp -> Chest Pain 
  Possible values : 
   0: asymptomatic
   1: atypical angina
   2: non-anginal pain
   3: typical angina
   
2. restecg -> Resting electrocardiographic results

Possible Values :-
 a) Value 1 -> Normal
 b) Value 0 & 2 -> showing probable or definitive left ventricular hypertrophy; having ST-T wave abnormality
 
 3. target -> 0 = disease ; 1 = no disease
 
 4. thal 
    Values :- 
    1. Fixed Defect
    2. Normal
    3. Reversable Effect
    
  5. slope -> the slope of the peak exercise ST segment
  
  value -> 0 : downsloping
           1 : flat
           2 : upsloping
           
  
  
  Other obvious features:
  
  Age, Sex, Resting Blood Pressure, Serum Choslestrol, Fasting Blood Sugar, Maximum Heart Rate Achieved, Exercise Induced Angina (1 = yes, 0 = no), OldPeak = ST Depression induced by exercise relative to rest; Number of Major Vessels -> 0-3 colored by flouroscopy; Thalasemia (3 = normal, 6 = fixed defect, 7 = reversable defect)
 

visualizing in encodings in difficult. Let's decode to text.

In [None]:
cat_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

dis_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

In [None]:
data['sex'][data['sex'] == 1] = 'male'
data['sex'][data['sex'] == 0] = 'female'

In [None]:
data.columns

In [None]:
data['cp'][data['cp'] == 0] = 'typical angina'
data['cp'][data['cp'] == 1] = 'atypical angina'
data['cp'][data['cp'] == 2] = 'non-anginal pain'
data['cp'][data['cp'] == 3] = 'asymptomatic'

In [None]:
data['fbs'][data['fbs'] == 1] = 'higher than 120mg/ml'
data['fbs'][data['fbs'] == 0] = 'lower than 120mg/ml'

In [None]:
data['slope'][data['slope'] == 0] = 'upsloping'
data['slope'][data['slope'] == 1] = 'flat'
data['slope'][data['slope'] == 2] = 'downsloping'

In [None]:
data.head()

In [None]:
data['restecg'][data['restecg'] == 0] = 'normal'
data['restecg'][data['restecg'] == 1] = 'ST-T wave abnormality'
data['restecg'][data['restecg'] == 2] = 'left ventricular hypertrophy'

In [None]:
data['exang'][data['exang'] == 0] = 'no'
data['exang'][data['exang'] == 1] = 'yes'

In [None]:
data['thal'][data['thal'] == 0] = 'normal'
data['thal'][data['thal'] == 1] = 'normal'
data['thal'][data['thal'] == 2] = 'fixed defect'
data['thal'][data['thal'] == 3] = 'reversable defect'

In [None]:
X_train.columns

Any null or missing values

In [None]:
data.isna().sum()

There are no null or missing values in the dataset.

Age

In [None]:
data['age']

In [None]:
sns.distplot(data['age']); plt.show()

**Which Age group is more vulnerable to heart disease?**

In [None]:
plt.figure(figsize = (10,10))
sns.boxplot(x = data['sex'], y = data['age'], hue = data['target'], palette = my_palette); plt.show()

For both sexes, the age group of 45-60 seems the most vulnerablw to heart diseases.

**Which age group sex-wise is more vulnerable to heart disease?**

For men, it is 45-55. For women, it is 46-60 which is the most vulnerable.

Sex

In [None]:
data['sex']

In [None]:
data['sex'].value_counts()

Which Sex is prone to heart disease?

In [None]:
sns.countplot(x = data['sex'], hue = data['target'], palette = my_palette); plt.show()

As per the dataset, Men are prone to heart disease compared to women.
However, within women, the number of women with heart disease exceeds the number of women w/o heart disease.

CP

In [None]:
X_train['cp']

In [None]:
X_train['cp'].value_counts()

In [None]:
# Does chest pain indicate possible heart disease?

plt.figure(figsize = (10,10))
sns.countplot(x = data['cp'], hue = data['target'], palette = my_palette);plt.show()

The results are quite interesting. 
To begin with, asymptomatic, the people with disease outnumbers the people without disease. This shows, even if there is no chest pain, chances are high that the person is having heart disease.
Same is the case with non-anginal pain as well as atypical angina. 
However, if the pain is detected to be anginal pain, chances of having heart disease reduces drastically. Maybe, it is due to the precautionary medication/treatment which is taken after detection of angina.

In [None]:
plt.figure(figsize = (10,10))
sns.boxplot(x = data['cp'], y = data['age'], hue = data['target'], palette = my_palette); plt.show()

Trestbps

In [None]:
data['trestbps']

In [None]:
def trestbps_bin(row):
    if row['trestbps'] <= 80:
        value = 'Low'
    elif row['trestbps'] > 120:
        value = 'High'
    else:
        value = "Normal"
    
    return value

In [None]:
data['trestbps'] = data.apply(trestbps_bin, axis = 1)

In [None]:
data['trestbps'].value_counts

In [None]:
pd.crosstab(data['trestbps'], data['target']).plot(kind = 'bar', color = ['r','k']);plt.show()

It is evident that the people having high "trestbps" have a larger count of having heart disease compared to the ones with lower count. To be put in a correct manner, high blood pressure is just an indication of increased chances of having heart disease but not a definitive indicator as equal number of patients are having high blood pressure but live disease free. Contrary to this, the ones with normal blood pressure show deviation from normalcy. Among the people with normal blood pressure, the ones with heart disease outnumbers the ones without heart disease.
CONCLUSION -> High Blood pressure can increase chances of having heart disease but is not a definitive indicator of having heart disease. People with high blood pressure might live without heart disease and people without high blood pressure might live w/ heart disease.

Chol

In [None]:
data['chol']

In [None]:
sns.distplot(data['chol'], kde = True, rug = True); plt.show()

The majority of the cholestrol values of the observations lie between 150 to 350. The majority being between 200 to 300.

In [None]:
sns.boxplot(data['target'], data['chol'], palette = my_palette); plt.show()

This indicates that cholestrol level is not the sole deciding factor for predicting whether the person gets heart disease or not. People with similar levels of cholestrol have got heart disease as well as are free from heart disease. Clearly, there is no direct correlation of heart disease with the chol level.

Let's take a deeper look into it.

In [None]:
gender_palette = {'male' : 'black', 'female' : 'red'}

In [None]:
sns.violinplot(data['target'], data['chol'], hue = data['sex'], split = True)
sns.swarmplot(data['target'], data['chol'], color = 'black',hue = data['sex']);plt.show()

The above plot confirms the inference made above. Let's explore the variable 'thalach'

Thalach

In [None]:
data['thalach']

In [None]:
sns.distplot(data['thalach'], rug = True); plt.show()

In [None]:
sns.boxplot(data['target'], data['thalach'], palette = my_palette); plt.show()

People with higher thalach levels are prone to heart disease. Let's explore the data by gender

In [None]:
# before getting into details by gender, let's begin with with better details of the distribution of the data points.
sns.boxenplot(data['target'], data['thalach'], palette = my_palette); plt.show()

We can observe the distribution of the datapoints being concentrated at 130 to 150 for people with no heart disease; and 155 to 175 for people with heart disease.

OldPeak

In [None]:
data['oldpeak']

In [None]:
sns.barplot(x = 'target', y = 'oldpeak', data = data, palette = my_palette); plt.show()

In [None]:
sns.boxplot(x = 'target', y = 'oldpeak', data = data, hue = 'sex', palette = gender_palette); plt.show()

It can be confirmed that people with heart disease are usually found to have 'oldpeak' value lying in the range of 0 to 1. And the ones without heart disease have values of 'oldpeak' above 1.

fbs

In [None]:
data['fbs']

In [None]:
data['fbs'].value_counts()

In [None]:
sns.countplot(data['fbs'], hue = data['target'], palette = my_palette); plt.show()

In [None]:
pd.crosstab(data['target'], data['fbs']).plot(kind = 'bar', color = ['r','k']); plt.show()

Restecg

In [None]:
data['restecg']

In [None]:
data['restecg'].value_counts()

In [None]:
plt.figure(figsize = (10,10))
sns.countplot(data['restecg'], hue = data['target'], palette = my_palette); plt.show()

In [None]:
sns.set_style('darkgrid')
pd.crosstab(data['target'], data['restecg']).plot(kind = 'bar', color = ['r','grey', 'k']); plt.show()

Exang

In [None]:
data['exang']

In [None]:
data['exang'].value_counts()

In [None]:
sns.countplot(data['exang'], hue = data['target'], palette = my_palette); plt.show()

In [None]:
pd.crosstab(data['target'], data['exang']).plot(kind = 'bar', color = ['r', 'k']); plt.show()

The result is quite clear. We can't infer based on the exang value. However, we can comment that if there is heart disease detected, chances are high that exang value is False

Old Peak

In [None]:
data['oldpeak']

In [None]:
sns.countplot(data['exang'], hue = data['target'], palette = my_palette); plt.show()

Slope

In [None]:
data['slope']

In [None]:
sns.countplot(data['slope'], hue = data['target'], palette = my_palette); plt.show()

In [None]:
pd.crosstab(data['target'], data['slope']).plot(kind = 'bar', color = ['r', 'grey', 'k']); plt.show()

Downsloping influences the most in case of heart disease.

ca

In [None]:
data['ca']

In [None]:
data['ca'].value_counts()

In [None]:
sns.countplot(data['ca'], hue = data['target'], palette = my_palette); plt.show()

This inference should be quite obvious. The lesser the number of major vessels, the higher the chances of heart disease

Thal

In [None]:
data['thal']

In [None]:
data['thal'].value_counts()

In [None]:
sns.countplot(data['thal'], hue = data['target'], palette = my_palette); plt.show()

The results are interesting. Normal & reversible defect are less prone to heart disease compared to fixed defect.

We are done with the Exploratory Data Analysis. We understood the individual feature influence on the target variable. Some features were clear indicator of their causation on the target variable. Some variables might be standing ineffective standing alone. However, they might have an effect when other factors are combined.

### Some Facts

* Cardiovascular disease is the leading cause of death in women in Australia with 90% of women having one risk factor.
* The causes including high blood pressure, high cholesterol, smoking, diabetes, weight and family history are discussed.
* A woman's risk also goes up if she's had a miscarriage or had her ovaries or uterus removed.
* Women's hearts are affected by stress and depression more than men's. Depression makes it difficult to maintain a healthy lifestyle.

# Data Modelling

Performing Feature Scaling

In [None]:
sc = StandardScaler()

In [None]:
X_train

In [None]:
X_train = sc.fit_transform(X_train)

In [None]:
X_test = sc.transform(X_test)

In [None]:
X_train

In [None]:
X_test

Simple Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(600)

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

XGBoost Classifier

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier
alg = XGBClassifier(learning_rate=0.01, n_estimators=2000, max_depth=8,
                        min_child_weight=0, gamma=0, subsample=0.52, colsample_bytree=0.6,
                        objective='binary:logistic', nthread=4, scale_pos_weight=1, 
                    seed=27, reg_alpha=5, reg_lambda=2, booster='gbtree',
            n_jobs=-1, max_delta_step=0, colsample_bylevel=0.6, colsample_bynode=0.6)
alg.fit(X_train, y_train)

print('test accuracy',alg.score(X_test,y_test))

Improvisations & additions on this notebook will keep happening gradually. In this version, we covered the EDA + Few Modelling methods like Random Forest, XGBoost

## If you liked the notebook, please appreciate my efforts by giving an UPVOTE! Thanks!