# Prediction of heart disease

## Problem statement
Predict if a aptient have heart disease based on his/her clinical parameters.

## DATA
The dataset can be found here: https://www.kaggle.com/ronitf/heart-disease-uci



## Features
1. age - age in years<br>
2. sex - (1 = male; 0 = female)<br>

3. cp - chest pain type
- 0: Typical angina: chest pain related decrease blood supply to the heart
- 1: Atypical angina: chest pain not related to heart
- 2: Non-anginal pain: typically esophageal spasms (non heart related)
- 3: Asymptomatic: chest pain not showing signs of disease


4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
anything above 130-140 is typically cause for concern

5. chol - serum cholestoral in mg/dl
serum = LDL + HDL + .2 * triglycerides
above 200 is cause for concern

6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
'>126' mg/dL signals diabetes

7. restecg - resting electrocardiographic results
- 0: Nothing to note
- 1: ST-T Wave abnormality
can range from mild symptoms to severe problems
signals non-normal heart beat
- 2: Possible or definite left ventricular hypertrophy
Enlarged heart's main pumping chamber

8. thalach - maximum heart rate achieved

9. exang - exercise induced angina (1 = yes; 0 = no)

10. oldpeak - ST depression induced by exercise relative to rest
looks at stress of heart during excercise
unhealthy heart will stress more

11. slope - the slope of the peak exercise ST segment
- 0: Upsloping: better heart rate with excercise (uncommon)
- 1: Flatsloping: minimal change (typical healthy heart)
- 2: Downslopins: signs of unhealthy heart

12. ca - number of major vessels (0-3) colored by flourosopy
colored vessel means the doctor can see the blood passing through
the more blood movement the better (no clots)

13. thal - thalium stress result
1,3: normal
6: fixed defect: used to be defect but ok now
7: reversable defect: no proper blood movement when excercising

14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)
Note: No personal identifiable information (PPI) can be found in the dataset.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
cd '/content/gdrive/My Drive/Colab Notebooks/DS projects'

In [None]:
# IMPORT packages, libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data split
from sklearn.model_selection import train_test_split

# ML models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

# cross validation
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

# Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,\
  precision_score, recall_score, f1_score, plot_roc_curve

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load dataset
df=pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head()

## EDA

In [None]:
# Info
df.info()

All the features are of numerical type, so we need not go for any encoding.

In [None]:
# Stats
df.describe().T

We can see here the stats for different features.

Few of them have values ranging in hundreds and others have really low values, so we might need to **scale down our features**.

We can also see that all features have 330 records, i.e. no missing values. But still lets have a look.

In [None]:
# missing values
df.isnull().sum()

In [None]:
# TARGET
df.target.value_counts()

This is a nearly balanced dataset.

### Comparing "Sex" feature with TARGET

In [None]:
df.sex.value_counts()

So we have more males (1) than females (0) in our data set.

In [None]:
pd.crosstab(df.sex, df.target)

So, around 75% of females and 47% of males tend to have heart disease.

Lets plot it.

In [None]:
pd.crosstab(df.sex, df.target).plot(kind='bar');
plt.ylabel('Count of people');
plt.legend(['Female', 'Male']);

### Comparing "cp" feature with TARGET
cp - chest pain type
- 0: Typical angina: chest pain related decrease blood supply to the heart
- 1: Atypical angina: chest pain not related to heart
- 2: Non-anginal pain: typically esophageal spasms (non heart related)
- 3: Asymptomatic: chest pain not showing signs of disease

In [None]:
df.cp.value_counts()

In [None]:
pd.crosstab(df.cp, df.target)

So, as cp goes from 0 to 3, the chances of heart diseases increase. And the highest chance is for people with **Atypical angina** and **Non-anginal pain**.

In [None]:
pd.crosstab(df.cp, df.target).plot(kind='bar')
plt.ylabel('Count of people');
plt.legend(['No disease', 'Heart disease'])
plt.xticks(ticks=[0,1,2,3],rotation=30,labels=['Typical angina', 'Atypical angina','Non-anginal pain','Asymptomatic']);

### Comparing age and cp

In [None]:
df.age.describe()

In [None]:
# AGE distribution among patients with heart disease
df[df.target==1].age.hist(bins=7);

So, people with age 40-65 are more prone to heart disease.

In [None]:
# Comparing the AGE distribution for different chest pains.
fig, ax= plt.subplots(2,2)
ax[0,0].hist(df[(df.cp==0) & (df.target==1)].age, bins=5);
ax[0,0].set_title('cp=0');
ax[0,1].hist(df[(df.cp==1) & (df.target==1)].age, bins=5, color='red');
ax[0,1].set_title('cp=1');
ax[1,0].hist(df[(df.cp==2) & (df.target==1)].age, bins=5, color='orange');
ax[1,0].set_title('cp=2');
ax[1,1].hist(df[(df.cp==3) & (df.target==1)].age, bins=5, color='yellow');
ax[1,1].set_title('cp=3');
fig.tight_layout()

Here people with age 40-65 are more likely to have chest pain with label 1,2,3 which has higher risk of heart disease.

**Age 40 to 65 risky**.

### Co-relation matrix

In [None]:
fig, ax= plt.subplots(figsize=(12,8))
ax=sns.heatmap(df.corr());

**Positive corr (target tends to 1->disease with inc. in these values)**
- cp
- thalach (max heart rate)
- slope (more downslope of peak exercise ST segment increases risk)

**Negative corr (risk reduces with inc. in values)**
- exang (if angina/ chest pain is due to exercise then low risk)
- oldpeak
- ca (more colored vessels seen during fluoroscopy, low risk)

## Feature Engineering 

### Outlier detection
We will now try to find outliers in features with continuous values.

In [None]:
for x in ['age','trestbps','chol','thalach']:
  print(x.upper())
  sns.boxplot(data=df, y=x);
  plt.show()

We can see these features have outliers.

### Checking distribution

In [None]:
df.columns

In [None]:
for x in ['age','trestbps','chol','thalach']:
  print(x.upper())
  sns.distplot(df[x], kde=True);
  plt.show()
  print('skew:', df[x].skew())
  print('--------------------------------------')

As 'trestbps' and 'chol' feature have skewness more than 0.5 (+ or -), we need to transform these.

And then we will scale down the features.

## Feature transformation / Normalization

In [None]:
sns.distplot(np.log(df.trestbps), kde=True)

In [None]:
np.log(df.trestbps).skew()

In [None]:
np.log(df.chol).skew()

We can see that using log transform we are able to reduce the **skewness** (left or right shift of distribution) and normalize the distribution.

## Methods for feature engineering
Here we will make a function to apply outlier treatment, normalization and scaling on both train set and test set.

We will learn the features from the train set first into outlier_dict{} and scale_dict{} and use these over test set, rather than learning different features/characters of test set.

We are going for **Outlier handling** before **Normalization** as Outlier handling helps in normalizing the data upto certain extent.

In [None]:
outlier_dict={}
scale_dict={}
def train_engg(X, feat=['age','trestbps','chol','thalach']):
  for f in feat:
    # print(f)
    # outlier
    first_quartile= np.percentile(X[f],25)
    third_quartile= np.percentile(X[f],75)
    iqr= third_quartile-first_quartile
    lowest= first_quartile-(1.5*iqr)
    highest= third_quartile+(1.5*iqr)
    # print(lowest, highest)
    outlier_dict[f]=[lowest,highest]
    X[f]= np.where(X[f]>highest, highest, np.where(X[f]<lowest, lowest, X[f]))
    
    # transformation/normalization
    X[f]=np.log(X[f])

    # scaling
    mean=X[f].mean()
    std=X[f].std()
    scale_dict[f]=[mean,std]
    X[f]=(X[f]-mean)/std
  return X


In [None]:
# Test set method for feature engg
def test_engg(X, feat=['age','trestbps','chol','thalach']):
  for f in feat:

    # outlier
    X[f]= np.where(X[f]>outlier_dict[f][1], outlier_dict[f][1],np.where(X[f]<outlier_dict[f][0], outlier_dict[f][0], X[f]))
    
    # transformation/normalization
    X[f]=np.log(X[f])

    # scaling
    X[f]=(X[f]-scale_dict[f][0])/scale_dict[f][1]
  return X

# X_test=test_engg(X_test)

## Data prep for modelling

In [None]:
X= df.drop(['target'], axis=1)

y= df.target

In [None]:
# Splitting the data
# Random seed for reproducibility
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

In [None]:
len(X_test)

In [None]:
len(X_train)

## Creating models

In [None]:
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
          "Logistic Regression": LogisticRegression(), 
          "Random Forest": RandomForestClassifier(),
          "Linear SVC": LinearSVC()}

In [None]:
# Function to train the models and record the scores
def fit_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """
    # Random seed for reproducible results
    np.random.seed(42)
    # dict to store model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit/train the model
        
        X_train_new=train_engg(X_train.copy())

        model.fit(X_train_new, y_train)
        #print(name, model.get_params())
        
        # Evaluate the model and append its score to model_scores

        X_test_new=test_engg(X_test.copy())

        model_scores[name] = model.score(X_test_new, y_test) # accuracy
    return model_scores

In [None]:
model_scores= fit_score(models, X_train, X_test, y_train, y_test)
model_scores

Model Scores without feature engineering:

{'KNN': 0.6885245901639344,
 
 'Linear SVC': 0.47540983606557374,

 'Logistic Regression': 0.8852459016393442,

 'Random Forest': 0.8360655737704918}

In [None]:
plt.bar(x=model_scores.keys(), height=model_scores.values());
plt.ylabel('Accuracy'); 

## Hyper-parameter tuning with cross-validation

### GridSearchCV tuning
It will try every single hyperparameter combination to reach the best score possible.

We will be using RECALL score for tuning our models as it is a health related problem statement where we generally want to reduce FALSE NEGATIVES.

#### Logistic Regression

In [None]:
np.logspace(-4,4,20)

In [None]:
# Logistic Regression
lr_grid={"C": [1,2,3,4,5,6,7,8,9], "solver": ['lbfgs','liblinear'], 'penalty':['l2']}

np.random.seed(42)
lr_gs= GridSearchCV(LogisticRegression(), param_grid= lr_grid, cv=5, verbose=True, scoring='recall')

X_train_new=train_engg(X_train.copy())
lr_gs.fit(X_train_new, y_train);

400 iterations means: 20 C x 2 penalties x 2 solvers x 5 cv= 400

Let's see the best parameters and score.

In [None]:
lr_gs.best_params_

In [None]:
# model score
X_test_new= test_engg(X_test.copy())
lr_gs.score(X_test_new, y_test)

In [None]:
# TRAINING SCORE
lr_gs.score(X_train_new, y_train)

### KNN

In [None]:
import sklearn
sklearn.metrics.SCORERS.keys()

In [None]:
# Logistic Regression
knn_grid={'n_neighbors':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]}

np.random.seed(42)
knn_gs= GridSearchCV(KNeighborsClassifier(), param_grid= knn_grid, cv=5, verbose=True, scoring='recall')

X_train_new=train_engg(X_train.copy())
knn_gs.fit(X_train_new, y_train);

In [None]:
knn_gs.best_params_

In [None]:
# model score
X_test_new= test_engg(X_test.copy())
knn_gs.score(X_test_new, y_test)

In [None]:
# TRAINING SCORE
knn_gs.score(X_train_new, y_train)

### Linear SVC

In [None]:
# Logistic Regression
svc_grid={'C':[0.01,0.05,0.1,0.5,1,2,3,4,5,6,7,8,9]}

np.random.seed(42)
svc_gs= GridSearchCV(LinearSVC(), param_grid= svc_grid, cv=5, verbose=True, scoring='recall')

X_train_new=train_engg(X_train.copy())
svc_gs.fit(X_train_new, y_train);

In [None]:
svc_gs.best_params_

In [None]:
# model score
X_test_new= test_engg(X_test.copy())
svc_gs.score(X_test_new, y_test)

In [None]:
# TRAINING SCORE
svc_gs.score(X_train_new, y_train)

#### Random Forest

In [None]:
rf_grid = {"n_estimators": np.arange(10, 400, 50),
           "max_depth": [None, 3, 5],
           "min_samples_split": np.arange(2, 7, 2),
           "min_samples_leaf": np.arange(15, 22, 2)}

In [None]:
np.random.seed(42)
rf_gs= GridSearchCV(RandomForestClassifier(), param_grid= rf_grid, cv=5, verbose=True)

X_train_new=train_engg(X_train.copy())
rf_gs.fit(X_train_new, y_train);

In [None]:
rf_gs.best_params_

In [None]:
rf_gs.score(X_test, y_test)

In [None]:
# TRAINING SCORE
rf_gs.score(X_train_new, y_train)

So, none of our models are over-fitting as they have both training and test score same.

## Evaluating model
### KNN

In [None]:
X_test_new= test_engg(X_test.copy())
y_pred= knn_gs.predict(X_test_new)

In [None]:
y_test.values

In [None]:
pd.DataFrame({'Actual':y_test.values, 'Predicted':y_pred})

### Confusion matrix

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True)
plt.xlabel('Actual')
plt.ylabel('Pred');

Here we can see 4 people each are **False positive** (Predicted as heart patients but actually not) and **False negative** (Predicted as not heart patients but actually are).

## ROC curve and AUC score
The large the area under curve (AUC), the better the model.

In [None]:
# Import ROC curve function from metrics module
from sklearn.metrics import plot_roc_curve

# Plot ROC curve and calculate AUC metric
plot_roc_curve(knn_gs, X_test_new, y_test);

## Classification report

In [None]:
print(classification_report(y_test, y_pred))

## Cross validation scores for different metrics

In [None]:
knn_gs.best_params_

In [None]:
clf= KNeighborsClassifier(n_neighbors=13)

In [None]:
X_new= test_engg(X.copy())
cv_acc = np.mean(cross_val_score(clf,
                         X_new,
                         y,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy")) # accuracy as scoring
cv_acc

In [None]:
# Cross-validated recall score
cv_recall = np.mean(cross_val_score(clf,
                                    X_new,
                                    y,
                                    cv=5, # 5-fold cross-validation
                                    scoring="recall")) # recall as scoring
cv_recall

In [None]:
# Cross-validated precision score
cv_precision = np.mean(cross_val_score(clf,
                                       X_new,
                                       y,
                                       cv=5, # 5-fold cross-validation
                                       scoring="precision")) # precision as scoring
cv_precision

In [None]:
# Cross-validated F1 score
cv_f1 = np.mean(cross_val_score(clf,
                                X_new,
                                y,
                                cv=5, # 5-fold cross-validation
                                scoring="f1")) # f1 as scoring
cv_f1

## Feature importance
Lets have a look at the model coeffients to mark the importance the models have given to different features.

As we do not get model coefficients for KNN, we will go for Logistic Regression and Linear SVC.

In [None]:
X_train_new= train_engg(X_train.copy())

In [None]:
def show_feat_imp(clf2):
  clf2.fit(X_train_new, y_train)
  features_dict = dict(zip(df.columns, list(clf2.coef_[0])))
  # Visualize feature importance
  features_df = pd.DataFrame(features_dict, index=[0])
  features_df.T.plot.bar(title="Feature Importance", legend=False); 

In [None]:
# Logistic Regression
show_feat_imp(LogisticRegression(C=1, penalty='l2', solver='liblinear'))

In [None]:
# Linear SVC
show_feat_imp(LinearSVC(C=0.1))

Here we can see Features like SEX, CP, EXANG, SLOPE, CA, THAL are given more importance by both the models. And their coefficients (+ve or -ve) match their corelation on the below heatmap.

In [None]:
sns.heatmap(df.corr())