# Predicting Heart Disease Using Machine Learning 

This notebook looks into using various python-based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes.

We're going to take the following approach:
1. Problem Definition.
2. Data.
3. Evaluation.
4. Features.
5. Modelling.
6. Experimentation.

## 1. Problem Definition

> In a statement, Given clinical parameters about of patient can we predict whether or not they have heart disease?

## 2. Data

The original data came from the cleaveland data from the UCI machine learning repository.
https://archive.ics.uci.edu/ml/datasets/heart+disease

There is also a version available on kaggle. https://www.kaggle.com/ronitf/heart-disease-uci

## 3. Evaluation

> If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue the project.

## 4.Features

** Data Dictionary **
* 1. age - age in years
* 2. sex - (1 = male; 0 = female)
* 3. cp - chest pain type
        0: Typical angina: chest pain related decrease blood supply to the heart
        1: Atypical angina: chest pain not related to heart
        2: Non-anginal pain: typically esophageal spasms (non heart related)
        3: Asymptomatic: chest pain not showing signs of disease
* 4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern
* 5. chol - serum cholestoral in mg/dl
* 6. serum = LDL + HDL + .2 * triglycerides
        above 200 is cause for concern
        fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
        '>126' mg/dL signals diabetes.
* 7. restecg - resting electrocardiographic results
        0: Nothing to note.
        1: ST-T Wave abnormality,
        can range from mild symptoms to severe problems
        signals non-normal heart beat
        2: Possible or definite left ventricular hypertrophy
        Enlarged heart's main pumping chamber
* 8. thalach - maximum heart rate achieved
* 9. exang - exercise induced angina (1 = yes; 0 = no)
* 10. oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during excercise unhealthy heart will stress more
* 11. slope - the slope of the peak exercise ST segment
        0: Upsloping: better heart rate with excercise (uncommon)
        1: Flatsloping: minimal change (typical healthy heart)
        2: Downslopins: signs of unhealthy heart
* 12. ca - number of major vessels (0-3) colored by flourosopy
        colored vessel means the doctor can see the blood passing through
        the more blood movement the better (no clots)
* 13. thal - thalium stress result
        1,3: normal
        6: fixed defect: used to be defect but ok now
        7: reversable defect: no proper blood movement when excercising
* 14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

## Preparing the tools

We're going to use Pandas, Numpy and Matplotlib for data analysis and data manipulation.

In [None]:
# Importing all the tools(libraries) we need.
# Regular EDA (Exploratory Data Analysis) and plotting libraries/tools.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation Tools
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Load Data and Initial Checks.

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

In [None]:
df.shape

## Data Exploration (Exploratory Data Analysis or EDA)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.target.value_counts()

In [None]:
df.target.value_counts().plot(kind = "bar"
                             ,color = ["salmon", "lightblue"]
                             ,figsize=(5,5))

plt.xlabel("1 = Heart Disease, 0 = No Heart Diease")
plt.ylabel("Amount")
plt.xticks(rotation=0);

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

## Exploring The Patterns in the Data

In [None]:
df.sex.value_counts()

In [None]:
pd.crosstab(df.sex, df.target)

Here we can see that, if the sample is women there is 75% chances that she has heart disease. Similarly, we can say that if the sample is man there is 50% chances that he has heart disease. We can also say that for a sample there is 60% chance (average of 50% and 70%) we will use this as baseline and what we want to achieve is the accuracy of the model to be atleast above 60%.

In [None]:
pd.crosstab(df.target, df.sex).plot(kind = "bar"
                                   ,color = ["salmon", "lightblue"]
                                   ,figsize=(10,10))

plt.title("Heart Disease Analysis Based On The Gender")
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Amount")
plt.xticks(rotation=0)
plt.legend(["Female", "Male"])
plt.show()

In [None]:
df.head()

In [None]:
df.thalach.value_counts()

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df.age[df.target == 1]
           ,df.thalach[df.target == 1]
           ,c="salmon")

plt.scatter(df.age[df.target == 0]
           ,df.thalach[df.target == 0]
           ,c="lightblue")

plt.title("Heart Disease in function of age and max heart rate")
plt.xlabel("Age")
plt.ylabel("Thalach")
plt.legend(["Disease", "No Disease"])
plt.show()

In [None]:
df.age.plot.hist()
plt.show()

In [None]:
pd.crosstab(df.cp, df.target)

In [None]:
pd.crosstab(df.cp, df.target).plot(kind = "bar"
                                  ,figsize=(10,6)
                                  ,color=["salmon", "lightblue"])
plt.title("Heart Disease Frequency per chest pain type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0)
plt.show()

In [None]:
df.corr()

In [None]:
correlation_matrix = df.corr()

fig, ax = plt.subplots(figsize=(15,10))

ax = sns.heatmap(correlation_matrix
             ,annot=True
             ,linewidth=2
             ,fmt=".2f"
             ,cmap="YlGnBu")


plt.yticks(rotation=0)
plt.show()

## Preparing Our Data for ML.

In [None]:
df.head()

In [None]:
X = df.drop("target", axis=1)
y = df.target

In [None]:
np.random.seed(42) # To reproduce our results

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
len(X_train), len(X_test)

In [None]:
len(y_train), len(y_test)

In [None]:
models = {
    "Logistic Regression":LogisticRegression(solver='liblinear'),
    "KNN":KNeighborsClassifier(),
    "Random Forest":RandomForestClassifier()
}

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        model_scores[model_name] = model.score(X_test, y_test)
        

    return model_scores

In [None]:
model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
# We can see that Logistic Regerssion Worked the best.(eventhough it is not in the scikit learn choosing right estimator map.)
# Now we will proceed with improving these models by hyperparameter tuning.

In [None]:
model_scores

### Model Comparison

In [None]:
model_comparison = pd.DataFrame(model_scores, index=["accuracy"])
model_comparison = model_comparison.T

In [None]:
model_comparison.plot.bar()
plt.xticks(rotation=0)
plt.legend()
plt.show()

### Our Focus 
- Hyperparameter tuning
- Feature Importance 
- Confusion Matrix
- Cross Validation
- Precision
- Recall
- F1 Score
- Classification Report
- ROC Curve
- Area under ROC Curve (AUC)

In [None]:
train_scores = []
test_scores = []

neighbors = range(1, 21)
knn = KNeighborsClassifier(n_jobs=-1)

for i in neighbors:
    knn.set_params(n_neighbors=i)
    knn.fit(X_train, y_train)
    test_scores.append(knn.score(X_test, y_test))
    train_scores.append(knn.score(X_train, y_train))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors, train_scores, label="Train Score")
plt.plot(neighbors, test_scores, label="Test Score")
plt.xlabel("Number Of Neighbors")
plt.ylabel("KNN Model Score")
plt.legend()
plt.show()

In [None]:
print(f"The maximum score achieved was {max(test_scores)*100:.2f}%")

The max score obtained even after hyperparameter tuning is till less than the other two competitors model, so we will drop KNN.

In [None]:
log_reg_grid = {
    "C":np.logspace(-4, 4, 20),
    "solver":["liblinear"]
}

rf_grid={
    "n_estimators":np.arange(10,1000,50),
    "max_depth":[None, 3, 5 ,10],
    "min_samples_split":np.arange(2,20,2),
    "min_samples_leaf":np.arange(1,20,2)
}

In [None]:
np.random.seed(42)
log_reg_rs = RandomizedSearchCV(estimator=LogisticRegression(),
                               cv = 5,
                               verbose = 0,
                               param_distributions=log_reg_grid)

rf_rs = RandomizedSearchCV(estimator = RandomForestClassifier(),
                          cv=5,
                          verbose=0,
                          param_distributions=rf_grid)

In [None]:
log_reg_rs.fit(X_train, y_train)
rf_rs.fit(X_train, y_train)

In [None]:
log_reg_rs.best_params_

In [None]:
rf_rs.best_params_

In [None]:
log_reg_rs.score(X_test, y_test)

In [None]:
rf_rs.score(X_test, y_test)

In [None]:
gs_log_reg = GridSearchCV(LogisticRegression(), log_reg_grid, cv=5, verbose=2, n_jobs=-1)
gs_log_reg.fit(X_train, y_train)

In [None]:
y_preds = gs_log_reg.predict(X_test)

In [None]:
plot_roc_curve(gs_log_reg, X_test, y_test)
plt.show()

In [None]:
print(confusion_matrix(y_test, y_preds))

In [None]:
sns.set(font_scale=1.5)

def plot_conf_matrix(y_test, y_preds):
    fig, ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test,y_preds),
                    annot=True,
                    cbar=False)

    plt.xlabel("Predicted Label")
    plt.ylabel("Actual Label")

In [None]:
plot_conf_matrix(y_test, y_preds)

In [None]:
print(classification_report(y_test, y_preds))

In [None]:
gs_log_reg.best_params_

In [None]:
model = LogisticRegression(C = 0.23357214690901212, solver = 'liblinear')

In [None]:
model.fit(X_train, y_train)

In [None]:
cv_acc = cross_val_score(model, X, y, cv=5, scoring="accuracy")
cv_precision = cross_val_score(model, X, y, cv=5, scoring="precision")
cv_recall = cross_val_score(model, X, y, cv=5, scoring="recall")
cv_f1 = cross_val_score(model, X, y, cv=5, scoring="f1")

cv_acc = np.mean(cv_acc)
cv_recall = np.mean(cv_recall)
cv_precision = np.mean(cv_precision)
cv_f1 = np.mean(cv_f1)

In [None]:
df = pd.DataFrame({"Accuracy":cv_acc,
                  "precision":cv_precision,
                  "recall":cv_recall,
                  "f1-score":cv_f1},
                 index=[0])

df.T.plot.bar(legend=False)
plt.title("Cross Validated Scores")
plt.show()

### Feature Importance

In [None]:
gs_log_reg.best_params_

In [None]:
model = LogisticRegression(C=0.2043359, solver="liblinear")
model.fit(X_train, y_train)

model.coef_

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
feature_dict = dict(zip(df.columns, list(model.coef_[0])))

In [None]:
feature_dict

In [None]:
feature_df = pd.DataFrame(feature_dict, index=[0])

In [None]:
feature_df.T.plot.bar(title="Feature Importance", legend=False)

In [None]:

pd.crosstab(df.sex, df.target)

In [None]:
pd.crosstab(df.slope, df.target)

In [None]:
model.coef_[0]

In [None]:
model.coef_[0]