Predicting heart disease using machine learning model :
By using the given data and attributes we will be implementing machine learning model.
Main objective of the model to predcit if a person have heart disease or not. disease during the proof of concept, we will puruse the project.
We will try three different models:
Logistic Regression
K- Nearest Neighbours Classifier
Random Forest Classifier
The original data came from the Cleavland data from from UCI Machine Learning Repository.
There is also a version of the data available on Kaggle.
Data attribute information :
1. age - age in years
2. sex - (1 = male; 0 = female)
3. cp - chest pain type
    0: Typical angina: chest pain related decrease blood supply to the heart
    1: Atypical angina: chest pain not related to heart
    2: Non-anginal pain: typically esophageal spasms (non heart related)
    3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is  typically cause for concern
5. chol - serum cholestoral in mg/dl
    * serum = LDL + HDL + .2 * triglycerides
    * above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    * '>126' mg/dL signals diabetes
7. restecg - resting electrocardiographic results
    0: Nothing to note
    1: ST-T Wave abnormality
    can range from mild symptoms to severe problems
    signals non-normal heart beat
    2: Possible or definite left ventricular hypertrophy
    Enlarged heart's main pumping chamber
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during excercise unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
    0: Upsloping: better heart rate with excercise (uncommon)
    1: Flatsloping: minimal change (typical healthy heart)
    2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy
    colored vessel means the doctor can see the blood passing through the more blood movement the better (no clots)
13. thal - thalium stress result
    1,3: normal
    6: fixed defect: used to be defect but ok now
    7: reversable defect: no proper blood movement when excercising
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

In [None]:
# 1. Regular EDA(Exploratory Data Analysis) and Plotting Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

# To appear plots inside the notebook 
%matplotlib inline 

# 2. Models from Scikit-learn 
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# 3. Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, classification_report, f1_score
from sklearn.metrics import plot_roc_curve

# 4. Saving a Model
import pickle

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.info()

In [None]:
df.head()

#### Data Exploration (Exploratory Data Analysis(EDA)) :

In [None]:
df.describe()

#### Finding Patterns In Data :

In [None]:
# Total number of classes in target column:
plt.style.use('seaborn-whitegrid')# it will work by deafult for charts 
df['target'].value_counts().plot(kind = 'bar', color = ["salmon", "lightblue"], figsize = (10,6))
plt.title("Heart Disease Frequency")
plt.ylabel('Total Number')
plt.xlabel("Yes                                                 No")
plt.xticks(rotation = 0);

In [None]:
# Heart disease frequency according to sex
pd.crosstab(df.target, df.sex).plot(kind = 'bar', color = ["salmon", 'lightblue'], figsize = (10,6))
plt.title("Heart Disease Frequency Acoording To Sex")
plt.xlabel("0 = No, 1 = Yes")
plt.ylabel("Total Numbber")
plt.legend(['Female', 'Male'])
plt.xticks(rotation = 0);

In [None]:
# Age vs Max Heart Rate for Heart Disease
plt.figure(figsize=(10,6))

# Scatter plot with heart disease = 1 values:
plt.scatter(df.age[df.target == 1],
                   df.thalach[df.target == 1],
                          c = "salmon")
# Scatter with with heart disease = 0 values:
plt.scatter(df.age[df.target == 0],
          df.thalach[df.target == 0],
          c = 'lightblue')

# Adding Information
plt.title("Heart Disease as a Function of Max Heart Rate and Age")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Heart Disease : Yes", "Heart Disease : No"]);

In [None]:
# Age Column Distribution 
df['age'].hist(figsize = (10,6));

In [None]:
# Heart Disease Frequency Per Chest Pain Type :
pd.crosstab(df.cp,df.target).plot(kind = 'bar', color = ("salmon", 'lightblue'), figsize = (10,6))
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Total Number")
plt.legend(["Heart Disease : No", "Heart Disease  : Yes"])
plt.xticks(rotation = 0);

In [None]:
# Correlation Matrix Using Seaborn Heatmap
corr_maxtix = df.corr()
fig, ax = plt.subplots(figsize = (15,10))
ax = sns.heatmap(corr_maxtix,
                annot = True,
                linewidths=0.5,
                fmt= ".2f",
                cmap = "YlGnBu") # Yellow, Green, Blue

#### Splitting Data :

In [None]:
# Random Seed
np.random.seed(42)

# Splitting data into X and y
X = df.drop("target", axis = 1)
y = df["target"]

# Spliting data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

#### Model Comparison :

In [None]:
# Putting Models in Dictionary :
models = {"Logestic Regression" : LogisticRegression(),
         "KNN" : KNeighborsClassifier(),
         "Random Forest Classifier" : RandomForestClassifier()}

def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fit and Evaluate Machine Learning Models with our data
    """
    np.random.seed(42)
    model_scores = {}  # a dictionary to keep the model score 
    for name, model in models.items():
        model.fit(X_train, y_train) # fitting data to model
        model_scores[name] = model.score(X_test, y_test) # evaluating model
    return model_scores

In [None]:
baseline_model_score = fit_and_score(models = models,
                                    X_train = X_train,
                                    X_test = X_test,
                                    y_train = y_train,
                                    y_test = y_test)
baseline_model_score

In [None]:
# Visulaization for model comparsion :
model_comparison = pd.DataFrame(baseline_model_score, index = ["accuracy"])
model_comparison.T.plot(kind = 'barh'); # T = transpose

#### Lets look at the following :
    1. Hyperperameter Tuning 
    2. Feature Importance
    3. Confusion Matrix
    4. Cross Validation 
    5. Precision
    6. Recall
    7. F1 Score
    8. Classification Report
    9. Recevier Operating Characterstic Curve (ROC)
    10. Area Under the Curve (AUC)

In [None]:
# 1. Hyperperameter Tuning (KNN)
train_scores = []
test_scores = []

#List of values for neighbors
neighbors = range(1,21)

# KNN Instance setup
knn = KNeighborsClassifier()

# Lopping through the range of neigbhors 
for i in neighbors:
    knn.set_params(n_neighbors = i)
    knn.fit(X_train, y_train) # fitting the traning data set 
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

train_scores, test_scores

# Plotting On Graph
plt.figure(figsize=(10,6))
plt.plot(neighbors, train_scores, label = "Train Score")
plt.plot(neighbors, test_scores, label = "Test Score")
plt.title("KNN Score")
plt.xlabel("Number of Neighbors")
plt.ylabel("Model Score")
plt.legend();
print(f"Maximum KNN Score on the test data : {max(test_scores) * 100:.2f} %")

# we will be not pursuing this model further as after tuning its accuracy is stll below Logistic Regression

In [None]:
# Creating a grid for hyperperameter tuning:

# Creating Hyperperameter grid for Logestic Regression:
log_reg_grid = {'C' : np.logspace(-4,4,20),
               "solver" : ["liblinear"] }

# Creating Hyperperameter grid for Random Forest Regression:
rf_grid = {"n_estimators" : np.arange(10,1000,50),
          "max_depth" : [None, 3,  5, 10],
          "min_samples_split" : np.arange(2,20,2),
          "min_samples_leaf" : np.arange(1,22,2)}

In [None]:
# 1. Hyperperameter Tuning Using RandomizedSearchCV - Logistic Regression
np.random.seed(42)

rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                               param_distributions=log_reg_grid,
                               cv = 5,
                               n_iter= 20,
                               verbose= True)
rs_log_reg.fit(X_train, y_train)
rs_log_reg.score(X_test, y_test)

In [None]:
# getting the best prams
rs_log_reg.best_params_

In [None]:
# Score of our model, simillar as baseline score 
rs_log_reg.score(X_test, y_test)

In [None]:
# 1. Hyperperameter Tuning Using RandomizedSearchCV - RandomForestClassifier

np.random.seed(42)

rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                          param_distributions= rf_grid,
                          cv = 5,
                          n_iter=20,
                          verbose= True)
rs_rf.fit(X_train, y_train)

In [None]:
# best params for model 
rs_rf.best_params_

In [None]:
# Score 
rs_rf.score(X_test, y_test)

In [None]:
# As Logistic Regresssion Performed Best, we will move foward with it and improve it:
# 1. By using GridSearchCV
np.random.seed(42)

rs_log_grid = {"C" : np.logspace(-4,4,30),
              "solver" : ["liblinear"]}

gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid= rs_log_grid,
                          cv = 5,
                          verbose = True)

gs_log_reg.fit(X_train, y_train)

In [None]:
gs_log_reg.best_params_

In [None]:
gs_log_reg.score(X_test, y_test)

#### Model Evaluation : 
- ROC Curve and AUC score
- Confusion Matrix
- Classification Report
- Precission 
- Recall
- F1 Score
- Use Cross Validation wher ever possisble

In [None]:
# Making Predictions  ( Always Evaluate on test data sets )
y_preds = gs_log_reg.predict(X_test)

In [None]:
# ROC and AUC 

plot_roc_curve(gs_log_reg, X_test, y_test);

In [None]:
# Confusion Matrix 
sns.set(font_scale = 1.5)

def plot_conf_matrix(y_test, y_preds):
    """
    Plotting Confusion Matrix Using Seaborn's Heatmap
    """
    fig, ax = plt.subplots(figsize = (4,4))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                    annot= True,
                    cbar = False)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Confusion Matrix - Logistic Regression")

plot_conf_matrix(y_test,y_preds); # true lables and predicted lables 

In [None]:
# Classification report on basis of one split that we have created above
log_reg_class_report = classification_report(y_test, y_preds)
print(log_reg_class_report)

In [None]:
# Classification report on basis of Cross Validation
# We will be using our best params for the same
gs_log_reg.best_params_

In [None]:
# Creating a new classifier with best params 
clf = LogisticRegression(C = 0.20433597178569418,
                        solver= 'liblinear')

In [None]:
# Cross Val Accuracy 
cv_acc = cross_val_score(clf,
                        X, y,
                        cv = 5,
                        scoring= "accuracy")
cv_acc_mean = cv_acc.mean()
cv_acc_mean

In [None]:
# Cross Val Precision
cv_prec = cross_val_score(clf, 
                          X,y,
                          cv =5,
                          scoring="precision")
cv_prec_mean = cv_prec.mean()
cv_prec_mean

In [None]:
# Cross Val Recall
cv_recall = cross_val_score(clf,
                           X, y,
                           cv =5,
                           scoring="recall")
cv_recall_mean = cv_recall.mean()
cv_recall_mean

In [None]:
# Cross Val F1 Score
cv_f1 = cross_val_score(clf,
                       X, y,
                       cv =5,
                       scoring="f1")
cv_f1_mean = cv_f1.mean()
cv_f1_mean

In [None]:
# Visualzation of Cross Val Score 
cv_metrics = pd.DataFrame({"Accuracy" : cv_acc_mean,
                          "Precision" : cv_prec_mean,
                          "Recall" : cv_recall_mean,
                          "F1" : cv_f1_mean},
                         index = [0])
cv_metrics.T.plot.barh(title = "Cross Validated Classification Metrics", legend = False);

#### Feature Importance :
- Which all features contributed most towards the model
- How did they contribute in predicting the target ?

In [None]:
# Fit an instance of Logistic Regression 
clf = LogisticRegression(C = 0.20433597178569418,
                        solver= 'liblinear')
clf.fit(X_train, y_train);

In [None]:
# Check Coef - Coefficient 
clf.coef_

In [None]:
# Matching Coefficent of features to columns
feature_dict = dict(zip(df.columns, list(clf.coef_[0])))
feature_dict

In [None]:
# Visualization of feature importance
features_df = pd.DataFrame(feature_dict,index = [0])
features_df.T.plot.bar(title = 'Feature Importance - Logistic Regression', legend = False)
plt.xlim(-2,15)
plt.ylim(-1, 1);