### Problem Definition

Given clinical parameters about a person, can we predict whether or not they have heart disease?

### I have also created a Inference Pipeline using Luigi and a Streamlit web app for real time predictions. 

You can try it at the below links - 


https://heart-disease-diagnostics.herokuapp.com/

https://github.com/Nikhilkohli1/Heart-Disease-Diagnosis-Assistant



### What I will be doing in this notebook - 

1. EDA (Exploratory Data Analysis)
2. Data Pre-processing 
3. Predictive Modeling - I will train 4 different algorithms on 4 different feature sets after doing extensive feature selection. 
4. Model Selection 
5. Ensemble Max Vote - from the best models, I will create a simple Ensemble Max Voting approach to make predictions using top 3 best models



### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression


import warnings
warnings.filterwarnings("ignore")

### Load the UCI heart Disease Dataset

In [None]:
df_heart = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
df_heart.head()

In [None]:
df_heart.shape

### Descriptive Statistics & Data Cleaning

In [None]:
df_heart = df_heart.rename(columns= {'cp':'chest_pain_type','trestbps':'resting_BP','chol':'serum_cholestoral','fbs':'fasting_blood_sugar','restecg':'resting_ECG',
                                     'thalach':'max_heart_rate','exang':'exercise_induced_angina',
                                     'ca':'major_vessels_count','thal':'thalium_stress'})
df_heart.columns

In [None]:
df_heart.info()


In [None]:
df_heart.isnull().sum()

No Null values,lucky!

### Categorical Discrete & Continous Variables

In [None]:
categorical_cols = []
continous_cols = []

for column in df_heart.columns:
    if(len(df_heart[column].unique()) <= 10):
        categorical_cols.append(column)
    else:
        continous_cols.append(column)

In [None]:
categorical_cols

In [None]:
continous_cols

In [None]:
df_heart_tmp = df_heart.copy()
for cols in categorical_cols:
    if(cols != 'target'):
        df_heart_tmp[cols] = df_heart_tmp[cols].astype('object')

In [None]:
df_heart_tmp.dtypes


In [None]:
df_heart_tmp.describe()


### Data Redundancy(Constant & Quasi Constant variables)

In [None]:
df_heart_tmp.describe(include='object')


These are no variables with only 1 unique value or Quasi constant(>99% values are constant), so we are good with Redundancy

### Target Distribution

In [None]:
df_heart.target.value_counts()


In [None]:
sns.countplot(df_heart['target'])


The data is not highly imbalanced, but we can try to balance it using SMOTE Oversampling if we do not get a good accuracy with this.

## Exploratory Data Analysis

### Continous Features

Let's check the distribution for continous features

In [None]:
df_heart.hist(column=continous_cols, figsize=(12,12))


In [None]:
for index, column in enumerate(continous_cols):
    plt.figure(index)
    sns.distplot(df_heart[column])

Old Peak seems to be highly skewed, lets see the Skew for each feature. The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.

In [None]:
df_heart[continous_cols].skew()

As evident from the Skew values above for continous features, oldpeak & serum cholestoral are right skewed. We can apply Log Transformation to these variables while Preprocessing data for Machine Learning.

In [None]:
### Distribution of Categorical Values

In [None]:
df_heart.hist(column=categorical_cols, figsize=(10,10))


### Associations & Correlation between variables

In [None]:
df_heart_tmp.describe().columns


In [None]:
sns.pairplot(df_heart_tmp[df_heart_tmp.describe().columns], hue='target')

In [None]:
df_heart.corr()


In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df_heart.corr(), annot=True, linewidths=1, linecolor='white', fmt=".2f",
                 cmap="YlGnBu")

In [None]:
df_heart.drop('target', axis=1).corrwith(df_heart.target).plot(kind='bar', grid=True, figsize=(10, 7), color='darkgreen')


Most features have a significant correlation with the Target variable except Fasting Blood Sugar, Resting ECG and Serum Cholestoral. Chest pain type and Max heart rate has a high positive correlation with the target

### Analysing Relationship between continous variables & Target

In [None]:
for index,column in enumerate(continous_cols):
    plt.figure(index, figsize=(8,5))
    sns.boxplot(x=df_heart.target, y=column, data=df_heart, palette='rainbow',linewidth=1)
    plt.title('Relation of {} with target'.format(column), fontsize = 10)

There are some unusual or rare values like in Cholestoral(400-500) and resting BP(200) but these are possible values and not Data collection errors. So we should not remove any of these even when they look like outliers.

### Swarmplots

Let me also look at the relationship between few variables and target with Swarm plots.

This approach adjusts the points along the categorical axis using an algorithm that prevents them from overlapping. It can give a better representation of the distribution of observations, although it only works well for relatively small datasets.

In [None]:
for index,column in enumerate(continous_cols):
    plt.figure(index,figsize=(7,5))
    sns.catplot(x='target', y=column, hue='sex', kind='swarm', data=df_heart, palette='husl')
    plt.title('Relationship of {} with target for each sex'.format(column), fontsize = 10)

### Observations -

***Age***
- On an Average people above the age of 50 are at risk of having a heart disease when combined with other factors. Age alone is not a good predictor of heart disease as evident from the box plot and Swarm plot.

***Resting Blood pressure***
- Anything above 130-140 (in mm Hg) is a cause for concern.

***Serum Cholestoral***
- Cholestoral (LDL + HDL + Triglysrides) above 300 is definitely a concern, below that is a concern when combined with other factors.

***Thalach(Maximum Heart ate)***
- There is a Strong correlation between the Heart Disease and max heart rate. People with Max heart rate above 150-160 are more likely to suffer from a Heart Disease.

### Analysing Relationship between Categorical variables & Target

In [None]:
categorical_cols.pop(8)


#### Number of Labels in each Categorical feature : Cardinality

In [None]:
for var in categorical_cols:
    print('Cardinality of {1} is {0}'.format(len(df_heart[var].unique()), var))

In [None]:
for index,column in enumerate(categorical_cols):
    plt.figure(index,figsize=(7,5))
    sns.countplot(x=column, hue='target', data=df_heart, palette='rainbow')
    plt.title('Relation of {} with target'.format(column), fontsize = 10)

### Observation for Categorical variables -

- Sex - Females are more likely to have a Heart Disease than Males.

- Chest Pain type - People with Chest Pain type 1,2,3 have more chance of having a Heart Disease.

- Resting ECG - People with value 1 for resting ECG(abnormal Heart beat) are more likely to have a heart disease.

- Exercise Induced Angina - Poeple with No Exercise Induced Angina(0) have heart diseases more than others who have Angina due to exercise. This seems a little contradictory between.

- Slope People with Slope value equal to 2 are more likely to have a Heart Disease than people with Slope value 0 or 1

- Major vessel Count - This has a negative relation with Heart disease. The lesser Number of Major vessels, the more chances are of Heart Disease.

- Thalium Stress ST Depression - Poeple with value 2 or 3 are more likely to have Heart Disease

### Data Preprocessing & Feature Engineering
We need to do three things as part of Data Preprocessing before we can build Machine Learning models for Classification -

- 1. One Hot Encoding - Creation of dummy variables for Categorical Variables with more than 2 classes

- 2. Feature Scaling - We will be using distance based algorithms as well like KNN, so scaling is required

Dummy Variable -
As Sex, Fasting Blood Sugar & Exercise induced Angina contain only 2 unique values (0, 1), we do not need to create dummy variable for them. So there are 5 variables which need to be encoded.

I will also drop the first column of each as after encoding, it can cause dummy variable trap.

In [None]:
categorical_dummy = [
 'chest_pain_type',
 'resting_ECG',
 'slope',
 'major_vessels_count',
 'thalium_stress']

In [None]:
df_heart = pd.get_dummies(df_heart, columns=categorical_dummy, drop_first=True )
df_heart.columns

In [None]:
len(df_heart.columns)


We have 23 features to play with now. Lets continue with the Preprocessing Steps

In [None]:
target = df_heart.target
features = df_heart.drop(columns=['target'])

### Feature Scaling
I have tried both Robust Scaling as well as Min Max Scaling, MinMax scalign works better for this problem, so i am using this.
Robust Scaler is robust to outliers.

In [None]:
from sklearn.preprocessing import RobustScaler, MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
features_SS = scaler.fit_transform(features)
features_SS = pd.DataFrame(features_SS, columns=features.columns)

### Train Test Splitting

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(features_SS, target, test_size=0.2, random_state=42)

In [None]:
Y_train.value_counts()


In [None]:
X_train.head()


In [None]:
X_train.shape, X_test.shape


### Multicollinearity using VIF

In [None]:
# for each feature, calculate the VIF score
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['features'] = X_train.columns
vif.round(1)

Only thalium Stress features have VIF factor greater. We can remove one and calculate the VIF again.

In [None]:
X = X_train.drop(columns=['thalium_stress_3'])

In [None]:
# for each feature, calculate the VIF score
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['features'] = X.columns
vif.round(1)

Now, we have less VIF scores for all features.

### Feature Selection
I will using 3 techniques to select Features -

- Recursive feature Elimination with CV using Random Forest as estimator
- Recursive feature Elimination with CV using Logistic as estimator
- SelectFromModel using XGBoost

### Recursive Feature Elimination with Cross-validation

In [None]:
from sklearn.feature_selection import RFECV, SelectFromModel
from xgboost import XGBClassifier
import time

In [None]:
start = time.time()

rf = RandomForestClassifier(n_estimators=10, random_state=40)
rfe_rf = RFECV(estimator=rf, step=1, cv=5, n_jobs=-1)
rfe_rf.fit_transform(X_train, Y_train)

end = time.time()

In [None]:
print('Time Taken - {}'.format(str(end - start)))

In [None]:
rfe_rf


In [None]:
rfe_rf.support_


In [None]:
rfe_rf_ranks = rfe_rf.ranking_
rfe_rf_ranks

In [None]:
params = {'axes.labelsize': 280,'axes.titlesize':40, 'legend.fontsize': 18, 'xtick.labelsize': 40, 'ytick.labelsize': 50}
plt.figure(figsize=(50,25))
plt.rcParams.update(params)
ax = plt.bar(range(X_train.shape[1]), rfe_rf_ranks, color='green', align = 'center')
ax = plt.title('Feature importance')
ax = plt.xticks(range(X_train.shape[1]), X_train.columns, rotation=90)
plt.show()

In [None]:
feature_idx = rfe_rf.support_
feature_names = X_train.columns[feature_idx]
feature_names

Using RFECV gives us 19 features which are significant for prediction of Heart Diseases. The 3 features which are not so important as per this are resting_ECG_2, Major Vessel Count 3 & 4, which makes sense, as upto 2-3 vessel count is good, above 3 it does not really matter as you are less likely to have a heart disease in that case. Resting ECG as we saw, had a very low correlation with the target as we analysed through the heatmap.

I will also use SelectFromModel for selecting another set of features, and then see which one works best for this data.

In [None]:
start = time.time()

logit = LogisticRegression()
rfe_logit = RFECV(estimator=logit, step=1, cv=5, n_jobs=-1)
rfe_logit.fit_transform(X_train, Y_train)

end = time.time()

print('Time Taken - {}'.format(str(end - start)))
rfe_logit

In [None]:
rfe_logit.support_

In [None]:
rfe_logit_ranks = rfe_logit.ranking_
rfe_logit_ranks

In [None]:
params = {'axes.labelsize': 280,'axes.titlesize':40, 'legend.fontsize': 18, 'xtick.labelsize': 40, 'ytick.labelsize': 50}
plt.figure(figsize=(50,25))
plt.rcParams.update(params)
ax = plt.bar(range(X_train.shape[1]), rfe_logit_ranks, color='green', align = 'center')
ax = plt.title('Feature importance')
ax = plt.xticks(range(X_train.shape[1]), X_train.columns, rotation=90)
plt.show()

In [None]:
feature_idx2 = rfe_logit.support_
feature_names2 = X_train.columns[feature_idx2]
feature_names2

#### SelectFromModel - This is a meta transformer for selecting features based on importance

In [None]:
xgb = XGBClassifier()
select_xg = SelectFromModel(estimator=xgb, threshold='median')
select_xg

In [None]:
select_xg.fit_transform(X_train, Y_train)

In [None]:
feature_idx3 = select_xg.get_support()
feature_names3 = X_train.columns[feature_idx3]
feature_names3

### Final Selected features

We have 4 set of features now

- Default features without any feature selection
- Features selected from RFECV(Random Forest)
- Features selected from RFECV(Logistic)
- Features selected from meta transformer

Lets start with Modeling now, I will train algorithms on all of these feature sets and see which are the best ones.

### Predictive Modeling & Hyperparameter Tuning
I will train below Machine Learning algorithms to build models for classifying Heart Disease (binary classification) using the above 3 set of selected features.

- Logistic Regression
- Support Vector Machine
- K-Nearest Neighbours
- Random Forest Classifier
- XGBoost Classifier

I will use Grid Search and CV to find the best Hyperparameters for each algorithm.

In [None]:
X_train_2 = X_train[feature_names]
X_train_3 = X_train[feature_names2]
X_train_4 = X_train[feature_names3]

X_test_2 = X_test[feature_names]
X_test_3 = X_test[feature_names2]
X_test_4 = X_test[feature_names3]

In [None]:
len(X_train.columns), len(X_train_2.columns), len(X_train_3.columns), len(X_train_4.columns)


### Baseline Classifiers

Let's quickly run some baseline Classification without any Tuning and using all the extracted features. After this I will use Grid Search and Cross-Validation to tune the Hyperparameters for all 5 algorithms

In [None]:
logit_clf = LogisticRegression()
logit_clf.fit(X_train, Y_train)

y_pred = logit_clf.predict(X_test)
print('Accuracy Score: ', str(accuracy_score(Y_test, y_pred)))
print('Classification Report: ')
print(classification_report(Y_test, y_pred))

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

y_pred = knn_clf.predict(X_test)
print('Accuracy Score: ', str(accuracy_score(Y_test, y_pred)))
print('Classification Report: ')
print(classification_report(Y_test, y_pred))

In [None]:
svm_clf = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm_clf.fit(X_train, Y_train)

y_pred = svm_clf.predict(X_test)
print('Accuracy Score: ', str(accuracy_score(Y_test, y_pred)))
print('Classification Report: ')
print(classification_report(Y_test, y_pred))

In [None]:
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, Y_train)

y_pred = dt_clf.predict(X_test)
print('Accuracy Score: ', str(accuracy_score(Y_test, y_pred)))
print('Classification Report: ')
print(classification_report(Y_test, y_pred))

### Grid Search & Hyperparameter Tuning

In [None]:
def fit_model(X_train, Y_train, X_test, Y_test, classifier_name, classifier, gridSearchParam, cv, save_model=False):
    #setting the seed for reproducability
    #np.random.seed(100)
    print('Training {} algorithm.........'.format(classifier_name))
    grid_clf = GridSearchCV(estimator=classifier,
                            param_grid=gridSearchParam, 
                            cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_res = grid_clf.fit(X_train, Y_train)
    best_params = grid_res.best_params_
    Y_pred = grid_res.predict(X_test)
    cm = confusion_matrix(Y_test, Y_pred)
    
    
    print(Y_pred)
    print("=====================================================================")
    print('Training Accuracy Score: ' + str(accuracy_score(Y_train, grid_res.predict(X_train))))
    print("---------------------------------------------------------------------")
    print('Test Accuracy Score: ' + str(accuracy_score(Y_test, Y_pred)))
    print("---------------------------------------------------------------------")
    print('Best HyperParameters: ', best_params)
    print("---------------------------------------------------------------------")
    print('Classification Report: ')
    print(classification_report(Y_test, Y_pred))
    print("---------------------------------------------------------------------")
    
    #fig, ax = plt.subplots(figsize=(7,7))
    ax= plt.subplot()
    #plt.figure(figsize=(6,6))
    sns.set(font_scale=1.0) # Adjust to fit
    label_font = {'size':'5'}
    plt.rcParams.update({'font.size': 14})
    sns.heatmap(cm, annot=True, ax = ax, fmt='g', cmap='Blues')
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels') 
    ax.set_title('Confusion Matrix') 
    ax.xaxis.set_ticklabels(['No Heart Disease', 'Heart Disease'])
    ax.yaxis.set_ticklabels(['No Heart Disease', 'Heart Disease'])
    print("=====================================================================")
    
    if save_model:
        file_name = classifier_name + '.pkl'
        pickle.dump(grid_res, open(file_name, 'wb'))
        #joblib.dump(grid_res, file_name)
        print('Model is saved successfully!')

### Logistic Regression

In [None]:
cv = 5 
hyper_params = {'C': [0.0001, 0.001, 0.1, 1, 10, 20],   #np.logspace(0, 4, 10),
               'penalty': ['l1','l2'],
               'solver': ['liblinear', 'saga']}

In [None]:
#Feature Set 1 
fit_model(X_train, Y_train, X_test, Y_test, 'Logistic Regression', LogisticRegression(), hyper_params, cv)

In [None]:
#Feature Set 2 
fit_model(X_train_2, Y_train, X_test_2, Y_test, 'Logistic Regression', LogisticRegression(), hyper_params, cv)

In [None]:
#Feature Set 3 
fit_model(X_train_3, Y_train, X_test_3, Y_test, 'Logistic Regression', LogisticRegression(), hyper_params, cv)

In [None]:
#Feature Set 4 
fit_model(X_train_4, Y_train, X_test_4, Y_test, 'Logistic Regression', LogisticRegression(), hyper_params, cv)

### Support Vector Machine

In [None]:
cv = 5 
hyper_params = {'C': [0.01, 0.1, 1, 10, 100, 1000],
                'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.1, 1, 3],
                'kernel': ['linear', 'rbf']}

In [None]:
#Feature Set 1 
fit_model(X_train, Y_train, X_test, Y_test, 'SVM Classifier', SVC(), hyper_params, cv)

In [None]:
#Feature Set 2
fit_model(X_train_2, Y_train, X_test_2, Y_test, 'SVM Classifier', SVC(), hyper_params, cv)

In [None]:
#Feature Set 3
fit_model(X_train_3, Y_train, X_test_3, Y_test, 'SVM Classifier', SVC(), hyper_params, cv)


In [None]:
#Feature Set 4
fit_model(X_train_4, Y_train, X_test_4, Y_test, 'SVM Classifier', SVC(), hyper_params, cv)

### K-Nearest Neighbours

In [None]:
cv = 5 
hyper_params = {'n_neighbors': list(range(1,20)),
                'leaf_size': list(range(1,15)),
                'p': [1,2]}

In [None]:
#Feature Set 1 
fit_model(X_train, Y_train, X_test, Y_test, 'KNN Classifier', KNeighborsClassifier(), hyper_params, cv)

In [None]:
#Feature Set 2
fit_model(X_train_2, Y_train, X_test_2, Y_test, 'KNN Classifier', KNeighborsClassifier(), hyper_params, cv)

In [None]:
#Feature Set 3 
fit_model(X_train_3, Y_train, X_test_3, Y_test, 'KNN Classifier', KNeighborsClassifier(), hyper_params, cv)

In [None]:
#Feature Set 4
fit_model(X_train_4, Y_train, X_test_4, Y_test, 'KNN Classifier', KNeighborsClassifier(), hyper_params, cv)

### Random Forest Classifer

In [None]:
cv = 5 
hyper_params = {'n_estimators': [10, 50, 100, 200, 500],
                'max_depth': [2, 4, 6, 10, 15, 20, 30],
                'min_samples_split': [2, 5, 10, 20],
                'min_samples_leaf': [1, 2, 5, 10]}

In [None]:
#Feature Set 1 
fit_model(X_train, Y_train, X_test, Y_test, 'Random Forest', RandomForestClassifier(), hyper_params, cv)

In [None]:
#Feature Set 2
fit_model(X_train_2, Y_train, X_test_2, Y_test, 'Random Forest', RandomForestClassifier(), hyper_params, cv)

In [None]:
#Feature Set 3 
fit_model(X_train_3, Y_train, X_test_3, Y_test, 'Random Forest', RandomForestClassifier(), hyper_params, cv)

### Best Model & features
Logistic Regression with feature set 4 , SVM with feature set 4 & KNN with feature set 1 are the best models with 90.16% Test accuracy and 90% f1 score

Based on the analysis and looking at all aspects of Training accuracy, Testing Accuracy, Precision & Recall, these 3 are our best estimator

### Best features -

['sex', 'max_heart_rate', 'exercise_induced_angina', 'oldpeak', 'chest_pain_type_2',
       'chest_pain_type_3', 'slope_1', 'major_vessels_count_1',
       'major_vessels_count_2',
       'thalium_stress_2']

### Save Best Model for Inference Pipeline

In [None]:
import pickle


In [None]:
cv = 5 
hyper_params = {'C': [0.01, 0.1, 1, 10, 100, 1000],
                'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.1, 1, 3],
                'kernel': ['linear', 'rbf']}


#Saving SVM Best Model using feature Set 4
fit_model(X_train_4, Y_train, X_test_4, Y_test, 'SVMClassifier', SVC(), hyper_params, cv, save_model=True)

In [None]:
cv = 5 
hyper_params = {'n_neighbors': list(range(1,20)),
                'leaf_size': list(range(1,15)),
                'p': [1,2]}

#Saving KNN Best Model using feature Set 1
fit_model(X_train, Y_train, X_test, Y_test, 'KNN Classifier', KNeighborsClassifier(), hyper_params, cv, save_model=True)

In [None]:
cv = 5 
hyper_params = {'C': [0.0001, 0.001, 0.1, 1, 10, 20],   #np.logspace(0, 4, 10),
               'penalty': ['l1','l2'],
               'solver': ['liblinear', 'saga']}

#Saving Logistic Regression Best Model using feature Set 4
fit_model(X_train_4, Y_train, X_test_4, Y_test, 'Logistic Regression', LogisticRegression(), hyper_params, cv, save_model=True)

### Ensemble Technique for Prediction of Heart Disease
We have done extensive feature selection and ran Machine learning models on 4 different set of features. After lot of Hyperparameter tuning and cross validation, I got 3 good models with f1 score of around 90% and test accuracy of around 90.16%

Now, I will create an Ensemble Max Voting of 3 best models KNN, Logistic and SVM models saved with the best hyperparameters.

***Ensemble learning will make the models more generalized and reduce the bias which a algorithm mihght have learnt***


In [None]:
svm = pickle.load(open('./SVMClassifier.pkl', 'rb'))
logit = pickle.load(open('./Logistic Regression.pkl', 'rb'))
knn = pickle.load(open('./KNN Classifier.pkl', 'rb'))

In [None]:
feature_set1 = ['age', 'sex', 'resting_BP', 'serum_cholestoral', 'fasting_blood_sugar',
       'max_heart_rate', 'exercise_induced_angina', 'oldpeak',
       'chest_pain_type_2', 'chest_pain_type_3', 'resting_ECG_1', 'slope_1',
       'slope_2', 'major_vessels_count_1', 'major_vessels_count_2',
       'thalium_stress_2', 'thalium_stress_3']

feature_set4 = ['sex', 'exercise_induced_angina', 'oldpeak', 'chest_pain_type_2',
       'chest_pain_type_3', 'slope_1', 'major_vessels_count_1',
       'major_vessels_count_2', 'major_vessels_count_3', 'thalium_stress_1',
       'thalium_stress_2']

In [None]:
pred_knn = knn.predict(X_test)
pred_logit = logit.predict(X_test_4)
pred_svm = svm.predict(X_test_4)

### Max Voting Ensemble learning 

In [None]:
import statistics


In [None]:
df_ensemble = pd.DataFrame()

In [None]:
df_ensemble['KNN'] = pred_knn
df_ensemble['Logistic'] = pred_logit
df_ensemble['SVM'] = pred_svm
df_ensemble.head(10)

In [None]:
def max_vote(x):
    vote = statistics.mode([int(x['KNN']), int(x['Logistic']), int(x['SVM'])])
    return vote

In [None]:
df_ensemble['Ensemble'] = df_ensemble.apply(max_vote, axis=1)
df_ensemble.head(10)

In [None]:
print("---------------------------------------------------------------------")
print('Test Accuracy Score: ' + str(accuracy_score(Y_test, df_ensemble.Ensemble.values)))
print("---------------------------------------------------------------------")
print('Classification Report: ')
print(classification_report(Y_test, df_ensemble.Ensemble.values))
print("---------------------------------------------------------------------")

cm = confusion_matrix(Y_test, df_ensemble.Ensemble.values)

ax= plt.subplot()
#plt.figure(figsize=(6,6))
sns.set(font_scale=1.0) # Adjust to fit
label_font = {'size':'5'}
plt.rcParams.update({'font.size': 14})
sns.heatmap(cm, annot=True, ax = ax, fmt='g', cmap='Blues')
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels') 
ax.set_title('Confusion Matrix') 
ax.xaxis.set_ticklabels(['No Heart Disease', 'Heart Disease'])
ax.yaxis.set_ticklabels(['No Heart Disease', 'Heart Disease'])
print("=====================================================================")

### Perfect! 

### I have also created a Inference Pipeline using Luigi and a Streamlit web app for real time predictions. 

You can try it at the below links - 


https://heart-disease-diagnostics.herokuapp.com/

https://github.com/Nikhilkohli1/Heart-Disease-Diagnosis-Assistant


Please upvote the kernel if you like it! 