![Image](https://intermountainhealthcare.org/-/media/images/modules/blog/posts/2020/02/how-to-know-if-youre-having-a-heart-attack.jpg)

# Heart Attack Prediction with Pipelines and Column Trans

About this dataset
* Age : Age of the patient

* Sex : Sex of the patient

* exang : exercise induced angina (1 = yes; 0 = no)

* ca : number of major vessels (0-3)

* cp : Chest Pain type chest pain type

     Value 1: typical angina
     
     Value 2: atypical angina
     
     Value 3: non-anginal pain
     
     Value 4: asymptomatic

* trtbps : resting blood pressure (in mm Hg)

* chol : cholestoral in mg/dl fetched via BMI sensor

* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

* rest_ecg : resting electrocardiographic results

     Value 0: normal
     
     Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or        depression of > 0.05 mV)
     
     Value 2: showing probable or definite left ventricular hypertrophy by Estes'            criteria

* thalach : maximum heart rate achieved

* target : 0= less chance of heart attack 1= more chance of heart attack


Content: 

* [Importing Analysis Libraries](#1)
* [Reading Data and Overview](#2)
* [Basic Data Visualization](#3)
* [Missing Values](#4)
* [Unique Values](#5)
* [Handle Outliers](#6)
* [Import Machine Learning Libraries](#7)
* [Train-Test Split](#8)
* [Column Trans](#9)
* [ML Models Testing with Pipeline](#10)
* [Importing Models into Pipeline](#11)
* [Setting Parameters](#12)
* [Select Best Model and Parameters](#13)
* [Final Model](#14)
* [Summary](#15)

<a id="1"></a><br>
## Importing Analysis Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="2"></a><br>
## Reading Data and Overview

In [None]:
df = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

<a id="3"></a><br>
## Basic Data Visualization

In [None]:
numeric_list = ["age","trtbps","chol","thalachh","oldpeak"]
categorical_list = ["sex", "cp","fbs","restecg","exng","slp","caa","thall","output"]
df_categoric= df.loc[:,categorical_list]
for i in categorical_list:
    plt.figure()
    sns.countplot(x = i, data = df_categoric, hue = "output")
    plt.title(i)

<a id="4"></a><br>
## Missing Values

As you can see, our dataset does not have any missing values. I almost cried.

In [None]:
df.isnull().sum()

<a id="5"></a><br>
## Unique Values

In [None]:
for i in list(df.columns):
    print("{} -- {}".format(i, df[i].value_counts().shape[0]))

<a id="6"></a><br>
## Handle Outliers

Following approaches can be used to deal with outliers once we’ve defined the boundaries for them:

Remove the observations and imputation


1.Remove the Observations

We may explicitly delete outlier observation entries from our data so that they don’t influence the training of our models. When dealing with a small dataset, however, eliminating the observations is not a good idea.

2.Imputation

To impute the outliers, we can use a variety of imputation values, ensuring that no data is lost.
As impute values, we can choose between the mean, median, mode, and boundary values.

I chose to remove the outliers here.

In [None]:
for i in numeric_list:
    
    # IQR
    Q1 = np.percentile(df.loc[:, i],25)
    Q3 = np.percentile(df.loc[:, i],75)
    
    IQR = Q3 - Q1
    
    print("Old shape: ", df.loc[:, i].shape)
    
    # upper bound
    upper = np.where(df.loc[:, i] >= (Q3 +2.5*IQR))
    
    # lower bound
    lower = np.where(df.loc[:, i] <= (Q1 - 2.5*IQR))
    
    print("{} -- {}".format(upper, lower))
    
    try:
        df.drop(upper[0], inplace = True)
    except: print("KeyError: {} not found in axis".format(upper[0]))
    
    try:
        df.drop(lower[0], inplace = True)
    except:  print("KeyError: {} not found in axis".format(lower[0]))
        
    print("New shape: ", df.shape)

<a id="7"></a><br>
## Import Machine Learning Libraries

In [None]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score

<a id="8"></a><br>
## Train-Test Split

In [None]:
df_copy = df.copy()
x = df_copy.drop("output", axis= 1)
y = df_copy["output"]
x_train, x_test, y_train , y_test = train_test_split(x, y, test_size=0.2, random_state = 69)

<a id="9"></a><br>
## Column Trans

The column_trans method is a useful sklearn method that allows it to apply different operations to different columns in the dataset.

For numeric features, I choose MinMaxScaler method as scaler process because of MinMaxScaler preserves the shape of the original distribution. It doesn't meaningfully change the information embedded in the original data. Note that MinMaxScaler doesn't reduce the importance of outliers.

For categorical features, OneHotEncoder speeds up the process considerably and increases the accuracy rate. What one hot encoding does is, it takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. 

In [None]:
column_trans = make_column_transformer(
               
               (MinMaxScaler(), numeric_list),
               
               (OneHotEncoder(sparse = False, handle_unknown="ignore"), categorical_list[:-1]),
                remainder = "passthrough")

<a id="10"></a><br>
## ML Models Testing with Pipeline

I think the make_pipeline method has a much easier syntax than the pipeline method. That's why I prefer make_pipeline.

In [None]:
logreg = LogisticRegression() 

knn = KNeighborsClassifier()

rfc = RandomForestClassifier()

svc = SVC()

models = [logreg, knn, rfc, svc]

for i in models:
    
    pipeline = make_pipeline(column_trans , i )
    
    print("{} = {}" .format(i,cross_val_score(pipeline, x_train, y_train, cv=5, scoring="accuracy").mean()))

<a id="11"></a><br>
## Importing Models into Pipeline

In [None]:
pipelines = []

for i in models:
    
    pipelines.append(make_pipeline(column_trans, i))

<a id="12"></a><br>
## Setting Parameters

When tuning a model in the pipeline, we must write the parameter names together with the name defined in the pipeline. for example we should write "logisticregression__C" instead of "C".

In [None]:

grid_logreg = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100], "logisticregression__penalty": ["l1","l2"], "logisticregression__solver": ["saga", 'liblinear']}

grid_knn = {"kneighborsclassifier__n_neighbors": np.arange(0,11), "kneighborsclassifier__weights": ["uniform","distance"], "kneighborsclassifier__metric": ["euclidean", "manhattan"]}

grid_rfc = { 'randomforestclassifier__n_estimators': [200, 300, 400, 500], 'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2'], 'randomforestclassifier__max_depth' : np.arange(1,11), 'randomforestclassifier__criterion' : ['gini', 'entropy']}

grid_svc = {'svc__C': [0.1,1, 10, 100], 'svc__gamma': [1,0.1,0.01,0.001],'svc__kernel': ['rbf', 'poly', 'sigmoid']}

parameters = [grid_logreg, grid_knn, grid_rfc, grid_svc]

<a id="13"></a><br>
## Select Best Model and Parameters

There are two methods in sklearn for hypermeter: GridSearchCV and RandomizedSearchCV.

Grid search is a technique which tends to find the right set of hyperparameters for the particular model. In this tuning technique, we simply build a model for every combination of various hyperparameters and evaluate each model. The model which gives the highest accuracy wins. The pattern followed here is similar to the grid, where all the values are placed in the form of a matrix. Each set of parameters is taken into consideration and the accuracy is noted. Once all the combinations are evaluated, the model with the set of parameters which give the top accuracy is considered to be the best. 

Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. The drawback of random search is that it yields high variance during computing. Since the selection of parameters is completely random; and since no intelligence is used to sample these combinations, luck plays its part.

If the n_jobs parameter is given -1, the computer will spend all its power on the process and the time will be shortened.

In [None]:
for i, a in zip(pipelines, parameters):
    grid_search = GridSearchCV(i, a, cv= 5, n_jobs= -1, scoring = "accuracy")
    grid_search.fit(x_train, y_train)
    
    grid_random = RandomizedSearchCV(i, a, cv= 5, n_jobs= -1, scoring = "accuracy")
    grid_random.fit(x_train, y_train)
    
    print("best grid parameters for {} = {}".format(i, grid_search.best_params_))
    print("best grid score for {} = {}".format(i, grid_search.best_score_))
    print("best random parameters for {} = {}".format(i, grid_random.best_params_))
    print("best random score for {} = {} \n\n" .format(i, grid_random.best_score_))

It may seem a bit complicated, but it's not hard to find the accuracy and the best parameters. now i will apply the best parameters i found to the final model.

<a id="14"></a><br>
## Final Model

In [None]:
model =  RandomForestClassifier( n_estimators = 300, max_features = "log2" , max_depth = 2, criterion = "gini" )
final_model = make_pipeline(column_trans , model )
final_model.fit(x_train, y_train)
pred = final_model.predict(x_test)
print("Final Model Accuracy = {}".format(accuracy_score(y_test, pred)))


<a id="15"></a><br>
# Summary

As you can see, you can do the same job much faster with less code by using the pipelines and column trans methods. If you like my notebok, please don't forget to upvote. I would be happy to see your thoughts and suggestions about my notebook in the comments section. I hope everything goes the way you want. Have a nice work!!