<div style="text-align: center"><b><span style="color:#08838b; font-family:Georgia; font-size:2.5em;">Model Building and EDA  to Predict a Possible Heart Attack</span></b></div>



**Feedbacks and suggestions will be greatly appreciated.**

![](https://s.marketwatch.com/public/resources/images/MW-GC908_heart__ZG_20180201153921.jpg)

In [None]:
#importing libraries
import numpy as np
import pandas as pd

import warnings as warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv("../input/heart-disease-uci/heart.csv")

print("HEART DISEASE DATASET : ")
data

In [None]:
#array of all columns
print(data.columns)

### <div style="text-align: center"><b><span style="color:#08838b; font-family:Georgia; font-size:2.5em;">Description of Columns</span></b></div>



*  cp: chest pain type
* -- Value 1: typical angina
* -- Value 2: atypical angina
* -- Value 3: non-anginal pain
* -- Value 4: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the hospital)
* chol: serum cholestoral in mg/d
* fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg: resting electrocardiographic results
* -- Value 0: normal
* -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved.
* exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak = ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
* -- Value 1: upsloping
* -- Value 2: flat
* -- Value 3: downslopin
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
* target: diagnosis of heart disease (angiographic disease status)
* -- Value 0: < 50% diameter narrowing
* -- Value 1: > 50% diameter narrowing

In [None]:
# checking total no. of rows and columns present in original datset
data.shape

### <div style="text-align: center"><b><span style="color:#08838b; font-family:Georgia; font-size:2.5em;">Data Cleaning</span></b></div>

In [None]:
# getting information about total non-null values and datatypes
data.info()

In [None]:
# to get percentage of null values present in all columns respectively
round(data.isnull().sum()*100/len(data),2)

* **No null values present in dataset. We are good to go!**

### <div style="text-align: center"><b><span style="color:#08838b; font-family:Georgia; font-size:2.5em;">Data Analysis</span></b></div>

In [None]:
# firstly we will make a copy of our original dataset to perform further actions
data_to_use = data.copy()

## Visualising Numeric Variables
### Let's make a pairplot of all the numeric variables
* Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters.

In [None]:
# Visualizing the 
cont_data = data_to_use[["age","trestbps","chol","thalach","oldpeak","target"]].copy()
cont_data['target'] = cont_data['target'].replace({0: '< 50% diameter narrowing', 1: '> 50% diameter narrowing'})

sns.pairplot(cont_data ,hue = "target" )
plt.show()

### Let's make a heatmap of all the numeric variables
* Heatmap is also used to understand the best set of features to explain a relationship between two variables .
* It is usually not used that much in drawing any final conclusions but a great way to visualize the relation between features by looking at the different color representation of them.
* The heat map shows the relative intensity of values captured by your eye tracker by assigning each value a color representation.

In [None]:
print(cont_data.corr())
plt.figure(figsize=(10,10))
sns.heatmap(cont_data.corr())
plt.show()

## Visualising Categorical Variables
As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.
* Outliers(for a normal distribution).
* “Minimum” and  “Maximum” of different categories in a boxplot.

In [None]:
cont_data = cont_data.drop("target", axis = 1)

In [None]:
categ_data_to_use_in_boxplot = data_to_use[['sex','thal']].copy()
categ_data_list = categ_data_to_use_in_boxplot.columns.values.tolist()

cont_data_list = cont_data.columns.values.tolist()
data_modified = data.copy()
data_modified['sex'] = data_modified['sex'].replace({0: 'Female', 1: 'Male'})


plt.figure(figsize=(25,10))

a = 1
for i in enumerate(categ_data_list):
    for j in enumerate(cont_data_list):
        plt.subplot(2,5,a)
        sns.boxplot(x=i[1], y= j[1], data=data_modified)
        plt.xlabel(i[1], fontsize = 15)
        plt.ylabel(j[1], fontsize = 15)
        a +=1
        
plt.subplots_adjust(hspace = 0.4)

plt.show()

#### Observations:
##### By observing boxplots of sex against different features we get,

* Maximum and minimum of age in male is greater than and less than maximum and minimum of age in female respectively.It means that this dataset contains more number of younger and older males than females of respective category.

* Resting Blood Pressure is same in both sexes but also both male and female data of resting blood pressure contains outliers.

* **Cholestrol level** is slightly **more in female** compared to male.

* Inter Quartile Range of Maximum heart rate achieved for **male is between 140 and 170** whereas for **female it is between 140 and 160**.

##### By observing boxplots of thal(a blood disorder called thalassemia) against different features we get to know that

* Thal **value of 2**(**normal blood flow**) has lowest minimum and highest maximum which denotes its relationship against age.It has biggest IQR when compared to other thal values quartiles. Most common Thal value for all age groups is thal value 2(normal blood flow). 

* Thal **value of 1**(**fixed defect** - no blood flow in some part of the heart) against age is between **50 and 70 years old.**

* Thal **value of 3**(**reversible defect** - a blood flow is observed but it is not normal) against age have its median at **58 years** old approximately.

* In **thal vs chol** plot we can see that the medians of cholestrol levels are increasing as  thal values are increasing which shows that  cholestrol level and thalessemia are positively correlated.

* In  **thal vs thallach (Maximum heart rate achieved)** plot we can see that **median of thal value 2 is greatest** which denotes that even after having normal blood flow the range of maximum heart reate achieved is greater than other thal values.


In [None]:
plt.figure(figsize=(7,7))
df = data.age.value_counts().to_frame()
textprops = {"fontsize":15}
df.head(10).age.plot(kind='pie',autopct='%1.1f%%',textprops =textprops )

plt.title("Age Categories",fontsize = 20)
plt.show()

#### Observations:
* Maximum records belong to the age of **58 years old**  in dataset with **14.1%**.
* We can see from the pie plot that the person present in the dataset mostly belongs to age group between **50 - 60 years old.**
* **However it is a very small dataset with very limited amount of observations so we can't rely on age as a deciding factor.**


## Let's visualize more about some important relation between features and target variable
### Histogram plot

In [None]:
plt.figure(figsize=(17,7))
ax = sns.countplot(x="age", hue="target", data=data_to_use)
plt.title("Angiographic Disease Status in All Age Group",fontsize = 20)
plt.legend(["< 50% diameter narrowing","> 50% diameter narrowing"],loc='upper right')
plt.xlabel("Age",fontsize = 15)
plt.show()

#### Observations:
According to dataset if we try to draw some conclusions on the basis of this plot then we conclude that,
* The orange bars shows that the person is at higher risk getting a heart attack and the blue bars shows that person is at less risk.
* The **age group 51-60 years old** is at more risk.
* Due to lack of more observations and small dataset we can see that according to chart, young people are at more risk than old people which is not true.

In [None]:
x = ["Female","Male"]
values = range(len(x))

plt.figure(figsize=(17,7))
ax = sns.countplot(x="sex", hue="target", data=data_to_use)
for rect in ax.patches:
    ax.text (rect.get_x() + rect.get_width()  / 2,rect.get_height()+ 0.75,rect.get_height(),horizontalalignment='center', fontsize = 10)
plt.title("Angiographic Disease Status in Female and Male",fontsize = 20)
plt.legend(["< 50% diameter narrowing","> 50% diameter narrowing"],loc='upper right')
plt.xlabel("Sex",fontsize = 15)
plt.xticks(values,x)
plt.show()

#### Observations
* *Such difference in male and female plot is due to less records of female present in dataset.*
* Ratio of more susceptible to a possible heart attack to less susceptible in **female** is **3:1**.
* Ratio of more susceptible to a possible heart attack to less susceptible in **male** is almost **3:4**.

### Violin plot used for categorical data
Analyzing relationship between age and sex with target variable in single plot

In [None]:
data_modified['target'] = data_modified['target'].replace({0: '< 50% diameter narrowing', 1: '> 50% diameter narrowing'})
plt.figure(figsize=(10,7))
sns.catplot(x="age", y="sex",
            hue="target",
            data=data_modified,
            orient="h", height=5, aspect=1, palette="tab10",
            kind="violin", dodge=True, cut=0, bw=.2)

plt.show()

#### Observations:
* In plot of female, where target: > 50% diameter narrowing (more susceptible to a heart attack), the range of age is greater and evenly distributed with flatter peaks.
* Whereas in plot of female, where target: < 50% diameter narrowing (less susceptible to a heart attack), the range of age is shorter and has higher peaks at certain age groups near 55 years and 62 years.
* We can see there are very less outliers in dataset.

### Probablity Density Plot of Some More Features

In [None]:
plt.figure(figsize = (10,13))
a = 1
for i in ["chol","trestbps","thalach"]:
    
    plt.subplot(3,1,a)
    sns.distplot(data_to_use[i],hist=True, kde=True, 
                 bins=int(180/5), color = 'darkblue', 
                 hist_kws={'edgecolor':'black'},
                 kde_kws={'linewidth': 4})
    if i=="chol":
        
        plt.title("Probability Density of Cholosterol Level",fontsize = 20)
        plt.xlabel("serum cholestoral in mg/d",fontsize = 15)
    
    elif i == "thalach":
        plt.title("Probability Density of Maximum Heart Rate Achieved",fontsize = 20)
        plt.xlabel("Thalach",fontsize = 15)
        
    else:
        plt.title("Probability Density of Resting Blood Pressure ",fontsize = 20)
        plt.xlabel("Resting Blood Pressure ",fontsize = 15)
    a+=1
    
plt.subplots_adjust(hspace = 0.7)        
plt.show()

#### Observations:
* Most common cholosterol level present in dataset is between **200 mg/d to 240 mg/d** with maximum value of **230 mg/d** which is considered **borderline to moderately elevated**.
* Few records of Thalach are at near **400 mg/d** and **550mg/d** which is very high and risky.
* Resting blood pressure at **120** and **125** have highest probablity density of **0.05**.
* Probablity density of **Thalach(maximum heart rate achieved)** is mostly evenly distributed between **140** and **175**.
* Highest probablity of **Thalach(maximum heart rate achieved)** is nearly at **165**.

### Line plot of cholosterols level and Thalach in male and female with their ages
* Serum cholesterol levels helps in figuring out risk for developing heart disease.
* The maximum heart rate greater than MHR(= 220 - present age in years) is considerd hazardous.

In [None]:
plt.figure(figsize = (10,7))
a = 1
y = ["chol","thalach"]
for feature in y:
    plt.subplot(2,1,a)
    sns.lineplot(data=data_to_use, x="age", y=feature,hue = "sex")
    plt.xlabel("Age",fontsize = 15)
    plt.legend(["Female", "Male"], loc ="upper right")
    
    if feature=="chol":
        plt.title("Cholestrol Level of Patients of Different Age ",fontsize = 20)
        plt.ylabel("serum cholestoral in mg/d",fontsize = 15)
    else:
        plt.title("Maximum Heart Rate Achieved By Patients of Different Age ",fontsize = 20)
        plt.ylabel("thalach",fontsize = 15)
    a+=1
plt.subplots_adjust(hspace = 0.4)
plt.show()

#### Observations:
* Serum cholestrol level in Female is higher than Male between the age of 52 years to 68 years.
* From the thalach plot we can see that as the age is increasing the chart is showing downward trend directing towards the fact that maximum heart rate achieved decreases as the person grows older.

## Data preparation

In [None]:
X = data_to_use.drop("target",axis =1)
y = data_to_use["target"]

datalist = data.columns.values.tolist()
datalist.remove("target")

<div style="text-align: center"><b><span style="color:#08838b; font-family:Georgia; font-size:2.5em;">Model Building -  Logistic Regression</span></b></div>

Before applying Logistic Regression ,
* We will split the data into train-test set.
* Then data will be scaled.
* For continuos data Normalization technique (MinMaxscaler) is used.
* for categorical data Mean encoding technique is used.



#### Splitting the dataset into train and test dataset

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X_train, X_test, y_train, y_test= train_test_split(X,y, test_size= 0.25, random_state=120)

#### Scaling of Train Data 

In [None]:
# Target Mean Encoding for categorical columns in train data

for i in ['age','cp','restecg','slope','ca','thal'] :
    mean_encoded_variable = data_to_use.groupby([i])['target'].mean().to_dict()

    X_train[i] =  X_train[i].map(mean_encoded_variable)

# Normalization for continuos column in train data    

scaler = MinMaxScaler()
num_vars = ['trestbps', 'chol', 'thalach', 'oldpeak']

# Fit on object

X_train[num_vars] = scaler.fit_transform(X_train[num_vars])

#### Training of the Model using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

model= LogisticRegression()

model.fit(X_train, y_train)
trainscore =  model.score(X_train,y_train)

#### Scaling of Test Data and Applying Test Data on Model

In [None]:
# Target Mean Encoding on categorical columns in test data
for i in ['age','cp','restecg','slope','ca','thal'] :
    mean_encoded_variable = data_to_use.groupby([i])['target'].mean().to_dict()

    X_test[i] =  X_test[i].map(mean_encoded_variable)
    
# Normalization for continuos columns in test data  

X_test[num_vars] = scaler.transform(X_test[num_vars]) 

testscore =  model.score(X_test,y_test)  

#### Test Score and Train Score

In [None]:
print("test score: {} \ntrain score: {}".format(testscore*100,trainscore*100),'\n')

y_pred =  model.predict(X_test)

#### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

print("Confusion Matrix : \n",confusion_matrix(y_test, y_pred))

#### Different Classification Metrics

In [None]:
from sklearn.metrics import classification_report,accuracy_score,f1_score,precision_score,recall_score,roc_curve,roc_auc_score

print(' f1 score: ',f1_score(y_test, y_pred)*100,'\n')
print(' Accuracy: ',accuracy_score(y_test, y_pred)*100,'\n')
print(' precision score: ',precision_score(y_test, y_pred)*100,'\n')
print(' recall score: ',recall_score(y_test, y_pred)*100,'\n')
print(" Classification report: \n",classification_report(y_test, y_pred))

### Importance of Different Metrics

* Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

* Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.

* In most real-life classification problems, imbalanced class distribution exists and thus **F1-score is a better metric** to evaluate our model on.

* In our model, most important metric will be **F1-score**. As it is used when the False Negatives are crucial. 

In [None]:
probabilityValues = model.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, y_pred)
print("AUC Score: ",auc*100)

#### ROC Curve

In [None]:
fpr,tpr, threshold =  roc_curve(y_test,probabilityValues)
plt.plot([0,1],[0,1], linestyle = '--')
plt.plot(fpr,tpr)