**Hello Visitor,**

**This is one of my attempts at making a detailed and well thought out kernels, hope you gain some insights from it and find it useful! Do upvote and share it if you like it! :)**

**This kernel has covered 4 topics:**

- Basic Introduction
- EDA
- Feature Engineering
- Model Building


The name of the dataset is **Heart Attack Analysis & Prediction Dataset**

The tag line is - **A dataset for heart attack classification**

# Understanding Heart Attacks

![Heart Attack](https://www.mayoclinic.org/-/media/kcms/gbs/patient-consumer/images/2013/08/26/10/08/ds00094_im00938_mcdc7_heartattackthu_jpg.jpg)

### Overview
A heart attack occurs when the flow of blood to the heart is blocked. The blockage is most often a buildup of fat, cholesterol and other substances, which form a plaque in the arteries that feed the heart (coronary arteries).

Sometimes, a plaque can rupture and form a clot that blocks blood flow. The interrupted blood flow can damage or destroy part of the heart muscle.

A heart attack, also called a myocardial infarction, can be fatal, but treatment has improved dramatically over the years.

### Symptoms
Common heart attack signs and symptoms include:

- Pressure, tightness, pain, or a squeezing or aching sensation in your chest or arms that may spread to your neck, jaw or back
- Nausea, indigestion, heartburn or abdominal pain
- Shortness of breath
- Cold sweat
- Fatigue
- Lightheadedness or sudden dizziness

### Heart attack symptoms vary
Not all people who have heart attacks have the same symptoms or have the same severity of symptoms. Some people have mild pain; others have more severe pain. Some people have no symptoms. For others, the first sign may be sudden cardiac arrest. However, the more signs and symptoms you have, the greater the chance you're having a heart attack.

Some heart attacks strike suddenly, but many people have warning signs and symptoms hours, days or weeks in advance. The earliest warning might be recurrent chest pain or pressure (angina) that's triggered by activity and relieved by rest. Angina is caused by a temporary decrease in blood flow to the heart.

### Causes
A heart attack occurs when one or more of your coronary arteries becomes blocked. Over time, a buildup of fatty deposits, including cholesterol, form substances called plaques, which can narrow the arteries (atherosclerosis). This condition, called coronary artery disease, causes most heart attacks.

During a heart attack, a plaque can rupture and spill cholesterol and other substances into the bloodstream. A blood clot forms at the site of the rupture. If the clot is large, it can block blood flow through the coronary artery, starving the heart of oxygen and nutrients (ischemia).

You might have a complete or partial blockage of the coronary artery.

- A complete blockage means you've had an ST elevation myocardial infarction (STEMI).
- A partial blockage means you've had a non-ST elevation myocardial infarction (NSTEMI).

Diagnosis and treatment might be different depending on which type you've had.

Another cause of a heart attack is a spasm of a coronary artery that shuts down blood flow to part of the heart muscle. Using tobacco and illicit drugs, such as cocaine, can cause a life-threatening spasm.

Infection with COVID-19 also may damage your heart in ways that result in a heart attack.

# About the dataset

- **Age** : Age of the patient

- **Sex** : Sex of the patient

- **exang**: exercise induced angina (1 = yes; 0 = no)

- **ca**: number of major vessels (0-3)

- **cp** : Chest Pain type chest pain type

 - Value 1: typical angina
 - Value 2: atypical angina
 - Value 3: non-anginal pain
 - Value 4: asymptomatic
 

- **trtbps** : resting blood pressure (in mm Hg)

- **chol** : cholestoral in mg/dl fetched via BMI sensor

- **fbs** : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- **rest_ecg** : resting electrocardiographic results

 - Value 0: normal
 - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
 - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria


- **thalach** : maximum heart rate achieved

- **target** : 
 - 0 = less chance of heart attack,  
 - 1 = more chance of heart attack
 


**Types Of Features**

 
- Categorical Features:

A categorical variable is one that has two or more categories and each value in that feature can be categorised by them.For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.
 
- Continous Feature:

A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the features column.

# Reading the dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
# df.head(1)

# Importing Libararies and packages

In [None]:
# Importing libs
import pandas as pd
import numpy as np
from matplotlib import ticker
import matplotlib
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
from textwrap import wrap
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import plotly as py
import plotly.graph_objs as go
import os
py.offline.init_notebook_mode(connected = True)
#print(os.listdir("../input"))
import datetime as dt

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.gridspec as grid_spec
import seaborn as sns
from imblearn.over_sampling import SMOTE
import scikitplot as skplt


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,cross_val_score


from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
# data splitting
from sklearn.model_selection import train_test_split
# data modeling
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#ensembling
from mlxtend.classifier import StackingCVClassifier
background_color = '#f6f5f5'

# Understanding the data

## Basic tasks

- The shape of the data
- Preview of the first 2 rows of the data
- Checking the number of unique values in each column
- Separating the columns in categorical and continuous
- Missing values

In [None]:
print("The shape of the dataset is : ", df.shape)
print()

print("The first 2 rows of the dataset are ")
print(df.head(2))
print()

dict = {}
for i in list(df.columns):
    dict[i] = df[i].value_counts().shape[0]
print(pd.DataFrame(dict,index=["Unique Count"]).transpose())

cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]
print()
print("The total number of categorical columns are : ",len(cat_cols))
print()
print("The total number of continous columns are : ",len(con_cols))
print()
print("The missing values in each column for the dataset are ")
print(df.isnull().sum())

# Exploratory Data Analysis

### Univariate Analysis
Univariate analysis is a basic kind of analysis technique for statistical data. Here the data contains just one variable and does not have to deal with the relationship of a cause and effect. Like for example consider a survey of a classroom. The analysts would want to count the number of boys and girls in the room. The data here simply talks about the number which is a single variable and the variable quantity. The main objective of the univariate analysis is to describe the data in order to find out the patterns in the data. This is done by looking at the mean, mode, median, standard deviation, dispersion, etc.

The first thing we are going to check is the distribution of the target feature. It's important to know if the class is balanced or not. If so, we would probably have to handle it.

In [None]:
ax = sns.countplot(data=df, x='output',palette=['lightgrey','#eeb977'])
ax.set(xticklabels=['Less chance of Heart Attack', 'More chance of Heart Attack'], title="Target Distribution")
ax.tick_params(bottom=False)

In [None]:
data = df
continuous = data.loc[:,data.nunique()>4]
fig = plt.figure(figsize=(15, 3), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(1, 6)
gs.update(wspace=0.1, hspace=0.4)

# for plotting
df = data

run_no = 0
for row in range(0, 1):
    for col in range(0, 6):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', left=False)
        locals()["ax"+str(run_no)].get_yaxis().set_visible(False)
        for s in ["top","right","left"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for variable in continuous:
        sns.kdeplot(df[variable] ,ax=locals()["ax"+str(run_no)], color='#eeb977',ec='black', shade=True, linewidth=1.5, alpha=0.9, zorder=3, legend=False)
        locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
        locals()["ax"+str(run_no)].set_xlabel(variable)
        run_no += 1
        

Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.15), 'Numeric Variable Distribution', fontsize=20, fontweight='bold', fontfamily='sans-serif')
ax0.text(Xstart, Yend+(Yend*0.05), 'Most numeric variables appear to have a positive skew', fontsize=13, fontweight='light', fontfamily='monospace')

plt.show()

In [None]:
fig = plt.figure(figsize=(15, 3), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(1, 6)
gs.update(wspace=0.1, hspace=0.4)

# for plotting
df = data
yes_c = '#eeb977'
no_c = 'lightgray'
run_no = 0
for row in range(0, 1):
    for col in range(0, 6):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', left=False)
        locals()["ax"+str(run_no)].get_yaxis().set_visible(False)
        locals()["ax"+str(run_no)].set_axisbelow(True)
        for s in ["top","right","left"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0

Yes = df[df['output'] == 1]
No = df[df['output'] == 0]

for variable in continuous:
        sns.kdeplot(Yes[variable], ax=locals()["ax"+str(run_no)], color=yes_c,ec='black', shade=True, linewidth=1.5, alpha=0.9, zorder=3, legend=False)
        sns.kdeplot(No[variable],ax=locals()["ax"+str(run_no)], color=no_c, shade=True, ec='black',linewidth=1.5, alpha=0.9, zorder=3, legend=False)
        locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
        locals()["ax"+str(run_no)].set_xlabel(variable)
        run_no += 1
        
Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.15), 'Numeric Variable Distribution with Condition', fontsize=20, fontweight='bold', fontfamily='sansserif')
ax0.text(Xstart, Yend+(Yend*0.05), 'There appear to be noticeable differences when patients have a heart condition.', fontsize=13, fontweight='light', fontfamily='monospace')

plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.11, hspace=0.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])


ax0.tick_params(axis='y', left=False)
ax0.get_yaxis().set_visible(False)
ax0.set_axisbelow(True)
ax1.tick_params(axis='y', left=False)
ax1.get_yaxis().set_visible(False)
ax1.set_axisbelow(True)
for s in ["top","right","left"]:
        ax0.spines[s].set_visible(False)
        ax1.spines[s].set_visible(False)

# ax0.set_facecolor(face_color)
# ax1.set_facecolor(face_color)


sns.kdeplot(Yes['caa'], ax=ax1, color=yes_c,ec='black', shade=True, linewidth=1.5, alpha=0.9, zorder=3, legend=False)
sns.kdeplot(No['caa'],ax=ax1, color=no_c, shade=True, ec='black',linewidth=1.5, alpha=0.9, zorder=3, legend=False)
 
sns.kdeplot(Yes['thalachh'], ax=ax0, color=yes_c,ec='black', shade=True, linewidth=1.5, alpha=0.9, zorder=3, legend=False)
sns.kdeplot(No['thalachh'],ax=ax0, color=no_c, shade=True, ec='black',linewidth=1.5, alpha=0.9, zorder=3, legend=False)


ax0.grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
ax1.grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))


Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.2), 'Important Observations', fontsize=15, fontweight='bold', fontfamily='sansserif')
ax0.text(Xstart, Yend+(Yend*0.09), 'Max. HR Acheived & Num. Major Blood Vessels look to be highly indicatvie of heart disease.', fontsize=8, fontweight='light', fontfamily='monospace')

plt.show()

In [None]:
fig = plt.figure(figsize=(15, 3), dpi=150, facecolor=background_color)
gs = fig.add_gridspec(2, 5)
gs.update(wspace=0.1, hspace=0.7)
categorical = data.loc[:,data.nunique()<=4]


run_no = 0
for row in range(0, 2):
    for col in range(0, 4):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', left=False)
        locals()["ax"+str(run_no)].get_yaxis().set_visible(False)
        locals()["ax"+str(run_no)].set_axisbelow(True)
        for s in ["top","right","left"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for variable in categorical:
        sns.countplot(df[variable],data=df,ax=locals()["ax"+str(run_no)], palette=['lightgrey','#eeb977'],ec='black', linewidth=1.5, alpha=1,)
        locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
        locals()["ax"+str(run_no)].set_xlabel(variable)
        run_no += 1


Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.6), 'Categorical Variable Distribution', fontsize=20, fontweight='bold', fontfamily='sans-serif')
ax0.text(Xstart, Yend+(Yend*0.3), 'This gives an indication of what we might want to investigate.', fontsize=13, fontweight='light', fontfamily='monospace')


plt.show()

In [None]:
fig = plt.figure(figsize=(15, 3), dpi=150, facecolor=background_color)
gs = fig.add_gridspec(2, 5)
gs.update(wspace=0.1, hspace=0.7)


run_no = 0
for row in range(0, 2):
    for col in range(0, 4):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        locals()["ax"+str(run_no)].tick_params(axis='y', left=False)
        locals()["ax"+str(run_no)].get_yaxis().set_visible(False)
        locals()["ax"+str(run_no)].set_axisbelow(True)
        for s in ["top","right","left"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for variable in categorical:
        sns.countplot(df[variable],data=df,ax=locals()["ax"+str(run_no)],hue='output',palette=['lightgrey','#eeb977'],ec='black', linewidth=1.5, alpha=1)
        locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
        locals()["ax"+str(run_no)].set_xlabel(variable)
        locals()["ax"+str(run_no)].get_legend().remove()
        run_no += 1


Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.85), 'Categorical Variable Distribution with Condition', fontsize=20, fontweight='bold', fontfamily='sans-serif')
ax0.text(Xstart, Yend+(Yend*0.3), 'This is very informative; several variables look to be related to the presence\nof the condition.', fontsize=13, fontweight='light', fontfamily='monospace')


plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.11, hspace=0.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])


ax0.tick_params(axis='y', left=False)
ax0.get_yaxis().set_visible(False)
ax0.set_axisbelow(True)
ax1.tick_params(axis='y', left=False)
ax1.get_yaxis().set_visible(False)
ax1.set_axisbelow(True)
for s in ["top","right","left"]:
        ax0.spines[s].set_visible(False)
        ax1.spines[s].set_visible(False)

# ax0.set_facecolor(face_color)
# ax1.set_facecolor(face_color)


sns.countplot(df['thall'], hue=df['output'],palette=[no_c,yes_c], ax=ax0, color=yes_c,ec='black', linewidth=1.5, alpha=1)
 
sns.countplot(df['slp'], hue=df['output'],palette=[no_c,yes_c], ax=ax1, color=yes_c,ec='black', linewidth=1.5, alpha=1)


ax0.grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))
ax1.grid(which='major', axis='x', zorder=0, color='gray', linestyle=':', dashes=(1,5))


Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.text(Xstart, Yend+(Yend*0.3), 'Important Observations', fontsize=15, fontweight='bold', fontfamily='sansserif')
ax0.text(Xstart, Yend+(Yend*0.09), 'Thalassemia & ST Slope values look to be highly indicatvie of heart disease, and indeed of being at lower risk\nin the case of some values.', fontsize=8, fontweight='light', fontfamily='monospace')

ax0.get_legend().remove()
ax1.get_legend().remove()

# ax0.annotate('Large differences', xy=(1.4, 84), xytext=(0.2, 84), xycoords='data', 
#             fontsize=8, ha='center', va='center',fontfamily='monospace',
#             bbox=dict(boxstyle='round', fc='firebrick'),
#             arrowprops=dict(arrowstyle='-[, widthB=4.6, lengthB=1', lw=1, color='black'), color='white')


plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.11, hspace=0.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])


# ax0.set_facecolor(face_color)
# ax1.set_facecolor(face_color)
# ax2.set_facecolor(face_color)
# ax3.set_facecolor(face_color)


cummulate_survival_ratio = []

for i in range(data['trtbps'].min(), data['trtbps'].max()):
    cummulate_survival_ratio.append(data[data['trtbps'] < i]['output'].sum() / len(data[data['trtbps'] < i]['output']))

sns.lineplot(data=cummulate_survival_ratio,color=yes_c,ax=ax0)


import matplotlib.patches as patches
    
Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()



# Create a Rectangle patch
rect = patches.Rectangle((Xstart-1, 0.5),Xend+100, Yend+10, linewidth=1,
                         edgecolor='lightgray', facecolor="#eeeeee")
  
# Add the patch to the Axes
ax0.add_patch(rect)




#ax0.text(Xstart,Yend+(Yend*0.1),'Resting Blood Pressure',fontfamily='serif',color='black',fontsize=10)

###################

cummulate_survival_ratio = []

for i in range(data['chol'].min(), data['chol'].max()):
    cummulate_survival_ratio.append(data[data['chol'] < i]['output'].sum() / len(data[data['chol'] < i]['output']))

sns.lineplot(data=cummulate_survival_ratio,color=yes_c,ax=ax1)

Xstart, Xend = ax1.get_xlim()
Ystart, Yend = ax1.get_ylim()

# Create a Rectangle patch
rect = patches.Rectangle((Xstart-1, 0.5),Xend+100, Yend, linewidth=1,
                         edgecolor='lightgray', facecolor="#eeeeee")
  
# Add the patch to the Axes
ax1.add_patch(rect)


#ax1.text(Xstart,Yend+(Yend*0.1),'Cholesterol',fontfamily='serif',color='black',fontsize=10)

###################


cummulate_survival_ratio = []

for i in range(data['thalachh'].min(), data['thalachh'].max()):
    cummulate_survival_ratio.append(data[data['thalachh'] < i]['output'].sum() / len(data[data['thalachh'] < i]['output']))

sns.lineplot(data=cummulate_survival_ratio,color=yes_c,ax=ax2)

Xstart, Xend = ax2.get_xlim()
Ystart, Yend = ax2.get_ylim()

# Create a Rectangle patch
rect = patches.Rectangle((Xstart-1, 0.5),Xend+100, Yend, linewidth=1,
                         edgecolor='lightgray', facecolor="#eeeeee")
  
# Add the patch to the Axes
ax2.add_patch(rect)

#ax2.text(Xstart,1.1,'Max. HR Acheived',fontfamily='serif',color='black',fontsize=10)


###################


cummulate_survival_ratio = []

for i in range(data['age'].min(), data['age'].max()):
    cummulate_survival_ratio.append(data[data['age'] < i]['output'].sum() / len(data[data['age'] < i]['output']))

sns.lineplot(data=cummulate_survival_ratio,color=yes_c,ax=ax3)

Xstart, Xend = ax3.get_xlim()
Ystart, Yend = ax3.get_ylim()

# Create a Rectangle patch
rect = patches.Rectangle((Xstart-1, 0.5),Xend+100, Yend+10, linewidth=1,
                         edgecolor='lightgray', facecolor="#eeeeee")
  
# Add the patch to the Axes
ax3.add_patch(rect)

#ax3.text(Xstart,Yend+(Yend*0.1),'Age',fontfamily='serif',color='black',fontsize=10)

###################


for s in ["top","right","left"]:
    ax0.spines[s].set_visible(False)
    ax1.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)
    ax3.spines[s].set_visible(False)
    

ax0.set_yticks(np.arange(0, 1.25, 0.25))
ax1.set_yticks(np.arange(0, 1.25, 0.25))
ax2.set_yticks(np.arange(0, 1.25, 0.25))
ax3.set_yticks(np.arange(0, 1.25, 0.25))

ax0.tick_params(axis='both', which='major', labelsize=8)
# ax0.tick_params(axis='both', colors=sub_col)
ax0.tick_params(axis=u'both', which=u'both',length=0)

ax1.tick_params(axis='both', which='major', labelsize=8)
# ax1.tick_params(axis='both', colors=sub_col)
ax1.tick_params(axis=u'both', which=u'both',length=0)

ax2.tick_params(axis='both', which='major', labelsize=8)
# ax2.tick_params(axis='both', colors=sub_col)
ax2.tick_params(axis=u'both', which=u'both',length=0)

ax3.tick_params(axis='both', which='major', labelsize=8)
# ax3.tick_params(axis='both', colors=sub_col)
ax3.tick_params(axis=u'both', which=u'both',length=0)




###############
ax0.set_xlabel("Resting Blood Pressure",loc='left',fontsize=10,fontfamily='sans-serif')
ax1.set_xlabel("Cholesterol",loc='left',fontsize=10,fontfamily='sans-serif')
ax2.set_xlabel("Max. HR Acheived",loc='left',fontsize=10,fontfamily='sans-serif')
ax3.set_xlabel("Age",loc='left',fontsize=10,fontfamily='sans-serif')



#ax2.set_ylabel(" ",loc='top',fontsize=sub,color=sub_col)

#title
ax0.text(Xstart,1.4,'How does risk vary by each variable as it changes?',fontfamily='sans-serif',color='black',fontweight='bold',fontsize=15)
ax0.text(Xstart,1.25,
'''
The grey box denotes where risk is greater than 50%.'''
         
,fontfamily='monospace',fontsize=8)

ax1.set_yticklabels([])
ax3.set_yticklabels([])

plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(2, 3)
gs.update(wspace=0.3, hspace=0.5)
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[0, 2])



# ax0.set_facecolor(face_color)
# ax1.set_facecolor(face_color)
# ax2.set_facecolor(face_color)


sns.scatterplot(data=data,x=data['age'],y=data['trtbps'],hue=data['output'],ec='black',ax=ax0,palette=[no_c,yes_c])
sns.scatterplot(data=data,x=data['age'],y=data['thalachh'],hue=data['output'],ec='black',ax=ax1,palette=[no_c,yes_c])
sns.scatterplot(data=data,x=data['age'],y=data['chol'],hue=data['output'],ec='black',ax=ax2,palette=[no_c,yes_c])


ax0.tick_params(axis='both', which='major', labelsize=8)
# ax0.tick_params(axis='both', colors=sub_col)
ax0.tick_params(axis=u'both', which=u'both',length=0)

ax1.tick_params(axis='both', which='major', labelsize=8)
# ax1.tick_params(axis='both', colors=sub_col)
ax1.tick_params(axis=u'both', which=u'both',length=0)

ax2.tick_params(axis='both', which='major', labelsize=8)
# ax2.tick_params(axis='both', colors=sub_col)
ax2.tick_params(axis=u'both', which=u'both',length=0)





for s in ["top","right","left"]:
    ax0.spines[s].set_visible(False)
    ax1.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)

    
###############
# ax0.set_xlabel("Age",loc='left',fontsize=10,color=sub_col)
# ax1.set_xlabel("",loc='left',fontsize=10,color=sub_col)
# ax2.set_xlabel("",loc='left',fontsize=10,color=sub_col)

# ax0.set_ylabel("Rest. BP.",loc='top',fontsize=10,color=sub_col)
# ax1.set_ylabel("Max. HR.",loc='top',fontsize=10,color=sub_col)
# ax2.set_ylabel("Chol.",loc='top',fontsize=10,color=sub_col)


ax0.text(20,254,'How do the variables interact?',fontsize=15,fontfamily='sansserif',fontweight='bold')
ax0.text(20,237.5,'The strongest relationship appears to be between Age & Max HR.',fontsize=10,fontfamily='monospace')

ax0.get_legend().remove()
ax1.get_legend().remove()
ax2.get_legend().remove()

ax0.text(20,220,'Heart Disease',fontsize=8,fontfamily='sansserif',color=yes_c)
ax0.text(40,220,'|',fontsize=8,fontfamily='serif')
ax0.text(41.5,220,'No Heart Disease',fontsize=8,fontfamily='sansserif',color=no_c)

plt.show()


In [None]:
def age_band(num):
    for i in range(1, 100):
        if num < 10*i :  return f'{(i-1) * 10} ~ {i*10}'

data['Age band'] = data['age'].apply(age_band)
hr_age = data[['Age band', 'output','sex']].groupby('Age band')['output'].value_counts().sort_index().unstack().fillna(0)
hr_age['Condition rate'] = hr_age[1] / (hr_age[0] + hr_age[1]) * 100
age_band = data['Age band'].value_counts().sort_index()

In [None]:
age_sex_surv = data.groupby(['sex','Age band'])['output'].mean().unstack().T
fem_mean = age_sex_surv[0].mean()
male_mean = age_sex_surv[1].mean()

fig = plt.figure(figsize=(5, 4), dpi=150,facecolor=background_color)
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.2, hspace=0.8)
ax0 = fig.add_subplot(gs[0, 0])

# ax0.set_facecolor(face_color)

for s in ["right", "top","bottom","left"]:
    ax0.spines[s].set_visible(False)

my_range=range(1,len(age_sex_surv.index)+1)
 
ax0.hlines(y=my_range, xmin=age_sex_surv[1], xmax=age_sex_surv[0], color='gray', alpha=0.4)
sns.scatterplot(age_sex_surv[1], my_range, color=yes_c, ec='black',alpha=1,s=100, label='male',ax=ax0)

sns.scatterplot(age_sex_surv[0], my_range, color=no_c,ec='black', alpha=1,s=100, label='female',ax=ax0)
ax0.get_legend().remove()

Xstart, Xend = ax0.get_xlim()
Ystart, Yend = ax0.get_ylim()
ax0.set_xticks(np.arange(0, 1, 0.1))
ax0.set_yticklabels([' ','20 ~ 30', '30 ~ 40', '40 ~ 50', '50 ~ 60','60 ~ 70', '70 ~ 80', '80 ~ 90'])


ax0.tick_params(axis='x', which='major', labelsize=8)
# ax0.tick_params(axis='both', colors=sub_col)
ax0.tick_params(axis=u'both', which=u'both',length=0)

# ax0.set_xlabel("Risk of having heart disease",loc='left',fontsize=8,color=sub_col)


ax0.text(-0.04,7.6,'Condition rates by age & sex',fontsize=15,fontweight='bold',color='black',fontfamily='sansserif')
ax0.text(-0.04,6.7,'The sex is not specified in the data, but it looks to be \nan important factor, with red being higher risk in all \ncategories and a higher mean risk.',fontsize=10,fontfamily='monospace')

#ax0.text(0,7,'Male',fontsize=8,fontweight='bold',color=yes_c,fontfamily='serif')
#ax0.text(0+0.037,7,'|',fontsize=8,fontweight='bold',color='black',fontfamily='serif')
#ax0.text(0+0.0436,7,'Female',fontsize=8,fontweight='bold',color=no_c,fontfamily='serif')

ax0.axvline(male_mean ,color=yes_c, linewidth=0.4, linestyle='dashdot')
ax0.axvline(fem_mean ,color=no_c, linewidth=0.4, linestyle='dashdot')


# Show the graph
plt.show()

### Observations : 
- There are a total of 303 people.
- 165 have a higher chance of suffering a heart attack which is **54%** of the sample.
- 138 have a comparitively low chance of suffering a heart attack which is **46%** of the sample.

- On average, approx. 3% of people are affected by heart disease in the USA. Whereas here, it is verginf on 50%. That raises questions as to what population of people we are looking at.

- Most numeric variables appear to have a positive skew
- Thalassemia & ST Slope values look to be highly indicatvie of heart disease, and indeed of being at lower risk in the case of some values.
- Max. HR Acheived & Num. Major Blood Vessels look to be highly indicatvie of heart disease.
- As resting blood pressure increasres, so to does risk of heart disease
- Rising Choloseterol does not appear to be a major indicator
- A low Max HR acheived is a big warning sign.
- Risk of heart disease increases with age
- The sex is not specified in the data, but it looks to be \nan important factor, with red being higher risk in all categories and a higher mean risk
- People with Non-Anginal chest pain, that is with cp = 2 have higher chances of heart attack.
- People with 0 major vessels, that is with caa = 0 have high chance of heart attack.
- People with sex = 1 have higher chance of heart attack.
- People with thall = 2 have much higher chance of heart attack.
- People with no exercise induced angina, that is with exng = 0 have higher chance of heart attack.People with Non-Anginal chest pain, that is with cp = 2 have higher chances of heart attack.
- People with 0 major vessels, that is with caa = 0 have high chance of heart attack.
- People with sex = 1 have higher chance of heart attack.
- People with thall = 2 have much higher chance of heart attack.
- People with no exercise induced angina, that is with exng = 0 have higher chance of heart attack.


- The mean age is lesser for higher chance of heart attack

- Some, features like resting heart rate are indifferent to chances of heart attack

-  Maximum heart rate is directly proportional to the chances of heart attack

-  Oldpeak is negatively correlated with the output

-  For certain categories the chances of heart attack was found high:-

 - Age = 0
 - cp = 2,3
 - thall = 2
 - caa = 0,4
 - slp = 2

# If there are any suggesion for the notebook please comment, that would be helpful. Also please upvote if you liked it! Thank you


Some of my other works:

https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model
https://www.kaggle.com/udbhavpangotra/heart-attacks-extensive-eda-and-visualizations
https://www.kaggle.com/udbhavpangotra/what-do-people-use-youtube-for-in-great-britain


kernels taken help from 

https://www.kaggle.com/namanmanchanda/heart-attack-eda-prediction-90-accuracy
https://www.kaggle.com/kaamraankhan/heart-attack-analysis-and-prediction
https://www.kaggle.com/aishwaryajmp/heart-attack-prediction-analysis#Splitting-arrays-into-Training-and-Testing-Arrays
https://www.kaggle.com/joshuaswords/awesome-eda-predicting-heart-disease#Categorical-variables