# Welcome
# Hear Attack Analysis and Prediction Dataset

This dataset contains information about people and there chances of having a heart stroke.


 **Dataset Information:**





* Age : Age of the patient
* Sex : Sex of the patient
* exang: exercise induced angina (1 = yes; 0 = no)
* ca: number of major vessels (0-3)
* cp : Chest Pain type chest pain type
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* rest_ecg : resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* target : 
    * 0 = less chance of heart attack 
    * 1 = more chance of heart attack
    
    

**Objective:**




* With the dataset provided for heart analysis, we have to analyse the possibilities of heart attack on the basis of various features, and then the prediction from the analysis will tell us that whether an individual is prone to heart attack or not. 
* The detailed analysis can proceed with the exploratory data analysis (EDA). 
* The classification for predication can be done using various machine learning model algorithms, choose the best suited model for heart attack analysis and finally save the model in the pickle (.pkl) file.


**Questions to be answered:**





* Does the age of a person contribute towards heart attack?
* Are different types of chest pain related to each other or the possibility of getting a heart attack?
* Does high blood pressure increase the risk of heart attack?
* Does the choestrol level eventually contribute as a risk factor towards heart attack?



In [None]:
# import packages
import os
import joblib
import numpy as np
import pandas as pd
import warnings

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

import warnings as wr
wr.filterwarnings("ignore") #to ignore the warnings



In [None]:
#Reading the csv file heart.csv in variable 
df=pd.read_csv("../input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
# looking at the first 5 rows of our data
df.head()

**Observation:**
 
 * You can see know all the columns are already in int or float data types.
 * here output is outcome feature to predict

In [None]:
df.tail()

In [None]:
print('Number of rows are :-',df.shape[0], ',and number of columns are :-',df.shape[1])

In [None]:
df.info()

**Observation:**

* you can see that there are no missing rows in the entire dataset. so we do not need to fil/drop any value
* All the columns except oldpeak (float) are of int data type.

In [None]:
df.isnull().sum()

**Observation:** There are no missing values.

Now we are going to get all feature for forther uses

In [None]:
df.columns

In [None]:
#counting duplicate 
df.duplicated().sum()

There is 1 duplicate row. Let's drop it!

In [None]:
df.drop_duplicates(inplace=True)
print('Number of rows are :',df.shape[0], ',and number of columns are :',df.shape[1])

In [None]:
df.describe().T

**Observation:**

* The average blood pressure of an individual is 130 whereas the maximun value goes upto 200.
* The average heart rate of the group is 152, whereas overall it ranges between 133 to 202
* Age of the group varies from 29 to 77 and the mean age is 55.5

In [None]:
#This is to look at what all unique values have . Just trying to use python
list_col=['sex','chol','trtbps','cp','thall','exng']

for col in list_col: 
    print('{} :{} ' . format(col.upper(),df[col].unique()))

**Observation:**

* There are two sex : 0 and 1
* The highest cholestrol level is 564 and the lowest is 126.
* Resting Blood Pressure of individuals vary between 94 to 200.
* There are 4 types of chest pain.
* exercise induced angina has 2 types (1 = yes; 0 = no)

# EDA

In [None]:
print(f'Number of people having sex as 0 are {df.sex.value_counts()[0]} and Number of people having sex as 1 are {df.sex.value_counts()[1]}')
plt.figure(figsize=(12,6))
ax=plt.axes()
ax.set_facecolor("green")
p = sns.countplot(data=df, x="sex", palette='pastel')


**Observation:** 

* The number of people belonging to sex category 0 are 96 whereas 1 are 206.
* The number of people in one category are more than double than the zero.

In [None]:
ax=plt.axis()
sns.countplot(x='cp', data=df, palette='pastel')

**Observation:**

* cp : Chest Pain type chest pain type

    * Value 0: typical angina
    * Value 1: atypical angina
    * Value 2: non-anginal pain
    * Value 3: asymptomatic
    
* People of chest pain category '0' have the highest count, whereas of count of chest pain '3' is the lowest

In [None]:
sns.countplot(x='fbs', data=df, palette='pastel')

**Observation:** People of fbs category 1 are less than 25% of people of fbs category 0.

In [None]:
sns.countplot(x='thall', data=df, palette='pastel')

**Observation:** Thall count is maximum for type 2 ( 165 ) and min for type 0 ( 2 ) .


In [None]:
sns.countplot(x='restecg', data=df, palette='pastel')

**Observation:** 

* ECG count is almost the same for type 0 and 1. 
* Also, its almost negligible for type 2 in comparision to type 0 and 1.

In [None]:
plt.figure(figsize = (10,10))
sns.violinplot(x='caa',y='age',data=df)
sns.swarmplot(x=df['caa'],y=df['age'],hue=df['output'], palette='pastel')

**Observation:**

* This swarmplot gives us a lot of information.
* Accoring to the figure, people belonging to caa category '0' , irrespective of their age are highly prone to getting a heart attack.
* While there are very few people belonging to caa category '4' , but it seems that around 75% of those get heart attacks.
* People belonging to category '1' , '2' and '3' are more or less at similar risk.

### Unique values
*Counting number of unique value and it's relative with their respective observations between train & test dataset.*

In [None]:
integer_features = ['age','chol','trtbps','cp','thall','exng']
unique_values_train = pd.DataFrame(df[integer_features].nunique())
unique_values_train = unique_values_train.reset_index(drop=False)
unique_values_train.columns = ['Features', 'Count']

unique_values_percent_train = pd.DataFrame(df[integer_features].nunique()/df.shape[0])
unique_values_percent_train = unique_values_percent_train.reset_index(drop=False)
unique_values_percent_train.columns = ['Features', 'Count']


In [None]:
plt.rcParams['figure.dpi'] = 400
fig = plt.figure(figsize=(6, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.4, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*6)

ax0 = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=unique_values_train['Features'], x=unique_values_train['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.set_xlabel("Unique Values",fontsize=4, weight='bold')
ax0_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax0_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -1.5, 'Unique Values - Train Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax0.text(0, -1, 'can be considered as classification features', fontsize=4, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 200
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
    
ax1 = fig.add_subplot(gs[0, 1])
for s in ["right", "top"]:
    ax1.spines[s].set_visible(False)
ax1.set_facecolor(background_color)
ax1_sns = sns.barplot(ax=ax1, y=unique_values_percent_train['Features'], x=unique_values_percent_train['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax1_sns.set_xlabel("Percentage Unique Values",fontsize=4, weight='bold')
ax1_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax1_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax1_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1.text(0, -1.5, 'Percentage Unique Values - Train Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax1.text(0, -1, 'can be considered as classification features', fontsize=4, ha='left', va='top')
# data label
for p in ax1.patches:
    value = f'{p.get_width():.2f}'
    x = p.get_x() + p.get_width() + 0.03
    y = p.get_y() + p.get_height() / 2 
    ax1.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

background_color = "#f6f5f5"
sns.set_palette(['#ff355d']*6)
    



### Observations:

It seems num feature can be treated as classification features as the unique numbers is small compared with the total observation which can be seen on the percentage to the total observations.


### Distribution
Showing distribution on each feature that are available in train dataset. 

In [None]:
sns.color_palette("pastel")
plt.title('Checking Outliers with distplot()')
sns.distplot(df.trtbps, label='trtbps', kde=True, bins=10, color='green')
plt.legend()

In [None]:
plt.title('Checking Outliers with distplot()')
sns.distplot(df.chol, label='chol', kde=True, color='red')
plt.legend()

In [None]:
plt.title('Checking Outliers with distplot()')
sns.distplot(df['thalachh'],label='thalachh', kde=True )
plt.legend()

**Observations:**

* trtbps and chol looks like they are normally distributed, with some outliers highly skewed towards right.
* In case of thalachh the data is highly skewed towards right!

In [None]:
sns.pairplot(df,kind="kde",hue="output")

**By the pair plot we can see data destribution and identfy outlier**

In [None]:
#spliting data into X and y

X=df.drop(["output"],axis=1)
y=df["output"]

**using minmax Scaler for scaling the data in same Scale**

**we scal the aal data between 0 to 1**

In [None]:

from sklearn.preprocessing import MinMaxScaler
scalerX = MinMaxScaler(feature_range=(0, 1))
X[X.columns] = scalerX.fit_transform(X[X.columns])



# model building

In [None]:
#for model building
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

In [None]:
# Spliting the data
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 30)


In [None]:
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier()
ada.fit(X_train,y_train)
ada_pre=ada.predict(X_test)
acc_ada = accuracy_score(y_test,ada_pre)
acc_ada



In [None]:
key = ['LogisticRegression','KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','XGBClassifier']
value = [LogisticRegression(random_state=9), KNeighborsClassifier(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier(), xgb.XGBClassifier()]
models = dict(zip(key,value))

In [None]:
predicted =[]

In [None]:
for name,algo in models.items():
    model=algo
    model.fit(X_train,y_train)
    predict = model.predict(X_test)
    acc = accuracy_score(y_test, predict)
    predicted.append(acc)
    print(name,acc)

In [None]:
#confusion matrix
cnn=KNeighborsClassifier()
cnn.fit(X_train,y_train)
cnn_predict = cnn.predict(X_test)
cf_matrix=confusion_matrix(y_test,cnn_predict)
plt.figure(figsize=(7,6))
sns.heatmap(cf_matrix,annot=True,fmt='d')

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x = predicted, y = key, palette='pastel')

**Observation:**  
From the above figure we can see that **KNeighborsClassifier** model give an accuracy greater than 90%.

# Conclusion: 


* Numeric Variables - No outliers were found!

* High Blood Pressure, High Cholestrol and High Heart Rate leads to high chance of heart attack.

* In the count of target showed up that we have more chance of heart attack details.

* Age from 40-60 years have the high chance of heart attack.

* Male gender has more chance of heart attack compared to female ones.

* Highly Correlated factors in this dataset are :
    * Age and trtbps (blood pressure rate)
    * Age and chol (cholestrol level)

# If you like please do a Up vote
**Thanks**