# Hear Attack Analysis and Prediction Dataset

This dataset contains information about people and there chances of having a heart stroke.


 **Dataset Information:**





* Age : Age of the patient
* Sex : Sex of the patient
* exang: exercise induced angina (1 = yes; 0 = no)
* ca: number of major vessels (0-3)
* cp : Chest Pain type chest pain type
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* rest_ecg : resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* target : 
    * 0 = less chance of heart attack 
    * 1 = more chance of heart attack
    
    

**Objective:**




* With the dataset provided for heart analysis, we have to analyse the possibilities of heart attack on the basis of various features, and then the prediction from the analysis will tell us that whether an individual is prone to heart attack or not. 
* The detailed analysis can proceed with the exploratory data analysis (EDA). 
* The classification for predication can be done using various machine learning model algorithms, choose the best suited model for heart attack analysis and finally save the model in the pickle (.pkl) file.


**Questions to be answered:**





* Does the age of a person contribute towards heart attack?
* Are different types of chest pain related to each other or the possibility of getting a heart attack?
* Does high blood pressure increase the risk of heart attack?
* Does the choestrol level eventually contribute as a risk factor towards heart attack?



In [None]:
# Heart Attack Analysis and Prediction Dataset
# Date: April 26, 2021

In [None]:
import numpy as np #linear algebra
import pandas as pd #data processing

import matplotlib.pyplot as plt #data visualization
import seaborn as sns #data visualization

import warnings
warnings.filterwarnings("ignore") #to ignore the warnings

#for model building
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb


In [None]:
#Reading the csv file heart.csv in variable 
df=pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
# looking at the first 5 rows of our data
df.head(10)


In [None]:
df.shape

**Observation:**
 
 * Seems like all the columns are already in int or float data types.
 * If the columns were not in int or float ( i.e they were categorical variables), we would have had to convert them for model building.
 * Few ways to do so are by using pd.get_dummies, one hot encoding, multi collinearity, label encoder etc.

In [None]:
df.tail(25)

In [None]:
print('Number of rows are',df.shape[0], 'and number of columns are ',df.shape[1])

In [None]:
df.info()

**Observation:**

* There are no missing rows in the entire dataset.
* All the columns except oldpeak (float) are of int data type.

In [None]:
df.isnull().sum()

**Observation:** There are no missing values.

In [None]:
df.columns

In [None]:
df.duplicated().sum()

**Observation:** There is 1 duplicate row. Let's drop it!

In [None]:
df.drop_duplicates(inplace=True)
print('Number of rows are',df.shape[0], 'and number of columns are ',df.shape[1])

In [None]:
df.describe().T

In [None]:
sns.pairplot(df)

**Observation:**

* The average blood pressure of an individual is 130 whereas the maximun value goes upto 200.
* The average heart rate of the group is 152, whereas overall it ranges between 133 to 202
* Age of the group varies from 29 to 77 and the mean age is 55.5

In [None]:
#This is to look at what all unique values have . Just trying to use python
list_col=['sex','chol','trtbps','cp','thall','exng']

for col in list_col: 
    print('{} :{} ' . format(col.upper(),df[col].unique()))

In [None]:
sns.boxplot(df['trtbps'])


**Observation:**

* There are two sex : 0 and 1
* The highest cholestrol level is 564 and the lowest is 126.
* Resting Blood Pressure of individuals vary between 94 to 200.
* There are 4 types of chest pain.
* exercise induced angina has 2 types (1 = yes; 0 = no)

In [None]:
sns.boxplot(df['chol'])

In [None]:
sns.boxplot(df['thalachh'])

In [None]:
sns.boxplot(df['oldpeak'])

In [None]:
print(np.where(df['trtbps']>175))
print(np.where(df['chol']>380))
print(np.where(df['oldpeak']>4))
print(np.where(df['thalachh']<80))

In [None]:
df.drop(df.index[[101, 110, 202, 222, 247, 259, 265,28,85,  96, 219, 245,101, 203, 220, 249, 290,271]],inplace=True)

In [None]:
df.shape

In [None]:
df.tail(15)

In [None]:
print(f'Number of people having sex as 0 are {df.sex.value_counts()[0]} and Number of people having sex as 1 are {df.sex.value_counts()[1]}')
p = sns.countplot(data=df, x="sex", palette='pastel')
plt.show()

**Observation:** 

* The number of people belonging to sex category 0 are 96 whereas 1 are 206.
* The number of people in one category are more than double than the zero.

In [None]:
sns.countplot(x='cp', data=df, palette='pastel')

**Observation:**

* cp : Chest Pain type chest pain type

    * Value 0: typical angina
    * Value 1: atypical angina
    * Value 2: non-anginal pain
    * Value 3: asymptomatic
    
* People of chest pain category '0' have the highest count, whereas of count of chest pain '3' is the lowest

In [None]:
sns.countplot(x='fbs', data=df, palette='pastel')

**Observation:** People of fbs category 1 are less than 25% of people of fbs category 0.

In [None]:
sns.countplot(x='thall', data=df, palette='pastel')

**Observation:** Thall count is maximum for type 2 ( 165 ) and min for type 0 ( 2 ) .


In [None]:
sns.countplot(x='restecg', data=df, palette='pastel')

**Observation:** 

* ECG count is almost the same for type 0 and 1. 
* Also, its almost negligible for type 2 in comparision to type 0 and 1.

In [None]:
plt.figure(figsize = (10,10))
sns.swarmplot(x=df['caa'],y=df['age'],hue=df['output'], palette='pastel')

**Observation:**

* This swarmplot gives us a lot of information.
* Accoring to the figure, people belonging to caa category '0' , irrespective of their age are highly prone to getting a heart attack.
* While there are very few people belonging to caa category '4' , but it seems that around 75% of those get heart attacks.
* People belonging to category '1' , '2' and '3' are more or less at similar risk.

In [None]:
sns.color_palette("pastel")
plt.title('Checking Outliers with distplot()')
sns.distplot(df.trtbps, label='trtbps', kde=True, bins=10, color='green')
plt.legend()

In [None]:
plt.title('Checking Outliers with distplot()')
sns.distplot(df.chol, label='chol', kde=True, color='red')
plt.legend()

In [None]:
plt.title('Checking Outliers with distplot()')
sns.distplot(df['thalachh'],label='thalachh', kde=True )
plt.legend()

**Observations:**

* trtbps and chol looks like they are normally distributed, with some outliers highly skewed towards right.
* In case of thalachh the data is highly skewed towards right!

In [None]:
plt.figure(figsize=(20,10))
sns.lineplot(x = df['age'], y = df['thall'],marker = '*', linestyle = '--', color = 'red')

plt.figure(figsize = (20,10))
sns.regplot(x=df['age'],y=df['oldpeak'])

**Observation:** 

* This hardly provides any information. 
* The relationship between age-oldpeak and age-thall is highly uncertain and varies significantly.

In [None]:
X = df.drop('output', axis = 1)
y = df['output']

**About StandardScaler:**

* Python sklearn library offers us with StandardScaler() function to standardize the data values into a standard format.

* Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

* StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance.

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
columns_to_scale = df.iloc[:,[0,3,4,7,9,]]
columns_to_scale

In [None]:
# Spliting the data
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

ss = StandardScaler()
scaled_values = ss.fit_transform(columns_to_scale)
scaled_values = pd.DataFrame(scaled_values, columns=columns_to_scale.columns)
scaled_values


In [None]:
scaled_df = pd.concat([scaled_values,df.iloc[:,[1,2,5,6,8,10,11,12,13]]],axis=1)
scaled_df

In [None]:
key = ['LogisticRegression','KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier','XGBClassifier']
value = [LogisticRegression(random_state=9), KNeighborsClassifier(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier(), AdaBoostClassifier(), xgb.XGBClassifier()]
models = dict(zip(key,value))

In [None]:
predicted =[]

In [None]:
for name,algo in models.items():
    model=algo
    model.fit(X_train,y_train)
    predict = model.predict(X_test)
    acc = accuracy_score(y_test, predict)
    predicted.append(acc)
    print(name,acc)

In [None]:
plt.figure(figsize = (10,5))
sns.barplot(x = predicted, y = key, palette='pastel')

**Observation:**  From the above figure we can see that none of the above models give an accuracy greater than 90%. Let us try some other approach. Lets take some other random_state for Logistic Regression Model and see if the accuracy improves!

In [None]:
lr = AdaBoostClassifier(n_estimators=100, random_state=0)
rs = []
acc = []
for i in range(1,100,1):
    X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = i)    
    model_lr_rs = lr.fit(X_train, y_train.values.ravel())
    predict_values_lr_rs = model_lr_rs.predict(X_test)
    acc.append(accuracy_score(y_test, predict_values_lr_rs))
    rs.append(i)

In [None]:
plt.figure(figsize=(10,10))
plt.plot(rs, acc)

In [None]:
for i in range(len(rs)):
    print(rs[i],acc[i])

# Conclusion: 


* High Blood Pressure, High Cholestrol and High Heart Rate leads to high chance of heart attack.

* In the count of target showed up that we have more chance of heart attack details.

* Age from 40-60 years have the high chance of heart attack.

* Male gender has more chance of heart attack compared to female ones.

* Highly Correlated factors in this dataset are :
    * Age and trtbps (blood pressure rate)
    * Age and chol (cholestrol level)