### Heart Attack Analysis and prediction

## TASK = Analysis & Predict Heart Attack based on Age, Gender, No. of disease they have and some other aspects

### In this notebook I have implemented the above task. 
### I am a beginner and have implemented simple Machine Learning Algorithms to predict the Heart Attack among Patients considering various factors and information which is provided in the dataset.

### Feel free to share your thoughts and opinions.

In [None]:
# Age : Age of the patient

# Sex : Sex of the patient

# exang: exercise induced angina (1 = yes; 0 = no)
# Angina: type of chest pain caused by reduced blood flow to the heart. 

# ca: number of major vessels (0-3)

# cp : Chest Pain type chest pain type

# Value 1: typical angina
# Value 2: atypical angina
# Value 3: non-anginal pain
# Value 4: asymptomatic

# trtbps : resting blood pressure (in mm Hg)

# chol : cholestoral in mg/dl fetched via BMI sensor

# fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

# rest_ecg : resting electrocardiographic results

# Value 0: normal
# Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
# Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

# thalach : maximum heart rate achieved

# target : 0= less chance of heart attack 1= more chance of heart attack

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
df1 = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
df1.sample(6)

In [None]:
df1.isnull().sum()

In [None]:
df1.info()

In [None]:
df1.duplicated().sum()

In [None]:
df1[df1.duplicated()]

In [None]:
df1 = df1.drop_duplicates()

In [None]:
df1.duplicated().sum()

In [None]:
df1.sample(10)

In [None]:
from  matplotlib.colors import LinearSegmentedColormap
cmap=LinearSegmentedColormap.from_list('rg',["black", "pink", "w"], N=256) 
plt.figure(figsize = (14,10))
sns.heatmap(df1.corr(), cmap = cmap, annot = True)
plt.title('Correlation Matrix',pad = 15, fontsize = 15)
plt.show()

## From the Heatmap we find that  cp(Chest Pain type),  thalach(maximum heart rate achieved) positively correlated.
## Also exng(exercise induced angina) and oldpeak(depression induced by exercise relative to rest) are negatively correlated.

In [None]:
plt.figure(figsize = (10,6))
sns.set(rc = {'axes.facecolor': 'w', 'axes.grid': False,})

sex_data = df1.sex.map({1: 'Male', 0: 'Female'})
sns.countplot(sex_data, hue = df1.output, palette = 'bright', alpha = 0.8)
plt.title('Comparing Male and Female Patients')
plt.show()

In [None]:
plt.figure(figsize = (10,6))
sns.countplot(df1.sex , palette = 'hls' , alpha = 0.8)
plt.show()

## here 0 : Female,  1 : Male

## We cant conclude that male patients are more than female patients as the total count of female patients is lesser than male patients

In [None]:
plt.figure(figsize = (10,6))
sns.set(rc = {'axes.facecolor': 'w', 'axes.grid': False,})
sns.countplot(df1.output ,palette = 'bright', alpha = 0.8)
plt.show()

In [None]:
df1['age'].value_counts()

In [None]:
df1['chol'].value_counts()

In [None]:
# Age
sns.set(rc = {'axes.facecolor': 'black', 'axes.grid': False,})
plt.figure(figsize = [10,6])

sns.distplot(df1['age'], color='red')
plt.title('Distribution of Ages', fontsize=15, pad = 10)
plt.xlabel('Ages', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.show()

## From this plot we conclude that maximum heart attack patients are from the age of 45 to 65 (approximately).

In [None]:
# Blood Pressure
sns.set(rc = {'axes.facecolor': 'black', 'axes.grid': False,})
plt.figure(figsize = [10,6])

sns.distplot(df1['trtbps'], color='magenta')
plt.title('Distribution of Blood Pressure among patients', fontsize=15, pad = 10)
plt.xlabel('Blood Pressure', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.show()

In [None]:
# Cholesterol rate
plt.figure(figsize = [10,6])

sns.distplot(df1['chol'], color='cyan')
plt.title('Distribution of Cholestrol among patients', fontsize=15, pad = 10)
plt.xlabel('Cholestrol', fontsize=10)
plt.ylabel('Count', fontsize=10)
plt.show()

## Now Lets find relation of output with other columns

In [None]:
features = ['cp','fbs','restecg','exng','slp','caa','thall']

In [None]:
list(enumerate(features))

In [None]:
df1['restecg'].value_counts()

In [None]:
plt.figure(figsize = (13,30))
sns.set(rc = {'axes.facecolor': 'w', 'axes.grid': False,})

for i in enumerate(features):
    plt.subplot(6, 2, i[0]+1)
    sns.countplot(i[1], hue = 'output',data = df1)
plt.show()

## the people with maximum heart rate have more risk of heart attack - (thall) - maximum heart rate achieved

In [None]:
df1.head()

In [None]:
X = df1.iloc[: , :-1]
Y = df1.iloc[: , -1]

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test,y_train,y_test = train_test_split(X,Y,test_size = (0.3))


## Training Different ML models

In [None]:
# Hyperparameter Tuning
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
}

In [None]:
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(x_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df2 = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df2

## I choose Logistic Regression to implement this 

In [None]:
model_lg = LogisticRegression(solver='liblinear', multi_class='auto')
model_lg.fit(x_train, y_train)

In [None]:
pred = model_lg.predict(x_test)
pred

In [None]:
Y.head()

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test , pred)
cm

## Lets find the other metrics also

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, pred)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

In [None]:
from sklearn.metrics import precision_score
precision_score(y_test, pred)

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test, pred)

## Thankyou and Happy Analyzing!