<br>
<h1 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #C66363 ; color : #E8D6D8; text-align: center; border-radius: 100px 100px;">INTRODUCTION </h1>
<br>

<img style ="margin-left: auto; margin-right: auto; margin-bottom: 20;" src="https://cdn.techexplorist.com/wp-content/uploads/2017/06/human-heart.jpg" alt="Heart" class="center" width="500">

  
  
####   A myocardial infarction (MI), commonly known as a heart attack, occurs when blood flow decreases or stops to a part of the heart, causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may travel into the shoulder, arm, back, neck or jaw. Often it occurs in the center or left side of the chest and lasts for more than a few minutes. The discomfort may occasionally feel like heartburn. Other symptoms may include shortness of breath, nausea, feeling faint, a cold sweat or feeling tired. About 30% of people have atypical symptoms. Women more often present without chest pain and instead have neck pain, arm pain or feel tired. Among those over 75 years old, about 5% have had an MI with little or no history of symptoms. An MI may cause heart failure, an irregular heartbeat, cardiogenic shock or cardiac arrest.

<br>
<h1 style = "font-size:40px; font-family:Garamond ; font-weight : normal; background-color: #C66363 ; color : #E8D6D8; text-align: center; border-radius: 100px 100px;">CONTENT </h1>
<br>

* [Add Libaries](#1)
* [Load and Examine Data](#2)
    * Data Information
    * Examine Data     
* [Clean Data](#3)
    * Check Missing Values   
    * Find Outliers
* [Visualize Data](#4)
* [Process Data](#5)

<a id="1"> </a>
# Add Libaries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from collections import Counter
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

<a id="2"> </a>
# Load and Examine Data

## Data Information

* Age : Age of the patient

* Sex : Sex of the patient

* exang: exercise induced angina (1 = yes; 0 = no)

* ca: number of major vessels (0-3)

* cp : Chest Pain type chest pain type
    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
    * trtbps : resting blood pressure (in mm Hg)


* chol : cholestoral in mg/dl fetched via BMI sensor

* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

* rest_ecg : resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria


* thalach : maximum heart rate achieved

* target : 0= less chance of heart attack 1= more chance of heart attack

In [None]:
data = pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

## Examine Data

In [None]:
data.info() #it looks like there are no empty values

In [None]:
data.describe()

In [None]:
data.head()

<a id="3"> </a>
# Clean Data

## Check Missing Values

In [None]:
data.isnull().sum() #there is no missing value

## Find Outliers

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        Q1 = np.percentile(df[c],25)
        Q3 = np.percentile(df[c],75)
        IQR = Q3 - Q1
        outlier_step = IQR * 1.5
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
data.loc[detect_outliers(data,["age","oldpeak","thalachh","chol","trtbps"])] #There is no multiple outliers

<a id="4"> </a>
# Visualize Data

In [None]:
g = sns.pairplot(data, diag_kind="kde", palette=["#F4F1DE","#AA3F22"])
g.map_lower(sns.kdeplot, levels=4, color=".4")

In [None]:
corr_df=data[["trtbps","age","chol","thalachh","oldpeak","output"]]
corrMatrix = corr_df.corr()
sns.heatmap(corrMatrix, annot=True)
sns.set(rc={'figure.figsize':(15,15)})

In [None]:
sns.boxplot(x="output", y="age", hue="cp", data=data, palette="rocket")

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

fig.suptitle('Risk Analysis')

sns.boxplot(ax=axes[0, 0], data=data, x='output', y='age',palette="rocket")
sns.boxplot(ax=axes[0, 1], data=data, x='output', y='trtbps',palette="rocket")
sns.boxplot(ax=axes[0, 2], data=data, x='output', y='chol',palette="rocket")
sns.boxplot(ax=axes[1, 0], data=data, x='output', y='thalachh',palette="rocket")
sns.boxplot(ax=axes[1, 1], data=data, x='output', y='oldpeak',palette="rocket")
sns.boxplot(ax=axes[1, 2], data=data, x='sex', y='output',palette="rocket")

<a id="5"> </a>
# Process Data

In [None]:
data_c = data.copy()

category_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
numeric_cols = ["age","trtbps","chol","thalachh","oldpeak"]

data_c = pd.get_dummies(data_c, columns = category_cols, drop_first = True)

Y = data_c[['output']]
X = data_c.drop(['output'],axis=1)

scaler = StandardScaler()
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])


In [None]:
X.head()

In [None]:
Y.head()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42)

In [None]:
X_train.head()

In [None]:
test_index=pd.Series(Y_train.index)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
acc_log_train = round(logreg.score(X_train, Y_train)*100,2) 
acc_log_test = round(logreg.score(X_test,Y_test)*100,2)
print("Training Accuracy: % {}".format(acc_log_train))
print("Testing Accuracy: % {}".format(acc_log_test))

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
             SVC(random_state = random_state),
             RandomForestClassifier(random_state = random_state),
             LogisticRegression(random_state = random_state),
             KNeighborsClassifier()]

dt_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20,2)}

svc_param_grid = {"kernel" : ["rbf"],
                 "gamma": [0.001, 0.01, 0.1, 1],
                 "C": [1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features": [1,3,10],
                "min_samples_split":[2,3,10],
                "min_samples_leaf":[1,3,10],
                "bootstrap":[False],
                "n_estimators":[100,300],
                "criterion":["gini"]}

logreg_param_grid = {"C":np.logspace(-3,3,7),
                    "penalty": ["l1","l2"]}

knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                 "weights": ["uniform","distance"],
                 "metric":["euclidean","manhattan"]}
classifier_param = [dt_param_grid,
                   svc_param_grid,
                   rf_param_grid,
                   logreg_param_grid,
                   knn_param_grid]

In [None]:
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
    clf.fit(X_train,Y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier",
             "LogisticRegression",
             "KNeighborsClassifier"]})

g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores")

In [None]:
votingC = VotingClassifier(estimators = [("dt",best_estimators[0]),
                                        ("rfc",best_estimators[2]),
                                        ("lr",best_estimators[3])],
                                        voting = "soft", n_jobs = -1)
votingC = votingC.fit(X_train, Y_train)
print(accuracy_score(votingC.predict(X_test),Y_test))

In [None]:
test_output = pd.Series(votingC.predict(X_test), name = "output").astype(int)
results = pd.concat([test_index, test_output],axis = 1)
results.to_csv("heart_attack.csv", index = False)