# Healthcare: Predicting Heart Attack

## Description:
### Cardiovascular disease is one of the top contributors to mortality in the world. Millions of dollars were spent on medical treatment included medications, surgical intervention, and so forth. Instead of spending humongous amount of money in treatment, preventive measures can significantly impact population health in positive way. People can benefit from early diagnosis before heart attack and gives them the opportunity to take preventive actions. 

## Project Objective:
### To predict the outcome, if there is a high or less chance of having heart attack. The target feature or y-variable is "output", where 0= less chance of heart attack 1= more chance of heart attack.  

## Process:
### This interesting project will start of with basic descriptive analysis, followed by data visualization along with exploratory data analysis, preparing the data, splitting train and test data, and fit into the model. Finally, the model's performance will be evaluated by classification report and confusion matrix. 

## Potential Impact:
### Healthier population, less money spent on healthcare, reduce mortality rate due to cardiovascular disease. 

# Dataset Description:


## Age : Age of the patient

## Sex : Sex of the patient

## exang: exercise induced angina (1 = yes; 0 = no)

## ca: number of major vessels (0-3)

## cp : Chest Pain type chest pain type

* Value 1: typical angina 
* Value 2: atypical angina 
* Value 3: non-anginal pain 
* Value 4: asymptomatic

## trtbps : resting blood pressure (in mm Hg)

## chol : cholestoral in mg/dl fetched via BMI sensor

## fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

## rest_ecg : resting electrocardiographic results

* Value 0: normal 
* Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
* Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria thalach : maximum heart rate achieved

## target : 0= less chance of heart attack 1= more chance of heart attack

### Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
data = pd.read_csv("../input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
data.head()

### Checking for duplicates

In [None]:
duplicate_rows = data[data.duplicated()]
print("Number of duplicate rows: ", duplicate_rows.shape)

In [None]:
data = data.drop_duplicates()

In [None]:
duplicate_rows = data[data.duplicated()]
print("Number of duplicate rows: ", duplicate_rows.shape)

### Noted that the data types are integers and float with no null values

In [None]:
data.info()

In [None]:
data.isnull().sum()

### Basic descriptive analysis

In [None]:
data.describe().transpose()

### Exploratory Data Analysis

### The target feature "output" has sorta balanced dataset

In [None]:
data["output"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x=data["output"], color="red", alpha=0.4)

In [None]:
print("The min value for age is: ", data["age"].min())
print("The mean value for age is: ", data["age"].mean())
print("The max value for age is: ", data["age"].max())

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(x=data["age"], bins=15, color="red")

In [None]:
sns.jointplot(x="age", y="chol", data=data, hue="output")

### Noted outliers in choleterol feature. Will just drop these extreme numbers. 

In [None]:
data["chol"].max()

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(x=data["chol"], bins=10, color="green")

### Removing outlier

In [None]:
data = data[data["chol"] < 380]

In [None]:
data["chol"].max()

### The gender proportion is imbalanced with male accounts for ~69% and female ~31%.

In [None]:
data["sex"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="sex", palette="plasma", data=data, hue="output")
plt.title("Gender")
plt.legend(title="Target Variable")

In [None]:
gender_gb = data.groupby("output")["sex"]
gender_gb.value_counts()

# Fun Fact:
### Given the dataset, the probability of a picking a female patient with having higher chance of heart attack is 23%

### Given the dataset, the probability of a picking a male patient with having higher chance of heart attack is 31%

In [None]:
prob_female_total = 69/len(data["sex"])*100
prob_female_total

In [None]:
prob_male_total = 92/len(data["sex"])*100
prob_male_total

### Noted in general, males have higher count than females across all types of chest pain, as shown in the graph below. This is because of the gender proportion in the dataset. 

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="cp", color="gold", data=data, hue="sex")
plt.xlabel("Chest Pain: 1 typical angina, 2 atypical angina, 3 non-anginal pain, 4 asymptomatic")

In [None]:
g = sns.FacetGrid(data, col="sex", hue="output")
g.map(plt.scatter, "age", "chol").add_legend()

### Drilling down into number of males and females having higher chance of heart attack

In [None]:
prob_female = 69/(69+22)*100
prob_male = 93/(114+93)*100
print("Given the female population in the dataset, the probability of a female having higher chance of heart attack is: {:.2f}%".format(prob_female))
print("\n")
print("Given the male population in the dataset, the probability of a male having higher chance of heart attack is: {:.2f}%".format(prob_male))

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="cp", palette="Reds_r", data=data, hue="output")
plt.xlabel("Chest Pain: 0 typical angina, 1 atypical angina, 2 non-anginal pain, 3 asymptomatic")

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x="cp", y="trtbps", data=data, hue="output")
plt.xlabel("Chest Pain: 0 typical angina, 1 atypical angina, 2 non-anginal pain, 3 asymptomatic")
plt.ylabel("Resting blood pressure (in mm Hg)")
sns.set_context("paper", font_scale=1.5)

In [None]:
g = sns.FacetGrid(data, col="sex", hue="output")
g.map(plt.scatter, "trtbps", "thalachh").add_legend()

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(x=data["trtbps"], bins=10, color="blue")

In [None]:
data = data[data["trtbps"] < 180]

In [None]:
data["trtbps"].max()

In [None]:
plt.figure(figsize=(12,7))
sns.distplot(x=data["thalachh"], bins=15, color="red")

In [None]:
data = data[data["thalachh"] > 100]

In [None]:
data["thalachh"].min()

In [None]:
data.columns

In [None]:
data["fbs"].value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(9,6))
sns.countplot(x=data["fbs"], color="red", alpha=0.4)

In [None]:
plt.figure(figsize=(12,7))
sns.boxplot(x="exng", y="thalachh", data=data, hue="output", palette="seismic")

In [None]:
data.columns

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x="slp", y="oldpeak", data=data, hue="output")

In [None]:
plt.figure(figsize=(15,7))
sns.violinplot(x="slp", y="oldpeak", data=data, hue="output", palette="plasma")

In [None]:
data["caa"].value_counts(normalize=True)*100

In [None]:
data["restecg"].value_counts(normalize=True)*100

In [None]:
data["restecg"].value_counts()

In [None]:
ekg_normal = len(data[data["restecg"] == 0])/len(data["restecg"])*100
ekg_normal

In [None]:
ekg1 = len(data[data["restecg"] == 1])/len(data["restecg"])*100
ekg2 = len(data[data["restecg"] == 2])/len(data["restecg"])*100
ekg_normal = len(data[data["restecg"] == 0])/len(data["restecg"])*100

print("Abnormal EKG: {:.2f}".format(ekg1))
print("\n")
print("Hypertrophy by Estes: {:.2f}".format(ekg2))
print("\n")
print("Normal EKG: {:.2f}".format(ekg_normal))


In [None]:
plt.figure(figsize=(12,7))
sns.countplot(x="output", data=data, hue="sex", palette="seismic")

In [None]:
data.corr()["output"].sort_values(ascending=False)

### The data correlation in the heatmap below shows us that cp, thalachh, and s/p have the highest correlation with output, our target variable. 

In [None]:
plt.figure(figsize=(18,7))
sns.heatmap(data.corr(method="pearson"), cmap="PuRd", annot=True, lw=0.1)

### A quick glance using the graph below, age 50 with non-anginal chest type has the highest risk of having a heart attack.

In [None]:
plt.figure(figsize=(15,7))
sns.kdeplot(x="age", y="cp", data=data, hue="output", fill=True, palette="Reds")

In [None]:
data.columns

In [None]:
data.skew(axis=0, skipna=True)

### The feature fbs has skewness >2. Also noted this feature has the lowest correlation, especially with target feature as shown in the heatmap. Will just drop this feature.

### Features oldpeak and caa will be log-transformed hopefully it can improve the data skewness.

In [None]:
data["caa"].value_counts()

In [None]:
data["caa"] = data["caa"].replace([2,3,4],1)
data["caa"].value_counts()

In [None]:
data["oldpeak"].min()

In [None]:
data["oldpeak"] = data["oldpeak"].replace(0.0, 0.01)
data["oldpeak"].value_counts()

In [None]:
data["log_oldpeak"] = np.log10(data["oldpeak"])

In [None]:
data.skew(axis=0, skipna=True)

In [None]:
data = data.drop("oldpeak", axis=1)

In [None]:
data.corr()["output"]

In [None]:
data = data.drop(["trtbps", "chol", "restecg", "fbs", "age", "sex"], axis=1)

In [None]:
data.corr()["output"]

In [None]:
data.head()

### Creating X and y variables for training the model

In [None]:
X = data.drop("output", axis=1)
y = data["output"]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Predictive Model: Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree_model = DecisionTreeClassifier(max_depth=15, random_state=42)

In [None]:
tree_model.fit(X_train, y_train)

In [None]:
tree_predict = tree_model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

### The precision score for 0 and 1 is ~0.80s, which I think is acceptable, but on the low end. 

In [None]:
print(classification_report(y_test, tree_predict))
print("Confusion Report:")
print(confusion_matrix(y_test, tree_predict))

### Predictive Model: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
random_model = RandomForestClassifier(n_estimators=80, bootstrap=True, random_state=42, criterion="entropy")

In [None]:
random_model.fit(X_train, y_train)

In [None]:
random_predict = random_model.predict(X_test)

### The Random Forest yields a slightly better precision score for 0 and 1 at ~0.82s, which I think is acceptable, but on the low end. Too bad my favorite model is not doing so well in this dataset. 

In [None]:
print(confusion_matrix(y_test, random_predict))
print("Confusion Report:")
print(classification_report(y_test, random_predict))

### Predictive Model: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_model = LogisticRegression(solver="liblinear", random_state=42)

In [None]:
log_model.fit(X_train, y_train)

In [None]:
log_predict = log_model.predict(X_test)

### The logistic regression model, by far, performs the best with precision score 0.95 on 0, and recall at 0.97 on 1. The f1-score 0.88 for 1 is very good. 

In [None]:
print(classification_report(y_test, log_predict))
print("Confusion Report:")
print(confusion_matrix(y_test, log_predict))

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gaussian_model = GaussianNB()

In [None]:
gaussian_model.fit(X_train, y_train)

In [None]:
gaussian_predict = gaussian_model.predict(X_test)

### I had high hope on GaussianNB model but looks like this model yields ~0.84. The recall at 0.90 on 1 is remarkable. 

In [None]:
print(classification_report(y_test, gaussian_predict))
print("Confusion Report:")
print(confusion_matrix(y_test, gaussian_predict))

In [None]:
report = [["GaussianNB", 0.84, 0.84, 0.84, 0.84], ["Random Forest", 0.82, 0.83, 0.82, 0.82], 
          ["DecisionTreeClassifier", 0.81, 0.81, 0.81, 0.81],
          ["LogisticRegression", 0.88, 0.87, 0.86, 0.86]]
overall_result = pd.DataFrame(report, columns=["Model", "Accuracy Score", "Precision", "Recall", "F1-score"])
overall_result.sort_values("Accuracy Score", ascending=False)


# Overall, Logistic Regression yields the highest score across the board on this dataset. 