# Problem Statement : 
## HEART DISEASE PREDICTION 

### Importing necessary libraries and The Dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

### Attributes and their meanings

age: The person's age in years

sex: The person's sex (1 = male, 0 = female)

cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)

chol: The person's cholesterol measurement in mg/dl

fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

thalach: The person's maximum heart rate achieved

exang: Exercise induced angina (1 = yes; 0 = no)

oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot)

slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

ca: The number of major vessels (0-3)

thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

target: Heart disease (0 = no, 1 = yes)

### Let's take a look at the dataset 

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df.info()

# Exploratory Data Analysis

In [None]:
sns.countplot(x='target',data=df)

In [None]:
df['target'].value_counts()

### Interpretations :
    1. The above plot confirms the findings that -
    2. There are 165 patients suffering from heart disease, and
    3. There are 138 patients who do not have any heart disease.
### So this is a pretty much balanced dataset

In [None]:
df.groupby('sex')['target'].value_counts()

In [None]:
## We can visualize the value counts of the sex variable wrt target as follows -
ax = sns.catplot(y="target", col="sex", data=df, kind="count", height=5,  palette="Accent")

### Interpretations :
We can see that the values of target variable are plotted wrt sex : (1 = male; 0 = female).

target variable also contains two integer values 1 and 0 : (1 = Presence of heart disease; 0 = Absence of heart disease)

The above plot confirms our findings that -
1. Out of 96 females - 72 have heart disease and 24 do not have heart disease.
2. Similarly, out of 207 males - 93 have heart disease and 114 do not have heart disease.

In [None]:
plt.figure(figsize=(15,6))
sns.distplot(df['age'])
plt.show()

### Interpretation :
The age is mostly normally distributed

### Effect of AGE on Heart Disease

In [None]:
plt.figure(figsize=(15,6))
sns.boxplot(x='target',y='age',data=df,palette='viridis')
plt.title('Heart Attacks by Age',size=15)
plt.xlabel('No Heart Attack vs. Heart Attack',size=15)
plt.ylabel('Age',size=15);

### Interpretation :
Median Age of patients with no heart attacks is higher.
50-60 years of age is the most crucial year in terms of heart disease.

In [None]:
plt.figure(figsize=(20,10))
c= df.corr()
sns.heatmap(c,cmap="Blues",annot=True);

### Interpretations:
From this correlation we can conclude that chol and fbs don't have any major impact on cardiac problems.
Except these two features all the other features are contributing towards heart disease to a certain extent.

## Analysis of target vs cp

In [None]:
##Now, I will view its frequency distribution as follows :
df['cp'].value_counts()
##It can be seen that cp is a categorical variable and it contains 4 types of values - 0, 1, 2 and 3.

In [None]:
##Visualize the frequency distribution of cp variable
plt.figure(figsize=(15,6))
sns.countplot(x="cp", data=df, palette="magma")
plt.show()

In [None]:
df.groupby('cp')['target'].value_counts()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='target',data=df,palette='magma',hue='cp')
plt.title('Number of No Heart Attacks vs. Heart Attacks by Chest Pain type',size=15)
plt.xlabel('Patients',size=15)
plt.ylabel('Count',size=15);


### Interpretations :
There are four types of chest pains.
Chest pain type 0 had less heart attack occurences. It also had the most patients with no heart attacks.
Chest pain type 2 had most heart attack occurences. 
Chest pain type 3 had least heart attack occurences.

So we can conclude that most of the heart attacks are caused by chest pain type 2 .

## Analysis of target vs thalach

In [None]:
sns.catplot(data = df, x = 'sex', y = 'thalach', palette = 'colorblind', hue = 'target')

In [None]:
sns.boxplot(x="target", y="thalach", data=df)
plt.show()

### Interpretation :
We can see that those people suffering from heart disease (target = 1) have relatively higher heart rate (thalach) as compared to people who are not suffering from heart disease (target = 0).

This is a symptom called Trachycardia.

## Analysis of target vs exang

In [None]:
df.groupby('exang')['target'].value_counts()

In [None]:
sns.catplot(kind = 'bar', data = df, y = 'exang', x = 'sex', hue = 'target',palette='PuBuGn_r')
plt.show()

### Interpretation :
there is an increased level of exang in men. this might be leading to heart disease in men.

### Effect of FBS on Heart Disease

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='fbs',data=df,palette='Set3',hue='target')
plt.xlabel('0-> fps <120 , 1-> fps>120',size=15)
plt.ylabel('Count',size=15);


### Interpretation :
People having fps < 120 have more chance of having Heart Disease than people havnig fps >120

### Effect of RESTECG on Heart Disease

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='restecg',data=df,palette='cividis',hue='target')
plt.xlabel('resting electrocardiographic',size=15)
plt.ylabel('Count',size=15);


### Interpretation :
An electrocardiogram (ECG) is a test which measures the electrical activity of your heart to show whether or not it is working normally. An ECG records the heart's rhythm and activity on a moving strip of paper or a line on a screen. 
 
With above graph as a refrence we can say that if resting electrocardiographic is 1 i.e. having ST-T wave abnormality then the person has more chances of suffering from Heart Disease.

### Effect of SLOPE on Heart Disease

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x='slope',data=df,palette='BuGn',hue='target')
plt.xlabel('peak exercise ST segment',size=15)
plt.ylabel('Count',size=15);


### Interpretation :
Feature (the peak exercise ST segment slope) has three symbolic values (flat,downsloping,upsloping)

Therefore People having up sloping are more prone to Heart Disease than flat and downsloping. 

In [None]:
plt.figure(figsize=(15,6))
sns.factorplot(y='trestbps',data=df,x='cp',hue='target',palette='Set2')
plt.title('Trestbps V/S Chest Pain',size=15);

### Interpretation :
As Chest pain increases Blood Pressure will also increase along with the chances of Heart Disease.

In [None]:
sns.catplot(data = df, x = 'sex', y = 'oldpeak', palette = 'colorblind', hue = 'target');

### Interpretation :
Based on the above plot we can conclude that if Old peak is less than 2 then people will have more chances of having heart disease.

# Model Building

# Logistic Regression :
Let us make our first model to predict the target variable. We will start with Logistic Regression which is used for predicting binary outcome.

Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.

Logistic regression is an estimation of Logit function. Logit function is simply a log of odds in favor of the event.

This function creates a s-shaped curve with the probability estimate, which is very similar to the required step wise function


## Splitting Train and Test data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x = df.drop('target',axis=1)
y = df['target']

In [None]:
x_train, x_cv, y_train, y_cv = train_test_split(x, y, test_size=0.30, random_state=40)

In [None]:
x_train.shape, x_cv.shape, y_train.shape, y_cv.shape

### Standard Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(x_train)

X_test_scaled = scaler.transform(x_cv)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Initialize Model

In [None]:
model = LogisticRegression(random_state=40)

### Fitting train data into model

In [None]:
model.fit(X_train_scaled, y_train)

### Predicting Accuracy

In [None]:
pred_cv = model.predict(X_test_scaled)

In [None]:
accuracy_score(y_cv, pred_cv)

### Confusion Matrix

In [None]:
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_cv, pred_cv)
print(cm)

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True)
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')

In [None]:
# import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_cv, pred_cv))

# KNN Classifier :
K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine Learning for regression and classification problem.

KNN algorithms use data and classify new data points based on similarity measures (e.g. distance function). 

Classification is done by a majority vote to its neighbours.

### Standard Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(df.drop('target',axis=1))
scaled_features = scaler.transform(df.drop('target',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])


### Splitting Train and Test data

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(scaled_features,df['target'],
                                                    test_size=0.30)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [None]:
accuracy_rate = []

for i in range(1,41):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    score=cross_val_score(knn,df_feat,df['target'],cv=10)
    accuracy_rate.append(score.mean())


In [None]:
plt.figure(figsize=(10,6))

plt.plot(range(1,41),accuracy_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Accuracy Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy Rate')

### Initialize Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=31)

### Fitting train data into model

In [None]:
knn.fit(X_train,Y_train)

### Predicting Accuracy

In [None]:
pred = knn.predict(X_test)
accuracy_score(Y_test, pred)


### Confusion Matrix

In [None]:
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(Y_test, pred)
print(cm)

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True, fmt="d")
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')

In [None]:
# import classification_report
from sklearn.metrics import classification_report
print(classification_report(Y_test, pred))

# Random Forest Classifier :
RandomForest is a tree based bootstrapping algorithm wherein a certain no. of weak learners (decision trees) are combined to make a powerful prediction model.

For every individual learner, a random sample of rows and a few randomly chosen variables are used to build a decision tree model.

Final prediction can be a function of all the predictions made by the individual learners.

In case of regression problem, the final prediction can be mean of all the predictions.

There are some parameters worth exploring with the sklearn RandomForestClassifier:
n_estimators
max_features

n_estimators = ususally bigger the forest the better, there is small chance of overfitting here. The more estimators you give it, the better it will do. We will use the value of 600.

max depth of each tree (default none, leading to full tree) - reduction of the maximum depth helps fighting with overfitting. We will limit at 2.


### Splitting Train and Test data

In [None]:

from sklearn.model_selection import train_test_split

In [None]:
x = df.drop('target',axis=1)
y = df['target']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30,random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Initialize model

In [None]:
rf= RandomForestClassifier(n_estimators =600, random_state=42,max_depth=2) 

### Fitting train data into model

In [None]:
rf.fit(x_train,y_train) 


### Predicting Accuracy

In [None]:
pred = rf.predict(x_test)
accuracy_score(y_test, pred)

### Confusion Matrix

In [None]:
# import confusion_matrix
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, pred)
print(cm)

# f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(cm, annot=True, fmt="d")
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')

In [None]:
# import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

### Conclusion :
After trying and testing 3 different algorithms, the best accuracy on the dataset is achieved by Logistic Regression (92.30%), followed by RandomForest (87.91%) and KNN Classifier (83.51%)