# 1. BUSINESS UNDERSTANDING

### Objective: Predict whether a person is prone to heart attack or not based on the information available.

* a) Perform Exploratory Data Analysis on the information / dataset available to gather insights around it. 
* b) Additionally, perform predict if a person is prone to heart attack or not.

In [None]:
from IPython.display import Image
Image("../input/heartpredictionimage/HeartAttackPrediction_Image.png")

# 2. DATA UNDERSTANDING

#### There are 2 files provided as inputs.
* o2saturation.csv
* heart.csv

#### Description of dataset features are captured below.

* age : Age of the patient
* sex : Sex of the patient
    * 1: Male
    * 0: Female
* cp : Chest Pain type
    * Value 0: typical angina
    * Value 1: atypical angina
    * Value 2: non-anginal pain
    * Value 3: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    * 1: True (i.e. Fasting Blood Sugar > 120mg/dl)
    * 0: False (i.e. Fasting Blood Sugar < 120mg/dl)
* rest_ecg : resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* exng: exercise induced angina (1 = yes; 0 = no)
* oldpeak: Previous peak
* slp: the slope of the peak exercise ST segment 
* caa: number of major vessels (0-4) - 0/1/2/3/4
* thall: thallium stress result - 0/1/2/3 etc
* output: (This is the TARGET variable)
    * 0= less chance of heart attack 
    * 1= more chance of heart attack 


# 2a. Get required libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas_profiling as pp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time

%matplotlib inline
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 2b. Read from Datasets

### O2 Saturation Dataset

In [None]:
o2sat = pd.read_csv("../input/heart-attack-analysis-prediction-dataset/o2Saturation.csv")
o2sat.head()

In [None]:
o2sat.mean()

In [None]:
o2sat.value_counts()

In [None]:
o2sat.describe()

In [None]:
plt.figure(figsize=(14,6))
sns.histplot(data=o2sat['98.6'])

We can observe that approximately 89% values range between 96.5 to 98.6. Most common values are 98.6

### Heart Attack Dataset

In [None]:
heart = pd.read_csv ("../input/heart-attack-analysis-prediction-dataset/heart.csv")
heart.head()

# 2c. Profile Report Analysis to understand features, distributions & correlations

This is just an example of one of AutoEDA (Automated Exploratory Data Analysis) considered for quick insights / analysis. We can always consider any such methods / approaches which we can use and see if it helps in our current context. For example other AutoEDA libraries could be - SweetViz, LUX, AutoViz, DataPrep, DTale etc.

In any case, these will provide some quick insights and save time for us. Post that, we can focus more in depth into certain areas such as correlation or interaction between features to understand more and take actions as part of our Data Preparation / Feature Engineering steps.

In [None]:
pp.ProfileReport(heart, explorative = True)

### Observations in the Heart Dataset: 
    * 14 columns/features and 303 rows/observations
    * It indicated no missing values
    * 1 duplicate row
    * A column "oldpeak" has almost 32.7% zero values
    

#### We observed from the Profile report about some of the features and data representations. Now we will further do EDA and Data Preparation to have pre-processing, more charts / visualization prior to making the data ready for model development phase.


# 3. DATA PREPARATION 

# 3a. Since there are duplicates, let's remove them.

In [None]:
heart.shape # Get quick snapshot of number of rows and features.

In [None]:
heart[heart.duplicated()] #understand which row is duplicate

In [None]:
df_processed = heart.copy()

df_processed.drop_duplicates(inplace = True)
df_processed.reset_index(drop = True, inplace = True)
df_processed.shape

In [None]:
df_processed[df_processed.duplicated()]

#### So, we have removed the 1 duplicate. We will proceed with this dataframe. (df_processed)

# 3b. Let's visualize through a histogram

In [None]:
df_processed.hist(figsize=(18,10))
plt.show()

# 3b. Correlation using Histogram

In [None]:
plt.figure(figsize=(16,8))
sns.heatmap(df_processed.corr(),annot=True,cmap="PuBuGn")

Overall, not much high correlation between variables. 

Output (Target variable) - is correlated more relatively with cp, thalachh, slp (positively) and exng, oldpeak, caa (negatively).


# 3c. Analysis on "Sex" feature

In [None]:
df_processed.sex.value_counts(normalize=True)

* Male ~ 68.2%
* Female ~ 31.8%

# 3d. Analysis on "Age" feature

In [None]:
df_processed.age.hist(figsize=(16,8),bins=30)

In [None]:
def age_category_values(df):
    p=round(df.max()/5)*5
    q=round(df.min()/5)*5
    L=[i for i in range(q,p,5)]
    dicts={}
    M=[]
    for a in range(len(L)):
        dicts[L[a]]=0
    for j in df:
        for k in L:
            if j<k:
                dicts[k]+=1
                break
    for b in dicts:
        M.append(([b-5,b],dicts[b]))
    return M

age_category_values(df_processed.age)

In [None]:
s=0
for i in df_processed.age:
    if 40 <= i:
        s+=1
x = len(df_processed)
print(100*s/x)

#### Interpretation: Around 95% of people above Age of 40 are having heart attack

# 3e. Analysis on "CP" feature

### CP - Chest Pain Type

* Value 0: Typical angina: chest pain related decrease blood supply to the heart
* Value 1: Atypical angina: chest pain not related to heart
* Value 2: Non-anginal pain: typically esophageal spasms (non heart related)
* Value 3: Asymptomatic: chest pain not showing signs of disease

In [None]:
df_processed.cp.value_counts()

In [None]:
df_processed.cp.hist()

In [None]:
df_processed.cp.value_counts(normalize=True)

# 3f. Analysis on "trtbps" feature

### trtbps - Resting Blood Pressure

In [None]:
df_processed.trtbps.hist()

In [None]:
def category_values(df, step):
    p = round(df.max()/step)*step
    q = round(df.min()/step)*step
    L=[i for i in range(q,p+(2*step),step)]
    dicts={}
    M=[]
    for a in range(len(L)):
        dicts[L[a]]=0
    for j in df:
        for k in L:
            if j<k:
                dicts[k]+=1
                break
    for b in dicts:
        M.append(([b-step,b],dicts[b]))
    return M

category_values(df_processed.trtbps,10)

In [None]:
sns.histplot(data=df_processed,x="trtbps", bins=(80,90,100,110,120,130,140,150,160,170,180,190,200))

# Processing with Dummy variables

### We observed that we need to convert some categorical variables into dummy variables and scale all the values before training the Machine Learning models. With regards to this, we will use the get_dummies method to create dummy columns for categorical variables.

In [None]:
df_processed.info()

In [None]:
df1_processed = pd.get_dummies(df_processed, columns = ['sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall'])

In [None]:
df1_processed.head()

# Scaling

* We noticed that following features/columns are needed to be normalized/scaled.
    * age
    * trtbps
    * chol
    * thalachh
    * oldpeak

In [None]:
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()

columns_for_scaling = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']

df1_processed[columns_for_scaling] = standardScaler.fit_transform(df1_processed[columns_for_scaling])

In [None]:
df1_processed.head()

#### Now, we can see that the features are scaled appropriately.

# 4. MODEL DEVELOPMENT

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

from sklearn.linear_model import LogisticRegression     # Logistic Regression
from sklearn.neighbors import KNeighborsClassifier      # KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier     # Random Forest
from sklearn.ensemble import GradientBoostingClassifier # GBM
import xgboost as xgb
from xgboost import XGBClassifier                       # XGBoost
from lightgbm import LGBMClassifier                     # Light GBM

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score, roc_curve
from sklearn.model_selection import ShuffleSplit, GridSearchCV
#from sklearn.metrics import mean_squared_error, r2_score

### Let's prepare Independent and Target variables 

In [None]:
SEED = 124
x = df1_processed.drop("output",axis=1)
target = df1_processed["output"]

x_train,x_test,y_train,y_test = train_test_split(x,target,test_size=0.25,random_state = SEED)

In [None]:
x_train.shape

In [None]:
x_test.shape

# 4a. Classification Models - kNearestNeighbor

In [None]:
# For us, x --> Independent Feature Set, target --> Target feature
# We will try with 10 fold Cross Validation of dataset

knn_scores = []
for k in range(1,21): 
    knn_classifier = KNeighborsClassifier(n_neighbors = k)
    score = cross_val_score(knn_classifier,x,target,cv=10)
    knn_scores.append(score.mean())

In [None]:
plt.plot([k for k in range(1, 21)], knn_scores, color = 'blue')
for i in range(1,21):
    plt.text(i, round(knn_scores[i-1],3), (i, round(knn_scores[i-1],2)))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')

In [None]:
# We are trying to execute with the best k value found above
knn_classifier = KNeighborsClassifier(n_neighbors = 10)
score = cross_val_score(knn_classifier,x,target,cv=10)

In [None]:
score.mean() ## Accuracy score from kNN

# 4b. Random Forest Classifier

In [None]:
rf_classifier= RandomForestClassifier(n_estimators = 10)

score = cross_val_score(rf_classifier,x,target,cv=10)

In [None]:
score.mean() ## Accuracy score from RF

# 4c. Light GBM Classifier

In [None]:
lgbm_classifier = LGBMClassifier()

score = cross_val_score(lgbm_classifier,x,target,cv=5)


In [None]:
score.mean() ## Accuracy score from LightGBM

# 5. MODEL TUNING

Let's focus on the fine tuning of hyper parameters and explore which combinations works in an optimum manner.
Based on that, we will consider those parameter values and re-execute our model and evaluate the performance.

In [None]:
def print_score(classifier, x_train, y_train, x_test, y_test, train=True):
    if train:
        pred = classifier.predict(x_train)
        classifier_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n--------------------------------------------")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"Classification Report:\n{classifier_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = classifier.predict(x_test)
        classifier_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n---------------------------------------------")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"Classification Report:\n{classifier_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

In [None]:
x_train.head()

In [None]:
y_train.head()

# 5.1 Hyperparameter Tuning for kNN

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors=20)
knn_classifier.fit(x_train, y_train)

print_score(knn_classifier, x_train, y_train, x_test, y_test, train=True)
print_score(knn_classifier, x_train, y_train, x_test, y_test, train=False)

In [None]:
test_score = accuracy_score(y_test, knn_classifier.predict(x_test)) * 100
train_score = accuracy_score(y_train, knn_classifier.predict(x_train)) * 100

results_df = pd.DataFrame(data=[["Tuned k-Nearest Neighbors", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df

# 5.2 Hyperparameter Tuning for Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV

params = {"C": np.logspace(-4, 4, 20),
          "solver": ["liblinear"]}

lr_classifier = LogisticRegression()

lr_cv = GridSearchCV(lr_classifier, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=5)
lr_cv.fit(x_train, y_train)
best_params = lr_cv.best_params_
print(f"Best parameters: {best_params}")
lr_classifier = LogisticRegression(**best_params)

lr_classifier.fit(x_train, y_train)

print_score(lr_classifier, x_train, y_train, x_test, y_test, train=True)
print_score(lr_classifier, x_train, y_train, x_test, y_test, train=False)

In [None]:
test_score = accuracy_score(y_test, lr_classifier.predict(x_test)) * 100
train_score = accuracy_score(y_train, lr_classifier.predict(x_train)) * 100

tuning_results_df = pd.DataFrame(data=[["Tuned Logistic Regression", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])

z = results_df.append(tuning_results_df, ignore_index=True)
z

# 5.3 Hyperparameter Tuning for Random Forest

In [None]:
n_estimators = [100]
max_features = ['auto', 'sqrt']
max_depth = [5]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

params_rf = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

rf_classifier = RandomForestClassifier(random_state = SEED)

rf_cv = GridSearchCV(rf_classifier, params_rf, scoring="accuracy", cv=3, verbose=2, n_jobs=-1)


rf_cv.fit(x_train, y_train)
best_params = rf_cv.best_params_
print(f"Best parameters: {best_params}")

rf_classifier = RandomForestClassifier(**best_params)
rf_classifier.fit(x_train, y_train)

print_score(rf_classifier, x_train, y_train, x_test, y_test, train=True)
print_score(rf_classifier, x_train, y_train, x_test, y_test, train=False)

In [None]:
test_score = accuracy_score(y_test, rf_classifier.predict(x_test)) * 100
train_score = accuracy_score(y_train, rf_classifier.predict(x_train)) * 100

results_df = pd.DataFrame(data=[["Tuned Random Forest Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
z = z.append(results_df, ignore_index=True)
z

# 5.4 Hyperparameter Tuning for XGBoost

In [None]:
n_estimators = [100]
max_depth = [2, 3, 5]
booster = ['gbtree', 'gblinear']
base_score = [0.99]
learning_rate = [0.05]
min_child_weight = [1, 2, 3]

params = {
    'n_estimators': n_estimators, 'max_depth': max_depth,
    'learning_rate' : learning_rate, 'min_child_weight' : min_child_weight, 
    'booster' : booster, 'base_score' : base_score
                      }

xgb_classifier = XGBClassifier()

xgb_cv = GridSearchCV(xgb_classifier, params, cv=3, scoring = 'accuracy',n_jobs =-1, verbose=1)


xgb_cv.fit(x_train, y_train)
best_params = xgb_cv.best_params_
print(f"Best paramters: {best_params}")

xgb_classifier = XGBClassifier(**best_params)
xgb_classifier.fit(x_train, y_train)

print_score(xgb_classifier, x_train, y_train, x_test, y_test, train=True)
print_score(xgb_classifier, x_train, y_train, x_test, y_test, train=False)

In [None]:
test_score = accuracy_score(y_test, xgb_classifier.predict(x_test)) * 100
train_score = accuracy_score(y_train, xgb_classifier.predict(x_train)) * 100

results_df = pd.DataFrame(data=[["Tuned XGBoost Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
z = z.append(results_df, ignore_index=True)
z

#### The tuned model with Training and Testing Accuracy percetage points are captured above.

#### We will continue to explore further with multiple experiments.

# 6. Conclusion / Interpretations:

* Since our objective is to predict whether a person is prone to heart attack or not based on the dataset and information available, we have approached it accordingly and explored with initial data analysis followed by feature engineering and few methods.
* We analyzed few algorithms and compared with their accuracy percentage points. Both training and testing p.p are compared to just get a feel of how they are performing (though we will be only interested in the testing accuracy p.p.)
* We will further experiment more with additional feature engineering and models to be analyzed with various options to see what works better and why.
* In any business problem solving, we will have to see data and context/need and then only can state which algorithm will perform better given the scenario. Time is also important and we will have to consider trade off between time and optimum solution accordingly.
* More effort will always be towards EDA and Feature Engineering which are important.