
The main objective is to design a Machine Learning solution which predicts the chance of having heart attack based Personal health infomration. 

Our ML solution will help in answering the following questions,

1. How to use Data analysis techniques in understanding Heart Attack data?
2. What are the most significant risk factors for developing a Heart Attack?
3. How to Identify the person who is having low/high chance of Heart Attack?




In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport


from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_recall_curve

from sklearn.model_selection import train_test_split
from mlxtend.evaluate import bias_variance_decomp

import warnings
warnings.filterwarnings("ignore")

# set the pandas dataframe option to display max number of columns
pd.set_option("display.max_columns", 101)

## 1. Data Preparation
#### 1.1 Read the Heart Attack data from the heart csv file

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# read the data feom heart file
heart_attack_df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

print('The shape of the heart_attack_df is : ', heart_attack_df.shape)
print('\n Display of the first 5 rows of the data \n')
print(heart_attack_df.head())

#### 1.2 Data Description 
Column Name	| Description
:-|:-
**age**| Age of the patient
**sex**| Sex of the patient
**cp**| Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non anginal Pain, 3 = Asymptomatic
**trtbps**| Resting blood pressure (in mm Hg)
**chol**| Cholestoral in mg/dl fetched via BMI sensor
**fbs**| (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False
**restecg**| Resting electrocardiographic results ~ 0 = Normal, 1 = ST**|T wave normality, 2 = Left ventricular hypertrophy
**thalachh**| Maximum heart rate achieved
**oldpeak**| Previous peak
**slp**| Slope
**caa**| Number of major vessels
**thall**| Thalium Stress Test result ~ (0,3)
**exng**| Exercise induced angina ~ 1 = Yes, 0 = No
**output**| Target variable (0= less chance of heart attack 1= more chance of heart attack)

In [None]:
# repalce acronyms with Full text/ Human friendly text
heart_attack_df.rename(columns = {'cp':'chest_pain_type', 'trtbps':'resting_blood_pressure',
                                  'chol':'cholestoral', 'fbs':'fasting_blood_sugar',
                                  'restecg': 'resting_ecg_results',
                                  'thalachh': 'maximum_heart_rate_achieved',
                                  'oldpeak': 'previous_peak',
                                  'slp': 'slope', 
                                  'caa': 'number_of_major_vessels',
                                  'thall': 'thalium_stress_test_result',
                                  'exng': 'exercise_induced_angina'}, inplace = True)

# Generate descriptive statistics
heart_attack_df.describe().transpose()

In [None]:
# Explore null values counts and column types
heart_attack_df.info()

### 2. Data Profiling
The main advantage of pandas profiling is its use with the datasets in performing the descriptive statistics. 

For each column the following statistics will be saved in an interactive HTML report:

Type inference,Essentials: type, unique values, missing values, 
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, 
coefficient of variation, kurtosis, skewness, Most frequent values, Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis,File and Image analysis.

In [None]:
# Create the pandas profiling report
ProfileReport(heart_attack_df).to_widgets()

#### 2.1 Missing Values Treatment

There are no missing values in the Heart Attack data. Missing Values Treatment isn't required.

#### 2.2 Dropping duplicated rows

Duplicate records wouldn't give us much information. So we can drop them from the Analysis. 

In [None]:
print('Heart data dataframe shape before removing the duplicates: ',heart_attack_df.shape)
heart_attack_df.drop_duplicates(inplace=True) 
heart_attack_df.reset_index(drop=True, inplace=True)
print('Heart data dataframe shape after removing the duplicates: ',heart_attack_df.shape)

#### 2.3 Outlier Treatment

Data profiling visualizations help us in detecting outliers. Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR can be considered as a Outlier. Other methods will be looking at the range of 5th and 95th percentile can be considered as outlier. 


Here the dataset is too small and also we are uncertain to simply apply detection techniques to identify and replace/remove the records. Thus the reason, we aren't performing the Outlier Treatment.

#### 2.4 Bi-Variate Analysis

Bi-Variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables.

In [None]:
# Define the Continues and Categorical Column names
conti_cols = [i for i in heart_attack_df.columns if heart_attack_df[i].nunique()>5]
cat_cols = [i for i in heart_attack_df.columns if heart_attack_df[i].nunique()<=5]

cat_cols = list(set(cat_cols) - set(['output']))
print('Continues Column names: ', conti_cols)
print('\nCategorical Column names: ', cat_cols)

#### 2.4.1 Plot Categorical Columns vs Output(target variable) distributions using Countplot 

In [None]:
plt.figure(figsize=(20,30)).patch.set_facecolor("#e6f7ff") 
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)

plt.subplot(4,2,1)
plt.title('Prevalence of Heart attack by Sex',fontsize=15)
sns.countplot(heart_attack_df['output'], hue=heart_attack_df['sex'])

plt.subplot(4,2,2)
plt.title('Prevalence of Heart attack by Chest Pain',fontsize=15)
sns.countplot(heart_attack_df['output'], hue=heart_attack_df['chest_pain_type'])

plt.subplot(4,2,3)
plt.title('Prevalence of Heart attack by fasting blood sugar > 120 mg/dl',fontsize=15)
sns.countplot(heart_attack_df['output'],hue=heart_attack_df['fasting_blood_sugar'])

plt.subplot(4,2,4)
plt.title('Prevalence of Heart attack by restecg',fontsize=15)
sns.countplot(heart_attack_df['output'],hue = heart_attack_df['resting_ecg_results'])

plt.subplot(4,2,5)
plt.title('Prevalence of Heart attack by Exercise induced angina',fontsize=15)
sns.countplot(heart_attack_df['output'],hue=heart_attack_df['exercise_induced_angina'])

plt.subplot(4,2,6)
plt.title('Prevalence of Heart attack by slp',fontsize=15)
sns.countplot(heart_attack_df['output'],hue=heart_attack_df['slope'])

plt.subplot(4,2,7)
plt.title('Prevalence of Heart attack by number of major vessels',fontsize=15)
sns.countplot(heart_attack_df['output'],hue=heart_attack_df['number_of_major_vessels'])

plt.subplot(4,2,8)
plt.title('Prevalence of Heart attack by thall',fontsize=15)
sns.countplot(heart_attack_df['output'],hue=heart_attack_df['thalium_stress_test_result'])

#### 2.4.2 Plot Continuous Columns vs Output(target variable) Histograms using Histplot 

In [None]:
plt.figure(figsize=(20,30)).patch.set_facecolor("#e6f7ff") 
plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=0.9, 
                    wspace=0.4, 
                    hspace=0.4)

plt.subplot(3,2,1)
plt.title('Prevalence of Heart attack by age',fontsize=15)
sns.histplot(x = heart_attack_df['age'], hue = heart_attack_df['output'])

plt.subplot(3,2,2)
plt.title('Prevalence of Heart attack by Maximum heart rate achieved',fontsize=15)
sns.histplot(x = heart_attack_df['maximum_heart_rate_achieved'], hue = heart_attack_df['output'])

plt.subplot(3,2,3)
plt.title('Prevalence of Heart attack by cholestoral in mg/dl',fontsize=15)
sns.histplot(x = heart_attack_df['cholestoral'], hue = heart_attack_df['output'])

plt.subplot(3,2,4)
plt.title('Prevalence of Heart attack by previous peak',fontsize=15)
sns.histplot(x = heart_attack_df['previous_peak'],hue = heart_attack_df['output'])

plt.subplot(3,2,5)
plt.title('Prevalence of Heart attack by resting blood pressure',fontsize=15)
sns.histplot(x = heart_attack_df['resting_blood_pressure'],hue = heart_attack_df['output'])

conti_cols:  ['age', 'resting_blood_pressure', 'cholestoral', 'maximum_heart_rate_achieved', '']



#### 2.5 Variable Selection and Transformation

In [None]:
x = heart_attack_df.drop("output",axis=1).values
y = heart_attack_df["output"].values

# Spliting the data for model development
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 0)

StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way.

In [None]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

### 3. Model Development
#### 3.1 Comaparing the Models
A list of classifiers for Comparision. 

* Linear Models : LogisticRegression
* Nearest Neighbors : KNeighborsClassifier
* Support Vector Machines : SVC
* Naive Bayes: GaussianNB
* Decision Trees : DecisionTreeClassifier
* Ensemble and Boost methods: 
                              - RandomForestClassifier
                              - GradientBoostingClassifier
                              - AdaBoostClassifier
                              - XGBClassifier
                              - LGBMClassifier 

* neural_network : MLPClassifier

Metrics for evaluation: Classification models accuracy can be measured using metrics like Accuracy, Area under curve-RUC, Precision, Recall, F-1 score, Confusion matrix. Compared to all the metrics, F-1 tell us model performance more effectively. During the development, having a single metric speeds up the ability to make a decision when we are selecting among a large number of classifiers. 

Bias - Variance tradeoff: Measure the bias and variance, Tune the model to minimize bias and variance. 



In [None]:
# Create the Dict with Model names and classifier functions 
model_names = ['LogisticRegression','KNeighborsClassifier',
       'SVC','GaussianNB',
       'DecisionTreeClassifier','RandomForestClassifier',
       'GradientBoostingClassifier','AdaBoostClassifier',
       'XGBClassifier','LGBMClassifier',
              'MLPClassifier']

model_functions = [LogisticRegression(),KNeighborsClassifier(),
                   SVC(),GaussianNB(),
                   DecisionTreeClassifier(),RandomForestClassifier(),
                   GradientBoostingClassifier(),AdaBoostClassifier(),
                   XGBClassifier(),LGBMClassifier(),MLPClassifier()]

models = dict(zip(model_names,model_functions))


predicted_f1_scores =[]
for name,algo in models.items():
    model=algo
    model.fit(x_train,y_train)
    predict = model.predict(x_test)
    acc_score = accuracy_score(y_test, predict)
    c_matrix = confusion_matrix(y_test, predict)
    prec_score = precision_score(y_test, predict)
    rec_score = recall_score(y_test, predict)
    f1 = f1_score(y_test, predict)
    mse, bias, var = bias_variance_decomp(model, x_train, y_train, x_test, y_test, loss='mse',
                                          num_rounds=200, random_seed=1)
    predicted_f1_scores.append(f1)

    print('Model name: ', name)
    print('Accuracy score: %.3f' % acc_score)
    print('Precision score: %.3f' % prec_score)
    print('Recall score: %.3f' % rec_score)
    print('f1 score: %.3f' % f1)
    print('Confusion Matrix',c_matrix)
    print('Bias: %.3f' % bias)
    print('Variance: %.3f' % var)
    
    print('#','-'*50,'#')

In [None]:
plt.figure(figsize = (15,5))
plt.title('Predicted F-1 scores',fontsize=15)
sns.barplot(x = predicted_f1_scores, y = model_names)

#### 3.2 Model Selection
SVC model has the highest F-1 score (0.939) compared to other Classifier. SVC model also have the less Bias and Variance scores.

It makes sense that the Ensemble and Boosting algorithm aren't doing great in this case, the dataset needs to be high dimensional(atleast 10000 reords).

SVC take cares of outliers better than KNN, GaussianNB. Run time for SVC is drastically lower compare to MLPClassifier.

However, SVC model has the best score. Let's work on the hyperparameter tuning.

#### 3.2.1 Hyperparameter Tuning

SVM’s almost a standard on good enough datasets to get high accuracy. But improving them can be a bit of a trick but today we’ll improve them using some standard techniques. Randomized search and Grid search for optimizing hyperparameters, but here to keep it more simple we are using for loop.



In [None]:
for c in [0.1, 0.5, 1, 2, 5, 10, 100]:
    for k in ['linear', 'poly', 'rbf', 'sigmoid']:
        model = SVC(C=c, kernel=k)
        model.fit(x_train,y_train)
        predict = model.predict(x_test)
       
        f1 = f1_score(y_test, predict)
        mse, bias, var = bias_variance_decomp(model, x_train, y_train, x_test, y_test, loss='mse',
                                              num_rounds=200, random_seed=1)

        print('C: ', c, 'kernel:', k)
        print('Accuracy score: %.3f' % acc_score)
        print('Precision score: %.3f' % prec_score)
        print('Recall score: %.3f' % rec_score)
        print('f1 score: %.3f' % f1)
        print('Confusion Matrix',c_matrix)
        print('Bias: %.3f' % bias)
        print('Variance: %.3f' % var)

        print('#','-'*50,'#')

The F-1 score remains same, even after training with multiple parameters.  Let's develop the model with best params.

In [None]:
# Create the model 
model = SVC(C=1, kernel='rbf')
model.fit(x_train, y_train)
predict = model.predict(x_test)
print('f1 score: %.3f' % f1_score(y_test, predict))

In [None]:
# Visualize the Predicted vs Actual scores
plt.figure(figsize = (15,5))
plt.title('Predicted vs Actual',fontsize=15)
sns.distplot(predict, label = 'Predicted')
sns.distplot(y_test, label = 'Actual')

For perfect prediction, we would have Predicted=Actual. In the above graph, the predicted values are pretty close to the actual values.

#### 3.2.3 Visualising Top Features 

In [None]:
# Extract the top features with weights and visualize
features_names = list(heart_attack_df.drop("output",axis=1))
svm = SVC(kernel='linear')
svm.fit(x_train, y_train)
imp,names = zip(*sorted(zip(abs(svm.coef_[0]), features_names)))
plt.figure(figsize = (15,5))
plt.title('Top Features in the Model',fontsize=15)
plt.barh(range(len(names)), imp, align='center')
plt.yticks(range(len(names)), names)
plt.show()




### Insites from the Heart Attack Chance data analysis

1. The heart attack dataset doesn't have missing/null/nan values.

2. We are uncertain to remove/replace the outliers in the continuous variables.

3. Correlation plots indicates that there is no Multicollinearity between the variables. 

4. Only Cholesterol data is Normally distributed, where as other variables are Skewed distributed.

5. It is intuitive that elder people might have higher chances of heart attack but according to the distribution plot of age vs output variables, it is evident that age group between 40-60 has higher chances.

6. People belogs to sex=1 have higher chance of getting heart attack compare to sex = 0.

7. People with Non-Anginal chest pain, that is with cp = 2 have higher chances of heart attack.

8. Person higher higher heart rate(> 150) are more probable to suffer from Heart Attack.

9. People with Resting Blood Pressure - range of 120 to 140 have higher chance of heart attack. 

10. People with thall = 2 have much higher chance of heart attack.

11. People with no exercise induced angina(exng = 0) have higher chance of heart attack.

12. People with lower pevious peak will have higher chances of heart attack.

13. People with number of major vessels = 0 have higher chances of heart attack.

14. Top Features plot tell us feature significance.



