# **1. Introduction**

In this notebook I'm going to use the heart-attack-analysis-prediction dataset to predict a heart attack. The first step is to know the problem we are facing and extract information about the data, then, we start training our model and subsequently evaluate it.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
data = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
warnings.filterwarnings('ignore') #Avoid print warnings

In [None]:
data.shape

In the above code we are showing the size of our dataset, in which we have 303 rows and 14 columns.

In [None]:
data.columns

These are the name of different columns, we can see that have a column called 'output', this is the variable to predict, then we go to show the different values that can take this variable to see if we are in front of a binary classification problem or a multiclass problem.

In [None]:
data['output'].unique()

The only two values that the output variable can take are 0 and 1. 0 represents a low probability of suffering a heart attack, 1 represents high probability to suffering a heart attack. We are in front of a binary classification problem.

# **2. EDA**

For now we know what type of problem we are faced, but we don't know nothing about the data quality, for this reason, the first job is explore the data to get information about that. A good starting point is getting some general information about our dataset.

In [None]:
data.info()

In the table above we can see that we have 303 entries and 14 columns (information extracted from the first two lines) but we already known this information. In the table we see the different column names, if has or not null values and their data type, and with this information we can make some assumptions like this: 
- We don't have null values, then we don't have to fill this white spaces.
- All columns are in a numeric type (integer and float), then we don't have to convert the content of these columns in to numeric type or modify the column type to numeric. 

In [None]:
data.head()

In the table above we see the different columns and the content of the first five rows, and also we see that the range of the different values are different, e.g. the sex column contains values that are either 0 or 1 (depend on if the person is male or female) and the column named chol (cholesterol in mg/dL) contains high values, therefore we should standarize the data (modify the ranges of all the columns from -1 to 1) if we want to use algorithms that uses the distance to classify the data. If we would have different ranges in the columns the influence in the output of the columns will be different, this means that the model would take into account the columns with higher values to decide the final result. Before to standarize the model we should see if the columns have some correlation between them and decide if we should or not remove any column. But this standardisation  will take place before to training, now I want to know if there are any correlation between the features.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,6))
sns.heatmap(data.corr(),cmap='coolwarm', annot=True)

Seeing the correlation between columns we can say that all the columns are necessary, or the same, we don't should remove any column because the correlations are near to 0. If we had two columns with a correlation around 1 or -1 we should remove one of this columns, because a high correlation can lead us to get bad results. Other thing to check is see if we have duplicated values, the duplicated values can lead us to a bad perform.

In [None]:
data.duplicated().sum()

We have 1 duplicated value, this is not a problem because we can remove this value without lose to much data.

In [None]:
data = data.drop_duplicates()
data.duplicated().sum()

Now we don't have duplicated values, the next step is know if we have either balanced or unbalanced data, in case that we don't have a balanced dataset we could have problems to predict the label with few samples. To comprove this we have to count the different entries with output equals to 1 and entries with output equals to 0.

In [None]:
import seaborn as sns
sns.countplot(data['output']) 

We have around 140 samples of no heart attack and around 165 samples of heart attack, this dataset it is balanced because have around 50% for each sample to predict, in case that the dataset contains 70-30 or more, the dataset should considered unbalanced, and in this case the treatment of the data would be dfferent.

Other interesting thing to do is see if we have balanced data but not for the final result, but for sex column. This is a good practice, because we could have problems predict the heart attack for the sex with few samples.

In [None]:
fig, ax3 = plt.subplots(figsize=(12,5))
grp2 = data.groupby(['sex', 'output'])['output'].count() #Rating mean
grp2.plot.bar(ax = ax3)

As we can see in the table, we have more samples of the sex 1 than the sex 0, this could be a problem. I don't know if the conditions for having an heart attack are different in men and women, for this reason I go to see the relationship between the probability of suffer a heart attack and the sex.

In [None]:
import statsmodels.formula.api as sm
reg = sm.ols(formula='output ~ sex', data = data).fit()
print(reg.summary())

In the above table we can see the relationship between the dependent variable (output) and the independent variable (sex). To see if the sex variable is important to predict the output we might observe the p-value (P>|t| in the table). This value tell us if the independent variable is statistically significant or not, in our case this values is 0.000 this means that sex variable has a lot of importance, because the p-value is less than 0.005. So we shouldn't remove the sex variable. Therefore, we could have problems predicting low probability of suffering a heart attack in the sex labeled as 0.

# **2. Training**

In [None]:
x = data.iloc[:, 0:13].values #Independent variables
y = data.iloc[:, 13].values #Dependent variable

In [None]:
#Data standarizaton
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
x[1]

The data has been standarized, now we don't will have problems with the columns influence in the final result. As example we see the values for the different columns in the first row.

Now we need to split the data in a subset for train the model and a subset for test the performance of our model, for this reason we need to use the train_test_split function. Also we need to do this split for the independent variable and for the dependent variable.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state = 0)

Now it's time to train the model, but for do it we have a lot of models, and a lot of hyperparameters. If we don't know what model can be fitting better we can use the grid search technique. This technique helps us to train the model with more than one classifier (or one if we want) and also we can select some values for their hyperparaments and see the algorithm with better permormance. 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier 

pipe = Pipeline(steps = [('estimator', DecisionTreeClassifier())])

params = [{
    'estimator' : [XGBClassifier()], 
    'estimator__max_depth' : range(2,8), 
    'estimator__learning_rate' : [0.11, 0.12, 0.13], 
    'estimator__n_estimators' : range(100,106),
    'estimator__min_child_weight': range(1,4),
    'estimator__min_split_loss' : range(1,4) 
    },
    {
    'estimator' : [RandomForestClassifier()],
    'estimator__n_estimators' : range(297, 303), 
    'estimator__max_depth' : range(3,7), 
    }]

model_grid = GridSearchCV(pipe, cv=5, param_grid = params)
model_grid.fit(x_train, y_train)

In [None]:
model_grid.best_params_

We are trained two different models, and the best model is Random Forest, the next step is evaluate the model with different metrics. The first metric that we should check is the accuracy, this metric evalueates the ability of the model for make a prediction. 

In [None]:
print("Accuracy train: {:.3f}".format(model_grid.score(x_train, y_train))) #See the accuracy of training set
print("Accuracy test: {:.3f}".format(model_grid.score(x_test, y_test))) #Print the accuracy of test set

In this case we have an accuracy of 89.6% in the train set (the model is able to predict well the 89.6% of knowed samples) and is able to predict the 82'4% in the test set (data that the model has never seen) The accuracy are differents in both cases, this is quite normal but if we had a very high gap between the accuracies we have two cases: 
- The train model is very higher than the test model: In this case we talk about overffit, the model is able to do well predictions in date previously seen, but don't know how to generalize with data that never had seen.

- The model performs very poorly also in train set, in this case we talk about underfit. 

In our case we don't have any of this cases. 

In [None]:
y_pred = model_grid.predict(x_test)

Another metric to predict is the confusion matrix, the diagonal of the matrix represents the correct predictions, the other values are wrong predicted values. 

In [None]:
from sklearn.metrics import confusion_matrix
cnf = confusion_matrix(y_test, y_pred)
cnf

We have 34 negative cases labeled as negative cases, and 41 positives cases labeled as positives. But our model is not perfect (no model is perfect in reality), we have 11 cases labeled as positives but actually are negatives, the false positive are a problem in healthcare field because we predict that a healthy person has a disease, but the real problem are the false negative, 5 in our case, because we predict that a person is healthy but in reality they are not.

For this reason the accuracy could give use a false safety when we apply this algorithm to a patient and carry out the predictions. The best metric to to use in this type of problems is the ROC curve, which tells us the relationship between true positives vs false positives.

In [None]:
from sklearn import metrics
metrics.plot_roc_curve(model_grid, x_test, y_test)  
plt.show()

This is the ROC curve, in our model we have a proportion of 0.92 or 92%, this means that the 92% of cases labeled as true positives will actually be true positives.

# **3. Conclusions**
We are trained a model to detect heart attacks, and we are achieved good results. This dataset are very easy to interpret because we don't have null values, unbalanced data, categorical variables and the amount of data is enough for do well predictions. In a real dataset we don't have this type of dataset and neither datasets with few columns with wich is easy to work. In real jobs we spend around the 80% of the time in EDA step, in our case we don't have spend to much time. 


We could use other algorithms and other hyperparameters but this is only an example about the power of these algorithms, and what we could do whith them if the humans and the IA worked together.