# Heart Attack Analysis and Prediction (Catboost) (beginner)

![](https://i.ibb.co/bH86zpn/EKG-Heart-concept-ML1701-ts484297336.png)

A heart attack occurs when the flow of blood to the heart is blocked.  

The blockage is most often a buildup of fat, cholesterol and other substances, which form a plaque in the arteries that feed the heart (coronary arteries).  


A heart attack, also called a **myocardial infarction**, can be **fatal** 

Lets try to analyze data set and find some insights to predict heart attacks

## General information about data

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import seaborn as sns
sns.set_style("darkgrid")
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score
from catboost import CatBoostClassifier
from catboost import Pool, cv
import warnings
warnings.filterwarnings("ignore")


### Loading and previewing data

In [None]:
heart = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
o2_sat = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/o2Saturation.csv')

In [None]:
heart.head()


We have a dateframe with the following features:

- `Age`: Age of the patient
- `Sex` : Sex of the patient *(1 = male; 0 = female)*
- `exang`: exercise induced angina (1 = yes; 0 = no)
- `ca`: number of major vessels (0-3)
- `cp` : Chest Pain type chest pain type

    -- Value 1: typical angina. 

    -- Value 2: atypical angina. 

    -- Value 3: non-anginal pain. 

    -- Value 4: asymptomatic. 

- `trtbps` : resting blood pressure (in mm Hg): *normal reading would be any blood pressure below 120/80 mm Hg and above 90/60 mm Hg in an adult*. 

- `chol` : cholestoral in mg/dl fetched via BMI sensor: *normal range is <200 mg/dL*. 

- `fbs` : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- `rest_ecg` : resting electrocardiographic results

    -- Value 0: normal

    -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

    -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

- `thalach` : maximum heart rate achieved
- `target` : 0= less chance of heart attack 1= more chance of heart attack

In [None]:
heart.info()

In [None]:
o2_sat.head()

In [None]:
o2_sat.info()

A normal level of oxygen is usually 95% or higher. Some people with chronic lung disease or sleep apnea can have normal levels around 90%.

In [None]:
o2_sat.hist()
plt.title('Oxigen saturation values')
plt.show()

In [None]:
o2_sat.value_counts()

We can see, that oxygen saturation tab consist of normal values of oxigen in blood. We will work only with `heart`-set further. 

Let check duplicates

In [None]:
heart.duplicated().sum()

We have the only one duplicated object here - we can drop it

In [None]:
df = heart.drop_duplicates().reset_index(drop=True).copy()

In [None]:
df.info()

In [None]:
df.describe().T

### Conclusion


At first glance, we do not see anomalies in the data. There are no missing values, duplicates are removed. There is no need to convert data to other types. Let's get started with exploratory data analysis. But before that let us devide set to train and test and hide our test data to prevent data snooping bias

## Dividing data

For dividing data we will use `train_test_split` from `scikit-Learn` library. Or we can use pandas method `sample()` with frac and random state.


PS. you can do it also in this way:

`train = df.sample(frac=0.8,random_state=42).copy()`. 

`test = df[~df.index.isin(train.index)].copy()`


we will work with `scikit-Learn`


In [None]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

Check the lens

In [None]:
len(train_set)

In [None]:
len(test_set)

In [None]:
len(train_set) + len(test_set) == len(df)

In my opinion iti is better to hide test now, before EDA - we have no NaN, we have no duplicates - in this case i decided to hide test to prevent snooping bias.  
Now we have to sets - train for our model and test


## Exploratory data analysis

In [None]:
train_set.head()

Let us plot hist of featires to see all values, that we have


In [None]:
train_set.hist(bins=50, figsize=(20,15), edgecolor='black', linewidth=2)
plt.show()

We have `age` feature with many values. It is

In [None]:
train_set['age_group'] = pd.qcut(train_set['age'], 5)

In [None]:
train_set.groupby('age_group')['output'].agg(['count', 'mean']).sort_values(by='mean', ascending=False)

We see, that the highest amount of heart attack is in group of 29-44 years old. maybe it is due to irregular work, stress, poor nutrition, and the abuse of fast food smkoing and etc - a pretty young disease.  

let us group by sex and age

In [None]:
train_set.groupby(['age_group', 'sex'])['output'].agg(['count', 'mean'])

In [None]:
train_set.groupby(['sex'])['output'].agg(['count', 'mean'])

The main risk factor for the development of myocardial infarction is arterial hypertension. And it is among female that this disease occurs quite often

also we can see, that the count of male with heart breaks is bigger than femal. One of the reasons for this is that fewer atherosclerotic plaques form in the vessels of the female body, and therefore there is less likelihood of blood clots in the arteries of the heart.

Summary, heart aches more offen affect male, but with age it is affect more femal

In [None]:

columns_list = ['trtbps', 'chol', 'thalachh']
title_list = ['Boxplot for resting blood pressure (in mm Hg)', 'Boxplot for chol values', 
              'Boxplot for thalach']

color_list=['steelblue', 'skyblue', 'cyan']


for i in tqdm(range(len(columns_list))):
    q75 = train_set[columns_list[i]].quantile([.75])
    q25 = train_set[columns_list[i]].quantile([.25])
    iqr = q75 - q25
    low_range = q25 - (1.5 * iqr)
    high_range = q75 + (1.5 * iqr)
    plt.figure(figsize=(15, 5))
    sns.boxplot(train_set[columns_list[i]], color=color_list[i])
    plt.xlim = (low_range, high_range)
    plt.title(title_list[i])
    plt.xlabel('')
    plt.show()


We have no big outliners or anomaly

In [None]:
train_set.head()

In [None]:
train_set['trtbps_group'] = pd.qcut(train_set['trtbps'], 5)

In [None]:
train_set.groupby(['trtbps_group'])['output'].agg(['count', 'mean'])

In [None]:
train_set['chol_group'] = pd.qcut(train_set['chol'], 5)

In [None]:
train_set.groupby(['chol_group'])['output'].agg(['count', 'mean'])

We can see that level of cholesterin and resting blood pressure (in mm Hg) is affects frequency of heart attacks

### Conclusion


The main risk factor for the development of myocardial infarction is arterial hypertension. And it is among female that this disease occurs quite often

also we can see, that the count of male with heart breaks is bigger than female. One of the reasons for this is that fewer atherosclerotic plaques form in the vessels of the female body, and therefore there is less likelihood of blood clots in the arteries of the heart.

Summary, heart aches more offen affect male, but with age it is affect more femal

We can see that level of cholesterin and resting blood pressure (in mm Hg) is affects frequency of heart attacks


In summary age, gender, chol level and resting blood pressure is affect heart attacks

## Feature engeneering

Lets split our train set for features and target

In [None]:
def split_data(data, target_column):
    return data.drop(columns=[target_column], axis=1), data[target_column]

In [None]:
train_features, train_target = split_data(train_set, 'output')

In [None]:
train_features = train_features.drop(['age_group',	'trtbps_group',	'chol_group'], axis=1)

In [None]:
train_features.head()

## Train model and tune

Let us try to choose parametrs of model via `GridSearchCV`

In [None]:
cat_model = CatBoostClassifier()
params = {'iterations': [100, 200, 500],
          'depth': [4, 5, 6],
          'loss_function': ['Logloss', 'CrossEntropy'],
          'l2_leaf_reg': np.logspace(-20, -19, 3),
          'leaf_estimation_iterations': [10],
          'logging_level':['Silent'],
          'random_seed': [42]
         }
scorer = make_scorer(accuracy_score)
clf_grid = GridSearchCV(estimator=cat_model, param_grid=params, scoring=scorer, cv=5)

Train our model on train set

In [None]:
clf_grid.fit(train_features, train_target)
best_param = clf_grid.best_params_
best_param

Now we have can save the best model with the best parametrs and train it on train_pool for crossvalidation

In [None]:
model = CatBoostClassifier(depth= 5,
                           iterations = 500,
                           l2_leaf_reg= 1e-20,
                           leaf_estimation_iterations= 10,
                           logging_level= 'Silent',
                           loss_function= 'Logloss',
                           random_seed= 42)

In [None]:
cat_features = [0]
xtrain, xval, ytrain, yval = train_test_split(
                            train_features, train_target, 
                            train_size=0.8,random_state=42
                            )
train_pool = Pool(xtrain, ytrain, cat_features=cat_features)

params = {'depth': 5,
          'iterations': 500,
          'l2_leaf_reg': 1e-20,
          'leaf_estimation_iterations': 10,
          'logging_level': 'Silent',
          'loss_function': 'Logloss',
          'random_seed': 42}


# Unfortunately plotting works only in Jupiter
# scores = cv(train_pool,
#             params,
#             fold_count=2, 
#             plot="True")

In [None]:
model.fit(train_pool, eval_set=(xval, yval))

Now it is time to our test set

In [None]:
test_features, test_target = split_data(test_set, 'output')
test_features.head()

Now we can test our model on test set

In [None]:
test_predictions = model.predict(test_features)
test_acc = accuracy_score(test_target, test_predictions)
test_f1 = f1_score(test_target, test_predictions)


print("Accuracy")
print("Test set:", test_acc)
print("F1-score")
print("Test set:", test_f1)

And senity test:

In [None]:
(df['output'].value_counts()/df.shape[0]).to_frame()

Our model better than random. We can try it on another set if we have

# Conclusion

Know your blood pressure. High blood pressure is usually not accompanied by any symptoms, but it is one of the main causes of sudden stroke or heart attack. Check your blood pressure and know your numbers. If your blood pressure is high, you need to change your lifestyle - switch to a healthy diet, reduce your salt intake, and increase your levels of physical activity. You may need to take medication to control your blood pressure.


Elevated blood cholesterol levels increase the risk of heart attacks and strokes. It is necessary to control blood cholesterol levels with a healthy diet and, if necessary, appropriate medications

take care of yourself and loved ones