# Heart Failure Prediction

Create a model predicting the presence of heart failure according to the patient's data. 
Dataset from Kaggle: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

In [18]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

In [None]:
#!pip install catboost
import catboost

In [14]:
data = pd.read_csv('./heart.csv')

Let's divide the features into numerical and categorical.

In [15]:
num_cols = [
    'Age',
    'RestingBP',
    'Cholesterol',
    'FastingBS',
    'MaxHR',
    'Oldpeak'
]

cat_cols = [
     'Sex',
     'ChestPainType',
     'RestingECG',
     'ExerciseAngina',
     'ST_Slope'
]

feature_cols = num_cols + cat_cols
target_col = ['HeartDisease']
cols = feature_cols + target_col
data = data[cols]
data.head()


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope,HeartDisease
0,40,140,289,0,172,0.0,M,ATA,Normal,N,Up,0
1,49,160,180,0,156,1.0,F,NAP,Normal,N,Flat,1
2,37,130,283,0,98,0.0,M,ATA,ST,N,Up,0
3,48,138,214,0,108,1.5,F,ASY,Normal,Y,Flat,1
4,54,150,195,0,122,0.0,M,NAP,Normal,N,Up,0


Check if there are missing values in the dataset.

In [5]:
data.isna().mean()

Age               0.0
RestingBP         0.0
Cholesterol       0.0
FastingBS         0.0
MaxHR             0.0
Oldpeak           0.0
Sex               0.0
ChestPainType     0.0
RestingECG        0.0
ExerciseAngina    0.0
ST_Slope          0.0
HeartDisease      0.0
dtype: float64

No missing values were found. Let's check the types of all the features.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   RestingBP       918 non-null    int64  
 2   Cholesterol     918 non-null    int64  
 3   FastingBS       918 non-null    int64  
 4   MaxHR           918 non-null    int64  
 5   Oldpeak         918 non-null    float64
 6   Sex             918 non-null    object 
 7   ChestPainType   918 non-null    object 
 8   RestingECG      918 non-null    object 
 9   ExerciseAngina  918 non-null    object 
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


The data types of all the features match the expected ones.

## LogisticRegressionCV

Let's prepare the data for learning.

In [7]:
#We will encode categorical features using one-hot-encoding and replace categorical features with new numeric ones.
data = pd.get_dummies(data, columns=cat_cols)
cat_cols_new = []
for col_name in cat_cols:
    cat_cols_new.extend(filter(lambda x: x.startswith(col_name), data.columns))
cat_cols = cat_cols_new

In [8]:
#Divide the dataset into parts and normalize the numerical features.
X_train, X_test, y_train, y_test = train_test_split(data[num_cols + cat_cols], data[target_col], test_size=0.2)
#Normalization is performed after dividing the data into parts, so that there is no leakage from the train part to the test part.
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                918 non-null    int64  
 1   RestingBP          918 non-null    int64  
 2   Cholesterol        918 non-null    int64  
 3   FastingBS          918 non-null    int64  
 4   MaxHR              918 non-null    int64  
 5   Oldpeak            918 non-null    float64
 6   HeartDisease       918 non-null    int64  
 7   Sex_F              918 non-null    uint8  
 8   Sex_M              918 non-null    uint8  
 9   ChestPainType_ASY  918 non-null    uint8  
 10  ChestPainType_ATA  918 non-null    uint8  
 11  ChestPainType_NAP  918 non-null    uint8  
 12  ChestPainType_TA   918 non-null    uint8  
 13  RestingECG_LVH     918 non-null    uint8  
 14  RestingECG_Normal  918 non-null    uint8  
 15  RestingECG_ST      918 non-null    uint8  
 16  ExerciseAngina_N   918 non

We will train the model using LogisticRegressionCV.

In [13]:
C = [0.15, 0.1, 0.05]
clf = LogisticRegressionCV(Cs = C, scoring='accuracy', refit=True, cv = 10, n_jobs = -1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('The best accuracy on train: ', clf.scores_[1].max())
print('The best hyperparameter C: ', clf.C_)
print('Accuracy on test: ', accuracy_score(y_test, y_pred))
None

  y = column_or_1d(y, warn=True)


The best accuracy on train:  0.918918918918919
The best hyperparameter C:  [0.15]
Accuracy on test:  0.8532608695652174


The best accuracy on train:  0.918918918918919

The best hyperparameter C:  [0.15]

Accuracy on test:  0.8532608695652174

## CatBoost
Let's do the same with CatBoostClassifier.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data[num_cols + cat_cols], data[target_col], test_size = 0.2, random_state=42)

boosting_model = catboost.CatBoostClassifier(logging_level='Silent', eval_metric='Accuracy', cat_features=cat_cols)
param_grid = {'l2_leaf_reg': np.linspace(0, 1, 20), 'learning_rate': [0.01, 0.03, 0.06], 'n_estimators': [200, 250]}
boosting_model.grid_search(param_grid, X=X_train, y=y_train, refit = True)

0:	loss: 0.9115646	best: 0.9115646 (0)	total: 585ms	remaining: 1m 9s
1:	loss: 0.9251701	best: 0.9251701 (1)	total: 1.11s	remaining: 1m 5s
2:	loss: 0.9115646	best: 0.9251701 (1)	total: 1.68s	remaining: 1m 5s
3:	loss: 0.9047619	best: 0.9251701 (1)	total: 2.22s	remaining: 1m 4s
4:	loss: 0.9047619	best: 0.9251701 (1)	total: 2.77s	remaining: 1m 3s
5:	loss: 0.9251701	best: 0.9251701 (1)	total: 3.33s	remaining: 1m 3s
6:	loss: 0.9047619	best: 0.9251701 (1)	total: 4.01s	remaining: 1m 4s
7:	loss: 0.9115646	best: 0.9251701 (1)	total: 5.37s	remaining: 1m 15s
8:	loss: 0.8979592	best: 0.9251701 (1)	total: 6.71s	remaining: 1m 22s
9:	loss: 0.9183673	best: 0.9251701 (1)	total: 8.13s	remaining: 1m 29s
10:	loss: 0.8911565	best: 0.9251701 (1)	total: 9.35s	remaining: 1m 32s
11:	loss: 0.9115646	best: 0.9251701 (1)	total: 10.3s	remaining: 1m 32s
12:	loss: 0.9183673	best: 0.9251701 (1)	total: 10.9s	remaining: 1m 29s
13:	loss: 0.9115646	best: 0.9251701 (1)	total: 11.4s	remaining: 1m 26s
14:	loss: 0.9251701	bes

{'cv_results': defaultdict(list,
             {'iterations': [0,
               1,
               2,
               3,
               4,
               5,
               6,
               7,
               8,
               9,
               10,
               11,
               12,
               13,
               14,
               15,
               16,
               17,
               18,
               19,
               20,
               21,
               22,
               23,
               24,
               25,
               26,
               27,
               28,
               29,
               30,
               31,
               32,
               33,
               34,
               35,
               36,
               37,
               38,
               39,
               40,
               41,
               42,
               43,
               44,
               45,
               46,
               47,
               48,
               49,
             

In [17]:
y_pred = boosting_model.predict(X_test)
print('The best score:', boosting_model.best_score_)
print('Accuracy on test: ', accuracy_score(y_test, y_pred))

The best score: {'learn': {'Accuracy': 0.9918256130790191, 'Logloss': 0.07612844401228881}}
Accuracy on test:  0.8858695652173914


Accuracy: 0.9918256130790191, Logloss: 0.07612844401228881

Accuracy on test:  0.8858695652173914

CatBoost gives an accuracy on the test part 3% higher than the logistic regression. At the same time, the options for search the parameters were taken randomly, so with the best selection, the quality can be improved.