![](https://cdn.studenti.stbm.it/images/2017/10/23/scorrimento-graduatorie-test-medicina-2017-orig.jpeg)

In [None]:
import numpy as np 
import pandas as pd
import plotly as py
import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier


pd.set_option('display.max_columns', None)
#########################################################
hrt = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')

# Basic information

In [None]:
hrt.head(3)

In [None]:
hrt.info()

**Let's explain what the abbreviated column names mean:**

1. age - age in years
2. sex - sex (1 = male; 0 = female)
3. cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 0 = asymptomatic)
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
5. chol - serum cholestoral in mg/dl
6. fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg - resting electrocardiographic results (1 = normal; 2 = having ST-T wave abnormality; 0 = hypertrophy)
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest
11. slp - the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)
12. caa - number of major vessels (0-4) colored by flourosopy
13. thall - 2 = normal; 1 = fixed defect; 3 = reversable defect
14. output - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < diameter narrowing; Value 1 = > 50% diameter narrowing)

**Conclusions on the basic information of the dataset:**
1. We have very few observational objects (patients). The analysis can certainly be done, but how the models will predict such a small number of observations is a question
2. The data does not require any preprocessing

# EDA

In [None]:
reoutput = {0: 'No heart disease', 1: 'Heart disease'}
hrt['output'] = hrt['output'].map(reoutput)

fig = px.pie(hrt['output'].value_counts().reset_index(), values = 'output', names = 'index', width = 700, height = 700)
fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.8, 
                  marker = dict(colors = ['#8d230f','#1e434c'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = 'Distribution of <br> heart disease <br> in patients', 
                                      x = 0.5, y = 0.5, font_size = 28, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = 'black')],
                  showlegend = False)
                  
fig.show()

**Affect of age, level of blood presure, cholesterol level, ST depression and level of maximum heart rate on risk of heart disease**

In [None]:
plt.figure(figsize = (16, 10))
sns.set_style("white")
plt.title('Age distribution of patients depending on heart disease', size = 20, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
a = sns.kdeplot(hrt.query("output == 'No heart disease'")['age'], color = '#1e434c', shade = True, label = 'No heart disease', alpha = 0.8)
sns.kdeplot(hrt.query("output == 'Heart disease'")['age'], color = '#8d230f', shade = True, label = 'Heart disease', alpha = 0.8)
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'monospace')
plt.yticks([])

for j in ['right', 'left', 'top']:
        a.spines[j].set_visible(False)
a.spines['bottom'].set_linewidth(1.2)

plt.figtext(0.65, 0.65, '''mean
56.6''', fontsize = 14, fontname = 'monospace', color = '#1e434c', ha = 'center')
plt.figtext(0.45, 0.55, '''mean
52.5''', fontsize = 14, fontname = 'monospace', color = '#8d230f', ha = 'center')

plt.show()

In [None]:
fig = plt.figure(figsize = (15, 15))

plt.subplot(221)
sns.set_style("white")
plt.title('Resting blood pressure', size = 17, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
a = sns.kdeplot(hrt.query("output == 'No heart disease'")['trtbps'], color = '#1e434c', shade = True, label = 'No heart disease', alpha = 0.8)
sns.kdeplot(hrt.query("output == 'Heart disease'")['trtbps'], color = '#8d230f', shade = True, label = 'Heart disease', alpha = 0.8)
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'monospace')
plt.yticks([])
plt.legend(['No heart disease', 'Heart disease'], bbox_to_anchor = (1.4, -0.1), ncol = 1, borderpad = 3, frameon = False, fontsize = 11)

plt.subplot(222)
plt.title('Oldpeak', size = 17, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
b = sns.kdeplot(hrt.query("output == 'No heart disease'")['oldpeak'], color = '#1e434c', shade = True, alpha = 0.8)
sns.kdeplot(hrt.query("output == 'Heart disease'")['oldpeak'], color = '#8d230f', shade = True, alpha = 0.8)
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'monospace')
plt.yticks([])

plt.subplot(223)
plt.title('Cholesterol level', size = 17, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
c = sns.kdeplot(hrt.query("output == 'No heart disease'")['chol'], color = '#1e434c', shade = True, alpha = 0.8)
sns.kdeplot(hrt.query("output == 'Heart disease'")['chol'], color = '#8d230f', shade = True, alpha = 0.8)
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'monospace')
plt.yticks([])

plt.subplot(224)
plt.title('Maximum heart rate ', size = 17, y = 1.03, fontname = 'monospace')
plt.grid(color = 'gray', linestyle = ':', axis = 'x', alpha = 0.8, zorder = 0,  dashes = (1,7))
d = sns.kdeplot(hrt.query("output == 'No heart disease'")['thalachh'], color = '#1e434c', shade = True, alpha = 0.8)
sns.kdeplot(hrt.query("output == 'Heart disease'")['thalachh'], color = '#8d230f', shade = True, alpha = 0.8)
plt.ylabel('')
plt.xlabel('')
plt.xticks(fontname = 'monospace')
plt.yticks([])

for i in [a,b,c,d]:
    for j in ['right', 'left', 'top']:
        i.spines[j].set_visible(False)
        i.spines['bottom'].set_linewidth(1.2)
        
fig.tight_layout(pad = 1)

plt.figtext(0.02, -0.04, 'Conclusions:', fontsize = 18, fontname = 'monospace', color = '#8d230f')

plt.figtext(0.02, -0.1, '''Judging solely from the available data, blood pressure has no effect on the development of heart disease, but this is not true. 
High blood pressure provokes the development and progression of serious diseases such as coronary heart disease, chronic heart 
failure and various types of heart rhythm disorders.''', fontsize = 13, fontname = 'monospace')

plt.figtext(0.02, -0.14, '''ST depression induced by exercise relative to rest is checked by electrocardiography. Patients with an approximate value of 
zero have a higher risk of developing heart disease.''', fontsize = 13, fontname = 'monospace')

plt.figtext(0.02, -0.2, '''If you do not go deep into medicine and understand a little what cholesterol is, then many people know that with an increased 
amount of cholesterol, the walls of blood vessels become clogged, which leads to heart disease, however, based on the data on 
available patients, cholesterol levels do not affect heart disease.''', fontsize = 13, fontname = 'monospace')

plt.figtext(0.02, -0.23, '''Patients who have an increased maximum achieved heart rate of about 175 beats per minute have a greater risk of heart disease.''', fontsize = 13, fontname = 'monospace')

plt.figtext(0.02, -0.27, '''And of course, age - here it is clear to anyone that the number of any diseases, including heart diseases, increases with age 
The available data tells us quite the opposite - that's not true.''', fontsize = 13, fontname = 'monospace')

plt.show()

In [None]:
resex = {0: 'female', 1: 'male'}
hrt['sex'] = hrt['sex'].map(resex)
sex_no = hrt.query("output == 'No heart disease'").groupby(['sex']).agg({'sex': 'count'}).rename(columns = {'sex': 'count'}).reset_index()
all_sex = hrt.groupby(['sex']).agg({'sex': 'count'}).rename(columns = {'sex': 'count'}).reset_index()
##########################
recp = {0: 'asymptomatic', 1: 'typical angina', 2: 'atypical angina', 3: 'non-anginal pain'}
hrt['cp'] = hrt['cp'].map(recp)
cp_no = hrt.query("output == 'No heart disease'").groupby(['cp']).agg({'cp': 'count'}).rename(columns = {'cp': 'count'}).reset_index()
all_cp = hrt.groupby(['cp']).agg({'cp': 'count'}).rename(columns = {'cp': 'count'}).reset_index()
##########################
refbs = {0: 'No diabetes', 1: 'Diabetes'}
hrt['fbs'] = hrt['fbs'].map(refbs)
fbs_no = hrt.query("output == 'No heart disease'").groupby(['fbs']).agg({'fbs': 'count'}).rename(columns = {'fbs': 'count'}).reset_index()
all_fbs = hrt.groupby(['fbs']).agg({'fbs': 'count'}).rename(columns = {'fbs': 'count'}).reset_index()
##########################
rerestecg = {0: 'hypertrophy', 1: 'normal', 2: 'abnormality'}
hrt['restecg'] = hrt['restecg'].map(rerestecg)
restecg_no = hrt.query("output == 'No heart disease'").groupby(['restecg']).agg({'restecg': 'count'}).rename(columns = {'restecg': 'count'}).reset_index()
all_restecg = hrt.groupby(['restecg']).agg({'restecg': 'count'}).rename(columns = {'restecg': 'count'}).reset_index()
##########################
reexng = {0: 'No', 1: 'Yes'}
hrt['exng'] = hrt['exng'].map(reexng)
exng_no = hrt.query("output == 'No heart disease'").groupby(['exng']).agg({'exng': 'count'}).rename(columns = {'exng': 'count'}).reset_index()
all_exng = hrt.groupby(['exng']).agg({'exng': 'count'}).rename(columns = {'exng': 'count'}).reset_index()
##########################
reslp = {0: 'downsloping', 1: 'flat', 2: 'upsloping'}
hrt['slp'] = hrt['slp'].map(reslp)
slp_no = hrt.query("output == 'No heart disease'").groupby(['slp']).agg({'slp': 'count'}).rename(columns = {'slp': 'count'}).reset_index()
all_slp = hrt.groupby(['slp']).agg({'slp': 'count'}).rename(columns = {'slp': 'count'}).reset_index()
##########################
recaa = {0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four'}
hrt['caa'] = hrt['caa'].map(recaa)
caa_no = hrt.query("output == 'No heart disease'").groupby(['caa']).agg({'caa': 'count'}).rename(columns = {'caa': 'count'}).reset_index()
all_caa = hrt.groupby(['caa']).agg({'caa': 'count'}).rename(columns = {'caa': 'count'}).reset_index()
##########################
rethall = {1: 'fixed defect', 2: 'normal', 3: 'reversable defect'}
hrt['thall'] = hrt['thall'].map(rethall)
thall_no = hrt.query("output == 'No heart disease'").groupby(['thall']).agg({'thall': 'count'}).rename(columns = {'thall': 'count'}).reset_index()
all_thall = hrt.groupby(['thall']).agg({'thall': 'count'}).rename(columns = {'thall': 'count'}).reset_index()

In [None]:
fig = plt.figure(figsize = (14, 70))
################################################################### 
plt.subplot(811)
sns.set_style("white")
plt.title('Sex', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_sex, x = 'count', y = 'sex', color = '#8d230f')
b = sns.barplot(data = sex_no, x = 'count', y = 'sex', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)
    
for p in b.patches[2:4]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
    
for p in range(2):
    width = b.patches[p].get_width() - sex_no['count'][p]
    plt.text((width + sex_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
###################################################################    
plt.subplot(812)
sns.set_style("white")
plt.title('Chest pain type', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_cp, x = 'count', y = 'cp', color = '#8d230f')
b = sns.barplot(data = cp_no, x = 'count', y = 'cp', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[4:9]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
    
for p in range(4):
    width = b.patches[p].get_width() - cp_no['count'][p]
    plt.text((width + cp_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
###################################################################   
plt.subplot(813)
sns.set_style("white")
plt.title('Fasting blood sugar', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_fbs, x = 'count', y = 'fbs', color = '#8d230f')
b = sns.barplot(data = fbs_no, x = 'count', y = 'fbs', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[2:4]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
    
for p in range(2):
    width = b.patches[p].get_width() - fbs_no['count'][p]
    plt.text((width + fbs_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
################################################################### 
plt.subplot(814)
sns.set_style("white")
plt.title('Electrocardiographic results', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_restecg, x = 'count', y = 'restecg', color = '#8d230f')
b = sns.barplot(data = restecg_no, x = 'count', y = 'restecg', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[3:7]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
    
for p in range(3):
    width = b.patches[p].get_width() - restecg_no['count'][p]
    plt.text((width + restecg_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
################################################################### 
plt.subplot(815)
sns.set_style("white")
plt.title('Exercise induced angina', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_exng, x = 'count', y = 'exng', color = '#8d230f')
b = sns.barplot(data = exng_no, x = 'count', y = 'exng', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[2:4]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
    
for p in range(2):
    width = b.patches[p].get_width() - exng_no['count'][p]
    plt.text((width + exng_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 30, color = 'white')
################################################################### 
plt.subplot(816)
sns.set_style("white")
plt.title('Slope of the peak exercise ST segment', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_slp, x = 'count', y = 'slp', color = '#8d230f')
b = sns.barplot(data = slp_no, x = 'count', y = 'slp', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[3:7]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
    
for p in range(3):
    width = b.patches[p].get_width() - slp_no['count'][p]
    plt.text((width + slp_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
################################################################### 
plt.subplot(817)
sns.set_style("white")
plt.title('Major vessels colored by flourosopy', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_caa, x = 'count', y = 'caa', color = '#8d230f')
b = sns.barplot(data = caa_no, x = 'count', y = 'caa', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[5:11]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
    
for p in range(5):
    width = b.patches[p].get_width() - caa_no['count'][p]
    plt.text((width + caa_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
################################################################### 
plt.subplot(818)
sns.set_style("white")
plt.title('Thall', fontsize = 25, fontname = 'monospace', x = 0.5, y = 1.05)
a = sns.barplot(data = all_thall, x = 'count', y = 'thall', color = '#8d230f')
b = sns.barplot(data = thall_no, x = 'count', y = 'thall', color = '#1e434c')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14)
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)

for p in b.patches[3:7]:
    width = p.get_width()
    plt.text(width/2, p.get_y() + 0.55*p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
    
for p in range(3):
    width = b.patches[p].get_width() - thall_no['count'][p]
    plt.text((width + thall_no['count'][p]*2)/2, b.patches[p].get_y() + 0.55*b.patches[p].get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 17, color = 'white')
################################################################### 

fig.tight_layout(h_pad = 30)

plt.figtext(0.16, 0.918, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.908, '''Women have a higher risk of developing heart disease. However, it is scientifically proven that 
men are more prone to heart disease.''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.787, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.779, '''Patients with asymptomatic chest pain type have the lowest risk of heart disease.''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.655, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.637, '''The presence of diabetes does not affect the risk of heart disease - a real misconception of 
this small dataset. According to the American Heart Association, a third of all deaths 
among patients with diabetes are associated with disorders of the cardiovascular system. 
Diabetics develop heart attacks and strokes faster - and this is a fact!''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.523, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.513, '''People with a normal result of electrocardiography are slightly more prone to heart disease - 
I think everyone understands that this is also not true, but data is data.''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.391, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.381, '''Under certain circumstances, physical activity can cause heart problems, but not necessarily. 
In our case, most of the patients did not perform physical activity.''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.259, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.252, '''Patients who have upsloping of the peak exercise ST segment have a higher risk of heart disease''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, 0.127, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, 0.117, '''Patients who have zero major vessels colored by fluoroscopy have a higher risk of risk of 
heart disease''', fontsize = 14, fontname = 'monospace')

plt.figtext(0.16, -0.006, 'Conclusion:', fontsize = 18, fontname = 'monospace', color = '#8d230f')
plt.figtext(0.16, -0.012, '''Patients who have normal thall have a higher risk of risk of heart disease''', fontsize = 14, fontname = 'monospace')

plt.show()

# Preprocessing

In [None]:
reoutput = {'Heart disease': 1, 'No heart disease': 0}
hrt['output'] = hrt['output'].map(reoutput)

X = hrt.drop(['output'], axis = 1)
y = hrt['output']

num_cols = X.select_dtypes(include = ['int64', 'float64']).columns.to_list()
cat_cols = X.select_dtypes(include = ['object']).columns.to_list()

In [None]:
def label_encoder(df):
    for i in cat_cols:
        le = LabelEncoder()
        df[i] = le.fit_transform(df[i])
    return df

In [None]:
sc = StandardScaler()
X[num_cols] = sc.fit_transform(X[num_cols])

# Label encoding
X = label_encoder(X)

X.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 228)

In [None]:
results = pd.DataFrame(columns = ['LR', 'RF', 'LGBM'], index = range(4))

# Modeling

**Logistic regression**

In [None]:
lg = LogisticRegression(random_state = 228)
lg.fit(X_train, y_train)
y_pred = lg.predict(X_test)
y_prob = lg.predict_proba(X_test)[:,1]

# Metrics
results.iloc[0, 0] = round(precision_score(y_test, y_pred), 2)
results.iloc[1, 0] = round(recall_score(y_test, y_pred), 2)
results.iloc[2, 0] = round(f1_score(y_test, y_pred), 2)
results.iloc[3, 0] = round(roc_auc_score(y_test, y_prob), 3)
lg_cm = confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))
print(f'ROC AUC score: {round(roc_auc_score(y_test, y_prob), 3)}')
print('')
print('-----------------------------------------------------')
print('')
print('Cross-validation scores with 5 folds:')
print('')
print(f"ROC AUC: {round(cross_val_score(lg, X_train, y_train, cv = 5, scoring = 'roc_auc').mean(), 3)}")
print(f"precision: {round(cross_val_score(lg, X_train, y_train, cv = 5, scoring = 'precision').mean(), 2)}")
print(f"recall: {round(cross_val_score(lg, X_train, y_train, cv = 5, scoring = 'recall').mean(), 2)}")
print(f"f1: {round(cross_val_score(lg, X_train, y_train, cv = 5, scoring = 'f1').mean(), 2)}")

# Visualize confusion matrix
plt.figure(figsize = (8, 5))
sns.heatmap(lg_cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15}, 
            yticklabels = ['No heart disease', 'Heart disease'], xticklabels = ['Predicted no heart disease', 'Predicted heart disease'])
plt.yticks(rotation = 0)
plt.show()

# Roc curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)

sns.set_theme(style = 'white')
plt.figure(figsize = (8, 8))
plt.plot(false_positive_rate,true_positive_rate, color = '#b01717', label = 'AUC = %0.3f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], linestyle = '--', color = '#174ab0')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Feature importance
f_imp = pd.DataFrame(columns = ['feature', 'importance (abs coef)'], index = range(13))
for i in range(len(f_imp.index)):
    f_imp.iloc[i, 0] = X_train.columns.to_list()[i]
f_imp['importance (abs coef)'] = abs(lg.coef_)[0]
f_imp = f_imp.sort_values('importance (abs coef)', ascending = False)
f_imp[0:12].style.background_gradient(cmap = 'Blues')

**Random forest**

In [None]:
rf = RandomForestClassifier(random_state = 228, max_depth = 5)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:,1]

# Metrics
results.iloc[0, 1] = round(precision_score(y_test, y_pred), 2)
results.iloc[1, 1] = round(recall_score(y_test, y_pred), 2)
results.iloc[2, 1] = round(f1_score(y_test, y_pred), 2)
results.iloc[3, 1] = round(roc_auc_score(y_test, y_prob), 3)
rf_cm = confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))
print(f'ROC AUC score: {round(roc_auc_score(y_test, y_prob), 3)}')
print('')
print('-----------------------------------------------------')
print('')
print('Cross-validation scores with 5 folds:')
print('')
print(f"ROC AUC: {round(cross_val_score(rf, X_train, y_train, cv = 5, scoring = 'roc_auc').mean(), 3)}")
print(f"precision: {round(cross_val_score(rf, X_train, y_train, cv = 5, scoring = 'precision').mean(), 2)}")
print(f"recall: {round(cross_val_score(rf, X_train, y_train, cv = 5, scoring = 'recall').mean(), 2)}")
print(f"f1: {round(cross_val_score(rf, X_train, y_train, cv = 5, scoring = 'f1').mean(), 2)}")

# Visualize confusion matrix
plt.figure(figsize = (8, 5))
sns.heatmap(rf_cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15},
           yticklabels = ['No heart disease', 'Heart disease'], xticklabels = ['Predicted no heart disease', 'Predicted heart disease'])
plt.yticks(rotation = 0)
plt.show()

# Roc curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)

sns.set_theme(style = 'white')
plt.figure(figsize = (8, 8))
plt.plot(false_positive_rate,true_positive_rate, color = '#b01717', label = 'AUC = %0.3f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], linestyle = '--', color = '#174ab0')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Feature importance
f_imp2 = pd.DataFrame(columns = ['feature', 'importance'], index = range(13))
for i in range(len(f_imp2.index)):
    f_imp2.iloc[i, 0] = X_train.columns.to_list()[i]
f_imp2['importance'] = rf.feature_importances_
f_imp2 = f_imp2.sort_values('importance', ascending = False)
f_imp2[0:12].style.background_gradient(cmap = 'Blues')

**LGBM**

In [None]:
lgbm = LGBMClassifier(random_state = 228, max_depth = 5, num_leaves = 50, n_estimators = 20, learning_rate = 0.1)
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
y_prob = lgbm.predict_proba(X_test)[:,1]

# Metrics
results.iloc[0, 2] = round(precision_score(y_test, y_pred), 2)
results.iloc[1, 2] = round(recall_score(y_test, y_pred), 2)
results.iloc[2, 2] = round(f1_score(y_test, y_pred), 2)
results.iloc[3, 2] = round(roc_auc_score(y_test, y_prob), 3)
lgbm_cm = confusion_matrix(y_test, y_pred)

print(classification_report(y_test, y_pred))
print(f'ROC AUC score: {round(roc_auc_score(y_test, y_prob), 3)}')
print('')
print('-----------------------------------------------------')
print('')
print('Cross-validation scores with 5 folds:')
print('')
print(f"ROC AUC: {round(cross_val_score(lgbm, X_train, y_train, cv = 5, scoring = 'roc_auc').mean(), 3)}")
print(f"precision: {round(cross_val_score(lgbm, X_train, y_train, cv = 5, scoring = 'precision').mean(), 2)}")
print(f"recall: {round(cross_val_score(lgbm, X_train, y_train, cv = 5, scoring = 'recall').mean(), 2)}")
print(f"f1: {round(cross_val_score(lgbm, X_train, y_train, cv = 5, scoring = 'f1').mean(), 2)}")

# Visualize confusion matrix
plt.figure(figsize = (8, 5))
sns.heatmap(lgbm_cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15},
           yticklabels = ['No heart disease', 'Heart disease'], xticklabels = ['Predicted no heart disease', 'Predicted heart disease'])
plt.yticks(rotation = 0)
plt.show()

# Roc curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(false_positive_rate, true_positive_rate)

sns.set_theme(style = 'white')
plt.figure(figsize = (8, 8))
plt.plot(false_positive_rate,true_positive_rate, color = '#b01717', label = 'AUC = %0.3f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], linestyle = '--', color = '#174ab0')
plt.axis('tight')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Feature importance
f_imp3 = pd.DataFrame(columns = ['feature', 'importance'], index = range(13))
for i in range(len(f_imp3.index)):
    f_imp3.iloc[i, 0] = X_train.columns.to_list()[i]
f_imp3['importance'] = lgbm.feature_importances_
f_imp3 = f_imp3.sort_values('importance', ascending = False)
f_imp3[0:12].style.background_gradient(cmap = 'Blues')

# Results and conclusion

**Results**

In [None]:
plt.figure(figsize = (7, 5))
sns.heatmap(results[results.columns.to_list()].astype(float), cmap = 'Blues', annot = True, linewidths = 2, cbar = False, annot_kws = {'fontsize': 15},
           yticklabels = ['Precision', 'Recall', 'F1', 'ROC AUC'])
sns.set(font_scale = 1.2)
plt.yticks(rotation = 0)
plt.show()

The best result for the training sample was shown by the logistic regression, but it has the worst scores for cross-validation. Although I presented the results for the training sample, but we need to focus on cross-validation. Random forest showed the best precision, and LGBM the best recall.

Of the three models, I would recommend LGBM. It would be interesting to make tuning models, but what can be tuning on data from 305 observations? :) This is a sure path to overfitting.

**Conclusion**

Honestly, I don't know why this dataset is so popular. It contains many errors that are not specified by the author. There is no practical sense in building models based on this data. Firstly, this is a very small sample, and secondly, based on basic medical knowledge, the data shows completely incorrect trends, which were described by me in the analysis - all this is again due to the small size of the data. In principle, the models don't make sense to build on this data at all, because data is incorrect. If you look at this data as a training dataset, see how the models will learn on such a small sample-yes, it is interesting, but, I repeat, this data does not make any practical sense.