# **Introduction**

Hi everyone!

The objective of this notebook is to test the Kaggle notebook features and play a little with python visualization tools (specially seaborn) and sklearn.
I wanted to explore the variables for this "famous" dataset and build several models with sklearn API.

After doing some basic exploration, some of the results and the insights didn't have any sense to me (I'm not an expert in the field, so I questioned myself a lot during this exercise); for example, exercise-induced angina (exang) showed that there is a **lower probability of having heart disease if the patient had exercise-induced angina**, which mas counterintuitive to the information I found. So after digging into the discussion tab over the dataset page, I found the [CORRECT](https://www.kaggle.com/ronitf/heart-disease-uci/discussion/105877) description; so I copied the correct information in the dataset description below and corrected some values after.  

Any suggestions, feedback, comments are welcomed!

## **Content**

1. [**Loading the dataset.**](#section-one) 
2. [**Exploratory Data Analysis (EDA) and hypothesis testing.**](#section-two)
    1. [**Dataset description.** ](#section-two-subsection-one)
    2. [**Essential exploration** (dataset describe, null's statistics).](#section-two-subsection-two)
    3. [**Univariate, Bivariate and Multivariate analysis**](#section-two-subsection-three). The typical pipeline proposed to explore a dataset is always to start with the univariate analysis, then the bivariate and finish with the multivariate one. I prefer from my experience to perform the univariate along with the bivariate, and specific cases, use a third or even a fourth variable to dig into a multivariate one.
3. [**Modelling.**](#section-three)
4. [**Evaluation.**](#section-four)

# **Loading the dataset**
<a id="section-one"></a>

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-cleveland-uci/heart_cleveland_upload.csv')

# **Exploratory Data Analysis (EDA)**
<a id="section-two"></a>

## **Dataset description**:
<a id="section-two-subsection-one"></a>

- Age **(age)**
- Sex **(sex)**:
    - 0: female
    - 1: male
- Chest Pain Type **(cp)**:
    - 0: Typical angina
    - 1: Atypical angina
    - 2: Non-Anginal pain
    - 3: Asymptomatic
- Resting blood pressure **(trestbps)**
- Serum cholestoral in mg/dl **(chol)**
- Fasting blood sugar **(fbs)**:
    - 0: < 120 mg/dl
    - 1: > 120 mg/dl
- Resting electrocardiographic results **(restecg)**
    - 0: normal
    - 1: ST-T wave abnormality
    - 2: Left ventricular hypertrophy by Estes' criteria (probably or definite)
- Maximum heart rate achieved **(thalach)**
- Exercise induced angina **(exang)**
    - 0: no
    - 1: yes
- ST depression induced by exercise relative to rest **(oldpeak)**
- The slope of the peak exercise ST segment **(slope)**
    - 0: unsloping
    - 1: flat
    - 2: downsloping
- Thalassemia: A blood disorder **(thal)**
    - 0: normal
    - 1: Fixed defect
    - 2: Reversable defect
- Number of Major Vessels: Number of major vessels colored by fluoroscopy **(ca)**
- Condition: target variable **(condition)**
    - 0: no disease
    - 1: disease

## **Essential exploration**
<a id="section-two-subsection-two"></a>

In [None]:
df.describe()

In [None]:
pd.DataFrame({
    'unique_values_per_column': df.nunique(),
    'percentage_of_unique_values': round((df.nunique()/len(df))*100, 3)
}).sort_values(by=['percentage_of_unique_values'], ascending=False)

In [None]:
(df.isna()*1).sum(axis=0) / len(df)

### **Initial findings**

- The data points from the male sex are about ~68% of the whole dataset, which may induce a systematic bias in the modelling part.
- After the cleaning performed in [this data source](https://www.kaggle.com/cherngs/heart-disease-cleveland-uci), there are no NaNs in the whole dataset
- The target variable *condition* is pretty balanced, about ~47% is 1, so there is no need make any balancing pre-processing before modelling.
- The variables **oldpeak (ST depression)**, **age** and **trestbps (Resting blood preassure)** are continuous variables with a few unique values (10% < x < 20%), maybe in the modelling part they could be transformed as ordinal variables with discrete variables.

## **Univariate, Bivariate & Mutivariate Analysis.**
<a id="section-two-subsection-three"></a>

**Most of the univariate analyzes are accompanied by the bivariate analysis of the variable *condition*.**

### **Age**

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20, 16))

# First row
sns.histplot(data=df, x='age', kde=True, ax=ax[0, 0])
ax[0, 0].set_ylabel('Frecuency')

sns.boxplot(x='age', data=df, orient='h',  width=.45, ax=ax[0, 1])

sns.stripplot(x="age", data=df, orient='h', size=4, color='.3', linewidth=0, ax=ax[0, 1])


# Second row
sns.histplot(data=df, x='age', hue='condition', kde=True, ax=ax[1, 0])
ax[1, 0].set_ylabel('Frecuency')
sns.boxplot(x='age', y='condition', data=df, orient='h',  width=.45, ax=ax[1, 1])
sns.stripplot(x="age", y='condition', data=df, orient='h', size=4, color='.3', linewidth=0, ax=ax[1, 1])
ax[1, 1].set_yticklabels(['no-disease', 'disease'])
ax[1, 1].set_ylabel('')

fig.suptitle('Exploration of Age variable discriminated by condition factor level')
plt.show();

The age has a really important coverage over the life expectancy, reaching to values above 70.
There is a clear discrimination between means using the condition states using the age as a discriminant variable, in which the mass of age over the patients with heart disease is more skewed to the left, and the distribution of patients with no heart disease seems to be a gaussian distribution.
One relevant insight is the age distribution over the patients with heart disease, that have a more narrowed distribution (less variability) and higher kurtosis. These graphs are implying that the probability of having a heart disease is maximized over the age of 55-65, let's check. 

In [None]:
fig, ax = plt.subplots(nrows=2, figsize=(20, 7), sharex=True)

sns.lineplot(data=df.groupby('age').agg(no_observations=pd.NamedAgg('condition', 'count')).reset_index(), x='age', y='no_observations', ax=ax[0])
ax[0].set_ylabel('Number of observations')
ax[0].set_yticks(list(range(0, 20, 2)))
ax[0].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False) 

sns.lineplot(data=df, x='age', y='condition', estimator='mean', ax=ax[1])
ax[1].set_ylabel('Estimated probability of\n having a heart disease')
ax[1].set_xticks(list(range(26, 80, 2)))

fig.suptitle('Age deep dive analysis')
plt.show()

As seen in the graphs above, the probability of getting a heart disease is maximized between the age of the patient is between 55 - 62, maybe in the mutivariate analysis, it can be discriminated with the sex variable to check the interaction between these two variables and the effect on the probability of having a heart disease.

### **Sex**
- 0: Female
- 1: Male

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 8))

sns.countplot(data=df, x='sex', order=df['sex'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels(['Male', 'Female'])
ax[0].set_ylabel('Frecuency')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )

sns.barplot(data=df, x='sex', y='condition', order=df['sex'].value_counts().index, ax=ax[1])
ax[1].set_ylabel('Probability of presence of heart disease given sex')
ax[1].set_xticklabels(['Male', 'Female'])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])

for p in ax[1].patches:
    height = p.get_height()
    ax[1].text(
        (p.get_x() +  p.get_width()/1.5),
        height*1.05,
        '{:1.2f}%'.format(height * 100),
        ha='center'
    )

fig.suptitle('Sex variable deep dive')

plt.show();

As seen before, the proportion of males in the dataset 2x compared to women, but the important takeaway is that the probability for a presence of a heart disease is also 2x compared to women, so it will be an important variable in the modelling part.

### **Chest pain (CP)**
- 0: Typical angina
- 1: Atypical angina
- 2: Non-Anginal pain
- 3: Asymptomatic

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))

mapped_values = {
    0: 'typical-angina',
    1: 'atypical-angina',
    2: 'non-anginal-pain',
    3: 'asymptomatic'
}

sns.countplot(data=df, x='cp', order=df['cp'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['cp'].value_counts().index])
ax[0].set_ylabel('Frecuency')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )


sns.barplot(
    data=df,
    x='cp',
    y='condition',
    order=df.groupby('cp')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('cp')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])


fig.suptitle('Probability estimation of heart disease given chest pain')
plt.show();

The key insight here is that the chest pain is surely an indicative variable for the presence of a heart disease. Later on the notebook I will investigate the interaction of this variables with other relevant variables to see any capacity of discrimination.

### **Resting blood preasure**

In [None]:
fig, ax = plt.subplots(
    nrows=2,
    ncols=2,
    figsize=(25, 10),
    sharex=True,
    gridspec_kw={
        'width_ratios': [15, 15],
        'height_ratios': [7.5, 2.5]
    }
)

sns.histplot(data=df, x='trestbps', kde=True, ax=ax[0, 0])
ax[0, 0].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 0].set_ylabel('Frecuency')
ax[0, 0].set_title('Resting blood preassure as single variable')
sns.boxplot(data=df, x='trestbps', orient='h', ax=ax[1, 0])
sns.stripplot(data=df, x='trestbps', orient='h', color=".25", ax=ax[1, 0])
ax[1, 0].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 0].set_xlabel('')


sns.histplot(data=df, x='trestbps',hue='condition', kde=True, ax=ax[0, 1])
ax[0, 1].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 1].set_ylabel('Frecuency')
ax[0, 1].set_title('Resting blood preassure discriminated by condition')
sns.boxplot(data=df, x='trestbps', y='condition', orient='h', ax=ax[1, 1])
sns.stripplot(data=df, x='trestbps', y='condition', orient='h', color=".25", ax=ax[1, 1])
ax[1, 1].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 1].set_xlabel('')

fig.suptitle('Histogram and KDE for Resting blood preasure')
plt.show()

As seen by the two plots above, the resting blood preasure doesn't seem to have a great effect on discriminating the presence of a heart disease, which by intuition, seems a little weird because usually this is a variable that has to be monitored if there is a heart condition with a patient. Maybe here is present the simpson paradox, so later I will perform a multivariate analysis with other variables to see if there is any discrimination given any interaction with this variable.

### **Serum cholestoral in mg/dl**

In [None]:
fig, ax = plt.subplots(
    nrows=2,
    ncols=2,
    figsize=(25, 10),
    sharex=True,
    gridspec_kw={
        'width_ratios': [15, 15],
        'height_ratios': [7.5, 2.5]
    }
)

sns.histplot(data=df, x='chol', kde=True, ax=ax[0, 0])
ax[0, 0].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 0].set_ylabel('Frecuency')
ax[0, 0].set_title('Cholestoral (mg/dl)')
sns.boxplot(data=df, x='chol', orient='h', ax=ax[1, 0])
sns.stripplot(data=df, x='chol', orient='h', color=".25", ax=ax[1, 0])
ax[1, 0].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 0].set_xlabel('')


sns.histplot(data=df, x='chol',hue='condition', kde=True, ax=ax[0, 1])
ax[0, 1].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 1].set_ylabel('Frecuency')
ax[0, 1].set_title('Cholestoral (mg/dl) discriminated by condition')
sns.boxplot(data=df, x='chol', y='condition', orient='h', ax=ax[1, 1])
sns.stripplot(data=df, x='chol', y='condition', orient='h', color=".25", ax=ax[1, 1])
ax[1, 1].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 1].set_xlabel('')

fig.suptitle('Histogram and KDE for Cholestoral (mg/dl)')
plt.show();

There is a slightly difference between the cholestoral of patients with heart disease from patients with no heart disease, maybe another variable to check interactions could discriminate better.

## **Fasting blood sugar > 120 mg/dl**

- 0: < 120 mg/dl
- 1: > 120 mg/dl

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))

mapped_values = {
    0: '< 120 mg/dl',
    1: '> 120 mg/dl'
}

sns.countplot(data=df, x='fbs', order=df['fbs'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['fbs'].value_counts().index])
ax[0].set_ylabel('Frecuency')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='fbs',
    y='condition',
    order=df.groupby('fbs')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('fbs')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])

fig.suptitle('Probability estimation for Fasting blood sugar > 120 mg/dl')
plt.show();

## **Resting electrocardiographic results (discrete)**
- 0: normal
- 1: ST-T wave abnormality
- 2: Left ventricular hypertrophy by Estes' criteria (probably or definite)

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))


mapped_values = {
    0: 'normal',
    1: 'ST-T wave abnormality',
    2: 'Left ventricular hypertrophy'
}

sns.countplot(data=df, x='restecg', order=df['restecg'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['restecg'].value_counts().index])
ax[0].set_ylabel('Frecuency')
ax[0].set_title('Frecuency barplot')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='restecg',
    y='condition',
    order=df.groupby('restecg')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('restecg')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])
ax[1].set_title('Estimated probability barplot')

fig.suptitle('Probability estimation for Resting electrocardiographic results')
plt.show();

As expected, non-normal results tend to increase the likelihood of heart disease, although it is essential to note the lack of observations for ST-T wave abnormality results, which, interestingly, is the factor level, which presents the highest probability of the three.

## **Maximum heart rate achieved**

In [None]:
fig, ax = plt.subplots(
    nrows=2,
    ncols=2,
    figsize=(25, 10),
    sharex=True,
    gridspec_kw={
        'width_ratios': [15, 15],
        'height_ratios': [7.5, 2.5]
    }
)

sns.histplot(data=df, x='thalach', kde=True, ax=ax[0, 0])
ax[0, 0].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 0].set_ylabel('Frecuency')
ax[0, 0].set_title('Maximum heart rate achieved (BPM)')
sns.boxplot(data=df, x='thalach', orient='h', ax=ax[1, 0])
sns.stripplot(data=df, x='thalach', orient='h', color=".25", ax=ax[1, 0])
ax[1, 0].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 0].set_xlabel('')


sns.histplot(data=df, x='thalach',hue='condition', kde=True, ax=ax[0, 1])
ax[0, 1].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 1].set_ylabel('Frecuency')
ax[0, 1].set_title('Maximum heart rate achieved (BPM) discriminated by condition')
sns.boxplot(data=df, x='thalach', y='condition', orient='h', ax=ax[1, 1])
sns.stripplot(data=df, x='thalach', y='condition', orient='h', color=".25", ax=ax[1, 1])
ax[1, 1].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 1].set_xlabel('')

fig.suptitle('Histogram and KDE for Cholestoral (mg/dl)')
plt.show();

This variable seems to have an important capability to discriminat, it can be interpreted with the exercise induced angin, given that the patient suffers from agina, he/she couldn't reach higher heart rate, later I will explore that assumption.

## **Exercise induced angina**
- 0: No
- 1: Yes

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))


mapped_values = {
    0: 'No',
    1: 'Yes',
}

sns.countplot(data=df, x='exang', order=df['exang'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['exang'].value_counts().index])
ax[0].set_ylabel('Frecuency')
ax[0].set_title('Frecuency barplot')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='exang',
    y='condition',
    order=df.groupby('exang')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('exang')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])
ax[1].set_title('Estimated probability barplot')

fig.suptitle('Probability estimation for Exercise induced angina')
plt.show();

As expected, the angina produced by exercise seems to be an important symptom for heart disease

## **ST depression induced by exercise relative to rest (oldpeak)**

In [None]:
fig, ax = plt.subplots(
    nrows=2,
    ncols=2,
    figsize=(25, 10),
    sharex=True,
    gridspec_kw={
        'width_ratios': [15, 15],
        'height_ratios': [7.5, 2.5]
    }
)

sns.histplot(data=df, x='oldpeak', kde=True, ax=ax[0, 0])
ax[0, 0].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 0].set_ylabel('Frecuency')
ax[0, 0].set_title('ST depresssion induced by exercise')
sns.boxplot(data=df, x='oldpeak', orient='h', ax=ax[1, 0])
sns.stripplot(data=df, x='oldpeak', orient='h', color=".25", ax=ax[1, 0])
ax[1, 0].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 0].set_xlabel('')


sns.histplot(data=df, x='oldpeak',hue='condition', kde=True, ax=ax[0, 1])
ax[0, 1].tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax[0, 1].set_ylabel('Frecuency')
ax[0, 1].set_title('ST depression induced by exercise discriminated by condition')
sns.boxplot(data=df, x='oldpeak', y='condition', orient='h', ax=ax[1, 1])
sns.stripplot(data=df, x='oldpeak', y='condition', orient='h', color=".25", ax=ax[1, 1])
ax[1, 1].tick_params(axis='y', which='both', left=False, top=False, labelleft=False)
ax[1, 1].set_xlabel('')

fig.suptitle('Histogram and KDE for ST depression induced by exercise')
plt.show();

This variable is highly skewed to the right and seems to have an important capability to discriminate. If any regression model will be used, this variable should be transformed to a normal distribution, a log-transformation could help.

## **Slope: Slope of the peak exercise ST segment**
- 0: Unsloping
- 1: Flat
- 2: Downsloping

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))

mapped_values = {
    0: 'Unsloping',
    1: 'Flat',
    2: 'Downsloping'
}

sns.countplot(data=df, x='slope', order=df['slope'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['slope'].value_counts().index])
ax[0].set_ylabel('Frecuency')
ax[0].set_title('Frecuency barplot')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='slope',
    y='condition',
    order=df.groupby('slope')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('slope')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])
ax[1].set_title('Estimated probability barplot')

fig.suptitle('Probability estimation for Slope of the peak exercise ST segment')
plt.show();

## **Thalassemia (blood disorder)**
- 0: normal
- 1: Fixed defect
- 2: Reversable defect

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))

mapped_values = {
    0: 'normal',
    1: 'Fixed defect',
    2: 'Reversable defect'
}

sns.countplot(data=df, x='thal', order=df['thal'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values[i] for i in df['thal'].value_counts().index])
ax[0].set_ylabel('Frecuency')
ax[0].set_title('Frecuency barplot')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='thal',
    y='condition',
    order=df.groupby('thal')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values[i] for i in df.groupby('thal')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])
ax[1].set_title('Estimated probability barplot')

fig.suptitle('Probability estimation for Thalassemia (blood disorder)')
plt.show();

## **Number of major vessels colored by fluoroscopy**

This variable is not declared directly as a categorical one, but only have 4 unique values, which are ordered, so its an ordinal variable.

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20, 7))

mapped_values = {}

sns.countplot(data=df, x='ca', order=df['ca'].value_counts().index, ax=ax[0])
ax[0].set_xticklabels([mapped_values.get(i, i) for i in df['ca'].value_counts().index])
ax[0].set_ylabel('Frecuency')
ax[0].set_title('Frecuency barplot')

for p in ax[0].patches:
    height = p.get_height()
    ax[0].text(
        p.get_x() + p.get_width() / 2.,
        height + 2,
        '{:1.2f}%'.format((height / float(len(df))) * 100),
        ha='center'
    )



sns.barplot(
    data=df,
    x='ca',
    y='condition',
    order=df.groupby('ca')['condition'].mean().sort_values(ascending=False).index,
    ax=ax[1]
)
ax[1].set_ylabel('Estimated probability of disease')
ax[1].set_xticklabels([mapped_values.get(i, i)  for i in df.groupby('ca')['condition'].mean().sort_values(ascending=False).index])
ylabels = [i*100 for i in ax[1].get_yticks()]
ax[1].set_yticks(ax[1].get_yticks())
ax[1].set_yticklabels(['{:1.1f}%'.format(_) for _ in ylabels])
ax[1].set_title('Estimated probability barplot')

fig.suptitle('Probability estimation for Number of major vessels colored by fluoroscopy')
plt.show();

# **Modelling**
<a id="section-three"></a>

# **Evaluation**
<a id="section-four"></a>