# Download dataset
This notebook downloads the `heart disease dataset` from UCI Machine Learning Repository [link_here](https://archive-beta.ics.uci.edu/ml/datasets/heart+disease). Licensed under CC BY 4.0.

In [1]:
import os
import pandas as pd

### Extract cleveland dataset
In particular, the Cleveland database is the only one that has been used by ML researchers to this date. File `processed_cleveland.data` has the Cleveland processed data without the names and social security numbers of the patients.

In [2]:
dataset_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'

cols_name = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 
             'slope', 'ca', 'thal', 'condition'
            ]

df = pd.read_csv(dataset_url, names=cols_name)

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,condition
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Target variable
Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence in `codition` column (values 1,2,3,4) from absence (value 0). Since we implement a binary classifier, all non-zero labels have to be equal to 1.

In [4]:
non_zero_labels = df.condition > 0

In [5]:
df.loc[non_zero_labels, 'condition'] = 1

In [6]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,condition
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## Preprocessing Non-numeric features
Check dtype for all features

In [7]:
df.dtypes

age          float64
sex          float64
cp           float64
trestbps     float64
chol         float64
fbs          float64
restecg      float64
thalach      float64
exang        float64
oldpeak      float64
slope        float64
ca            object
thal          object
condition      int64
dtype: object

In [8]:
set(df.ca)

{'0.0', '1.0', '2.0', '3.0', '?'}

In [9]:
set(df.thal)

{'3.0', '6.0', '7.0', '?'}

### Preprocess `ca`and `thal` features
These features are categorical and should be corrected to **numeric**

Transform `?` as $max() + 1$

In [17]:
non_num_idx = df.ca == '?'

df.loc[non_num_idx, 'ca'] = 3 + 1 # numeric_max() + 1

Same for `thal` feature

In [16]:
non_num_idx = df.thal == '?'

df.loc[non_num_idx, 'thal'] = 7 + 1 # numeric_max() + 1

Impose `dtype == float64`

In [19]:
df.ca = df.ca.astype(float)

df.thal = df.thal.astype(float)

Check again the labels for both features

In [20]:
set(df.ca)

{0.0, 1.0, 2.0, 3.0, 4.0}

In [21]:
set(df.thal)

{3.0, 6.0, 7.0, 8.0}

## Save as csv file

In [23]:
filename = 'heart_disease.csv'

path = os.path.join(os.getcwd(), filename)

df.to_csv(path, index=False)