# Heart Disease
###### Donated on 6/30/1988

4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

Dataset Characteristics: Multivariate

Subject Area: Health and Medicine

Associated Tasks: Classification

Feature Type: Categorical, Integer, Real

\#Instances: 303

\# Features: 13

### Dataset Information
##### Additional Information

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them.  In particular, the Cleveland database is the only one that has been used by ML researchers to date.  The "goal" field refers to the presence of heart disease in the patient.  It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  
   
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

One file has been "processed", that one containing the Cleveland database.  All four unprocessed files also exist in this directory.

To see Test Costs (donated by Peter Turney), please see the folder "Costs" 

##### Has Missing Values?

Yes 

### Introductory Paper
International application of a new probability algorithm for the diagnosis of coronary artery disease.
By R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher. 1989

Published in American Journal of Cardiology

##### 14 attributes on this processed dataset:
    age: age in years
    sex: sex (1 = male; 0 = female)
    cp: chest pain type
        - Value 1: typical angina
        - Value 2: atypical angina
        - Value 3: non-anginal pain
        - Value 4: asymptomatic
    trestbps: resting blood pressure (in mm Hg on admission to the hospital)
    chol: serum cholestoral in mg/dl 
    fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
    restecg: resting electrocardiographic results
        - Value 0: normal
        - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    thalach: maximum heart rate achieved
    exang: exercise induced angina (1 = yes; 0 = no)
    oldpeak = ST depression induced by exercise relative to rest
    slope: the slope of the peak exercise ST segment
        - Value 1: upsloping
        - Value 2: flat
        - Value 3: downsloping
    ca: number of major vessels (0-3) colored by flourosopy
    thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
    heart_disease: diagnosis of heart disease (angiographic disease status)
        - Value 0: < 50% diameter narrowing
        - Value 1: > 50% diameter narrowing "[This field] refers to the presence of heart disease in the patient. Scorere is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from the absence (value 0)."

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

Note: Data is from the UCI Machine Learning Repository
Creators: Andras Janosi, William Steinbrunn, Matthias Pfisterer, Robert Detrano
Licence: This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

In [2]:
# data: https://archive.ics.uci.edu/dataset/45/heart+disease

# import csv file, adding column names
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'heart_disease']
heart = pd.read_csv('processed.cleveland.csv', names=column_names)

In [3]:
# print the first 5 rows of the dataframe
print(heart.shape)
heart.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [4]:
# replace zeros with 'NaN' value
heart = heart.replace(0, np.nan)

heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,,2.3,3.0,0.0,6.0,
1,67.0,1.0,4.0,160.0,286.0,,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2.0
2,67.0,1.0,4.0,120.0,229.0,,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1.0
3,37.0,1.0,3.0,130.0,250.0,,,187.0,,3.5,3.0,0.0,3.0,
4,41.0,,2.0,130.0,204.0,,2.0,172.0,,1.4,1.0,0.0,3.0,


In [5]:
# summary statistics including all columns
heart.describe(include='all')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
count,303.0,206.0,303.0,303.0,303.0,45.0,152.0,303.0,99.0,204.0,303.0,303.0,303.0,139.0
unique,,,,,,,,,,,,5.0,4.0,
top,,,,,,,,,,,,0.0,3.0,
freq,,,,,,,,,,,,176.0,166.0,
mean,54.438944,1.0,3.158416,131.689769,246.693069,1.0,1.973684,149.607261,1.0,1.544118,1.60066,,,2.043165
std,9.038662,0.0,0.960126,17.599748,51.776918,0.0,0.160602,22.875003,0.0,1.105746,0.616226,,,1.013464
min,29.0,1.0,1.0,94.0,126.0,1.0,1.0,71.0,1.0,0.1,1.0,,,1.0
25%,48.0,1.0,3.0,120.0,211.0,1.0,2.0,133.5,1.0,0.775,1.0,,,1.0
50%,56.0,1.0,3.0,130.0,241.0,1.0,2.0,153.0,1.0,1.4,2.0,,,2.0
75%,61.0,1.0,4.0,140.0,275.0,1.0,2.0,166.0,1.0,2.0,2.0,,,3.0


In [6]:
# check for missing values
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            206 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            45 non-null     float64
 6   restecg        152 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          99 non-null     float64
 9   oldpeak        204 non-null    float64
 10  slope          303 non-null    float64
 11  ca             303 non-null    object 
 12  thal           303 non-null    object 
 13  heart_disease  139 non-null    float64
dtypes: float64(12), object(2)
memory usage: 33.3+ KB


In [7]:
# check for missing values
print(heart.ca.unique())
print(heart.thal.unique())

['0.0' '3.0' '2.0' '1.0' '?']
['6.0' '3.0' '7.0' '?']


In [8]:
# replace '?' with 'NaN'
heart = heart.replace('?', np.nan)

In [9]:
# re-check replaced missing values
heart.ca.unique()

array(['0.0', '3.0', '2.0', '1.0', nan], dtype=object)

In [10]:
# re-check missing values in dataframe
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            206 non-null    float64
 2   cp             303 non-null    float64
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            45 non-null     float64
 6   restecg        152 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          99 non-null     float64
 9   oldpeak        204 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    object 
 12  thal           301 non-null    object 
 13  heart_disease  139 non-null    float64
dtypes: float64(12), object(2)
memory usage: 33.3+ KB


In [11]:
# convert dtype for columns 'ca' to float
heart.ca = heart.ca.astype('float')

In [12]:
# re-check dtype for columns 'ca'
heart.ca.dtype

dtype('float64')

In [13]:
# cp: chest pain type
#         - Value 1: typical angina
#         - Value 2: atypical angina
#         - Value 3: non-anginal pain
#         - Value 4: asymptomatic

# replace categorical numeric values with string values
# heart.cp = heart.cp.replace({1.0: 'typical angina', 2.0: 'atypical angina', 3.0: 'non-anginal pain', 4.0: 'asymptomatic'})
# using inplace to override data on df
heart.cp.replace({1.0: 'typical angina', 2.0: 'atypical angina', 3.0: 'non-anginal pain', 4.0: 'asymptomatic'}, inplace=True)

In [14]:
# re-check dtypes
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            303 non-null    float64
 1   sex            206 non-null    float64
 2   cp             303 non-null    object 
 3   trestbps       303 non-null    float64
 4   chol           303 non-null    float64
 5   fbs            45 non-null     float64
 6   restecg        152 non-null    float64
 7   thalach        303 non-null    float64
 8   exang          99 non-null     float64
 9   oldpeak        204 non-null    float64
 10  slope          303 non-null    float64
 11  ca             299 non-null    float64
 12  thal           301 non-null    object 
 13  heart_disease  139 non-null    float64
dtypes: float64(12), object(2)
memory usage: 33.3+ KB


In [15]:
# display summary statistics for all columns
heart.describe(include = 'all')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease
count,303.0,206.0,303,303.0,303.0,45.0,152.0,303.0,99.0,204.0,303.0,299.0,301.0,139.0
unique,,,4,,,,,,,,,,3.0,
top,,,asymptomatic,,,,,,,,,,3.0,
freq,,,144,,,,,,,,,,166.0,
mean,54.438944,1.0,,131.689769,246.693069,1.0,1.973684,149.607261,1.0,1.544118,1.60066,0.672241,,2.043165
std,9.038662,0.0,,17.599748,51.776918,0.0,0.160602,22.875003,0.0,1.105746,0.616226,0.937438,,1.013464
min,29.0,1.0,,94.0,126.0,1.0,1.0,71.0,1.0,0.1,1.0,0.0,,1.0
25%,48.0,1.0,,120.0,211.0,1.0,2.0,133.5,1.0,0.775,1.0,0.0,,1.0
50%,56.0,1.0,,130.0,241.0,1.0,2.0,153.0,1.0,1.4,2.0,0.0,,2.0
75%,61.0,1.0,,140.0,275.0,1.0,2.0,166.0,1.0,2.0,2.0,1.0,,3.0


In [16]:
# display values of 'slope' column
heart.slope.head()

0    3.0
1    2.0
2    2.0
3    3.0
4    1.0
Name: slope, dtype: float64

In [17]:
# replace categorical numeric values with string values
heart.slope.replace({1.0: 'upsloping', 2.0: 'flat', 3.0: 'downsloping'}, inplace=True)

# display comverted values of column 'slope'
heart.slope.head()
print(heart.slope.unique())
print(heart.slope.dtype)

['downsloping' 'flat' 'upsloping']
object


In [18]:
# convert 'slope' column to categorical data type
heart.slope = pd.Categorical(heart.slope, ['upsloping', 'flat', 'downsloping'], ordered=True)

# re-check dtypes for 'slope' column
print(heart.slope.dtype)

# encode 'slope' column
heart_slope_encode = heart.slope.cat.codes
# This line of code is encoding the 'slope' column in the 'heart' DataFrame.
# 'slope' is a categorical variable and 'cat.codes' is used to transform the 
# categories into integers for further analysis or modeling tasks.

# display encoded values of 'slope' column
print(heart_slope_encode.unique())

category
[2 1 0]
