# Heart Disease Dataset


### Features description

Age: displays the age of the individual.

Sex: displays the gender of the individual: 1 = male, 0 = female

Chest-pain type: type of chest-pain experienced by the individual: 1 = typical angina, 
2 = atypical angina, 
3 = non — anginal pain, 
4 = asymptotic

Resting Blood Pressure: resting blood pressure value of an individual in mmHg (unit)

Serum Cholestrol: serum cholesterol in mg/dl (unit)

Fasting Blood Sugar: compares the fasting blood sugar value of an individual with 120mg/dl: If fasting blood sugar > 120mg/dl then : 1 (true), else : 0 (false)

Resting ECG: resting electrocardiographic results: 0 = normal, 1 = having ST-T wave abnormality, 2 = left ventricular hyperthrophy

Max heart rate achieved: max heart rate achieved by an individual.

Exercise induced angina: 1 = yes, 0 = no

ST depression induced by exercise relative to rest: value (integer/float).

Peak exercise ST segment: 1 = upsloping, 2 = flat, 3 = downsloping

Number of major vessels (0–3) colored by flourosopy: value (integer/float).

Thal: displays the thalassemia: 3 = normal, 6 = fixed defect, 7 = reversible defect

Diagnosis of heart disease: individual suffering from heart disease or not: 0 = absence; 1, 2, 3, 4 = present.

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl.metadata (5.2 kB)
Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [2]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# extract feature names from the variables DataFrame
feature_names = heart_disease.variables['name'].tolist()

# create a DataFrame
heart_dis = pd.DataFrame(data = heart_disease.data.features, columns = feature_names)

# create a 'target' column with 'num' values
heart_dis['target'] = heart_disease.data.targets

# save DataFrame to CSV
heart_dis.to_csv('heart_disease_data.csv', index = False)

In [3]:
heart_dis.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,,0


In [4]:
unique_values = heart_dis['target'].unique()
unique_values

array([0, 2, 1, 3, 4])

In [5]:
# drop the 'num' column from the features
heart_dis = heart_dis.drop(columns=['num'], errors='ignore')

heart_dis.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [6]:
# check for missing values in the DataFrame
missing_values = heart_dis.isnull().sum()

# count of missing values for each column
missing_values

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64

In [7]:
# identify rows with missing values in 'ca' and 'thal' columns
missing_rows_ca = heart_dis[heart_dis['ca'].isnull()]
missing_rows_thal = heart_dis[heart_dis['thal'].isnull()]

# Display the rows with missing values
print("Rows with missing 'ca' values:")
print(missing_rows_ca)

print("\nRows with missing 'thal' values:")
print(missing_rows_thal)

Rows with missing 'ca' values:
     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
166   52    1   3       138   223    0        0      169      0      0.0   
192   43    1   4       132   247    1        2      143      1      0.1   
287   58    1   2       125   220    0        0      144      0      0.4   
302   38    1   3       138   175    0        0      173      0      0.0   

     slope  ca  thal  target  
166      1 NaN   3.0       0  
192      2 NaN   7.0       1  
287      2 NaN   7.0       0  
302      1 NaN   3.0       0  

Rows with missing 'thal' values:
     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
87    53    0   3       128   216    0        2      115      0      0.0   
266   52    1   4       128   204    1        0      156      1      1.0   

     slope   ca  thal  target  
87       1  0.0   NaN       0  
266      2  0.0   NaN       2  


#### Patients 166, 287, 302 do not have heart disease, so I will impute the 'ca' values with the mean of the healthy patients.
#### For patient 192, I will change the ca value with the mean of those suffering from heart disease, stage 1.
#### In an analogue way, I will do the same for the 'thal' values.

In [8]:
# calculate the mean 'ca' value for rows with target 0
mean_ca_target_0 = heart_dis.loc[heart_dis['target'] == 0, 'ca'].mean()

# update 'ca' values for rows 166, 287, and 302 with the mean
rows_to_update = [166, 287, 302]
heart_dis.loc[rows_to_update, 'ca'] = heart_dis.loc[rows_to_update, 'ca'].fillna(mean_ca_target_0)

# verify that missing values have been filled
heart_dis.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          1
thal        2
target      0
dtype: int64