In [4]:
import os
from google.colab import drive
import pandas as pd

In [5]:
drive.mount('/content/drive')
folder_path = '/content/drive/My Drive/dfa_data/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Gender Bias in Cardiovascular Models
## Phase 3: Data Cleaning
Arman Syed (as3778), Srimoyee Mukhopadhyay (sm2437), Lancaster Wu (jw2555), Hongjin Quan (hq48)<br><br>
CS 5382 - Spring 2024


## Data Cleaning

We loaded the files containing the fully cleaned 14 features from each of the 4 locations.

We then conducted the following Pre-processing steps:

1. Feature xiv. `num` - renamed as `target` for clarity in this project. This is the target output, indicating heart disease diagnosis.
    * In some of the original datasets, this column is rated as an integer value from 0 (no presence of heart disease) to 1,2,3,4 (presence of heart disease in increasing severity); while other datasets only have 0 and 1. For uniformity between datasets, and to conform to the same method used in the original research, we preprocessed any integer >= 1 as 1 in this column.

2. For categorical features such as `cp`, `restecg`, and `slope`, the numerical values assigned seem to align with indication of increasing severity, so we kept them as such.

3. Our Sensitive Group is `sex` female vs male, so we separated and created new objects for dataset, for each Sensitive Group female (f) and male (m).

## Step 1: Read the Data In

In [8]:
clev_path = os.path.join(folder_path, "initial.cleveland.data")
clev_data = pd.read_csv(clev_path)

switz_path = os.path.join(folder_path, "initial.switzerland.data")
switz_data = pd.read_csv(switz_path)

hung_path = os.path.join(folder_path, "initial.hungarian.data")
hung_data = pd.read_csv(hung_path)

va_path = os.path.join(folder_path, "initial.va.data")
va_data = pd.read_csv(va_path)


We used pd.read_csv to easily read in the large amounts of data and have it organized automatically. The datasets all contained "?" where the data was missing, and so we specificed that those data values should be considered na.

## Step 2: Labelling the Columns

In [9]:
cols = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
clev_data.columns = cols
switz_data.columns = cols
hung_data.columns = cols
va_data.columns = cols

The datasets did not label what each column meant in the raw data files themselves. Rather, they described what each column represented in a seperate file. To have all of it in the same place and for readability and simple access purposes, we added the features that each column represents as the column headers of the dataframe.

## Step 3: Convert All Target Values Above 1 to 1

In [10]:
clev_data.loc[clev_data['target'] > 1, 'target'] = 1
switz_data.loc[switz_data['target'] > 1, 'target'] = 1
hung_data.loc[hung_data['target'] > 1, 'target'] = 1
va_data.loc[va_data['target'] > 1, 'target'] = 1

In [11]:
#converting these original data files now iwth column names to csv format and saving for better readability for grading
clev_data.to_csv("/content/drive/My Drive/dfa_data/processed_cleveland_data.csv", index=False)
switz_data.to_csv("/content/drive/My Drive/dfa_data/processed_switzerland_data.csv", index=False)
hung_data.to_csv("/content/drive/My Drive/dfa_data/processed_hungarian_data.csv", index=False)
va_data.to_csv("/content/drive/My Drive/dfa_data/processed_va_data.csv", index=False)

As mentioned above, the feature num was renamed to target, and this feature represented whether a patient had heart disease. However, we are only concerned about the binary case of whether they have heart disease or not. The dataset has a scale for differing levels of heart disease development. Therefore, we believed it best to set all values above 0 to 1, as regardless of what the value is, as long as it isn't 0, for our purposes it simply means the patient had heart disease.

## Step 4: Split the Data Based on Sex

In [12]:
# Conform to datatype
clev_data['sex'] = clev_data['sex'].astype(int)

# Sensitive Groups split: sex - female and male
clev_rows_f = clev_data.index[clev_data['sex'] == 0].tolist()
clev_data_f = clev_data.loc[clev_rows_f]
clev_data_m = clev_data.drop(clev_rows_f)
switz_rows_f = switz_data.index[switz_data['sex'] == 0].tolist()
switz_data_f = switz_data.loc[switz_rows_f]
switz_data_m = switz_data.drop(switz_rows_f)
hung_rows_f = hung_data.index[hung_data['sex'] == 0].tolist()
hung_data_f = hung_data.loc[hung_rows_f]
hung_data_m = hung_data.drop(hung_rows_f)
va_rows_f = va_data.index[va_data['sex'] == 0].tolist()
va_data_f = va_data.loc[va_rows_f]
va_data_m = va_data.drop(va_rows_f)

This was done because sex is the determining factor of our experiement. We want to show how to sheer lack of female patient data led to the biased, overgeneralized, and innaccurate conclusions drawn by the original study, and to do this, we needed to seperate the data based on sex and then ubias it.

In [13]:
#saving as CSV
clev_data_m.to_csv("/content/drive/My Drive/dfa_data/processed_cleveland_data_m.csv", index=False)
switz_data_m.to_csv("/content/drive/My Drive/dfa_data/processed_switzerland_data_m.csv", index=False)
hung_data_m.to_csv("/content/drive/My Drive/dfa_data/processed_hungarian_data_m.csv", index=False)
va_data_m.to_csv("/content/drive/My Drive/dfa_data/processed_va_data_m.csv", index=False)

clev_data_f.to_csv("/content/drive/My Drive/dfa_data/processed_cleveland_data_f.csv", index=False)
switz_data_f.to_csv("/content/drive/My Drive/dfa_data/processed_switzerland_data_f.csv", index=False)
hung_data_f.to_csv("/content/drive/My Drive/dfa_data/processed_hungarian_data_f.csv", index=False)
va_data_f.to_csv("/content/drive/My Drive/dfa_data/processed_va_data_f.csv", index=False)