# Data Cleaning Notebook

This notebook is responsible for preparing our raw UCI Adult dataset for further analysis. In this notebook, we will:
- Determine the project root directory.
- Construct the path to the raw data file.
- Check if `train.csv` exists; if not, create it using AIF360's AdultDataset, with a fallback to manual loading.
- Perform data cleaning by dropping rows with missing values and encoding categorical variables.
- Remove any leakage features and ensure that our target and protected attribute are numeric.
- Save the final cleaned dataset as `train_cleaned.csv`.

Let's begin by setting up our environment and defining file paths.


In [11]:
import os
import sys
import pandas as pd
from aif360.datasets import AdultDataset

# Get the current working directory (for debugging purposes)
project_root = os.getcwd()
print("Project root directory:", project_root)

# Construct the absolute path to the raw data file.
# Since this notebook is at the project root (or a subfolder), 
# the raw file is expected at: data/raw/adult/adult.data.
raw_data_file = os.path.join(project_root, "data", "raw", "adult", "adult.data")
print("Looking for raw data file at:", raw_data_file)


Project root directory: /Users/stay-c/Desktop/AI_Fairness_Project/notebooks
Looking for raw data file at: /Users/stay-c/Desktop/AI_Fairness_Project/notebooks/data/raw/adult/adult.data


## Define CSV Paths and Create train.csv

In this cell, we define the paths for our intermediate CSV files (`train.csv` and `train_cleaned.csv`).  
We then check if `train.csv` exists. If it doesn't, we attempt to create it from the raw Adult dataset.  
First, we try to use AIF360's `AdultDataset` – if that fails due to non-numeric values (e.g., the string "Black" in the 'race' column),  
we fall back to manual loading:
- We read the raw data using pandas (with no header) and assign column names as per the dataset documentation.
- We drop rows with missing values to ensure data integrity and prevent errors in downstream analysis.
- We encode categorical variables to numeric codes because many ML models (and AIF360) require numerical input.
- We create a binary target column (`income_binary`) based on the original income column.
Finally, we save the resulting DataFrame as `train.csv` for future steps.


In [12]:
import os
import sys
import pandas as pd
from aif360.datasets import AdultDataset

# Use the current working directory to determine the project root.
# In notebooks, __file__ is not defined, so we use os.getcwd() and go one level up.
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
print("Project root directory:", project_root)

# Construct the absolute path to the raw data file.
raw_data_file = os.path.join(project_root, "data", "raw", "adult", "adult.data")
print("Looking for raw data file at:", raw_data_file)

# Define file paths for the CSV files we will create.
train_csv = os.path.join(project_root, "data", "train.csv")
cleaned_csv = os.path.join(project_root, "data", "train_cleaned.csv")

# Check if train.csv already exists; if not, create it from the raw data.
if not os.path.exists(train_csv):
    print("train.csv not found. Creating train.csv from raw Adult dataset...")
    try:
        # Attempt to load using AIF360's AdultDataset (this requires all values to be numeric).
        dataset = AdultDataset(protected_attribute_names=['sex'], features_to_drop=['fnlwgt'])
        data_df, label_names, protected_attribute_names = dataset.convert_to_dataframe()
    except Exception as e:
        print("Error using AdultDataset:", e)
        print("Falling back to manual loading and encoding...")
        # Define column names based on the Adult dataset documentation.
        columns = [
            'age', 'workclass', 'fnlwgt', 'education', 'education-num',
            'marital-status', 'occupation', 'relationship', 'race', 'sex',
            'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-per-year'
        ]
        # Load raw data with pandas; the file has no header so we supply the column names.
        data_df = pd.read_csv(raw_data_file, header=None, names=columns, na_values='?')
        # Drop rows with missing data because missing values can cause errors and reduce model reliability.
        data_df = data_df.dropna()
        # Encode categorical variables to numeric codes.
        # This step converts non-numeric text data (e.g., 'Black' in 'race') into numerical codes required by ML models.
        categorical_cols = ['workclass', 'education', 'marital-status', 'occupation',
                            'relationship', 'race', 'sex', 'native-country', 'income-per-year']
        for col in categorical_cols:
            data_df[col] = pd.Categorical(data_df[col]).codes
        # Create a binary target column.
        # We assume that a higher categorical code corresponds to a '>50K' income.
        data_df['income_binary'] = data_df['income-per-year'].apply(lambda x: 1 if x > 0 else 0)
        # Remove the original income column as it's now redundant.
        data_df = data_df.drop('income-per-year', axis=1)
        label_names = ['income_binary']
        protected_attribute_names = ['sex']
    # Save the resulting DataFrame as train.csv.
    data_df.to_csv(train_csv, index=False)
    print("train.csv created and saved at:", train_csv)
else:
    print("train.csv already exists.")


Project root directory: /Users/stay-c/Desktop/AI_Fairness_Project
Looking for raw data file at: /Users/stay-c/Desktop/AI_Fairness_Project/data/raw/adult/adult.data
train.csv already exists.


## Load and Finalize Data Cleaning

Now that we've created (or verified) the existence of `train.csv` from our raw Adult dataset, we will load it into a DataFrame and perform additional cleaning steps:
- Remove any leakage features (for example, a column like "14_ <=50K" if it exists).
- Verify and convert the target column (`income_binary`) and the protected attribute (`sex`) to numeric types.
- Save the final cleaned dataset as `train_cleaned.csv` for future processing.


In [13]:
# Load the intermediate dataset (train.csv)
data_df = pd.read_csv(train_csv)
print("Loaded train.csv with shape:", data_df.shape)

# Remove any leakage feature if it exists (for example, "14_ <=50K").
if "14_ <=50K" in data_df.columns:
    data_df = data_df.drop("14_ <=50K", axis=1)
    print("Leakage feature '14_ <=50K' removed.")
else:
    print("No leakage feature '14_ <=50K' found.")

# Determine which column to use as the target.
# We expect the target column to be 'income_binary'. If not, fallback to '14_ >50K'.
target_column = 'income_binary' if 'income_binary' in data_df.columns else '14_ >50K'
data_df[target_column] = data_df[target_column].astype(int)

# Ensure that the protected attribute 'sex' is numeric.
if 'sex' in data_df.columns:
    data_df['sex'] = data_df['sex'].astype(int)

# Save the final cleaned dataset.
data_df.to_csv(cleaned_csv, index=False)
print("Cleaned dataset saved at:", cleaned_csv)
print("Data sample:")
data_df.head()


Loaded train.csv with shape: (32561, 15)
No leakage feature '14_ <=50K' found.
Cleaned dataset saved at: /Users/stay-c/Desktop/AI_Fairness_Project/data/train_cleaned.csv
Data sample:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


## Next Steps in Data Cleaning

At this point, our dataset has been fully cleaned:
- We have created `train.csv` from the raw data.
- We removed any leakage features and ensured that the target and protected attributes are in numeric format.
- The final cleaned dataset is saved as `train_cleaned.csv`.

This cleaned dataset is now ready for the next phases of the project, such as bias detection, bias mitigation, model training, and evaluation.
