# Train Model on Weighted Data

In this notebook, we train a machine learning model using the weighted data generated in the bias mitigation step. We will:
- Load the weighted dataset (`weighted_train.csv`)
- Split the data into training and testing sets
- Scale the features using StandardScaler
- Train a Logistic Regression model as our baseline classifier
- Evaluate the model's performance with accuracy and a classification report

Let's begin by setting up our environment and defining the file paths.


## 1. Environment Setup and Data Loading

We first construct the absolute path to the weighted data file. Since the notebook is in the `notebooks` folder, we move up one level to locate the project root. Then we load the weighted dataset.



In [1]:
import os
import sys
import pandas as pd

# In a notebook, __file__ is not defined. We use os.getcwd() and move up one level if necessary.
current_dir = os.getcwd()
if os.path.basename(current_dir) == "notebooks":
    project_root = os.path.abspath(os.path.join(current_dir, ".."))
else:
    project_root = current_dir

print("Project root directory:", project_root)

# Construct the absolute path to the weighted data file.
weighted_csv_path = os.path.join(project_root, "data", "weighted_train.csv")
print("Looking for weighted data at:", weighted_csv_path)

# Check if the weighted data file exists.
if not os.path.exists(weighted_csv_path):
    sys.exit(f"Error: Weighted data file not found at {weighted_csv_path}.\n"
             "Please run the bias_mitigation script to generate weighted_train.csv before proceeding.")

# Load the weighted data.
weighted_df = pd.read_csv(weighted_csv_path)
print("Weighted data loaded. Shape:", weighted_df.shape)


Project root directory: /Users/stay-c/Desktop/AI_Fairness_Project
Looking for weighted data at: /Users/stay-c/Desktop/AI_Fairness_Project/data/weighted_train.csv
Weighted data loaded. Shape: (32561, 15)


## 2. Data Preparation

Next, we prepare our data for model training:
- We assume the target column is named `income_binary`.
- We separate the features (X) from the target (y).
- Then, we split the data into training and testing sets using a 70-30 split.


In [2]:
from sklearn.model_selection import train_test_split

# Ensure the target column exists.
if 'income_binary' not in weighted_df.columns:
    sys.exit("Error: Expected target column 'income_binary' not found in weighted data.")

# Separate features and target.
X = weighted_df.drop(['income_binary'], axis=1)
y = weighted_df['income_binary']

# Split data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Weighted data split into training and testing sets.")


Weighted data split into training and testing sets.


## 3. Feature Scaling and Model Training

Feature scaling is important to ensure that all input features have equal weight during model training.  
In this cell, we:
- Scale the features using StandardScaler.
- Train a Logistic Regression model on the scaled training data.
- Evaluate the model performance on the test data using accuracy and a classification report.


In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Scale features using StandardScaler.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a Logistic Regression model on the weighted data.
model = LogisticRegression(max_iter=2000)
model.fit(X_train_scaled, y_train)
print("Model trained on weighted data.")

# Evaluate model performance on the test set.
predictions = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
print("Weighted Model Accuracy:", accuracy)
print("Weighted Model Classification Report:")
print(classification_report(y_test, predictions))


Model trained on weighted data.
Weighted Model Accuracy: 0.8258777766403931
Weighted Model Classification Report:
              precision    recall  f1-score   support

         0.0       0.85      0.94      0.89      7455
         1.0       0.71      0.45      0.55      2314

    accuracy                           0.83      9769
   macro avg       0.78      0.70      0.72      9769
weighted avg       0.81      0.83      0.81      9769



## Conclusion

In this notebook, we:
- Set up the environment and correctly located the weighted data file.
- Prepared the data by separating features and target, then splitting it into training and testing sets.
- Scaled the features and trained a Logistic Regression model on the weighted data.
- Evaluated the model, obtaining accuracy and detailed classification metrics.

This model will serve as our baseline for comparing performance and fairness in subsequent evaluations.
