**Note:** This notebook is designed for **Google Colab**.

If you see the Colab logo <span style='vertical-align:bottom;'><img src='https://colab.research.google.com/img/colab_favicon_256px.png' width='40' alt='Colab logo'></span> in the top-left corner, you're all set! Please **continue**.

If you don't see the logo (e.g., you are on GitHub), please click the button below to open it in the correct environment:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mparrott-at-wiris/aimodelshare/blob/master/notebooks/justice_and_equity_advance_notebook_en.ipynb)

# **Advanced Justice & Equity Challenge: Build & Submit Custom Models**

Welcome to the **Advanced Pathway** of the Ethics at Play (Ãˆtica en Joc) Justice Challenge.

**Who is this for?**
This notebook is designed for participants with Python experience (e.g., Scikit-Learn, TensorFlow, PyTorch). Instead of using the gamified apps, you will build, train, and submit your own machine learning models directly to the competition leaderboard.

**The Goal:**
Train a model to predict recidivism risk (the likelihood of re-offending) using the COMPAS dataset, while balancing accuracy and fairness.

## ðŸš€ **Quick Start Guide**

To participate in the challenge, complete these 5 steps:

1.  **Install Libraries:** Run the setup cell to install `aimodelshare`.
2.  **Get the Data:** Run the data loading cell to retrieve the pre-split training and testing data.
3.  **Train Your Model:** Use the provided Scikit-Learn Pipeline example or write your own custom training code.
4.  **Connect:** Link this notebook to the Justice Challenge Leaderboard.
5.  **Submit:** Send your predictions to the leaderboard to see your score.

**Ready? Click the â–¶ Play Button on the first cell below to get started.**

---
# **Step 1: Installation**

We need to install the `aimodelshare` library to connect to the competition backend.

In [None]:
# Install the aimodelshare library
print("Installing required libraries...")
!pip install aimodelshare --upgrade -q --no-warn-script-location > /dev/null 2>&1
print("âœ… Installation complete!")

---
# **Step 2: Load Data**

We will load the training and testing data directly from the official competition URLs.

* **X_train:** Features for training.
* **y_train:** Target labels (did recidivism occur?) for training.
* **X_test:** Features for testing (you will generate predictions on this).

In [None]:
import pandas as pd

# 1. Load Data from URLs
X_train = pd.read_csv("https://raw.githubusercontent.com/AIModelShare/aimodelshare_tutorials/refs/heads/main/datasets/ethicsatplay/X_train.csv")
X_test = pd.read_csv("https://raw.githubusercontent.com/AIModelShare/aimodelshare_tutorials/refs/heads/main/datasets/ethicsatplay/X_test.csv")
y_train_labels = pd.read_csv("https://raw.githubusercontent.com/AIModelShare/aimodelshare_tutorials/refs/heads/main/datasets/ethicsatplay/y_train.csv")

# Ensure y_train is a 1D Series (required for sklearn)
y_train = y_train_labels.squeeze()

# 2. Define Feature Lists for the Pipeline
# These lists match the columns present in X_train and X_test
ALL_NUMERIC_COLS = ["juv_fel_count", "juv_misd_count", "juv_other_count", "days_b_screening_arrest", "age", "length_of_stay", "priors_count"]
ALL_CATEGORICAL_COLS = ["race", "sex", "c_charge_degree", "c_charge_desc"]

print("âœ… Data loaded successfully!")
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")
print("\nFirst 5 rows of training data:")
X_train.head()

âœ… Data loaded successfully!
Training data shape: (7214, 11)
Testing data shape: (1000, 11)

First 5 rows of training data:


Unnamed: 0,juv_fel_count,juv_misd_count,juv_other_count,days_b_screening_arrest,age,length_of_stay,priors_count,race,sex,c_charge_degree,c_charge_desc
0,0,0,0,-1.0,69,0.984468,0,Other,Male,F,Aggravated Assault w/Firearm
1,0,0,0,-1.0,34,10.077384,0,African-American,Male,F,Felony Battery w/Prior Convict
2,0,0,1,-1.0,24,1.085764,4,African-American,Male,F,Possession of Cocaine
3,0,1,0,,23,,1,African-American,Male,F,Possession of Cannabis
4,0,0,0,,43,,2,Other,Male,F,arrest case no charge


---
# **Step 3: Train Model with Preferred Library (Sklearn, Tensorflow, Pytorch, Etc.)**

We will use a **Scikit-Learn Pipeline** to streamline preprocessing and modeling.

This pipeline will:
1.  **Impute Missing Values** (Fill NaNs with median for numbers, most frequent for categories).
2.  **One-Hot Encode** categorical columns (Race, Sex, Charge Degree, Charge Desc).
3.  **Scale** numerical columns (Age, Priors, Length of Stay, etc.).
4.  **Train** a Logistic Regression classifier.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# 1. Define feature groups using constants from Step 2
# Numerical features will be imputed and scaled
numeric_features = ALL_NUMERIC_COLS

# Categorical features will be imputed and One-Hot Encoded
categorical_features = ALL_CATEGORICAL_COLS

# 2. Define Transformers with Imputation
# Numeric: Impute missing values with the median, then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical: Impute missing values with the most frequent value, then one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# 3. Create Preprocessor using the transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 4. Create Pipeline (Preprocessor + Model)
# You can replace LogisticRegression with any other sklearn model (e.g., RandomForestClassifier)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# 5. Train the pipeline
pipeline.fit(X_train, y_train)

# 6. Generate predictions on the test set
predictions = pipeline.predict(X_test)

print(f"âœ… Model Trained! Predictions generated for {len(predictions)} samples.")
print("Proceed to the next step to submit your predictions to the leaderboard.")

âœ… Model Trained! Predictions generated for 1000 samples.
Proceed to the next step to submit your predictions to the leaderboard.


---
# **Step 4: Connect to the Leaderboard**

This step connects your notebook to the specific backend for the Justice & Equity Challenge.

*Note: You will be prompted to enter a username and password. If you don't have one, you will need to create one at [modelshare.ai](https://www.modelshare.ai)

In [None]:
from aimodelshare.aws import set_credentials
from aimodelshare.playground import Competition
import os

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


# The specific Model Playground URL for the Justice Challenge
my_playground_url = "https://cf3wdpkg0d.execute-api.us-east-1.amazonaws.com/prod/m"

# Set your credentials (pop-up will appear)
set_credentials(apiurl=my_playground_url)

# Generate your session access token
token=os.getenv("AWS_TOKEN")

# Connect to the competition
playground = Competition(my_playground_url)

Modelshare.ai Username:Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
Modelshare.ai Password:Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·
Modelshare.ai login credentials set successfully.


---
# **Step 5: Submit & Check Results**

Submit your predictions to the leaderboard.

In [None]:
# 1. Submit your predictions
# Note: We pass None for model and preprocessor because we are only submitting predictions for evaluation
playground.submit_model(
    model=None,
    preprocessor=None,
    prediction_submission=predictions, token=token,
    input_dict={
        "Team": "The Ethical Explorers", # Change team name manually as needed.
        "description": "Logistic Regression with Sklearn Pipeline",
        "tags": "sklearn, logistic_regression, advanced_pathway, pipeline"
    }
)

print("âœ… Predictions submitted successfully!")

# 2. Check the leaderboard
print("Loading leaderboard...")
leaderboard = playground.get_leaderboard()
playground.stylize_leaderboard(leaderboard)