<a href="https://colab.research.google.com/github/rfaraz/shiftsc/blob/main/final/shiftsc_lawfinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🏛 Law Track: COMPAS Revised
You are an employee at your city's judicial branch trying to determine how you can improve the COMPAS system to make better predictions. To solve this task, you decided that you will write a similar model that predicts a COMPAS score, similar to how the previous system came up with a recidivism score to determine how to sentence a defendant.

**But, you need to watch out!**. The previous system is heavily flawed and uses race as a factor to make its predictions, but you do not want to discriminate against patients who come from different racial backgrounds. You must also be careful that defendants do not have their identity revealed. Finally, you should make sure your algorithm is safe from any cyber attacks, as that will put the entire judicial system at risk.

So, let's get started!

---
# 📊 Your Data

To start you off, we have provided the old COMPAS records. To learn more about the complete data set, feel free to refer to this link: [COMPAS Recidivism Racial Bias](https://www.kaggle.com/datasets/danofer/compass)

The dataset has the following variables:
* Person_ID (Categorical)
* AssessmentID (Categorical)
* Case_ID (Categorical)
* Agency_Text (Categorical)
* LastName (Categorical)
* FirstName (Categorical)
* MiddleName (Categorical)
* Sex_Code_Text (Categorical)
* Ethnic_Code_Text (Categorical)
* DateOfBirth (Categorical)
* RecSupervisionLevel (Categorical)
* RecSupervisionLevelText (Categorical)
* Scale_ID (Categorical)
* DisplayText (Categorical)
* DecileScore (Numerical)
* ScoreText (Categorical)
* AssessmentType (Categorical)
* IsCompleted (Categorical)
* IsDeleted (Categorical)
* **Target Variable:** RawScore (Numerical)

It is up to you to determine which variables will be needed to make an accurate prediction. Hint: You will most likely not need date of birth to determine the final prediction, so start by removing inputs like this. Remember, the goal is to make a strong model, so the way that you manipulate you decision is up to you!



In [2]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = 'https://raw.githubusercontent.com/rfaraz/shiftsc/main/compas-scores-raw.csv'
df = pd.read_csv(url)  # Reads the dataset from the defined URL
df.head()

Unnamed: 0,Person_ID,AssessmentID,Case_ID,Agency_Text,LastName,FirstName,MiddleName,Sex_Code_Text,Ethnic_Code_Text,DateOfBirth,...,RecSupervisionLevel,RecSupervisionLevelText,Scale_ID,DisplayText,RawScore,DecileScore,ScoreText,AssessmentType,IsCompleted,IsDeleted
0,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,7,Risk of Violence,-2.08,4,Low,New,1,0
1,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,8,Risk of Recidivism,-1.06,2,Low,New,1,0
2,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,12/05/92,...,1,Low,18,Risk of Failure to Appear,15.0,1,Low,New,1,0
3,50848,57174,51956,PRETRIAL,KENDALL,KEVIN,,Male,Caucasian,09/16/84,...,1,Low,7,Risk of Violence,-2.84,2,Low,New,1,0
4,50848,57174,51956,PRETRIAL,KENDALL,KEVIN,,Male,Caucasian,09/16/84,...,1,Low,8,Risk of Recidivism,-1.5,1,Low,New,1,0


# 🔎 Exploring the Data
Before you begin manipulating the data, here are some commands that will help you better understand the data. Specifically, it will help to know how many different types of categories exist within the categorical variables. This will help determine how many distinct groups exist within the data, as well as what method would be ideal to encode all of the categorical inputs.

In [None]:
# Inspect missing values and duplicates
df.info()
df.isnull().sum()

# Explore the number of categories within categorical variables (Variables that cannot be quantified by a number)

# Create some sort of visualization to tell you more about the data
# Use matplotlib to help you out

# 🧼 Cleaning the Data
Below, we have included some commands you will need to prepare the data. We have already covered the coding you would need to drop null, or missing, values in your data and standardizing all numerical values so they use the same scale. There are at least two more crucial steps you need to take to prepare your dataset:

* Encode categorical variables
* Drop unnecessary columns

**Encoding:** This step is a bit complicated, as there are some columns that have far too many categories to be effectively used as encoded numbers. There are some ways to work around this such as combining rare values into one category. This needs to be done carefully, however, as this will dirrect affect the accuracy of the final prediction. Use your exploration to guide your decisions.

As you continue working on improving the model, you may find the need to do some other data preprocessing, so feel free to add any other commands you feel are necessary.

In [None]:
# Drop or fill missing values
df = df.dropna()  # Or fill with mean/median as needed

# Normalize or scale numerical columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Encode categorical variables (Change strings to numeric data)
# Use pd.get_dummies() or LabelEncoder/OneHotEncoder


# Drop any values that are unnecessary to the final model training

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

# 🔐 Patient Privacy
Since we are dealing with sensitive data on patients, we need to ensure that the privacy of each individual is protected. To do so, we have started you off with one method: removing personal identifiers. If you haven't done so already, uncomment the line that removes name from the training data. (It is safe to assume that LastName, FirstName, and MiddleName do not influence the final prediction)


When it comes to protecting privacy, this might not always be enough. If you combine other data like ethnicity and gender, it becomes clear what patient we are dealing with, so there are some other measures that need to be put in place. Here are two more methods to implement:
* `Generalizing features`
* `Adding noise to data`

In [None]:
# Remove direct identifiers like name (Uncomment if you haven't done this step earlier)
# df = df.drop(columns=['FirstName', 'LastName', 'MiddleName'], errors='ignore')

# Generalize features (e.g., age → age group)
# df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['0-18', '19-35', '36-60', '60+'])
# df.drop(columns='age', inplace=True)

# Add noise to sensitive columns
# def add_noise(col, epsilon=0.1):
#     return col + np.random.normal(0, epsilon, size=len(col))
# df['income'] = add_noise(df['income'])

X = df.drop(columns=['RawScore'])
y = df['RawScore']

# 🏋 Training the Model

Great! Now that we have processed the data, we are finally ready to start training the model.

Implement a neural network model. We have provided you the `sklearn` library, which contains a lot of strong classifiers. Use the [documentation](https://scikit-learn.org/stable/) to help guide you more in defining and using the trained model.



In [None]:
# Write a neural network
from sklearn.neural_network import MLPRegressor # Here is one example

# ⚖️ Fairness Metrics
Now, run some formulas to evaluate the fairness of your algorithm. One common approach is to simply go in and measure the accuracy based on different sensitive variables. For example, to evaluate gender bias, you would go in and compare accuracy for Male patients with Female patients. In our case, the sensitive variable is ethnicity.

We have already provided one approach. We also recommend going in and running your own fairness evaluations, such as the ones we went over in previous modules. If you find that there is some bias present in the system, go back to the model training step and implement the approaches you learned in Module 2: Bias and Fairness.

In [None]:
# Split predictions by group and compare accuracy or other metrics based on billing amount/insurance

# Example pseudocode:
# for group in df['gender'].unique():
#     idx = df['gender'] == group
#     print(group)
#     print(classification_report(y_test[idx], y_pred[idx]))

# You can also compute fairness metrics manually

# 🛟 Safety Measurements
Similar to the fairness metrics, go through and test out the safety of your model. Is it prone to adversarial attacks?

To test the robustness of your system, one common approach is adding noise to the test data. If the model performs well on the noisier data, it is a robust system.

We have already provided the first approach. Now, go ahead and add one more testing method. You can refer back to Module 3: Safety and Robustness for some more formal mathematical definitions. If you find that the system is failing these robustness tests, go back to your model training step and add measures such as random smoothing or noisier training data to improve the robustness of the model.

In [None]:
# Add noise to test data and compare performance
# def add_random_noise(X, epsilon=0.05):
#     return X + np.random.normal(0, epsilon, size=X.shape)

# X_test_noisy = add_random_noise(X_test)
# y_pred_noisy = model.predict(X_test_noisy)
# print(classification_report(y_test, y_pred_noisy))