<font size=5  color=#003366> <b>[LEPL1109] - STATISTICS AND DATA SCIENCES</b> <br><br> 
<b>Hackathon 03 - Clustering: Bias in sensitive datasets</b> </font> <br><br><br>

<font size=4  color=#003366>
Prof. D. Hainaut<br>
Prof. L. Jacques<br>

<br><br>
Adrien Banse (adrien.banse@uclouvain.be)<br>
Jana Jovcheva (jana.jovcheva@uclouvain.be)<br>
François Lessage (francois.lessage@uclouvain.be)<br>
Sofiane Tanji (sofiane.tanji@uclouvain.be)<br>
<div style="text-align: right"> Version 2024-2025</div>
<br><br>
<div class="alert alert-danger">
<b>[IMPORTANT] Read all the documentation</b>  <br>
    Make sure that you read the whole notebook, <b>and</b> the <code>README.md</code> file in the folder.
</div>
<br><br>
</font>

# **Guidelines and Deliverables**

*   This hackathon is due on the **22 December 2024 at 22h00**
*   Copying code or answers from other groups (or from the internet) is strictly forbidden. <b>Each source of inspiration (stack overflow, git, other groups, ChatGPT...) must be clearly indicated!</b>
*  This notebook (with the "ipynb" extension) file, the report (PDF format) and all other files that are necessary to run your code must be delivered on <b>Moodle</b>.
* Only the PDF report and the python source file will be graded, both on their content and the quality of the text / figures.
  * 5/10 for the code.
  * 4/10 for the Latex report.
  * 1/10 for the visualization. <br><br>

<div class="alert alert-info">
<b>[DELIVERABLE] Summary</b>  <br>
After the reading of this document (and playing with the code!), we expect you to provide us with:
<ol>
   <li> a PDF file (written in LaTeX) that answers all the questions below. The report should contain high quality figures with named axes (we recommend saving plots with the <samp>.pdf</samp> extension);
   <li> this Jupyter Notebook (it will be read, checked for plagiarism and evaluated);
   <li> and all other files we would need to run your code.
</ol>
</div>

As mentioned above, plagiarism is forbidden. However, we cannot forbid you to use artificial intelligence BUT we remind you that the aim of this project is to learn on your own and with the help of the course material. Finally, we remind you that for the same question, artificial intelligence presents similar solutions, which could be perceived as a form of plagiarism.

# **Context & Objective**

## Context

Predictive algorithms serve multiple functions in criminal justice. They forecast crime locations, identify potential violent offenders, predict court appearance compliance, and estimate recidivism risk. 
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) stands as a prominent risk assessment tool. Since 1998, the COMPAS risk score has been used by many jurisdictions in the United States to assess risk of recidivism in pre-trial bail decisions.
In the United States, a defendant may either be detained or released on bail *(sous caution)* prior to the trial in court depending on various factors. Judges may detain defendants or increase the bail amount based on the risk score provided by the COMPAS algorithm.


In 2016, investigative journalists at ProPublica published [Machine Bias](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing), highlighting significant biases in the COMPAS Algorithm. Specifically, they showed that the proportion of false positives for African-American defendants is significantly higher than for Caucasian defendants. In other words, more African-American were labeled high risk and ended up not relapsing into criminal behaviour than Caucasian defendants. A more thorough explanation of their data analysis procedure can be found in their companion article [How We Analyzed the COMPAS Recidivism Algorithm](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm).

The COMPAS algorithm is proprietary software. What is known is that its decision is based on the answers to a questionnaire with 137 questions which the defendant must fill. There are questions related to crime (“How many prior juvenile felony offense arrests?”) along with seemingly mundane ones (“Do you live with friends?”; “Do you feel discouraged at times?”).

In this hackathon, you are provided a dataset with a subset of answers to that questionnaire from more than 7000 defendants living in Broward County, Florida as well as whether they did relapse into criminal behaviour or not.

## Objective(s)
It has been shown in the article linked above that the COMPAS algorithm is biased. We take this for granted and we do not bother to show it again. The main objective of the hackathon for you is to understand that it is not only the COMPAS algorithm that is biased, but that **the data itself is biased**, in the sense that one can find structural patterns of discrimination embedded in the data. In other words, *learning from data coming from a biased world without precautions will necessarily lead to biased predictions*. Knowing which precautions one should take to avoid biased predictions is a whole subfield of machine learning called "Fairness in AI". It is out of the scope of this hackathon and out of the scope of LEPL1109. If you are curious about it however, a great resource is the following book [Fairness and machine learning Limitations and Opportunities](https://fairmlbook.org/).

To see that the data itself is biased, you will implement your own recidivism prediction algorithms and measure their fairness.

## Dataset description
A large part of this hackathon will be devoted to handling, understanding and manipulating the dataset. The dataset provided represents 7214 defendants (data points) with 49 answers to the questionnaire (features) for each defendant. This is a lot of features and it should take you some time to understand them before going through Part 1. This data processing part may be tiresome but it is a necessary task in any serious data project.
A description of the features is provided in the table below:

| Feature               | Description                                                                                      |
|-----------------------|--------------------------------------------------------------------------------------------------|
| name                  | Full name of the defendant                                                                       |
| first                 | First name of the defendant                                                                      |
| last                  | Last name of the defendant                                                                       |
| compas_screening_date | The date the defendant filled the questionnaire                                                  |
| sex                   | Sex of the defendant (Female, Male)                                                              |
| dob                   | Date of birth of the defendant (YYYY-MM-DD)                                                      |
| age                   | Age of the defendant                                                                             |
| age_cat               | Age category of the defendant (Less than 25, 25-45, Greater than 45)                             |
| race                  | race attribute (African-American, Caucasian, Hispanic, Asian, Native American, Other)       |
| juv_fel_count         | Number of juvenile felonies committed by the defendant                                           |
| decile_score          | Decile of the COMPAS score                                                                       |
| juv_misd_count        | Number of juvenile misdemeanors                                                                  |
| juv_other_count       | Number of juvenile convictions that are not considered misdemeanors nor felonies                 |
| priors_count          | Number of prior crimes committed                                                                 |
| days_b_screening_arrest | Count of days between screening date and (original) arrest date                                |
| c_jail_in             | Datetime at which the defendant entered jail (YYYY-MM-DD, hh:mm:ss)                              |
| c_jail_out            | Datetime at which the defendant left jail (YYYY-MM-DD, hh:mm:ss)                                 |
| c_case_number         | Case number for the current charge                                                               |
| c_offense_date        | Date the offense was committed (YYYY-MM-DD)                                                      |
| c_arrest_date         | Date the offense was arrested (YYYY-MM-DD)                                                       |
| c_days_from_compas    | Days from COMPAS screening date to current arrest date                                           |
| c_charge_degree       | Current charge degree (felony or misdemeanor) at the time of filling the questionnaire ("F", "M")|
| c_charge_desc         | Description of the current charge                                                                |
| is_recid              | Binary variable indicating whether the defendant is rearrested at any time (0, 1)                |
| r_case_number         | Case number for a recidivism charge                                                              |
| r_charge_degree       | Recidivism charge degree (felony or misdemeanor) for an offense subsequent to filling the questionnaire |
| r_days_from_arrest    | Days from Arrest to Recidivism Event                                                             |
| r_offense_date        | Date the recidivism offense was committed (YYYY-MM-DD)                                           |
| r_charge_desc         | Description of the recidivism charge                                                             |
| r_jail_in             | Datetime at which the defendant entered jail for a recidivism charge (YYYY-MM-DD, hh:mm:ss)      |
| r_jail_out            | Datetime at which the defendant left jail for a recidivism charge (YYYY-MM-DD, hh:mm:ss)         |
| violent_recid         | Number of violent recidivism events                                                              |
| is_violent_recid      | Binary variable indicating whether the defendant committed a violent recidivism (0, 1)           |
| vr_case_number        | Case number for a violent recidivism charge                                                      |
| vr_charge_degree      | Violent recidivism charge degree (felony or misdemeanor)                                         |
| vr_offense_date       | Date the violent recidivism offense was committed (YYYY-MM-DD)                                   |
| vr_charge_desc        | Description of the violent recidivism charge                                                     |
| type_of_assessment    | Type of COMPAS assessment performed                                                              |
| decile_score.1        | *Same as decile_score*                                                                           |
| score_text            | Recidivism risk of the defendant (Low, Medium, High)                                             |
| screening_date        | Date on which the defendant was screened (YYYY-MM-DD)                                            |
| v_type_of_assessment  | Type of violent risk assessment                                                                  |
| v_decile_score        | Decile score for violent risk assessment                                                         |
| v_score_text          | Violent recidivism risk of the defendant (Low, Medium, High)                                     |
| v_screening_date      | Date of the violent risk assessment (YYYY-MM-DD)                                                 |
| in_custody            | Date on which the defendant was placed in custody (YYYY-MM-DD)                                   |
| out_custody           | Date on which the defendant left custody (YYYY-MM-DD)                                            |
| priors_count.1        | *Same as priors_count*                                                                           |
| two_year_recid        | Binary variable on whether the defendant has recidivated within two years (0, 1)                 |

## **Notebook structure**

### PART 1 - Data preprocessing
   #### 1.1 - Importing the packages
   #### 1.2 - Importing the dataset
   #### 1.3 - Dataset curation
   #### 1.4 - Feature engineering
   #### 1.5 - Sensitive features
   #### 1.6 - Scale the dataset

### PART 2 - Data exploration
   #### 2.1 - Feature visualization
   #### 2.2 - Principal Component Analysis

   
### PART 3 - Clustering
   #### 3.1 - K-Means
   #### 3.2 - Results Analysis


### PART 4 - Validation and fairness metrics
   #### 4.1 - Silhouette score
   #### 4.2 - Purity and entropy of a clustering
   #### 4.3 - Precision and Recall per race group
   #### 4.4 - Select the number of clusters


### PART 5 - Visualization
   #### 5.1 - Visualize your results

<br><br>

***Remark***

We filled this notebook with preliminary (trivial) code. This practice makes possible to run each cell, even the last ones, without throwing warnings once the dataset is imported. <b>Take advantage of this aspect to divide the work between all team members!</b> <br><br>
Remember that many libraries exist in Python, so many functions have already been developed. Read the documentation and don't reinvent the wheel! You can import whatever you want.

<br><font size=7 color=#009999> <b>PART I - Preliminaries</b> </font> <br><br>

<font size=5 color=#009999> <b>1.1 - Importing the packages</b> <br>
</font>


In [152]:
"""
CELL N°1.1 : IMPORTING ALL THE NECESSARY PACKAGES

@pre:  /
@post: The necessary packages should be loaded.
"""

import warnings
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns

random_seed = 42

np.random.seed(random_seed)
random.seed(random_seed)

warnings.filterwarnings("ignore")

# Import all the necessary packages here...


<br>
<font size=5 color=#009999> <b>1.2 - Importing the dataset</b> <br>
</font>


In [153]:
"""
CELL N°1.2 : IMPORTING THE DATASET

@pre:  /
@post: The object `df` should contain a Pandas DataFrame corresponding to the file `compas-dataset.csv`.
"""

df = pd.read_csv("compas-dataset.csv")

df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7214 entries, 0 to 7213
Data columns (total 49 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     7214 non-null   object 
 1   first                    7214 non-null   object 
 2   last                     7214 non-null   object 
 3   compas_screening_date    7214 non-null   object 
 4   sex                      7214 non-null   object 
 5   dob                      7214 non-null   object 
 6   age                      7214 non-null   int64  
 7   age_cat                  7214 non-null   object 
 8   race                     7214 non-null   object 
 9   juv_fel_count            7214 non-null   int64  
 10  decile_score             7214 non-null   int64  
 11  juv_misd_count           7214 non-null   int64  
 12  juv_other_count          7214 non-null   int64  
 13  priors_count             7214 non-null   int64  
 14  days_b_screening_arrest 

Unnamed: 0,age,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_days_from_compas,is_recid,r_days_from_arrest,violent_recid,is_violent_recid,decile_score.1,v_decile_score,priors_count.1,two_year_recid
count,7214.0,7214.0,7214.0,7214.0,7214.0,7214.0,6907.0,7192.0,7214.0,2316.0,0.0,7214.0,7214.0,7214.0,7214.0,7214.0
mean,34.817993,0.06723,4.509565,0.090934,0.109371,3.472415,3.304763,57.731368,0.481148,20.26943,,0.113529,4.509565,3.691849,3.472415,0.450652
std,11.888922,0.473972,2.856396,0.485239,0.501586,4.882538,75.809505,329.740215,0.499679,74.871668,,0.317261,2.856396,2.510148,4.882538,0.497593
min,18.0,0.0,1.0,0.0,0.0,0.0,-414.0,0.0,0.0,-1.0,,0.0,1.0,1.0,0.0,0.0
25%,25.0,0.0,2.0,0.0,0.0,0.0,-1.0,1.0,0.0,0.0,,0.0,2.0,1.0,0.0,0.0
50%,31.0,0.0,4.0,0.0,0.0,2.0,-1.0,1.0,0.0,0.0,,0.0,4.0,3.0,2.0,0.0
75%,42.0,0.0,7.0,0.0,0.0,5.0,0.0,2.0,1.0,1.0,,0.0,7.0,5.0,5.0,1.0
max,96.0,20.0,10.0,13.0,17.0,38.0,1057.0,9485.0,1.0,993.0,,1.0,10.0,10.0,38.0,1.0


<br>
<font size=5 color=#009999> <b>1.3 - Dataset curation</b> <br>
</font>

For this hackathon, your goal is to **determine the risk of recidivism of an individual**. Therefore, you should be able to determine which features are useful for your application and remove the unnecessary ones. We provide a list of features to keep and ask you to add features to that list. This step may take more time than the others. It is important to carefully analyze each feature and its relevance for our goal.

In this data cleaning task, you must remove redundant features, features that are not quantifiable and features that you believe are not linked to risk of recidivism. Yous should neither limit yourself to the provided list which is too short nor add all numeric features.
You should also avoid data leakage. Except for the "two_year_recid" feature, do not keep features which represent true recidivism. For similar reasons, do not keep features linked to the predictions made by the COMPAS algorithm. Using predictions made by a supervised algorithm (which is trained using both the features matrix and the target variable) is effectively leaking information from the target variable.

<div class="alert alert-warning">
<b>[Question 1.1] Removing unnecessary features </b>  <br>
Can you already, a priori, detect that some features are useless?
<ol>
   <li> if yes, list those (useless) features and explain your choice;
   <li> if not, then explain why it is better to wait.
</ol>
    Generally speaking, is it a good idea to remove a feature based on <i>a priori</i> knowledge, or doesn't it alter the final outcome?
</div>

In [154]:
"""
CELL N°1.3.1 : CURATION OF THE DATASET

@pre:  A pandas.DataFrame `df` containing the dataset
@post: A pandas.DataFrame `df` containing the dataset without outliers and with necessary features only
"""

# We also provide some code to remove outliers.
df = df[
    (
        df["is_recid"] != -1
    )  # Data aggregator encoded is_recid = -1 whenever they couldn't find a COMPAS case.
    & (
        df["days_b_screening_arrest"] <= 30
    )  # More than 30 days between the day of arrest and the date when the questionnaire was filled => poor data quality
    & (df["days_b_screening_arrest"] >= -30)  # Same as above
    & (
        df["c_charge_degree"] != "O"
    )  # These are simple traffic offenses, they will never be charged with jail.
]

# For reasons made explicit later, keep at least these columns.
columns_to_keep = [
    "sex",
    "race",
    "c_jail_in",
    "c_jail_out",
    "in_custody",
    "out_custody",
    "two_year_recid",
]

df.info()
df.describe(include="all")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6172 entries, 0 to 7213
Data columns (total 49 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     6172 non-null   object 
 1   first                    6172 non-null   object 
 2   last                     6172 non-null   object 
 3   compas_screening_date    6172 non-null   object 
 4   sex                      6172 non-null   object 
 5   dob                      6172 non-null   object 
 6   age                      6172 non-null   int64  
 7   age_cat                  6172 non-null   object 
 8   race                     6172 non-null   object 
 9   juv_fel_count            6172 non-null   int64  
 10  decile_score             6172 non-null   int64  
 11  juv_misd_count           6172 non-null   int64  
 12  juv_other_count          6172 non-null   int64  
 13  priors_count             6172 non-null   int64  
 14  days_b_screening_arrest 

Unnamed: 0,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,juv_fel_count,...,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,two_year_recid
count,6172,6172,6172,6172,6172,6172,6172.0,6172,6172,6172.0,...,6172,6172,6172,6172.0,6172,6172,6172,6172,6172.0,6172.0
unique,6128,2493,3465,685,2,4830,,3,6,,...,3,685,1,,3,685,1087,1097,,
top,anthony smith,michael,williams,2013-04-20,Male,1987-12-21,,25 - 45,African-American,,...,Low,2013-04-20,Risk of Violence,,Low,2013-04-20,2013-01-27,2020-01-01,,
freq,3,127,73,30,4997,5,,3532,3175,,...,3421,30,6172,,4117,30,19,46,,
mean,,,,,,,34.534511,,,0.0593,...,,,,3.641769,,,,,3.246436,0.45512
std,,,,,,,11.730938,,,0.463599,...,,,,2.488768,,,,,4.74377,0.498022
min,,,,,,,18.0,,,0.0,...,,,,1.0,,,,,0.0,0.0
25%,,,,,,,25.0,,,0.0,...,,,,1.0,,,,,0.0,0.0
50%,,,,,,,31.0,,,0.0,...,,,,3.0,,,,,1.0,0.0
75%,,,,,,,42.0,,,0.0,...,,,,5.0,,,,,4.0,1.0


In data science, datasets are rarely tailored to specific applications. Instead, they typically originate from information collected over a certain period. It is the data scientist's responsibility to effectively utilize these datasets.

<div class="alert alert-info">
<b>[Remark 1.1]</b><br>
In most real-world cases, the datasets you work with will contain artifacts like typos or missing data, which may need to be removed before using them in algorithms. In Pandas, missing data is represented as "NaNs" (Not a Number), though it applies to all missing objects, not just numbers.
</div>

Can you find a way to inspect your dataset and see if there are some missing data?

In [155]:
"""
CELL N°1.3.2: INFORMATION ABOUT TYPES AND NANs
@pre:  A pandas.DataFrame `df` containing the dataset.
@post: Statistics and/or visualization on the presence of missing data in `df`.
"""
df

Unnamed: 0,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,juv_fel_count,...,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,two_year_recid
0,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,0,...,Low,2013-08-14,Risk of Violence,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0
1,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,0,...,Low,2013-01-27,Risk of Violence,1,Low,2013-01-27,2013-01-26,2013-02-05,0,1
2,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,0,...,Low,2013-04-14,Risk of Violence,3,Low,2013-04-14,2013-06-16,2013-06-16,4,1
5,marsha miles,marsha,miles,2013-11-30,Male,1971-08-22,44,25 - 45,Other,0,...,Low,2013-11-30,Risk of Violence,1,Low,2013-11-30,2013-11-30,2013-12-01,0,0
6,edward riddle,edward,riddle,2014-02-19,Male,1974-07-23,41,25 - 45,Caucasian,0,...,Medium,2014-02-19,Risk of Violence,2,Low,2014-02-19,2014-03-31,2014-04-18,14,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7209,steven butler,steven,butler,2013-11-23,Male,1992-07-17,23,Less than 25,African-American,0,...,Medium,2013-11-23,Risk of Violence,5,Medium,2013-11-23,2013-11-22,2013-11-24,0,0
7210,malcolm simmons,malcolm,simmons,2014-02-01,Male,1993-03-25,23,Less than 25,African-American,0,...,Low,2014-02-01,Risk of Violence,5,Medium,2014-02-01,2014-01-31,2014-02-02,0,0
7211,winston gregory,winston,gregory,2014-01-14,Male,1958-10-01,57,Greater than 45,Other,0,...,Low,2014-01-14,Risk of Violence,1,Low,2014-01-14,2014-01-13,2014-01-14,0,0
7212,farrah jean,farrah,jean,2014-03-09,Female,1982-11-17,33,25 - 45,African-American,0,...,Low,2014-03-09,Risk of Violence,2,Low,2014-03-09,2014-03-08,2014-03-09,3,0


<div class="alert alert-info">
<b>[Remark 1.2] Each problem has its own solution</b> <br>
There exist numerous ways to deal with missing information and we will discuss the two main approaches:
<ol>
   <li> you remove rows or columns that contain missing data;
   <li> or you replace NaNs with another value. The latter can be a fixed value or computed to be, e.g., the mean of all non-NaNs values. The topic of replacing missing data, also called imputation of missing values, is very broad and complex, and there is no global solution that applies everywhere. Maybe you can find one that works well here?
</ol>
    
You **should** read more about how to imput missing value [here](https://scikit-learn.org/stable/modules/impute.html). However, you will not be evaluated on how sophisticated your handling of NaNs is so, for this hackathon, do not spend an unreasonable amount of time on the next cell.
</div> 

<div class="alert alert-warning">
<b>[Question 1.2] Handling missing data </b>  <br>
Given the dataset and the amount / type of missing information, what strategy do you propose to follow regarding missing data (NaNs)? <br> You can choose one or many of the following:
<ol>
   <li> drop features (column) with missing information; 
   <li> drop samples (row) with missing information;
   <li> replace missing information with interpolation / extrapolation / simple substitution / ...
</ol>
Justify briefly your choice.
</div> 

In [156]:
"""
CELL N°1.3.3: Handling missing values
@pre:  A pandas.DataFrame `df` containing the dataset.
@post: A pandas.DataFrame `df` containing the dataset with no missing values.
"""

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6172 entries, 0 to 7213
Data columns (total 49 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     6172 non-null   object 
 1   first                    6172 non-null   object 
 2   last                     6172 non-null   object 
 3   compas_screening_date    6172 non-null   object 
 4   sex                      6172 non-null   object 
 5   dob                      6172 non-null   object 
 6   age                      6172 non-null   int64  
 7   age_cat                  6172 non-null   object 
 8   race                     6172 non-null   object 
 9   juv_fel_count            6172 non-null   int64  
 10  decile_score             6172 non-null   int64  
 11  juv_misd_count           6172 non-null   int64  
 12  juv_other_count          6172 non-null   int64  
 13  priors_count             6172 non-null   int64  
 14  days_b_screening_arrest 

<br>
<font size=5 color=#009999> <b>1.4 - Feature engineering</b> <br>
</font>

<div class="alert alert-info">
<b>[Remark 1.3] New features extraction</b> <br>
In the present case, some features in the dataset still need to be reworked in order to provide meaningful information. For example, working with datetimes might not be easy.
</div>

You may want to somehow incorporate the information about date and time into the dataset in a more **intelligent** manner than it was before. Again, there can be multiple solutions, and we will propose you a very simple one.

For example, what is most important to predict the likelihood of recidivism: the exact dates at which each defendant entered and left jail/custody or the time spent in jail/custody?

Note: 1) you should apply your solution to all the "date/time" features you kept. There should at least be the one hinted above. 2) Pandas has a to_datetime function that should prove useful !

In [157]:
"""
CELL N°1.4 : FEATURE ENGINEERING

@pre:  A pandas.DataFrame `df` containing the dataset
@post: A pandas.DataFrame `df` containing the previous dataset with the new features you created.
"""

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6172 entries, 0 to 7213
Data columns (total 49 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     6172 non-null   object 
 1   first                    6172 non-null   object 
 2   last                     6172 non-null   object 
 3   compas_screening_date    6172 non-null   object 
 4   sex                      6172 non-null   object 
 5   dob                      6172 non-null   object 
 6   age                      6172 non-null   int64  
 7   age_cat                  6172 non-null   object 
 8   race                     6172 non-null   object 
 9   juv_fel_count            6172 non-null   int64  
 10  decile_score             6172 non-null   int64  
 11  juv_misd_count           6172 non-null   int64  
 12  juv_other_count          6172 non-null   int64  
 13  priors_count             6172 non-null   int64  
 14  days_b_screening_arrest 

<div class="alert alert-warning">
<b>[Question 1.3] New features </b>  <br>
What features have you added? If a particular manipulation has been applied, please explain.
</div> 

<br>
<font size=5 color=#009999> <b>1.5 - Sensitive features</b> <br>
</font>
<br>

<div class="alert alert-info">
<b>[Remark 1.4] Sensitive features</b> <br>
At this stage of the Hackathon, you still have two sensitive features, the sex attribute and the race attribute. As the end goal is to build a fair learning algorithm, you should not reasonably use these two features to determine if there is a risk of recidivism of the defendant.
</div>

To check if your learning techniques are unfair to particular subgroups of these features, you should **remove both features from the dataset while keeping them aside** to analyze the fairness of our learning techniques.

In [158]:
"""
CELL N°1.5 : SENSITIVE FEATURES

@pre:  A pandas.DataFrame `df` containing the dataset with sensitive features
@post: A pandas.DataFrame `df` containing the dataset without sensitive features and separate numpy arrays for each sensitive feature 
as well as the true label array `y`.
"""

# Use the dictionaries below to encode numerically the sensitive features.
race_map = {
    "African-American": 0,
    "Caucasian": 1,
    "Asian": 2,
    "Other": 3,
    "Hispanic": 4,
    "Native American": 5,
}

sex_map = {
    "Male": 0,
    "Female": 1
}


#Create arrays of numbers for the sensitive features
race_text = df["race"].tolist()
sex_text = df["sex"].tolist()

race = [race_map[i] for i in race_text]
sex = [sex_map[i] for i in sex_text]

print(race_text)
print(sex_text)
print(race)
print(sex)

#Drop sensitive features
df = df.drop(columns=["race","sex"], axis=1)

# As you approach the last step of the preprocessing, you should also store the target variable and remove it from the dataframe.
# It is good practice to remove it just before the scaling so that its dimension corresponds to the number of data points in `df`.
y = df["two_year_recid"].astype(float).values
df = df.drop(columns=["two_year_recid"], axis=1)
#Dans l'originel: on avait que: df.drop(columns=["two_year_recid"])

print(y)
df.info()
df.describe()

['Other', 'African-American', 'African-American', 'Other', 'Caucasian', 'Other', 'Caucasian', 'Caucasian', 'African-American', 'Caucasian', 'African-American', 'Caucasian', 'African-American', 'Hispanic', 'African-American', 'Caucasian', 'Caucasian', 'African-American', 'African-American', 'Caucasian', 'Caucasian', 'Hispanic', 'Caucasian', 'Other', 'African-American', 'African-American', 'African-American', 'Caucasian', 'African-American', 'Caucasian', 'Other', 'Caucasian', 'Caucasian', 'African-American', 'Caucasian', 'African-American', 'African-American', 'Caucasian', 'African-American', 'African-American', 'Caucasian', 'African-American', 'Caucasian', 'African-American', 'Caucasian', 'Other', 'African-American', 'African-American', 'Hispanic', 'African-American', 'Caucasian', 'African-American', 'African-American', 'Caucasian', 'African-American', 'African-American', 'Hispanic', 'African-American', 'Other', 'African-American', 'Hispanic', 'Hispanic', 'African-American', 'African-Am

Unnamed: 0,age,juv_fel_count,decile_score,juv_misd_count,juv_other_count,priors_count,days_b_screening_arrest,c_days_from_compas,is_recid,r_days_from_arrest,violent_recid,is_violent_recid,decile_score.1,v_decile_score,priors_count.1
count,6172.0,6172.0,6172.0,6172.0,6172.0,6172.0,6172.0,6172.0,6172.0,1997.0,0.0,6172.0,6172.0,6172.0,6172.0
mean,34.534511,0.0593,4.418503,0.091218,0.110661,3.246436,-1.740279,24.903273,0.484446,20.100651,,0.112119,4.418503,3.641769,3.246436
std,11.730938,0.463599,2.839463,0.497872,0.470731,4.74377,5.084709,276.812982,0.499799,76.543499,,0.315539,2.839463,2.488768,4.74377
min,18.0,0.0,1.0,0.0,0.0,0.0,-30.0,0.0,0.0,-1.0,,0.0,1.0,1.0,0.0
25%,25.0,0.0,2.0,0.0,0.0,0.0,-1.0,1.0,0.0,0.0,,0.0,2.0,1.0,0.0
50%,31.0,0.0,4.0,0.0,0.0,1.0,-1.0,1.0,0.0,0.0,,0.0,4.0,3.0,1.0
75%,42.0,0.0,7.0,0.0,0.0,4.0,-1.0,1.0,1.0,1.0,,0.0,7.0,5.0,4.0
max,96.0,20.0,10.0,13.0,9.0,38.0,30.0,9485.0,1.0,993.0,,1.0,10.0,10.0,38.0


<br>
<font size=5 color=#009999> <b>1.6 - Scaling the dataset</b> <br>
</font>

***Standardizing*** is important when you work with data because it allows data to be compared with one another.

$z$ is the standard score of a population $x$. It can be computed as follows:
$$z = \frac{x-\mu}{\sigma}$$
with $\mu$ the mean of the population and $\sigma$ the standard deviation of the population.

Please consult [Wikipedia](https://en.wikipedia.org/wiki/Standard_score) for further information about the standardization.\
Be careful to use the same formula as us, check in `scikit-learn` and check the already existing imports.

In [159]:
"""
CELL N°1.6 : SCALE THE DATASET

@pre:  A pandas.DataFrame `df` containing the dataset
@post: A pandas.DataFrame `df` containing the standardized dataset
"""

#"""
#Marche pas
def scale_dataset(df):
    scaler = StandardScaler()
    scaled_df = scaler.fit_transform(df.to_numpy())
    scaled_df = pd.DataFrame(scaled_df, columns=[col for col in df.columns if col != "two_year_recid"], index=df.index)
    return scaled_df
#"""

"""
#Marche pas non plus
def scale_dataset(df):
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(recid)
    scaled_df = pd.DataFrame(scaled_features, columns=recid.columns)#, index=df.index)
    scaled_df = scaled_df[[col for col in scaled_df.columns]]
    return scaled_df
"""

X = scale_dataset(df)
X.info()
X.describe()

ValueError: could not convert string to float: 'miguel hernandez'

<br><font size=7 color=#009999> <b>PART 2 - Data Exploration</b> </font> <br><br>

<font size=5 color=#009999> <b>2.1 - Feature visualization</b> <br>
</font>


### Is the dataset balanced in terms of sensitive groups ?
It's good practice to check this to better understand the contents of our dataset.
Indeed, if the training dataset is severely imbalanced, our learning algorithm may perform better for over-represented groups than for under-represented groups. Moreover, our goal is for the model to perform equally well across all groups.

<div class="alert alert-warning">
<b>[Question 2.1] (Im)Balanced dataset ? </b>  <br>
Is the dataset imbalanced ? What could be the consequences in terms of fairness i.e. in terms of the model performing equally well across all groups ?
</div> 

In [None]:
"""
CELL N°2.1.1: (Im)Balanced dataset ?

@pre:  A pandas.DataFrame `X` containing the dataset
@post: A pie chart plot representing the repartition of race groups in the dataset.
"""

# As the race group was previously removed, we can temporarily add it back, using the following map.
race_reverse_map = {v: k for k, v in race_map.items()}

for i in range(len(race)):
    race[i]=race_reverse_map[race[i]]
X["race"]=race
mylabels=[]
values=[]
for i in race_map:
    groups =  X.filter(items="race",like=i)
    values.append(groups.shape[0])
    mylabels.append(i)

plt.pie(values,labels=mylabels)
plt.show()
labels = ["African-American","Caucasian","Asian","Other","Hispanic","Native American"]
sizes = [0,0,0,0,0,0]
#for i in race:
#    sizes[i]+=1

plt.pie(sizes, labels = labels)
plt.show() 

X = X.drop(columns=["race"])  # Remove the race group again after doing the plot.

### Correlation
In order to check to the important features in our dataset, we can compute and plot (see e.g. `sns.heatmap`) the correlation matrix, as a tool to visually show all the correlation between features.

In [None]:
"""
CELL N°2.1.2 : Correlation matrix

@pre:  A pandas.DataFrame `df` containing the dataset
@post: A visualization of the correlation matrix between features.
"""

#corr_matrix = df.corr(method='pearson') 
#sns.heatmap(corr_matrix, cmap="Blues", annot=False, square=True, )
#plt.show()

<br>
<font size=5 color=#009999> <b>2.2 Principal Component Analysis</b> <br>
</font>


PCA is often considered as the simplest and most fundamental technique used in dimensionality reduction. Remember that PCA is essentially the rotation of coordinate axes, chosen such that each successful axis captures or preserves as much variance as possible. If the algorithm returns a new system coordinates of the same dimension as the input, we can keep only the axis corresponding to the 3 largest singular values and project data on this coordinates system to perform the visualization.

To vizualize the importance of features, we can extract the PCA loadings. These are indicators of the correlation between components and original features. The value of loadings is contained between -1 and 1. The more the value goes toward those boundaries, the more the feature influences the choice of component.We propose to perform a 2-dimensional PCA and then to add the loadings in vector form to the figure to obtain what is called a biplot.

The biplot visualization function is provided below.

In [None]:
"""
CELL N°2.2.1 : Principal Component Analysis (2D)

@pre:  A pandas.DataFrame `X` containing the dataset and labels `y`
@post: A PCA visualization in 2D where points are colored with respect to true labels `y`
"""


def biplot_visualization(X, y, columns=None):
    """
    Plot a biplot graph: the scaled data after applying a 2D PCA with loadings in vector forms.

    :param pca: PCA object
    :param X: a n by m matrix (or DataFrame), containing the input prior to the PCA transformation
    :param y: a vector of length n containing the target
    :param columns: a list of length m contained the names of the columns
        If not given, X.columns will be used
    """
    pca = PCA(n_components=2)
    X = pca.fit_transform(X)

    columns = (
        columns
        if columns is not None
        else X.columns
        if isinstance(X, pd.DataFrame)
        else [f"Feature {i+1}" for i in range(X.shape[1])]
    )

    # Normalize data for scaling
    X_normalized = X / (X.max(axis=0) - X.min(axis=0))

    df = pd.DataFrame(data=X_normalized, columns=["PC1", "PC2"])

    # Prepare loadings (vector components)
    loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

    loadings_df = pd.DataFrame(loadings, columns=["PC1", "PC2"], index=columns)

    # Create scatter plot
    plt.figure(figsize=(10, 8))
    sns.scatterplot(x=df["PC1"], y=df["PC2"], hue=y, palette="viridis", s=70, alpha=0.7)

    # Add vectors for loadings
    for index, row in loadings_df.iterrows():
        plt.arrow(
            0,
            0,
            row.PC1,
            row.PC2,
            color="red",
            alpha=0.7,
            head_width=0.02,
            head_length=0.03,
        )
        plt.text(
            row.PC1 * 1.1,
            row.PC2 * 1.1,
            index,
            color="black",
            ha="center",
            va="center",
            fontsize=10,
        )

    # Labels and limits
    plt.title("Biplot Visualization", fontsize=14)
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.axhline(0, color="gray", linestyle="--", linewidth=0.5)
    plt.axvline(0, color="gray", linestyle="--", linewidth=0.5)
    plt.grid(alpha=0.3)
    plt.legend(title="Classes", loc="best")
    plt.show()


# biplot_visualization(X, y, columns=df.columns)

In the next cell, you are asked to perform a 3 components PCA and plot it using Plotly.
<div class="alert alert-danger">
 Note: On certain versions of Firefox, the 3D scatter function of plotly may have some issues.
</div>

In [None]:
"""
CELL N°2.2.2 : Principal Component Analysis (3D)

@pre:  A pandas.DataFrame `X` containing the dataset and labels `y`
@post: A PCA visualization in 3D where points are colored with respect to true labels `y`
"""
def triplot_visualization(X, y, columns=None):
    """
    Plot a triplot graph: the scaled data after applying a 3D PCA.

    :param pca: PCA object
    :param X: a n by m matrix (or DataFrame), containing the input prior to the PCA transformation
    :param y: a vector of length n containing the target
    :param columns: a list of length m contained the names of the columns
        If not given, X.columns will be used
    """
    pca = PCA(n_components=3)
    X = pca.fit_transform(X)

    columns = (
        columns
        if columns is not None
        else X.columns
        if isinstance(X, pd.DataFrame)
        else [f"Feature {i+1}" for i in range(X.shape[1])]
    )

    # Normalize data for scaling
    X_normalized = X / (X.max(axis=0) - X.min(axis=0))

    df = pd.DataFrame(data=X_normalized, columns=["PC1", "PC2", "PC3"])

    # Prepare loadings (vector components)
    loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

    loadings_df = pd.DataFrame(loadings, columns=["PC1", "PC2", "PC3"], index=columns)

    # Create scatter plot
    df2 = px.data.iris()
    fig = px.scatter_3d(df2, x="PC1", y="PC2", z="PC3", color="olive")
    fig.show()

    # Labels and limits
    """
    plt.title("Triplot Visualization", fontsize=14)
    plt.xlabel("Principal Component 1")
    plt.ylabel("Principal Component 2")
    plt.ylabel("Principal Component 2")
    plt.axhline(0, color="gray", linestyle="--", linewidth=0.5)
    plt.axvline(0, color="gray", linestyle="--", linewidth=0.5)
    plt.grid(alpha=0.3)
    plt.legend(title="Classes", loc="best")
    plt.show()
    """


# triplot_visualization(X, y, columns=df.columns)

<div class="alert alert-warning">
<b>[Question 2.2] Principal Component Analysis </b>  <br>
Do all features have the same importance? If no, which features are less important, and why? You can use all other graphs from the visualization part to justify your answer.
</div> 

<br><font size=7 color=#009999> <b>PART 3 - Clustering</b> </font> <br><br>

<font size=4 color=#009999> <b>ABCs of Clustering</b> <br>
Clustering can be defined as the task of *grouping* objects from a set $S$ (here, each row/observation is an object) in such a way that objects assigned to the same group (called cluster) are more **similar** (or less **distant**) with respect to each other (in some sense) than to those assigned to the other groups. Usually, we would like to divide our objects into $K$ groups.

As such, clustering reduces to finding, among all $K$-partitions possible of $S$, the partition $\mathcal{P}$ that minimizes some error criterion $f(\mathcal{P})$. Each object will be assigned a cluster, $C_i$, and each cluster will have its centroid $c_i$ the distance between **any object** in $C_i$ to centroid $c_i$ is **always smaller** that the distance to any other centroid. In other words, each object is assigned to the cluster whose centroid is the closest.


A mathematical formulation of the problem could be the following, $$ \boxed{\min_{(C_1,\dots,C_K) \,\in\, \mathcal{P}}\,f(C_1,\dots,C_K) = \sum_{i = 1}^{K}\,\sum_{x \in C_i}\,\Delta(x,c_i)}$$

where $\Delta(x,c_i)$ denotes the distance between object $x$ and centroid $c_i$.

<br>
<font size=5 color=#009999>
EXAMPLE OF SEPARATING OBJECTS INTO 10 CLUSTERS
</font> <br> <br>

**First**, let us imagine the following 2D dataset.

<img src="Imgs/10-partitions-data.svg" width = "250">

**Then**, a 10-partition is defined by the position of the centroids, one for each cluster. Below, you can observe four examples of (random) centroids localizations (stars).

<img src="Imgs/10-partitions-chose-centroids.svg" width = "1000">

**Next**, the regions are colored based on their closest centroid. Here, we take the distance to be the Euclidean distance.

<img src="Imgs/10-partitions-centroids.svg" width = "1000">

**Finally**, data points (objects) are colored based in the region they are in.

<img src="Imgs/10-partitions-clusters.svg" width = "1000">

<font size=5 color=#009999> <b>3.1 - K-Means</b> <br>
</font>


In [None]:
"""
CELL N°3.1.1 : GROUND TRUTH

@pre:  A pandas.DataFrame `X` containing the dataset and labels `y`
@post: A 80/20 split of your dataset in train and test sets.
"""


<div class="alert alert-warning">
<b>[Question 3.1] Number of clusters </b>  <br>
    Accounting for all features, what do you think is the ideal number of clusters? What will happen if too many or even too few clusters are chosen?
</div>

Now that your dataset is divided into a train and a test set, use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html">KMeans</a> algorithm from `scikit-learn` to apply the clustering on your dataset.

In [26]:
"""
CELL N°3.1.2 : K-Means

@pre:  A split of your dataset: X_train, X_test, y_train, y_test
@post: A split of your dataset in train and test sets.
"""


def train_and_predict(model, X_train, X_test):
    """Trains the clustering model on the training data and predict the clusters for both training and test data.

    Parameters:
    model (sklearn or similar clustering model): The clustering algorithm that has a fit_predict method and a predict method.
    X_train (array-like, shape (n_samples, n_features)): The training data to fit the model on.
    X_test (array-like, shape (n_samples, n_features)): The test data to predict the clusters for.

    Returns:
    tuple: A tuple containing two arrays:
        - train_clusters (array): Cluster labels for the training data.
        - test_clusters (array): Cluster labels for the test data.
    """
    train_clusters = ...  # TODO
    test_clusters = ...  # TODO
    return train_clusters, test_clusters


def compute_y_pred(model, X_train, X_test, y_train):
    """Compute the predicted labels for the test data based on the clustering model.

    This function assigns a predicted label to each sample in the test set by:
    1. Training the model on the training data using the previous function.
    2. Assigning the majority class from the training labels to each cluster.
    3. Using the cluster assignments from the test data to assign predicted labels.

    Parameters:
    model (sklearn or similar clustering model): The trained clustering model with an `n_clusters` attribute.
    X_train (array-like, shape (n_samples, n_features)): The training data used to fit the model.
    X_test (array-like, shape (n_samples, n_features)): The test data to predict labels for.
    y_train (array-like, shape (n_samples,)): The true labels of the training data.

    Returns:
    np.array: An array of predicted labels for the test data based on the majority class in each cluster.
    """
    mapping = {}
    train_clusters, test_clusters = train_and_predict(model, X_train, X_test)
    df = pd.DataFrame({"cluster": train_clusters, "target": y_train})

    for cluster in range(model.n_clusters):
        majority_class = df[df["cluster"] == cluster]["target"].mode()[0]
        mapping[cluster] = majority_class

    y_pred = ...  # TODO
    return y_pred


def compute_metrics(model, X_train, y_train, X_test, y_test):
    """Computes various evaluation metrics for the clustering model.

    Parameters:
    model (sklearn or similar clustering model): The trained clustering model with an `n_clusters` attribute.
    X_train (array-like, shape (n_samples, n_features)): The training data used to fit the model.
    X_test (array-like, shape (n_samples, n_features)): The test data to predict labels for.
    y_train (array-like, shape (n_samples,)): The true labels of the training data.
    y_test (array-like, shape (n_samples,)): The true labels of the test data.

    Returns:
    dict: A dictionary containing the computed metrics:
        - "n_clusters": The number of clusters in the model.
        - "Accuracy": The accuracy of the model on the test data.
        - "F1-Score": The F1-score of the model on the test data.
        - "Precision": The precision of the model on the test data.
        - "Recall": The recall of the model on the test data.
        - "Silhouette Score": The silhouette score of the clustering on the test data.
    """
    y_pred = compute_y_pred(model, X_train, X_test, y_train)
    accuracy = ...  # TODO
    f1 = ...  # TODO
    precision, recall, _, _ = ...  # TODO
    sil_score = ...  # TODO
    return {
        "n_clusters": ...,  # TODO
        "Accuracy": accuracy,
        "F1-Score": f1,
        "Precision": precision,
        "Recall": recall,
        "Silhouette Score": sil_score,
    }


kmeans = KMeans(n_clusters=2, random_state=random_seed)
# results = compute_metrics(kmeans, X_train, y_train, X_test, y_test)
# results

<font size=5 color=#009999> <b>3.2 - Results Analysis</b> <br>
</font>

In this section, we adress the difficult task of evaluating the performance of the clustering algorithm.

<font size=3 color=#009999> <b>3.2.1 - Quality of the clustering</b> <br>
</font>
The silhouette score is a measure of how close each point in one cluster is to points in the neighboring clusters. The [mean silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) is an average of the silhouette score for each point and provides a way to measure the quality of the clustering.

The best value is 1 and the worst value is -1.

In [None]:
"""
CELL N°3.2.1 : Silhouette Score

@pre:  A split of your dataset: X_train, X_test, y_train, y_test
@post: A "Mean Silhouette Score versus Number of Clusters" plot
"""

plt.figure(figsize=(8, 5))
plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))

plt.show()

<br>
<font size=5 color=#009999> <b>3.2.2 - Purity and entropy of a clustering</b> <br>
</font>

### Purity

Purity measures how well a cluster contains points from a single class. A cluster with high purity mostly contains points from one class.

**Example:** Imagine you are grouping fruits based on their shape, but you also have information about their color. If a group contains mostly red apples, that group has high purity. However, if you find a few green apples or pears in the group, the purity decreases. In this case, high purity means the majority of fruits share both shape and color consistency.

Formula:
$$
\text{Purity } = \frac{1}{N} \sum_{i = 1}^k \max_j n_{i,j}
$$
where:
- $N = $ total number of points,
- $k = $ number of clusters,
- $n_{i,j} = $​ number of points from class $j$ in cluster $i$,
- $\max_j n_{i,j} = $ number of points from the most common class in cluster $i$.


### Entropy

Entropy measures how mixed the classes are within a cluster. Low entropy means most points in a cluster belong to the same class. High entropy means points are more evenly distributed across different classes.

**Example:** Consider a fruit basket that is mostly filled with red apples, with only a few bananas and oranges. Since the basket is dominated by one type of fruit, it has low entropy. In contrast, if the basket contains an equal mix of apples, bananas, and oranges, the distribution is more random, resulting in high entropy. This even distribution means it is harder to predict the dominant fruit just by looking at the basket.

***Formula for a single cluster:***
$$
E_i = -\sum_{j=1}^{C} p_{ij} \log_2(p_{ij})
$$

Where:

- $C = $ number of classes,
- $p_{i,j} = $ proportion of points from class jj in cluster ii.

The overall entropy is the weighted average across all clusters:

$$
\text{Entropy} = \frac{1}{N} \sum_{i=1}^{k} n_i \cdot E_i
$$

Where $n_i$​ is the number of points in cluster $i$.

A good clustering aims for both high purity (most points in a cluster belong to one class) and low entropy (each cluster contains little class mixing).
    
<div class="alert alert-danger">
 If this makes it easier for you to implement purity and entropy, you can modify the previously defined function `compute_metrics` to also return in the results dictionary the purity, the entropy or any other metric that you may want to use later on.
</div>
<div class="alert alert-danger">
 Compared to the silhouette score which is computed using only the features, purity and entropy are metrics computed using the true label `y`. Do not forget to compute these metrics on a test set.
</div>

In [None]:
"""
CELL N°3.2.2 : Purity and Entropy

@pre:  A split of your dataset: X_train, X_test, y_train, y_test
@post: A "Purity/Entropy versus Number of Clusters" plot. There should be two curves, one for the purity and one for the entropy.
"""

plt.figure(figsize=(8, 5))
plt.gca().xaxis.set_major_locator(MaxNLocator(integer=True))

plt.show()

<div class="alert alert-warning">
<b>[Question 3.2] Quality of the clustering </b>  <br>
    You considered three different measures for the quality of the clustering: the first one is the silhouette score and is oblivious to the true labels: it is a truly unsupervised metric. The second and third metric use the true label to assess the quality of the clustering. Based on this observation,
    
1. Comment on the evolution of each metric according to the number of clusters.
2. Comment on what do you now think is the ideal number of clusters ?
    
</div>

<br><font size=7 color=#009999> <b>PART 4 - Fairness metrics</b> </font> <br><br>

Congratulations for reaching this far ! So far, you have thoroughly analyzed a sensitive dataset, you cleaned it and focused on what you believe were useful features for predicting recidivism. You then used the K-Means algorithm to have your own recidivism predictor.

Because of the sensitivity of the dataset and its potential negative impact on certain parts of the population, you should now assess its fairness with respect to each gender and race group.

<br>
<font size=5 color=#009999> <b>4.1 False Positive Rate</b> <br>
</font>

The false positive rate (FPR) is a performance metric used to evaluate the accuracy of a machine learning model, particularly in binary classification tasks. It refers to the proportion of actual negative instances (people that did not recidivate) that are incorrectly classified as positive. A lower FPR indicates that the model is better at identifying negative cases.

A fair model would have the same FPR across all groups.

<div class="alert alert-danger">
 As for the purity and entropy metrics, the false positive rate metric uses the true labels, you should therefore make a train/test split before hand.
</div>

In [None]:
"""
CELL N°4.1 False Positive Rate

@pre:  A split of your dataset: X_train, X_test, y_train, y_test
@post: A "False Positive Rate vs Number of Clusters" plot for each group
"""

# Because the dataset is imbalanced, we will repartition our dataset into three race groups: African-American, Caucasian and Other.
group_labels = np.where(race == 0, 0, np.where(race == 1, 1, 2))

# X_train, X_test, y_train, y_test, group_train, group_val = train_test_split(
#     X, y, group_labels, test_size=0.2, random_state=random_seed
# )

# Doing so, you can now use X_test[group_val == i] to get the test points with race i.


plt.xlabel("Number of Clusters")
plt.ylabel("False Positive Rate")
plt.title("False Positive Rate vs. Number of Clusters")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

<br>
<font size=5 color=#009999> <b>4.2 Demographic Parity</b> <br>
</font>
Demographic parity is a fairness metric aimed at ensuring that a machine learning model’s predictions do not depend on membership in a sensitive group. Specifically, demographic parity is achieved when the likelihood of a prediction is independent of sensitive group membership. In binary classification, demographic parity requires equal selection rates across groups.

In our case, perfect demographic parity means that there is the exact same proportion of “bail denied” in each race group. A fair model would have the same Demographic Parity value across all groups.

In [None]:
"""
CELL N°4.2 Demographic Parity

@pre:  A split of your dataset: X_train, X_test, y_train, y_test
@post: A "Demographic Parity vs Number of Clusters" plot
"""


# We provide the function below to compute demographic parity
def compute_demographic_parity(y_pred, group_labels):
    unique_groups = np.unique(group_labels)
    demographic_parity = {}

    for group in unique_groups:
        # Create a boolean mask for the current group
        group_mask = group_labels == group

        # Calculate the proportion of positive predictions for the group
        group_pred = y_pred[group_mask]
        positive_rate = np.mean(group_pred == 1)

        demographic_parity[group] = positive_rate

    return demographic_parity


plt.xlabel("Number of Clusters")
plt.ylabel("Demographic Parity")
plt.title("Demographic Parity vs. Number of Clusters")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

<div class="alert alert-warning">
<b>[Question 4.1] Fairness of your model </b>  <br>
    You considered two different measures for the fairness of your model and checked for various variants of your algorithm (number of clusters) the value of these fairness metrics.

Is your algorithm unfair ? If yes, which ethnic group is penalized by the unfairness of your model ?
    
</div>

<div class="alert alert-warning">
<b>[Question 4.2] Presence of the sensitive features in the dataset [BONUS]</b> <br> 
In Cell 1.5, you removed the sensitive features from your dataset before building your algorithm. Yet, you may have noticed unfairness in your algorithm.

1. Provide reasons why it is not necessarily enough to remove sensitive features from your dataset if you want to have fair predictions.
2. Compute FPR and Demographic Parity for your algorithm when trained on the full dataset. Is the fairness of your classifier worse ?
</div>

In [31]:
# Empty cell for the BONUS question.

<br><font size=7 color=#009999> <b>PART 5 - Visualization </b> </font> <br><br>

<font size=5 color=#009999> <b>5.1 Visualize your results</b> <br>
</font>
In the last cell, you can create the figure of your choice to visualize your results. You can be as creative as you want as long as you only use one figure (with potentially more than one plot).

You will be evaluated on the clarity of your figure. You should ask yourself the following question while creating it: "Is the message I am trying to convey clear enough so that a student from another group can take a quick look and understand it directly ?" If the answer is positive, it's probably a great plot !

In [32]:
# Empty cell for the VISUALIZATION question.