# Task

Operators like joins, selections and missing value imputaters can cause *data distribution issues*, which can heavily impact the performance of our model for specific demographic groups. Mlinspect helps with identifying such issues by offering a check that calculates histograms for sensitive groups in the data and verifying whether the histogram change is significant enough to alert the user. Thanks to our annotation propagation, we can deal with complex code involving things like nested sklearn pipelines and group memberships that are removed from the training data using projections.

We want to find out if preprocessing operations in pipelines introduce bias and if so, which groups are effected.
The pipeline we want to analyse in this task can be found using the path `os.path.join(str(get_project_root()), "experiments", "user_interviews", "compas_modified.py")`. The senstive attributes we want to take a look at are `sex` and `race`. 

The COMPAS dataset contains information about 6,889 criminal defendants in Broward County, FL, along with predictions of their recidivism risk, as produced by a commercial tool called COMPAS. The sensitive attributes include gender and race. The task is to predict whether a defendant is likely re-offend. We took this existing data set and only modified it slightly by introducing an artificial issue which we will now try to find using mlinspect.

The code of the pipeline:

> ```python
> """
> COMPAS pipeline
> """
> import os
> 
> import pandas as pd
> from sklearn.compose import ColumnTransformer
> from sklearn.impute import SimpleImputer
> from sklearn.linear_model import LogisticRegression
> from sklearn.pipeline import Pipeline
> from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, label_binarize
> 
> from mlinspect.utils import get_project_root
> 
> train_file = os.path.join(str(get_project_root()), "experiments", "user_interviews", "compas_train_modified.csv")
> train = pd.read_csv(train_file, na_values='?', index_col=0)
> test_file = os.path.join(str(get_project_root()), "example_pipelines", "compas", "compas_test.csv")
> test = pd.read_csv(test_file, na_values='?', index_col=0)
> 
> train = train[
>     ['sex', 'dob', 'age', 'c_charge_degree', 'race', 'score_text', 'priors_count', 'days_b_screening_arrest',
>      'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
> test = test[
>     ['sex', 'dob', 'age', 'c_charge_degree', 'race', 'score_text', 'priors_count', 'days_b_screening_arrest',
>      'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
> 
> # If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested,
> # we assume that because of data quality reasons, that we do not have the right offense.
> train = train[(train['days_b_screening_arrest'] <= 30) & (train['days_b_screening_arrest'] >= -30)]
> # We coded the recidivist flag – is_recid – to be -1 if we could not find a compas case at all.
> train = train[train['is_recid'] != -1]
> # In a similar vein, ordinary traffic offenses – those with a c_charge_degree of ‘O’ – will not result in Jail
> # time are removed (only two of them).
> train = train[train['c_charge_degree'] != "O"]
> # We filtered the underlying data from Broward county to include only those rows representing people who had either
> # recidivated in two years, or had at least two years outside of a correctional facility.
> train = train[train['score_text'] != 'N/A']
> 
> train = train.replace('Medium', "Low")
> test = test.replace('Medium', "Low")
> 
> train_labels = label_binarize(train['score_text'], classes=['High', 'Low'])
> test_labels = label_binarize(test['score_text'], classes=['High', 'Low'])
> 
> impute_and_onehot = Pipeline([('imputer1', SimpleImputer(strategy='most_frequent')),
>                               ('onehot', OneHotEncoder(handle_unknown='ignore'))])
> impute_and_bin = Pipeline([('imputer2', SimpleImputer(strategy='mean')),
>                            ('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform'))])
> 
> compas_featurizer = ColumnTransformer(transformers=[
>     ('impute1_and_onehot', impute_and_onehot, ['is_recid']),
>     ('impute2_and_bin', impute_and_bin, ['age'])
> ])
> compas_pipeline = Pipeline([
>     ('features', compas_featurizer),
>     ('classifier', LogisticRegression())
> ])
> 
> compas_pipeline.fit(train, train_labels.ravel())
> print(compas_pipeline.score(test, test_labels.ravel()))
> ```

In [1]:
import os
from mlinspect.utils import get_project_root

COMPAS_FILE_PY = os.path.join(str(get_project_root()), "experiments", "user_interviews", "compas_modified.py")

# TODO


# Your answer: Did we find operators that introduce bias? How did the distribution of demographic groups change?

**My anser:** TODO