# Task

Operators like joins, selections and missing value imputaters can cause *data distribution issues*, which can heavily impact the performance of our model for specific demographic groups. Mlinspect helps with identifying such issues by offering a check that calculates histograms for sensitive groups in the data and verifying whether the histogram change is significant enough to alert the user. Thanks to our annotation propagation, we can deal with complex code involving things like nested sklearn pipelines and group memberships that are removed from the training data using projections.

We want to find out if preprocessing operations in pipelines introduce bias and if so, which groups are effected.
The pipeline we want to analyse in this task can be found using the path `os.path.join(str(get_project_root()), "example_pipelines", "healthcare", "healthcare.py")`. The senstive attributes we want to take a look at are `age_group` and `race`. 

It this task, we use a pipeline we created using synthetic data.

The code of the pipeline:

> ```python
> """
> An example pipeline
> """
> import os
> import warnings
> 
> import pandas as pd
> from sklearn.compose import ColumnTransformer
> from sklearn.impute import SimpleImputer
> > from sklearn.model_selection import train_test_split
> from sklearn.pipeline import Pipeline
> from sklearn.preprocessing import OneHotEncoder, StandardScaler
> from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
> from example_pipelines.healthcare.healthcare_utils import MyW2VTransformer, create_model
> from mlinspect.utils import get_project_root
> 
> # FutureWarning: Given feature/column names or counts do not match the ones for the data given during fit
> warnings.filterwarnings('ignore')
> 
> COUNTIES_OF_INTEREST = ['county2', 'county3']
> 
> # load input data sources (data generated with https://www.mockaroo.com as a single file and then split into two)
> patients = pd.read_csv(os.path.join(str(get_project_root()), "example_pipelines", "healthcare",
>                                     "healthcare_patients.csv"), na_values='?')
> histories = pd.read_csv(os.path.join(str(get_project_root()), "example_pipelines", "healthcare",
>                                      "healthcare_histories.csv"), na_values='?')
> 
> # combine input data into a single table
> data = patients.merge(histories, on=['ssn'])
> 
> # compute mean complications per age group, append as column
> complications = data.groupby('age_group').agg(mean_complications=('complications', 'mean'))
> 
> data = data.merge(complications, on=['age_group'])
> 
> # target variable: people with a high number of complications
> data['label'] = data['complications'] > 1.2 * data['mean_complications']
> 
> # project data to a subset of attributes
> data = data[['smoker', 'last_name', 'county', 'num_children', 'race', 'income', 'label']]
> 
> # filter data
> data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
> 
> # define the feature encoding of the data
> impute_and_one_hot_encode = Pipeline([
>         ('impute', SimpleImputer(strategy='most_frequent')),
>         ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
>     ])
> 
> featurisation = ColumnTransformer(transformers=[
>     ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
>     ('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
>     ('numeric', StandardScaler(), ['num_children', 'income'])
> ])
> 
> # define the training pipeline for the model
> neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
> pipeline = Pipeline([
>     ('features', featurisation),
>     ('learner', neural_net)])
> 
> # train-test split
> train_data, test_data = train_test_split(data, random_state=0)
> # model training
model = pipeline.fit(train_data, train_data['label'])
> # model evaluation
> # this is running on synthetic random data, so there is nothing meaningful to learn in this example pipeline
> print(model.score(test_data, test_data['label']))
> ```

In [1]:
import os
from mlinspect.utils import get_project_root

HEALTHCARE_FILE_PY = os.path.join(str(get_project_root()), "example_pipelines", "healthcare", "healthcare.py")

# TODO

# Your answer: Did we find operators that introduce bias? How did the distribution of demographic groups change?

**My anser:** TODO