# Adult Experiment

This notebook contains the code to reproduce the Adult experiment. The dataset can be downloaded from the [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/adult). 

**Run the following cells in order to reproduce the experiment from the paper.**

In [None]:
from boexplain import fmin
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## Preprocess the data and prepare it for training

In [None]:
# read the data
df = pd.read_csv("adult.data",
                 names=[
                     "Age", "Workclass", "fnlwgt", "Education",
                     "Education-Num", "Marital Status", "Occupation",
                     "Relationship", "Race", "Gender", "Capital Gain",
                     "Capital Loss", "Hours per week", "Country", "Income"
                 ],
                 na_values=" ?")

# Mapping binary values to the expected output
df['Income'].replace({" <=50K": 0, ' >50K': 1}, inplace=True)
# Since mode is Prof-specialty, replacing null values with it
df['Occupation'] = df['Occupation'].fillna('Prof-specialty')
# Since mode is Private, replacing null values with it
df['Workclass'] = df['Workclass'].fillna('Private')
# Since mode is United-States, replacing null values with it
df['Country'] = df['Country'].fillna('United-States')

categorical = [
    'Workclass', 'Education', 'Marital Status', 'Occupation', 'Relationship',
    'Race', 'Gender', 'Country'
]
for col in df.columns:  # simple label encoding
    if col in categorical:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
    df[col] = pd.to_numeric(df[col], downcast='unsigned')

# Split the data set into features and label
X = df.drop(columns='Income')
Y = df['Income']

# Split the data into test data and training data
X_train, X_test, Y_train, Y_test = train_test_split(X.copy(),
                                                    Y.copy(),
                                                    test_size=0.2,
                                                    random_state=0)

# flip training labels satisfying the condition
condition = (X_train["Education-Num"] >=
             8) & (X_train["Education-Num"] <=
                   10) & (X_train["Age"] >= 30) & (X_train["Age"] <= 40)
X_train["Income"] = Y_train
X_train.loc[condition, "Income"] += 1
X_train.loc[condition, "Income"] %= 2

## The objective function

The objective function trains a random forest classifier, and queries the average predicted value for the group Male. The value 0.259830 is the average predicted value after removing the corrupted tuples.

In [None]:
def obj(train_filtered):
    # create and train the rf model
    rf = RandomForestClassifier(n_estimators=28, random_state=0)
    rf.fit(train_filtered.drop(columns='Income'), train_filtered['Income'])
    # query the inference data
    X_test["prediction"] = rf.predict(X_test)
    query = abs(X_test.groupby("Gender")["prediction"].mean()[1] - 0.259830)
    X_test.drop(columns='prediction', inplace=True)
    return query

## BOExplain API call

The function *fmin* is used to minimize the objective function. The columns Age and Education-Num are searched for an explanation. The runtime is 200 seconds, and the results are averaged over 10 runs. The correct predicate is provided so that F-score, precision and recall can be calculated. Statistics about the run are saved to the file adult_boexplain.json.

In [None]:
df_rem = fmin(
    data=X_train,
    f=obj,
    num_cols=["Age", "Education-Num"],
    runtime=200,
    runs=10,
    random=True,  # perform a random iteration
    correct_pred={
        'Age_min': 30,
        'Age_len': 10,
        'Education-Num_min': 8,
        'Education-Num_len': 2
    },
    name="adult",
    file="adult.json",
    use_seeds_from_paper=True,
)

# Recreate Figure 6

From the output of the above call to the BOExplain APi, the following two cells can be used to recreate Figure 6 from the paper.

In [2]:
import pandas as pd
import numpy as np
import altair as alt
alt.data_transformers.disable_max_rows()

from json import loads
experiments = {}

fo = open("results/adult.json", "r")
for i, line in enumerate(fo.readlines()):
    experiments[i] = loads(line)
fo.close()

df = pd.DataFrame({}, columns=["Algorithm", "Time (seconds)", "Value"])
for i in range(len(experiments)):
    df_new = pd.DataFrame.from_dict({"Algorithm": experiments[i]["cat_enc"],
                        "Time (seconds)": list(range(5, experiments[i]["runtime"]+5, 5)),
                        "Value": experiments[i]["time_array"]},
                                orient='index').T
    df = df.append(df_new)

df = df.explode("Value")
df = df.set_index(['Algorithm']).apply(pd.Series.explode).reset_index()
df["Algorithm"] = df["Algorithm"].replace({"individual_contribution_warm_start_top1": "BOExplain", np.nan: "Random"})

line = alt.Chart(df).mark_line().encode(
    x='Time (seconds)',
    y=alt.Y('mean(Value)', title=['Mean Objective', 'Function Value']),
    color="Algorithm"
).properties(
    title=" ",
    width=225,
    height=90
)
band = alt.Chart(df).mark_errorband(extent='stdev').encode(
    x='Time (seconds)',
    y=alt.Y('Value', title='Mean Objective Function Value'),
    color=alt.Color("Algorithm")
)
chart = line# + band
chart = chart.configure_title(
    anchor='start',
)
chart.configure_legend(
    title=None,
    orient='none',
    legendX=30,
    legendY=163,
    columns=4,
    labelFontSize=15,
    symbolSize=700,
    labelLimit=275,
).configure_axis(
    labelFontSize=15,
    titleFontSize=15,
    titlePadding=2
).configure_title(
    fontSize=15
)

In [4]:
from json import loads
import altair as alt
experiments = {}

fo = open("results/adult.json", "r")
for i, line in enumerate(fo.readlines()):
    experiments[i] = loads(line)
fo.close()

import re
df = pd.DataFrame({}, columns=["Algorithm", "Time (seconds)", "Precision", "Recall", "F-score", "Jaccard"])
for i in range(len(experiments)):
    df_new = pd.DataFrame.from_dict({"Algorithm": experiments[i]["cat_enc"],
#                         "Iteration": tuple(range(experiments[i]["n_trials"])),
                        "Time (seconds)": tuple(range(5, experiments[i]["runtime"]+5, 5)),
                        "Precision": experiments[i]["precision_time_array"],
                        "Recall": experiments[i]["recall_time_array"],
                        "F-score": experiments[i]["f_score_time_array"],
                        "Jaccard": experiments[i]["jaccard_time_array"]
                        }, orient='index').T
    df = df.append(df_new)

df = df.set_index(['Algorithm', "Time (seconds)"]).apply(pd.Series.explode).reset_index()
df = df.set_index(['Algorithm']).apply(pd.Series.explode).reset_index()
df["Algorithm"] = df["Algorithm"].replace({"individual_contribution_warm_start_top1": "BOExplain", np.nan: "Random"})


num_cols = f"{len(experiments[0]['num_cols'])} numerical columns: "
for i, col in enumerate(experiments[0]["num_cols"]):
    num_cols += f"{col} (range {experiments[0]['num_cols_range'][i][0]} to {experiments[0]['num_cols_range'][i][1]}), "
cat_cols = f"{experiments[0]['cat_cols']} categorical columns: "
for i, col in enumerate(experiments[0]["cat_cols"]):
    cat_cols += f"{col} ({experiments[0]['cat_cols_n_uniq'][i]} unique values), "

out_str = f"Experiment: {experiments[0]['name']}. Completed {experiments[0]['n_trials']} iterations for {experiments[0]['runs']} runs. Search space includes "

if len(experiments[0]['num_cols']) > 0:
    out_str += num_cols
    if len(experiments[0]['cat_cols']) > 0:
        out_str += "and "

if len(experiments[0]['cat_cols']) > 0:
    out_str += cat_cols

out_str = f"{out_str[:-2]}."

out_lst = [line.strip() for line in re.findall(r'.{1,140}(?:\s+|$)', out_str)]
# df = pd.melt(df, id_vars=["Algorithm", "Iteration"], value_vars=["Precision", "Recall", "F1_score", "Jaccard"], value_name="Metric")
# altair
metric = "Metric"
f1_score = alt.Chart(df).mark_line().encode(
    x=alt.X('Time (seconds)', axis=alt.Axis()),
    y=alt.Y(f'mean(F-score)', scale=alt.Scale(domain=[0, 1]), axis=alt.Axis(values=[0.2, 0.5, 0.8]), title=None),
    color=alt.Color("Algorithm")
).properties(
    title="F-score",
    width=225,
    height=90
)
jaccard = alt.Chart(df).mark_line().encode(
    x=alt.X('Time (seconds)', axis=alt.Axis(labels=False, title=None,  tickSize=0)),
    y=alt.Y(f'mean(Jaccard)', scale=alt.Scale(domain=[0, 1]), axis=alt.Axis(values=[0.2, 0.5, 0.8], labels=False, title=None,  tickSize=0), title=None),
    color=alt.Color("Algorithm")
).properties(
    title="Jaccard Similarity",
    width=225,
    height=90
)
prec = alt.Chart(df).mark_line().encode(
    x='Time (seconds)',
    y=alt.Y(f'mean(Precision)', scale=alt.Scale(domain=[0, 1]), axis=alt.Axis(values=[0.2, 0.5, 0.8], labels=False, title=None,  tickSize=0), title=None),
    color=alt.Color("Algorithm")
).properties(
    title="Precision",
    width=225,
    height=90
)
recall = alt.Chart(df).mark_line().encode(
    x='Time (seconds)',
    y=alt.Y(f'mean(Recall)', scale=alt.Scale(domain=[0, 1]), axis=alt.Axis(values=[0.2, 0.5, 0.8], labels=False, title=None,  tickSize=0), title=None),
    color=alt.Color("Algorithm")
).properties(
    title="Recall",
    width=225,
    height=90
)
# first = alt.hconcat(f1_score, jaccard, spacing=0)
# second = alt.hconcat(prec, recall, spacing=0)
alt.hconcat(f1_score, prec,recall, spacing=0).resolve_scale(x='shared', y='shared').configure_legend(
    title=None,
    orient='none',
    legendX=200,
    legendY=135,
    labelFontSize=15,
    symbolSize=700,
    columns=2,
    labelLimit=275,
).configure_axis(
    labelFontSize=15,
    titleFontSize=15
).configure_title(
    fontSize=15
)