# Data processing for CBR library

This notebook contains code used to process raw csv data into a format that is suitable for the CBR library. Running the cell associated to a dataset will have it read the raw data and produce a processed csv which can be used for the `casebase` constructor. The following datasets are included in the notebook: 

* COMPAS recidivism scores + variants
* Mushrooms edibility
* University admission
* Telecommunications customer churn
* Dutch tort law
* Welfare benefit application

In [10]:
from math import nan
import pandas as pd
from dateutil.parser import parse
import matplotlib.pyplot as plt

### COMPAS recidivism scores + variants

We load the COMPAS data (https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis) and modify it somewhat. This will produce four datasets: `compas`, `compas-large`, `compas-small`, and `compas-corels`. 

As follows a description of the columns of the `compas-large` data which are as in the original data, and some that we create ourselves on the basis of the original data. There does not seem to be any summary available of what all the columns mean individually. This makes it very difficult to determine what the values in the columns actually mean. See also: https://opendata.stackexchange.com/questions/17940/propublicas-compas-data-feature-descriptions.

Some general remarks about the meaning of the columns:
+ The `c_` prefix means the information relates to the crime 
  that was committed and resulted in inprisonment. 
+ The `r_` prefix means the information relates to the recidivism, 
  so this is only included in cases where recidivism did occur.
+ The `vr_` prefix is information related to violent recidivism,
  so this too is only included in cases where this occurred. 
  
The columns that we retain from the original data:

* `name`
* `sex`
* `age`
* `race`
* `c_charge_desc`
* `c_charge_degree`, either `F` for Felony or `M` for Misdemeanour
* `two_year_recid`, whether a person got rearrested within two years

The fields that we add manually on the basis of dropped fields:

* `days_in_jail`, based on `c_jail_in` and `c_jail_out`
* `days_in_custody`, based on `in_custody` and `out_custody`
* `Priors`, the sum of `juv_*_count` and `priors_count` (I think juvenile priors are not counted towards `priors_count` because sometimes the sum of `juv_*_count` is greater than `priors_count`)

There are 6900 rows with non-null values for these columns.

Some remarks:
* According to research done by Rudin et al. the most predictive variables are age, sex, and number of priors. This is in line with the correlations that we find in the data. 
```python 
print(compas.corr())
```
```
                          sex       age  c_charge_degree  two_year_recid  days_in_jail  days_in_custody         Priors
    sex              1.000000 -0.006484         0.055054        0.093938      0.057079         0.036673       0.126077
    age             -0.006484  1.000000        -0.086883       -0.186686      0.019523        -0.029768       0.100359
    c_charge_degree  0.055054 -0.086883         1.000000        0.107918      0.125348         0.083599       0.141103
    two_year_recid   0.093938 -0.186686         0.107918        1.000000      0.118340         0.119047       0.293013
    days_in_jail     0.057079  0.019523         0.125348        0.118340      1.000000         0.249814       0.199281
    days_in_custody  0.036673 -0.029768         0.083599        0.119047      0.249814         1.000000       0.120882
    Priors           0.126077  0.100359         0.141103        0.293013      0.199281         0.120882       1.000000
``` 
  As we can see age negatively correlates with recidivism rates, while number of prior offenses positively correlates and so does sex.

* The value of 'days_in_jail' is not very high, 18 on average, while
'days_in_custody' is higher, 40 on average, max 6035(!). This 
seems strange to me, I would rather expect it the other way around.
```python
print(compas["days_in_jail"].describe())
```
```
    count    6907.000000
    mean       18.265238
    std        51.078167
    min         0.000000
    25%         0.000000
    50%         1.000000
    75%         9.000000
    max       799.000000
    Name: days_in_jail, dtype: float64
```
It is also interesting to look at time in jail as a function of charge description.
```python
cd = "c_charge_desc"
charges = compas[cd].unique()
vals = sorted([(compas[compas[cd] == c]["days_in_jail"].mean(), c, compas[compas[cd] == c].shape[0]) for c in charges if compas[compas[cd] == c].shape[0] >= 10])
cs = pd.DataFrame(vals, columns=["Avg. j.t.", "Crime", "N"])
pd.set_option("display.min_rows", 20)
print(cs)
```
```
        Avg. j.t.                           Crime     N
    0    0.846154   Leaving the Scene of Accident    13
    1    0.857143                  DUI - Enhanced    14
    2    0.900000      Leaving Acc/Unattended Veh    10
    3    1.307692     Possession of Hydromorphone    13
    4    1.375000  DUI Level 0.15 Or Minor In Veh    56
    5    1.416667    Lve/Scen/Acc/Veh/Prop/Damage    12
    6    1.901408      DUI Property Damage/Injury    71
    7    2.238095  Poss Contr Subst W/o Prescript    21
    8    2.659574  Felony Driving While Lic Suspd    94
    9    2.708333        Possession Of Alprazolam    48
    ..        ...                             ...   ...
    77  31.309524    Burglary Unoccupied Dwelling    84
    78  33.181818              Aggravated Battery    22
    79  35.307692      Burglary Dwelling Occupied    26
    80  35.738593           arrest case no charge  1052
    81  37.285714              Prowling/Loitering    14
    82  41.923077                  Felony Battery    13
    83  43.047619  Burglary Dwelling Assault/Batt    21
    84  49.239130  Felony Battery w/Prior Convict    46
    85  63.313433              Felony Petit Theft    67
    86  74.846154       Att Burgl Unoccupied Dwel    13
```
* According to Barenstein (2019) the original publishers of this data, ProPublica, made a mistake in keeping too many recidivists; he writes:

"_I estimate that in the two-year general recidivism dataset ProPublica kept over x40% more recidivists than it should have. [...] It obviously  has a substantial impact on the recidivism rate; artificially inflating it. [...] Thus, the two-year recidivism rate in ProPublica’s dataset is inflated by over 24%._"

In [11]:
# Load the ProPublica two year recidivism COMPAS data.
in_loc = "data/compas-scores-two-years-raw.csv"
out_loc = "data/compas-large.csv"
compas = pd.read_csv(in_loc)

# Drop irrelevant columns. 
compas = compas.drop(
    labels=['id',
            'first', # First and last name are both combined in 'name'.
            'last',
            'dob', # Contained in the already included 'age' column.
            'age_cat', # Contained in the already included 'age' column.
            'compas_screening_date', # I think the day on which the COMPAS recidivism assessment was done. 
            'screening_date', # Always equal to compas_screening_date.
            'decile_score', # The COMPAS recidivism score. 
            'decile_score.1', # Always equal to decile_score.
            'score_text', # Quantization of decile_score: 1-4:Low, 5-7:Med, 8-10:High.
            'priors_count.1', # Always equal to priors_count.
            'c_case_number', 
            'c_offense_date', 
            'c_arrest_date', 
            'c_days_from_compas',
            'violent_recid', # Is always empty.
            'is_recid', # Overall recidivism variable (at any time).
            'r_case_number', # The details related to the recidivism are irrelevant to us, so we drop the 'r_*' fields.
            'r_charge_degree',
            'r_days_from_arrest', # The number of days between r_offense_dat and r_jail_in.
            'r_offense_date', 
            'r_charge_desc',
            'r_jail_in',
            'r_jail_out',
            'is_violent_recid', # Violent recidivism specifically (at any time).
            'vr_case_number', # The details related to the violent recidivism are irrelevant to us, so we drop the 'vr_*' fields.
            'vr_charge_degree', 
            'vr_offense_date',
            'vr_charge_desc', 
            'type_of_assessment', # Always equal to "Risk of Recidivism".
            'v_type_of_assessment', # Always equal to "Risk of Violence".
            'v_decile_score', # The COMPAS violent recidivism score?
            'v_score_text', # Quantization of the v_decile_score into Low, Medium, High.
            'v_screening_date', # I think the date on which the COMPAS violence assessment was done.
            'days_b_screening_arrest', # Not sure what this means...
            'start',  # Not sure what this means...
            'end',    # Not sure what this means...
            'event'], # Not sure what this means...
    axis='columns')

# Instead of listing date of entry and departure from jail
# we use these values to compute a single field 'days_in_jail',
# and then drop the date fields. Rows with null vallues for these fields are dropped. 
compas = compas.dropna(subset=["c_jail_in", "c_jail_out"])
f = lambda x: int((parse(x.c_jail_out) - parse(x.c_jail_in)).total_seconds() / 86400)
compas["days_in_jail"] = compas.apply(f, axis='columns')
compas = compas.drop(
    labels=['c_jail_in',
            'c_jail_out'], 
    axis='columns')

# Instead of listing date of entry and departure from custody
# we use these values to compute a single field 'days_in_custody',
# and then drop the date fields. Rows with null vallues for these fields are dropped. 
compas = compas.dropna(subset=["in_custody", "out_custody"])
f = lambda x: int((parse(x.out_custody) - parse(x.in_custody)).total_seconds() / 86400)
compas["days_in_custody"] = compas.apply(f, axis='columns')
compas = compas.drop(
    labels=['in_custody',
            'out_custody'], 
    axis='columns')

# We combine all prior offenses to a single field, by summing the
# 'juv_*_count' fields together with 'priors_count' to a single 
# 'Priors' field. I am not 100% sure that this is not double
# counting priors but in some cases the sum of 'juv_*_count' exceeds
# the value of 'priors_count', which suggests that our new 
# 'Priors' field is not double counting anything. 
f = lambda x: x.juv_fel_count + x.juv_misd_count + x.juv_other_count + x.priors_count
compas["Priors"] = compas.apply(f, axis='columns')
compas = compas.drop(
    labels=['juv_fel_count',
            'juv_misd_count',
            'juv_other_count',
            'priors_count'], 
    axis='columns')

# There are now 7 rows left with null values for 'c_charge_desc'; we drop those. 
compas = compas.dropna(subset=["c_charge_desc"])

# Lastly we identify the Label column.
compas = compas.rename(columns={"two_year_recid" : "Label"})

# Reorder the 'Label' column to the end of the list. 
cols = [col for col in list(compas.columns) if col != 'Label']
compas = compas[cols + ['Label']]

# Rename the columns somewhat. 
compas = compas.rename(
    {
        'age' : 'Age',
        'sex' : 'Sex',
        'c_charge_degree' : 'ChargeDegree',
        'days_in_jail' : 'DaysInJail',
        'days_in_custody' : 'DaysInCustody',
        'total_priors' : 'Priors',
    }, 
    axis='columns'
    )

# Drop duplicate rows.
compas = compas.drop_duplicates()

# Write the processed file to a new csv.
compas.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/compas-scores-two-years-raw.csv, wrote output to data/compas-large.csv.


Now we remove some of the columns to create the `compas` dataset. Removing even more columns from the `compas` dataset will yield the `compas-small` dataset. Then, lastly, we change the labels of the `compas-small` dataset according to the decision rule from 2017 Angelino et al. to obtain what we call the `corels` dataset. This `corels` decision rule is given by the following logical formulas:
\begin{align*}
\text{Risk}(x) &\iff \bigvee_{1 \leq i \leq 3} C_i(x), \\
C_1(x) &\iff 18 \leq \text{Age}(x) \leq 20, \\
C_2(x) &\iff 21 \leq \text{Age}(x) \leq 23 \wedge 2 \leq \text{Priors}(x) \leq 3, \\
C_3(x) &\iff 3 < \text{Priors}(x). 
\end{align*}

In [12]:
# Filter the categorical fields.
noncat_compas = compas[['Sex', 'Age', 'ChargeDegree', 'DaysInJail', 'DaysInCustody', 'Priors', 'Label']]
noncat_compas = noncat_compas.drop_duplicates()

# Write the processed file to a new csv.
out_loc = "data/compas.csv"
noncat_compas.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

# Filter all but the age and priors information from compas. 
small_compas = compas[['Age', 'Priors', 'Label']]
small_compas = small_compas.drop_duplicates()

# Write the processed file to a new csv.
out_loc = "data/compas-small.csv"
small_compas.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

# Re-label the small-compas data according to the classification rule in
# Angelino et al. - Learning certifiably optimal rule lists for categorical data (2017)
corels_compas = small_compas.copy()
f = lambda x: 1 if (18 <= x.Age <= 20) or ((21 <= x.Age <= 23) and (2 <= x.Priors <= 3)) or (x.Priors > 3) else 0
corels_compas["Label"] = corels_compas.apply(f, axis='columns')
corels_compas = corels_compas.drop_duplicates()

# Write the processed file to a new csv.
out_loc = "data/corels.csv"
corels_compas.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/compas-scores-two-years-raw.csv, wrote output to data/compas.csv.
Finished processing data/compas-scores-two-years-raw.csv, wrote output to data/compas-small.csv.
Finished processing data/compas-scores-two-years-raw.csv, wrote output to data/corels.csv.


### Mushrooms edibility

Found at: http://archive.ics.uci.edu/ml/datasets/Mushroom.

Important to note is that this seems to be _hypothetical_ data, a.k.a. synthetic, as explained at the url above:

"_This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525)._"

This data was provided by Jeffrey Schlimmer who also wrote about it in his disserattion. In his disseration the impression is given that this data is _not_ synthetic. He writes:

"_Specifically, the task is to assess the edibility of a novel mushroom sample. The available inputs are a series of 3,078 mushroom samples representing 23 species of gilled mushrooms in the Agaricus and Lepiota family. Each sample is described in terms of the 23 observable attributes (listed in Table 5.4) and identified as edible or poisonous by The Audubon Society Field Guide to North American Mushroom1 (1981, pp. 500-525)._"

However there are several discrepancies. The data we have does not
count 3078 samples but 8124. Futhermore he writes that

"_the majority of the mushrooms in this study, including both edible and poisonous species, have no smell._"

In our dataset we have 3528 mushrooms with 'odor = none' and of those only 3 percent are poisonous, in contrast to the above statement. Then again, two mushrooms describes specifically in Table 5.5 of Schlimmer's dissertation do appear exactly in the dataset as he describes. Sadly I can not view the source of this data anywhere, and emails to any addresses of the author of this dataset are bounced. What I can see is an image on amazon for a review of that book, see: https://images-na.ssl-images-amazon.com/images/I/81C9QDjREyL.jpg. Here we see that indeed mushrooms are specified according to the characteristics also listed in the dataset. However, I do not get the impression that more than 3000, let alone more than 8000, mushrooms are specified in that book. The description of the book on amazon reads:

"_[The book has] more than 700 mushrooms detailed with color photographs and descriptive text._"

Again, it does not count up to 3000, let alone 8000. Furthermore Jeffrey writes that all the 3000 samples used for this dissertation stemmed only from pages 500-525 in the field guide. 

A strange fact of this data is that the the landmarks for both outcomes force exactly the same number of cases. I would think this is statistically highly unlikely (almost impossible). If this data is indeed synthetic, then the reason for this strange result might be found in the way it was generated, but since it is unknown if and how it was generated we cannot check this. 

In [13]:
# Load the shrooms data.
in_loc = "data/mushrooms-raw.csv"
out_loc = "data/mushrooms.csv"
shrooms = pd.read_csv("data/mushrooms-raw.csv")

# Rename the Label field.
shrooms = shrooms.rename(columns={"class" : "Label"})
shrooms["Label"] = shrooms["Label"].replace("e", 0).replace("p", 1)

# All the values in this data have been abbreviated to a single letter, 
# presumably because at the time it was first used (1981) 
# computer memory was not as abundant as it is nowadays. For our ease of
# reading we de-abbreviate the values according to the description below.
description = \
"""1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d"""
for line in description.split("\n"):
    c = line.split(":")[0].split(" ")[1]
    vals = dict([s.strip().split("=")[::-1] for s in line.split(":")[1].split(",")])
    shrooms[c] = shrooms[c].replace(vals)
 
# Drop an irrelevant field (since it always takes the same value.)
shrooms = shrooms.drop(labels=['veil-type'], axis='columns')

# Reorder the 'Label' column to the end of the list. 
cols = [col for col in list(shrooms.columns) if col != 'Label']
shrooms = shrooms[cols + ['Label']]

# Remove duplicate row occurances. 
shrooms = shrooms.drop_duplicates()

# Write the processed file to a new csv.
shrooms.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/mushrooms-raw.csv, wrote output to data/mushrooms.csv.


### University admission

Found at https://www.kaggle.com/mohansacharya/graduate-admissions.

This dataset is quite strange because rather than having a binary score 1 or 0 for admission or rejection, it has a 'chance of admit' field which ranges from 0 to 1 (although in practise, almost all scores are above 0.5). It turns out that this data comes from each individual student in the dataset; its author writes (on kaggle):

"_To clarify things, chance of admit is a parameter that was asked
 to individuals (some values manually entered) 
 before the results of the application._"

I am skeptical about this because all the values are precise to two decimal places. For instance, if I asked someone how sure they were of getting accepted I would expect a somewhat vague answer like "0.5", "0.8", etc. but not "0.72", or "0.59", etc., but (almost) all values in the data fall in the latter category. In another comment the author of the dataset writes:

"_For a few profiles, I asked the applicants how sure they were of
 getting an admit in terms of percentage. I added an extra decimal
 to increase the accuracy. For the rest of the data, since Regression
 is supervised learning, I gave values that actually were relatable
 and made enough sense. :)_"

To me this suggests the data is at least in a large part synthetic. This suspicion is confirmed in further writing:

"_For some profiles, SOP and LOR strength was asked to the applicants
 themselves. A few values were extrapolated using other parameters.
 Most values were entered manually with no specific pattern.
 It was random assignment. Thanks!_"

In order to compare our code to that of Prakken & Ratsma we round up all scores to 1 or down to 0. In my opinion it would have been better to take a higher cut-off score since almost all scores are above 0.5, but that 
is the way it was done in Prakken & Ratsma. 

In [14]:
# Load the data.
in_loc = "data/admission-raw.csv"
out_loc = "data/admission.csv"
admissions = pd.read_csv(in_loc)

# Remove the redundant ID field. 
del admissions["Serial No."]

# Rename the Label field and round it up to 1 so that it becomes binary. 
admissions = admissions.rename(columns={"Chance of Admit " : "Label"})
admissions["Label"] = admissions["Label"].apply(round)

# Reorder the 'Label' column to the end of the list. 
cols = [col for col in list(admissions.columns) if col != 'Label']
admissions = admissions[cols + ['Label']]

# Remove duplicate row occurances. 
admissions = admissions.drop_duplicates()

# Write the processed file to a new csv.
admissions.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/admission-raw.csv, wrote output to data/admission.csv.


### Telecommunications customer churn

Found at: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113 (or possibly the data we use is an older version of this dataset.)

This dataset seems to be entirely synthetic. Its authors write:

"_The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3._"

In [15]:
# Load the data. 
in_loc = "data/telco-raw.csv"
out_loc = "data/churn.csv"
churn = pd.read_csv(in_loc)

# Delete an irrelevant ID field. 
del churn["customerID"]

# Delete rows that have no 'TotalCharges' value. I'm not sure 
# how these are treated in Prakken & Ratsma.
churn = churn[churn["TotalCharges"] != ' ']

# Rename the Label field and make it use numbers rather than words.
churn = churn.rename(columns={"Churn" : "Label"})
churn["Label"] = churn["Label"].replace("Yes", 1).replace("No", 0)

# Reorder the 'Label' column to the end of the list. 
cols = [col for col in list(churn.columns) if col != 'Label']
churn = churn[cols + ['Label']]

# Remove duplicate row occurances. 
churn = churn.drop_duplicates()

# Write the processed file to a new csv.
churn.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/telco-raw.csv, wrote output to data/churn.csv.


### Welfare benefit application

Found at: https://github.com/CorSteging/DiscoveringTheRationaleOfDecisions, and see (2021) Steging et al. - _Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning_.

The welfare benefit application is synthetic which concerns the eligibility of a person for a welfare benefit to cover the expenses for visiting their spouse in the hospital. This data was first published in (1993) T. Bench-Capon. and is entirely synthetic, although the features were constructed so as to resemble those found in legal domains. Eligibility is decided on the basis of the following features:

| Feature | Values |
| --- | ----------- |
| Age | 0 to 100 (all integers) |
| Gender | male or female |
| Con_1, ..., Con_5 | true or false |
| Spouse | true or false |
| Absent | true or false |
| Resources | 0 to 10,000 (all integers) |
| Type (Patient type) | in or out |
| Distance (to the hospital) | 0 to 100 (all integers) |

The label, i.e. whether the person is eligible, is then determined on the basis of the following formulas:
\begin{align*}
\text{Eligible}(x) &\iff \bigwedge_{1 \leq i \leq 6} C_i(x), \\
C_1(x) &\iff (\text{Gender}(x) = \text{female} \wedge \text{Age}(x) \geq 60) \vee (\text{Gender}(x) = \text{male} \wedge \text{Age}(x) \geq 65), \\
C_2(x) &\iff |\text{Con}_1(x),\text{Con}_2(x),\text{Con}_3(x),\text{Con}_4(x),\text{Con}_5(x)| \geq 4, \\
C_3(x) &\iff \text{Spouse}(x), \\
C_4(x) &\iff \neg\text{Absent}(x), \\
C_5(x) &\iff \neg\text{Resources}(x) \geq 3000, \\
C_6(x) &\iff (\text{Type}(x) = \text{in} \wedge \text{Distance}(x) < 50) \vee (\text{Type}(x) = \text{out} \wedge \text{Distance}(x) \geq 50).
\end{align*}

The data in the csv is generated according to a process described in (2021) Steging et al.

The raw dataset also contains 52 noise variables which have no influence on eligibility. These are included in the data because in (2021) Steging et al. the goal is to see whether a neural network can learn the formula above, even in the presence of noise variables. Since the case based reasoning approach is (as of yet) not able to deal with noise we remove them here. One possible way of modifying the strategy of (2022) van Woerkom et al. to commodate for noise is to have the logistic regression analysis check whether any coefficient values that are used for the dimension orders are very close to 0, and remove variables for which this is the case from the analysis. Maybe a more sophisticated approach could make use of statistical significance tests. 

In [16]:
for in_loc in ['data/welfare-A-raw.csv', 'data/welfare-B-raw.csv']:
    out_loc = in_loc.replace("-raw", "")

    # Load the data.
    welfare = pd.read_csv(in_loc)

    # Rename the Label field.
    welfare = welfare.rename(columns={"eligible" : "Label"})

    # Drop the noise columns from the data. 
    welfare = welfare.drop([f"noise_{i}" for i in range(1, 53)], axis='columns')

    # Reorder the 'Label' column to the end of the list. 
    cols = [col for col in list(welfare.columns) if col != 'Label']
    welfare = welfare[cols + ['Label']]

    # Change some columns using "True" and "False" to use 0 and 1 respectively.
    welfare["is_spouse"] = welfare["is_spouse"].astype(int)
    welfare["is_absent"] = welfare["is_absent"].astype(int)
    welfare["Label"] = welfare["Label"].astype(int)

    # Remove duplicate row occurances. 
    welfare = welfare.drop_duplicates()

    # Write the processed file to a new csv.
    welfare.to_csv(out_loc, index=False)
    print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/welfare-A-raw.csv, wrote output to data/welfare-A.csv.
Finished processing data/welfare-B-raw.csv, wrote output to data/welfare-B.csv.


Next we process the simplified versions of these datasets provided in (2021) Steging et al. which only use conditions $C_1(x)$ and $C_6(x)$. In other words, the labelling is now based on the formulas
\begin{align*}
\text{Eligible}(x) &\iff C_1(x) \wedge C_6(x), \\
C_1(x) &\iff (\text{Gender}(x) = \text{female} \wedge \text{Age}(x) \geq 60) \vee (\text{Gender}(x) = \text{male} \wedge \text{Age}(x) \geq 65), \\
C_6(x) &\iff (\text{Type}(x) = \text{in} \wedge \text{Distance}(x) < 50) \vee (\text{Type}(x) = \text{out} \wedge \text{Distance}(x) \geq 50).
\end{align*}

In [20]:
for in_loc in ['data/welfare-A-simplified-raw.csv', 'data/welfare-B-simplified-raw.csv']:
    out_loc = in_loc.replace("-raw", "")

    # Load the data.
    welfare_simp = pd.read_csv(in_loc, index_col=0)

    # Rename the Label field and make it use numbers instead of true/false.
    welfare_simp = welfare_simp.rename(columns={"eligible" : "Label"})
    welfare_simp["Label"] = welfare_simp["Label"].astype(int)

    # Reorder the 'Label' column to the end of the list. 
    cols = [col for col in list(welfare_simp.columns) if col != 'Label']
    welfare_simp = welfare_simp[cols + ['Label']]

    # Remove duplicate row occurances. 
    welfare_simp = welfare_simp.drop_duplicates()

    # Write the processed file to a new csv.
    welfare_simp.to_csv(out_loc, index=False)
    print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/welfare-A-simplified-raw.csv, wrote output to data/welfare-A-simplified.csv.
Finished processing data/welfare-B-simplified-raw.csv, wrote output to data/welfare-B-simplified.csv.


### Dutch tort law

Found at: https://github.com/CorSteging/DiscoveringTheRationaleOfDecisions, and see (2021) Steging et al. - _Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning_.

This data is based on articles 6:162 and 6:163 of the Dutch civil code regarding Dutch tort law, that describe when a wrongful act is committed and resulting damages must be repaired, and was published in (2017) Verheij. This 'duty to repair' (dut) can be formalised as a logical formula depending on several features. This relation is visualized as follows. 

<div>
<img src="data/tort.png" width="450"/>
</div>

The relation between the features and the label is determined by the following formulas: 

\begin{align*}
\text{dut}(x) &\iff \bigwedge_{1 \leq i \leq 5} C_i(x), \\
C_1(x) &\iff \text{cau}(x), \\
C_2(x) &\iff \text{ico}(x) \vee \text{ila}(x) \vee \text{ift}(x), \\
C_3(x) &\iff \text{vun}(x) \vee (\text{vst}(x) \wedge \neg\text{jus}(x)) \vee (\text{vrt}(x) \wedge \neg\text{jus}(x)), \\
C_4(x) &\iff \text{dmg}(x), \\
C_5(x) &\iff \neg(\text{vst}(x) \wedge \neg\text{prp}(x)).
\end{align*}

The data itself is all possible combinations of the features and the associated label.

In [19]:
# Load the data. 
in_loc = "data/tort-raw.csv"
out_loc = "data/tort.csv"
tort = pd.read_csv(in_loc)

# Rename the Label field.
tort = tort.rename(columns={"dut" : "Label"})

# Reorder the 'Label' column to the end of the list. 
cols = [col for col in list(tort.columns) if col != 'Label']
tort = tort[cols + ['Label']]

# Remove duplicate row occurances. 
tort = tort.drop_duplicates()

# Write the processed file to a new csv.
tort.to_csv(out_loc, index=False)
print(f"Finished processing {in_loc}, wrote output to {out_loc}.")

Finished processing data/tort-raw.csv, wrote output to data/tort.csv.
