# Links to dataset information

**UCI**:
- __[adult](http://archive.ics.uci.edu/ml/datasets/Adult)__
- __[annealing](https://archive.ics.uci.edu/ml/datasets/Annealing)__
- __[audiology-std](https://archive.ics.uci.edu/ml/datasets/Audiology+%28Standardized%29)__
- __[bank](https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing)__
- __[bankruptcy](http://archive.ics.uci.edu/ml/datasets/Qualitative_Bankruptcy)__
- __[car](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)__
- __[chess-krvk](https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29)__
- __[chess-krvkp](http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29)__
- __[congress-voting](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)__
- __[contrac](https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice)__
- __[credit-approval](http://archive.ics.uci.edu/ml/datasets/Credit+Approval)__
- **unsure about this one**: __[ctg](https://www.kaggle.com/akshat0007/fetalhr)__
- __[cylinder-bands](http://archive.ics.uci.edu/ml/datasets/Cylinder+Bands)__
- __[dermatology](https://archive.ics.uci.edu/ml/datasets/Dermatology)__
- __[german_credit](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29)__
- __[heart-cleveland](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)__
- __[ilpd](http://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29)__
- __[mammo](https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)__
- __[mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom)__
- __[wine](https://archive.ics.uci.edu/ml/datasets/wine)__
- __[wine_qual](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)__

Others:
- __[texas](https://www.dshs.texas.gov/thcic/hospitals/UserManual1Q2013.pdf)__
- __[IEEECIS](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203)__


# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from tqdm import tqdm

In [2]:
from src.loader import load_dataset
from src.models import SRR
from src.preprocessing import processing_pipeline
from src.feature_selection import forward_stepwise_regression
from src.vulnerabilities import *

In [3]:
uci_datasets = ['adult', 'annealing', 'audiology-std', 'bank', 'bankruptcy', 'car',
                'chess-krvk', 'chess-krvkp', 'congress-voting', 'contrac', 'credit-approval',
                'ctg', 'cylinder-bands', 'dermatology', 'german_credit', 'heart-cleveland',
                'ilpd', 'mammo', 'mushroom', 'wine', 'wine_qual']

all_datasets = uci_datasets + ['texas', 'ieeecis']

# Monotonicity

## Trying different hyper-parameters

In [4]:
res = []

### german_credit

In [5]:
dataset = 'german_credit'
print(f"-> {dataset} dataset")
# Load the data
X, y = load_dataset(name=dataset)

-> german_credit dataset
Loading german_credit...


In [6]:
for nbins in [3, 4, 5, 6, 7]:
    # Apply the processing pipeline
    X_train, X_test, y_train, y_test = processing_pipeline(X, y, nbins=nbins)
    
    for k in [3, 5]:
        for M in [3, 5, 10]:
            # Construct and train Select-Regress-Round model
            srr = SRR(k=k, M=M)
            srr.fit(X_train, y_train)
            
            monotonic = binned_features_pass_monotonicity(srr, X_train, y_train)
            res.append(['german_credit', k, M, nbins, int(monotonic)])

                                                  

### IEEE-CIS

In [7]:
dataset = 'ieeecis'
print(f"-> {dataset} dataset")
# Load the data
X, y = load_dataset(name=dataset)

# This dataset is too big so get a reasonably-sized subset
X_subset = pd.concat([
    X[y == 1].sample(n=1500, random_state=15),
    X[y == 0].sample(n=1500, random_state=15)
])
y_subset = y.loc[X_subset.index]

del X
del y

-> ieeecis dataset
Loading ieeecis...


In [8]:
for nbins in [3, 4, 5, 6, 7]:
    # Apply the processing pipeline
    X_train, X_test, y_train, y_test = processing_pipeline(X_subset, y_subset, nbins=nbins)
    
    for k in [3, 5]:
        for M in [3, 5, 10]:
            # Construct and train Select-Regress-Round model
            srr = SRR(k=k, M=M)
            srr.fit(X_train, y_train)
            
            monotonic = binned_features_pass_monotonicity(srr, X_train, y_train)
            res.append(['ieeecis', k, M, nbins, int(monotonic)])

                                                   

### bankruptcy

In [9]:
dataset = 'bankruptcy'
print(f"-> {dataset} dataset")
# Load the data
X, y = load_dataset(name=dataset)

-> bankruptcy dataset
Loading bankruptcy...


In [10]:
mapping = {'N': pd.Interval(left=float('-inf'), right=-1),
           'A': pd.Interval(left=           -1, right=1),
           'P': pd.Interval(left=            1, right=float('inf'))}

In [11]:
# Apply the processing pipeline
X_train, X_test, y_train, y_test = processing_pipeline(X, y, nbins=3)
X_train.replace(mapping, inplace=True)

for k in [3, 5]:
    for M in [3, 5, 10]:
        # Construct and train Select-Regress-Round model
        srr = SRR(k=k, M=M)
        srr.fit(X_train, y_train)

        monotonic = binned_features_pass_monotonicity(srr, X_train, y_train)
        res.append(['bankruptcy', k, M, '-', int(monotonic)])

                                                  

#### Result

In [28]:
df = pd.DataFrame(res, columns=['dataset', 'k', 'M', 'nbins', '% monotonic'])
df['% monotonic'] *= 100
df.groupby(['dataset', 'nbins']).agg({'% monotonic': 'mean'}).applymap(lambda x: round(x, 1)).T

dataset,bankruptcy,german_credit,german_credit,german_credit,german_credit,german_credit,ieeecis,ieeecis,ieeecis,ieeecis,ieeecis
nbins,-,3,4,5,6,7,3,4,5,6,7
% monotonic,100.0,100.0,100.0,0.0,100.0,100.0,100.0,50.0,33.3,50.0,16.7


## Repeating same training on many data splits

In [32]:
X, y = load_dataset('german_credit')

passed = 0
n_tests = 100

for nfold in tqdm(range(n_tests)):
    
    X_train, X_test, y_train, y_test = processing_pipeline(X, y, seed=nfold, nbins=3)

    srr = SRR(k=3, M=5)
    srr.fit(X_train, y_train)
    
    passed += int(binned_features_pass_monotonicity(srr, X_train, y_train))

print("{:.1f} % passed monotonicity check".format(100 * passed / n_tests))

  0%|          | 0/100 [00:00<?, ?it/s]

Loading german_credit...


100%|██████████| 100/100 [00:46<00:00,  2.14it/s]

100.0 % passed monotonicity check





In [33]:
X, y = load_dataset('bankruptcy')

mapping = {'N': pd.Interval(left=float('-inf'), right=-1),
           'A': pd.Interval(left=           -1, right=1),
           'P': pd.Interval(left=            1, right=float('inf'))}

passed = 0
n_tests = 100

for nfold in tqdm(range(n_tests)):
    
    X_train, X_test, y_train, y_test = processing_pipeline(X, y, seed=nfold, nbins=3)
    
    X_train.replace(mapping, inplace=True)

    srr = SRR(k=3, M=5)
    srr.fit(X_train, y_train)
    
    passed += int(binned_features_pass_monotonicity(srr, X_train, y_train))

print("{:.1f} % passed monotonicity check".format(100 * passed / n_tests))

  0%|          | 0/100 [00:00<?, ?it/s]

Loading bankruptcy...


 97%|█████████▋| 97/100 [00:25<00:00,  3.72it/s]
  0%|          | 0/225 [00:00<?, ?it/s][A
  6%|▌         | 13/225 [00:00<00:01, 127.68it/s][A
 12%|█▏        | 26/225 [00:00<00:01, 127.31it/s][A
 17%|█▋        | 39/225 [00:00<00:01, 126.36it/s][A
 23%|██▎       | 52/225 [00:00<00:01, 125.50it/s][A
 29%|██▉       | 65/225 [00:00<00:01, 125.48it/s][A
 35%|███▍      | 78/225 [00:00<00:01, 125.31it/s][A
 40%|████      | 90/225 [00:00<00:01, 120.52it/s][A
 46%|████▌     | 103/225 [00:00<00:01, 121.77it/s][A
 52%|█████▏    | 116/225 [00:00<00:00, 122.26it/s][A
 57%|█████▋    | 129/225 [00:01<00:00, 122.47it/s][A
 64%|██████▎   | 143/225 [00:01<00:00, 124.93it/s][A
 69%|██████▉   | 156/225 [00:01<00:00, 125.15it/s][A
 75%|███████▌  | 169/225 [00:01<00:00, 125.00it/s][A
 81%|████████  | 182/225 [00:01<00:00, 124.86it/s][A
 87%|████████▋ | 195/225 [00:01<00:00, 125.31it/s][A
 92%|█████████▏| 208/225 [00:01<00:00, 125.86it/s][A
 98%|█████████▊| 221/225 [00:01<00:00, 124.05it/s]

99.0 % passed monotonicity check





In [34]:
dataset = 'ieeecis'
print(f"-> {dataset} dataset")
# Load the data
X, y = load_dataset(name=dataset)

# This dataset is too big so get a reasonably-sized subset
X_subset = pd.concat([
    X[y == 1].sample(n=1500, random_state=15),
    X[y == 0].sample(n=1500, random_state=15)
])
y_subset = y.loc[X_subset.index]

del X
del y

passed = 0
n_tests = 100

for nfold in tqdm(range(n_tests)):
    
    X_train, X_test, y_train, y_test = processing_pipeline(X_subset, y_subset, seed=nfold, nbins=3)
    
    X_train.replace(mapping, inplace=True)

    srr = SRR(k=3, M=5)
    srr.fit(X_train, y_train)
    
    passed += int(binned_features_pass_monotonicity(srr, X_train, y_train))

print("{:.1f} % passed monotonicity check".format(100 * passed / n_tests))

-> ieeecis dataset
Loading ieeecis...


100%|██████████| 100/100 [03:49<00:00,  2.30s/it]

100.0 % passed monotonicity check



