# Links to dataset information

**UCI**:
- __[adult](http://archive.ics.uci.edu/ml/datasets/Adult)__
- __[annealing](https://archive.ics.uci.edu/ml/datasets/Annealing)__
- __[audiology-std](https://archive.ics.uci.edu/ml/datasets/Audiology+%28Standardized%29)__
- __[bank](https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing)__
- __[bankruptcy](http://archive.ics.uci.edu/ml/datasets/Qualitative_Bankruptcy)__
- __[car](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation)__
- __[chess-krvk](https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29)__
- __[chess-krvkp](http://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29)__
- __[congress-voting](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)__
- __[contrac](https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice)__
- __[credit-approval](http://archive.ics.uci.edu/ml/datasets/Credit+Approval)__
- **unsure about this one**: __[ctg](https://www.kaggle.com/akshat0007/fetalhr)__
- __[cylinder-bands](http://archive.ics.uci.edu/ml/datasets/Cylinder+Bands)__
- __[dermatology](https://archive.ics.uci.edu/ml/datasets/Dermatology)__
- __[german_credit](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29)__
- __[heart-cleveland](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)__
- __[ilpd](http://archive.ics.uci.edu/ml/datasets/ILPD+%28Indian+Liver+Patient+Dataset%29)__
- __[mammo](https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)__
- __[mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom)__
- __[wine](https://archive.ics.uci.edu/ml/datasets/wine)__
- __[wine_qual](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)__

Others:
- __[texas](https://www.dshs.texas.gov/thcic/hospitals/UserManual1Q2013.pdf)__
- __[IEEECIS](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203)__


# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from tqdm import tqdm

In [2]:
from src.loader import load_dataset
from src.models import SRR
from src.preprocessing import one_hot_encode_pair, processing_pipeline, bin_features
from src.feature_selection import forward_stepwise_regression
from src.vulnerabilities import *

In [3]:
uci_datasets = ['adult', 'annealing', 'audiology-std', 'bank', 'bankruptcy', 'car',
                'chess-krvk', 'chess-krvkp', 'congress-voting', 'contrac', 'credit-approval',
                'ctg', 'cylinder-bands', 'dermatology', 'german_credit', 'heart-cleveland',
                'ilpd', 'mammo', 'mushroom', 'wine', 'wine_qual']

all_datasets = uci_datasets + ['texas', 'ieeecis']

In [4]:
important = ['bankruptcy', 'german_credit', 'ieeecis']

l = []
for dataset in important:
    
    X, y = load_dataset(name=dataset)
    
    if dataset == 'ieeecis':
        X = pd.concat([X[y == 1].sample(n=1500, random_state=15),
                       X[y == 0].sample(n=1500, random_state=15)])
        y = y.loc[X.index]
    
    X_train, X_test, y_train, y_test = processing_pipeline(X, y, nbins=3)
    
    srr = SRR(k=3, M=5)
    srr.fit(X_train, y_train)
    
    train_acc = accuracy_score(srr.predict(X_train), y_train)
    train_base = y_train.mean()
    train_base = max(train_base, 1-train_base)
    
    test_acc = accuracy_score(srr.predict(X_test), y_test)
    test_base = y_test.mean()
    test_base = max(test_base, 1-test_base)
    
    l.append([dataset, train_acc, train_base, test_acc, test_base])

Loading bankruptcy...
Loading german_credit...
Loading ieeecis...


In [12]:
df = pd.DataFrame(l, columns=['dataset', 'training accuracy', 'training baseline', 'test accuracy', 'test baseline'])
pd.concat([df[['dataset']], df.drop(columns='dataset') * 100], axis=1)

Unnamed: 0,dataset,training accuracy,training baseline,test accuracy,test baseline
0,bankruptcy,100.0,57.333333,100.0,56.0
1,german_credit,74.222222,70.0,76.0,70.0
2,ieeecis,69.518519,50.0,66.666667,50.0


|   dataset   |training accuracy (baseline)|test accuracy (baseline)|
|:-----------:|:-------------------------:|:----------------------:|
| bankruptcy  |100% (57.3%)| 100% (56%)|
|german_credit|74.2% (70%)| 76% (70%)|
|  IEEE-CIS   |69.5% (50%)| 66.6% (50%)|