# Comparing Classifiers
This notebook compares multiple classification methods on multiple datasets and evaluates them in terms of the area under the roc-curve (roc-auc).

__Note:__ There is a similar [notebook for regression datasets](Comparing_Regressors.ipynb).

In [1]:
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import QuantileTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

try:
    # Model Trees are installed / on the path
    from modeltrees import ModelTreeClassifier
except:
    # Assume project structure
    import sys
    sys.path.append("..")
    from modeltrees import ModelTreeClassifier
from modeltrees.criteria import CrossEntropySplitCriterion

import pandas as pd
import numpy as np

# Downloading and accessing files
import shutil
import urllib.request
import zipfile
from urllib.parse import urlparse
from pathlib import Path
import os

# Specific Data Formats
from scipy.io.arff import loadarff


## 1. Datasets
In this section, all datasets for the comparison are defined. Missing datasets are downloaded automatically.

See [Section 3.3](#characteristics) for a list of dataset characteristics

### 1.1 Downloading Datasets
We do not ship datasets with this repository, but the notebook will automatically try to download missing data.

In [2]:
data_path = "./data"

def copy_from_zip(zip_file, path_in_zip, dest_path):
    # Open Zip File
    with zipfile.ZipFile(zip_file) as zf:
        # open the destination path for writing
        with open(dest_path, "wb") as f:
            f.write(zf.read(path_in_zip)) 
            
            
def get_dataset_file_path(dataset_id, url, path_in_zip=None, zip_file=None, file=None):
    # Two Cases:
    #      (a) Online Source has the zipped file
    #      (b) Online Source has the plain file
    is_zipped = (path_in_zip is not None)
    
    # If no file is specified, take the name from the url
    if file is None:
        if is_zipped:
            file = os.path.basename(path_in_zip)
        else:
            file = urlparse(url)
            file = os.path.basename(file.path)
     
    # Create path to local file
    path = Path(data_path, dataset_id, file)
    
    if not path.exists():
        # Create missing folders
        os.makedirs(path.parent, exist_ok=True)
        
        if is_zipped:
            # Download zip file (if missing)
            path_to_zip = get_dataset_file_path(dataset_id, url, file=zip_file)
            
            # Extract file
            copy_from_zip(path_to_zip, path_in_zip, path)
        else:
            # Download missing file
            with urllib.request.urlopen(url) as response:
                with open(path, "wb") as outputFile:
                    shutil.copyfileobj(response, outputFile)
    
    return path


### 1.2 Dataset Definitions
The following defines different datasets with their download url, and possibly some preprocessing steps

In [3]:
def fetch_bankruptcy():
    ds_name = "bankruptcy"
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00365/data.zip"

    # Load arff
    X = []
    y = []
    for year in range(1, 6):
        path = get_dataset_file_path(ds_name, url, path_in_zip=f"{year}year.arff")

        D, _ = loadarff(path)
        y.append( D["class"].astype(np.int8) )
        X.append( np.asarray([list(row) for row in D[list(D.dtype.names)[:-1]]], dtype=np.float) )

    X = np.concatenate(X, axis=0)
    y = np.concatenate(y, axis=0)
    
    return X, y, {'ref':'https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data'}

In [4]:
def fetch_creditcard():
    ds_name = "credit_card"
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"

    # Load xls
    path = get_dataset_file_path(ds_name, url,file="creditcard.xls")
    df = pd.read_excel(path, header=0, skiprows=[1], usecols=range(1,25))
    
    y = df["Y"].values
    df.drop(columns="Y", inplace=True)
    X = df.values
    
    return X, y, {'ref':'https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients'}

### 1.3 Iterating over all Datasets
This gives a generator that iterates over all datasets.  
Each dataset is a triple consisting of 
- Features Matrix `X`, 
- Target Vector `y`, and 
- An attribute dictionary that contains meta information like the name of the dataset or a reference url

In [5]:
def get_datasets():
    # Using generators instead of lists for memory efficiency reasons.
    
    # Dataset 1: Bankruptcy
    X, y, attr = fetch_bankruptcy()
    attr['name'] = 'Bankruptcy' 
    yield (X, y, attr)
    
    # Dataset 2: Credit Cards
    X, y, attr = fetch_creditcard()
    attr['name'] = 'Credit Card' 
    yield (X, y, attr)

## 2. Classifiers
We are comparing the following regressors:
- Logistic Regression
- Decision Trees with maximal depth 3 and 6 
- Model Trees with maximal depth 1 and 3. We compare two split criteria:
    - Plain Gradients 
    - Renormalized Gradients

In [6]:
def get_classifiers():
    entropy = CrossEntropySplitCriterion()
    return [
        (LogisticRegression(solver="liblinear"), "Log. Reg."),
        (DecisionTreeClassifier(max_depth=3, random_state=12), "Decision Tree"),
        (DecisionTreeClassifier(max_depth=6, random_state=12), "Decision Tree"),
        (ModelTreeClassifier(max_depth=1, criterion=entropy), "Model Tree (Entropy)"),
        (ModelTreeClassifier(max_depth=3, criterion=entropy), "Model Tree (Entropy)"),
        (ModelTreeClassifier(max_depth=1), "Model Tree (Gradient)"),
        (ModelTreeClassifier(max_depth=3), "Model Tree (Gradient)"),
        (ModelTreeClassifier(max_depth=1, criterion="gradient-renorm-z"), "Model Tree (Renorm. Gradient)"),
        (ModelTreeClassifier(max_depth=3, criterion="gradient-renorm-z"), "Model Tree (Renorm. Gradient)")
    ]

## 3. Comparison
### 3.1 Parameters

In [7]:
# Cross Validation: Number of Folds
n_fold = 5

seed = 42   # We suggest to try other values to get a feeling for the stability

### 3.2 Evaluation
Iterating over datasets and regressors and evaluating the regressors in terms of the $r^2$ metric.

In [8]:
# Create a DataFrame for results (see 3.4)
# Multi-Index for better readability
col_index = pd.MultiIndex(levels=[[],[]],
                             codes=[[],[]],
                             names=['Method', 'Depth'])
results = pd.DataFrame(columns=col_index)

# Create a DataFrame for the Dataset Characteristics (see 3.3)
ds_characteristics = pd.DataFrame(columns=("#Samples", "#Features", "Reference"))

# Create a scorer function
scorer = make_scorer(roc_auc_score, needs_proba=True)

# Iterate over Datasets
for X, y, attr in get_datasets():
    ds_name = attr['name']
    
    # Store dataset  characteristics
    n_samples, n_features = X.shape
    ds_characteristics.loc[ds_name, "#Samples"] = n_samples
    ds_characteristics.loc[ds_name, "#Features"] = n_features
    
    if "ref" in attr:
        ds_characteristics.loc[ds_name, "Reference"] = attr["ref"]
    else:
        ds_characteristics.loc[ds_name, "Reference"] = None
    
    # Iterate over Regressors
    for model, m_name in get_classifiers():
        # Use the same seed for comparing different regressors
        kfold = KFold(n_splits=n_fold, shuffle=True, random_state=seed)
        
        # Build processing pipeline
        model_pipe = Pipeline([('Normalize', QuantileTransformer()), ('Impute', SimpleImputer()), ('Predict', model)])
        
        # Evalute over the folds
        scores = cross_val_score(model_pipe, X, y, scoring=scorer, cv=kfold)
        
        # Compute statistics from list of scores
        mean_score = np.mean(scores)
        std_score = np.std(scores)
        
        # Create result cell
        cell_text = f"{mean_score*100:.2f} ± {std_score*100:.2f}" 
        
        # Multi-Indexing
        if hasattr(model, 'max_depth'):
            col_idx = (m_name, f"{model.max_depth}")
        else:
            col_idx = (m_name, '-')
            
        results.loc[ds_name, col_idx] = cell_text

### 3.3 Dataset Characteristics <a id='characteristics'></a>

In [9]:
ds_characteristics["#Samples"] = ds_characteristics["#Samples"].astype(dtype=np.int)
ds_characteristics["#Features"] = ds_characteristics["#Features"].astype(dtype=np.int)

def format_link(val):
    # Handle Empty references
    if val is None:
        return ''
    
    # Format link
    return '<a target="_blank" href="{}">Link</a>'.format(val)

ds_characteristics.style.format({'Reference': format_link})

Unnamed: 0,#Samples,#Features,Reference
Bankruptcy,43405,64,Link
Credit Card,30000,23,Link


### 3.4 Results
The classifiers are evaluated in terms of the roc-auc metric (area under the roc-curve).  
The following results are given in percentage. The uncertainty is given as standard deviation of the roc-auc score.

In [10]:
results

Method,Log. Reg.,Decision Tree,Decision Tree,Model Tree (Entropy),Model Tree (Entropy),Model Tree (Gradient),Model Tree (Gradient),Model Tree (Renorm. Gradient),Model Tree (Renorm. Gradient)
Depth,-,3,6,1,3,1,3,1,3
Bankruptcy,81.07 ± 0.76,75.65 ± 1.15,84.27 ± 1.80,81.07 ± 0.76,81.04 ± 0.74,88.76 ± 0.80,91.59 ± 0.40,88.74 ± 0.82,91.22 ± 0.61
Credit Card,73.43 ± 0.45,73.28 ± 0.63,75.42 ± 0.49,73.40 ± 0.48,73.40 ± 0.50,76.25 ± 0.34,76.86 ± 0.56,76.73 ± 0.49,77.32 ± 0.39
