# AIPI 590 - XAI | Assignment #4

## Description

imodels is an interpretability library in python that has support for many decision rule set and list algorithms (https://github.com/csinva/imodels?tab=readme-ov-file).



From the list of supported models available in the README (link above), choose three algorithms to demo on a dataset of your choice.



In addition to a demonstration for each of your chosen algorithms, you should provide an explanation of the method via a visual. This visual can be a block diagram, a slide, or anything of your choosing. It should represent the method in a way that your fellow students would be able to understand the algorithm visually.

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/suneel-nadipalli/xai-assignments-duke-fall24/blob/main/Assignment%203/XAI_Assignment_3_Interpretable_Models.ipynb)

## Suneel Nadipalli

## Importing Libraries

In [1]:
!pip install imodels --quiet

In [2]:
!pip install --upgrade scikit-learn --quiet

In [3]:
%load_ext autoreload
%autoreload 2

# Data Science Libraries
import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

# Scikit Learn Libraries
from sklearn.model_selection import train_test_split

from sklearn.tree import plot_tree, DecisionTreeClassifier

from sklearn import metrics

from sklearn.datasets import fetch_openml

# iModels Libraries
import imodels

from imodels import OneRClassifier, HSTreeClassifierCV, RuleFitRegressor

from imodels.discretization import ExtraBasicDiscretizer

## Helper Functions

In [4]:
# function to download the dataset from OpenML

def get_dataset(dataset_name, version=None):
    """
    Downloads a dataset from OpenML.

    Args:
      dataset_name: The name of the dataset to download.
      version: The version of the dataset to download (optional).

    Returns:
      dataset: A tuple containing the OpenML dataset object
      dataset.frame: the dataset as a Pandas DataFrame.
    """
    try:
        dataset = fetch_openml(name=dataset_name, as_frame=True, parser='auto', version=version)
    except:
        dataset = fetch_openml(name=dataset_name, as_frame=True, version=version)

    return dataset, dataset.frame

  and should_run_async(code)


In [5]:
# function to split dataset for modeling

def get_features(dataset, drop_cols):

    """
    Extracts features from a dataset, drops specified columns,
    and splits the data into training and testing sets.

    Args:
        dataset: The dataset containing features and target variable.
        drop_cols: A list of column names to drop from the dataset.

    Returns:
        A tuple containing the training features, testing features,
        training target, testing target, and feature names.
    """

    ds_target = dataset['target'].values

    if len(drop_cols) > 0:
        ds_data_numeric = dataset['data'].select_dtypes('number').drop(columns=drop_cols).dropna(axis=1)
    else:
        ds_data_numeric = dataset['data'].select_dtypes('number').dropna(axis=1)

    ds_feature_names = ds_data_numeric.columns.values

    X_train, X_test, y_train, y_test = train_test_split(
        ds_data_numeric.values, ds_target, test_size=0.75)

    return X_train, X_test, y_train, y_test, ds_feature_names

  and should_run_async(code)


In [6]:
# function to test the get_features function

def test_dataset(dataset, drop_cols):
    """
    Tests a dataset to ensure everything returned from get_features is as expected.

    Args:
        dataset: The dataset to test.
        drop_cols: A list of column names to drop from the dataset.

    Raises:
        AssertionError: If any of the data integrity checks fail.
    """
    X_train, X_test, y_train, y_test, ds_feature_names = get_features(dataset, drop_cols)

    # Assert that the returned feature names are not empty
    assert len(ds_feature_names) > 0, "Feature names should not be empty."

    # Assert that the shapes of the data match correctly
    assert X_train.shape[0] == len(y_train), "Number of training samples does not match the target."
    assert X_test.shape[0] == len(y_test), "Number of test samples does not match the target."

    # Assert that the data has been split correctly (train and test shapes should not be equal)
    assert X_train.shape[0] != X_test.shape[0], "Training and test set sizes should differ."

    # Assert that target arrays are not empty
    assert len(y_train) > 0, "Training target should not be empty."
    assert len(y_test) > 0, "Test target should not be empty."

    print("All tests passed successfully!")

  and should_run_async(code)


## Modeling

### [Regression Dataset #1 - Moneyball](https://openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=lte_1&id=41021)

***Description***

In the early 2000s, Billy Beane and Paul DePodesta worked for the Oakland Athletics. While there, they literally changed the game of baseball. They didn't do it using a bat or glove, and they certainly didn't do it by throwing money at the issue; in fact, money was the issue. They didn't have enough of it, but they were still expected to keep up with teams that had much deeper pockets. This is where Statistics came riding down the hillside on a white horse to save the day. This data set contains some of the information that was available to Beane and DePodesta in the early 2000s, and it can be used to better understand their methods.

***Attributes***

This data set contains a set of variables that Beane and DePodesta focused heavily on. They determined that stats like on-base percentage (OBP) and slugging percentage (SLG) were very important when it came to scoring runs, however they were largely undervalued by most scouts at the time. This translated to a gold mine for Beane and DePodesta. Since these players weren't being looked at by other teams, they could recruit these players on a small budget. The variables are as follows:

- Team
- League
- Year
- Runs Scored (RS)
- Runs Allowed (RA)
- Wins (W)
- On-Base Percentage (OBP)
- Slugging Percentage (SLG)
- Batting Average (BA)
- Playoffs (binary)
- RankSeason
- RankPlayoffs
- Games Played (G)
- Opponent On-Base Percentage (OOBP)
- Opponent Slugging Percentage (OSLG)

In [7]:
# getting the Moneyball dataset

moneyball, moneyball_df = get_dataset('Moneyball', version=2)

  and should_run_async(code)


In [8]:
# preveiwing the dataset

moneyball_df.head()

  and should_run_async(code)


Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


In [9]:
# splitting dataset

X_train_mb, X_test_mb, y_train_mb, y_test_mb, feature_names_mb = get_features(moneyball, [])

  and should_run_async(code)


In [10]:
# testing the get_features function

test_dataset(moneyball, [])

  and should_run_async(code)


All tests passed successfully!


In [11]:
# fitting the dataset to a Rule Fit Regression model

rulefit = RuleFitRegressor(max_rules=10)
rulefit.fit(X_train_mb, y_train_mb, feature_names=feature_names_mb)

# get test performance
preds = rulefit.predict(X_test_mb)
print(f'Test R2: {metrics.r2_score(y_test_mb, preds):0.2f}')

rulefit

  and should_run_async(code)


Test R2: 0.92


Interpretation:

Based on the R2 Score, the model fit the data really well.

Looking at the model post-fitting, we can see that there are 4 features and 5 rules. Among these, 3 features have the highest effect on the target column: On-Base Percentage, Slugging Percentage and Batting Average. The 4 rules are based on these 3 features as well - indicating **OBP, SLP and BA** are the **highest impact** deciding factors for this dataset and model

### [Classification Dataset #1 - Baseball](https://openml.org/search?type=data&status=any&id=185)

Database of baseball players and play statistics, including 'Games_played', 'At_bats', 'Runs', 'Hits', 'Doubles', 'Triples', 'Home_runs', 'RBIs', 'Walks', 'Strikeouts', 'Batting_average', 'On_base_pct', 'Slugging_pct' and 'Fielding_ave'



In [12]:
# getting the Baseball dataset

baseball, baseball_df = get_dataset('baseball', version=1)

  and should_run_async(code)


In [13]:
# previewing the baseball dataset

baseball_df.head()

  and should_run_async(code)


Unnamed: 0,Number_seasons,Games_played,At_bats,Runs,Hits,Doubles,Triples,Home_runs,RBIs,Walks,Strikeouts,Batting_average,On_base_pct,Slugging_pct,Fielding_ave,Position,Hall_of_Fame
0,23,3298,12364,2174,3771,624,98,755,2297,1402,1383.0,0.305,0.377,0.555,0.98,Outfield,1
1,13,1165,4019,378,1022,163,19,57,366,208,499.0,0.254,0.294,0.347,0.985,Second_base,0
2,13,1424,5557,844,1588,249,48,9,394,453,223.0,0.286,0.343,0.353,0.974,Second_base,0
3,14,1281,4019,591,1082,188,49,37,303,414,447.0,0.269,0.34,0.368,0.955,Third_base,0
4,17,1959,6606,823,1832,295,35,336,1122,594,1059.0,0.277,0.339,0.485,0.994,First_base,0


In [14]:
# splitting the dataset

X_train_bball, X_test_bball, y_train_bball, y_test_bball, feature_names_bball = get_features(baseball, [])

  and should_run_async(code)


In [15]:
# testing the get_features function

test_dataset(baseball, [])

  and should_run_async(code)


All tests passed successfully!


In [16]:


m = OneRClassifier()
m.fit(X_train_bball, y=y_train_bball, feature_names=feature_names_bball)  # stores into m.rules_
probs = m.predict_proba(X_test_bball)

m

  and should_run_async(code)


In [25]:
m.rules_

  and should_run_async(code)


[{'col': 'RBIs',
  'index_col': 8,
  'cutoff': 1322.0,
  'val': 0.103125,
  'flip': False,
  'val_right': 1.2,
  'num_pts': 335,
  'num_pts_right': 15},
 {'col': 'RBIs',
  'index_col': 8,
  'cutoff': 731.0,
  'val': 0.01652892561983471,
  'flip': False,
  'val_right': 0.3717948717948718,
  'num_pts': 320,
  'num_pts_right': 78},
 {'col': 'RBIs',
  'index_col': 8,
  'cutoff': 585.5,
  'val': 0.0,
  'flip': False,
  'val_right': 0.1,
  'num_pts': 242,
  'num_pts_right': 40},
 {'val': 0.0, 'num_pts': 202}]

Interpretation:

Runs Batted In (RBI) has been chosen as the one deciding feature. There are 4 levels, each associated with its own risk (probability value) based on whether or not a given data point meets a certain condition for the RBI. The conditions are as follows:
- If RBIs >= 1322.0, the risk is 120%
- If RBIs >= 731.0, the risk is 37.2%
- If RBIs >= 585.5, the risk is 10.0%

### [Classification Dataset #2 - Credit Risk Clasisifcation](https://openml.org/search?type=data&sort=runs&status=active&id=31)

This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix:

Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).

Attribute descriptions:
- Status of existing checking account, in Deutsche Mark.
- Duration in months
- Credit history (credits taken, paid back duly, delays, critical accounts)
- Purpose of the credit (car, television,...)
- Credit amount
- Status of savings account/bonds, in Deutsche Mark.
- Present employment, in number of years.
- Installment rate in percentage of disposable income
- Personal status (married, single,...) and sex
- Other debtors / guarantors
- Present residence since X years
- Property (e.g. real estate)
- Age in years
- Other installment plans (banks, stores)
- Housing (rent, own,...)
- Number of existing credits at this bank
- Job
- Number of people being liable to provide maintenance for
- Telephone (yes,no)
- Foreign worker (yes,no)

In [18]:
# getting the credit dataset

credit, credit_df = get_dataset('credit-g', version=1)

  and should_run_async(code)


In [19]:
# previewing the dataset

credit_df.head()

  and should_run_async(code)


Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6,critical/other existing credit,radio/tv,1169,no known savings,>=7,4,male single,none,...,real estate,67,none,own,2,skilled,1,yes,yes,good
1,0<=X<200,48,existing paid,radio/tv,5951,<100,1<=X<4,2,female div/dep/mar,none,...,real estate,22,none,own,1,skilled,1,none,yes,bad
2,no checking,12,critical/other existing credit,education,2096,<100,4<=X<7,2,male single,none,...,real estate,49,none,own,1,unskilled resident,2,none,yes,good
3,<0,42,existing paid,furniture/equipment,7882,<100,4<=X<7,2,male single,guarantor,...,life insurance,45,none,for free,1,skilled,2,none,yes,good
4,<0,24,delayed previously,new car,4870,<100,1<=X<4,3,male single,none,...,no known property,53,none,for free,2,skilled,2,none,yes,bad


In [20]:
# splitting the dataset

X_train_credit, X_test_credit, y_train_credit, y_test_credit, feature_names_credit = get_features(credit, [])

  and should_run_async(code)


In [21]:
# testing the dataset

test_dataset(credit, [])

  and should_run_async(code)


All tests passed successfully!


In [23]:
# fitting the Hierarchical Shrinkage model

dt = HSTreeClassifierCV(max_leaf_nodes=7)
dt.fit(X_train_credit, y=y_train_credit, feature_names=feature_names_credit)
probs = dt.predict_proba(X_test_credit)

dt

  and should_run_async(code)


  and should_run_async(code)


Interpretation:

The model uses credit_amount and age as the primary features to make predictions, applying shrinkage to the weights in each leaf to reflect feature importance and reduce complexity. There are multiple conditions splitting the data into leaves, where each leaf provides the class prediction (either 0 or 1) and the associated weights.

Within the 2 features, credit_amount is the primary factor,with age being the secondary filter being applied for the final classification.

A majority (4/7) of the leaves/rules lean towards the 0 class, indicating that 0 is the more likely classification.