### Example: diabetes

This tutorial shows how to train `rofigs` models on the diabetes dataset from the paper and demonstrates different functionalities.

First, you need to install the required dependencies with `pip install -r requirements.txt`.

In [1]:
%load_ext autoreload        
%autoreload 2

In [2]:
import os
import sys                                                    
sys.path.insert(0, os.path.abspath('..'))           

from src.utils import load_final_data, load_data
from src.rofigs import ROFIGSClassifier
from sklearn.metrics import balanced_accuracy_score


In [3]:
# Load the data
(X_train, y_train), (X_test, y_test) = load_final_data(dataset="diabetes", fold=6)

Fit and evaluate RO-FIGS models

In [4]:
model_8 = ROFIGSClassifier(beam_size=8, max_splits=10, min_impurity_decrease=10, random_state=12345)
model_8.fit(X_train, y_train)

In [5]:
# set verbose=True to see details of the model training

model_4 = ROFIGSClassifier(beam_size=4, max_splits=10, min_impurity_decrease=10, verbose=True, random_state=12345)
model_4.fit(X_train, y_train)

********************************************************************************
...............................  Iteration = 1  ................................
********************************************************************************
#samples: split = 691, left = 496, right = 195
values: split = 0.349, left = 0.258, right = 0.579
impurity reduction: 14.461

********************************************************************************
...............................  Iteration = 2  ................................
********************************************************************************
Adding a new node with impurity reduction of 14.461:
	3.973 * X_5 + 3.346 * X_6 <= 3.030 (Tree #-1 root)

>>>>> Splitting on features: 5, 1, 2, 4 <<<<<

There are 3 potential splits:

 **************************************************
#samples: split = 691, left = 566, right = 125
values: split = -0.000, left = -0.031, right = 0.140
impurity reduction: 3.001

 ************************

In [6]:
# Evaluate models
acc_4 = 100 * balanced_accuracy_score(y_test, model_4.predict(X_test))
print(f"Model with beam_size=4 has \n\t{model_4.count_trees()} trees, {model_4.count_splits()} splits, and {model_4.get_average_num_feat_per_split():.1f} features per split \n\tbalanced accuracy of {acc_4:.1f}\n")

acc_8 = 100 * balanced_accuracy_score(y_test, model_8.predict(X_test))
print(f"Model with beam_size=8 has \n\t{model_8.count_trees()} trees, {model_8.count_splits()} splits, and {model_8.get_average_num_feat_per_split():.1f} features per split \n\tbalanced accuracy of {acc_8:.1f}")

Model with beam_size=4 has 
	2 trees, 2 splits, and 1.5 features per split 
	balanced accuracy of 65.1

Model with beam_size=8 has 
	1 trees, 1 splits, and 3.0 features per split 
	balanced accuracy of 72.5


Access feature combinations that appear in splits

In [7]:
# model_4 has two splits: one with two features and one with one feature
print(f"Feature combinations in model_4: {model_4.feature_combinations}") 

# model_8 has only one split with three features
print(f"Feature combinations in model_8: {model_8.feature_combinations}")

Feature combinations in model_4: [(5, 6), (1,)]
Feature combinations in model_8: [(1, 5, 7)]
