# Regression Framework Walkthrough

This notebook provides a comprehensive walkthrough of our regression framework. It's designed to help team members understand the code structure, functionality, and how to use it effectively for different regression tasks.

## Table of Contents
1. [Setup and Dependencies](#setup)
2. [Overview of the Framework](#overview)
3. [Data Preparation and Exploration](#data-prep)
4. [Framework Components](#components)
   - [RegressionBuilder](#builder)
   - [RegressionTrain](#train)
   - [RegressionInference](#inference)
   - [RegressionHyperparameterTune](#tuning)
5. [Key Features](#features)
   - [Model Selection](#model-selection)
   - [Feature Selection](#feature-selection)

## 1. Setup and Dependencies <a id="setup"></a>

Let's start by importing the necessary libraries and dependencies. The regression framework relies on various libraries for data manipulation, model training, and evaluation.

In [42]:
# Import core libraries
import os
import shutil
import json
from collections import Counter
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

# Configure warnings and logging
warnings.filterwarnings('ignore')

# Display settings for better visibility
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

In [43]:
# Import sklearn components
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.feature_selection import SequentialFeatureSelector, RFE, RFECV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, BayesianRidge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.exceptions import DataConversionWarning
import xgboost as xgb
# Note: TabPFN might require separate installation
# Uncommenting this will ensure the notebook works even if TabPFN isn't available
# try:
#     from tabpfn import TabPFNRegressor
#     from tabpfn_feature_importance_helper import TabPFNRegressorWithImportance
#     TABPFN_AVAILABLE = True
# except ImportError:
#     print("TabPFN not available. Some functionality will be limited.")
#     TABPFN_AVAILABLE = False

## 2. Overview of the Framework <a id="overview"></a>

Our regression framework is designed to streamline the process of building, training, and evaluating regression models. It provides a unified interface for various regression algorithms and incorporates best practices for model selection, hyperparameter tuning, and performance evaluation.

### Key Components:

1. **RegressionBuilder**: The main entry point that orchestrates the entire process
2. **RegressionTrain**: Handles model training, validation, and selection
3. **RegressionInference**: Manages prediction on new data
4. **RegressionHyperparameterTune**: Optimizes model hyperparameters

### Design Philosophy:
- **Modularity**: Each component has a specific responsibility
- **Flexibility**: Supports various regression algorithms and configurations
- **Robustness**: Includes validation, error handling, and diagnostics
- **Reproducibility**: Maintains consistent evaluation methodology

Let's first generate some sample data to use throughout this walkthrough.

Alternatively, we can use a real-world dataset like the California Housing dataset:

In [26]:
# Load California Housing dataset
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df_california = pd.DataFrame(housing.data, columns=housing.feature_names)
df_california['target'] = housing.target
target = "target"
# Display the first few rows
print(f"California Housing dataset shape: {df_california.shape}")
df_california.head()

California Housing dataset shape: (20640, 9)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## 3. Data Preparation and Exploration <a id="data-prep"></a>

Before diving into the regression framework, let's explore the data to understand its characteristics. This step is crucial for making informed decisions about model selection and preprocessing.

In [27]:
# Let's use the California Housing dataset for our examples
df = df_california.copy()

# Basic statistics
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


## Or use one of the chemistry datasets
Likely you will have to modify the path based on your file system.


In [28]:
chemistry_dataset_ex_1 = "/mnt/c/Users/16303/misc/antonio_molecules/chemetrian/notebooks/nw_msai_349_fall_2025_final_project/vaskas.csv"
df = pd.read_csv(chemistry_dataset_ex_1)
target = "barrier"

In [29]:
df.head()

Unnamed: 0,chi-0_fa,chi-1_fa,chi-2_fa,chi-3_fa,chi-4_fa,chi-5_fa,Z-0_fa,Z-1_fa,Z-2_fa,Z-3_fa,Z-4_fa,Z-5_fa,I-0_fa,I-1_fa,I-2_fa,I-3_fa,I-4_fa,I-5_fa,T-0_fa,T-1_fa,T-2_fa,T-3_fa,T-4_fa,T-5_fa,S-0_fa,S-1_fa,S-2_fa,S-3_fa,S-4_fa,S-5_fa,chi-0_ma,chi-1_ma,chi-2_ma,chi-3_ma,chi-4_ma,chi-5_ma,Z-0_ma,Z-1_ma,Z-2_ma,Z-3_ma,Z-4_ma,Z-5_ma,I-0_ma,I-1_ma,I-2_ma,I-3_ma,I-4_ma,I-5_ma,T-0_ma,T-1_ma,T-2_ma,T-3_ma,T-4_ma,T-5_ma,S-0_ma,S-1_ma,S-2_ma,S-3_ma,S-4_ma,S-5_ma,chi-0_md,chi-1_md,chi-2_md,chi-3_md,chi-4_md,chi-5_md,Z-0_md,Z-1_md,Z-2_md,Z-3_md,Z-4_md,Z-5_md,I-0_md,I-1_md,I-2_md,I-3_md,I-4_md,I-5_md,T-0_md,T-1_md,T-2_md,T-3_md,T-4_md,T-5_md,S-0_md,S-1_md,S-2_md,S-3_md,S-4_md,S-5_md,distance,barrier,smiles,filename
0,271.4006,564.775,1031.2006,1395.6522,1972.4306,2482.2086,6923.0,7298.0,10116.0,12454.0,8122.0,4844.0,46.0,92.0,176.0,238.0,332.0,420.0,268.0,692.0,1042.0,1390.0,1574.0,1454.0,17.0209,44.4116,72.037,98.2048,122.7906,123.6412,271.4006,564.775,1031.2006,1395.6522,1972.4306,2482.2086,6923.0,7298.0,10116.0,12454.0,8122.0,4844.0,46.0,92.0,176.0,238.0,332.0,420.0,268.0,692.0,1042.0,1390.0,1574.0,1454.0,17.0209,44.4116,72.037,98.2048,122.7906,123.6412,271.4006,564.775,1031.2006,1395.6522,1972.4306,2482.2086,6923.0,7298.0,10116.0,12454.0,8122.0,4844.0,46.0,92.0,176.0,238.0,332.0,420.0,268.0,692.0,1042.0,1390.0,1574.0,1454.0,17.0209,44.4116,72.037,98.2048,122.7906,123.6412,0.9348,14.8,[Ir]([P+](CC)(CC)CC)([C-]1N(C)CCN(C)1)(C#[O+])...,ir_tbp_1_dft-pet3_1_dft-sime_1_dft-co_1_dft-ic...
1,224.3694,482.135,884.7308,1137.1092,1673.2128,1550.3932,6811.0,5588.0,8986.0,10374.0,5364.0,3814.0,38.0,76.0,142.0,186.0,286.0,252.0,218.0,556.0,858.0,1078.0,1094.0,1024.0,14.5337,36.1862,57.4704,79.2868,92.0554,82.638,224.3694,482.135,884.7308,1137.1092,1673.2128,1550.3932,6811.0,5588.0,8986.0,10374.0,5364.0,3814.0,38.0,76.0,142.0,186.0,286.0,252.0,218.0,556.0,858.0,1078.0,1094.0,1024.0,14.5337,36.1862,57.4704,79.2868,92.0554,82.638,224.3694,482.135,884.7308,1137.1092,1673.2128,1550.3932,6811.0,5588.0,8986.0,10374.0,5364.0,3814.0,38.0,76.0,142.0,186.0,286.0,252.0,218.0,556.0,858.0,1078.0,1094.0,1024.0,14.5337,36.1862,57.4704,79.2868,92.0554,82.638,0.9699,12.4,[Ir]([N+](C)(C)C)([C-]1N(C)CCN(C)1)(C#[N+][H])...,ir_tbp_1_dft-nme3_1_dft-sime_1_dft-hicn_1_dft-...
2,305.232,670.614,1125.6584,1522.5118,1856.8504,2402.9954,7215.0,8080.0,11196.0,13816.0,13400.0,11216.0,51.0,108.0,184.0,258.0,310.0,390.0,292.0,762.0,1172.0,1452.0,1702.0,1972.0,21.8694,56.4396,89.7068,117.5974,140.652,164.6612,305.232,670.614,1125.6584,1522.5118,1856.8504,2402.9954,7215.0,8080.0,11196.0,13816.0,13400.0,11216.0,51.0,108.0,184.0,258.0,310.0,390.0,292.0,762.0,1172.0,1452.0,1702.0,1972.0,21.8694,56.4396,89.7068,117.5974,140.652,164.6612,305.232,670.614,1125.6584,1522.5118,1856.8504,2402.9954,7215.0,8080.0,11196.0,13816.0,13400.0,11216.0,51.0,108.0,184.0,258.0,310.0,390.0,292.0,762.0,1172.0,1452.0,1702.0,1972.0,21.8694,56.4396,89.7068,117.5974,140.652,164.6612,0.9983,5.2,[Ir]([P+](c1ccccc1)(c1ccccc1)c1ccccc1)([N+]1=C...,ir_tbp_1_dft-pph3_1_dft-oxaz_1_dft-hicn_1_dft-...
3,171.9665,354.089,617.2588,837.7394,1186.002,751.202,6893.0,8358.0,8314.0,6874.0,4074.0,1808.0,29.0,58.0,104.0,138.0,206.0,124.0,162.0,410.0,620.0,720.0,610.0,472.0,12.4648,31.8018,49.8734,63.0708,63.2648,42.1078,171.9665,354.089,617.2588,837.7394,1186.002,751.202,6893.0,8358.0,8314.0,6874.0,4074.0,1808.0,29.0,58.0,104.0,138.0,206.0,124.0,162.0,410.0,620.0,720.0,610.0,472.0,12.4648,31.8018,49.8734,63.0708,63.2648,42.1078,171.9665,354.089,617.2588,837.7394,1186.002,751.202,6893.0,8358.0,8314.0,6874.0,4074.0,1808.0,29.0,58.0,104.0,138.0,206.0,124.0,162.0,410.0,620.0,720.0,610.0,472.0,12.4648,31.8018,49.8734,63.0708,63.2648,42.1078,0.9353,10.5,[Ir]([n+]1ccncc1)([P+](C)(C)C)(C#[N+][H])([Cl]...,ir_tbp_1_dft-pyz_1_dft-pme3_1_dft-hicn_1_chlor...
4,221.8709,472.145,822.1272,1033.6916,1544.7814,1473.76,6755.0,6844.0,9234.0,9142.0,5510.0,3208.0,39.0,78.0,142.0,174.0,272.0,248.0,220.0,546.0,812.0,1026.0,1088.0,1092.0,14.8146,38.7222,60.648,78.1554,91.8444,83.371,221.8709,472.145,822.1272,1033.6916,1544.7814,1473.76,6755.0,6844.0,9234.0,9142.0,5510.0,3208.0,39.0,78.0,142.0,174.0,272.0,248.0,220.0,546.0,812.0,1026.0,1088.0,1092.0,14.8146,38.7222,60.648,78.1554,91.8444,83.371,221.8709,472.145,822.1272,1033.6916,1544.7814,1473.76,6755.0,6844.0,9234.0,9142.0,5510.0,3208.0,39.0,78.0,142.0,174.0,272.0,248.0,220.0,546.0,812.0,1026.0,1088.0,1092.0,14.8146,38.7222,60.648,78.1554,91.8444,83.371,0.931,9.4,[Ir]([P+](C)(C)C)([C-]1N(C)C=CN(C)1)(C#[N+]C)(...,ir_tbp_1_dft-pme3_1_dft-ime_1_dft-iacn_1_dft-c...


## 4. Framework Components <a id="components"></a>

Let's explore each of the main components of our regression framework. We'll start by importing the necessary classes from our framework files.

### 4.1 RegressionBuilder <a id="builder"></a>

The `RegressionBuilder` class serves as the main entry point to our regression framework. It follows the builder pattern, allowing for clean method chaining and provides a unified interface to configure and run the regression process.

In [30]:
# Define a simplified version of the constants used by the framework
# In practice, these would be imported from constants.py

# Scoring metrics
SCORING = {
    'mae': 'neg_mean_absolute_error',
    'mse': 'neg_mean_squared_error',
    'rmse': 'neg_root_mean_squared_error',
    'r2': 'r2',
    'adj_r2': 'r2'  # Note: sklearn doesn't have adjusted R2 scorer; we handle this separately
}

# Regression metrics for evaluation
OFFLINE_REGRESSION_METRICS = ['mae', 'mse', 'rmse', 'r2', 'adj_r2']

# Default hyperparameters for each model type
MODEL_PARAM_DEFAULT = {
    'linear': {},  # Linear regression uses default parameters
    'bayesian': {
        'alpha_1': 1e-6, 
        'alpha_2': 1e-6,
        'lambda_1': 1e-6, 
        'lambda_2': 1e-6
    },
    'elasticnet': {'alpha': 0.1, 'l1_ratio': 0.5},
    'xgboost': {
        'n_estimators': 100,
        'learning_rate': 0.1,
        'max_depth': 3,
        'subsample': 0.8,
        'colsample_bytree': 0.8
    },
    'randomforest': {
        'n_estimators': 100,
        'max_depth': 10,
        'min_samples_split': 2,
        'min_samples_leaf': 1
    },
    'svm': {'C': 1.0, 'epsilon': 0.1, 'gamma': 'scale'},
    'tabpfn': {'N_ensemble_configurations': 32}
}

# Hyperparameter ranges for tuning
MODEL_PARAM_RANGE = {
    'linear': {},  # No tuning for linear regression
    'bayesian': {
        'alpha_1': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3],
        'alpha_2': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3],
        'lambda_1': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3],
        'lambda_2': [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
    },
    'elasticnet': {
        'alpha': [0.001, 0.01, 0.1, 0.5, 1.0],
        'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    },
    'xgboost': {
        'n_estimators': [50, 100, 200, 300],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
        'max_depth': [3, 5, 7, 9],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0]
    },
    'randomforest': {
        'n_estimators': [50, 100, 200, 300],
        'max_depth': [5, 10, 15, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'svm': {
        'C': [0.1, 1.0, 10.0, 100.0],
        'epsilon': [0.01, 0.1, 0.2],
        'gamma': ['scale', 'auto', 0.1, 0.01]
    },
    'tabpfn': {
        'N_ensemble_configurations': [16, 32, 64, 128]
    }
}

# Hyperparameters to not display in output
MODEL_HYPERPARAMTERS_TO_NOT_DISPLAY = {
    'linear': ['copy_X', 'n_jobs', 'positive'],
    'bayesian': ['compute_score', 'fit_intercept', 'verbose'],
    'elasticnet': ['copy_X', 'fit_intercept', 'precompute', 'selection', 'tol', 'warm_start'],
    'xgboost': ['booster', 'verbosity', 'objective', 'nthread', 'gamma', 'min_child_weight'],
    'randomforest': ['bootstrap', 'ccp_alpha', 'max_features', 'max_leaf_nodes', 'oob_score', 'verbose'],
    'svm': ['kernel', 'degree', 'coef0', 'shrinking', 'tol', 'cache_size', 'verbose', 'max_iter'],
    'tabpfn': ['device', 'base_path', 'verbose']
}

In [31]:
# For the purpose of this notebook, let's define the adjusted R² function
def adj_r2(r2, n, p):
    """
    Calculate adjusted R-squared
    Parameters:
    r2 (float): R-squared value
    n (int): Sample size
    p (int): Number of predictors (excluding intercept)
    Returns:
    float: Adjusted R-squared value
    """
    # Check if we have enough samples relative to predictors
    if n <= p + 1:
        return float("nan")  # Not enough degrees of freedom
    # Correct formula for adjusted R-squared
    adj_r2_value = 1 - ((1 - r2) * (n - 1) / (n - p - 1))
    # Adjusted R² should never exceed R²
    if adj_r2_value > r2:
        return r2
    return adj_r2_value

Let's examine the `RegressionBuilder` class to understand its structure and functionality:

In [32]:
# Simplified implementation of RegressionBuilder
class RegressionBuilder:
    def __init__(
        self,
        input_training_df,       # Training data DataFrame
        input_inference_df,      # Inference data DataFrame (can be None for training-only)
        target,                  # Target column name
        enable_parameter_tune,   # Whether to tune hyperparameters
        data_augmentation,       # Whether to apply data augmentation
        feature_selection_autoselect, # Whether to use automatic feature selection
        feature_selection_num,   # Number of features to select (0 = auto)
        goal_metric,             # Optimization metric ('mae', 'mse', 'r2', 'adj_r2')
        data_augmentation_method = None,  # Method for augmentation ('gaussian', 'smote', 'vae')
        data_augmentation_sectioning = None,  # Sectioning method ('binning', 'kde')
        data_augmentation_region_num = 10,    # Number of regions for augmentation
        data_augmentation_min_samples_per_region = 5,  # Min samples per region
        balance_strategy = "equal",  # Strategy for balancing regions
        k = 10,                 # Number of folds for cross-validation
        sampling_method = 'kfold', # Sampling method ('kfold', 'stratified')
        model_choice = None,    # Model to use (string or list)
        saved_model = None,     # Pre-trained model for inference
        saved_scaler = None,    # Pre-fitted scaler for inference
        target_provided = False # Whether target is provided for inference data
    ):
        # Store all configuration parameters
        self.input_training = input_training_df
        self.input_inference = input_inference_df
        self.target = target
        self.enable_parameter_tune = enable_parameter_tune
        self.data_augmentation = data_augmentation
        self.feature_selection_autoselect = feature_selection_autoselect
        self.feature_selection_num = feature_selection_num
        self.goal_metric = goal_metric
        self.data_augmentation_method = data_augmentation_method
        self.data_augmentation_sectioning = data_augmentation_sectioning
        self.data_augmentation_region_num = data_augmentation_region_num
        self.data_augmentation_min_samples_per_region = data_augmentation_min_samples_per_region
        self.balance_strategy = balance_strategy
        self.k = k
        self.sampling_method = sampling_method
        self.model_choice = model_choice
        self.saved_model = saved_model
        self.saved_scaler = saved_scaler
        self.target_provided = target_provided
        
        # File validation would be handled here in the full implementation
        # self.regression_training_file_validator = ... 
        # self.regression_inference_file_validator = ...

    def populate_model_dict(self):
        """Create a dictionary of model classes based on user selection"""
        # Define available models
        model_dict = {
            "linear": LinearRegression,
            "bayesian": BayesianRidge,
            "elasticnet": ElasticNet,
            "xgboost": xgb.XGBRegressor,
            "randomforest": RandomForestRegressor,
            "svm": SVR,
            # "tabpfn": TabPFNRegressor  # Commented out to avoid dependency issues
        }
        
        # Filter models based on user choice
        if isinstance(self.model_choice, list):
            return {k: {"model": model_dict[k]} for k in self.model_choice if k in model_dict}
        elif isinstance(self.model_choice, str) and self.model_choice in model_dict:
            return {self.model_choice: {"model": model_dict[self.model_choice]}}
        else:
            # Default: use all available models
            return {k: {"model": v} for k, v in model_dict.items()}

    def prepare_model_dict_for_training(self, model_dict):
        """Prepare the model dictionary with feature information"""
        X = (
            self.input_training.drop(columns=[self.target])
            .select_dtypes(include=["number"])
            .fillna(0)
        )
        feature_dict = {k: X.columns for k in model_dict.keys()}
        return {
            model_name: {
                "model": model_dict[model_name],
                "feature_analysis": list(feature_dict.get(model_name, [])),
            }
            for model_name in model_dict
        }

    def build_model(self):
        """Build and train the regression model"""
        model_dict = self.populate_model_dict()
        
        # In a full implementation, this would create a RegressionTrain instance
        # and call find_best_model. For simplicity, we'll just outline the steps.
        print("1. Initializing RegressionTrain with the following configuration:")
        print(f"   - Target: {self.target}")
        print(f"   - Hyperparameter tuning: {self.enable_parameter_tune}")
        print(f"   - Data augmentation: {self.data_augmentation}")
        print(f"   - Feature selection: {self.feature_selection_autoselect}")
        print(f"   - Goal metric: {self.goal_metric}")
        print(f"   - How are we splitting train and test?: {self.sampling_method}")
        print(f"   - Models to evaluate: {list(model_dict.keys())}")
        
        print("\n2. Training and evaluating models:")
        # This would call regression_train.find_best_model(model_dict)
        for model_name in model_dict.keys():
            print(f"   - Training {model_name}...")
        
        print(f"\n3. Selecting best model based on {self.goal_metric}")
        print("4. Preparing final model and metrics")
        
        # Placeholder for storing results
        self.model_name = list(model_dict.keys())[0]  # Just for demonstration
        self.model = None
        self.scaler = MinMaxScaler()
        self.metrics = {}
        self.features = []
        self.goal_metric_per_folds = {"train": [], "test": []}
        self.true_prediction_point = {}
        
        return "Model training process outlined"

    def run_inference(self):
        """Run inference on new data using a pre-trained model"""
        print("Running inference with pre-trained model")
        # This would create a RegressionInference instance and call inference()
        return {"predictions": []}  # Placeholder for demonstration

    def run_regression(self):
        """Main method to execute either training or inference"""
        if self.input_training is not None:
            # Training mode
            print("Running in training mode")
            self.build_model()
            return {
                "metrics": self.metrics,
                "model_name": self.model_name,
                "model": self.model,
                "hyperparameters": [],  # Would be populated with model parameters
                "scaler": self.scaler,
                "features": self.features,
                "goal_metric_per_folds": self.goal_metric_per_folds,
                "true_prediction_point": self.true_prediction_point,
            }
        else:
            # Inference mode
            print("Running in inference mode")
            return self.run_inference()

Let's demonstrate how to use the `RegressionBuilder` class for a basic training run:

In [44]:
# Split the data into train and test sets for demonstration
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Create a RegressionBuilder instance with basic configuration
regression_builder = RegressionBuilder(
    input_training_df=train_df,
    input_inference_df=None,
    target=target,
    enable_parameter_tune=False,
    data_augmentation=False,
    feature_selection_autoselect=False,
    feature_selection_num=0,
    goal_metric='r2',
    model_choice=['randomforest'],
    sampling_method="random"
)

# Let's see what models are selected
model_dict = regression_builder.populate_model_dict()
print(f"Selected models: {list(model_dict.keys())}")

# Outline the training process
regression_builder.build_model()

Selected models: ['randomforest']
1. Initializing RegressionTrain with the following configuration:
   - Target: barrier
   - Hyperparameter tuning: False
   - Data augmentation: False
   - Feature selection: False
   - Goal metric: r2
   - How are we splitting train and test?: random
   - Models to evaluate: ['randomforest']

2. Training and evaluating models:
   - Training randomforest...

3. Selecting best model based on r2
4. Preparing final model and metrics


'Model training process outlined'

### 4.2 RegressionTrain <a id="train"></a>

The `RegressionTrain` class is responsible for training regression models. It handles various aspects of the training process, including data preprocessing, feature selection, hyperparameter tuning, and cross-validation (if requested).

In [34]:
class RegressionTrain:
    """Class for training regression models with various options"""
    
    def __init__(
        self,
        input_df,                 # Input DataFrame
        target,                   # Target column name
        enable_parameter_tune,    # Whether to tune hyperparameters
        data_augmentation,        # Whether to apply data augmentation
        feature_selection_autoselect, # Whether to use automatic feature selection
        feature_selection_num,    # Number of features to select (0 = auto)
        goal_metric,              # Optimization metric
        data_augmentation_method=None,  # Method for augmentation
        data_augmentation_sectioning=None,  # Sectioning method
        data_augmentation_region_num=10,    # Number of regions
        data_augmentation_min_samples_per_region=5,  # Min samples per region
        balance_strategy="equal", # Strategy for balancing regions
        k=10,                    # Number of folds for CV
        sampling_method='kfold'  # Sampling method ('kfold', 'stratified')
    ):
        # Store configuration parameters
        self.input = input_df
        self.target = target
        self.enable_parameter_tune = enable_parameter_tune
        self.data_augmentation = data_augmentation
        self.feature_selection_autoselect = feature_selection_autoselect
        self.feature_selection_num = feature_selection_num
        self.goal_metric = goal_metric
        self.data_augmentation_method = data_augmentation_method
        self.data_augmentation_sectioning = data_augmentation_sectioning
        self.data_augmentation_region_num = data_augmentation_region_num
        self.data_augmentation_min_samples_per_region = data_augmentation_min_samples_per_region
        self.balance_strategy = balance_strategy
        self.k = k
        self.sampling_method = sampling_method

    def normalize(self, X, scaler=None):
        """Normalize features using MinMaxScaler"""
        if scaler is None:
            scaler = MinMaxScaler()
            X_scaled = scaler.fit_transform(X)
            return X_scaled, scaler
        else:
            X_scaled = scaler.transform(X)
            return X_scaled
    
    def find_best_model(self, model_dict):
        """Train and evaluate all models to find the best one"""
        # Extract features and target
        X = (
            self.input.drop(columns=[self.target])
            .select_dtypes(include=["number"])
            .fillna(0)
        )
        y = self.input[[self.target]].fillna(self.input[self.target].mean())
        
        # Initialize best model tracking
        best_model = {
            "model_name": None,
            f"test_{self.goal_metric}": None,
            "goal_metric_per_folds": None,
            "model": None,
            "metrics": None,
            "selected_features": None,
            "test_true_predicted": None,
        }
        
        # Determine number of CV splits based on data size
        num_splits = self.k if len(X) >= self.k else len(X)
        
        # Train and evaluate each model
        for k in model_dict.keys():
            print(f"Training and evaluating {k} model...")
            # This would call self.train_model() in the full implementation
            # For simplicity, we'll just outline the process
            
            # 1. Create a dummy model
            if k == 'linear':
                model = LinearRegression()
            elif k == 'randomforest':
                model = RandomForestRegressor(n_estimators=100, random_state=42)
            else:
                # Default to a simple model
                model = LinearRegression()
                
            # 2. Placeholder for metrics (would be calculated by train_model)
            model_dict[k]["metrics"] = {
                f"test_{self.goal_metric}": 0.8,  # Placeholder value
                "model": model,
                "selected_features": list(X.columns),
                "test_true_predicted": {}
            }
            goal_metric_per_folds = {"train": [0.85], "test": [0.8]}  # Placeholder
            
            # 3. Check if this model is better than the current best
            current_metric = model_dict[k]["metrics"][f"test_{self.goal_metric}"]
            is_first_model = best_model[f"test_{self.goal_metric}"] is None
            
            # Determine if current model is better based on goal metric
            if self.goal_metric in ["mae", "mse", "rmse"]:
                # For error metrics, lower is better
                is_better = (
                    is_first_model
                    or current_metric < best_model[f"test_{self.goal_metric}"]
                )
            else:  # 'r2' or 'adj_r2'
                # For fit metrics, higher is better
                is_better = (
                    is_first_model
                    or current_metric > best_model[f"test_{self.goal_metric}"]
                )
                
            # 4. Update best model if needed
            if is_better:
                best_model[f"test_{self.goal_metric}"] = current_metric
                best_model["model_name"] = k
                best_model["model"] = model_dict[k]["metrics"]["model"]
                best_model["metrics"] = model_dict[k]["metrics"]
                best_model["selected_features"] = model_dict[k]["metrics"]["selected_features"]
                best_model["goal_metric_per_folds"] = goal_metric_per_folds
                best_model["test_true_predicted"] = model_dict[k]["metrics"]["test_true_predicted"]
        
        # Create a scaler based on the best features
        scaler_result = self.normalize(X[best_model["selected_features"]])
        best_model["scaler"] = scaler_result[1]
        
        return best_model
    
    def train_model(self, model_name, model_dict, X, y, n_splits):
        """Train a model using the appropriate sampling method"""
        if hasattr(self, 'sampling_method') and self.sampling_method == "stratified":
            return self._train_model_stratified(model_name, model_dict, X, y)
        else:
            return self._train_model_kfold(model_name, model_dict, X, y, n_splits)
    
    # For brevity, we're not including all the training methods here
    # The full implementation would include _train_model_kfold, _train_model_stratified, etc.

Let's see how the `RegressionTrain` class works with a simple example:

In [35]:
# Create a simple model dictionary for demonstration
model_dict = {
    'randomforest': {'model': RandomForestRegressor}
}

# Initialize RegressionTrain
regression_train = RegressionTrain(
    input_df=train_df,
    target=target,
    enable_parameter_tune=False,
    data_augmentation=False,
    feature_selection_autoselect=False,
    feature_selection_num=0,
    goal_metric='r2',
    sampling_method="random"
)

# Find the best model
best_model = regression_train.find_best_model(model_dict)
print(f"\nBest model: {best_model['model_name']}")
#print(f"Selected features: {best_model['selected_features']}")

Training and evaluating randomforest model...

Best model: randomforest


### 4.3 RegressionHyperparameterTune <a id="tuning"></a>

The `RegressionHyperparameterTune` class is responsible for optimizing model hyperparameters using RandomizedSearchCV (Similar to GridSearch in class but performs much faster than GridSearch with a small loss in metric optimization).

In [36]:
class RegressionHyperparameterTune:
    """Class for hyperparameter tuning using RandomizedSearchCV"""
    
    def __init__(self, model_name, model, X, y, goal_metric, k=10):
        self.model_name = model_name
        self.model = model
        self.X = X
        self.y = y
        self.goal_metric = goal_metric
        self.k = k
    
    def tune_parameters_rscv(self):
        """Tune hyperparameters using RandomizedSearchCV"""
        print(f"Tuning hyperparameters for {self.model_name} model...")
        
        # Get parameter distributions from constants
        param_distributions = MODEL_PARAM_RANGE.get(self.model_name, {})
        
        if not param_distributions:
            print(f"No tuning parameters defined for {self.model_name}. Using defaults.")
            return MODEL_PARAM_DEFAULT.get(self.model_name, {})
        
        # Set up RandomizedSearchCV
        random_search = RandomizedSearchCV(
            self.model(),
            scoring=SCORING[self.goal_metric],
            param_distributions=param_distributions,
            n_iter=3,  # Reduced for demonstration
            random_state=13,
            cv=self.k if len(self.X) >= self.k else len(self.X),
            verbose=1
        )
        
        # Fit to data
        random_search.fit(self.X, self.y)
        
        print(f"Best parameters: {random_search.best_params_}")
        return random_search.best_params_

Let's see how to use the hyperparameter tuning class:

In [37]:
# Prepare data for tuning
X = train_df.drop(columns=[target]).select_dtypes(include=["number"]).fillna(0)
y = train_df[target]

# Initialize tuner for RandomForest
hyperparameter_tune = RegressionHyperparameterTune(
    model_name='randomforest',
    model=RandomForestRegressor,
    X=X,
    y=y,
    goal_metric='r2',
    k=2  # Using fewer folds for demonstration
)

# Tune hyperparameters
best_params = hyperparameter_tune.tune_parameters_rscv()

Tuning hyperparameters for randomforest model...
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best parameters: {'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': None}


### 4.4 RegressionInference <a id="inference"></a>

The `RegressionInference` class is responsible for making predictions on new data using a pre-trained model.

In [38]:
class RegressionInference:
    """Class for making predictions with a pre-trained model"""
    
    def __init__(self, input_df, scaler, model, target, target_provided=False):
        self.input = input_df
        self.scaler = scaler
        self.model = model
        self.target = target
        self.target_provided = target_provided
    
    def normalize(self, X):
        """Normalize features using the pre-fitted scaler"""
        X_scaled = self.scaler.transform(X)
        return X_scaled
    
    def inference(self):
        """Make predictions on the input data"""
        # Prepare features
        X = self.input.select_dtypes(include=["number"]).fillna(0)
        X = X[[_ for _ in X.columns if _ != self.target]]
        X = pd.DataFrame(self.normalize(X), columns=X.columns)
        
        # Make predictions
        predictions = [float(_) for _ in list(self.model.predict(X))]
        
        # If target is provided, calculate metrics
        if self.target_provided:
            y_test = self.input[[self.target]]
            test_metrics = {
                "test_mae": mean_absolute_error(y_test, predictions),
                "test_mse": mean_squared_error(y_test, predictions),
                "test_adj_r2": adj_r2(
                    r2_score(y_test, predictions), len(y_test), len(X.columns)
                ),
                "test_r2": r2_score(y_test, predictions),
                "test_true_predicted": [
                    {
                        "true": y_test[self.target].tolist()[i],
                        "predicted": predictions[i],
                    }
                    for i in range(len(X))
                ],
            }
            return {"predictions": predictions, "test_metrics": test_metrics}
        else:
            return {"predictions": predictions}

Let's see how to use the inference class with a pre-trained model:

In [39]:
# Train a simple model to use for inference
X_train = train_df.drop(columns=[target]).select_dtypes(include=["number"]).fillna(0)
y_train = train_df[target]
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create a scaler
scaler = MinMaxScaler()
scaler.fit(X_train)

# Initialize the inference class
regression_inference = RegressionInference(
    input_df=test_df,
    scaler=scaler,
    model=model,
    target=target,
    target_provided=True  # We have the true values for evaluation
)

# Make predictions
results = regression_inference.inference()

# Display results
print(f"Number of predictions: {len(results['predictions'])}")
print(f"First few predictions: {results['predictions'][:5]}")
print(f"\nMetrics:")
for metric, value in results['test_metrics'].items():
    if metric != 'test_true_predicted':
        print(f"  {metric}: {value}")

Number of predictions: 390
First few predictions: [7.943000000000004, 7.943000000000004, 7.943000000000004, 7.943000000000004, 7.943000000000004]

Metrics:
  test_mae: 4.927838461538458
  test_mse: 38.36605079230766
  test_adj_r2: -1.4590178303354504
  test_r2: -0.8837720139844838


## 5. Key Features <a id="features"></a>

Now that we've explored the main components of our regression framework, let's dive deeper into some of its key features.

### 5.1 Model Selection <a id="model-selection"></a>

Our framework supports multiple regression algorithms, each with its own strengths and weaknesses. Here's a quick overview of the available models:

In [40]:
# Create a DataFrame to describe the models
model_descriptions = pd.DataFrame({
    'Model Name': [
        'Linear Regression', 
        'Bayesian Ridge', 
        'Elastic Net',
        'Random Forest',
        'XGBoost',
        'Support Vector Regression',
        'TabPFN'
    ],
    'Key in Framework': [
        'linear', 
        'bayesian', 
        'elasticnet',
        'randomforest',
        'xgboost',
        'svm',
        'tabpfn'
    ],
    'Best For': [
        'Simple linear relationships, interpretability', 
        'Linear relationships with uncertainty estimates', 
        'Linear relationships with sparse feature selection',
        'Complex nonlinear relationships, robust to outliers',
        'High performance on structured data, handles nonlinearity well',
        'Nonlinear relationships, high-dimensional spaces',
        'Small datasets, no hyperparameter tuning needed'
    ],
    'Strengths': [
        'Fast, interpretable, low variance', 
        'Provides uncertainty estimates, robust to ill-posed problems', 
        'Feature selection via L1 regularization',
        'Captures interactions, robust to outliers and noise',
        'High performance, feature importance, handles missing values',
        'Works well in high dimensions, captures complex patterns',
        'Few-shot learning, no hyperparameter tuning, works well on small datasets'
    ],
    'Limitations': [
        'Cannot capture nonlinear relationships', 
        'Still assumes linear relationship', 
        'Sensitive to hyperparameters',
        'Can overfit, slow on large datasets',
        'Sensitive to hyperparameters, can overfit',
        'Slow for large datasets, sensitive to scaling',
        'Limited to smaller datasets, less interpretable'
    ]
})

model_descriptions

Unnamed: 0,Model Name,Key in Framework,Best For,Strengths,Limitations
0,Linear Regression,linear,"Simple linear relationships, interpretability","Fast, interpretable, low variance",Cannot capture nonlinear relationships
1,Bayesian Ridge,bayesian,Linear relationships with uncertainty estimates,"Provides uncertainty estimates, robust to ill-...",Still assumes linear relationship
2,Elastic Net,elasticnet,Linear relationships with sparse feature selec...,Feature selection via L1 regularization,Sensitive to hyperparameters
3,Random Forest,randomforest,"Complex nonlinear relationships, robust to out...","Captures interactions, robust to outliers and ...","Can overfit, slow on large datasets"
4,XGBoost,xgboost,"High performance on structured data, handles n...","High performance, feature importance, handles ...","Sensitive to hyperparameters, can overfit"
5,Support Vector Regression,svm,"Nonlinear relationships, high-dimensional spaces","Works well in high dimensions, captures comple...","Slow for large datasets, sensitive to scaling"
6,TabPFN,tabpfn,"Small datasets, no hyperparameter tuning needed","Few-shot learning, no hyperparameter tuning, w...","Limited to smaller datasets, less interpretable"


The framework automatically evaluates multiple models and selects the best one based on the specified goal metric. This approach ensures that you don't have to manually try different algorithms to find the most suitable one for your specific dataset.

### 5.2 Feature Selection <a id="feature-selection"></a>

Feature selection is a critical step in building effective regression models. Our framework supports both automatic and manual feature selection through the Recursive Feature Elimination (RFE) method.

Feature Selection is built into the variants of train_model() where the program does not always run this process but if they set the class variable "feature_selection_autoselect" to true it will then run something similar to the below cell.

#### I believe this method will be lightly touched upon in class as this is an automated feature selection method. The problem here is it can take a really long time to run given a large number of features, rows of data, and possibly the model. Random Forest I have seen typically performs much better with datasets found in chemistry but at the cost of time. If all number of features + rows of data are the same, auto feature selection like RFE or RFECV takes much longer for random forest than it does for linear regression

In [41]:
# Demonstrate feature selection process
from sklearn.feature_selection import RFE

# Prepare data
X = train_df.drop(columns=['target']).select_dtypes(include=["number"]).fillna(0)
y = train_df['target']

# Initialize a model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Create RFE with 3 features
rfe = RFE(estimator=model, n_features_to_select=3, step=1)
rfe.fit(X, y)

# Get selected features
selected_indices = rfe.get_support(indices=True)
selected_features = X.columns[selected_indices].tolist()

# Display results
print(f"Selected features: {selected_features}")

# Create a DataFrame with feature rankings
feature_ranks = pd.DataFrame({
    'Feature': X.columns,
    'Ranking': rfe.ranking_,
    'Selected': rfe.support_
}).sort_values('Ranking')

feature_ranks

KeyError: "['target'] not found in axis"