##### Hyperparam Tuning

Here I will do hyperparameter tuning on the best models on the scaled dataset

Import required packages and libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

In [3]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings

Read in the scaled datasets

In [4]:
scaled_X_train=pd.read_csv('data/scaled_X_train.csv')
scaled_y_train=pd.read_csv('data/scaled_y_train.csv')
scaled_X_test=pd.read_csv('data/scaled_X_test.csv')
scaled_y_test=pd.read_csv('data/scaled_y_test.csv')

Define evalute_model function as done in the Model notebook

In [5]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    mape = mean_absolute_percentage_error(true, predicted)
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, mape, r2_square

##### These are the following hyperparameters that I use:

CatBoost
* 'iterations': [100, 200], - total number of decision trees built sequentially during the training process
* 'depth': [4, 6, 8], - defining the maximum number of levels a tree can grow
* 'learning_rate': [0.05, 0.1], - It scales the contribution of each newly added decision tree to the final model's prediction
* 'l2_leaf_reg': [1, 3, 5] - L2 regularization parameter that adds a penalty to the leaf values in the cost function

XGBoost
* 'n_estimators': [100, 200], - maximum number of individual decision trees
* 'max_depth': [3, 5, 7], - maximum number of splits each decision tree in the ensemble can grow to
* 'learning_rate': [0.05, 0.1], - After a new tree is built to correct errors, its predictions are multiplied by the learning_rate before being added to the overall model
* 'subsample': [0.8, 1.0], - value like 0.5 means that XGBoost will randomly sample 50% of the total training data
* 'colsample_bytree': [0.8, 1.0] - controls the fraction of features that are randomly selected for use in a single decision tree

Random Forest
* 'n_estimators': [100, 200], - number of decision trees to build
* 'max_depth': [10, 20, None], - maximum number of splits in each individual decision tree
* 'min_samples_split': [2, 5], - minimum number of samples required to split an internal node
* 'min_samples_leaf': [1, 2], - minimum number of samples required to be present in a leaf node
* 'max_features': ['sqrt', 'log2'] - maximum number of features at each individual node

Gradient Boosting 
* 'n_estimators': [100, 200], - number of individual decision trees that are built sequentially to form the final ensemble model
* 'max_depth': [3, 5, 7], - maximum depth or number of levels for each individual decision tree
* 'learning_rate': [0.05, 0.1], - scales the contribution of each new, weak tree added to the ensemble
* 'subsample': [0.8, 1.0], - introduces randomness by training each new tree on a random fraction of the original training data
* 'min_samples_split': [2, 5] - minimum number of samples required in an internal node for it to be considered for a further split

K-Neighbors
* 'n_neighbors': [3, 5, 7, 9] - user-defined number of closest data points (neighbors) used to predict the value of a new point
* 'weights': ['uniform', 'distance'] - how much influence each of the 'k' neighbors has on the final prediction
    * uniform:  All k neighbors contribute equally
    * distance: Each neighbor's value is weighted by the inverse of its distance to the query point (1/distance)
* 'algorithm': ['auto', 'ball_tree', 'kd_tree'], - determines the method used to compute the nearest neighbors efficiently, particularly as dataset size increases
    * auto: automatically choose
    * ball_tree: Uses the BallTree data structure
    * kd_tree: Uses the KDTree data structure
* 'leaf_size': [20, 30, 40] -  control the size of the leaves in the data structure

Decision Tree
* 'max_depth': [5, 10, 15, 20, None] - length of the longest path from the root node to any leaf node
* 'min_samples_split': [2, 5, 10] - the minimum number of samples an internal node must contain before it can be considered for splitting further. Used as a stopping criterion to control the growth of the tree and prevent overfitting
* 'min_samples_leaf': [1, 2, 4] - the minimum number of samples that must be present in a node for it to be considered a valid leaf node. Used to prevent overfitting by ensuring that decisions are based on a sufficient number of data points
* 'max_features': ['sqrt', 'log2', None] - randomly selects a subset of features from the total available features in the dataset


Define the hyperparam tuning function to do first quick tuning on the 6 models, then more fine tuning for top 3 models

In [6]:
def flight_price_quick_hyperparam_tuning(X_train, y_train, X_test, y_test):
    
    print("="*150)
    print("STAGE 1: QUICK HYPERPARAMETER TUNING - ALL 6 MODELS")
    print("="*150)
    
    # Stage 1: Quick hyperparameter tuning
    quick_models = {
        "CatBoost": {
            "model": CatBoostRegressor(random_state=42, verbose=0),
            "params": {
                'iterations': [100, 200],
                'depth': [4, 6, 8],
                'learning_rate': [0.05, 0.1],
                'l2_leaf_reg': [1, 3, 5]
            }
        },
        "XGBoost": {
            "model": XGBRegressor(random_state=42, eval_metric='rmse'),
            "params": {
                'n_estimators': [100, 200],
                'max_depth': [3, 5, 7],
                'learning_rate': [0.05, 0.1],
                'subsample': [0.8, 1.0],
                'colsample_bytree': [0.8, 1.0]
            }
        },
        "Random Forest": {
            "model": RandomForestRegressor(random_state=42),
            "params": {
                'n_estimators': [100, 200],
                'max_depth': [10, 20, None],
                'min_samples_split': [2, 5],
                'min_samples_leaf': [1, 2],
                'max_features': ['sqrt', 'log2']
            }
        },
        "Gradient Boosting": {
            "model": GradientBoostingRegressor(random_state=42),
            "params": {
                'n_estimators': [100, 200],
                'max_depth': [3, 5, 7],
                'learning_rate': [0.05, 0.1],
                'subsample': [0.8, 1.0],
                'min_samples_split': [2, 5]
            }
        },
        "K-Neighbors": {
            "model": KNeighborsRegressor(),
            "params": {
                'n_neighbors': [3, 5, 7, 9],
                'weights': ['uniform', 'distance'],
                'algorithm': ['auto', 'ball_tree', 'kd_tree'],
                'leaf_size': [20, 30, 40]
            }
        },
        "Decision Tree": {
            "model": DecisionTreeRegressor(random_state=42),
            "params": {
                'max_depth': [5, 10, 15, 20, None],
                'min_samples_split': [2, 5, 10],
                'min_samples_leaf': [1, 2, 4],
                'max_features': ['sqrt', 'log2', None]
            }
        }
    }
    
    results_quick = []
    trained_models = {}
    
    # Quick tuning with fewer iterations
    print("\nPerforming quick hyperparameter tuning")
    for model_name, model_info in quick_models.items():
        print(f"\n  Training {model_name}", end=" ")
        
        random_search = RandomizedSearchCV(
            estimator=model_info["model"],
            param_distributions=model_info["params"],
            n_iter=20, 
            cv=3, 
            scoring='neg_mean_squared_error',
            random_state=42,
            n_jobs=-1,
            verbose=0
        )
        
        random_search.fit(X_train, y_train)
        best_model = random_search.best_estimator_
        trained_models[model_name] = best_model
        
        # Predictions
        y_train_pred = best_model.predict(X_train)
        y_test_pred = best_model.predict(X_test)
        

        model_train_mae, model_train_mse, model_train_rmse, model_train_mape, model_train_r2 = evaluate_model(y_train, y_train_pred)
        model_test_mae, model_test_mse, model_test_rmse, model_test_mape, model_test_r2 = evaluate_model(y_test, y_test_pred)
        
        results_quick.append({
            'Model': model_name,
            'Train RMSE': model_train_rmse,
            'Train MSE': model_train_mse,
            'Train MAE': model_train_mae,
            'Train MAPE': model_train_mape,
            'Train R2': model_train_r2,
            'Test RMSE': model_test_rmse,
            'Test MSE': model_test_mse,
            'Test MAE': model_test_mae,
            'Test MAPE': model_test_mape,
            'Test R2': model_test_r2,
            'Best Params': str(random_search.best_params_)
        })
    
    # Create DataFrame for quick results
    df_results_quick = pd.DataFrame(results_quick)
    
    # Display quick tuning results
    print("\n" + "="*150)
    print("QUICK TUNING RESULTS - ALL MODELS")
    print("="*150)
    display_df = df_results_quick.drop('Best Params', axis=1)
    print(display_df.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    
    # Identify top 3 models based on Test R2
    top_3_models = df_results_quick.nlargest(3, 'Test R2')['Model'].tolist()
    
    print("\n" + "="*150)
    print("TOP 3 MODELS SELECTED FOR INTENSIVE TUNING")
    print("="*150)
    for i, model_name in enumerate(top_3_models, 1):
        model_stats = df_results_quick[df_results_quick['Model'] == model_name].iloc[0]
        print(f"{i}. {model_name}")
        print(f"   - Test R2: {model_stats['Test R2']:.4f}")
        print(f"   - Test RMSE: {model_stats['Test RMSE']:.4f}")
        print(f"   - Test MAE: {model_stats['Test MAE']:.4f}")
    
    return df_results_quick, trained_models, top_3_models

Define more Intensive hyperparam tuning for the top 3 models

In [None]:

def flight_price_complete_hyperparam_tuning(X_train, y_train, X_test, y_test, top_3_models, df_results_quick):
    
    print("\n" + "="*150)
    print("STAGE 2: Complete HYPERPARAMETER TUNING - TOP 3 MODELS")
    print("="*150)
    
    intensive_models = {
        "CatBoost": {
            "model": CatBoostRegressor(random_state=42, verbose=0),
            "params": {
                'iterations': [100, 200, 300, 500],
                'depth': [4, 6, 8, 10],
                'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.15],
                'l2_leaf_reg': [1, 3, 5, 7, 9],
                'border_count': [32, 64, 128],
                'subsample': [0.7, 0.8, 0.9, 1.0]
            }
        },
        "XGBoost": {
            "model": XGBRegressor(random_state=42, eval_metric='rmse'),
            "params": {
                'n_estimators': [100, 200, 300, 500],
                'max_depth': [3, 5, 7, 9, 11],
                'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.15],
                'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
                'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
                'gamma': [0, 0.1, 0.3, 0.5, 1],
                'min_child_weight': [1, 3, 5, 7],
                'reg_alpha': [0, 0.01, 0.1, 1],
                'reg_lambda': [0.1, 1, 10]
            }
        },
        "Gradient Boosting": {
            "model": GradientBoostingRegressor(random_state=42),
            "params": {
                'n_estimators': [100, 200, 300, 500],
                'max_depth': [3, 5, 7, 9, 11],
                'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.15],
                'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
                'min_samples_split': [2, 5, 10, 15],
                'min_samples_leaf': [1, 2, 4, 6],
                'max_features': ['sqrt', 'log2', None],
                'loss': ['squared_error', 'absolute_error', 'huber']
            }
        }
    }
    
    results_intensive = []
    best_models = {}
    
    for model_name in top_3_models:
        if model_name not in intensive_models:
            continue
            
        print(f"\n  Training {model_name}", end=" ")
        
        model_info = intensive_models[model_name]
        
        random_search = RandomizedSearchCV(
            estimator=model_info["model"],
            param_distributions=model_info["params"],
            n_iter=50,  
            cv=5,  
            scoring='neg_mean_squared_error',
            random_state=42,
            n_jobs=-1,
            verbose=0
        )
        
        random_search.fit(X_train, y_train)
        best_model = random_search.best_estimator_
        best_models[model_name] = best_model
        
        # Predictions
        y_train_pred = best_model.predict(X_train)
        y_test_pred = best_model.predict(X_test)
        
        model_train_mae, model_train_mse, model_train_rmse, model_train_mape, model_train_r2 = evaluate_model(y_train, y_train_pred)
        model_test_mae, model_test_mse, model_test_rmse, model_test_mape, model_test_r2 = evaluate_model(y_test, y_test_pred)
        
        # Calculate improvement from quick tuning
        quick_r2 = df_results_quick[df_results_quick['Model'] == model_name]['Test R2'].values[0]
        r2_improvement = model_test_r2 - quick_r2
        
        print("Completed")
        
        results_intensive.append({
            'Model': model_name,
            'Train RMSE': model_train_rmse,
            'Train MSE': model_train_mse,
            'Train MAE': model_train_mae,
            'Train MAPE': model_train_mape,
            'Train R2': model_train_r2,
            'Test RMSE': model_test_rmse,
            'Test MSE': model_test_mse,
            'Test MAE': model_test_mae,
            'Test MAPE': model_test_mape,
            'Test R2': model_test_r2,
            'R2 Improvement': r2_improvement,
            'Best Params': str(random_search.best_params_)
        })
    
    # Create DataFrame for intensive results
    df_results_intensive = pd.DataFrame(results_intensive)
    
    # Display intensive tuning results
    print("\n" + "="*150)
    print("Complete TUNING RESULTS - TOP 3 MODELS")
    print("="*150)
    display_df_intensive = df_results_intensive.drop('Best Params', axis=1)
    print(display_df_intensive.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    
    # Display best parameters for intensive tuning
    print("\n" + "="*150)
    print("BEST HYPERPARAMETERS - Complete TUNING")
    print("="*150)
    for idx, row in df_results_intensive.iterrows():
        print(f"\n{row['Model']}:")
        params = eval(row['Best Params'])
        for param, value in params.items():
            print(f"  - {param}: {value}")
    
    # Display final rankings
    print("\n" + "="*150)
    print("FINAL MODEL RANKINGS (BY TEST R2)")
    print("="*150)
    final_ranking = df_results_intensive.sort_values('Test R2', ascending=False)[
        ['Model', 'Test R2', 'Test RMSE', 'Test MAE', 'Test MAPE', 'R2 Improvement']
    ]
    print(final_ranking.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    
    # Identify best overall model
    best_model_name = df_results_intensive.loc[df_results_intensive['Test R2'].idxmax(), 'Model']
    best_model_r2 = df_results_intensive['Test R2'].max()
    
    print("\n" + "="*150)
    print("BEST OVERALL MODEL")
    print("="*150)
    print(f"Model: {best_model_name}")
    print(f"Test R2: {best_model_r2:.4f}")
    print("="*150)
    
    return df_results_intensive, best_models

Run the quick hyperparam tuning

In [8]:
df_quick, trained_models_quick, top_3 = flight_price_quick_hyperparam_tuning(
    scaled_X_train, scaled_y_train, scaled_X_test, scaled_y_test
)

STAGE 1: QUICK HYPERPARAMETER TUNING - ALL 6 MODELS

Performing quick hyperparameter tuning

  Training CatBoost 
  Training XGBoost 
  Training Random Forest 

  return fit_method(estimator, *args, **kwargs)



  Training Gradient Boosting 

  y = column_or_1d(y, warn=True)  # TODO: Is this still required?



  Training K-Neighbors 
  Training Decision Tree 
QUICK TUNING RESULTS - ALL MODELS
            Model  Train RMSE  Train MSE  Train MAE  Train MAPE  Train R2  Test RMSE  Test MSE  Test MAE  Test MAPE  Test R2
         CatBoost      0.1385     0.0192     0.1020      0.0113    0.9264     0.1561    0.0244    0.1168     0.0130   0.9084
          XGBoost      0.1447     0.0209     0.1065      0.0118    0.9197     0.1591    0.0253    0.1196     0.0133   0.9049
    Random Forest      0.1285     0.0165     0.0895      0.0099    0.9366     0.1780    0.0317    0.1250     0.0138   0.8810
Gradient Boosting      0.1428     0.0204     0.1045      0.0116    0.9218     0.1598    0.0255    0.1193     0.0132   0.9041
      K-Neighbors      0.1793     0.0321     0.1257      0.0139    0.8767     0.2026    0.0410    0.1412     0.0156   0.8459
    Decision Tree      0.1325     0.0175     0.0858      0.0095    0.9327     0.1809    0.0327    0.1225     0.0136   0.8771

TOP 3 MODELS SELECTED FOR INTENSIVE TUN

Observations
Top 3 models based of test R2 score is Catboost, XGboost and Gradient boost

Run the intensive hyperparam tuning on the top 3 models from stage 1

In [9]:
df_intensive, best_models = flight_price_complete_hyperparam_tuning(
    scaled_X_train, scaled_y_train, scaled_X_test, scaled_y_test, top_3, df_quick
)


STAGE 2: Complete HYPERPARAMETER TUNING - TOP 3 MODELS

  Training CatBoost Completed

  Training XGBoost Completed

  Training Gradient Boosting 

  y = column_or_1d(y, warn=True)  # TODO: Is this still required?


Completed

Complete TUNING RESULTS - TOP 3 MODELS
            Model  Train RMSE  Train MSE  Train MAE  Train MAPE  Train R2  Test RMSE  Test MSE  Test MAE  Test MAPE  Test R2  R2 Improvement
         CatBoost      0.1346     0.0181     0.0978      0.0109    0.9305     0.1537    0.0236    0.1136     0.0126   0.9113          0.0028
          XGBoost      0.1361     0.0185     0.0988      0.0110    0.9290     0.1547    0.0239    0.1139     0.0126   0.9101          0.0053
Gradient Boosting      0.1228     0.0151     0.0852      0.0095    0.9422     0.1542    0.0238    0.1101     0.0122   0.9107          0.0066

BEST HYPERPARAMETERS - Complete TUNING

CatBoost:
  - subsample: 0.9
  - learning_rate: 0.1
  - l2_leaf_reg: 1
  - iterations: 500
  - depth: 6
  - border_count: 64

XGBoost:
  - subsample: 0.8
  - reg_lambda: 1
  - reg_alpha: 1
  - n_estimators: 300
  - min_child_weight: 3
  - max_depth: 7
  - learning_rate: 0.05
  - gamma: 0
  - colsample_bytree: 0.7

Gradient Boosting:
  - subsam

Observation

* The best model after hyperparameter tuning is Catboost model with Test R2 of 0.9113
* Gradient boosting and XGBoost models follow and are very similar in performance
* The params for Catboost used are
    * subsample: 0.9
    * learning_rate: 0.1
    * l2_leaf_reg: 1
    * iterations: 500
    * depth: 6
    * border_count: 64



For catboost model
* Note that maximum number of possible iterations are used (500)
* The depth is tuned to 6 between 4 to 10 indicating try to prevent overfitting
* L2 regularization is set to 1 indicating standard regurlization
* subsample set to 0.9 between 0.7 and 1