<div style="text-align:center; font-size:36px; font-weight:bold; color:#4A4A4A; background-color:#fff6e4; padding:10px; border:3px solid #f5ecda; border-radius:6px">
    Machine Learning Template
    <p style="text-align:center; font-size:14px; font-weight:normal; color:#4A4A4A; margin-top:12px;">
        Author: Jens Bender <br> 
        May 2025
    </p>
</div>

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Introduction</h1>
</div> 

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Template Overview</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ This Jupyter Notebook file provides a comprehensive <strong>machine learning template</strong> to streamline the key stages of the machine learning workflow for tabular data:
    <ul>
        <li><strong>Data Preprocessing:</strong>
            <ul>
                <li>Load, clean, transform, and save data using <code>pandas</code> and <code>sklearn</code>.</li>
                <li>Handle duplicates, data types, missing values, and outliers.</li>
                <li>Perform train-validation-test split, feature engineering, scaling, and encoding.</li>
            </ul>
        </li>
        <li><strong>Exploratory Data Analysis (EDA):</strong>
            <ul>
                <li>Analyze descriptive statistics using <code>pandas</code> and <code>numpy</code>.</li>
                <li>Visualize distributions, correlations, and relationships using <code>seaborn</code> and <code>matplotlib</code>.</li>
            </ul>
        </li>
        <li><strong>Modeling:</strong>
            <ul>
                <li>Train baseline models and perform hyperparameter tuning for regression and classification tasks with <code>sklearn</code> and <code>xgboost</code>.</li>
                <li>Evaluate model performance for regression (RMSE, MAPE, R-squared) and classification (accuracy, precision, recall, F1-score).</li>
                <li>Visualize feature importance, show model prediction examples, and save the final model.</li>
            </ul>
        </li>
    </ul>
    This template provides a flexible, customizable foundation for various datasets and use cases, making it an ideal starting point for efficient and reproducible machine learning projects. It is specifically tailored to structured tabular data (e.g., .csv, .xls, or SQL tables) using Pandas and Scikit-learn. It is not optimized for text, image, or time series data, which require specialized preprocessing, models, and tools (e.g., TensorFlow, PyTorch).
</div>

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Project Overview</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>💡 Example: Predicting Rental Prices</strong> <br>
    This is an illustrative example of a project overview designed to help you write your own.
</div> 

**Summary**  
This project aims to build a machine learning model to predict rental prices for properties in Berlin using historical data from property listings. The model will enable stakeholders like landlords, tenants, investors, and market analysts to make data-driven decisions about rental prices, property investments, and market analysis through automated property evaluations.

**Problem**  
Determining rental prices is challenging due to the complex interactions between the many factors involved, such as property characteristics (e.g., size, bedrooms, amenities), location (e.g., neighborhood, access to public transport), and market dynamics (e.g., trends in demand, seasonality). Traditional estimation methods can be subjective and inconsistent. Machine learning offers enhanced predictive capability by capturing non-linear patterns and intricate dependencies in property data, enabling more accurate rental price predictions.

**Objectives**  
- Develop a machine learning model to accurately predict rental prices based on information from property listings.
- Compare different models (e.g., Linear Regression, Random Forest, XGBoost) using suitable evaluation metrics (e.g., RMSE, MAPE, R²).
- Identify key price drivers through feature importance analysis.

**Value Proposition**  
This project provides actionable, data-driven insights for key stakeholders:
- Landlords: Set competitive rental prices to attract tenants while maximizing returns.
- Tenants: Identify reasonable and fair rental rates.
- Investors: Evaluate property investment opportunities based on potential rental income and returns.
- Market Analysts: Gain insights into rental price trends, key influencing factors, and market dynamics.

**Business Goals**  
- Increase rental revenue by 5%-10% within 12 months of model deployment: By accurately predicting competitive rental prices, landlords can optimize their pricing strategies and reduce vacancy rates.  
- Cut time spent on rental price evaluation by 30%-40%: Automate the price-setting process with the model, reducing the time needed for landlords, property managers, and investors to evaluate property pricing compared to traditional manual methods.

**Data**  
The dataset contains historical information from rental properties in Berlin listed on Zillow between 2023-09-01 and 2024-08-31, provided in a single `.csv` file.

Dataset Statistics:
- Dataset size: 44,300 records
- Target variable: Monthly rent
- Features: 9  

Data Overview Table:
| Column | Description | Storage Type | Semantic Type | Theoretical Range | Observed Range |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Monthly Rent | Rental price per month in euros | Integer | Numerical | [0, ∞] | [600, 3100] |
| Size | Living space in square meters | Float | Numerical | [0, ∞] | [30, 240] |
| Bedrooms | Number of bedrooms | Integer | Numerical | [0, ∞] | [1, 6] |
| Bathrooms | Number of bathrooms | Integer | Numerical | [0, ∞] | [1, 4] |
| Property Type | Category of property | String | Categorical (Nominal) | Any property type [e.g., "Apartment", "House"] | 8 unique types |
| Neighborhood | District within Berlin | String | Categorical (Nominal) | Any Berlin district [e.g., "Charlottenburg"] | 12 unique districts |
| Built Year | Year of construction | Integer | Numerical | [1600, 2024] | [1904, 2024] |
| Furnishing | Whether property is furnished | String | Categorical (Binary) | ["Yes", "No"] | ["Yes", "No"] |
| Heating | Type of heating system | String | Categorical (Nominal) | Any heating type [e.g., "Gas", "Electric"] | 8 unique types |
| Amenities | Available features (comma-separated list) | String | Categorical | Any combination of amenities [e.g., "Balcony, Elevator"] | 34 unique amenities |

Example Data:
| Size (m²) | Bedrooms | Bathrooms | Property Type | Neighborhood | Built Year | Furnishing | Heating | Amenities | Monthly Rent (€) |
| :-------- | :------- | :-------- | :----------- | :----------- | :--------- | :--------- | :------ | :-------- | :-------------- |
| 80 | 2 | 1 | Apartment | Prenzlauer Berg | 1990 | Yes | Gas | Balcony, Elevator, River View | 1200 |
| 120 | 3 | 2 | House | Charlottenburg | 2005 | No | Electric | Garden, Parking, Newly Renovated | 2500 |

**Technical Requirements**  
- Data Preprocessing:
  - Load, clean, transform, and save data using `pandas` and `sklearn`.
  - Handle duplicates, data types, missing values, and outliers.
  - Perform train-validation-test split, feature engineering, scaling, and encoding.
- Exploratory Data Analysis (EDA):
  - Analyze descriptive statistics using `pandas` and `numpy`.
  - Visualize distributions, correlations, and relationships using `seaborn` and `matplotlib`.
- Modeling:
  - Train baseline models and perform hyperparameter tuning for regression task with `sklearn` and `xgboost`.
  - Baseline models: Linear Regression, K-Nearest Neighbors, Support Vector Machine, Random Forest, Multi-Layer Perceptron, XGBoost.
  - Evaluate model performance using Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and R-squared (R²).
    - Success criteria: Maximum RMSE of 1200, maximum MAPE of 0.15, and minimum R² of 0.80 on the test data.
  - Visualize feature importance, show model prediction examples, and save the final model with `pickle`.
- Deployment:
  - Expose the final model via a REST API (e.g., Flask, FastAPI) for easy integration with existing platforms.  
  - Implement batch processing capabilities to deliver predictions for up to 1M data points in under 2 minutes.  
  - Deploy on cloud infrastructure (e.g., AWS, Microsoft Azure, Google Cloud Platform) to ensure scalability.
  - Set up model performance monitoring and data drift detection.

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Setup</h1>
</div> 

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Imports</h2>
</div>

In [None]:
# Data Processing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from xgboost import XGBRegressor, XGBClassifier
import xgboost as xgb
import pickle

# Evaluation Metrics: Regression
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, r2_score
# Evaluation Metrics: Classification
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    roc_auc_score,
    precision_recall_curve,
    auc,
    classification_report, 
    confusion_matrix, 
    ConfusionMatrixDisplay
)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Environment Variables</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>🔒 Security Note:</strong> Setting environment variables is optional, but it is recommended if you store sensitive information (such as API keys or database credentials) in a <code>.env</code> file. Using environment variables helps keep such information secure and separate from your codebase.
</div>

In [None]:
# Imports
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get API key from .env 
api_key = os.getenv("API_KEY")

# Get SQL database credentials from .env
sql_username = os.getenv("SQL_USERNAME")
sql_password = os.getenv("SQL_PASSWORD")

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Data Loading and Inspection</h1>
</div>

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">CSV</h2>
</div>  

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Load data from a <code>.csv</code> file into a Pandas DataFrame.
</div>

In [None]:
try:
    df = pd.read_csv("data/your_csv_file_here.csv")
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except pd.errors.EmptyDataError:
    print("Error: The file is empty.")
except pd.errors.ParserError:
    print("Error: The file content could not be parsed as a CSV.")
except PermissionError:
    print("Error: Permission denied when accessing the file.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">MySQL</h2>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Load data from a MySQL database table into a Pandas DataFrame.
</div>

In [None]:
# Imports 
from sqlalchemy import create_engine, exc

# Database info
mysql_host = "localhost"  # Default hostname for a MySQL server running locally
mysql_port = 3306  # Default port for MySQL
mysql_database_name = "your_mysql_database_name_here"
mysql_table_name = "your_mysql_table_name_here"

try:
    # Create an SQLAlchemy engine for interacting with the MySQL database
    engine = create_engine(f"mysql+mysqlconnector://{sql_username}:{sql_password}@{mysql_host}:{mysql_port}/{mysql_database_name}")
    
    # Load data from MySQL into DataFrame
    with engine.connect() as connection:
        df = pd.read_sql(f"SELECT * FROM {mysql_table_name}", con=connection)
    print("Data loaded successfully.")

except exc.SQLAlchemyError as e:
    print(f"Database error occurred: {e}")
    
except Exception as e:
    print(f"An unexpected error occurred: {e}")

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Initial Data Inspection</h2>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Basic exploration of the dataset to understand its structure and detect obvious issues.
</div>

In [None]:
# Show DataFrame info to check the number of rows and columns, data types and missing values
df.info()

In [None]:
# Show top five rows to get a sense of the data
df.head()

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Data Preprocessing</h1>
</div> 

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Standardizing Names and Labels</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Column Names</strong> <br>
    📌 Convert all column names to snake_case for consistency, improved readability, and to minimize the risk of errors.
</div>

In [None]:
# Convert column names to snake_case
df.columns = (
    df.columns
    .str.strip()  # Remove leading/trailing spaces
    .str.lower()  # Convert to lowercase
    .str.replace(r"[-/\s+]", "_", regex=True)  # Replace spaces and special characters with "_"
)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Categorical Labels</strong> <br>
    📌 Convert all categorical labels to snake_case for consistency, improved readability, and to minimize the risk of errors.
</div>

In [None]:
def standardize_categorical_labels(categorical_label):
    return (
        categorical_label
        .strip()  # Remove leading/trailing spaces
        .lower()  # Convert to lowercase
        .replace("-", "_")  # Replace hyphens with "_"
        .replace("/", "_")  # Replace slashes with "_"
        .replace(" ", "_")  # Replace spaces with "_"
    )


# Define categorical columns to standardize labels
columns_to_standardize = ["categorical_column_1", "categorical_column_2", "categorical_column_3"]

# Apply standardization of categorical labels
for column in columns_to_standardize:
    df[column] = df[column].apply(standardize_categorical_labels)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Handling Duplicates</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove duplicates based on all columns.
</div>

In [None]:
# Identify duplicates based on all columns
df.duplicated().value_counts()

In [None]:
# Remove duplicates
df = df.drop_duplicates().copy()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove duplicates based on the ID column.
</div>

In [None]:
# Identify duplicates based on the ID column
df.duplicated(["id"]).value_counts()

In [None]:
# Remove duplicates
df = df.drop_duplicates(["id"]).copy()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove duplicates based on a combination of specific columns.
</div>

In [None]:
# Identify duplicates based on a combination of specific columns
df.duplicated(["column_1", "column_2", "column_3"]).value_counts()

In [None]:
# Remove duplicates
df = df.drop_duplicates(["column_1", "column_2", "column_3"]).copy()

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Handling Data Types</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and convert incorrect storage data types.
</div>

In [None]:
# Identify storage data types
df.dtypes()

In [None]:
# Convert column from str to int
df["int_column"] = df["str_column"].astype("Int32")

In [None]:
# Convert column from str to datetime
df["datetime_column"] = pd.to_datetime(df["str_column"])

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify object columns with two unique categories and convert them to boolean columns.
</div>

In [None]:
# Identify object columns with two unique categories 
df.select_dtypes(include=["object"]).nunique()

In [None]:
# Convert column from object to boolean
df["bool_column"] = df["object_column"].map({"category_1": True, "category_2": False})

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Train-Validation-Test Split</h2>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ The ideal split depends on the dataset size and the task. A general guideline is:
    <table style="margin-left:0; margin-top:15px; border-collapse: collapse;">
        <thead>
            <tr>
                <th style="text-align:left; background-color:#e8f4fd; padding:8px;">Dataset Size</th>
                <th style="text-align:left; background-color:#e8f4fd; padding:8px;">Training Set</th>
                <th style="text-align:left; background-color:#e8f4fd; padding:8px;">Validation Set</th>
                <th style="text-align:left; background-color:#e8f4fd; padding:8px;">Test Set</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td style="background-color:#e8f4fd; text-align:left; padding:8px;">Smaller datasets (<10,000)</td>
                <td style="background-color:#e8f4fd; text-align:left; padding:8px;">70%</td>
                <td style="background-color:#e8f4fd; text-align:left; padding:8px;">15%</td>
                <td style="background-color:#e8f4fd; text-align:left; padding:8px;">15%</td>
            </tr>
            <tr>
                <td style="background-color:#d0e7fa; text-align:left; padding:8px;">Larger datasets (≥10,000)</td>
                <td style="background-color:#d0e7fa; text-align:left; padding:8px;">80%</td>
                <td style="background-color:#d0e7fa; text-align:left; padding:8px;">10%</td>
                <td style="background-color:#d0e7fa; text-align:left; padding:8px;">10%</td>
            </tr>
        </tbody>
    </table>
</div>

In [None]:
# Split the data into X features and y target
X = df.drop("target", axis=1)
y = df["target"]

In [None]:
# Split the data into training and temporary sets (70% train, 30% temporary)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Split the temporary data into validation and test sets (50% each)
# Note: This accomplishes a 70% training, 15% validation and 15% test set size
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Delete the temporary data to free up memory
del X_temp, y_temp

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Engineering New Features</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Derive a new feature from a raw text column or a categorical text column.
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">From Raw Text</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Derive a categorical, numerical, or boolean feature from a raw text column, which contains unstructured text data stored as strings.
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Categorical Feature</strong> <br>
    📌 Extract a categorical feature from a raw text column.
</div>

In [None]:
# Function to extract a category from a string   
def extract_category_from_string(string):
    # Map categories to their corresponding list of keywords
    category_keywords_map = {
        "Category 1": ["Keyword 1", "Keyword 2", "Keyword 3"],
        "Category 2": ["Keyword 4", "Keyword 5", "Keyword 6"],
        "Category 3": ["Keyword 7", "Keyword 8", "Keyword 9"]
    }

    # Loop through each category and its associated keywords 
    for category, keywords in category_keywords_map.items():
        # Check if any keyword is present in the string
        if any(keyword in string for keyword in keywords):
            return category  # Return the category corresponding to the keyword
    return np.nan  # Return a missing value if no keyword matches

# Apply function on an existing string column to create a new categorical feature column
X_train["categorical_feature"] = X_train["str_column"].apply(extract_category_from_string)
X_val["categorical_feature"] = X_val["str_column"].apply(extract_category_from_string)
X_test["categorical_feature"] = X_test["str_column"].apply(extract_category_from_string)

# Show category frequencies
print(X_train["categorical_feature"].value_counts())

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Numerical Feature</strong> <br>
    📌 Extract a numerical feature from a raw text column.
</div>

In [None]:
# Imports
import re

# Function to extract the first number in a string 
def extract_number_from_string(string):
    first_number = re.search(r"\b-?\d+([.,]\d+)?\b", string)  # searches for first integer or float (positive or negative; decimal separator "." or ",")
    if first_number:
        return float(first_number.group().replace(",", "."))  # Replace "," with "." as decimal separator  
    else:
        return np.nan  # Return a missing value if no number in string

# Apply function on an existing string column to create a new numerical feature column
X_train["numerical_feature"] = X_train["str_column"].apply(extract_number_from_string)
X_val["numerical_feature"] = X_val["str_column"].apply(extract_number_from_string)
X_test["numerical_feature"] = X_test["str_column"].apply(extract_number_from_string)

# Show descriptive statistics of numerical feature
X_train["numerical_feature"].describe()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Boolean Feature</strong> <br>
    📌 Extract a boolean feature from a raw text column.
</div>

In [None]:
# List of keywords to determine if the feature is present or absent
keywords = ["keyword 1", "keyword 2", "keyword 3"]

# Extract boolean feature column: True if any keyword is found in the string column
X_train["boolean_feature"] = X_train["str_column"].apply(lambda x: any(keyword.lower() in x.lower() for keyword in keywords))
X_val["boolean_feature"] = X_val["str_column"].apply(lambda x: any(keyword.lower() in x.lower() for keyword in keywords))
X_test["boolean_feature"] = X_test["str_column"].apply(lambda x: any(keyword.lower() in x.lower() for keyword in keywords))

# Show frequencies of boolean feature
print(X_train["boolean_feature"].value_counts())

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">From Categorical Text</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Derive a new feature from a categorical text column, which contains a predefined set of unique categories respresented as strings.
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Target Encoding</strong> <br>
    📌 Derive a numerical feature from a categorical text column by encoding categories based on the mean of the target variable. This method is especially useful for categorical columns with high cardinality (i.e., a large number of unique categories).
</div>

In [None]:
# Merge X_train and y_train
df_train = pd.concat([X_train, y_train], axis=1)

# Calculate mean target by category and global mean based on the training data
mean_target_by_category = df_train.groupby("str_column")["target_variable"].mean()
mean_target = df_train["target_variable"].mean()

# Replace string categories with corresponding mean target in training, validation, and test data (use global mean for unseen categories) 
X_train["numerical_feature"] = X_train["str_column"].map(mean_target_by_category).fillna(mean_target)
X_val["numerical_feature"] = X_val["str_column"].map(mean_target_by_category).fillna(mean_target)
X_test["numerical_feature"] = X_test["str_column"].map(mean_target_by_category).fillna(mean_target)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Tiering</strong> <br>
    📌 Group categories into predefined tiers based on domain-specific factors that are relevant for predicting the target variable using subject matter expertise. This method is useful for improving predictive performance and handling categorical columns with high cardinality (i.e., a large number of unique categories). <br><br>

💡 Example: Categorize 300+ cities into 3 tiers based on factors such as employment opportunities, cost of living, and population density.
</div>

In [None]:
def derive_tier(category):
    # Define a dictionary mapping categories to their respective tiers
    tier_map = {
        "category_1": "tier_1", "category_2": "tier_1", "category_3": "tier_1", 
        "category_4": "tier_1", "category_5": "tier_1", "category_6": "tier_1", 
        "category_7": "tier_1", "category_8": "tier_1", "category_9": "tier_1", 
        "category_10": "tier_1",
        
        "category_11": "tier_2", "category_12": "tier_2", "category_13": "tier_2",
        "category_14": "tier_2", "category_15": "tier_2", "category_16": "tier_2",
        "category_17": "tier_2", "category_18": "tier_2", "category_19": "tier_2",
        "category_20": "tier_2",
        
        "category_21": "tier_3", "category_22": "tier_3", "category_23": "tier_3",
        "category_24": "tier_3", "category_25": "tier_3", "category_26": "tier_3",
        "category_27": "tier_3", "category_28": "tier_3", "category_29": "tier_3",
        "category_30": "tier_3"
        # Add more categories as needed...
    }

    # Return the tier based on the category (default to "tier_2" for unknown categories)
    return tier_map.get(category, "tier_2")

# Apply the function to create the tier feature in training, validation, and test data
X_train["tier"] = X_train["str_column"].apply(derive_tier)
X_val["tier"] = X_val["str_column"].apply(derive_tier)
X_test["tier"] = X_test["str_column"].apply(derive_tier)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Defining Semantic Types</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Define semantic column types (numerical, categorical, boolean) for downstream tasks like additional preprocessing steps, exploratory data analysis, and machine learning.
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Manual</strong> <br>
    📌 Option 1: Define semantic column types manually for small datasets or when you have specific requirements and need precise control.
</div>

In [None]:
# Identify storage data types
print(X_train.dtypes())
print(y_train.dtypes())

In [None]:
# Define semantic column types manually
numerical_columns = ["numerical_column_1", "numerical_column_2", "numerical_column_3"]
categorical_columns = ["categorical_column_1", "categorical_column_2", "categorical_column_3"]
boolean_columns = ["boolean_column_1", "boolean_column_2", "boolean_column_3"]

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Programmatic</strong> <br>
    📌 Option 2: Define semantic column types programmatically for large datasets or when you need to automate the process, ensuring efficiency and scalability.
</div>

In [None]:
# Merge X_train and y_train
df_train = pd.concat([X_train, y_train], axis=1)

# Define semantic column types programmatically based on storage data types
numerical_columns = df_train.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_columns = df_train.select_dtypes(include=["object"]).columns.tolist()
boolean_columns = df_train.select_dtypes(include=["bool"]).columns.tolist()

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Handling Missing Values</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Identification</strong> <br>
    📌 Identify missing values.
</div>

In [None]:
# Identify missing values in training, validation, and test data
print("Training Data - Features:")
print(X_train.isnull().sum())
print("\nTraining Data - Target Variable:")
print(y_train.isnull().sum())

print("\nValidation Data - Features:")
print(X_val.isnull().sum())
print("\nValidation Data - Target Variable:")
print(y_val.isnull().sum())

print("\nTest Data - Features:")
print(X_test.isnull().sum())
print("\nTest Data - Target Variable:")
print(y_test.isnull().sum())

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Imputation</strong> <br>
    📌 Impute missing values in a numerical column using the median.
</div>

In [None]:
# Descriptive statistics of numerical column
X_train["numerical_column"].describe()

In [None]:
# Calculate median from training data
median = X_train["numerical_column"].median()

# Impute median in training, validation, and test data
X_train["numerical_column"] = X_train["numerical_column"].fillna(median)
X_val["numerical_column"] = X_val["numerical_column"].fillna(median)
X_test["numerical_column"] = X_test["numerical_column"].fillna(median)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Impute missing values in a categorical column using the mode (most frequent value). 
</div>

In [None]:
# Frequencies of categorical column
X_train["categorical_column"].value_counts()

In [None]:
# Calculate mode from training data
mode = X_train["categorical_column"].mode()[0]

# Impute mode in training, validation, and test data
X_train["categorical_column"] = X_train["categorical_column"].fillna(mode)
X_val["categorical_column"] = X_val["categorical_column"].fillna(mode)
X_test["categorical_column"] = X_test["categorical_column"].fillna(mode)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Deletion</strong> <br>
    📌 Delete rows with missing values on any column.
</div>

In [None]:
# Delete rows where any column has a missing value 
X_train.dropna(inplace=True)
X_val.dropna(inplace=True)
X_test.dropna(inplace=True)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Delete rows with missing values on specific columns.
</div>

In [None]:
# Delete rows where either column_1 or column_2 has a missing value 
X_train.dropna(subset=["column_1", "column_2"], how="any", inplace=True)
X_val.dropna(subset=["column_1", "column_2"], how="any", inplace=True)
X_test.dropna(subset=["column_1", "column_2"], how="any", inplace=True)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Handling Outliers</h2>
</div>

In [None]:
# Imports for creating a custom transformer class to handle outliers
from sklearn.base import BaseEstimator, TransformerMixin

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">3SD Method</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove univariate outliers in numerical columns by applying the 3 standard deviation (SD) rule. Specifically, a data point is considered an outlier if it falls more than 3 standard deviations above or below the mean of the column. 
</div>

In [None]:
# Create a custom transformer class to identify and remove outliers using the 3SD method
class OutlierRemover3SD(BaseEstimator, TransformerMixin):
    def fit(self, df, numerical_columns):
        # Convert single column string to list
        if isinstance(numerical_columns, str):
            self.numerical_columns_ = [numerical_columns]
        else:
            self.numerical_columns_ = numerical_columns
            
        # Calculate statistics (mean, standard deviation, cutoff values) for each column
        self.stats_ = pd.DataFrame(index=self.numerical_columns_)
        self.stats_["mean"] = df[self.numerical_columns_].mean()
        self.stats_["sd"] = df[self.numerical_columns_].std()
        self.stats_["lower_cutoff"] = self.stats_["mean"] - 3 * self.stats_["sd"]
        self.stats_["upper_cutoff"] = self.stats_["mean"] + 3 * self.stats_["sd"]
        
        # Create masks for filtering outliers 
        self.masks_ = (df[self.numerical_columns_] >= self.stats_["lower_cutoff"]) & (df[self.numerical_columns_] <= self.stats_["upper_cutoff"])  # masks by column
        self.final_mask_ = self.masks_.all(axis=1)  # single mask across all columns
     
        # Calculate number of outliers
        self.stats_["outliers"] = (~self.masks_).sum()  # by column
        self.outliers_ = (~self.final_mask_).sum()  # across all columns
        
        # Show outliers across all columns
        if len(self.numerical_columns_) == 1:
            print(f"\nIdentified {self.outliers_} rows ({self.outliers_ / len(self.final_mask_) * 100:.1f}%) with outliers in the '{self.numerical_columns_[0]}' column.")
        else:
            print(f"\nIdentified {self.outliers_} rows ({self.outliers_ / len(self.final_mask_) * 100:.1f}%) with outliers in one or more numerical columns.")
 
        return self

    def transform(self, df):
        # Create masks for new df
        masks = (df[self.numerical_columns_] >= self.stats_["lower_cutoff"]) & (df[self.numerical_columns_] <= self.stats_["upper_cutoff"])  # masks by column
        final_mask = masks.all(axis=1)  # single mask across all columns
        
        # Remove outliers based on the final mask
        print(f"Removed {(~final_mask).sum()} rows ({(~final_mask).sum() / len(final_mask) * 100:.1f}%) with outliers.")
        return df[final_mask]

    def fit_transform(self, df, numerical_columns):
        # Perform both fit and transform 
        return self.fit(df, numerical_columns).transform(df)


# Initialize outlier remover 
outlier_remover_3sd = OutlierRemover3SD()

# Fit outlier remover to training data
outlier_remover_3sd.fit(X_train, numerical_columns)

# Show mean, sd, cutoff values, and outliers by column for training data
print("\nOutliers by column:")
round(outlier_remover_3sd.stats_, 2)

In [None]:
# Remove outliers
print("Training Data:")
X_train_no_outliers = outlier_remover_3sd.transform(X_train)
print("\nValidation Data:")
X_val_no_outliers = outlier_remover_3sd.transform(X_val)
print("\nTest Data:")
X_test_no_outliers = outlier_remover_3sd.transform(X_test)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">1.5 IQR Method</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove univariate outliers in numerical columns using the 1.5 interquartile range (IQR) rule. Specifically, a data point is considered an outlier if it falls more than 1.5 interquartile ranges above the third quartile (Q3) or below the first quartile (Q1) of the column. 
</div>

In [None]:
# Create a custom transformer class to identify and remove outliers using the 1.5 IQR method
class OutlierRemoverIQR(BaseEstimator, TransformerMixin):
    def fit(self, df, numerical_columns):
        # Convert single column string to list
        if isinstance(numerical_columns, str):
            self.numerical_columns_ = [numerical_columns]
        else:
            self.numerical_columns_ = numerical_columns
        
        # Calculate statistics (first quartile, third quartile, interquartile range, cutoff values) for each column
        self.stats_ = pd.DataFrame(index=self.numerical_columns_)
        self.stats_["Q1"] = df[self.numerical_columns_].quantile(0.25)
        self.stats_["Q3"] = df[self.numerical_columns_].quantile(0.75)
        self.stats_["IQR"] = self.stats_["Q3"] - self.stats_["Q1"]
        self.stats_["lower_cutoff"] = self.stats_["Q1"] - 1.5 * self.stats_["IQR"]
        self.stats_["upper_cutoff"] = self.stats_["Q3"] + 1.5 * self.stats_["IQR"]

        # Create masks for filtering outliers 
        self.masks_ = (df[self.numerical_columns_] >= self.stats_["lower_cutoff"]) & (df[self.numerical_columns_] <= self.stats_["upper_cutoff"])  # masks by column
        self.final_mask_ = self.masks_.all(axis=1)  # single mask across all columns

        # Calculate number of outliers
        self.stats_["outliers"] = (~self.masks_).sum()  # by column
        self.outliers_ = (~self.final_mask_).sum()  # across all columns
        
        # Show outliers across all columns
        if len(self.numerical_columns_) == 1:
            print(f"\nIdentified {self.outliers_} rows ({self.outliers_ / len(self.final_mask_) * 100:.1f}%) with outliers in the '{numerical_columns[0]}' column.")
        else:
            print(f"\nIdentified {self.outliers_} rows ({self.outliers_ / len(self.final_mask_) * 100:.1f}%) with outliers in one or more numerical columns.")
            
        return self

    def transform(self, df):
        # Create masks for new df
        masks = (df[self.numerical_columns_] >= self.stats_["lower_cutoff"]) & (df[self.numerical_columns_] <= self.stats_["upper_cutoff"])  # masks by column
        final_mask = masks.all(axis=1)  # single mask across all columns
        
        # Remove outliers based on the final mask
        print(f"Removed {(~final_mask).sum()} rows ({(~final_mask).sum() / len(final_mask) * 100:.1f}%) with outliers.")
        return df[final_mask]

    def fit_transform(self, df, numerical_columns):
        # Perform both fit and transform
        return self.fit(df, numerical_columns).transform(df)


# Initialize outlier remover 
outlier_remover_iqr = OutlierRemoverIQR()

# Fit outlier remover to training data
outlier_remover_iqr.fit(X_train, numerical_columns)

# Show outliers by column for training data
print("\nOutliers by column:")
round(outlier_remover_iqr.stats_, 2)

In [None]:
# Remove outliers
print("Training Data:")
X_train_no_outliers = outlier_remover_iqr.transform(X_train)
print("\nValidation Data:")
X_val_no_outliers = outlier_remover_iqr.transform(X_val)
print("\nTest Data:")
X_test_no_outliers = outlier_remover_iqr.transform(X_test)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Isolation Forest</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify and remove multivariate outliers using the isolation forest algorithm.
</div>

In [None]:
# Initialize isolation forest
isolation_forest = IsolationForest(contamination=0.05, random_state=42)

# Define columns to use: All numerical and boolean features 
numerical_boolean_features = numerical_columns + boolean_columns  # Remove the target variable if necessary

# Fit isolation forest on training data
isolation_forest.fit(X_train[numerical_boolean_features])

# Predict outliers on training, validation, and test data
X_train["outlier"] = isolation_forest.predict(X_train[numerical_boolean_features])
X_train["outlier_score"] = isolation_forest.decision_function(X_train[numerical_boolean_features])
X_val["outlier"] = isolation_forest.predict(X_val[numerical_boolean_features])
X_val["outlier_score"] = isolation_forest.decision_function(X_val[numerical_boolean_features])
X_test["outlier"] = isolation_forest.predict(X_test[numerical_boolean_features])
X_test["outlier_score"] = isolation_forest.decision_function(X_test[numerical_boolean_features])

# Show number of outliers
n_outliers_train = X_train["outlier"].value_counts()[-1]
contamination_train = X_train["outlier"].value_counts()[-1] / X_train["outlier"].value_counts().sum()
print(f"Training Data: Identified {n_outliers_train} rows ({100 * contamination_train:.1f}%) as multivariate outliers.")

n_outliers_val = X_val["outlier"].value_counts()[-1]
contamination_val = X_val["outlier"].value_counts()[-1] / X_val["outlier"].value_counts().sum()
print(f"Validation Data: Identified {n_outliers_val} rows ({100 * contamination_val:.1f}%) as multivariate outliers.")

n_outliers_test = X_test["outlier"].value_counts()[-1]
contamination_test = X_test["outlier"].value_counts()[-1] / X_test["outlier"].value_counts().sum()
print(f"Test Data: Identified {n_outliers_test} rows ({100 * contamination_test:.1f}%) as multivariate outliers.")

In [None]:
# Scatter plot matrix to visualize outliers for a subsample of the training data
X_train_subsample = X_train[numerical_boolean_features + ["outlier"]].sample(n=5000, random_state=42)
sns.pairplot(X_train_subsample, hue="outlier", palette={1: "#4F81BD", -1: "#D32F2F"}, plot_kws={"alpha":0.6, "s":40})

In [None]:
# Remove outliers
X_train_no_outliers = X_train[X_train["outlier"] == 1]
print(f"Training Data: Removed {X_train[X_train['outlier'] == -1].shape[0]} rows ({X_train[X_train['outlier'] == -1].shape[0] / X_train.shape[0] * 100:.1f}%) with multivariate outliers.") 
X_val_no_outliers = X_val[X_val["outlier"] == 1]
print(f"Validation Data: Removed {X_val[X_val['outlier'] == -1].shape[0]} rows ({X_val[X_val['outlier'] == -1].shape[0] / X_val.shape[0] * 100:.1f}%) with multivariate outliers.") 
X_test_no_outliers = df_test[X_test["outlier"] == 1]
print(f"Test Data: Removed {X_test[X_test['outlier'] == -1].shape[0]} rows ({X_test[X_test['outlier'] == -1].shape[0] / X_test.shape[0] * 100:.1f}%) with multivariate outliers.") 

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Feature Scaling and Encoding</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Use a <code>ColumnTransformer</code> to preprocess columns based on their semantic type. This allows the appropriate transformation to each semantic column type in a single step.
    <ul>
        <li>Scale numerical columns:</li>
        <ul>
            <li><code>StandardScaler</code>: Scales features to have mean = 0 and standard deviation = 1.</li>
            <li><code>MinMaxScaler</code>: Scales features to a range between [0, 1].</li>
        </ul>
        <li>Encode categorical columns:</li>
        <ul>
            <li>Nominal columns: <code>OneHotEncoder</code> to convert string categories into binary (one-hot) encoded columns; use for unordered categories (e.g., red, green, blue).</li>
            <li>Ordinal columns: <code>OrdinalEncoder</code> to convert string categories into integers; use for ordered categories (e.g., low, medium, high).</li>
        </ul>
        <li>Retain boolean columns: Pass through boolean columns unchanged using <code>remainder="passthrough"</code>.</li>
    </ul>
</div> 

In [None]:
# Define nominal and ordinal columns
nominal_columns = ["nominal_column_1", "nominal_column_2", "nominal_column_3"]
ordinal_columns = ["ordinal_column_1", "ordinal_column_2", "ordinal_column_3"]

# Define the explicit order of categories for all ordinal columns
ordinal_column_orders = [
    ["low", "medium", "high"],  # Order for ordinal_column_1
    ["poor", "average", "good", "excellent"],  # Order for ordinal_column_2
    ["small", "medium", "large"]  # Order for ordinal_column_3
]

# Initialize a column transformer 
column_transformer = ColumnTransformer(
    transformers=[
        ("scaler", StandardScaler(), numerical_columns),  # Use MinMaxScaler() for min-max normalization
        ("nominal_encoder", OneHotEncoder(drop="first"), nominal_columns),
        ("ordinal_encoder", OrdinalEncoder(categories=ordinal_column_orders), ordinal_columns)  
    ],
    remainder="passthrough" 
)

# Fit the column transformer to the training data and apply transformations
X_train_transformed = column_transformer.fit_transform(X_train)

# Apply the same transformations to the validation and test data
X_val_transformed = column_transformer.transform(X_val)
X_test_transformed = column_transformer.transform(X_test)

# Get transformed column names
nominal_encoded_columns = list(column_transformer.named_transformers_["nominal_encoder"].get_feature_names_out())
passthrough_columns = list(X_train.columns.difference(numerical_columns + nominal_columns + ordinal_columns, sort=False))
transformed_columns = numerical_columns + nominal_encoded_columns + ordinal_columns + passthrough_columns

# Convert transformed data from arrays to DataFrames with column names 
X_train_transformed = pd.DataFrame(X_train_transformed, columns=transformed_columns, index=X_train.index)
X_val_transformed = pd.DataFrame(X_val_transformed, columns=transformed_columns, index=X_val.index)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=transformed_columns, index=X_test.index)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Polynomial Features</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Create polynomial features in combination with feature scaling and encoding. Use a <code>ColumnTransformer</code> to preprocess columns based on their semantic type. This allows the appropriate transformation to each semantic column type in a single step. 
    <ul>
        <li>Numerical columns:</li>
        <ul>
            <li><code>Pipeline</code> to create polynomial features followed by feature scaling.</li>
            <li><code>PolynomialFeatures</code> to create polynomial terms (e.g., squared terms and interaction terms).</li>
            <li>Feature scaling:</li>
            <ul>
                <li><code>StandardScaler</code>: Scales features to have mean = 0 and standard deviation = 1.</li>
                <li><code>MinMaxScaler</code>: Scales features to a range between [0, 1].</li>
            </ul>
        </ul>
        <li>Categorical columns:</li>
        <ul>
            <li>Nominal columns: <code>OneHotEncoder</code> to convert string categories into binary (one-hot) encoded columns; use for unordered categories (e.g., red, green, blue).</li>
            <li>Ordinal columns: <code>OrdinalEncoder</code> to convert string categories into integers; use for ordered categories (e.g., low, medium, high).</li>
        </ul>
        <li>Retain boolean columns: Pass through boolean columns unchanged using <code>remainder="passthrough"</code>.</li>
    </ul>
</div> 
<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Polynomial features are commonly used with models such as linear regression, logistic regression, and elastic net regression to capture non-linear relationships and interactions between features. Polynomial features are often combined with Elastic Net regularization (balancing the effects of both Lasso and Ridge regularization) to prevent overfitting.  
</div>

In [None]:
# Define nominal and ordinal columns
nominal_columns = ["nominal_column_1", "nominal_column_2", "nominal_column_3"]
ordinal_columns = ["ordinal_column_1", "ordinal_column_2", "ordinal_column_3"]

# Define the explicit order of categories for all ordinal columns
ordinal_column_orders = [
    ["low", "medium", "high"],  # Order for ordinal_column_1
    ["poor", "average", "good", "excellent"],  # Order for ordinal_column_2
    ["small", "medium", "large"]  # Order for ordinal_column_3
]

# Initialize a column transformer
column_transformer = ColumnTransformer(
    transformers=[
        # Numerical columns: Use a pipeline to create polynomial features followed by feature scaling 
        ("polynomial_scaler", Pipeline([
            ("polynomial", PolynomialFeatures(degree=2, include_bias=False)),  # degree hyperparameter tuning is recommended 
            ("scaler", StandardScaler())  # Use MinMaxScaler() for min-max normalization
        ]), numerical_columns),
        
        # Nominal columns: One-hot encoding
        ("nominal_encoder", OneHotEncoder(drop="first"), nominal_columns),

        # Ordinal Columns: Ordinal encoding with explicit category order
        ("ordinal_encoder", OrdinalEncoder(categories=ordinal_column_orders), ordinal_columns)  
    ],
    remainder="passthrough"  # Keep boolean columns as they are
)

# Fit the column transformer to the training data and apply transformations
X_train_transformed = column_transformer.fit_transform(X_train)

# Apply the same transformations to the validation and test data
X_val_transformed = column_transformer.transform(X_val)
X_test_transformed = column_transformer.transform(X_test)

# Get transformed column names
numerical_polynomial_columns = list(column_transformer.named_transformers_["polynomial_scaler"].named_steps["polynomial"].get_feature_names_out(numerical_columns))
nominal_encoded_columns = list(column_transformer.named_transformers_["nominal_encoder"].get_feature_names_out())
passthrough_columns = list(X_train.columns.difference(numerical_columns + nominal_columns + ordinal_columns, sort=False))
transformed_columns = numerical_polynomial_columns + nominal_encoded_columns + ordinal_columns + passthrough_columns

# Convert transformed data from arrays to DataFrames with column names 
X_train_transformed = pd.DataFrame(X_train_transformed, columns=transformed_columns, index=X_train.index)
X_val_transformed = pd.DataFrame(X_val_transformed, columns=transformed_columns, index=X_val.index)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=transformed_columns, index=X_test.index)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Saving Data</h2>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">CSV</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Save preprocessed data from a Pandas DataFrame to a <code>.csv</code> file in the <code>data</code> directory.
</div>

In [None]:
# Imports
import os

# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Merge transformed X features and y target variable
df_train_transformed = pd.concat([X_train_transformed, y_train], axis=1)
df_val_transformed = pd.concat([X_val_transformed, y_val], axis=1)
df_test_transformed = pd.concat([X_test_transformed, y_test], axis=1)

# Save as .csv  
df_train_transformed.to_csv("data/df_train_preprocessed.csv", index=False)
df_val_transformed.to_csv("data/df_val_preprocessed.csv", index=False)
df_test_transformed.to_csv("data/df_test_preprocessed.csv", index=False)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">MySQL</h3>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Save preprocessed data from a Pandas DataFrame to a MySQL database table. <br><br>
    🔒 Make sure <code>sql_username</code> and <code>sql_password</code> were imported as environment variables.
</div>

In [None]:
# Imports 
from sqlalchemy import create_engine

# Database info
mysql_host = "localhost"  # Default hostname for a MySQL server running locally
mysql_port = 3306  # Default port for MySQL
mysql_database_name = "your_mysql_database_name_here"
mysql_table_name = "preprocessed_data"

# Merge transformed X features and y target variable
df_train_transformed = pd.concat([X_train_transformed, y_train], axis=1)
df_val_transformed = pd.concat([X_val_transformed, y_val], axis=1)
df_test_transformed = pd.concat([X_test_transformed, y_test], axis=1)

try:
    # Create an SQLAlchemy engine for interacting with the MySQL database
    engine = create_engine(f"mysql+mysqlconnector://{sql_username}:{sql_password}@{mysql_host}:{mysql_port}/{mysql_database_name}")
    
    # Save data to MySQL 
    with engine.connect() as connection:
        df_train_transformed.to_sql(name="df_train_preprocessed", con=connection, if_exists="replace", index=False)
        df_val_transformed.to_sql(name="df_val_preprocessed", con=connection, if_exists="replace", index=False)
        df_test_transformed.to_sql(name="df_test_preprocessed", con=connection, if_exists="replace", index=False)
    
    print("Preprocessed data successfully saved to MySQL.")

except Exception as e:
    print(f"Error saving preprocessed data to MySQL: {e}")

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Exploratory Data Analysis (EDA)</h1>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Merge <code>X_train</code> and <code>y_train</code> to create a single DataFrame with both features and target, containing all partially preprocessed columns (not yet scaled or encoded) for EDA. 
</div>

In [None]:
# Merge X_train and y_train 
df_train = pd.concat([X_train, y_train], axis=1)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Univariate EDA</h2>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Analyze the distribution of a single column using descriptive statistics and visualizations.
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Numerical Columns</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Examine descriptive statistics (e.g., mean, median, standard deviation) and visualize the distributions (e.g., histograms) of numerical columns.
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Descriptive Statistics</strong> <br>
    📌 Examine descriptive statistics of numerical columns. 
</div>

In [None]:
# Descriptive statistics of a single numerical column
df_train["numerical_column"].describe()

In [None]:
# Table of descriptive statistics of all numerical columns
df_train[numerical_columns].describe().transpose()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Visualize Distributions</strong> <br> 
    📌 Histogram matrix that shows the distributions of all numerical columns. 
</div>

In [None]:
# Imports
import math


# Function to visualize distributions of all numerical columns using a histogram matrix
def plot_numerical_distributions(df, numerical_columns, safe_to_file=False):
    # Calculate number of rows and columns for subplot grid
    n_plots = len(numerical_columns)
    n_cols = 3  
    n_rows = math.ceil(n_plots / n_cols) 
    
    # Create subplot grid with figure size based on 4x3 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3))
    
    # Flatten the axes for easier iteration
    axes = axes.flat
    
    # Iterate over all numerical columns
    for i, column in enumerate(numerical_columns):
        # Get the current axes object
        ax = axes[i]
        
        # Create histogram for the current column
        sns.histplot(data=df, x=column, ax=ax)
        
        # Customize histogram title, axes labels and tick labels
        ax.set_title(column.title().replace("_", " "), fontsize=14)
        ax.set_ylabel("Frequency", fontsize=12)
        ax.set_xlabel("")
        ax.tick_params(axis="both", labelsize=10)

    # Hide any unused subplots 
    for j in range(i + 1, len(axes)):
        axes[j].axis("off")
        
    # Adjust layout to prevent overlap
    fig.tight_layout()
    
    # Save plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)  
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Numerical distributions plot (histogram matrix) saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving numerical distributions plot (histogram matrix): {e}")
        else:
            print(f"Skip saving numerical distributions plot (histogram matrix) to file: '{image_path}' already exists.")
            
    # Show the plot
    plt.show()


# Use function to visualize distributions of all numerical columns in the training data
plot_numerical_distributions(df_train, numerical_columns, safe_to_file="numerical_distributions_histograms.png")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Customize individual histograms for better interpretability. <br><br>  
    💡 Example: Format x-axis in thousands (K) without decimals. 
</div>

In [None]:
# Imports
from matplotlib.ticker import FuncFormatter

# Set the figure size
plt.figure(figsize=(6, 4))

# Create histogram 
sns.histplot(df_train["numerical_column"])

# Customize title, axes labels and tick labels 
plt.title("numerical_column", fontsize=14)
plt.ylabel("Frequency", fontsize=12)
plt.xlabel("")
plt.tick_params(axis="both", labelsize=10)  # apply fontsize 10 to tick labels of both axes
plt.gca().xaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{x / 1000:.0f}K"))   # Format x-axis tick labels in thousands (no decimals)

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Categorical Columns</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Examine descriptive statistics (absolute and relative frequencies) and visualize the distributions (e.g., bar plots) of categorical columns.
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Descriptive Statistics</strong> <br> 
    📌 Examine frequencies of categorical columns.
</div>

In [None]:
# --- Frequencies of a single categorical column ---
# Calculate absolute and relative frequencies 
absolute_frequencies = df_train["categorical_column"].value_counts()
relative_frequencies = df_train["categorical_column"].value_counts(normalize=True) 
percent_frequencies = relative_frequencies.map(lambda x: f"{x * 100:.2f}%")  # formatted as percent with 2 decimals (string)

# Show frequency table
pd.concat([absolute_frequencies, percent_frequencies], axis=1, keys=["absolute_frequency", "percent_frequency"]).reset_index()

In [None]:
# --- Frequencies of all categorical columns ---
# Function to create frequency tables for all categorical columns 
def calculate_frequencies(df, categorical_columns):
    # Initialize dictionary to store all frequency tables 
    frequencies = {}

    # Iterate over each categorical column
    for column in categorical_columns:
        # Calculate frequencies for current column
        absolute_frequencies = df[column].value_counts()
        relative_frequencies = df[column].value_counts(normalize=True) 
        percent_frequencies = relative_frequencies.map(lambda x: f"{x * 100:.2f}%")  # formatted as string with % sign

        # Create frequency table
        frequency_table = pd.concat([absolute_frequencies, relative_frequencies, percent_frequencies], axis=1).reset_index()
        frequency_table.columns = ["category", "absolute_frequency", "relative_frequency", "percent_frequency"] 
        
        # Add current frequency table to dictionary 
        frequencies[column] = frequency_table
    
    return frequencies


# Use function to create frequency tables of all categorical and boolean columns in the training data
frequencies = calculate_frequencies(df_train, categorical_columns + boolean_columns)

# Display a single frequency table
frequencies["categorical_column"][["category", "absolute_frequency", "percent_frequency"]]

# Display all frequency tables
for column, frequency_table in frequencies.items():
    print(f"{column.title().replace('_', ' ')} Frequencies:")
    display(frequency_table[["category", "absolute_frequency", "percent_frequency"]])

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Visualize Distributions</strong> <br> 
    📌 Bar plot matrix that shows the frequency distributions of all categorical columns.
</div>

In [None]:
# Imports
import math
import matplotlib.ticker as mtick


# Function to visualize categorical frequency distributions using a bar plot matrix 
def plot_categorical_distributions(df, categorical_columns, max_categories=5, safe_to_file=False):
    # Calculate number of rows and columns for subplot grid
    n_plots = len(categorical_columns)
    n_cols = 3  
    n_rows = math.ceil(n_plots / n_cols) 
    
    # Create subplot grid with figure size based on 4x4 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    
    # Flatten the axes for easier iteration
    axes = axes.flat
    
    # Iterate over all categorical columns
    for i, column in enumerate(categorical_columns):      
        # Get the current axes object
        ax = axes[i]
        
        # Calculate frequencies for the current column
        column_frequencies = df[column].value_counts(normalize=True)  # False for absolute frequencies

        # Retain inherent order of categories for ordinal columns
        if column == "ordinal_column":
            column_frequencies = column_frequencies.reindex(["low", "medium", "high"])  # define the explicit order of categories
            
        # Format plot title
        plot_title = column.title().replace("_", " ")
    
        # Limit number of categories for better readability  
        if len(column_frequencies) > max_categories:
            column_frequencies = column_frequencies.head(max_categories)
            plot_title += f" (Top {max_categories})"
        
        # Create bar plot for the current column
        sns.barplot(x=column_frequencies.index, y=column_frequencies.values, ax=ax)
        
        # Add value labels
        value_labels = [f"{value * 100:.1f}%" for value in column_frequencies.values]
        ax.bar_label(ax.containers[0], labels=value_labels, padding=2, fontsize=10 if len(column_frequencies) <= 6 else 7)  
        
        # Customize title and axes labels 
        ax.set_title(plot_title, fontsize=14)
        ax.set_ylabel("Frequency", fontsize=12)
        ax.set_xlabel("")

        # Customize axes tick labels
        xticks = range(len(column_frequencies.index))
        xticklabels = [str(label).title().replace("_", " ") for label in column_frequencies.index]
        if len(column_frequencies) >= 5:  # rotate x-tick labels if 5 or more categories 
            ax.set_xticks(xticks, labels=xticklabels, fontsize=11, rotation=45, ha="right")
        else:
            ax.set_xticks(xticks, labels=xticklabels, fontsize=11)
        ax.tick_params(axis="y", labelsize=10)
        ax.set_ylim(0, column_frequencies.max() * 1.1)  # set y-axis from 0 to 10% above maximum frequency
        ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0, decimals=0))  # format y-axis as percent

    # Hide any unused subplots 
    for j in range(i + 1, len(axes)):
        axes[j].axis("off")
        
    # Adjust layout to prevent overlap
    fig.tight_layout()
    
    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Categorical distributions plot (bar plot matrix) saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving categorical distributions plot (bar plot matrix): {e}")
        else:
            print(f"Skip saving categorical distributions plot (bar plot matrix) to file: '{image_path}' already exists.")
    
    # Show the plot
    plt.show()


# Use function to visualize categorical frequencies of all boolean and categorical columns in training data
plot_categorical_distributions(df_train, boolean_columns + categorical_columns, max_categories=8, safe_to_file="categorical_frequencies_barplots.png")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Customize individual bar plots for better interpretability. <br><br>  
    💡 Example: Horizontal bar plot for high-cardinality column (large number of unique categories). 
</div>

In [None]:
# Imports
import matplotlib.ticker as mtick

# Set the figure size
plt.figure(figsize=(6, 12))

# Create horizontal bar plot
ax = sns.barplot(x=frequencies["categorical_column"]["relative_frequency"], 
                 y=frequencies["categorical_column"]["category"].str.title().str.replace("_", " "))

# Add value labels inside the bars
for i, value in enumerate(frequencies["categorical_column"]["relative_frequency"]):
    ax.text(value * 0.98,  # x position (slightly from right end)
            i,    # y position
            f"{value:.1f}%",  # text (frequency with 1 decimal place)
            ha="right",  # horizontal alignment
            va="center") # vertical alignment

# Customize title, axes labels, and tick labels 
plt.title("categorical_column", fontsize=14)
plt.ylabel("")
plt.xlabel("Frequency", fontsize=12)
plt.tick_params(axis="both", labelsize=10)
plt.gca().xaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0, decimals=1))  # x-tick labels as percentages (1 decimal)

# Show the plot
plt.show()

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Bivariate EDA</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Analyze relationships between two columns using correlations and group-wise statistics and visualize relationships using scatter plots and bar plots.
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Numerical vs. Numerical</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Analyze relationships between two numerical columns using correlation coefficients and visualize relationships using scatter plots.
</div>

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Correlations</h4>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Correlation heatmap of all numerical and boolean columns. 
</div>

In [None]:
# Function to plot correlation heatmap 
def plot_correlation_heatmap(df, numerical_columns, safe_to_file=False):
    # Create correlation matrix and round to 2 decimals
    correlation_matrix = round(df[numerical_columns].corr(), 2) 
    
    # Create upper triangle mask (k=1 excludes diagonal)
    mask = np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool) 
    
    # Set upper triangle to NaN to avoid redundancy
    correlation_matrix[mask] = np.nan

    # Set the figure size
    plt.figure(figsize=(10, 8))

    # Create heatmap
    ax = sns.heatmap(
        correlation_matrix, 
        cmap="viridis",  # Colorblind-friendly colormap (other options: "cividis", "magma", "YlOrBr", "RdBu") 
        annot=True,  # Show correlation values
        fmt=".2f",  # Ensure uniform decimal formatting
        linewidth=0.5  # Thin white lines between cells
    )

    # Customize title and axes tick labels
    plt.title("Correlation Heatmap", fontsize=14)
    formatted_column_names = correlation_matrix.columns.str.title().str.replace("_", " ") 
    ax.set_xticklabels(formatted_column_names)
    ax.set_yticklabels(formatted_column_names)
    
    # Adjust layout to prevent overlap
    plt.tight_layout()
    
    # Save heatmap to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)  
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:    
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Correlation heatmap saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving correlation heatmap: {e}")
        else:
            print(f"Skip saving correlation heatmap: '{image_path}' already exists.")
        
    # Show heatmap
    plt.show()


# Use function to plot correlation heatmap of all numerical and boolean columns in training data
plot_correlation_heatmap(df_train, numerical_columns + boolean_columns, safe_to_file="correlation_heatmap.png")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Feature-target correlations by order of magnitude.
</div>

In [None]:
# Feature-target correlations sorted by absolute values in descending order 
feature_target_correlations = df_train[numerical_columns + boolean_columns].corr()["numerical_target"].sort_values(key=abs, ascending=False)
feature_target_correlations

In [None]:
# Function to visualize feature-target correlations using a bar plot
def plot_feature_target_correlations(feature_target_correlations, y_min=-1, y_max=1, safe_to_file=False):
    # Remove target variable self-correlation
    feature_target_correlations = feature_target_correlations[1:]
    
    # Set figure size
    plt.figure(figsize=(8, 6))
    
    # Create bar plot
    ax = sns.barplot(x=feature_target_correlations.index, y=feature_target_correlations.values)  # for pandas Series

    # Add horizontal line
    ax.axhline(0, color="gray", alpha=0.5, linewidth=0.8)
    
    # Add value labels
    ax.bar_label(ax.containers[0], fmt="%.2f", padding=3, fontsize=10)
    
    # Customize title, axes labels and ticks
    plt.title("Feature-Target Correlations", fontsize=14)
    plt.ylabel("Correlation", fontsize=12)
    plt.xlabel("")
    plt.yticks(np.arange(y_min, y_max + 0.2, 0.2), fontsize=10)
    formatted_xtick_labels = [label.title().replace("_", " ") for label in feature_target_correlations.index]
    plt.xticks(ticks=range(len(formatted_xtick_labels)), labels=formatted_xtick_labels, rotation=45, ha="right", fontsize=11)
    
    # Adjust layout
    plt.tight_layout()
    
    # Save plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Feature-target correlations bar plot saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving feature-target correlations bar plot: {e}")
        else:
            print(f"Skip saving feature-target correlations bar plot to file: '{image_path}' already exists.")
    
    # Show plot
    plt.show()

    
# Use function to plot feature-target correlations of training data
plot_feature_target_correlations(feature_target_correlations, y_min=-0.5, y_max=0.5, safe_to_file="feature_target_correlations.png")

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Visualize Relationships</h4>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Numerical-Numerical Relationships (Scatter Plot Matrix)</strong> <br>
    📌 Visualize pairwise relationships between all numerical columns using a scatter plot matrix.
</p>

In [None]:
# Imports
import itertools
import math


# Function to visualize pairwise relationships between numerical columns using a scatter plot matrix
def plot_numerical_relationships(df, numerical_columns, safe_to_file=False):
    # Get all possible pairs of numerical columns
    column_pairs = list(itertools.combinations(numerical_columns, 2))
    
    # Calculate number of rows and columns for subplot grid
    n_plots = len(column_pairs)
    n_cols = 3  
    n_rows = math.ceil(n_plots / n_cols) 
    
    # Create subplot grid with figure size based on 4x4 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    
    # Flatten axes for easier iteration
    axes = axes.flat
    
    # Iterate over each column pair
    for i, (column_1, column_2) in enumerate(column_pairs):
        # Get the current axes object
        ax = axes[i]
        
        # Create scatter plot
        sns.scatterplot(data=df, x=column_1, y=column_2, ax=ax)
        
        # Customize title, axes labels, and axes tick labels
        column_1_name = column_1.title().replace("_", " ")
        column_2_name = column_2.title().replace("_", " ")
        ax.set_title(f"{column_1_name} vs. {column_2_name}", fontsize=13)
        ax.set_xlabel(column_1_name, fontsize=12)
        ax.set_ylabel(column_2_name, fontsize=12)
        ax.tick_params(axis="both", labelsize=10)
            
    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].axis("off")
        
    # Adjust layout to prevent overlap
    fig.tight_layout()

    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Numerical relationships plot (scatter plot matrix) saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving numerical relationships plot (scatter plot matrix): {e}")
        else:
            print(f"Skip saving numerical relationships plot (scatter plot matrix) to file: '{image_path}' already exists.")
    
    # Show the plot
    plt.show()


# Use function to visualize relationships between all numerical columns in the training data
plot_numerical_relationships(df_train, numerical_columns, safe_to_file="numerical_relationships_scatterplots.png")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Customize individual scatter plots for better interpretability. <br><br>
    💡 Example: Define the exact axes ticks for two columns representing years. 
</div>

In [None]:
# Set figure size
plt.figure(figsize=(8, 6))

# Create scatter plot 
sns.scatterplot(data=df_train, x="numerical_column_1", y="numerical_column_2")

# Customize title and axes labels
plt.title("Relationship between numerical_column_1 and numerical_column_2", fontsize=14)
plt.xlabel("numerical_column_1", fontsize=12)
plt.ylabel("numerical_column_2", fontsize=12)

# Customize axes tick labels
plt.xticks(range(0, 21, 1), fontsize=10)  # set x-axis ticks from 0 to 20 years in 1-year steps
plt.yticks(range(5, 11, 1), fontsize=10)  # set y-axis ticks from 5 to 10 years in 1-year steps

# Show plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Numerical vs. Categorical</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Analyze relationships between a numerical column and a categorical column using group-wise statistics (e.g., median or mean by category) and visualize relationships using bar plots.
</div>

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Group-Wise Statistics</h4>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Descriptive statistics of numerical columns grouped by a single categorical column. 
</div>

In [None]:
# Group-wise statistics of all numerical columns by a single categorical column (focus on median, mean, and std for easier readability)
stats_by_category = df_train[numerical_columns].groupby(df_train["categorical_column"]).agg(["median", "mean", "std"]).transpose()
stats_by_category

In [None]:
# Group-wise statistics of a single numerical column by a single categorical column 
stats_by_category = df_train["numerical_column"].groupby(df_train["categorical_column"]).describe()
stats_by_category

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Effect Size</strong> <br>
    📌 Quantify the magnitude of the difference between two groups using Cohen's d.
</div>

In [None]:
# Function to calculate Cohen's d
def calculate_cohens_d(df, numerical_column, categorical_column, group_1, group_2):
    x1 = df[df[categorical_column] == group_1][numerical_column]
    x2 = df[df[categorical_column] == group_2][numerical_column]
    
    mean1, mean2 = np.mean(x1), np.mean(x2)
    std1, std2 = np.std(x1, ddof=1), np.std(x2, ddof=1)  # Sample standard deviation using N−1
    n1, n2 = len(x1), len(x2)
    
    pooled_std = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))
    
    return (mean1 - mean2) / pooled_std if pooled_std != 0 else 0


# Use function to calculate Cohen's d for a single numerical column
cohens_d_result = cohens_d(df_train, "numerical_column", "categorical_column", "category_1", "category_2")
print(f"Cohen's d: {cohens_d_result}")

# Use function to calculate Cohen's d for all numerical columns
cohens_d_results = {column: calculate_cohens_d(df_train, column, "categorical_column", "category_1", "category_2") for column in numerical_columns}
cohens_d_results = pd.DataFrame.from_dict(cohens_d_results, orient="index", columns=["Cohen's d"])
cohens_d_results

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Visualize Relationships</h4>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Numerical-Categorical Relationships (Bar Plot Matrix)</strong> <br>
    📌 Visualize pairwise relationships between all numerical columns and a single categorical column using a bar plot matrix.
</p>

In [None]:
# Function to visualize pairwise relationships between multiple numerical columns and a single categorical column using a bar plot matrix
def plot_numerical_categorical_relationships(df, numerical_columns, categorical_column, estimator=np.median, safe_to_file=False):
    # Calculate number of rows and columns for subplot grid
    n_plots = len(numerical_columns)
    n_cols = 3  
    n_rows = math.ceil(n_plots / n_cols) 
    
    # Create subplot grid with figure size based on 4x4 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    
    # Flatten axes for easier iteration
    axes = axes.flat
    
    # Iterate over all numerical columns
    for i, numerical_column in enumerate(numerical_columns):
        # Get the current axes object
        ax = axes[i]
        
        # Create bar plot 
        sns.barplot(data=df, x=categorical_column, y=numerical_column, estimator=estimator, errorbar=None, ax=ax)

        # Add value labels 
        for bar in ax.patches:  # patches are the bars themselves
            # Get the height of the bar (which is the value)
            height = bar.get_height()
            
            # Format value labels            
            value = f"{height:.1f}"
                
            # Get the x and y coordinates for the text label
            x_pos = bar.get_x() + bar.get_width() / 2.
            y_pos = height * 1.01

            # Add the text label 
            ax.text(x_pos, y_pos, value, ha="center", va="bottom", fontsize=10) 

        # Extend the y-axis upper limit by 5% to make room for value labels
        y_min, y_max = ax.get_ylim()
        ax.set_ylim(y_min, y_max * 1.05)
        
        # Customize title and axes labels 
        numerical_column_name = numerical_column.title().replace("_", " ")
        categorical_column_name = categorical_column.title().replace("_", " ")
        ax.set_title(f"{estimator.__name__.title()} {numerical_column_name} by {categorical_column_name}", fontsize=13)
        ax.set_xlabel("")
        ax.set_ylabel(numerical_column_name, fontsize=12)
    
    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].axis("off")
        
    # Adjust layout to prevent overlap
    fig.tight_layout()

    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Numerical-categorical relationships plot (bar plot matrix) saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving numerical-categorical relationships plot (bar plot matrix): {e}")
        else:
            print(f"Skip saving numerical-categorical relationships plot (bar plot matrix) to file: '{image_path}' already exists.")
 
    
    # Show the plot
    plt.show()


# Use function to visualize relationships between all numerical columns and a single categorical column in the training data
plot_numerical_categorical_relationships(df_train, numerical_columns, "categorical_column", safe_to_file="numerical_categorical_relationships_barplots.png")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Customize individual bar plots for better interpretability. <br><br>
    💡 Example: Define exact order of categories and format y-axis in thousands. 
</div>

In [None]:
# Imports
from matplotlib.ticker import FuncFormatter

# Define exact order of categories
ordered_categories = ["category_1", "category_2", "category_3"]

# Create figure and axes
fig, ax = plt.subplots(figsize=(6, 4))

# Create bar plot 
sns.barplot(data=df_train, x="categorical_column", order=ordered_categories, y="numerical_column", estimator=np.median, errorbar=None, ax=ax)

# Add value labels 
for bar in ax.patches:  
    height = bar.get_height()
    value = f"{height:,.0f}K"  # format in thousands with thousand separator (no decimals)
    x_pos = bar.get_x() + bar.get_width() / 2.
    y_pos = height * 1.01
    ax.text(x_pos, y_pos, value, ha="center", va="bottom", fontsize=10) 

# Extend the y-axis upper limit by 5% 
y_min, y_max = ax.get_ylim()
ax.set_ylim(y_min, y_max * 1.05)

# Customize title and axes labels
numerical_column_name = "numerical_column".title().replace("_", " ")
categorical_column_name = "categorical_column".title().replace("_", " ")
ax.set_title(f"Median {numerical_column_name} by {categorical_column_name}", fontsize=14)
ax.set_xlabel(categorical_column_name, fontsize=12)
ax.set_ylabel(numerical_column_name, fontsize=12)

# Customize axes tick labels
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{x / 1000:.0f}K"))  # format in thousands (no decimals)
ax.tick_params(axis="y", labelsize=10)
ax.set_xticks(range(len(ordered_categories)))
ax.set_xticklabels([label.title().replace("_", " ") for label in ordered_categories], fontsize=10) 

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Categorical vs. Categorical</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Analyze relationships between two categorical columns using contingency tables and visualize relationships using grouped bar plots.
</div>

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Contingency Tables</h4>
</div> 

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Contingency tables between all categorical feature-target pairs for machine learning projects with a categorical target variable.
</p> 

In [None]:
# Show all categorical and boolean columns with number of unique categories
df_train[categorical_columns + boolean_columns].nunique()

In [None]:
# Function to create contingency tables between each categorical feature and a single categorical target variable 
def calculate_feature_target_crosstabs(df, categorical_columns, target_variable, normalize="index"):
    # Dictionary to store contingency tables
    contingency_tables = {}
    
    # Get all possible pairs of the target variable with the categorical features
    feature_target_pairs = [(feature, target_variable) for feature in categorical_columns if feature != target_variable]
    
    # Iterate over each feature-target pair
    for feature, target in feature_target_pairs:
        # Create contingency table (normalize="index" gives percentage distribution of the target variable within each category of the feature)
        contingency_tables[(feature, target)] = pd.crosstab(df[feature], df[target], normalize=normalize)

    return contingency_tables


# Use function to create contingency tables between the target variable and each categorical or boolean feature in the training data
contingency_tables = calculate_feature_target_crosstabs(df_train, categorical_columns + boolean_columns, target_variable="categorical_target")

# Display a single feature-target contingency table (formatted as percent with 1 decimal)
display(contingency_tables[("categorical_feature", "categorical_target")].map(lambda x: f"{x * 100:.1f}"))

# Display all feature-target contingency tables 
for (feature, target) in contingency_tables.keys():
    display(contingency_tables[(feature, target)].map(lambda x: f"{x * 100:.1f}"))

In [None]:
# Function to create heatmap of a contingency table
def plot_crosstab_heatmap(contingency_table, safe_to_file=False):
    # Get number of categories
    n_categories_feature = contingency_table.index.nunique()
    n_categories_target = contingency_table.columns.nunique()
    
    # Set the figure size based on number of categories
    if n_categories_feature > 4:
        plt.figure(figsize=(n_categories_target * 1.5, n_categories_feature * 0.4))
    else:
        plt.figure(figsize=(n_categories_target * 2, n_categories_feature * 0.8))

    # Create heatmap (formatted as percent with 1 decimal)
    ax = sns.heatmap(contingency_table, annot=True, fmt=".1%", cmap="viridis", cbar=False, linewidth=0.5)

    # Customize title and axes labels
    feature_name = contingency_table.index.name.title().replace("_", " ")
    target_name = contingency_table.columns.name.title().replace("_", " ")
    ax.set_title(f"{target_name} Distribution Within {feature_name}", fontsize=14)
    ax.set_xlabel(target_name, fontsize=12) 
    ax.set_ylabel(feature_name, fontsize=12) 

    # Customize axes tick labels
    formatted_xticklabels = [label.get_text().title().replace("_", " ") for label in ax.get_xticklabels()]
    formatted_yticklabels = [label.get_text().title().replace("_", " ") for label in ax.get_yticklabels()]
    ax.set_xticklabels(formatted_xticklabels, fontsize=10)
    ax.set_yticklabels(formatted_yticklabels, rotation=0, fontsize=10)

    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Contingency table heatmap saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving contingency table heatmap: {e}")
        else:
            print(f"Skip saving contingency table heatmap to file: '{image_path}' already exists.")

    # Show heatmap
    plt.show()


# Use function to create a contingency table heatmap of a single categorical feature with the categorical target variable  
plot_crosstab_heatmap(contingency_tables["categorical_feature", "categorical_target"])

# Use function to create contingency table heatmaps of all categorical features with the categorical target variable 
for (feature, target) in contingency_tables.keys():
    plot_crosstab_heatmap(contingency_tables[feature, target])

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Customize individual contingency table heatmaps for better interpretability. <br><br>
    💡 Example: Display only the top and bottom 5 feature categories ranked by target frequency (useful for high-cardinality features).
</p>

In [None]:
# Create contingency table with percentage distribution of the target variable within each category of the feature 
contingency_table = pd.crosstab(df_train["categorical_feature"], df_train["categorical_target"], normalize="index")

# Sort feature categories from highest to lowest target frequency (class 1 for binary target)
contingency_table = contingency_table.sort_values(by=1, ascending=False)

# Filter feature categories with 5 highest and 5 lowest target frequencies
contingency_table = pd.concat([contingency_table.head(5), contingency_table.tail(5)])

# Create heatmap of contingency table 
plot_crosstab_heatmap(contingency_table)

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Visualize Relationships</h4>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Categorical-Categorical Relationships (Grouped Bar Plot Matrix)</strong> <br>
    📌 Visualize pairwise relationships between all categorical columns with low cardinality using a grouped bar plot matrix.
</p>

In [None]:
# Imports
import itertools


# Function to visualize pairwise relationships between all categorical columns using a grouped bar plot matrix
def plot_categorical_relationships(df, categorical_columns, max_categories=5, safe_to_file=False):
    # Filter columns with low cardinality (small number of unique categories) 
    low_cardinality_columns = [column for column in categorical_columns if df[column].nunique() <= max_categories]

    # Get all possible pairs of categorical columns
    column_pairs = tuple(itertools.combinations(low_cardinality_columns, 2))
    
    # Calculate number of rows and columns for subplot grid
    n_plots = len(column_pairs)
    n_cols = 3  
    n_rows = math.ceil(n_plots / n_cols) 
    
    # Create subplot grid with figure size based on 4x4 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    
    # Flatten the axes for easier iteration
    axes = axes.flat
    
    # Iterate over each column pair
    for i, (column_1, column_2) in enumerate(column_pairs):
        # Get the current axes object
        ax = axes[i]

        # Ensure column_1 (plot x-axis) has higher cardinality (more categories) for plot readability
        if df[column_1].nunique() < df[column_2].nunique():
            column_1, column_2 = column_2, column_1
        
        # Create contingency table (normalize="index" calculates the percentage distribution of column_2 within each category of column_1)
        contingency_table = pd.crosstab(df[column_1], df[column_2], normalize="index") 

        # Reshape data for easier plotting  
        plot_df = contingency_table.stack().reset_index()
        plot_df.columns = [column_1, column_2, "Frequency"]
         
        # Create grouped bar plot
        sns.barplot(data=plot_df, x=column_1, y="Frequency", hue=column_2, palette="viridis", ax=ax)

        # Add value labels
        n_categories_col2 = plot_df[column_2].nunique()
        value_label_size = {1: 10, 2: 10, 3: 8}.get(n_categories_col2, 7)  # dynamic fontsize based on number of categories (default fontsize 7)
        for container in ax.containers:
            value_labels = [f"{value * 100:.0f}%" for value in container.datavalues]
            ax.bar_label(container, labels=value_labels, padding=2, fontsize=value_label_size) 
                    
        # Extend the y-axis upper limit by 5% 
        y_min, y_max = ax.get_ylim()
        ax.set_ylim(y_min, y_max * 1.05)
        
        # Customize title and axes labels
        column_1_name = column_1.title().replace("_", " ")
        column_2_name = column_2.title().replace("_", " ")
        ax.set_title(f"{column_1_name} vs. {column_2_name}", fontsize=14)
        ax.set_xlabel(column_1_name, fontsize=12)
        ax.set_ylabel(f"% within {column_1_name}", fontsize=12)
        
        # Customize axes ticks
        ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0, decimals=0))  # format y-axis tick labels as percentages
        xticks = range(plot_df[column_1].nunique())
        xticklabels = [label.get_text().title().replace("_", " ") for label in ax.get_xticklabels()]
        ax.set_xticks(xticks, labels=xticklabels, fontsize=10)

        # Customize legend
        legend_handles, legend_labels = ax.get_legend_handles_labels()
        legend_labels = [str(label).title().replace("_", " ") for label in legend_labels]  # format legend labels 
        ax.legend(handles=legend_handles, labels=legend_labels, title=column_2_name)    
    
    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].axis("off")
        
    # Adjust layout to prevent overlap
    fig.tight_layout()

    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Categorical relationships plot (grouped bar plot matrix) saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving categorical relationships plot (grouped bar plot matrix): {e}")
        else:
            print(f"Skip saving categorical relationships plot (grouped bar plot matrix) to file: '{image_path}' already exists.")
  
    
    # Show the plot
    plt.show()


# Use function to visualize relationships between all categorical columns in the training data
plot_categorical_relationships(df_train, categorical_columns + boolean_columns, safe_to_file="categorical_relationships_groupedbarplots.png")

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Customize individual grouped bar plots for better interpretability. <br><br>
    💡 Example: Display only the top and bottom 5 categories (useful for high-cardinality columns).
</p>

In [None]:
# Create contingency table (normalize="index" calculates the percentage distribution of column_2 within each category of column_1)
contingency_table = pd.crosstab(df_train["categorical_column_1"], df_train["categorical_column_2"], normalize="index") 

# Sort categories from highest to lowest (by class 1 of categorical_column_2)
contingency_table = contingency_table.sort_values(by=1, ascending=False)

# Filter 5 highest and 5 lowest categories
contingency_table = pd.concat([contingency_table.head(5), contingency_table.tail(5)])

# Reshape data for easier plotting  
plot_df = contingency_table.stack().reset_index()
plot_df.columns = ["categorical_column_1", "categorical_column_2", "Frequency"]

# Create figure and axes
fig, ax = plt.subplots(figsize=(12, 6))

# Create grouped bar plot
sns.barplot(data=plot_df, x="categorical_column_1", y="Frequency", hue="categorical_column_2", palette="viridis", ax=ax)

# Add value labels
for container in ax.containers:
    value_labels = [f"{value * 100:.1f}%" for value in container.datavalues]
    ax.bar_label(container, labels=value_labels, padding=2, fontsize=10) 
            
# Extend the y-axis upper limit by 5% 
y_min, y_max = ax.get_ylim()
ax.set_ylim(y_min, y_max * 1.05)

# Customize title and axes labels
column_1_name = "categorical_column_1".title().replace("_", " ")
column_2_name = "categorical_column_2".title().replace("_", " ")
ax.set_title(f"{column_1_name} vs. {column_2_name}", fontsize=14)
ax.set_xlabel(column_1_name, fontsize=12)
ax.set_ylabel(f"% Within {column_1_name}", fontsize=12)

# Customize axes ticks
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0, decimals=0))  # format y-axis tick labels as percentages
column_1_categories = contingency_table.index.tolist()  # Get the specific order of categories
ax.set_xticks(range(len(column_1_categories)))
ax.set_xticklabels([label.title().replace("_", " ") for label in column_1_categories], rotation=45, ha="right") 

# Customize legend
legend_handles, legend_labels = ax.get_legend_handles_labels()
legend_labels = [str(label).title().replace("_", " ") for label in legend_labels]
ax.legend(handles=legend_handles, labels=legend_labels, title=column_2_name)    

# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px">
    <h1 style="margin:0px">Modeling (Regression)</h1>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ For a regression problem, where the task is to predict a numerical target variable. 
</div>

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Helper Function for Residual Plots</h2>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Creates two plots for predicted vs. actual target and residuals vs. actual target to evaluate model performance and identify potential issues in the model assumptions graphically. 
</div>

In [None]:
# Helper function to create residual plots
def plot_residuals(y, y_pred):
    # Calculate residuals
    residuals = [actual_value - predicted_value for actual_value, predicted_value in zip(y, y_pred)]

    # Create a 1x2 grid of subplots
    fig, axes = plt.subplots(1, 2, figsize=(12, 5), dpi=150)

    # Plot 1: Predicted vs. Actual Target
    axes[0].scatter(y, y_pred)
    axes[0].plot([min(y), max(y)], 
                 [min(y), max(y)], 
                 color="red", 
                 linestyle="--", 
                 label="Perfect Prediction")  # Add diagonal reference line
    axes[0].set_title("Predicted vs. Actual Target")
    axes[0].set_xlabel("Actual Target")
    axes[0].set_ylabel("Predicted Target")
    axes[0].grid(True)
    axes[0].legend() 

    # Plot 2: Residuals vs. Actual Target
    axes[1].scatter(y, residuals)
    axes[1].axhline(y=0, color="red", linestyle="--", label="Perfect Prediction")  # Add horizontal reference line
    axes[1].set_xlabel("Actual Target")
    axes[1].set_ylabel("Residuals")
    axes[1].set_title("Residuals vs. Actual Target")
    axes[1].grid(True)
    axes[1].legend() 

    # Adjust layout and display the plots
    plt.tight_layout()
    plt.show()

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Training Baseline Models (Individually)  </h2>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Linear Regression</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li><code>fit_intercept=True</code>: Calculates the intercept; can be set to <code>False</code> if data is already centered.</li>
        <li><code>n_jobs=None</code>: Number of CPU threads; use <code>-1</code> for all available processors.</li>
        <li><code>positive=False</code>: Forces regression coefficients to be non-negative if set to <code>True</code>.</li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression" target="_blank">scikit-learn LinearRegression documentation</a>.  
</div>

In [None]:
# Initialize model
lr = LinearRegression()

# Train model
lr.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = lr.predict(X_train_transformed)
y_val_pred = lr.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
lr_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(lr_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Elastic Net Regression</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>alpha=1.0</code>: Regularization strength. Higher values increase the penalty, reducing overfitting but possibly underfitting the data.</li>
                <li><code>l1_ratio=0.5</code>: The mix between L1 (Lasso) and L2 (Ridge) regularization.
                    <ul>
                        <li><code>l1_ratio=1.0</code> corresponds to pure Lasso.</li>
                        <li><code>l1_ratio=0.0</code> corresponds to pure Ridge.</li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Solver Configuration:
            <ul>
                <li><code>fit_intercept=True</code>: Whether to calculate the intercept for the model. If <code>False</code>, the model assumes data is centered.</li>
                <li><code>precompute=False</code>: Whether to use precomputed Gram matrices to speed up calculations. Set to <code>True</code> for small datasets.</li>
                <li><code>max_iter=1000</code>: The maximum number of iterations allowed for convergence during optimization.</li>
                <li><code>tol=1e-4</code>: Stopping criterion for optimization. If the change in the cost function is smaller than <code>tol</code>, training stops.</li>
                <li><code>warm_start=False</code>: If <code>True</code>, reuse the solution of the previous fit as initialization for the next fit.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>selection="cyclic"</code>: Determines the strategy for updating coefficients. <code>"cyclic"</code> updates coefficients sequentially, <code>"random"</code> in a random order.</li>
                <li><code>random_state=None</code>: Seed for random number generation when <code>selection="random"</code>.</li>
            </ul>
        </li>
        <li>Performance Configuration:
            <ul>
                <li><code>copy_X=True</code>: Whether to copy the input data <code>X</code>. If <code>False</code>, training modifies the original data, saving memory.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html" target="_blank">scikit-learn Elastic Net Regression documentation</a>.  
</div>


In [None]:
# Initialize model 
en = ElasticNet()

# Train model
en.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = en.predict(X_train_transformed)
y_val_pred = en.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
en_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(en_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">K-Nearest Neighbors Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_neighbors=5</code>: The number of neighbors to use for prediction. A higher value makes the model more general, while a lower value may lead to overfitting.</li>
                <li><code>weights="uniform"</code>: Determines how neighbors are weighted during prediction. <code>"uniform"</code> gives equal weight to all neighbors, while <code>"distance"</code> gives closer neighbors more influence.</li>
                <li><code>p=2</code>: The power parameter for the Minkowski distance. <code>p=2</code> corresponds to the Euclidean distance, commonly used in KNN regression.</li>
                <li><code>algorithm="auto"</code>: The algorithm used to compute nearest neighbors. <code>"auto"</code> selects the best algorithm based on the dataset (options include <code>"ball_tree"</code>, <code>"kd_tree"</code>, and <code>"brute"</code>).</li>
                <li><code>leaf_size=30</code>: The size of the leaf in tree-based algorithms like Ball Tree and KD Tree. This parameter impacts the speed and memory usage during training.</li>
            </ul>
        </li>
        <li>Distance Metrics:
            <ul>
                <li><code>metric="minkowski"</code>: The distance metric used to calculate the proximity between data points. The default is <code>"minkowski"</code>, but you can also use <code>"euclidean"</code>, <code>"manhattan"</code>, etc.</li>
                <li><code>metric_params=None</code>: Additional parameters for the distance metric, usually left as <code>None</code>.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>n_jobs=None</code>: The number of parallel jobs to run for neighbor search. Setting <code>n_jobs=-1</code> utilizes all available CPU cores for faster computation.</li>
                <li><code>radius=1.0</code>: Defines the search radius for neighbors. Instead of a fixed number of neighbors, this parameter considers neighbors within a given radius. This can be more flexible than <code>n_neighbors</code>, but it should be used with care as it may return an inconsistent number of neighbors.</li>
                <li><code>max_iter=None</code>: The maximum number of iterations for the neighbor search process. This is set to <code>None</code> to allow unlimited iterations.</li>
                <li><code>verbose=False</code>: Whether or not to print progress messages during training.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html" target="_blank">scikit-learn KNN Regressor documentation</a>.  
</div>


In [None]:
# Initialize model
knn = KNeighborsRegressor()

# Train model
knn.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = knn.predict(X_train_transformed)
y_val_pred = knn.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
knn_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(knn_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Support Vector Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>C=1.0</code>: Regularization parameter balancing error reduction and model complexity.</li>
                <li><code>epsilon=0.1</code>: Margin of tolerance for predictions without penalty.</li>
            </ul>
        </li>
        <li>Kernel Configuration:
            <ul>
                <li><code>kernel="rbf"</code>: Kernel function for mapping data to higher dimensions (default is radial basis function or <code>"rbf"</code>).</li>
                <li><code>degree=3</code>: Degree of the polynomial kernel function (ignored by the rbf kernel).</li>
                <li><code>gamma="scale"</code>: Influence range of a single training example. <code>"scale"</code> means <code>1 / (n_features * X.var())</code>.</li>
                <li><code>coef0=0.0</code>: Independent term in polynomial and sigmoid kernel function (ignored by the rbf kernel).</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>tol=1e-3</code>: Stopping criterion for optimization. If the change in the cost function is less than this tolerance, training stops.</li>
                <li><code>cache_size=200</code>: Memory (MB) allocated for kernel computation. Larger values speed up training.</li>
                <li><code>shrinking=True</code>: Enables the shrinking heuristic, which can speed up training by eliminating unnecessary steps during optimization.</li>
                <li><code>verbose=False</code>: Whether to print progress messages during training.</li>
                <li><code>max_iter=-1</code>: Maximum number of iterations during training (<code>-1</code> for no limit).</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR" target="_blank">scikit-learn SVR documentation</a>.  
</div>


In [None]:
# Initialize model
svr = SVR()

# Train model
svr.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = svr.predict(X_train_transformed)
y_val_pred = svr.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
svr_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(svr_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Decision Tree Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>max_depth=None</code>: Maximum depth of the tree. <code>None</code> allows nodes to expand until all leaves are pure or contain fewer samples than <code>min_samples_split</code>.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split an internal node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required to be at a leaf node.</li>
                <li><code>criterion="squared_error"</code>: Function to measure the quality of a split. Options include <code>"squared_error"</code> (mean squared error) and <code>"friedman_mse"</code> (Friedman’s mean squared error).</li>
                <li><code>splitter="best"</code>: Strategy to choose the split at each node. Options are <code>"best"</code> (best split) and <code>"random"</code> (random split).</li>
                <li><code>max_features=None</code>: Number of features to consider when looking for the best split. If <code>None</code>, all features are considered.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>ccp_alpha=0.0</code>: Complexity parameter for pruning. A higher value encourages pruning by penalizing tree complexity.</li>
                <li><code>min_impurity_decrease=0.0</code>: A node will split only if the impurity decrease exceeds this threshold.</li>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes in the tree.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>min_weight_fraction_leaf=0.0</code>: Minimum weighted fraction of the sum of weights required at a leaf node.</li>
                <li><code>max_samples=None</code>: (Only relevant for certain ensemble methods; ignored in standalone <code>DecisionTreeRegressor</code>).</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>presort="deprecated"</code>: Pre-sorting data for faster splits has been deprecated in recent versions.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html" target="_blank">scikit-learn DecisionTreeRegressor documentation</a>.  
</div>


In [None]:
# Initialize model
tree = DecisionTreeRegressor(random_state=42)

# Train model
tree.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = tree.predict(X_train_transformed)
y_val_pred = tree.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
tree_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(tree_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Random Forest Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees in the forest.</li>
                <li><code>max_depth=None</code>: Maximum depth of each tree; <code>None</code> allows trees to grow until all leaves are pure or minimum samples are reached.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split a node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required at a leaf node.</li>
                <li><code>max_features="auto"</code>: Number of features considered for the best split; default <code>auto</code> uses the square root of all features.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes per tree.</li>
                <li><code>min_impurity_decrease=0.0</code>: Splits a node only if it decreases impurity by this threshold.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>bootstrap=True</code>: Whether to use bootstrap samples for training each tree.</li>
                <li><code>oob_score=False</code>: Whether to use out-of-bag samples to estimate prediction accuracy.</li>
                <li><code>n_jobs=None</code>: Number of CPU threads used (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>verbose=0</code>: Controls the verbosity of output during training.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>max_samples=None</code>: Maximum number of samples used to train each tree, useful for subsampling large datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor" target="_blank">scikit-learn RandomForestRegressor documentation</a>.  
</div>


In [None]:
# Initialize model
rf = RandomForestRegressor(random_state=42)

# Train model
rf.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = rf.predict(X_train_transformed)
y_val_pred = rf.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
rf_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(rf_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Multi-Layer Perceptron Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Architecture:
            <ul>
                <li><code>hidden_layer_sizes=(100,)</code>: Defines the size and number of hidden layers; <code>(100,)</code> indicates one layer with 100 neurons.</li>
                <li><code>activation="relu"</code>: Activation function for the hidden layers; options include <code>"relu"</code>, <code>"tanh"</code>, <code>"logistic"</code>, or <code>"identity"</code>.</li>
                <li><code>solver="adam"</code>: Optimization algorithm; options are <code>"adam"</code> (default), <code>"lbfgs"</code>, or <code>"sgd"</code>.</li>
            </ul>
        </li>
        <li>Regularization and Learning:
            <ul>
                <li><code>alpha=0.0001</code>: L2 regularization term to prevent overfitting.</li>
                <li><code>learning_rate="constant"</code>: Strategy for learning rate adjustment; options are <code>"constant"</code>, <code>"invscaling"</code>, or <code>"adaptive"</code>.</li>
                <li><code>learning_rate_init=0.001</code>: Initial learning rate for weight updates.</li>
                <li><code>power_t=0.5</code>: Exponent for inverse scaling of learning rate (used when <code>learning_rate="invscaling"</code>).</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>max_iter=200</code>: Maximum number of iterations for training.</li>
                <li><code>tol=1e-4</code>: Tolerance for stopping criteria; training stops if loss improvement is below this value.</li>
                <li><code>momentum=0.9</code>: Momentum parameter for gradient descent updates (used when <code>solver="sgd"</code>).</li>
                <li><code>n_iter_no_change=10</code>: Number of iterations with no improvement to stop early.</li>
                <li><code>early_stopping=False</code>: Enables early stopping when validation score doesn’t improve.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>batch_size="auto"</code>: Number of samples per batch for training; <code>"auto"</code> uses <code>min(200, n_samples)</code>.</li>
                <li><code>shuffle=True</code>: Whether to shuffle training data before each epoch.</li>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>verbose=False</code>: Controls verbosity of output during training.</li>
                <li><code>warm_start=False</code>: Reuses previous solution to initialize weights for additional fitting.</li>
                <li><code>beta_1=0.9</code>, <code>beta_2=0.999</code>: Exponential decay rates for moving averages of gradients and squared gradients (used in <code>solver="adam"</code>).</li>
                <li><code>epsilon=1e-8</code>: Small value to prevent division by zero in <code>solver="adam"</code>.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html" target="_blank">scikit-learn MLPRegressor documentation</a>.  
</div>


In [None]:
# Initialize model
mlp = MLPRegressor(random_state=42)

# Train model
mlp.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = mlp.predict(X_train_transformed)
y_val_pred = mlp.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
mlp_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(mlp_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">XGBoost Regressor</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees.</li>
                <li><code>max_depth=6</code>: Maximum depth of each tree.</li>
                <li><code>learning_rate=0.3</code>: Step size shrinking to prevent overfitting.</li>
                <li><code>subsample=1.0</code>: Fraction of training samples used for each tree.</li>
                <li><code>colsample_bytree=1.0</code>: Fraction of features used for each tree.</li>
                <li><code>colsample_bylevel=1.0</code>: Fraction of features used at each tree level.</li>
                <li><code>colsample_bynode=1.0</code>: Fraction of features used at each node.</li>
            </ul>
        </li>
        <li>Regularization and Learning:
            <ul>
                <li><code>gamma=0</code>: Minimum loss reduction required to make a further partition on a leaf node.</li>
                <li><code>min_child_weight=1</code>: Minimum sum of instance weight (hessian) in a child.</li>
                <li><code>scale_pos_weight=1</code>: Controls the balance of positive and negative weights; used for imbalanced datasets.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>objective="reg:squarederror"</code>: Objective function for regression; default is for squared error.</li>
                <li><code>booster="gbtree"</code>: Booster type; options include <code>"gbtree"</code> (default), <code>"gblinear"</code>, and <code>"dart"</code>.</li>
                <li><code>tree_method="auto"</code>: Tree construction algorithm; <code>"auto"</code> chooses based on system configuration. Options include <code>"exact"</code>, <code>"approx"</code>, and <code>"hist"</code>.</li>
                <li><code>eval_metric="rmse"</code>: Metric used for validation during training; default is root mean square error (<code>rmse</code>).</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>early_stopping_rounds=None</code>: Stops training if validation metric does not improve after specified rounds.</li>
                <li><code>n_jobs=1</code>: Number of threads used for parallel computation (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Seed for reproducibility.</li>
                <li><code>verbose=1</code>: Verbosity level for training output; <code>0</code> for silent, higher values show more details.</li>
            </ul>
        </li>
        <li>Advanced Parameters:
            <ul>
                <li><code>lambda=1</code>: L2 regularization term on weights.</li>
                <li><code>alpha=0</code>: L1 regularization term on weights.</li>
                <li><code>max_delta_step=0</code>: Used to help with convergence in highly imbalanced datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.  
</div>


In [None]:
# Initialize model
xgb = XGBRegressor(random_state=42)

# Train model
xgb.fit(X_train_transformed, y_train)

# Predict on the training and validation data
y_train_pred = xgb.predict(X_train_transformed)
y_val_pred = xgb.predict(X_val_transformed)

# Evaluate model
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Create table of evaluation metrics
xgb_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2]
})

# Show evaluation metrics
print(xgb_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Training Baseline Models (Pipeline)</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Train 8 baseline models:  
    <ul>
        <li>Linear Regression</li>
        <li>Elastic Net Regression</li>
        <li>K-Nearest Neighbors Regressor</li>
        <li>Support Vector Regressor</li>
        <li>Decision Tree Regressor</li>
        <li>Random Forest Regressor</li>
        <li>Multi-Layer Perceptron Regressor</li>
        <li>XGBoost Regressor</li>
    </ul>
    🎯 Model performance will be evaluated using the following metrics:  
    <ul>
        <li>Root Mean Squared Error (RMSE)</li>
        <li>Mean Absolute Percentage Error (MAPE)</li>
        <li>R-squared (R2)</li>
    </ul>
</div>

In [None]:
# Define models with baseline configurations
models = [
    LinearRegression(), 
    ElasticNet(),
    KNeighborsRegressor(),
    SVR(), 
    DecisionTreeRegressor(random_state=42),
    RandomForestRegressor(random_state=42), 
    MLPRegressor(random_state=42), 
    XGBRegressor(random_state=42)
]

# Create lists for storing the evaluation metrics (RMSE, MAPE, R2) of each model 
rmse_ls = []
mape_ls = []
r2_ls = []

# Loop through each model
for model in models:
    # Show model
    print("=" * 100)
    print(f"Model: {model}")

    # Scale numerical columns and encode categorical columns 
    column_transformer = ColumnTransformer(
        transformers=[
            ("scaler", StandardScaler(), numerical_columns),  # Use MinMaxScaler() for min-max normalization
            ("nominal_encoder", OneHotEncoder(drop="first"), nominal_columns),
            ("ordinal_encoder", OrdinalEncoder(categories=ordinal_column_orders), ordinal_columns)  
        ],
        remainder="passthrough"  # Include the boolean columns without transformation
    )

    # Create a pipeline
    pipeline = Pipeline(steps=[
        ("column_transformer", column_transformer),
        ("model", model)
    ])

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)
    
    # Predict on the validation data
    y_val_pred = pipeline.predict(X_val)

    # Calculate evaluation metrics: RMSE, MAPE, R2
    rmse = mean_squared_error(y_val, y_val_pred, squared=False)
    mape = mean_absolute_percentage_error(y_val, y_val_pred)
    r2 = r2_score(y_val, y_val_pred)

    # Show evaluation metrics
    print(f"RMSE: {round(rmse, 2)}")
    print(f"MAPE: {round(mape, 2)}")
    print(f"R-squared (R²): {round(r2, 2)}")

    # Add evaluation metrics to their respective lists
    rmse_ls.append(rmse)
    mape_ls.append(mape)
    r2_ls.append(r2)

    # Create residual plots
    plot_residuals(y_val, y_val_pred)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Hyperparameter Tuning</h2>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Based on baseline model training, select the best performing models for hyperparameter tuning. <br><br> 
    💡 Example: The following models outperformed the other candidates across evaluation metrics (RMSE, MAPE, R-squared) on the validation data and were selected for hyperparameter tuning:  
    <ul>
        <li><b>Model 1: Random Forest Regressor</b>  
            <br><i>Justification:</i> Delivered low RMSE and MAPE scores during baseline evaluation, showing strong predictive performance with minimal error.</li>
        <li><b>Model 2: XGBoost Regressor</b>  
            <br><i>Justification:</i> Achieved high R-squared and consistently low error metrics, indicating its ability to capture underlying patterns in the data effectively.</li>
    </ul>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Grid Search</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">  
    ℹ️ Hyperparameters:  
    <ul>  
        <li>Core Parameters:  
            <ul>  
                <li><code>estimator</code>: The machine learning model you want to optimize (e.g., <code>RandomForestRegressor</code>).</li>  
                <li><code>param_grid</code>: A dictionary where keys are hyperparameter names, and values are lists of possible values to try.</li>  
                <li><code>cv=5</code>: Number of folds for cross-validation.</li>  
                <li><code>scoring=None</code>: The metric to evaluate model performance (e.g., <code>"neg_mean_squared_error"</code>, <code>"neg_mean_absolute_error"</code>, <code>"r2"</code>).</li>  
            </ul>  
        </li>  
        <li>Optional Parameters:  
            <ul>  
                <li><code>verbose=0</code>: Controls the verbosity of the output. Higher values provide more detailed logs.</li>  
                <li><code>n_jobs=None</code>: Number of jobs to run in parallel. <code>-1</code> uses all available CPU cores.</li>  
                <li><code>pre_dispatch="2*n_jobs"</code>: Controls the number of jobs that get dispatched during parallel execution.</li>  
                <li><code>refit=True</code>: If <code>True</code>, the estimator is refit on the entire training set using the best parameters.</li>  
                <li><code>random_state=None</code>: Ensures reproducibility of results when estimators or scoring functions involve randomness.</li>  
                <li><code>error_score=np.nan</code>: Value to assign to the score if a parameter combination results in a failure.</li>  
                <li><code>return_train_score=False</code>: If <code>True</code>, training scores will be returned along with validation scores.</li>  
            </ul>  
        </li>  
    </ul>  
</div>  


<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 1</h4>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <p>💡 Example: Random Forest Regressor</p>
    <p>The following hyperparameters are typically the most impactful:</p>
    <ul>
        <li><code>n_estimators</code></li>
        <li><code>max_depth</code></li>
        <li><code>min_samples_split</code></li>
        <li><code>min_samples_leaf</code></li>
        <li><code>max_features</code></li>
    </ul>
    <p>For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor" target="_blank">scikit-learn RandomForestRegressor documentation</a>.</p>
</div>

In [None]:
# Initialize model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter grid 
rf_param_grid = {
    "n_estimators": [100, 200, 500],               
    "max_depth": [None, 10, 20],              
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": [0.33, 0.66, 1]                
}

# Initialize grid search object
rf_grid_search = GridSearchCV(
    estimator=rf, 
    param_grid=rf_param_grid, 
    cv=5, 
    scoring="neg_root_mean_squared_error"  # use "r2" for R-squared or a custom function for MAPE
)

# Fit the grid search to the training data
rf_grid_search.fit(X_train_transformed, y_train)

In [None]:
# DataFrame of grid search results 
rf_grid_search_results = pd.DataFrame({
    "validation_rmse": -1 * rf_grid_search.cv_results_["mean_test_score"],  # RMSE on validation data
    "parameters": rf_grid_search.cv_results_["params"]  # parameter values
}) 

# Extract each hyperparameter as a separate column
rf_grid_search_results["n_estimators"] = rf_grid_search_results["parameters"].apply(lambda x: x["n_estimators"])
rf_grid_search_results["max_depth"] = rf_grid_search_results["parameters"].apply(lambda x: x["max_depth"])
rf_grid_search_results["min_samples_split"] = rf_grid_search_results["parameters"].apply(lambda x: x["min_samples_split"])
rf_grid_search_results["min_samples_leaf"] = rf_grid_search_results["parameters"].apply(lambda x: x["min_samples_leaf"])
rf_grid_search_results["max_features"] = rf_grid_search_results["parameters"].apply(lambda x: x["max_features"])

# Delete the parameters column
rf_grid_search_results = rf_grid_search_results.drop("parameters", axis=1)

# Show the top 10 best performing models
print(rf_grid_search_results.sort_values("validation_rmse").head(10))

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 2</h4>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <p>💡 Example: XGBoost Regressor</p>
    <p>The following hyperparameters are typically the most impactful:</p>
    <ul>
        <li><code>n_estimators</code></li>
        <li><code>max_depth</code></li>
        <li><code>learning_rate</code></li>
        <li><code>subsample</code></li>
        <li><code>colsample_bytree</code></li>
        <li><code>gamma</code></li>
        <li><code>min_child_weight</code></li>
    </ul>
    <p>For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.</p>
</div>

In [None]:
# Initialize model
xgb = XGBRegressor(random_state=42)

# Define hyperparameter grid 
xgb_param_grid = {
    "n_estimators": [100, 200, 300, 500],               
    "max_depth": [3, 6, 10], 
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.6, 0.8, 1.0],
    "colsample_bytree": [0.6, 0.8, 1.0],  
    "gamma": [0, 0.1, 0.2],           
    "min_child_weight": [1, 3, 5]
}

# Initialize grid search object
xgb_grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=xgb_param_grid, 
    cv=5, 
    scoring="neg_root_mean_squared_error"  # use "r2" for R-squared or a custom function for MAPE
)

# Fit the grid search to the training data
xgb_grid_search.fit(X_train_transformed, y_train)

In [None]:
# DataFrame of grid search results 
xgb_grid_search_results = pd.DataFrame({
    "validation_rmse": -1 * xgb_grid_search.cv_results_["mean_test_score"],  # RMSE on validation data
    "parameters": xgb_grid_search.cv_results_["params"]  # parameter values
}) 

# Extract each hyperparameter as a separate column
xgb_grid_search_results["n_estimators"] = xgb_grid_search_results["parameters"].apply(lambda x: x["n_estimators"])
xgb_grid_search_results["max_depth"] = xgb_grid_search_results["parameters"].apply(lambda x: x["max_depth"])
xgb_grid_search_results["learning_rate"] = xgb_grid_search_results["parameters"].apply(lambda x: x["learning_rate"])
xgb_grid_search_results["subsample"] = xgb_grid_search_results["parameters"].apply(lambda x: x["subsample"])
xgb_grid_search_results["colsample_bytree"] = xgb_grid_search_results["parameters"].apply(lambda x: x["colsample_bytree"])
xgb_grid_search_results["gamma"] = xgb_grid_search_results["parameters"].apply(lambda x: x["gamma"])
xgb_grid_search_results["min_child_weight"] = xgb_grid_search_results["parameters"].apply(lambda x: x["min_child_weight"])

# Delete the parameters column
xgb_grid_search_results = xgb_grid_search_results.drop("parameters", axis=1)

# Show the top 10 best performing models
print(xgb_grid_search_results.sort_values("validation_rmse").head(10))

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Randomized Search</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">  
    ℹ️ Hyperparameters:  
    <ul>  
        <li>Core Parameters:  
            <ul>  
                <li><code>estimator</code>: The machine learning model you want to optimize (e.g., <code>RandomForestRegressor</code>).</li>  
                <li><code>param_distributions</code>: A dictionary where keys are hyperparameter names, and values are the distributions or lists of possible values.</li>  
                <li><code>n_iter=10</code>: The number of random combinations to sample from the parameter grid.</li>  
                <li><code>cv=5</code>: Number of folds for cross-validation.</li>  
                <li><code>scoring=None</code>: The metric to evaluate model performance (e.g., <code>"neg_mean_squared_error"</code>, <code>"neg_mean_absolute_error"</code>, <code>"r2"</code>).</li>  
            </ul>  
        </li>  
        <li>Optional Parameters:  
            <ul>  
                <li><code>verbose=0</code>: Controls the verbosity of the output. Higher values provide more detailed logs.</li>  
                <li><code>random_state=None</code>: Ensures reproducibility of results.</li>  
                <li><code>n_jobs=None</code>: Number of jobs to run in parallel. <code>-1</code> uses all available CPU cores.</li>  
                <li><code>pre_dispatch="2*n_jobs"</code>: Controls the number of jobs that get dispatched during parallel execution.</li>  
                <li><code>refit=True</code>: If <code>True</code>, the estimator is refit on the entire training set using the best parameters.</li>  
                <li><code>error_score=np.nan</code>: Value to assign to the score if a parameter combination results in a failure.</li>  
                <li><code>return_train_score=False</code>: If <code>True</code>, training scores will be returned along with validation scores.</li>  
            </ul>  
        </li>  
    </ul>  
</div>  

In [None]:
# Import scipy to define the hyperparameter distributions (randint for int, uniform for float)
from scipy.stats import randint, uniform  

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 1</h4>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <p>💡 Example: Random Forest Regressor</p>
    <p>The following hyperparameters are typically the most impactful:</p>
    <ul>
        <li><code>n_estimators</code></li>
        <li><code>max_depth</code></li>
        <li><code>min_samples_split</code></li>
        <li><code>min_samples_leaf</code></li>
        <li><code>max_features</code></li>
    </ul>
    <p>For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor" target="_blank">scikit-learn RandomForestRegressor documentation</a>.</p>
</div>

In [None]:
# Initialize model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter distributions 
rf_param_distributions = {
    "n_estimators": randint(50, 1000),               
    "max_depth": randint(5, 50),              
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 10),
    "max_features": uniform(0.1, 0.9)                
}

# Initialize randomized search object
rf_random_search = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=rf_param_distributions, 
    n_iter=50,
    cv=5, 
    scoring="neg_root_mean_squared_error",  # use "r2" for R-squared or a custom function for MAPE
    random_state=42
)

# Fit the random search to the training data
rf_random_search.fit(X_train_transformed, y_train)

In [None]:
# DataFrame of randomized search results
rf_random_search_results = pd.DataFrame({
    "validation_rmse": -1 * rf_random_search.cv_results_["mean_test_score"],  # RMSE on validation data
    "parameters": rf_random_search.cv_results_["params"]  # parameter values
})

# Extract each hyperparameter as a separate column
rf_random_search_results["n_estimators"] = rf_random_search_results["parameters"].apply(lambda x: x["n_estimators"])
rf_random_search_results["max_depth"] = rf_random_search_results["parameters"].apply(lambda x: x["max_depth"])
rf_random_search_results["min_samples_split"] = rf_random_search_results["parameters"].apply(lambda x: x["min_samples_split"])
rf_random_search_results["min_samples_leaf"] = rf_random_search_results["parameters"].apply(lambda x: x["min_samples_leaf"])
rf_random_search_results["max_features"] = rf_random_search_results["parameters"].apply(lambda x: x["max_features"])

# Delete the parameters column
rf_random_search_results = rf_random_search_results.drop("parameters", axis=1)

# Show the top 10 best performing models 
print(rf_random_search_results.sort_values("validation_rmse", ascending=False).head(10))

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 2</h4>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <p>💡 Example: XGBoost Regressor</p>
    <p>The following hyperparameters are typically the most impactful:</p>
    <ul>
        <li><code>n_estimators</code></li>
        <li><code>max_depth</code></li>
        <li><code>learning_rate</code></li>
        <li><code>subsample</code></li>
        <li><code>colsample_bytree</code></li>
        <li><code>gamma</code></li>
        <li><code>min_child_weight</code></li>
    </ul>
    <p>For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.</p>
</div>

In [None]:
# Initialize model
xgb = XGBRegressor(random_state=42)

# Define hyperparameter distributions 
xgb_param_distributions = {
    'n_estimators': [100, 200, 300, 400, 500, 600, 700],  
    'max_depth': randint(3, 11),                
    'learning_rate': uniform(0.01, 0.4),       
    'subsample': uniform(0.5, 0.5),             
    'colsample_bytree': uniform(0.5, 0.5),      
    'gamma': uniform(0, 0.5),                     
    'min_child_weight': [1, 3, 5, 7],           
}

# Initialize randomized search object
xgb_random_search = RandomizedSearchCV(
    estimator=xgb, 
    param_distributions=xgb_param_distributions, 
    n_iter=50,
    cv=5, 
    scoring="neg_root_mean_squared_error",  # use "r2" for R-squared or a custom function for MAPE
    random_state=42
)

# Fit the random search to the training data
xgb_random_search.fit(X_train_transformed, y_train)

In [None]:
# DataFrame of randomized search results
xgb_random_search_results = pd.DataFrame({
    "validation_rmse": -1 * xgb_random_search.cv_results_["mean_test_score"],  # RMSE on validation data
    "parameters": xgb_random_search.cv_results_["params"]  # parameter values
})

# Extract each hyperparameter as a separate column
xgb_random_search_results["n_estimators"] = xgb_random_search_results["parameters"].apply(lambda x: x["n_estimators"])
xgb_random_search_results["max_depth"] = xgb_random_search_results["parameters"].apply(lambda x: x["max_depth"])
xgb_random_search_results["learning_rate"] = xgb_random_search_results["parameters"].apply(lambda x: x["learning_rate"])
xgb_random_search_results["subsample"] = xgb_random_search_results["parameters"].apply(lambda x: x["subsample"])
xgb_random_search_results["colsample_bytree"] = xgb_random_search_results["parameters"].apply(lambda x: x["colsample_bytree"])
xgb_random_search_results["gamma"] = xgb_random_search_results["parameters"].apply(lambda x: x["gamma"])
xgb_random_search_results["min_child_weight"] = xgb_random_search_results["parameters"].apply(lambda x: x["min_child_weight"])

# Delete the parameters column
xgb_random_search_results = xgb_random_search_results.drop("parameters", axis=1)

# Show the top 10 best performing models 
print(xgb_random_search_results.sort_values("validation_rmse", ascending=False).head(10))

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Final Model</h2>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Select the best model based on hyperparameter tuning. <br><br> 
    💡 Example: The XGBoost Regressor model achieved the best performance (e.g., RMSE = 12.34) on the validation data compared to other candidates, making it the optimal choice for the final model. This model will be further evaluated on the test data to confirm its generalizability.  
    <br><br>
    Hyperparameter Values:  
    <ul>
        <li><code>n_estimators=300</code></li>
        <li><code>max_depth=4</code></li>
        <li><code>learning_rate=0.1</code></li>
        <li><code>subsample=0.8</code></li>
        <li><code>colsample_bytree=0.8</code></li>
        <li><code>gamma=0.1</code></li>
        <li><code>min_child_weight=3</code></li>
        <li><code>random_state=42</code></li>
    </ul>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Training Final Model</h3>
</div> 

In [None]:
# Initialize model
final_model = XGBRegressor(
    n_estimators=300, 
    max_depth=4, 
    learning_rate=0.1,
    subsample=0.8, 
    colsample_bytree=0.8, 
    gamma=0.1,
    min_child_weight=3, 
    random_state=42
)

# Train model
final_model.fit(X_train_transformed, y_train)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Model Evaluation</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Metrics</strong> <br> 
    📌 Evaluate model performance using metrics (RMSE, MAPE, R-squared) for training, validation, and test data.
</div>

In [None]:
# Predict on the training, validation and test data
y_train_pred = final_model.predict(X_train_transformed)
y_val_pred = final_model.predict(X_val_transformed)
y_test_pred = final_model.predict(X_test_transformed)

# Calculate evaluation metrics
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
val_rmse = mean_squared_error(y_val, y_val_pred, squared=False)
test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
train_mape = mean_absolute_percentage_error(y_train, y_train_pred)
val_mape = mean_absolute_percentage_error(y_val, y_val_pred)
test_mape = mean_absolute_percentage_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Create table of evaluation metrics
final_model_evaluation = pd.DataFrame({
    "Metric": ["RMSE", "MAPE", "R-squared"],
    "Training": [train_rmse, train_mape, train_r2],
    "Validation": [val_rmse, val_mape, val_r2],
    "Test": [test_rmse, test_mape, test_r2]
})

# Show table
print(final_model_evaluation.round(2))  # round metrics to 2 decimals

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Error Analysis: Residual Plots</strong> <br> 
    📌 Plot predicted vs. actual values and residuals vs. actual values for validation and test data.
</div>

In [None]:
# Resiudal plots for validation data
plot_residuals(y_val, y_val_pred)

In [None]:
# Resiudal plots for tets data
plot_residuals(y_test, y_test_pred)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Error Analysis: Descriptive Statistics and Visualizations</strong> <br> 
    📌 Analyze errors on test data with descriptive statistics (mean, median) and visualize error distributions using histograms. 
</div>

In [None]:
# Combine X_test, y_test, and y_test_pred into one DataFrame
df_test = X_test.copy()  
df_test["Actual"] = y_test
df_test["Predicted"] = y_test_pred

# Calculate errors 
df_test["Error"] = df_test["Predicted"] - df_test["Actual"]
df_test["Absolute Error"] = np.abs(df_test["Actual"] - df_test["Predicted"])
df_test["Error (%)"] = (df_test["Actual"] - df_test["Predicted"]) / df_test["Actual"] * 100

In [None]:
# Mean and median error
print(f"Mean Absolute Error: {df_test['Absolute Error'].mean():.2f}")
print(f"Median Absolute Error: {df_test['Absolute Error'].median():.2f}")
print(f"Mean Error (%): {df_test['Error (%)'].mean():.2f}%")
print(f"Median Error (%): {df_test['Error (%)'].median():.2f}%")

In [None]:
# Error histograms 
plt.figure(figsize=(12, 6))

# Create subplot for absolute errors
plt.subplot(1, 2, 1)
sns.histplot(data=df_test, x="Absolute Error", bins=30)
plt.title("Distribution of Absolute Errors")
plt.xlabel("Absolute Error")
plt.ylabel("Frequency")

# Create subplot for percentage errors
plt.subplot(1, 2, 2)
sns.histplot(data=df_test, x="Error (%)", bins=30)
plt.title("Distribution of Percentage Errors")
plt.xlabel("Error (%)")
plt.ylabel("Frequency")

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Error Analysis: Feature Relationships with Errors</strong> <br> 
    📌 Explore relationships between the features and the errors through correlations and scatterplots. 
</div>

In [None]:
# Correlations between features and absolute error
# Note: Ensure numerical_columns and boolean_columns were defined in exploratory data analysis section 
df_test[numerical_columns + boolean_columns].corr()["Absolute Error"].sort_values(ascending=False)

In [None]:
# Scatterplot matrix between (numerical or boolean) features and absolute error
import math

# Features to plot: All numerical and boolean columns excluding the target variable
features_to_plot = [col for col in numerical_columns + boolean_columns if col != "numerical_target"]

# Define the number of columns for the grid
num_cols = 4  

# Calculate the number of rows needed
num_rows = math.ceil(len(features_to_plot) / num_cols)

# Set the figure size dynamically based on the grid size
plt.figure(figsize=(num_cols * 4, num_rows * 4))

# Iterate over the features to plot
for i, feature in enumerate(features_to_plot):
    # Create a subplot in the grid 
    plt.subplot(num_rows, num_cols, i + 1)
    
    # Create a scatterplot between the current feature and the absolute error
    sns.scatterplot(data=df_test, x=feature, y="Absolute Error")
    
    # Add title and axis labels
    plt.title(f"Absolute Error by {feature}")
    plt.xlabel(f"{feature}")
    plt.ylabel("Absolute Error")

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Feature Importance</h3>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 <strong>Feature Importance Plot</strong>: Linear Regression or Elastic Net Regression.
</div>

In [None]:
# Get the coefficients
# Note: final_model must be LinearRegression or ElasticNet
coefficients = final_model.coef_

# Get the feature names 
feature_names = X_train_transformed.columns  

# Create a DataFrame to make it easier for Seaborn to plot
feature_importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": np.abs(coefficients)
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create feature importance plot of the top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x="importance", y="feature", data=feature_importance_df.head(10), palette="colorblind")
plt.title("Top 10 Feature Importances")
plt.xlabel("Importance (Absolute Coefficients)")
plt.ylabel("Features")
plt.tight_layout()
plt.show()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 <strong>Feature Importance Plot</strong>: Decision Tree or Random Forest. 
</div>

In [None]:
# Get the feature importances
# Note: final_model must be a decision tree or a random forest
importances = final_model.feature_importances_

# Get the feature names 
feature_names = X_train_transformed.columns  

# Create a DataFrame to make it easier for Seaborn to plot
feature_importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create feature importance plot of the top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x="importance", y="feature", data=feature_importance_df.head(10), palette="colorblind")
plt.title("Top 10 Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.tight_layout()
plt.show()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 <strong>Feature Importance Plot</strong>: XGBoost.  
</div> 

In [None]:
# Get the feature importances
# Note: final_model must be XGBoost model 
importances = final_model.get_score(importance_type="gain")

# Convert to DataFrame
feature_importance_df = pd.DataFrame({
    "feature": list(importances.keys()),
    "importance": list(importances.values())
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create feature importance plot of the top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x="importance", y="feature", data=feature_importance_df.head(10), palette="colorblind")
plt.title("Top 10 Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.tight_layout()
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Model Prediction Examples</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Show illustrative examples of model predictions from test data to demonstrate performance on unseen data.
    <ul>
        <li>Goal: Give stakeholders a clear picture of when the model performs well and when it struggles.</li>
        <li>Recommendations:
            <ul>
                <li>Show 5-10 diverse examples: Best cases, worst cases, and typical cases.</li>
                <li>Show 2-5 most important features, actual vs. predicted values, and errors.</li>
                <li>Add notes about any interesting patterns or edge cases observed.</li>
            </ul>
        </li>
        <li>Example table:</li>
    </ul>
    <table style="border-collapse: collapse; margin-top: 10px;">
        <tr style="background-color:#e8f4fd; border-bottom: 2px solid #a6c8e9;">
            <th style="padding: 8px;">Size (m²)</th>
            <th style="padding: 8px;">Bedrooms</th>
            <th style="padding: 8px;">Neighborhood</th>
            <th style="padding: 8px;">Built Year</th>
            <th style="padding: 8px;">Actual Price (€)</th>
            <th style="padding: 8px;">Predicted Price (€)</th>
            <th style="padding: 8px;">Error (%)</th>
        </tr>
        <tr style="background-color:#e8f4fd;">
            <td style="padding: 8px;">80</td>
            <td style="padding: 8px;">2</td>
            <td style="padding: 8px;">Prenzlauer Berg</td>
            <td style="padding: 8px;">1990</td>
            <td style="padding: 8px;">1200</td>
            <td style="padding: 8px;">1205</td>
            <td style="padding: 8px; color:green;">+0.4</td>
        </tr>
        <tr style="background-color:#d0e7fa;">
            <td style="padding: 8px;">120</td>
            <td style="padding: 8px;">3</td>
            <td style="padding: 8px;">Charlottenburg</td>
            <td style="padding: 8px;">2005</td>
            <td style="padding: 8px;">2500</td>
            <td style="padding: 8px;">2100</td>
            <td style="padding: 8px; color:red;">-16.0</td>
        </tr>
        <tr style="background-color:#e8f4fd;">
            <td style="padding: 8px;">100</td>
            <td style="padding: 8px;">3</td>
            <td style="padding: 8px;">Kreuzberg</td>
            <td style="padding: 8px;">2010</td>
            <td style="padding: 8px;">2000</td>
            <td style="padding: 8px;">2060</td>
            <td style="padding: 8px; color:green;">+3.0</td>
        </tr>
        <tr style="background-color:#d0e7fa;">
            <td style="padding: 8px;">150</td>
            <td style="padding: 8px;">4</td>
            <td style="padding: 8px;">Mitte</td>
            <td style="padding: 8px;">2018</td>
            <td style="padding: 8px;">3500</td>
            <td style="padding: 8px;">4000</td>
            <td style="padding: 8px; color:red;">+14.3</td>
        </tr>
        <tr style="background-color:#e8f4fd;">
            <td style="padding: 8px;">75</td>
            <td style="padding: 8px;">1</td>
            <td style="padding: 8px;">Neukölln</td>
            <td style="padding: 8px;">1995</td>
            <td style="padding: 8px;">1000</td>
            <td style="padding: 8px;">1030</td>
            <td style="padding: 8px; color:green;">+3.0</td>
        </tr>
    </table>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 <strong>Identify Best, Worst, and Typical Cases.</strong>
</div>

In [None]:
# Get top 5 most important features
# Note: Ensure feature_importance_df was created in feature importance section 
top5_features = feature_importance_df.sort_values("importance", ascending=False).head(5)["feature"]

# Keep only top 5 features and error metrics
# Note: Ensure df_test was created in model evaluation section 
columns_to_keep = list(top5_features) + ["Actual", "Predicted", "Error", "Absolute Error", "Error (%)"]
df_test = df_test[columns_to_keep].copy()

# Best cases: Top 10 cases with smallest errors
print("Best cases:")
print(df_test.sort_values("Absolute Error").head(10))

# Worst cases: Top 10 cases with largest errors
print("Worst cases:")
print(df_test.sort_values("Absolute Error", ascending=False).head(10))

# Typical cases: 10 cases closest to the mean error
mean_error = df_test["Absolute Error"].mean()
df_test["Difference from Mean Error"] = np.abs(df_test["Absolute Error"] - mean_error)
print("Typical cases:")
print(df_test.sort_values("Difference from Mean Error").head(10))

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 <strong>Plot: Actual vs. Predicted Values.</strong> <br> 
    While identifying the best, worst, and typical cases offers insights into individual scenarios, this plot provides context for understanding overall model performance and reveals trends like consistent underprediction or overprediction.
</div>  

In [None]:
# Plot: Actual vs. predicted values
plt.figure(figsize=(7, 5), dpi=150)
plt.scatter(y_test, y_test_pred)
plt.plot([min(y_test), max(y_test)], 
         [min(y_test), max(y_test)], 
         color="red", 
         linestyle="--", 
         label="Perfect Prediction")  # Diagonal reference line
plt.title("Actual vs. Predicted Values", fontsize=14)
plt.xlabel("Actual Values", fontsize=12)
plt.ylabel("Predicted Values", fontsize=12)
plt.grid(True, linestyle="--", alpha=0.6)
plt.legend() 
plt.tight_layout()
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Saving Model</h3>
</div>   

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Save both the column transformer and the final model as pickle files in the <code>models</code> directory for later use.
</div>

In [None]:
# Imports
import pickle
import os

# Create models directory if it doesn't exist
os.makedirs("models", exist_ok=True)

# Save column transformer 
try:
    with open("models/column_transformer.pkl", "wb") as file:
        pickle.dump(column_transformer, file)
    print("Column transformer saved successfully")
except Exception as e:
    print(f"An error occurred while saving the column transformer: {e}}")
    
# Save final model
try:
    with open("models/final_model.pkl", "wb") as file:
        pickle.dump(final_model, file)
    print("Final model saved successfully.")
except Exception as e:
    print(f"An error occurred while saving the final model: {e}")

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Modeling (Classification)</h1>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ For a classification problem, where the task is to predict a categorical target variable. 
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Helper functions to save and load models using <code>pickle</code>. 
</div>

In [None]:
# Function to save model as .pkl file
def save_model(model, filename):
    # Create models directory if it doesn't exist
    os.makedirs("models", exist_ok=True)
    # Save model as .pkl file 
    try:
        with open(f"models/{filename}", "wb") as file:
            pickle.dump(model, file)
        print(f"Model saved successfully under 'models/{filename}'.")
    except Exception as e:
        print(f"An error occurred while saving the model: {e}")


# Function to load model from .pkl file 
def load_model(filename):
    try:
        with open(f"models/{filename}", "rb") as file:  # ensure model is stored in "models" directory
            model = pickle.load(file)
        print(f"{filename} loaded successfully.")
        return model
    except Exception as e:
        print(f"An error occurred while loading {filename}: {e}")
        return None

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Baseline Models</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Train 8 baseline models (default hyperparameter values):  
    <ul>
        <li>Logistic Regression</li>
        <li>Elastic Net Logistic Regression</li>
        <li>K-Nearest Neighbors Classifier</li>
        <li>Support Vector Classifier</li>
        <li>Decision Tree Classifier</li>
        <li>Random Forest Classifier</li>
        <li>Multi-Layer Perceptron Classifier</li>
        <li>XGBoost Classifier</li>
    </ul>
    🎯 Evaluate model performance:  
    <ul>
        <li>Metrics:
            <ul>
                <li>Accuracy</li>
                <li>Recall</li>
                <li>Precision</li>
                <li>F1-score</li>
                <li>ROC-AUC (Area Under the Receiver Operating Characteristic Curve)</li>
                <li>AUC-PR (Area Under the Precision-Recall Curve)</li>
            </ul>     
        </li>
        <li>Additional Diagnostics:
            <ul>
                <li>Metrics Comparison Table</li>
                <li>Precision-Recall Curves</li>
                <li>Classification Report</li>
                <li>Confusion Matrix</li>                
                <li>Overfitting</li>
                <li>Feature Misclassification Analysis</li> 
            </ul>
        </li>
    </ul>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Training</h3>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Train all baseline models and store fitted models, predicted values, and evaluation metrics in a results dictionary.
</p> 

In [None]:
# Imports
import time

# Define features to be used for model training
columns_to_keep = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]
X_train_transformed = X_train_transformed[columns_to_keep].copy()
X_val_transformed = X_val_transformed[columns_to_keep].copy()
X_test_transformed = X_test_transformed[columns_to_keep].copy()

# Define baseline models
baseline_models = {
    "Logistic Regression": LogisticRegression(),
    "Elastic Net": LogisticRegression(penalty="elasticnet", solver="saga", l1_ratio=0.5),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC(probability=True),  # get predicted probabilities for ROC-AUC and AUC-PR
    "Neural Network": MLPClassifier(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(random_state=42)
}


# Function to train and evaluate a single model
def evaluate_model(model, X_train, y_train, X_val, y_val):
    # Fit model on the training data and measure training time
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()    
    
    # Predict on the validation data
    y_val_pred = model.predict(X_val)
    y_val_proba = model.predict_proba(X_val)[:, 1]
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_val, y_val_pred)
    recall = recall_score(y_val, y_val_pred)
    precision = precision_score(y_val, y_val_pred)
    f1 = f1_score(y_val, y_val_pred)
    roc_auc = roc_auc_score(y_val, y_val_proba)
    precision_curve, recall_curve, _ = precision_recall_curve(y_val, y_val_proba)
    auc_pr = auc(recall_curve, precision_curve)
    
    # Return fitted model, predicted values, and evaluation metrics
    return {
        "model": model,
        "training_time": end_time - start_time,
        "y_val_pred": y_val_pred,
        "y_val_proba": y_val_proba,
        "Accuracy": accuracy,
        "Recall": recall,
        "Precision": precision,
        "F1-Score": f1,
        "ROC-AUC": roc_auc,
        "AUC-PR": auc_pr
    }


# Example usage
# tree = evaluate_model(baseline_models["Decision Tree"], X_train_transformed, y_train, X_val_transformed, y_val)
# knn = evaluate_model(baseline_models["K-Nearest Neighbors"], X_train_transformed, y_train, X_val_transformed, y_val)


# Function to train and evaluate all models 
def evaluate_all_models(models, X_train, y_train, X_val, y_val):
    results = {}   
    for model_name, model in models.items():
        print(f"\nEvaluating {model_name}...")
        result = evaluate_model(model, X_train, y_train, X_val, y_val)
        results[model_name] = result
        print(f"Training Time: {round(result['training_time'], 1)} sec")
    return results

    
# Use function to train all baseline models
baseline_model_results = evaluate_all_models(baseline_models, X_train_transformed, y_train, X_val_transformed, y_val)

# Save baseline model results as .pkl file (using helper function)  
save_model(baseline_model_results, "baseline_models.pkl")

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Metrics</h3>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Compare the evaluation metrics of all baseline models on the validation data.
</p> 

In [None]:
# Load baseline model results (using helper function)
baseline_model_results = load_model("baseline_models.pkl")

# Extract evaluation metrics
baseline_model_comparison = {
    model_name: {
        metric: baseline_model_results[model_name][metric]
        for metric in ["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"]
    }
    for model_name in baseline_model_results
}

# Convert the dictionary to a DataFrame 
baseline_model_comparison = pd.DataFrame(baseline_model_comparison).transpose()

# Show model comparison table
round(baseline_model_comparison, 2)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Precision-Recall Curves</h3>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Plot precision-recall curves of all baseline models on the validation data in a single graph.
</p> 

In [None]:
# Function to plot precision-recall curve of one or more models
def plot_precision_recall_curve(y_true, model_results, title="Precision-Recall Curves", safe_to_file=False):
    # Set the figure size
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Get colors for different models (colormap "viridis" is colorblind-friendly)
    cmap = plt.get_cmap("viridis", len(model_results))
    
    # Plot baseline performance of random classifier
    baseline = np.sum(y_true) / len(y_true)
    ax.axhline(y=baseline, color="black", linestyle="--", alpha=0.5, label=f"Baseline = {baseline:.2f}")
    
    # Iterate over each model in the results dictionary
    for i, (model_name, model_result) in enumerate(model_results.items()):
        # Plot precision-recall curve for the current model
        precision_curve, recall_curve, _ = precision_recall_curve(y_true, model_result["y_val_proba"])
        auc_pr = auc(recall_curve, precision_curve)
        ax.plot(recall_curve, precision_curve, color=cmap(i), label=f"{model_name} AUC-PR={auc_pr:.2f}")
    
    # Customize title, axes labels, axes ticks, legend, and grid
    ax.set_title(title, fontsize=14)
    ax.set_ylabel("Precision", fontsize=12)
    ax.set_xlabel("Recall", fontsize=12)
    ax.set_ylim(0, 1.02)  # slightly extend y-axis for visibility
    ax.set_xlim(0, 1)
    ax.set_yticks(np.arange(0, 1.1, 0.1))
    ax.set_xticks(np.arange(0, 1.1, 0.1))
    ax.legend(loc="best", fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Save the plot to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                fig.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Precision-recall curve plot saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving precision-recall curve plot: {e}")
        else:
            print(f"Skip saving precision-recall curve plot to file: '{image_path}' already exists.")
    
    # Show the plot
    plt.show()


# Use function to plot precision-recall curves of all baseline models on the validation data 
plot_precision_recall_curve(
    y_val, 
    baseline_model_results, 
    title="Precision-Recall Curves: Baseline Models", 
    safe_to_file="precision_recall_curves_baseline.png"
)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Classification Report</h3>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Create classification report for all baseline models on the validation data.
</p> 

In [None]:
# Create classification report for all baseline models
for model_name, model_result in baseline_model_results.items():
    print(f"\n{model_name}: Classification Report")
    print(classification_report(y_val, model_result["y_val_pred"], zero_division=0))  # disable zero-division warning if no predictions for a given class (sets metric to 0) 

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Confusion Matrix</h3>
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Plot confusion matrix for all baseline models on the validation data.
</p> 

In [None]:
# Function to plot confusion matrix
def plot_confusion_matrix (y, y_pred, title="", display_labels=None, safe_to_file=False, axes=None):
    # Create axis if not provided
    if axes is None:
        fig, ax = plt.subplots()
    else:
        ax = axes
    
    # Create confusion matrix
    cm = confusion_matrix(y, y_pred)
    cm_disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=display_labels)
    cm_disp.plot(cmap="viridis", values_format="d", colorbar=False, ax=ax)
    
    # Customize plot
    ax.set_title(f"{title}", fontsize=14)
    ax.set_xlabel("Predicted", fontsize=12)
    ax.set_ylabel("True", fontsize=12)

    # Save to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Confusion matrix saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving confusion matrix: {e}")
        else:
            print(f"Skip saving confusion matrix to file: '{image_path}' already exists.")

    # Show the plot
    if axes is None:
        fig.tight_layout()
        plt.show()


# --- Use function to plot confusion matrix for all baseline models ---
# Calculate number of rows and columns for subplot grid
n_plots = len(baseline_model_results)
n_cols = 3  
n_rows = math.ceil(n_plots / n_cols) 

# Create subplot grid with figure size based on 5x5 inches per subplot
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 5))

# Flatten the axes for easier iteration
axes = axes.flat

# Iterate over each model
for i, (model_name, model_result) in enumerate(baseline_model_results.items()):
    # Plot confusion matrix for current model
    plot_confusion_matrix(y_val, model_result["y_val_pred"], title=f"{model_name}", axes=axes[i])

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].axis("off")
    
# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Overfitting</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Diagnose overfitting for all baseline models by comparing evaluation metrics between training and validation data.
</div>

In [None]:
# Function to analyze overfitting
def analyze_overfitting(X_train, y_train, model_results):
    # Store overfitting results as a list of dictionaries
    overfitting_results = [] 
    
    # Iterate over each model
    for model_name, model_result in model_results.items():
        model = model_result["model"]
    
        # Predict on training data
        y_train_proba = model.predict_proba(X_train)[:, 1]
        if "best_threshold" in model_result:
            y_train_pred = (y_train_proba >= model_result["best_threshold"]).astype(int)
        else:
            y_train_pred = model.predict(X_train)

        # Calculate training metrics
        accuracy_train = accuracy_score(y_train, y_train_pred)
        recall_train = recall_score(y_train, y_train_pred)
        precision_train = precision_score(y_train, y_train_pred)
        f1_train = f1_score(y_train, y_train_pred)
        roc_auc_train = roc_auc_score(y_train, y_train_proba)
        precision_curve_train, recall_curve_train, _ = precision_recall_curve(y_train, y_train_proba)
        auc_pr_train = auc(recall_curve_train, precision_curve_train)

        # Get validation metrics
        accuracy_val = model_result["Accuracy"]
        recall_val = model_result["Recall"]
        precision_val = model_result["Precision"]
        f1_val = model_result["F1-Score"] 
        roc_auc_val = model_result["ROC-AUC"]
        auc_pr_val = model_result["AUC-PR"]
        
        # Create results dictionary for current model
        model_metrics = {
            "Model": model_name,
            "Accuracy (Train)": accuracy_train,
            "Accuracy (Val)": accuracy_val,
            "Accuracy (Diff)": accuracy_train - accuracy_val,
            "Recall (Train)": recall_train,
            "Recall (Val)": recall_val,
            "Recall (Diff)": recall_train - recall_val,
            "Precision (Train)": precision_train,
            "Precision (Val)": precision_val,
            "Precision (Diff)": precision_train - precision_val,
            "F1-Score (Train)": f1_train,
            "F1-Score (Val)": f1_val, 
            "F1-Score (Diff)": f1_train - f1_val,
            "ROC-AUC (Train)": roc_auc_train,
            "ROC-AUC (Val)": roc_auc_val,
            "ROC-AUC (Diff)": roc_auc_train - roc_auc_val,
            "AUC-PR (Train)": auc_pr_train,
            "AUC-PR (Val)": auc_pr_val,
            "AUC-PR (Diff)": auc_pr_train - auc_pr_val,
        }

        # Append current model dictionary
        overfitting_results.append(model_metrics)
    
    # Convert the list of dictionaries into a DataFrame
    overfitting_results = pd.DataFrame(overfitting_results)
    
    # Set model as index
    overfitting_results = overfitting_results.set_index("Model")
    
    return overfitting_results


# Use function to analyze overfitting for all baseline models 
baseline_models_overfitting = analyze_overfitting(X_train_transformed, y_train, model_results=baseline_model_results)

# Display overfitting results (with 2 decimals)
round(baseline_models_overfitting, 2)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Plot training and validation metrics of all baseline models side-by-side using grouped bar plots.
</div>

In [None]:
# Function to plot train vs. validation metrics of multiple models 
def plot_train_val_metrics(metrics, overfitting_results, safe_to_file=False):    
    # Ensure metrics is a list, even if a single metric is provided
    if not isinstance(metrics, list):
        metrics = [metrics]

    # Get number of metrics
    n = len(metrics)
    
    # Set up the subplot layout based on the number of metrics
    if n == 1:
        fig, ax = plt.subplots(figsize=(9, 6))
        axes = [ax]
    else:
        # Create a grid with 2 columns and enough rows to accommodate all metrics
        rows = int(np.ceil(n / 2))
        fig, axes = plt.subplots(rows, 2, figsize=(16, 6 * rows))
        # Flatten the axes for easier iteration
        axes = axes.flat

    # Add overall figure title only if there are multiple metrics
    if n > 1:
        fig.suptitle("Overfitting: Train vs. Validation Metrics", fontsize=16, y=0.98)
    
    for ax, metric in zip(axes, metrics):        
        # Create DataFrame with only training and validation metric
        metric_df = overfitting_results[[f"{metric} (Train)", f"{metric} (Val)"]].reset_index()
    
        # Rename columns for clarity
        metric_df = metric_df.rename(columns={f"{metric} (Train)": "Training", f"{metric} (Val)": "Validation"})
        
        # Melt the DataFrame for easier plotting 
        metric_df = pd.melt(
            metric_df,
            id_vars=["Model"], 
            value_vars=["Training", "Validation"],
            var_name="Data", 
            value_name=metric
        )
        
        # Create grouped bar chart
        sns.barplot(data=metric_df, x="Model", y=metric, hue="Data", palette="viridis", ax=ax)
    
        # Add value labels 
        for container in ax.containers:
            ax.bar_label(container, fmt="%.2f", padding=3, fontsize=10)
       
        # Customize plot 
        if n > 1:
            ax.set_title(f"{metric}", fontsize=14, pad=12)
            ax.set_ylabel("")
        else:
            ax.set_title(f"Overfitting: Train vs. Validation {metric}", fontsize=14, pad=12)
            ax.set_ylabel(metric, fontsize=12)
        ax.set_xlabel("")
        ax.set_ylim(0, 1.05)
        ax.set_yticks(np.arange(0, 1.1, 0.1))
        ax.tick_params(axis="x", labelrotation=45 if len(overfitting_results.index) > 5 else 0, labelsize=12)  # rotate xticks if more than 5 models
        ax.tick_params(axis="y", labelsize=10)
        ax.legend(fontsize=11)
        ax.grid(axis="y", alpha=0.3)
    
    # Save to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Overfitting plot saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving overfitting plot: {e}")
        else:
            print(f"Skip saving overfitting plot to file: '{image_path}' already exists.")
    
    # Adjust layout and show the plot
    fig.tight_layout()
    plt.show()


# Use function to plot train vs. validation AUC-PR of all baseline models  
plot_train_val_metrics("AUC-PR", baseline_models_overfitting)

# Use function to plot train vs. validation comparison of all metrics for all baseline models 
plot_train_val_metrics(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], baseline_models_overfitting)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Plot train-validation difference scores of all metrics and all baseline models in a single grouped bar plot.
</div>

In [None]:
# Function to plot train-validation difference scores of multiple models across multiple metrics in a single grouped bar plot
def plot_train_val_difference(metrics, overfitting_results, safe_to_file=False):
    # Extract difference scores from overfitting results
    diff_metrics = [metric + " (Diff)" for metric in metrics]
    metric_df = overfitting_results[diff_metrics].reset_index()
        
    # Rename columns for better reabability
    metric_df.columns = metric_df.columns.str.replace(" (Diff)", "")
    
    # Melt the DataFrame for easier plotting
    metric_df = pd.melt(
        metric_df,
        id_vars=["Model"], 
        var_name="Metric", 
        value_name="Value"
    )
    
    # Set the figure size
    fig, ax = plt.subplots(figsize=(12, 6))
        
    # Create grouped bar plot
    sns.barplot(data=metric_df, x="Model", y="Value", hue="Metric", palette="viridis", ax=ax)
    
    # Add value labels
    for container in ax.containers:
        ax.bar_label(container, fmt="%.2f", padding=3, fontsize=7)
    
    # Customize plot
    ax.set_title("Overfitting: Train-Validation Difference Scores", fontsize=14, pad=12)
    ax.set_xlabel("")
    ax.set_ylabel("Difference (Train - Val)", fontsize=12)
    ax.tick_params(axis="x", labelsize=12, labelrotation=45 if len(overfitting_results.index) > 5 else 0)  # rotate xticks if more than 5 models
    ax.tick_params(axis="y", labelsize=10)  
    ax.set_ylim(metric_df["Value"].min() - 0.05, metric_df["Value"].max() + 0.05)  # slightly extend y-axis for visibility 
    ax.legend(fontsize=11)
    ax.grid(axis="y", alpha=0.3)

    # Save to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Overfitting plot saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving overfitting plot: {e}")
        else:
            print(f"Skip saving overfitting plot to file: '{image_path}' already exists.")
    
    # Adjust layout and show plot
    fig.tight_layout()
    plt.show()


# Use function to plot train-validation difference scores of all metrics and all baseline models
plot_train_val_difference(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], baseline_models_overfitting)  

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Feature Misclassification Analysis</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Analyze relationships between the features and misclassifications on the validation data through correlations and box plots.
</div>

In [None]:
# Function to analyze feature correlations with misclassifications
def analyze_feature_misclassification(X, y, y_pred, numerical_features=None):  
    # Combine features with actual and predicted target values into a single DataFrame
    df = X.copy()  
    df["Actual"] = y
    df["Predicted"] = y_pred

    # Create misclassification column
    df["Misclassification"] = (df["Predicted"] != df["Actual"]).astype(int)
    
    # Create correlations between features and misclassifications
    correlations = df.drop(columns=["Actual", "Predicted"]).corr()["Misclassification"]
    
    # --- Create box plot matrix ---
    # Ensure only numerical features
    if numerical_features is None:
        numerical_features = X.select_dtypes(include=["number"]).columns  
        # Include continuous numerical features, exclude binary and ordinal features with less than 5 categories
        numerical_features = [column for column in numerical_features if X[column].nunique() > 4]
        
    # Number of columns and rows for matrix grid
    n_cols = 3  
    n_rows = math.ceil(len(numerical_features) / n_cols)
    
    # Create subplot grid with figure size based on 4x4 inches per subplot
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 4))
    
    # Flatten axes for easier iteration
    axes = axes.flat
    
    # Iterate over the numerical features 
    for i, feature in enumerate(numerical_features):
        # Get the current axes object
        ax = axes[i]
        
        # Create a box plot of the current feature grouped by misclassification
        sns.boxplot(data=df, x="Misclassification", y=feature, ax=ax)
        
        # Customize plot
        ax.set_title(f"{feature.title().replace('_', ' ')} by Misclassification")
        ax.set_xlabel("Misclassification")
        ax.set_ylabel(f"{feature.title().replace('_', ' ')}")
        ax.set_xticks(ticks=[0, 1], labels=["Correct", "Misclassified"])
    
    # Adjust layout to prevent overlap
    fig.tight_layout()
    
    # Show the plot
    plt.show()

    # Return the misclassification correlations
    return correlations


# Example usage for a single model  
# rf_misclassification_correlations = analyze_feature_misclassification(X_val_transformed, y_val, baseline_model_results["Random Forest"]["y_val_pred"])

# --- Use function to analyze feature misclassification relationships of all baseline models (no outlier handling) ---
# Initialize results dictionary
baseline_misclassification_correlations = {}
# Iterate over each model
for model_name, model_result in baseline_model_results.items():
    print(f"{model_name}: Feature Misclassification Analysis")
    # Analyze feature misclassification relationships for current model
    misclassification_correlations = analyze_feature_misclassification(X_val_transformed, y_val, model_result["y_val_pred"])
    # Add current model results to dictionary
    baseline_misclassification_correlations[model_name] = misclassification_correlations
    print("=" * 145)
# Convert results dictionary into a DataFrame    
baseline_misclassification_correlations = pd.DataFrame(baseline_misclassification_correlations)

In [None]:
# Display feature misclassification correlations of all baseline models (with 2 decimals)
round(baseline_misclassification_correlations, 2)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Hyperparameter Tuning</h2>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Model Selection</strong> <br>
    💡  Select the best performing baseline models for hyperparameter tuning. <br><br>
    Example: The following models outperformed the other candidates on the evaluation metrics (accuracy, recall, precision, F1-score, ROC-AUC, AUC-PR) and additional diagnostics (precision-recall curves, classification report, confusion matrix, overfitting, feature misclassification analysis) and were selected for hyperparameter tuning:  
    <ul>
        <li><b>Random Forest</b>: Showed balanced, high performance across key metrics (notably F1-score and AUC-PR) and minimal overfitting.</li>
        <li><b>XGBoost</b>: Achieved leading performance on key metrics, especially ROC-AUC and F1-score.</li>
    </ul>
    <strong>Next steps</strong> <br>
    <ul>
        <li>Tune the hyperparameters of each model using grid search or randomized search.</li>
        <li>Retrain the best-performing model from each algorithm and plot precision-recall curves.</li>
        <li>Optimize decision thresholds.</li>
        <li>Evaluate the best hyperparameter-tuned models with both default and optimized thresholds using:
            <ul>
                <li>Evaluation metrics (accuracy, recall, precision, F1-score, ROC-AUC, AUC-PR).</li>
                <li>Additional diagnostics (classification report, confusion matrix, overfitting, feature misclassification analysis).</li>
            </ul>
        </li>
    </ul>   
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Grid Search</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">  
    ℹ️ Hyperparameters:  
    <ul>  
        <li>Core Parameters:  
            <ul>  
                <li><code>estimator</code>: The machine learning model you want to optimize (e.g., <code>RandomForestClassifier</code>).</li>  
                <li><code>param_grid</code>: A dictionary where keys are hyperparameter names, and values are lists of possible values to try.</li>  
                <li><code>cv=5</code>: Number of folds for cross-validation.</li>  
                <li><code>scoring=None</code>: The metric to evaluate model performance (e.g., <code>"accuracy"</code>, <code>"precision"</code>, <code>"recall"</code>, <code>"f1"</code>, <code>"roc_auc"</code>).</li>  
            </ul>  
        </li>  
        <li>Optional Parameters:  
            <ul>  
                <li><code>verbose=0</code>: Controls the verbosity of the output. Higher values provide more detailed logs.</li>  
                <li><code>n_jobs=None</code>: Number of jobs to run in parallel. <code>-1</code> uses all available CPU cores.</li>  
                <li><code>pre_dispatch="2*n_jobs"</code>: Controls the number of jobs that get dispatched during parallel execution.</li>  
                <li><code>refit=True</code>: If <code>True</code>, the estimator is refit on the entire training set using the best parameters.</li>  
                <li><code>random_state=None</code>: Ensures reproducibility of results when estimators or scoring functions involve randomness.</li>  
                <li><code>error_score=np.nan</code>: Value to assign to the score if a parameter combination results in a failure.</li>  
                <li><code>return_train_score=False</code>: If <code>True</code>, training scores will be returned along with validation scores.</li>  
            </ul>  
        </li>  
    </ul>  
</div>  

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 1</h4>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    💡 Example: Random Forest Classifier <br><br>
    ℹ️ The following hyperparameters are typically the most impactful:
    <ul>
        <li><code>n_estimators</code>: Number of trees in the forest.</li>
        <li><code>max_depth</code>: Maximum depth of each tree; <code>None</code> allows trees to grow until all leaves are pure or minimum samples are reached.</li>
        <li><code>min_samples_split</code>: Minimum number of samples required to split a node.</li>
        <li><code>min_samples_leaf</code>: Minimum number of samples required at a leaf node.</li>
        <li><code>max_features</code>: Number of features considered for the best split; default <code>"auto"</code> uses the square root of all features.</li>
        <li><code>class_weight</code>: Weights associated with classes. If <code>None</code>, all classes are supposed to have weight one. Use <code>"balanced"</code> to automatically adjust weights inversely proportional to class frequencies in the input data.</li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank">scikit-learn RandomForestClassifier documentation</a>.
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Fit grid search and save as <code>.pkl</code> file.
</p> 

In [None]:
# Initialize model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid 
rf_param_grid = {
    "n_estimators": [100, 200, 500],               
    "max_depth": [None, 10, 20],              
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": [0.33, 0.66, 1],                
    "class_weight": [None, "balanced", "balanced_subsample"]
}

# Initialize grid search object
rf_grid_search = GridSearchCV(
    estimator=rf, 
    param_grid=rf_param_grid, 
    cv=5, 
    scoring="accuracy"  # use "f1", "precision", "recall" or "average_precision" optionally
)

# Fit the grid search to the training data
rf_grid_search.fit(X_train_transformed, y_train)

# Save fitted grid search as .pkl file using helper function  
save_model(rf_grid_search, "rf_grid_search.pkl")

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Load grid search from <code>.pkl</code> file and show Top 10 models.
</p> 

In [None]:
# Load grid search using helper function
rf_grid_search = load_model("rf_grid_search.pkl")

# Create DataFrame of grid search results 
rf_grid_search_results = pd.DataFrame({
    "validation_accuracy": rf_grid_search.cv_results_["mean_test_score"],  # accuracy on validation data
    "parameters": rf_grid_search.cv_results_["params"]  # parameter values
}) 

# Extract each hyperparameter as a separate column
for parameter in rf_param_distributions:
    rf_grid_search_results[parameter] = rf_grid_search_results["parameters"].apply(lambda x: x[parameter])

# Delete the parameters column
rf_grid_search_results = rf_grid_search_results.drop("parameters", axis=1)

# Show top 10 best performing models
rf_grid_search_results.sort_values("validation_accuracy", ascending=False).head(10)

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 2</h4>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    💡 Example: XGBoost Classifier <br><br>
    ℹ️ The following hyperparameters are typically the most impactful:
    <ul>
        <li><code>n_estimators</code>: Number of trees (boosting rounds).</li>
        <li><code>max_depth</code>: Maximum depth of each tree.</li>
        <li><code>learning_rate</code>: Step size shrinkage to prevent overfitting.</li>
        <li><code>subsample</code>: Fraction of training samples used per tree.</li>
        <li><code>colsample_bytree</code>: Fraction of features used per tree.</li>
        <li><code>gamma</code>: Minimum loss reduction required to split a leaf node.</li>
        <li><code>min_child_weight</code>: Minimum sum of instance weights (hessian) in a child.</li>
        <li><code>scale_pos_weight</code>: Balances positive and negative class weights for imbalanced datasets.</li>
    </ul>
    For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Fit grid search and save as <code>.pkl</code> file.
</p> 

In [None]:
# Initialize model
xgb = XGBClassifier(random_state=42)

# Define hyperparameter grid 
xgb_param_grid = {
    "n_estimators": [100, 200, 300, 500],               
    "max_depth": [3, 6, 10], 
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.6, 0.8, 1.0],
    "colsample_bytree": [0.6, 0.8, 1.0],  
    "gamma": [0, 0.1, 0.2],           
    "min_child_weight": [1, 3, 5],
    "scale_pos_weight": [1, 5, 10]
}

# Initialize grid search object
xgb_grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=xgb_param_grid, 
    cv=5, 
    scoring="accuracy"  # use "f1", "precision", "recall" or "average_precision" optionally
)

# Fit the grid search to the training data
xgb_grid_search.fit(X_train_transformed, y_train)

# Save fitted grid search as .pkl file using helper function  
save_model(xgb_grid_search, "xgb_grid_search.pkl")

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Load grid search from <code>.pkl</code> file and show Top 10 models.
</p> 

In [None]:
# Load grid search using helper function 
xgb_grid_search = load_model("xgb_grid_search.pkl")

# Create DataFrame of grid search results 
xgb_grid_search_results = pd.DataFrame({
    "validation_accuracy": xgb_grid_search.cv_results_["mean_test_score"],  # accuracy on validation data
    "parameters": xgb_grid_search.cv_results_["params"]  # parameter values
}) 

# Extract each hyperparameter as a separate column
for parameter in xgb_param_grid:
    xgb_grid_search_results[parameter] = xgb_grid_search_results["parameters"].apply(lambda x: x[parameter])

# Delete the parameters column
xgb_grid_search_results = xgb_grid_search_results.drop("parameters", axis=1)

# Show top 10 best performing models
xgb_grid_search_results.sort_values("validation_accuracy", ascending=False).head(10)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Randomized Search</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">  
    ℹ️ Hyperparameters:  
    <ul>  
        <li>Core Parameters:  
            <ul>  
                <li><code>estimator</code>: The machine learning model you want to optimize (e.g., <code>RandomForestClassifier</code>).</li>  
                <li><code>param_distributions</code>: A dictionary where keys are hyperparameter names, and values are the distributions or lists of possible values.</li>  
                <li><code>n_iter=10</code>: The number of random combinations to sample from the parameter grid.</li>  
                <li><code>cv=5</code>: Number of folds for cross-validation.</li>  
                <li><code>scoring=None</code>: The metric to evaluate model performance (e.g., <code>"accuracy"</code>, <code>"precision"</code>, <code>"recall"</code>, <code>"f1"</code>, <code>"roc_auc"</code>).</li>  
            </ul>  
        </li>  
        <li>Optional Parameters:  
            <ul>  
                <li><code>verbose=0</code>: Controls the verbosity of the output. Higher values provide more detailed logs.</li>  
                <li><code>random_state=None</code>: Ensures reproducibility of results.</li>  
                <li><code>n_jobs=None</code>: Number of jobs to run in parallel. <code>-1</code> uses all available CPU cores.</li>  
                <li><code>pre_dispatch="2*n_jobs"</code>: Controls the number of jobs that get dispatched during parallel execution.</li>  
                <li><code>refit=True</code>: If <code>True</code>, the estimator is refit on the entire training set using the best parameters.</li>  
                <li><code>error_score=np.nan</code>: Value to assign to the score if a parameter combination results in a failure.</li>  
                <li><code>return_train_score=False</code>: If <code>True</code>, training scores will be returned along with validation scores.</li>  
            </ul>  
        </li>  
    </ul>  
</div>  

In [None]:
# Import scipy to define the hyperparameter distributions (randint for int, uniform for float)
from scipy.stats import randint, uniform  

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 1</h4>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    💡 Example: Random Forest Classifier <br><br>
    ℹ️ The following hyperparameters are typically the most impactful:
    <ul>
        <li><code>n_estimators</code>: Number of trees in the forest.</li>
        <li><code>max_depth</code>: Maximum depth of each tree; <code>None</code> allows trees to grow until all leaves are pure or minimum samples are reached.</li>
        <li><code>min_samples_split</code>: Minimum number of samples required to split a node.</li>
        <li><code>min_samples_leaf</code>: Minimum number of samples required at a leaf node.</li>
        <li><code>max_features</code>: Number of features considered for the best split; default <code>"auto"</code> uses the square root of all features.</li>
        <li><code>class_weight</code>: Weights associated with classes. If <code>None</code>, all classes are supposed to have weight one. Use <code>"balanced"</code> to automatically adjust weights inversely proportional to class frequencies in the input data.</li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" target="_blank">scikit-learn RandomForestClassifier documentation</a>.
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Fit random search and save as <code>.pkl</code> file.
</p> 

In [None]:
# Initialize model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter distributions 
rf_param_distributions = {
    "n_estimators": randint(100, 501),  # Random integers between 100 and 500             
    "max_depth": randint(5, 31),  # Random integers between 5 and 30            
    "min_samples_split": randint(2, 21),  # Random integers between 2 and 20
    "min_samples_leaf": randint(1, 11),  # Random integers between 1 and 10
    "max_features": uniform(0.1, 0.9),  # Random floats between 0.1 and 1.0  
    "class_weight": [None, "balanced", "balanced_subsample"]
}

# Initialize randomized search object
rf_random_search = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=rf_param_distributions, 
    n_iter=50,
    cv=5, 
    scoring="accuracy",  # use "f1", "precision", "recall" or "average_precision" optionally
    random_state=42,
    n_jobs=-1,  # utilize all available CPU cores for parallel processing
    verbose=2,  # print training progress messages 
    refit=False  # Prevent storing "best_estimator_" to save storage
)

# Fit the random search to the training data
rf_random_search.fit(X_train_transformed, y_train)

# Save fitted random search as .pkl file using helper function 
save_model(rf_random_search, "rf_random_search.pkl")

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Load random search from <code>.pkl</code> file and show Top 10 models.
</p> 

In [None]:
# Load random search using helper function 
rf_random_search = load_model("rf_random_search.pkl")

# Create DataFrame of randomized search results
rf_random_search_results = pd.DataFrame({
    "validation_accuracy": rf_random_search.cv_results_["mean_test_score"],  # accuracy on validation data
    "parameters": rf_random_search.cv_results_["params"]  # parameter values
})

# Extract each hyperparameter as a separate column
for parameter in rf_param_distributions:
    rf_random_search_results[parameter] = rf_random_search_results["parameters"].apply(lambda x: x[parameter])

# Delete the parameters column
rf_random_search_results = rf_random_search_results.drop("parameters", axis=1)

# Show top 10 best performing models 
rf_random_search_results.sort_values("validation_accuracy", ascending=False).head(10)

<div style="background-color:#5f9ade; color:white; padding:8px; border-radius:6px;">
    <h4 style="margin:0px">Model 2</h4>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    💡 Example: XGBoost Classifier <br><br>
    ℹ️ The following hyperparameters are typically the most impactful:
    <ul>
        <li><code>n_estimators</code>: Number of trees (boosting rounds).</li>
        <li><code>max_depth</code>: Maximum depth of each tree.</li>
        <li><code>learning_rate</code>: Step size shrinkage to prevent overfitting.</li>
        <li><code>subsample</code>: Fraction of training samples used per tree.</li>
        <li><code>colsample_bytree</code>: Fraction of features used per tree.</li>
        <li><code>gamma</code>: Minimum loss reduction required to split a leaf node.</li>
        <li><code>min_child_weight</code>: Minimum sum of instance weights (hessian) in a child.</li>
        <li><code>scale_pos_weight</code>: Balances positive and negative class weights for imbalanced datasets.</li>
    </ul>
    For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.
</div>

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Fit random search and save as <code>.pkl</code> file.
</p> 

In [None]:
# Initialize model
xgb = XGBClassifier(random_state=42)

# Define hyperparameter distributions 
xgb_param_distributions = {
    "n_estimators": randint(100, 501),  # Random integers between 100 and 500             
    "max_depth": randint(3, 11),  # Random integers between 3 and 10            
    "learning_rate": uniform(0.01, 0.29),  # Random floats between 0.01 and 0.30
    "subsample": uniform(0.5, 0.5),  # Random floats between 0.5 and 1.0
    "colsample_bytree": uniform(0.5, 0.5),  # Random floats between 0.5 and 1.0
    "gamma": uniform(0, 0.5),  # Random floats between 0.0 and 0.5  
    "min_child_weight": randint(1, 10),  # Random integers between 1 and 9
    "scale_pos_weight": randint(1, 16)  # Random integers between 1 and 15
}

# Initialize randomized search object
xgb_random_search = RandomizedSearchCV(
    estimator=xgb, 
    param_distributions=xgb_param_distributions, 
    n_iter=50,
    cv=5, 
    scoring="accuracy",  # use "f1", "precision", "recall" or "average_precision" optionally
    random_state=42,
    n_jobs=-1,  # utilize all available CPU cores for parallel processing
    verbose=2  # print training progress messages 
)

# Fit the random search to the training data
xgb_random_search.fit(X_train_transformed, y_train)

# Save fitted random search as .pkl file using helper function 
save_model(xgb_random_search, "xgb_random_search.pkl")

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Load random search from <code>.pkl</code> file and show Top 10 models.
</p> 

In [None]:
# Load random search using helper function 
xgb_random_search = load_model("xgb_random_search.pkl")

# Create DataFrame of randomized search results
xgb_random_search_results = pd.DataFrame({
    "validation_accuracy": xgb_random_search.cv_results_["mean_test_score"],  # accuracy on validation data
    "parameters": xgb_random_search.cv_results_["params"]  # parameter values
})

# Extract each hyperparameter as a separate column
for parameter in xgb_param_distributions:
    xgb_random_search_results[parameter] = xgb_random_search_results["parameters"].apply(lambda x: x[parameter])

# Delete the parameters column
xgb_random_search_results = xgb_random_search_results.drop("parameters", axis=1)

# Show top 10 best performing models 
xgb_random_search_results.sort_values("validation_accuracy", ascending=False).head(10)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Retraining</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Retrain the best hyperparameter-tuned model from each algorithm on the full training dataset.
</div>

In [None]:
# Define best hyperparameter-tuned model from each algorithm (continue example with Random Forest and XGBoost)
tuned_models = {
    "Random Forest": RandomForestClassifier(**rf_random_search.best_params_, random_state=42),  # or rf_grid_search.best_params_
    "XGBoost": XGBClassifier(**xgb_random_search.best_params_, random_state=42)  # or xgb_grid_search.best_params_
}

# Retrain models on the full training data
tuned_model_results = evaluate_all_models(tuned_models, X_train_transformed, y_train, X_val_transformed, y_val)

# Save hyperparameter-tuned model results as .pkl file 
save_model(tuned_model_results, "tuned_models.pkl")

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Precision-Recall Curves</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Plot precision-recall curves of the tuned models on the validation data.
</div>

In [None]:
# Load hyperparameter-tuned model results
tuned_model_results = load_model("tuned_models.pkl")

# Plot precision-recall curves of hyperparameter-tuned models
plot_precision_recall_curve(
    y_val, 
    tuned_model_results, 
    title="Precision-Recall Curves: Hyperparameter-Tuned Models",
    safe_to_file="precision_recall_curves_tuned.png"
)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Threshold Optimization</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ The default decision threshold (typically 0.5) may not be ideal to achieve the business goals, especially when certain performance targets are non-negotiable or misclassification has quantifiable costs. There are two common strategies to optimize the threshold based on business needs: 
    <ol>
        <li>Set a minimum requirement for a primary metric (e.g., recall), and then optimize a secondary metric (e.g., precision).</li>
        <li>Assign costs to false positives (FP) and false negatives (FN), and select the threshold that minimizes total cost.</li>
    </ol> 
    Examples:
    <ul>
        <li>Spam Filtering: Enforce a minimum precision of 0.99 to avoid misclassifying legitimate emails as spam (users are highly sensitive to missing important emails), then maximize recall to catch as much spam as possible.</li>
        <li>Medical Screening: Ensure a recall of at least 0.97 for the disease class to avoid missed diagnoses (which could delay treatment and increase mortality risk), then optimize precision to reduce unnecessary follow-up tests for healthy people.</li>
        <li>Fraud Detection: Assign costs to missed fraud (FN cost of 200€) and blocked legitimate transactions (FP cost of 5€) and select the threshold that minimizes total cost.</li>
    </ul>
</div>

In [None]:
# Function to evaluate a single model across multiple decision thresholds and determine the best threshold 
def optimize_threshold(y_true, y_pred_proba, thresholds=None, optimize="Accuracy", min_accuracy=0, min_recall=0, min_precision=0, min_f1=0,
                       cost_fp=0, cost_fn=0, title="Metrics by Threshold", safe_to_file=False):
    # Use 1% to 99% in 1%-steps in the absence of custom thresholds
    if thresholds is None:
        thresholds = np.arange(0.01, 1, 0.01)

    # --- Calculate metrics for each threshold ---
    # Store threshold evaluation results as list of dictionaries
    threshold_results = []   
    
    # Iterate over each threshold
    for threshold in thresholds:
        # Get class predictions for current threshold
        y_pred = (y_pred_proba >= threshold).astype(int)
        
        # Calculate evaluation metrics for current threshold
        accuracy = accuracy_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        precision = precision_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
    
        # Calculate cost for current threshold
        if optimize == "Cost" and cost_fp == 0 and cost_fn == 0:
            print("Warning: Cannot optimize for 'Cost' when cost_fn and cost_fp are both 0. Defaulting to 'Accuracy'.")
            optimize="Accuracy"
            total_cost = None
        elif optimize == "Cost" and (cost_fp > 0 or cost_fn > 0):
            # Calculate number of true negatives (tn), false positives (fp), false negatives (fn) and true positives (tp)
            tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
            # Calculate total cost
            total_cost = cost_fp * fp + cost_fn * fn
        else:
            total_cost = None
        
        # Add evaluation metrics dictionary to list
        threshold_results.append({
            "threshold": threshold,
            "Accuracy": accuracy,
            "Recall": recall,
            "Precision": precision,
            "F1-Score": f1,
            "Cost": total_cost
        })    

    # Convert list of dictionaries to DataFrame
    threshold_results = pd.DataFrame(threshold_results)

    # --- Determine the best threshold --- 
    # Filter thresholds that satisfy minimum accuracy, recall, precision, and F1-score
    filtered_thresholds = threshold_results[
        (threshold_results["Accuracy"] >= min_accuracy) & 
        (threshold_results["Recall"] >= min_recall) & 
        (threshold_results["Precision"] >= min_precision) & 
        (threshold_results["F1-Score"] >= min_f1)
    ]
    
    # Fallback to no minimum criteria if not a single threshold satisfies all of them
    if filtered_thresholds.empty:
        print("Warning: No threshold satisfies all minimum criteria.")
        print("Defaulting to optimization without any minimum criteria.")
        filtered_thresholds = threshold_results.copy()
    
    # Optimize accuracy
    if optimize == "Accuracy":
        best_threshold = filtered_thresholds.loc[filtered_thresholds["Accuracy"].idxmax(), "threshold"] 
    # Optimize recall
    elif optimize == "Recall":
        best_threshold = filtered_thresholds.loc[filtered_thresholds["Recall"].idxmax(), "threshold"]  
    # Optimize precision
    elif optimize == "Precision":
        best_threshold = filtered_thresholds.loc[filtered_thresholds["Precision"].idxmax(), "threshold"]  
    # Optimize F1-score
    elif optimize == "F1-Score":
        best_threshold = filtered_thresholds.loc[filtered_thresholds["F1-Score"].idxmax(), "threshold"]  
    # Optimize cost
    elif optimize == "Cost":
        best_threshold = filtered_thresholds.loc[filtered_thresholds["Cost"].idxmin(), "threshold"]  
    # Fallback to accuracy if metric unkown 
    else:
        print(f"Warning: Unknown optimize metric '{optimize}'. Defaulting to accuracy.")
        optimize = "Accuracy"
        best_threshold = filtered_thresholds.loc[filtered_thresholds["Accuracy"].idxmax(), "threshold"] 
    
    # --- Plot metrics by threshold --- 
    # Set the figure size
    fig, ax = plt.subplots(figsize=(12, 6))

    # Define metrics to use in plot starting with the optimization metric
    metrics = [optimize]
    # When plotting anything other than cost, also plot all metrics with minimum criteria
    if optimize != "Cost":
        if min_accuracy > 0 and "Accuracy" not in metrics:
            metrics.append("Accuracy")
        if min_recall > 0 and "Recall" not in metrics:
            metrics.append("Recall")
        if min_precision > 0 and "Precision" not in metrics:
            metrics.append("Precision")
        if min_f1 > 0 and "F1-Score" not in metrics:
            metrics.append("F1-Score")

    # Order metrics
    ordered_metrics = ["Cost", "Accuracy", "Recall", "Precision", "F1-Score"]
    ordered_metrics = [metric for metric in ordered_metrics if metric in metrics]
        
    # Get a color from viridis colormap for each metric
    n_metrics = len(ordered_metrics)
    cmap = plt.get_cmap("viridis", n_metrics)

    # Iterate over each metric
    for i, metric in enumerate(ordered_metrics):
        # Create line plot of current metric by threshold
        ax.plot(threshold_results["threshold"], threshold_results[metric], label=metric, color=cmap(i))
    
    # Format cost plots
    if optimize == "Cost":
        title += f" (FN Cost: {cost_fn}, FP Cost: {cost_fp})"  # add FN and FP costs to title 
        ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f"{x:,.0f}"))  # format cost y-axis with thousand separator and no decimals

    # Format plots with all other metrics
    else:
        ax.set_ylim(-0.02, 1.02)  # value range of metrics is 0 to 1, slightly extend y-axis for better visibility
        ax.set_yticks(np.arange(0, 1.1, 0.1))  # set y-axis ticks from 0 to 1 in 0.1 steps        

    # Customize plot
    ax.set_title(title, fontsize=14)
    ax.set_xlabel("Threshold", fontsize=12)
    ax.set_ylabel("Metric Value" if n_metrics > 1 else metric, fontsize=12)
    ax.set_xlim(0, 1)
    ax.set_xticks(np.arange(0, 1.1, 0.1))
    ax.legend(fontsize=11).set_visible(True if n_metrics > 1 else False)
    ax.grid(True, alpha=0.3)
 
    # Add dashed line for best threshold 
    ax.axvline(x=best_threshold, color="gray", linestyle="--")
    y_min, y_max = ax.get_ylim()
    ax.text(best_threshold+0.01, y_min+(y_max-y_min)*0.05, f"Best Threshold: {best_threshold:.2f}", rotation=90, fontsize=11)
    
    # Adjust layout
    fig.tight_layout()

    # Save to file
    if safe_to_file:
        os.makedirs("images", exist_ok=True)
        image_path = os.path.join("images", f"{safe_to_file}")  
        if not os.path.exists(image_path):
            try:        
                plt.savefig(image_path, bbox_inches="tight", dpi=144)
                print(f"Metrics by threshold plot saved successfully to '{image_path}'.")
            except Exception as e:
                print(f"Error saving metrics by threshold plot: {e}")
        else:
            print(f"Skip saving metrics by threshold plot to file: '{image_path}' already exists.")

    # Show the plot
    plt.show()

    return best_threshold, threshold_results


# Example usage: Optimize accuracy (default) 
rf_best_threshold, rf_threshold_results = optimize_threshold(
    y_true=y_val, 
    y_pred_proba=tuned_model_results["Random Forest"]["y_val_proba"],
    title="Accuracy by Threshold"
)

# Example usage: Optimize precision while ensuring minimum recall  
rf_best_threshold, rf_threshold_results = optimize_threshold(
    y_true=y_val, 
    y_pred_proba=tuned_model_results["Random Forest"]["y_val_proba"],
    optimize="Precision",
    min_recall=0.95
)

# Example usage: Optimize cost  
rf_best_threshold, rf_threshold_results = optimize_threshold(
    y_true=y_val, 
    y_pred_proba=tuned_model_results["Random Forest"]["y_val_proba"],
    optimize="Cost",
    cost_fn=8500,
    cost_fp=1250,
    title="Cost by Threshold"
)


# --- Use function to optimize thresholds for all tuned models --- 
# Store tuned threshold model results as dictionary
tuned_threshold_model_results = {}

# Iterate over each tuned model
for model_name, model_results in tuned_model_results.items():
    # Optimize threshold for current model on the validation data 
    best_threshold, threshold_results = optimize_threshold(
        y_true=y_val, 
        y_pred_proba=model_results["y_val_proba"], 
        optimize="Precision",  # maximize precision
        min_recall=0.95,  # ensure minimum recall of 0.95 
        title=f"{model_name}: Metrics by Threshold"
    )

    # Add best threshold and resulting class predictions to tuned threshold model results dictionary
    tuned_threshold_model_results[model_name] = {}
    tuned_threshold_model_results[model_name]["best_threshold"] = best_threshold
    tuned_threshold_model_results[model_name]["y_val_pred"] = (tuned_model_results[model_name]["y_val_proba"] >= best_threshold).astype(int)

    # Add evaluation metrics of optimized threshold model to results dictionary
    tuned_threshold_model_results[model_name]["ROC-AUC"] = model_results["ROC-AUC"]
    tuned_threshold_model_results[model_name]["AUC-PR"] = model_results["AUC-PR"]
    tuned_threshold_model_results[model_name]["Accuracy"] = threshold_results.loc[threshold_results["threshold"] == best_threshold, "Accuracy"].squeeze()
    tuned_threshold_model_results[model_name]["Recall"] = threshold_results.loc[threshold_results["threshold"] == best_threshold, "Recall"].squeeze()
    tuned_threshold_model_results[model_name]["Precision"] = threshold_results.loc[threshold_results["threshold"] == best_threshold, "Precision"].squeeze()
    tuned_threshold_model_results[model_name]["F1-Score"] = threshold_results.loc[threshold_results["threshold"] == best_threshold, "F1-Score"].squeeze()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Metrics</h3>
</div> 

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    📌 Compare evaluation metrics of hyperparameter-tuned models with default and optimized thresholds on the validation data.
</p> 

In [None]:
# Extract metrics with default thresholds
metrics_default_thresholds = {
    model_name: {
        metric: tuned_model_results[model_name][metric]
        for metric in ["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"]
    }
    for model_name in tuned_model_results
}

# Extract metrics with optimized thresholds
metrics_optimized_thresholds = {
    model_name: {
        metric: tuned_threshold_model_results[model_name][metric]
        for metric in ["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"]
    }
    for model_name in tuned_threshold_model_results
}

# Create dictionary with tuned model comparison tables 
tuned_model_comparison = {
    "default_thresholds": pd.DataFrame(metrics_default_thresholds).transpose(),
    "optimized_thresholds": pd.DataFrame(metrics_optimized_thresholds).transpose()
}

# Display comparison table for default thresholds
print("Hyperparameter-Tuned Models (Default Thresholds)")
display(round(tuned_model_comparison["default_thresholds"], 2))

# Display comparison table for optimized thresholds
print("Hyperparameter-Tuned Models (Optimized Thresholds)")
display(round(tuned_model_comparison["optimized_thresholds"], 2))

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Classification Report</h3>
</div> 

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Default Thresholds</strong> <br>
    📌 Show classification report of hyperparameter-tuned models with default thresholds on the validation data.
</p> 

In [None]:
# Create classification report for all tuned models with default thresholds
for model_name, model_result in tuned_model_results.items():
    print(f"\n{model_name} (Default Threshold): Classification Report")
    print(classification_report(y_val, model_result["y_val_pred"]))

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Optimized Thresholds</strong> <br>
    📌 Show classification report of hyperparameter-tuned models with optimized thresholds on the validation data.
</p> 

In [None]:
# Create classification report for all tuned models with optimized thresholds
for model_name, model_result in tuned_threshold_model_results.items():
    print(f"\n{model_name} (Optimized Threshold): Classification Report")
    print(classification_report(y_val, model_result["y_val_pred"]))

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Confusion Matrix</h3>
</div> 

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Default Thresholds</strong> <br>
    📌 Plot confusion matrix of hyperparameter-tuned models with default thresholds on the validation data.
</p> 

In [None]:
# --- Plot confusion matrix for all tuned models with default thresholds ---
# Calculate number of rows and columns for subplot grid
n_plots = len(tuned_model_results)
n_cols = 2  
n_rows = math.ceil(n_plots / n_cols) 

# Create subplot grid with figure size based on 6x6 inches per subplot
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 6))

# Flatten the axes for easier iteration
axes = axes.flat

# Iterate over each model
for i, (model_name, model_result) in enumerate(tuned_model_results.items()):
    # Plot confusion matrix for current model
    plot_confusion_matrix(y_val, model_result["y_val_pred"], title=f"{model_name} (Default Threshold)", axes=axes[i])

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].axis("off")
    
# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px">
    <strong>Optimized Thresholds</strong> <br>
    📌 Plot confusion matrix of hyperparameter-tuned models with optimized thresholds on the validation data.
</p> 

In [None]:
# --- Plot confusion matrix for all tuned models with optimized thresholds ---
# Calculate number of rows and columns for subplot grid
n_plots = len(tuned_threshold_model_results)
n_cols = 2  
n_rows = math.ceil(n_plots / n_cols) 

# Create subplot grid with figure size based on 6x6 inches per subplot
fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 6))

# Flatten the axes for easier iteration
axes = axes.flat

# Iterate over each model
for i, (model_name, model_result) in enumerate(tuned_threshold_model_results.items()):
    # Plot confusion matrix for current model
    plot_confusion_matrix(y_val, model_result["y_val_pred"], title=f"{model_name} (Optimized Threshold)", axes=axes[i])

# Hide any unused subplots
for j in range(i + 1, len(axes)):
    axes[j].axis("off")

# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Overfitting</h3>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Default Thresholds</strong> <br>
    📌 Diagnose overfitting of hyperparameter-tuned models with default thresholds by comparing evaluation metrics between training and validation data.
</div>

In [None]:
# Analyze overfitting for tuned models with default thresholds 
tuned_models_overfitting = analyze_overfitting(X_train_transformed, y_train, model_results=tuned_model_results)
print("Hyperparameter-Tuned Models (Default Thresholds)")
round(tuned_models_overfitting, 2)

In [None]:
# Plot train vs. validation comparison of all metrics for tuned models with default thresholds  
plot_train_val_metrics(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], tuned_models_overfitting)

In [None]:
# Plot train-validation difference scores of all metrics for tuned models with default thresholds
print("Hyperparameter-Tuned Models (Default Thresholds)")
plot_train_val_difference(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], tuned_models_overfitting)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Optimized Thresholds</strong> <br>
    📌 Diagnose overfitting of hyperparameter-tuned models with optimized thresholds by comparing evaluation metrics between training and validation data.
</div>

In [None]:
# Analyze overfitting for tuned models with optimized thresholds 
tuned_threshold_models_overfitting = analyze_overfitting(X_train_transformed, y_train, model_results=tuned_threshold_model_results)
print("Hyperparameter-Tuned Models (Optimized Thresholds)")
round(tuned_threshold_models_overfitting, 2)

In [None]:
# Plot train vs. validation comparison of all metrics for tuned models with optimized thresholds  
plot_train_val_metrics(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], tuned_threshold_models_overfitting)

In [None]:
# Plot train-validation difference scores of all metrics for tuned models with optimized thresholds
print("Hyperparameter-Tuned Models (Optimized Thresholds)")
plot_train_val_difference(["Accuracy", "Recall", "Precision", "F1-Score", "ROC-AUC", "AUC-PR"], tuned_threshold_models_overfitting)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Feature Misclassification Analysis</h3>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Default Thresholds</strong> <br>
    📌 Analyze feature misclassification relationships of hyperparameter-tuned models with default thresholds on the validation data.
</div>

In [None]:
# --- Analyze feature misclassification relationships of tuned models with default thresholds ---
# Initialize results dictionary
tuned_misclassification_correlations = {}

# Iterate over each model
for model_name, model_result in tuned_model_results.items():
    print(f"{model_name}: Feature Misclassification Analysis")
    
    # Analyze feature misclassification relationships for current model
    misclassification_correlations = analyze_feature_misclassification(X_val_transformed, y_val, model_result["y_val_pred"])
    
    # Add current model results to dictionary
    tuned_misclassification_correlations[model_name] = misclassification_correlations
    print("=" * 145)
    
# Convert results dictionary into a DataFrame    
tuned_misclassification_correlations = pd.DataFrame(tuned_misclassification_correlations)

In [None]:
# Display feature misclassification correlations of tuned models with default thresholds
round(tuned_misclassification_correlations, 2)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    <strong>Optimized Thresholds</strong> <br>
    📌 Analyze feature misclassification relationships of hyperparameter-tuned models with optimized thresholds on the validation data.
</div>

In [None]:
# --- Analyze feature misclassification relationships of tuned models with default thresholds ---
# Initialize results dictionary
tuned_threshold_misclassification_correlations = {}

# Iterate over each model
for model_name, model_result in tuned_threshold_model_results.items():
    print(f"{model_name}: Feature Misclassification Analysis")
    
    # Analyze feature misclassification relationships for current model
    misclassification_correlations = analyze_feature_misclassification(X_val_transformed, y_val, model_result["y_val_pred"])
    
    # Add current model results to dictionary
    tuned_threshold_misclassification_correlations[model_name] = misclassification_correlations
    print("=" * 145)
    
# Convert results dictionary into a DataFrame    
tuned_threshold_misclassification_correlations = pd.DataFrame(tuned_threshold_misclassification_correlations)

In [None]:
# Display feature misclassification correlations of tuned models with optimized thresholds
round(tuned_threshold_misclassification_correlations, 2)

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Final Model</h2>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Model Selection</strong> <br>
    💡 Example: The Random Forest model achieved the best performance (e.g., accuracy = 0.92) on the validation data compared to other candidates. Therefore, the final model is a Random Forest Classifier with the following hyperparameters:  
    <ul>
        <li><code>n_estimators=225</code></li>
        <li><code>max_depth=26</code></li>
        <li><code>min_samples_split=2</code></li>
        <li><code>min_samples_leaf=1</code></li>
        <li><code>max_features=0.13</code></li>
        <li><code>class_weight="balanced"</code></li>
    </ul>
    <strong>Next steps</strong> <br>
    <ul>
        <li>Retrain the final model, save it to a file, and apply the optimized threshold.</li>
        <li>Evaluate the final model on the training, validation, and test sets to confirm its generalizability.</li>
        <li>Use the same performance metrics as for the baseline and hyperparameter-tuned models 
            (accuracy, recall, precision, F1-score, ROC-AUC, AUC-PR) along with additional diagnostics 
            (classification report, confusion matrix, overfitting, feature misclassification analysis).
        </li>
        <li>In addition, conduct a feature importance analysis and review model prediction examples 
            to further interpret and validate model behavior.
        </li>
    </ul>
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Retraining</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Retrain the final model with optimized hyperparameters and save it to a <code>.pkl</code> file in the <code>model</code> directory.
</div>

In [None]:
# Initialize model with tuned hyperparameters (Random Forest example)
final_model = RandomForestClassifier(**rf_random_search.best_params_, random_state=42)

# Fit model
final_model.fit(X_train_transformed, y_train)

# Save final model as .pkl file 
save_model(final_model, "final_model.pkl")

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Apply optimized threshold to obtain predicted values on the training, validation, and test sets.
</div>

In [None]:
# Load final model
final_model = load_model("final_model.pkl")

# Predict probabilities for class-1 on the training, validation and test data
y_train_proba = final_model.predict_proba(X_train_transformed)[:, 1]
y_val_proba = final_model.predict_proba(X_val_transformed)[:, 1]
y_test_proba = final_model.predict_proba(X_test_transformed)[:, 1]

# Apply optimized threshold to convert probabilities to binary predictions
threshold = tuned_threshold_model_results["Random Forest"]["best_threshold"]  # Random Forest example
y_train_pred = (y_train_proba >= threshold).astype(int)
y_val_pred = (y_val_proba >= threshold).astype(int)
y_test_pred = (y_test_proba >= threshold).astype(int)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Metrics</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Compare evaluation metrics of the final model on the training, validation, and test sets.
</div>

In [None]:
# --- Calculate evaluation metrics ---
# Accuracy
accuracy_train = accuracy_score(y_train, y_train_pred)
accuracy_val = accuracy_score(y_val, y_val_pred)
accuracy_test = accuracy_score(y_test, y_test_pred)

# Recall 
recall_train = recall_score(y_train, y_train_pred)
recall_val = recall_score(y_val, y_val_pred)
recall_test = recall_score(y_test, y_test_pred)

# Precision 
precision_train = precision_score(y_train, y_train_pred)
precision_val = precision_score(y_val, y_val_pred)
precision_test = precision_score(y_test, y_test_pred)

# F1-score
f1_train = f1_score(y_train, y_train_pred)
f1_val = f1_score(y_val, y_val_pred)
f1_test = f1_score(y_test, y_test_pred)

# ROC-AUC
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_test = roc_auc_score(y_test, y_test_proba)

# AUC-PR
precision_curve_train, recall_curve_train, _ = precision_recall_curve(y_train, y_train_proba)
auc_pr_train = auc(recall_curve_train, precision_curve_train)
precision_curve_val, recall_curve_val, _ = precision_recall_curve(y_val, y_val_proba)
auc_pr_val = auc(recall_curve_val, precision_curve_val)
precision_curve_test, recall_curve_test, _ = precision_recall_curve(y_test, y_test_proba)
auc_pr_test = auc(recall_curve_test, precision_curve_test)

# --- Comparison table of evaluation metrics ---
# Create comparison table
final_model_comparison = pd.DataFrame({
    "Data": ["Training", "Validation", "Test"],
    "Accuracy": [accuracy_train, accuracy_val, accuracy_test],
    "Recall": [recall_train, recall_val, recall_test],
    "Precision": [precision_train, precision_val, precision_test],
    "F1-Score": [f1_train, f1_val, f1_test],
    "ROC-AUC": [roc_auc_train, roc_auc_val, roc_auc_test],
    "AUC-PR": [auc_pr_train, auc_pr_val, auc_pr_test]
})

# Display comparison table (with 2 decimals)
round(final_model_comparison, 2)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Classification Report</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Show classification report for training, validation, and test data.
</div>

In [None]:
# Classification report
print("Classification Report: Training")
print(classification_report(y_train, y_train_pred))
print("Classification Report: Validation")
print(classification_report(y_val, y_val_pred))
print("Classification Report: Test")
print(classification_report(y_test, y_test_pred))

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Confusion Matrix</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Plot confusion matrix for training, validation, and test data.
</div>

In [None]:
# --- Plot confusion matrix for training, validation, and test data ---
# Create subplot grid 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Add overall title
fig.suptitle("Final Model: Confusion Matrix", fontsize=16, y=1.03)

# Create confusion matrix subplots for training, validation, and test data (using helper function)
plot_confusion_matrix(y_train, y_train_pred, title="Training", axes=axes[0])
plot_confusion_matrix(y_val, y_val_pred, title="Validation", axes=axes[1])
plot_confusion_matrix(y_test, y_test_pred, title="Test", axes=axes[2])
    
# Adjust layout to prevent overlap
fig.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Overfitting</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Diagnose overfitting of final model by comparing evaluation metrics between training, validation, and test data using a grouped bar plot.
</div>

In [None]:
# --- Overfitting grouped bar plot ---
# Melt the final_model_comparison DataFrame (from the "Metrics" section) for easier plotting 
metric_df = pd.melt(final_model_comparison, id_vars=["Data"], var_name="Metric", value_name="Value")

# Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Create grouped bar plot
sns.barplot(data=metric_df, x="Metric", y="Value", hue="Data", palette="viridis", ax=ax)

# Add value labels 
for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", padding=3, fontsize=10)

# Customize plot 
ax.set_title("Final Model: Overfitting", fontsize=14)
ax.set_ylabel("Metric Value", fontsize=12)
ax.set_xlabel("")
ax.set_ylim(0, 1.05)  # slightly extend y-axis upper limit for better visibility of value labels
ax.set_yticks(np.arange(0, 1.1, 0.1))  # y-axis ticks from 0 to 1 in 0.1 steps
ax.tick_params(axis="x", labelsize=12) 
ax.tick_params(axis="y", labelsize=10)
ax.legend(fontsize=11)
ax.grid(axis="y", alpha=0.3)

# Adjust the layout
fig.tight_layout()

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Feature Misclassification Analysis</h3>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Analyze relationships between the features and misclassifications on the training, validation, and test sets through correlations and box plots.
</div>

In [None]:
# Analyze feature misclassification relationships on the training data (using helper function)
print("Feature Misclassification Relationships: Training")
misclassification_correlations_train = analyze_feature_misclassification(X_train_transformed, y_train, y_train_pred)

In [None]:
# Analyze feature misclassification relationships on the validation data
print("Feature Misclassification Relationships: Validation")
misclassification_correlations_val = analyze_feature_misclassification(X_val_transformed, y_val, y_val_pred)

In [None]:
# Analyze feature misclassification relationships on the test data
print("Feature Misclassification Relationships: Test")
misclassification_correlations_test = analyze_feature_misclassification(X_test_transformed, y_test, y_test_pred)

In [None]:
# --- Feature Misclassification Correlations ---
# Merge training, validation, and test correlations into a single DataFrame
misclassification_correlations = pd.concat(
    [misclassification_correlations_train, misclassification_correlations_val, misclassification_correlations_test], 
    axis=1,
    keys=["Training", "Validation", "Test"]
)

# Show misclassification correlations (rounded to 2 decimals and sorted by absolute correlation values in test data)
round(misclassification_correlations.sort_values("Test", key=abs, ascending=False), 2)

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Feature Importance</h3>
</div> 

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Visualize feature importances of Logistic Regression or Elastic Net Logistic Regression with a bar plot.  
</div>

In [None]:
# --- Feature importance plot for logistic regression or elastic net logistic regression ---
# Get the coefficients 
coefficients = final_model.coef_  # final_model must be LogisticRegression 

# Get feature names in proper format 
feature_names = X_train_transformed.columns.str.title().str.replace("_", " ") 

# Create a DataFrame for easier plotting
feature_importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": np.abs(coefficients)
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create figure and axes
fig, ax = plt.subplots(figsize=(12, 6))

# Create bar plot of top 10 features
sns.barplot(data=feature_importance_df.head(10), x="importance", y="feature", hue="feature", palette="viridis", ax=ax)

# Customize plot
ax.set_title("Final Model: Top 10 Most Important Features", fontsize=14)
ax.set_xlabel("Feature Importance (Absolute Coefficients)", fontsize=12)
ax.set_ylabel("")
ax.tick_params(axis="both", labelsize=12)
ax.grid(axis="x", alpha=0.3)

# Add value labels 
for i, value in enumerate(feature_importance_df["importance"].head(10)):
    ax.text(value + 0.001, i, f"{value:.2f}", va="center", fontsize=12)

# Adjust layout
fig.tight_layout()

# Save plot to file
os.makedirs("images", exist_ok=True)  
image_path = os.path.join("images", "feature_importance_final.png")  
if not os.path.exists(image_path):
    try:        
        fig.savefig(image_path, bbox_inches="tight", dpi=144)
        print(f"Feature importance plot saved successfully to '{image_path}'.")
    except Exception as e:
        print(f"Error saving feature importance plot: {e}")
else:
    print(f"Skip saving feature importance plot: '{image_path}' already exists.")

# Show the plot
plt.show()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Visualize feature importances of Decision Tree or Random Forest with a bar plot.  
</div>

In [None]:
# --- Feature importance plot for decision tree or a random forest ---
# Get feature importances
importances = final_model.feature_importances_  # final_model must be DecisionTree or RandomForest

# Get feature names in proper format 
feature_names = X_train_transformed.columns.str.title().str.replace("_", " ")

# Create a DataFrame for easier plotting
feature_importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create figure and axes
fig, ax = plt.subplots(figsize=(12, 6))

# Create bar plot of top 10 features
sns.barplot(data=feature_importance_df.head(10), x="importance", y="feature", hue="feature", palette="viridis", ax=ax)

# Customize plot
ax.set_title("Final Model: Top 10 Most Important Features", fontsize=14)
ax.set_xlabel("Feature Importance", fontsize=12)
ax.set_ylabel("")
ax.tick_params(axis="both", labelsize=12)
ax.grid(axis="x", alpha=0.3)

# Add value labels 
for i, value in enumerate(feature_importance_df["importance"].head(10)):
    ax.text(value + 0.001, i, f"{value:.2f}", va="center", fontsize=12)

# Adjust layout
fig.tight_layout()

# Save plot to file
os.makedirs("images", exist_ok=True)  
image_path = os.path.join("images", "feature_importance_final.png")  
if not os.path.exists(image_path):
    try:        
        fig.savefig(image_path, bbox_inches="tight", dpi=144)
        print(f"Feature importance plot saved successfully to '{image_path}'.")
    except Exception as e:
        print(f"Error saving feature importance plot: {e}")
else:
    print(f"Skip saving feature importance plot: '{image_path}' already exists.")

# Show the plot
plt.show()

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Visualize feature importances of XGBoost with a bar plot.  
</div>

In [None]:
# --- Feature importance plot for XGBoost ---
# Get the feature importances
importances = final_model.get_score(importance_type="gain")  # final_model must be XGBoost 

# Create a DataFrame for easier plotting
feature_importance_df = pd.DataFrame({
    "feature": list(importances.keys()),
    "importance": list(importances.values())
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values("importance", ascending=False)

# Create figure and axes
fig, ax = plt.subplots(figsize=(12, 6))

# Create bar plot of top 10 features
sns.barplot(data=feature_importance_df.head(10), x="importance", y="feature", hue="feature", palette="viridis", ax=ax)

# Customize plot
ax.set_title("Final Model: Top 10 Most Important Features", fontsize=14)
ax.set_xlabel("Feature Importance", fontsize=12)
ax.set_ylabel("")
ax.tick_params(axis="both", labelsize=12)
ax.grid(axis="x", alpha=0.3)

# Add value labels 
for i, value in enumerate(feature_importance_df["importance"].head(10)):
    ax.text(value + 0.001, i, f"{value:.2f}", va="center", fontsize=12)

# Adjust layout
fig.tight_layout()

# Save plot to file
os.makedirs("images", exist_ok=True)  
image_path = os.path.join("images", "feature_importance_final.png")  
if not os.path.exists(image_path):
    try:        
        fig.savefig(image_path, bbox_inches="tight", dpi=144)
        print(f"Feature importance plot saved successfully to '{image_path}'.")
    except Exception as e:
        print(f"Error saving feature importance plot: {e}")
else:
    print(f"Skip saving feature importance plot: '{image_path}' already exists.")

# Show the plot
plt.show()

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Model Prediction Examples</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Show illustrative examples of model predictions from test data to showcase performance on unseen data.
    <ul>
        <li>Goal: Give stakeholders a clear picture of when the model performs well and when it struggles.</li>
        <li>Recommendations:
            <ul>
                <li>Show 5-10 diverse examples: Best cases, worst cases, and typical cases.</li>
                <li>Show 2-5 most important features, actual vs. predicted values, prediction confidence, and whether the example was misclassified.</li>
                <li>Add notes about any interesting patterns or edge cases observed.</li>
            </ul>
        </li>
    </ul>
</div>

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Identify best examples (correct with high confidence), worst examples (incorrect with high confidence), and typical examples (average confidence).
</div>

In [None]:
# --- Create DataFrame with top 5 features, actual and predicted values, prediction confidence, and misclassifications from test data ---
# Combine test features with actual and predicted target values into a single DataFrame
prediction_examples = X_test.copy()  # raw features (before transformation) for easier interpretability
prediction_examples.columns = prediction_examples.columns.str.title().str.replace("_", " ")  # format feature names
prediction_examples["Actual"] = y_test
prediction_examples["Predicted"] = y_test_pred

# Calculate prediction confidence (ensure final_model supports predict_proba and X_test_transformed is correctly preprocessed)
prediction_examples["Confidence Score"] = final_model.predict_proba(X_test_transformed).max(axis=1)
prediction_examples["Confidence"] = prediction_examples["Confidence Score"].apply(lambda x: f"{x:.0%}")  # format as percentages

# Calculate misclassification (and format for better readability)
prediction_examples["Misclassification"] = prediction_examples["Predicted"] != prediction_examples["Actual"]
prediction_examples["Misclassification"] = prediction_examples["Misclassification"].map({True: "❌ Yes", False: "✅ No"})  

# Get top 5 most important features (ensure feature_importance_df was created in feature importance section) 
top5_features = feature_importance_df.sort_values("importance", ascending=False).head(5)["feature"]

# Filter DataFrame columns
columns_to_keep = list(top5_features) + ["Actual", "Predicted", "Confidence Score", "Confidence", "Misclassification"]
prediction_examples = prediction_examples[columns_to_keep].copy()

# --- Identify best examples ---
# Get the top 5 correctly classified cases with highest confidence scores for each class 
best_class_1 = prediction_examples[(prediction_examples["Actual"] == 1) & (prediction_examples["Misclassification"] == "✅ No")].sort_values("Confidence Score", ascending=False).head(5)
best_class_0 = prediction_examples[(prediction_examples["Actual"] == 0) & (prediction_examples["Misclassification"] == "✅ No")].sort_values("Confidence Score", ascending=False).head(5)

# Combine and show best prediction examples from each class
best_examples = pd.concat([best_class_1, best_class_0]).drop(columns=["Confidence Score"])
print("Best examples (correct with high confidence):")
display(best_examples)

# --- Identify worst examples ---
# Get the top 5 misclassified cases despite high confidence scores for each class
worst_class_1 = prediction_examples[(prediction_examples["Actual"] == 1) & (prediction_examples["Misclassification"] == "❌ Yes")].sort_values("Confidence Score", ascending=False).head(5)
worst_class_0 = prediction_examples[(prediction_examples["Actual"] == 0) & (prediction_examples["Misclassification"] == "❌ Yes")].sort_values("Confidence Score", ascending=False).head(5)

# Combine and show worst prediction examples from each class
worst_examples = pd.concat([worst_class_1, worst_class_0]).drop(columns=["Confidence Score"])
print("Worst examples (incorrect with high confidence):")
display(worst_examples)

# --- Identify typical examples ---
# Calculate average confidence score
mean_confidence = prediction_examples["Confidence Score"].mean()

# Calculate difference from average confidence for each case
prediction_examples["Difference from Mean Confidence"] = np.abs(prediction_examples["Confidence Score"] - mean_confidence)

# Get the top 5 cases with confidence scores closest to the average confidence for each class
typical_class_1 = prediction_examples[prediction_examples["Actual"] == 1].sort_values("Difference from Mean Confidence").head(5)
typical_class_0 = prediction_examples[prediction_examples["Actual"] == 0].sort_values("Difference from Mean Confidence").head(5)

# Combine and show typical prediction examples from each class
typical_examples = pd.concat([typical_class_1, typical_class_0]).drop(columns=["Confidence Score", "Difference from Mean Confidence"])
print("Typical examples (average confidence):")
display(typical_examples)

<div style="background-color:#fff6e4; padding:15px; border:3px solid #f5ecda; border-radius:6px;">
    📌 Display the best, worst, and typical prediction examples for each class in the test data, including the top five most important features, actual vs. predicted values, prediction confidence, and whether the example was misclassified.
</div>

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px">
    💡 Option 1: Single Table (Loan Default Example)
</p>

| Example | Income (€) | Age | Credit Score | Loan Amount (€) | Work Years | Actual Default | Predicted Default | Confidence | Misclassified |
|---------|------------|-----|--------------|-----------------|------------|----------------|-------------------|------------|---------------|
| Best    | 48,000     | 26  | 740          | 12,000          | 1          | Yes            | Yes               | 99%        | ✅ No        |
| Best    | 72,000     | 56  | 770          | 25,000          | 2          | No             | No                | 100%       | ✅ No        |
| Worst   | 55,000     | 42  | 630          | 40,000          | 3          | Yes            | No                | 95%        | ❌ Yes       |
| Worst   | 32,000     | 24  | 620          | 18,000          | 1          | No             | Yes               | 98%        | ❌ Yes       |
| Typical | 60,000     | 47  | 710          | 30,000          | 3          | Yes            | Yes               | 94%        | ✅ No        |
| Typical | 38,000     | 24  | 750          | 10,000          | 4          | No             | No                | 94%        | ✅ No        |


<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px">
    💡 Option 2: Separate Tables (Loan Default Example)
</p>

🏆 Best Predictions (Correct with High Confidence)
| Income (€) | Age | Credit Score | Loan Amount (€) | Work Years | Actual Default | Predicted Default | Confidence | Misclassified |
|------------|-----|--------------|-----------------|------------|----------------|-------------------|------------|---------------|
| 48,000     | 26  | 740          | 12,000          | 1          | Yes            | Yes               | 99%        | ✅ No         |
| 72,000     | 56  | 770          | 25,000          | 2          | No             | No                | 100%       | ✅ No         |

⚠️ Worst Predictions (Incorrect with High Confidence)
| Income (€) | Age | Credit Score | Loan Amount (€) | Work Years | Actual Default | Predicted Default | Confidence | Misclassified |
|------------|-----|--------------|-----------------|------------|----------------|-------------------|------------|---------------|
| 55,000     | 42  | 630          | 40,000          | 3          | Yes            | No                | 95%        | ❌ Yes        |
| 32,000     | 24  | 620          | 18,000          | 1          | No             | Yes               | 98%        | ❌ Yes        |

➖ Typical Predictions (Average Confidence)
| Income (€) | Age | Credit Score | Loan Amount (€) | Work Years | Actual Default | Predicted Default | Confidence | Misclassified |
|------------|-----|--------------|-----------------|------------|----------------|-------------------|------------|---------------|
| 60,000     | 47  | 710          | 30,000          | 3          | Yes            | Yes               | 94%        | ✅ No         |
| 38,000     | 24  | 750          | 10,000          | 4          | No             | No                | 94%        | ✅ No         |

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Summary</h1>
</div>

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px; margin-bottom:8px;">
    <h2 style="margin:0px">🧹 Data Preprocessing</h2>
</div> 

Used `pandas` and `sklearn` for data loading, cleaning, transformation, and saving.
- **Loaded data**:
    - From a .csv file using `pandas` `read_csv`.
    - From a MySQL database table using `sqlalchemy`, `mysql-connector-python`, and `pandas` `read_sql`.
- **Standardized column names and labels**:
    - To `snake_case` using `pandas` string methods and `apply` with custom functions.
- **Handled duplicates**:
    - Removed duplicate rows (e.g., based on the ID column) using `pandas` `drop_duplicates`.
- **Handled data types**:
    - Converted string columns to numerical types (`pandas` `astype`) and datetime types (`pandas` `to_datetime`).
    - Converted string columns with two categories to boolean columns using `pandas` `map`.
- **Train-validation-test split**:
    - Split data into training (e.g., 70%), validation (15%), and test (15%) sets using `sklearn` `train_test_split`.
- **Engineered new features**:
    - Derived categorical, numerical, and boolean features from raw text columns using `pandas` `apply` with custom functions.
    - Derived categorical and numerical features from categorical text columns using tiering and target encoding.
- **Defined semantic type** for each column (numerical, categorical, boolean).
- **Handled missing values**:
    - Deleted rows with missing values using `pandas` `dropna`.
    - Imputed missing values: Filled in the median for numerical columns or the mode for categorical columns using `pandas` `fillna`.
- **Handled outliers**:
    - Identified univariate outliers using statistical methods (e.g., 3SD or 1.5 IQR) with custom transformer classes that inherit from `sklearn` `BaseEstimator` and `TransformerMixin`.
    - Identified multivariate outliers using `sklearn` `IsolationForest`. 
- **Feature scaling and encoding**:
    - Scaled numerical features using standard scaling with `sklearn` `StandardScaler` or min-max normalization with `MinMaxScaler`.
    - Encoded categorical features:
        - Nominal features: Used one-hot encoding with `sklearn` `OneHotEncoder`.
        - Ordinal features: Used ordinal encoding with `sklearn` `OrdinalEncoder`.
    - Applied scaling and encoding together using `sklearn` `ColumnTransformer`.
- **Polynomial features**:
    - Created polynomial features using `sklearn` `PolynomialFeatures`.
- **Saved the preprocessed data**:
    - For training, validation, and test sets as .csv files using `pandas` `to_csv`.
    - In a MySQL database table using `sqlalchemy`, `mysql-connector-python`, and `pandas` `to_sql`.

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px; margin-bottom:8px;">
    <h2 style="margin:0px">🔍 Exploratory Data Analysis (EDA)</h2>
</div> 

Used `pandas`, `numpy`, `seaborn`, and `matplotlib` for statistical analysis and visualizations.
- **Univariate EDA**:
    - **Numerical columns**:
        - Analyzed descriptive statistics (e.g., mean, median, standard deviation) using `pandas` `describe`.
        - Visualized distributions with histograms using `seaborn` `histplot` and `matplotlib`.
    - **Categorical columns**:
        - Examined frequencies using `pandas` `value_counts`.
        - Visualized frequency distributions with bar plots using `seaborn` `barplot` and `matplotlib`. 
- **Bivariate EDA**:
    - **Numerical vs. numerical**:
        - Analyzed pairwise relationships with a correlation matrix (`pandas` `corr` and `numpy`) and visualized them with a heatmap (`seaborn` `heatmap`).
        - Visualized relationships with scatterplots using `seaborn` `scatterplot` and `matplotlib`.
    - **Numerical vs. categorical**:
        - Explored relationships with group-wise statistics (e.g., mean or median by category) using `pandas` `groupby` and `agg`.
        - Quantified the magnitude of group differences with Cohen's d using a custom function.
        - Visualized results with bar plots using `seaborn` `barplot` and `matplotlib`.
    - **Categorical vs. categorical**:
        - Analyzed relationships with contingency tables using `pandas` `crosstab`.
        - Visualized relationships with grouped bar plots using `pandas` `crosstab` `plot` and `matplotlib`.

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px; margin-bottom:8px;">
    <h2 style="margin:0px">🏗️ Modeling</h2>
</div> 

Used `sklearn` and `xgboost` for model training, evaluation, and optimization.
- **Baseline models**:
    - Trained baseline models with default hyperparameter values.
    - Evaluated regression model performance using metrics and diagnostics:
        - Calculated RMSE, MAPE, and R-squared with `sklearn` `mean_squared_error`, `mean_absolute_percentage_error`, and `r2_score`.
        - Analyzed errors with residual plots and error distributions using `pandas` and `matplotlib`.
        - Explored feature-error relationships using scatter plots with `seaborn` `scatterplot` and `matplotlib`.
    - Evaluated classification model performance using metrics and diagnostics:
        - Calculated accuracy, recall, precision, F1-score, ROC-AUC, and AUC-PR with `sklearn` `accuracy_score`, `recall_score`, `precision_score`, `f1_score`, `roc_auc_score`, `precision_recall_curve`, and `auc`.
        - Compared metrics using tables with `pandas`.
        - Plotted precision-recall curves using `matplotlib`.
        - Created classification reports with `sklearn` `classification_report`.
        - Plotted confusion matrices with `sklearn` `confusion_matrix` and `ConfusionMatrixDisplay`.
        - Analyzed overfitting using custom tables (`pandas`) and plots (`seaborn` `barplot` and `matplotlib`).
        - Explored feature-misclassification relationships through correlations (`pandas` `corr`) and grouped box plots (`seaborn` `boxplot` and `matplotlib`).
- **Hyperparameter tuning**:
    - Performed grid search (`sklearn` `GridSearchCV`) or random search (`RandomizedSearchCV`) with 5-fold cross-validation on the most promising baseline models.
    - Retrained the best-performing model from each algorithm.
    - Classification models: Plotted precision-recall curves and optimized decision thresholds.
    - Evaluated all tuned models using the above metrics and diagnostics.
- **Final model**:
    - Selected Random Forest for its good performance (highest AUC-PR), low overfitting (lowest AUC-PR difference), and interpretability.
    - Retrained and saved the final model as a .pkl file using `pickle`.
    - Compared training, validation, and test performance using the above metrics and diagnostics.
    - Visualized feature importances using `seaborn` `barplot` and `matplotlib` or `xgboost` `plot_importance`.
    - Showed model prediction examples of the best, worst, and typical cases using `pandas`.

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px; margin-bottom:8px;">
    <h2 style="margin:0px">Next Steps</h2>
</div> 

- Add unsupervised learning section with algorithms like K-Means Clustering, DBSCAN, and Principal Component Analysis (PCA).
- Add model deployment as a web application with an interactive UI and API (e.g., using Gradio or Flask).

<div style="background-color:#2c699d; color:white; padding:15px; border-radius:6px;">
    <h1 style="margin:0px">Appendix</h1>
</div>

<div style="background-color:#3d7ab3; color:white; padding:12px; border-radius:6px;">
    <h2 style="margin:0px">Model Hyperparameters</h2>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    ℹ️ Overview of hyperparameters and their default values for both regression and classification models. 
</div>

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Regression</h3>
</div> 

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Linear Regression</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li><code>fit_intercept=True</code>: Calculates the intercept; can be set to <code>False</code> if data is already centered.</li>
        <li><code>n_jobs=None</code>: Number of CPU threads; use <code>-1</code> for all available processors.</li>
        <li><code>positive=False</code>: Forces regression coefficients to be non-negative if set to <code>True</code>.</li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression" target="_blank">scikit-learn LinearRegression documentation</a>.  
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Logistic Regression</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Regularization:
            <ul>
                <li><code>penalty="l2"</code>: Regularization type (options are <code>"l1"</code>, <code>"l2"</code>, <code>"elasticnet"</code>, or <code>"none"</code>).</li>
                <li><code>C=1.0</code>: Inverse of regularization strength (smaller = stronger regularization).</li>
                <li><code>l1_ratio=None</code>: The mix ratio between L1 and L2 regularization (0 is L2 only, 1 is L1 only). Used with <code>penalty="elasticnet"</code> and <code>solver="saga"</code>.</li>
            </ul>
        </li>
        <li>Optimization:
            <ul>
                <li><code>solver="lbfgs"</code>: Optimization algorithm (options are <code>"lbfgs"</code>, <code>"liblinear"</code>, <code>"saga"</code>, <code>"newton-cg"</code>, or <code>"sag"</code>).</li>
                <li><code>max_iter=100</code>: Maximum number of iterations for convergence.</li>
                <li><code>tol=1e-4</code>: Tolerance for stopping criteria.</li>
            </ul>
        </li>
        <li>Model Behavior:
            <ul>
                <li><code>fit_intercept=True</code>: Whether to calculate the intercept.</li>
                <li><code>intercept_scaling=1.0</code>: Scaling of the intercept (used with <code>solver="liblinear"</code>).</li>
                <li><code>warm_start=False</code>: Reuse previous solution for subsequent fits.</li>
                <li><code>dual=False</code>: Use dual formulation (only with <code>penalty="l2"</code> and <code>solver="liblinear"</code>).</li>
            </ul>
        </li>
        <li>Multi-Class Classification:
            <ul>
                <li><code>multi_class="auto"</code>: Multi-class handling (options are <code>"auto"</code>, <code>"ovr"</code>, or <code>"multinomial"</code>).</li>
                <li><code>class_weight=None</code>: Class weights; <code>"balanced"</code> adjusts for imbalanced class frequencies.</li>
            </ul>
        </li>
        <li>Performance:
            <ul>
                <li><code>random_state=None</code>: Seed for reproducibility.</li>
                <li><code>n_jobs=None</code>: Number of CPU cores used during computation (only with <code>solver="liblinear"</code>).</li>
                <li><code>verbose=0</code>: Verbosity level; controls solver progress output.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" target="_blank">scikit-learn LogisticRegression documentation</a>.  
</div>  

<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Elastic Net</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Elastic Net Regression</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>alpha=1.0</code>: Regularization strength. Higher values increase the penalty, reducing overfitting but possibly underfitting the data.</li>
                <li><code>l1_ratio=0.5</code>: The mix between L1 (Lasso) and L2 (Ridge) regularization.
                    <ul>
                        <li><code>l1_ratio=1.0</code> corresponds to pure Lasso.</li>
                        <li><code>l1_ratio=0.0</code> corresponds to pure Ridge.</li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Solver Configuration:
            <ul>
                <li><code>fit_intercept=True</code>: Whether to calculate the intercept for the model. If <code>False</code>, the model assumes data is centered.</li>
                <li><code>precompute=False</code>: Whether to use precomputed Gram matrices to speed up calculations. Set to <code>True</code> for small datasets.</li>
                <li><code>max_iter=1000</code>: The maximum number of iterations allowed for convergence during optimization.</li>
                <li><code>tol=1e-4</code>: Stopping criterion for optimization. If the change in the cost function is smaller than <code>tol</code>, training stops.</li>
                <li><code>warm_start=False</code>: If <code>True</code>, reuse the solution of the previous fit as initialization for the next fit.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>selection="cyclic"</code>: Determines the strategy for updating coefficients. <code>"cyclic"</code> updates coefficients sequentially, <code>"random"</code> in a random order.</li>
                <li><code>random_state=None</code>: Seed for random number generation when <code>selection="random"</code>.</li>
            </ul>
        </li>
        <li>Performance Configuration:
            <ul>
                <li><code>copy_X=True</code>: Whether to copy the input data <code>X</code>. If <code>False</code>, training modifies the original data, saving memory.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html" target="_blank">scikit-learn Elastic Net Regression documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Elastic Net Logistic Regression</strong> <br>
    ℹ️ Note: Uses the same <code>LogisticRegression</code> class and hyperparameters as the standard Logistic Regression, with the key difference being the hyperparameter values for the Elastic Net regularization. Specifically, the <code>penalty</code> is set to <code>"elasticnet"</code>, and the solver should be <code>"saga"</code>. The <code>l1_ratio</code> parameter controls the mix between L1 and L2 regularization.
    <ul>
        <li>Hyperparameters:
            <ul>
                <li><code>penalty="elasticnet"</code>: Regularization type, combining both L1 and L2 penalties (<code>"l1"</code>, <code>"l2"</code>, <code>"elasticnet"</code>, or <code>"none"</code>).</li>
                <li><code>l1_ratio=0.5</code>: The mix ratio between L1 and L2 regularization (0 is L2 only, 1 is L1 only). This parameter is specific to Elastic Net regularization.</li>
                <li><code>solver="saga"</code>: Optimization algorithm, recommended for Elastic Net (<code>"lbfgs"</code>, <code>"liblinear"</code>, <code>"saga"</code>, etc.).</li>
            </ul>
        </li>
    </ul>
</div>  


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">K-Nearest Neighbors</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>K-Nearest Neighbors Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_neighbors=5</code>: The number of neighbors to use for prediction. A higher value makes the model more general, while a lower value may lead to overfitting.</li>
                <li><code>weights="uniform"</code>: Determines how neighbors are weighted during prediction. <code>"uniform"</code> gives equal weight to all neighbors, while <code>"distance"</code> gives closer neighbors more influence.</li>
                <li><code>p=2</code>: The power parameter for the Minkowski distance. <code>p=2</code> corresponds to the Euclidean distance, commonly used in KNN regression.</li>
                <li><code>algorithm="auto"</code>: The algorithm used to compute nearest neighbors. <code>"auto"</code> selects the best algorithm based on the dataset (options include <code>"ball_tree"</code>, <code>"kd_tree"</code>, and <code>"brute"</code>).</li>
                <li><code>leaf_size=30</code>: The size of the leaf in tree-based algorithms like Ball Tree and KD Tree. This parameter impacts the speed and memory usage during training.</li>
            </ul>
        </li>
        <li>Distance Metrics:
            <ul>
                <li><code>metric="minkowski"</code>: The distance metric used to calculate the proximity between data points. The default is <code>"minkowski"</code>, but you can also use <code>"euclidean"</code>, <code>"manhattan"</code>, etc.</li>
                <li><code>metric_params=None</code>: Additional parameters for the distance metric, usually left as <code>None</code>.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>n_jobs=None</code>: The number of parallel jobs to run for neighbor search. Setting <code>n_jobs=-1</code> utilizes all available CPU cores for faster computation.</li>
                <li><code>radius=1.0</code>: Defines the search radius for neighbors. Instead of a fixed number of neighbors, this parameter considers neighbors within a given radius. This can be more flexible than <code>n_neighbors</code>, but it should be used with care as it may return an inconsistent number of neighbors.</li>
                <li><code>max_iter=None</code>: The maximum number of iterations for the neighbor search process. This is set to <code>None</code> to allow unlimited iterations.</li>
                <li><code>verbose=False</code>: Whether or not to print progress messages during training.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html" target="_blank">scikit-learn KNN Regressor documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>K-Nearest Neighbors Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_neighbors=5</code>: The number of neighbors to use for prediction. A higher value makes the model more general, while a lower value may lead to overfitting.</li>
                <li><code>weights="uniform"</code>: Determines how neighbors are weighted during prediction. <code>"uniform"</code> gives equal weight to all neighbors, while <code>"distance"</code> gives closer neighbors more influence.</li>
                <li><code>p=2</code>: The power parameter for the Minkowski distance. <code>p=2</code> corresponds to the Euclidean distance, commonly used in KNN classification.</li>
                <li><code>algorithm="auto"</code>: The algorithm used to compute nearest neighbors. <code>"auto"</code> selects the best algorithm based on the dataset (options include <code>"ball_tree"</code>, <code>"kd_tree"</code>, and <code>"brute"</code>).</li>
                <li><code>leaf_size=30</code>: The size of the leaf in tree-based algorithms like Ball Tree and KD Tree. This parameter impacts the speed and memory usage during training.</li>
            </ul>
        </li>
        <li>Distance Metrics:
            <ul>
                <li><code>metric="minkowski"</code>: The distance metric used to calculate the proximity between data points. The default is <code>"minkowski"</code>, but you can also use <code>"euclidean"</code>, <code>"manhattan"</code>, etc.</li>
                <li><code>metric_params=None</code>: Additional parameters for the distance metric, usually left as <code>None</code>.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>n_jobs=None</code>: The number of parallel jobs to run for neighbor search. Setting <code>n_jobs=-1</code> utilizes all available CPU cores for faster computation.</li>
                <li><code>radius=1.0</code>: Defines the search radius for neighbors. Instead of a fixed number of neighbors, this parameter considers neighbors within a given radius. This can be more flexible than <code>n_neighbors</code>, but it should be used with care as it may return an inconsistent number of neighbors.</li>
                <li><code>max_iter=None</code>: The maximum number of iterations for the neighbor search process. This is set to <code>None</code> to allow unlimited iterations.</li>
                <li><code>verbose=False</code>: Whether or not to print progress messages during training.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html" target="_blank">scikit-learn KNN Classifier documentation</a>.  
</div>  


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Support Vector Machine</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Support Vector Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>C=1.0</code>: Regularization parameter balancing error reduction and model complexity.</li>
                <li><code>epsilon=0.1</code>: Margin of tolerance for predictions without penalty.</li>
            </ul>
        </li>
        <li>Kernel Configuration:
            <ul>
                <li><code>kernel="rbf"</code>: Kernel function for mapping data to higher dimensions (default is radial basis function or <code>"rbf"</code>).</li>
                <li><code>degree=3</code>: Degree of the polynomial kernel function (ignored by the rbf kernel).</li>
                <li><code>gamma="scale"</code>: Influence range of a single training example. <code>"scale"</code> means <code>1 / (n_features * X.var())</code>.</li>
                <li><code>coef0=0.0</code>: Independent term in polynomial and sigmoid kernel function (ignored by the rbf kernel).</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>tol=1e-3</code>: Stopping criterion for optimization. If the change in the cost function is less than this tolerance, training stops.</li>
                <li><code>cache_size=200</code>: Memory (MB) allocated for kernel computation. Larger values speed up training.</li>
                <li><code>shrinking=True</code>: Enables the shrinking heuristic, which can speed up training by eliminating unnecessary steps during optimization.</li>
                <li><code>verbose=False</code>: Whether to print progress messages during training.</li>
                <li><code>max_iter=-1</code>: Maximum number of iterations during training (<code>-1</code> for no limit).</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR" target="_blank">scikit-learn SVR documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Support Vector Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li><code>C=1.0</code>: Regularization strength (smaller = stronger regularization).</li>
        <li><code>kernel="rbf"</code>: Kernel type (<code>"linear"</code>, <code>"poly"</code>, <code>"rbf"</code>, <code>"sigmoid"</code>, or callable).</li>
        <li><code>degree=3</code>: Degree of the polynomial kernel (ignored for other kernels).</li>
        <li><code>gamma="scale"</code>: Kernel coefficient (<code>"scale"</code>, <code>"auto"</code>, or float).</li>
        <li><code>coef0=0.0</code>: Independent term in kernel functions (<code>"poly"</code> and <code>"sigmoid"</code> only).</li>
        <li><code>class_weight=None</code>: Class weights; <code>"balanced"</code> adjusts for imbalance.</li>
        <li><code>max_iter=-1</code>: Maximum iterations (unlimited if <code>-1</code>).</li>
        <li><code>probability=False</code>: Enables probability estimates (slower training).</li>
        <li><code>random_state=None</code>: Seed for reproducibility (affects <code>shrinking</code> and <code>probability</code>).</li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html" target="_blank">scikit-learn SVC documentation</a>.  
</div>  


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Decision Tree</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Decision Tree Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>max_depth=None</code>: Maximum depth of the tree. <code>None</code> allows nodes to expand until all leaves are pure or contain fewer samples than <code>min_samples_split</code>.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split an internal node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required to be at a leaf node.</li>
                <li><code>criterion="squared_error"</code>: Function to measure the quality of a split. Options include <code>"squared_error"</code> (mean squared error) and <code>"friedman_mse"</code> (Friedman’s mean squared error).</li>
                <li><code>splitter="best"</code>: Strategy to choose the split at each node. Options are <code>"best"</code> (best split) and <code>"random"</code> (random split).</li>
                <li><code>max_features=None</code>: Number of features to consider when looking for the best split. If <code>None</code>, all features are considered.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>ccp_alpha=0.0</code>: Complexity parameter for pruning. A higher value encourages pruning by penalizing tree complexity.</li>
                <li><code>min_impurity_decrease=0.0</code>: A node will split only if the impurity decrease exceeds this threshold.</li>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes in the tree.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>min_weight_fraction_leaf=0.0</code>: Minimum weighted fraction of the sum of weights required at a leaf node.</li>
                <li><code>max_samples=None</code>: (Only relevant for certain ensemble methods; ignored in standalone <code>DecisionTreeRegressor</code>).</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>presort="deprecated"</code>: Pre-sorting data for faster splits has been deprecated in recent versions.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html" target="_blank">scikit-learn DecisionTreeRegressor documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Decision Tree Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>max_depth=None</code>: Maximum depth of the tree. <code>None</code> allows nodes to expand until all leaves are pure or contain fewer samples than <code>min_samples_split</code>.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split an internal node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required to be at a leaf node.</li>
                <li><code>criterion="gini"</code>: Function to measure the quality of a split. Options include <code>"gini"</code> (Gini impurity) and <code>"entropy"</code> (information gain).</li>
                <li><code>splitter="best"</code>: Strategy to choose the split at each node. Options are <code>"best"</code> (best split) and <code>"random"</code> (random split).</li>
                <li><code>max_features=None</code>: Number of features to consider when looking for the best split. If <code>None</code>, all features are considered.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>ccp_alpha=0.0</code>: Complexity parameter for pruning. A higher value encourages pruning by penalizing tree complexity.</li>
                <li><code>min_impurity_decrease=0.0</code>: A node will split only if the impurity decrease exceeds this threshold.</li>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes in the tree.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>min_weight_fraction_leaf=0.0</code>: Minimum weighted fraction of the sum of weights required at a leaf node.</li>
                <li><code>class_weight=None</code>: Weights associated with classes. If <code>None</code>, all classes are given equal weight. Can be <code>"balanced"</code> to automatically adjust weights inversely proportional to class frequencies.</li>
                <li><code>max_samples=None</code>: (Only relevant for certain ensemble methods; ignored in standalone <code>DecisionTreeClassifier</code>).</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>presort="deprecated"</code>: Pre-sorting data for faster splits has been deprecated in recent versions.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html" target="_blank">scikit-learn DecisionTreeClassifier documentation</a>.  
</div>  


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Random Forest</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Random Forest Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees in the forest.</li>
                <li><code>max_depth=None</code>: Maximum depth of each tree; <code>None</code> allows trees to grow until all leaves are pure or minimum samples are reached.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split a node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required at a leaf node.</li>
                <li><code>max_features="auto"</code>: Number of features considered for the best split; default <code>auto</code> uses the square root of all features.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes per tree.</li>
                <li><code>min_impurity_decrease=0.0</code>: Splits a node only if it decreases impurity by this threshold.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>bootstrap=True</code>: Whether to use bootstrap samples for training each tree.</li>
                <li><code>oob_score=False</code>: Whether to use out-of-bag samples to estimate prediction accuracy.</li>
                <li><code>n_jobs=None</code>: Number of CPU threads used (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>verbose=0</code>: Controls the verbosity of output during training.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>max_samples=None</code>: Maximum number of samples used to train each tree, useful for subsampling large datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor" target="_blank">scikit-learn RandomForestRegressor documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Random Forest Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees in the forest.</li>
                <li><code>max_depth=None</code>: Maximum depth of each tree; <code>None</code> allows trees to grow until all leaves are pure or minimum samples are reached.</li>
                <li><code>min_samples_split=2</code>: Minimum number of samples required to split a node.</li>
                <li><code>min_samples_leaf=1</code>: Minimum number of samples required at a leaf node.</li>
                <li><code>max_features="auto"</code>: Number of features considered for the best split; default <code>"auto"</code> uses the square root of all features.</li>
            </ul>
        </li>
        <li>Regularization:
            <ul>
                <li><code>max_leaf_nodes=None</code>: Maximum number of leaf nodes per tree.</li>
                <li><code>min_impurity_decrease=0.0</code>: Splits a node only if it decreases impurity by this threshold.</li>
                <li><code>class_weight=None</code>: Weights associated with classes. If <code>None</code>, all classes are supposed to have weight one. Use <code>"balanced"</code> to automatically adjust weights inversely proportional to class frequencies in the input data.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>bootstrap=True</code>: Whether to use bootstrap samples for training each tree.</li>
                <li><code>oob_score=False</code>: Whether to use out-of-bag samples to estimate prediction accuracy.</li>
                <li><code>n_jobs=None</code>: Number of CPU threads used (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>verbose=0</code>: Controls the verbosity of output during training.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>max_samples=None</code>: Maximum number of samples used to train each tree, useful for subsampling large datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier" target="_blank">scikit-learn RandomForestClassifier documentation</a>.  
</div>  


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">Multi-Layer Perceptron</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Multi-Layer Perceptron Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Architecture:
            <ul>
                <li><code>hidden_layer_sizes=(100,)</code>: Defines the size and number of hidden layers; <code>(100,)</code> indicates one layer with 100 neurons.</li>
                <li><code>activation="relu"</code>: Activation function for the hidden layers; options include <code>"relu"</code>, <code>"tanh"</code>, <code>"logistic"</code>, or <code>"identity"</code>.</li>
                <li><code>solver="adam"</code>: Optimization algorithm; options are <code>"adam"</code> (default), <code>"lbfgs"</code>, or <code>"sgd"</code>.</li>
            </ul>
        </li>
        <li>Regularization and Learning:
            <ul>
                <li><code>alpha=0.0001</code>: L2 regularization term to prevent overfitting.</li>
                <li><code>learning_rate="constant"</code>: Strategy for learning rate adjustment; options are <code>"constant"</code>, <code>"invscaling"</code>, or <code>"adaptive"</code>.</li>
                <li><code>learning_rate_init=0.001</code>: Initial learning rate for weight updates.</li>
                <li><code>power_t=0.5</code>: Exponent for inverse scaling of learning rate (used when <code>learning_rate="invscaling"</code>).</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>max_iter=200</code>: Maximum number of iterations for training.</li>
                <li><code>tol=1e-4</code>: Tolerance for stopping criteria; training stops if loss improvement is below this value.</li>
                <li><code>momentum=0.9</code>: Momentum parameter for gradient descent updates (used when <code>solver="sgd"</code>).</li>
                <li><code>n_iter_no_change=10</code>: Number of iterations with no improvement to stop early.</li>
                <li><code>early_stopping=False</code>: Enables early stopping when validation score doesn’t improve.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>batch_size="auto"</code>: Number of samples per batch for training; <code>"auto"</code> uses <code>min(200, n_samples)</code>.</li>
                <li><code>shuffle=True</code>: Whether to shuffle training data before each epoch.</li>
                <li><code>random_state=None</code>: Random seed for reproducibility.</li>
                <li><code>verbose=False</code>: Controls verbosity of output during training.</li>
                <li><code>warm_start=False</code>: Reuses previous solution to initialize weights for additional fitting.</li>
                <li><code>beta_1=0.9</code>, <code>beta_2=0.999</code>: Exponential decay rates for moving averages of gradients and squared gradients (used in <code>solver="adam"</code>).</li>
                <li><code>epsilon=1e-8</code>: Small value to prevent division by zero in <code>solver="adam"</code>.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html" target="_blank">scikit-learn MLPRegressor documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>Multi-Layer Perceptron Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Architecture:
            <ul>
                <li><code>hidden_layer_sizes=(100,)</code>: Number and size of hidden layers; e.g., <code>(100,)</code> = 1 layer with 100 neurons.</li>
                <li><code>activation="relu"</code>: Activation function; options: <code>"relu"</code>, <code>"tanh"</code>, <code>"logistic"</code>, <code>"identity"</code>.</li>
            </ul>
        </li>
        <li>Optimization:
            <ul>
                <li><code>solver="adam"</code>: Optimization algorithm; options: <code>"adam"</code>, <code>"lbfgs"</code>, <code>"sgd"</code>.</li>
                <li><code>alpha=0.0001</code>: L2 regularization to reduce overfitting.</li>
                <li><code>learning_rate="constant"</code>: Learning rate strategy; options: <code>"constant"</code>, <code>"invscaling"</code>, <code>"adaptive"</code>.</li>
                <li><code>learning_rate_init=0.001</code>: Initial learning rate.</li>
                <li><code>power_t=0.5</code>: Used for <code>"invscaling"</code> learning rate.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>max_iter=200</code>: Maximum training iterations.</li>
                <li><code>tol=1e-4</code>: Stopping criterion for improvement tolerance.</li>
                <li><code>momentum=0.9</code>: Momentum for SGD (used with <code>solver="sgd"</code>).</li>
                <li><code>early_stopping=False</code>: Stop early if no improvement on validation set.</li>
                <li><code>n_iter_no_change=10</code>: Iterations without improvement to stop training.</li>
            </ul>
        </li>
        <li>Performance:
            <ul>
                <li><code>batch_size="auto"</code>: Batch size; <code>"auto"</code> = <code>min(200, n_samples)</code>.</li>
                <li><code>shuffle=True</code>: Shuffle training data every epoch.</li>
                <li><code>random_state=None</code>: Seed for reproducibility.</li>
                <li><code>verbose=False</code>: Control output verbosity.</li>
                <li><code>warm_start=False</code>: Retain model state for further training.</li>
            </ul>
        </li>
        <li>Classification-Specific:
            <ul>
                <li><code>validation_fraction=0.1</code>: Fraction of data for validation (used with <code>early_stopping=True</code>).</li>
                <li><code>out_activation_="softmax"</code>: Output activation for multi-class classification.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html" target="_blank">scikit-learn MLPClassifier documentation</a>.  
</div>


<div style="background-color:#4e8ac8; color:white; padding:10px; border-radius:6px;">
    <h3 style="margin:0px">XGBoost</h3>
</div>

<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>XGBoost Regressor</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees.</li>
                <li><code>max_depth=6</code>: Maximum depth of each tree.</li>
                <li><code>learning_rate=0.3</code>: Step size shrinking to prevent overfitting.</li>
                <li><code>subsample=1.0</code>: Fraction of training samples used for each tree.</li>
                <li><code>colsample_bytree=1.0</code>: Fraction of features used for each tree.</li>
                <li><code>colsample_bylevel=1.0</code>: Fraction of features used at each tree level.</li>
                <li><code>colsample_bynode=1.0</code>: Fraction of features used at each node.</li>
            </ul>
        </li>
        <li>Regularization and Learning:
            <ul>
                <li><code>gamma=0</code>: Minimum loss reduction required to make a further partition on a leaf node.</li>
                <li><code>min_child_weight=1</code>: Minimum sum of instance weight (hessian) in a child.</li>
                <li><code>scale_pos_weight=1</code>: Controls the balance of positive and negative weights; used for imbalanced datasets.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>objective="reg:squarederror"</code>: Objective function for regression; default is for squared error.</li>
                <li><code>booster="gbtree"</code>: Booster type; options include <code>"gbtree"</code> (default), <code>"gblinear"</code>, and <code>"dart"</code>.</li>
                <li><code>tree_method="auto"</code>: Tree construction algorithm; <code>"auto"</code> chooses based on system configuration. Options include <code>"exact"</code>, <code>"approx"</code>, and <code>"hist"</code>.</li>
                <li><code>eval_metric="rmse"</code>: Metric used for validation during training; default is root mean square error (<code>rmse</code>).</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>early_stopping_rounds=None</code>: Stops training if validation metric does not improve after specified rounds.</li>
                <li><code>n_jobs=1</code>: Number of threads used for parallel computation (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Seed for reproducibility.</li>
                <li><code>verbose=1</code>: Verbosity level for training output; <code>0</code> for silent, higher values show more details.</li>
            </ul>
        </li>
        <li>Advanced Parameters:
            <ul>
                <li><code>lambda=1</code>: L2 regularization term on weights.</li>
                <li><code>alpha=0</code>: L1 regularization term on weights.</li>
                <li><code>max_delta_step=0</code>: Used to help with convergence in highly imbalanced datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.  
</div>


<div style="background-color:#e8f4fd; padding:15px; border:3px solid #d0e7fa; border-radius:6px;">
    <strong>XGBoost Classifier</strong> <br>
    ℹ️ Hyperparameters and Default Values:
    <ul>
        <li>Model Complexity:
            <ul>
                <li><code>n_estimators=100</code>: Number of trees (boosting rounds).</li>
                <li><code>max_depth=6</code>: Maximum depth of each tree.</li>
                <li><code>learning_rate=0.3</code>: Step size shrinkage to prevent overfitting.</li>
                <li><code>subsample=1.0</code>: Fraction of training samples used per tree.</li>
                <li><code>colsample_bytree=1.0</code>: Fraction of features used per tree.</li>
                <li><code>colsample_bylevel=1.0</code>: Fraction of features used per tree level.</li>
                <li><code>colsample_bynode=1.0</code>: Fraction of features used per split node.</li>
            </ul>
        </li>
        <li>Regularization and Learning:
            <ul>
                <li><code>gamma=0</code>: Minimum loss reduction required to split a leaf node.</li>
                <li><code>min_child_weight=1</code>: Minimum sum of instance weights (hessian) in a child.</li>
                <li><code>scale_pos_weight=1</code>: Balances positive and negative class weights for imbalanced datasets.</li>
            </ul>
        </li>
        <li>Training Behavior:
            <ul>
                <li><code>objective="binary:logistic"</code>: Objective function for binary classification; alternatives include <code>"multi:softmax"</code> or <code>"multi:softprob"</code>.</li>
                <li><code>booster="gbtree"</code>: Booster type; options include <code>"gbtree"</code> (default), <code>"gblinear"</code>, and <code>"dart"</code>.</li>
                <li><code>tree_method="auto"</code>: Tree construction algorithm; options: <code>"exact"</code>, <code>"approx"</code>, <code>"hist"</code>, <code>"gpu_hist"</code>.</li>
                <li><code>eval_metric="logloss"</code>: Default evaluation metric for binary classification. Options include <code>"error"</code>, <code>"auc"</code>, and others.</li>
            </ul>
        </li>
        <li>Performance Optimization:
            <ul>
                <li><code>early_stopping_rounds=None</code>: Stops training if validation metric does not improve after specified rounds.</li>
                <li><code>n_jobs=1</code>: Number of threads for parallel computation (<code>-1</code> for all processors).</li>
                <li><code>random_state=None</code>: Seed for reproducibility.</li>
                <li><code>verbose=1</code>: Verbosity level; 0 for silent, higher values for detailed output.</li>
            </ul>
        </li>
        <li>Advanced Parameters:
            <ul>
                <li><code>lambda=1</code>: L2 regularization term on weights.</li>
                <li><code>alpha=0</code>: L1 regularization term on weights.</li>
                <li><code>max_delta_step=0</code>: Helps with convergence in imbalanced datasets.</li>
            </ul>
        </li>
    </ul>
    For more details, refer to the official <a href="https://xgboost.readthedocs.io/en/latest/parameter.html" target="_blank">XGBoost documentation</a>.
</div>