# Solution Overview

## The Environment

- The Jupyter notebook runs in a Python 3.10.14 environment. The code runs on a CPU runtime type, lasting less than 15 hours (excluding hyperparameter search and feature selection).

- The details of the operating system on which the notebook run is Linux-5.15.154+-x86_64-with-glibc2.31

- The RAM usage recorded throughout the run of the notebook was less than 30gb.

## The Code

- This notebook takes input and uses only the data files provided in the competition resources.

- At the end of the processes, it outputs 'scv_catselect_2xtsf.parquet' as the submission file.

- The methodology used is to predict building features in two stages as required by the heirarchical nature of the task. Thus, an ensemble of [5] LightGBM classifiers (trained on different folds of the training data) is built to identify the building stock type and then depending on the identified building stock type, additional characteristics of each building are then predicted by a set of LightGBM classifier ensembles (5 classifiers per ensemble).

- The General Overview of the Notebook is as follows:

   1. Importing relevant methods from the following libraries:

      - scipy==1.14.1
      - numpy==1.26.4
      - pandas==2.2.2
      - lightgbm==4.2.0
      - scikit-learn==1.2.2
      - category-encoders==2.6.3
      - tsfresh==0.20.3

   2. Setting Paths

      - Indicating the location of the input files to be used in the notebook.

      - These paths are stored in the following variables:
          - `TEST_DIR` : path to the folder contain parquet files of the test data buildings.
          - `TRAIN_DIR` : path to the folder contain parquet files of the train data buildings
          - `SS_FILE` : path to the parquet file showing how a submission should be formatted.
          - `LABEL_FILE` : path to the parquet file containing attributes (meta data) of each building in the train dataset

   3. Data Preprocessing and Feature Engineering

      - Data is read, preprocessed and features engineered in four different phases.

      - Phase 1: This phase extracts statistics of electricity consumption based on datetime features. This includes hour, day and month of peak consumption. Statistics such as difference between average weekend and average weekday consumption is calculated in addition to rolling window statistics.

      - Phase 2: An automated feature engineering process applying tsfresh extraction to a quarter (full data was not used because of resource constraints) of the time series data of each train data building.

      - Phase 3: This is the second manual feature engineering process that creates a set of 96 features each to represent consumption during weekends and consumption during weekdays. The aggregation is achieved by averaging consumption across all the 96 15-minute time blocks for each day (weekend days in one set and weekdays in the other set). This phase also calculates additional statistics based on these aggregates such as the number of outliers per set of 15-minute consumption values.

      - Phase 4: This second automated feature engineering process applies the tsfresh extraction process to the two sets of timeseries data created in Phase 3 for each building.

   4. Tackling Heirarchy 1 - Building Stock Type Classification

      - A preprocessing pipeline is created to encode non-numeric features using a `TargetEncoder` object with the next step being the scaling of features using `MinMaxScaler`

      - Feature selection using `select_features` method of `CatboostClassifier` with the `RecursiveByShapValues` algorithm. This feature selection is omitted from this notebook but the results are applied to reduce the huge feature dimension and improve model performance.

      - Hyperparameter optimisation using `optuna` to select best hyperparameters of `LGBMClassifier` necessary for predicting building stock type. This step is also omitted from this notebook due to its time-consuming nature.

      - Fitting 5 different `LGBMClassifier` models one each on each of 5 folds of the training data. These 5 models will be ensembled during inference by selecting the mode of their individual predictions.

   5. Tackling Hierarchy 2 - Commercial Building Models

      - The preprocessed data with engineered features is filtered to create a dataframe of commercial buildings only.

      - This filtered dataframe is manipulated such that individual characteristics of commercial buildings to be predicted are transformed from being in individual target columns into a single target column along with a `target_type` column indicating the kind of characteristic being predicted. This is done using the `create_combo_task` function.

      - A dictionary is created to store the index of the features relevant for predicting each kind of characteristic (`target_type`). These indices are obtained from a feature selection process using the `select_features` method of `CatboostClassifier` with the `RecursiveByShapValues` algorithm. This feature selection is omitted from this notebook but the results are applied to reduce the huge feature dimension and improve model performance.

      - A dictionary is created to store the optimised hyperparameters of the `LGBMClassifier` model for predicting the values of each kind of characteristic (`target_type`). These hyperparameter values are obtained from an optuna hyperparameter search process which has been omitted from this notebook because it is time-intensive.

      - Finally we iterate through each commercial building characteristic's data, preprocess it, select relevant features, and fit one `LGBMClassifier` model on each of 5 folds of the data.

      - The model performance is evaluated and all the relevant objects are stored to be used in predicting on final test data.

   6. Tackling Hierarchy 2 - Residential Building Models

      - The preprocessed data with engineered features is filtered to create a dataframe of residential buildings only.

      - This filtered dataframe is manipulated such that individual characteristics of residential buildings to be predicted are transformed from being in individual target columns into a single target column along with a `target_type` column indicating the kind of characteristic whose values are being predicted. This is done using the `create_combo_task` function.

      - A dictionary is created to store the index of the features relevant for predicting each kind of characteristic (`target_type`). These indices are obtained from a feature selection process using the `select_features` method of `CatboostClassifier` with the `RecursiveByShapValues` algorithm. This feature selection is omitted from this notebook but the results are applied to reduce the huge feature dimension and improve model performance.

      - A dictionary is created to store the optimised hyperparameters of the `LGBMClassifier` model for predicting the values of each kind of characteristic (`target_type`). These hyperparameter values are obtained from an optuna hyperparameter search process which has been omitted from this notebook because it is time-intensive.

      - Finally we iterate through each residential building characteristic's data, preprocess it, select relevant features, and fit one `LGBMClassifier` model on each of 5 folds of the data.

      - The model performance is evaluated and all the relevant objects are stored to be used in predicting on final test data

   7. Predicting on Testset and Preparing Submission

      - The data for the test set of buildings is read, preprocessed and features engineered in the same fashion of 4 phases like the training data.

      - The 5 models developed for classifying the building stock type take turns in predicting the building stock type of the buildings in the test set with the modal prediction value being adopted as the final prediction.

      - Based upon each building's predicted stock type, it is taken through the set of ensemble models for predicting the additional characteristics of either residential or commercial buildings.

      - The prediction dataframe of the residential and commercial buildings are combined and manipulated to fit the format of the sample submission file.

      - The formatted prediction dataframe is exported to the `scv_catselect_2xtsf.parquet` parquet file for submission.


# Importing Relevant Modules and Specifying Directories

In [1]:
import os
from multiprocessing import Pool
from functools import partial

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import f1_score

from category_encoders import TargetEncoder
from lightgbm import LGBMClassifier
from scipy.stats import skew, kurtosis
from tsfresh.feature_extraction import extract_features, EfficientFCParameters

In [2]:
TEST_DIR = ""
TRAIN_DIR = ""
SS_FILE = ""
LABEL_FILE = ""

In [3]:
def list_files_recursive(root_dir: str, n: int = 5) -> list[str]:
    """
    Recursively lists paths to parquet files within a given directory.
    
    Parameters
    ----------
    root_dir : str
        The path to the root directory where the search for parquet files will start.
        
    n : int, optional
        The maximum number of parquet file paths to return. Default is 5.
    
    Returns
    -------
    all_files : list[str]
        A list of parquet file paths found within the 'root_dir' directory.
        
    Notes
    -----
    The function will traverse all subdirectories of the given 'root_dir' and return
    the paths to at most 'n' parquet files.
    """
    all_files = []

    for dirpath, _, filenames in os.walk(root_dir):
        for ind, filename in enumerate(filenames):
            if ind < n and filename.endswith('.parquet'):
                all_files.append(os.path.join(dirpath, filename))
            elif ind >= n:
                break

    return all_files

# Data Preprocessing and Feature Engineering

## Manual Feature Engineering - Datetime and General Statistics

In [4]:
def extract_datetime_features(
    df, datetime_column="timestamp", consumption_column="cons"
):
    """
    Extracts various time-based features from a datetime column, including:
    year, month, day, day of the week, is_weekend, quarter, hour, and cyclic time features
    (sin and cos transformations of time attributes). Additionally, it generates lag and
    rolling window features for the specified consumption column.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing the datetime and consumption columns.

    datetime_column : str, optional
        The name of the datetime column from which features are extracted, by default 'timestamp'.

    consumption_column : str, optional
        The name of the column representing consumption values, used for lag and rolling window 
        features, by default 'cons'.

    Returns
    -------
    pd.DataFrame
        A DataFrame with the original columns and new engineered features based on 
        the datetime column.

    Notes
    -----
    - Cyclic transformations (sin, cos) are applied to month, day of the week, and hour to handle 
    their circular nature.
    - Lags and rolling window features are generated for a specified consumption column.
    """

    # Ensure the datetime column is in datetime format
    df[datetime_column] = pd.to_datetime(df[datetime_column])

    # Create a DataFrame to store the engineered features
    features_df = pd.DataFrame(index=df.index)

    # Extract string prefix from datetime column name for feature naming
    prefix = datetime_column[:3]

    # Date features
    features_df[f"{prefix}_year"] = df[datetime_column].dt.year
    features_df[f"{prefix}_month"] = df[datetime_column].dt.month
    features_df[f"{prefix}_day"] = df[datetime_column].dt.day
    features_df[f"{prefix}_day_of_week"] = df[datetime_column].dt.dayofweek
    features_df[f"{prefix}_is_weekend"] = (df[datetime_column].dt.dayofweek > 4).astype(
        int
    )
    features_df[f"{prefix}_quarter"] = df[datetime_column].dt.quarter
    features_df[f"{prefix}_week_of_year"] = df[datetime_column].dt.isocalendar().week
    features_df[f"{prefix}_is_month_start"] = df[
        datetime_column
    ].dt.is_month_start.astype(int)
    features_df[f"{prefix}_is_month_end"] = df[datetime_column].dt.is_month_end.astype(
        int
    )
    features_df[f"{prefix}_is_quarter_start"] = df[
        datetime_column
    ].dt.is_quarter_start.astype(int)
    features_df[f"{prefix}_is_quarter_end"] = df[
        datetime_column
    ].dt.is_quarter_end.astype(int)
    features_df[f"{prefix}_is_year_start"] = df[
        datetime_column
    ].dt.is_year_start.astype(int)
    features_df[f"{prefix}_is_year_end"] = df[datetime_column].dt.is_year_end.astype(
        int
    )
    features_df[f"{prefix}_days_in_month"] = df[
        datetime_column
    ].dt.days_in_month.astype(int)

    # Time features
    features_df[f"{prefix}_hour"] = df[datetime_column].dt.hour
    features_df[f"{prefix}_minute"] = df[datetime_column].dt.minute
    features_df[f"{prefix}_second"] = df[datetime_column].dt.second
    features_df[f"{prefix}_is_morning"] = (
        (df[datetime_column].dt.hour >= 6) & (df[datetime_column].dt.hour < 12)
    ).astype(int)
    features_df[f"{prefix}_is_afternoon"] = (
        (df[datetime_column].dt.hour >= 12) & (df[datetime_column].dt.hour < 18)
    ).astype(int)
    features_df[f"{prefix}_is_evening"] = (
        (df[datetime_column].dt.hour >= 18) & (df[datetime_column].dt.hour < 24)
    ).astype(int)
    features_df[f"{prefix}_is_night"] = (
        (df[datetime_column].dt.hour >= 0) & (df[datetime_column].dt.hour < 6)
    ).astype(int)

    # Cyclic time features
    features_df[f"{prefix}_month_sin"] = np.sin(
        2 * np.pi * features_df[f"{prefix}_month"] / 12
    )
    features_df[f"{prefix}_month_cos"] = np.cos(
        2 * np.pi * features_df[f"{prefix}_month"] / 12
    )
    features_df[f"{prefix}_day_of_week_sin"] = np.sin(
        2 * np.pi * features_df[f"{prefix}_day_of_week"] / 7
    )
    features_df[f"{prefix}_day_of_week_cos"] = np.cos(
        2 * np.pi * features_df[f"{prefix}_day_of_week"] / 7
    )
    features_df[f"{prefix}_hour_sin"] = np.sin(
        2 * np.pi * features_df[f"{prefix}_hour"] / 24
    )
    features_df[f"{prefix}_hour_cos"] = np.cos(
        2 * np.pi * features_df[f"{prefix}_hour"] / 24
    )

    # Lag features (e.g., for 1 to 7-day lag)
    for lag in range(1, 8):
        features_df[f"{prefix}_lag_{lag}"] = df[consumption_column].shift(lag)

    # Rolling window features (e.g., 7-day rolling window)
    window_size = 7
    features_df[f"{prefix}_rolling_mean_{window_size}"] = (
        df[consumption_column].rolling(window=window_size).mean()
    )
    features_df[f"{prefix}_rolling_std_{window_size}"] = (
        df[consumption_column].rolling(window=window_size).std()
    )
    features_df[f"{prefix}_rolling_min_{window_size}"] = (
        df[consumption_column].rolling(window=window_size).min()
    )
    features_df[f"{prefix}_rolling_max_{window_size}"] = (
        df[consumption_column].rolling(window=window_size).max()
    )

    # Concatenate the new features with the original dataframe (excluding the datetime column)
    result_df = pd.concat([features_df, df.drop([datetime_column], axis=1)],
                           axis=1)

    # Set the datetime column as the index
    result_df = result_df.set_index(df[datetime_column])

    return result_df


In [5]:
def quantile_25(growth_vals: pd.Series) -> float:
    """
    Calculate the 1st quartile (25th percentile) of a series of values.

    Parameters
    ----------
    growth_vals : pd.Series
        A pandas Series containing numerical values for which the 1st quartile 
        (25th percentile) is to be calculated.

    Returns
    -------
    float
        The 25th percentile (1st quartile) of the input values.
    """
    q25 = growth_vals.quantile(0.25)
    return q25


def quantile_75(growth_vals: pd.Series) -> float:
    """
    Calculate the 3rd quartile (75th percentile) of a series of values.

    Parameters
    ----------
    growth_vals : pd.Series
        A pandas Series containing numerical values for which the 3rd quartile 
        (75th percentile) is to be calculated.

    Returns
    -------
    float
        The 75th percentile (3rd quartile) of the input values.
    """
    q75 = growth_vals.quantile(0.75)
    return q75


In [6]:
def get_flat(df, reference, prefix):
    """
    Flatten a pandas DataFrame with a multi-level index and columns, and return a dictionary
    with keys generated from the column names and index values, prefixed as specified.

    Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame with a multi-level index and columns to be flattened.
    reference : str
        A string used to filter out dictionary keys that start with this reference.
    prefix : str
        A string to prepend to each key in the resulting dictionary.

    Returns
    -------
    dict
        A dictionary where keys are of the form 'prefix_columnname_index' and values are the
        corresponding DataFrame values. Keys that start with the `reference` string are excluded
        from the output.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({
    ...     ('A', 'X'): [1, 2],
    ...     ('A', 'Y'): [3, 4],
    ...     ('B', 'X'): [5, 6],
    ... }, index=['row1', 'row2'])
    >>> get_flat(df, 'A', 'prefix')
    {'prefix_B_X_row1': 5, 'prefix_B_X_row2': 6}

    Notes
    -----
    The function is designed to work with DataFrames that have multi-level (hierarchical) 
    column labels and a single-level index. It constructs new dictionary keys by concatenating 
    the column names and index, and excludes any keys that begin with the `reference` string.
    """

    # Set the index name for the DataFrame to 'cons'
    df.index.name = "cons"

    # Initialize an empty dictionary to store flattened key-value pairs
    flat_dict = {}

    # Iterate over each index and column of the DataFrame
    for idx in df.index:
        for col in df.columns:
            # Flatten the multi-level column names into a single string
            col_name = "_".join([str(x) for x in col if x])

            # Generate the key by combining the column name and index
            key = f"{col_name}_{idx}"

            # Add the value from the DataFrame to the flat_dict
            flat_dict[key] = df.at[idx, col]

    # Add prefix to the keys and filter out keys that start with the reference string
    flat_dict = {
        f"{prefix}_{k}": v for k, v in flat_dict.items() if not k.startswith(reference)
    }

    return flat_dict


In [7]:
# reading data of building attributes and metadata
labels = pd.read_parquet(LABEL_FILE)

In [8]:
def process_file(file):
    """
    Process a parquet file containing building energy consumption data and extract various
    statistical and time-based features for analysis.

    Parameters
    ----------
    file : str
        Path to the parquet file containing building energy consumption data.

    Returns
    -------
    dict
        A dictionary containing various features such as consumption statistics, peak periods,
        seasonal means, and advanced rolling statistics.

    Notes
    -----
    - The function reads the parquet file into a DataFrame and renames the main electricity
      consumption column to 'cons'.
    - Various statistics like mean, standard deviation, range, variance, etc. are calculated.
    - Peak consumption periods (hour, day, month) are identified and features are extracted
      based on these periods.
    - Additional rolling statistics and pivot tables are computed for different time periods.
    - The function returns a dictionary containing all the calculated features.
    """

    # Read the parquet file
    df = pd.read_parquet(file)

    # Rename the main electricity consumption column to 'cons'
    value = "out.electricity.total.energy_consumption"
    df = df.rename(columns={value: "cons"})
    value = "cons"

    # Extract building ID and calculate consumption statistics
    build_id = df.first_valid_index()  # building id
    cons_std = df[value].std()
    cons_mean = df[value].mean()
    cons_min = df[value].min()
    cons_max = df[value].max()
    cons_range = cons_max - cons_min
    cons_var = df[value].var()
    peak_to_avg_ratio = cons_max / cons_mean

    # Extract datetime features
    df_datetime = extract_datetime_features(df)
    df_datetime.index = pd.to_datetime(df_datetime.index)

    # Identify peak consumption hours, days, and months
    peak_hours = df_datetime[df_datetime[value] == cons_max].index.hour
    peak_hour = peak_hours[0] if len(peak_hours) > 0 else None
    peak_hourcount = len(peak_hours)

    peak_days = df_datetime[df_datetime[value] == cons_max].index.dayofweek
    peak_day = peak_days[0] if len(peak_days) > 0 else None
    peak_daycount = len(peak_days)

    peak_months = df_datetime[df_datetime[value] == cons_max].index.month
    peak_month = peak_months[0] if len(peak_months) > 0 else None
    peak_monthcount = len(peak_months)

    # Determine if the peak day is a weekend (1 for weekend, 0 for weekday)
    peak_is_weekend = 1 if peak_day > 4 else 0

    # Classify the peak consumption time into morning, afternoon, evening, or night
    if (peak_hour >= 6) and (peak_hour < 12):
        peak_timeofday = "m"
    elif (peak_hour >= 12) and (peak_hour < 18):
        peak_timeofday = "a"
    elif (peak_hour >= 18) and (peak_hour < 24):
        peak_timeofday = "e"
    elif (peak_hour >= 0) and (peak_hour < 6):
        peak_timeofday = "n"

    # Calculate consumption differences, skewness, and kurtosis
    cons_diff_mean = df[value].diff().mean()
    cons_diff_std = df[value].diff().std()
    cons_skew = df[value].skew()
    cons_kurt = df[value].kurtosis()

    # Calculate average consumption for weekdays and weekends
    weekday_mean = df_datetime[df_datetime["tim_is_weekend"] == 0][value].mean()
    weekend_mean = df_datetime[df_datetime["tim_is_weekend"] == 1][value].mean()
    weekend_diff = weekend_mean - weekday_mean

    # Calculate seasonal mean consumption values
    spring_mean = df_datetime[
        (df_datetime["tim_month"] >= 3) & (df_datetime["tim_month"] <= 5)
    ][value].mean()
    summer_mean = df_datetime[
        (df_datetime["tim_month"] >= 6) & (df_datetime["tim_month"] <= 8)
    ][value].mean()
    autumn_mean = df_datetime[
        (df_datetime["tim_month"] >= 9) & (df_datetime["tim_month"] <= 11)
    ][value].mean()
    winter_mean = df_datetime[
        (df_datetime["tim_month"] == 12) | (df_datetime["tim_month"] <= 2)
    ][value].mean()

    # Extract the building's state
    build_state = df.iloc[0, -1]  # state where building is found

    # Initialize the overall dictionary to store all computed features
    overall_dict = {
        "bldg_id": build_id,
        "cons_range": cons_range,
        "cons_std": cons_std,
        "cons_mean": cons_mean,
        "cons_min": cons_min,
        "cons_max": cons_max,
        "cons_var": cons_var,
        "peak_to_avg_ratio": peak_to_avg_ratio,
        "peak_hour": peak_hour,
        "peak_day": peak_day,
        "peak_is_weekend": peak_is_weekend,
        "peak_timeofday": peak_timeofday,
        "peak_month": peak_month,
        "peak_hourcount": peak_hourcount,
        "peak_daycount": peak_daycount,
        "peak_monthcount": peak_monthcount,
        "cons_diff_mean": cons_diff_mean,
        "cons_diff_std": cons_diff_std,
        "cons_skew": cons_skew,
        "cons_kurt": cons_kurt,
        "weekday_mean": weekday_mean,
        "weekend_mean": weekend_mean,
        "weekend_diff": weekend_diff,
        "spring_mean": spring_mean,
        "summer_mean": summer_mean,
        "autumn_mean": autumn_mean,
        "winter_mean": winter_mean,
        "build_state": build_state,
    }

    # Aggregations by time periods
    periods = ["weekend", "afternoon", "morning", "evening", "night"]
    time_conditions = {
        "weekend": "tim_is_weekend",
        "afternoon": "tim_is_afternoon",
        "morning": "tim_is_morning",
        "evening": "tim_is_evening",
        "night": "tim_is_night",
    }

    for period in periods:
        condition = time_conditions[period]
        cons_by_period = pd.pivot_table(
            data=df_datetime,
            values=value,
            index=condition,
            aggfunc=["mean", "std", quantile_25, quantile_75, "max", "min"],
        ).reset_index()
        cons_by_period_dict = get_flat(cons_by_period, condition, period)
        overall_dict.update(cons_by_period_dict)

    # Advanced Rolling Statistics
    rolling_windows = [7, 30]  # 7-day and 30-day rolling windows
    for window in rolling_windows:
        df[f"rolling_mean_{window}"] = df[value].rolling(window=window).mean()
        df[f"rolling_std_{window}"] = df[value].rolling(window=window).std()
        df[f"rolling_median_{window}"] = df[value].rolling(window=window).median()
        df[f"rolling_range_{window}"] = (
            df[value].rolling(window=window).apply(lambda x: x.max() - x.min())
        )
        df[f"rolling_var_{window}"] = df[value].rolling(window=window).var()
        overall_dict.update(
            {
                f"rolling_mean_{window}": df[f"rolling_mean_{window}"].mean(),
                f"rolling_std_{window}": df[f"rolling_std_{window}"].mean(),
                f"rolling_median_{window}": df[f"rolling_median_{window}"].mean(),
                f"rolling_range_{window}": df[f"rolling_range_{window}"].mean(),
                f"rolling_var_{window}": df[f"rolling_var_{window}"].mean(),
            }
        )

    return overall_dict


In [9]:
def prepare_data(folder, n_files=5, is_test=False, labels=labels):
    """
    Prepare data by processing a folder of parquet files and extracting features from them.

    Parameters
    ----------
    folder : str
        Path to the folder containing parquet files.
    n_files : int, optional
        Number of files to process, by default 5.
    is_test : bool, optional
        Indicates whether the function is being run on test data, by default False.
    labels : pandas.DataFrame
        DataFrame containing labels or metadata related to the building.
    
    Returns
    -------
    pandas.DataFrame
        A DataFrame containing extracted features. If `is_test` is False, the labels are merged
        based on the building ID.
    """

    # Retrieve list of files from the folder
    files = list_files_recursive(folder, n_files)

    # Process files in parallel using multiprocessing Pool
    with Pool() as pool:
        process_file_partial = partial(process_file)
        dict_lists = pool.map(process_file_partial, files)

    # Create a DataFrame from the processed features
    X = pd.DataFrame(dict_lists)

    if is_test:
        return X
    else:
        # Merge the features with the labels
        return pd.merge(X, labels, on='bldg_id', how='left')


In [10]:
# lasts approx 6hours
train_data = prepare_data(TRAIN_DIR,n_files=7200)

## Ts_fresh Feature Engineering - Raw TimeSeries
In this portion of the feature engineering process, the ts_fresh library is used to
extract features from a quarter of each building's consumption timeseries data.
A quarter is used because of memory and time constraints

In [13]:
def ts_fresh_prep(folder: str, n_files: int = 3) -> pd.DataFrame:
    """
    Prepares a combined time series dataframe from multiple building data files 
    for feature extraction using ts_fresh.

    This function reads time series data from a specified number of parquet files, 
    processes the first quarter of each file's data, and structures the combined 
    data in a format suitable for the ts_fresh feature extraction library.

    Parameters
    ----------
    folder : str
        The path to the folder containing parquet files with time series data 
        for different buildings.
    
    n_files : int, optional, default=3
        The number of building data files to consider for the processing.
        Only the first n_files in the folder will be used.
    
    Returns
    -------
    pd.DataFrame
        A dataframe containing time series data in ts_fresh-compatible format, 
        with columns ['bldg_id', 'time', 'cons'].
        - 'bldg_id' : Building identifier for each entry.
        - 'time'    : Time step for the building (0-indexed per building).
        - 'cons'    : The total energy consumption for the respective time step.
    """

    # Get the list of files in the folder (recursively) based on the limit n_files
    files = list_files_recursive(folder, n_files)

    # Read and process parquet files, renaming the energy consumption column
    dfs = [
        pd.read_parquet(file)
        .rename(columns={'out.electricity.total.energy_consumption': 'cons'})
        .reset_index(drop=False)  # Retain the original index as a separate column
        for file in files
    ]

    # Select only the first quarter of the data from each building's dataframe
    dfs = [df.head(int(df.shape[0] / 4)) for df in dfs]

    # Combine all dataframes into a single dataframe
    combined_df = pd.concat(dfs, ignore_index=True)

    # Create a time column that indexes each row in order within each building's data
    combined_df['time'] = combined_df.groupby('bldg_id').cumcount()

    # Select the columns required by ts_fresh: building id, time, and consumption
    tsf_df = combined_df[['bldg_id', 'time', 'cons']]

    return tsf_df


In [14]:
# Also time-intensive
tsf_df = ts_fresh_prep(TRAIN_DIR,n_files=7200)

In [None]:
# Extract features 
extracted_features = extract_features(tsf_df, column_id="bldg_id", column_sort="time",
                                     default_fc_parameters=EfficientFCParameters(),
                                        n_jobs=3)

Feature Extraction:  80%|████████  | 16/20 [4:41:06<39:19, 589.98s/it]   

In [None]:
extracted_features['bldg_id'] = extracted_features.index

In [None]:
for col in extracted_features.columns:
    if extracted_features[col].isna().sum()>0:
        print(col,': ',extracted_features[col].isna().sum())
        extracted_features.drop(col,axis=1,inplace=True)

In [None]:
train_data_comp = pd.merge(train_data,extracted_features,how='left',on='bldg_id')
train_data_comp

## Manual Feature Engineering - 15min-Interval Condensation

In [None]:
def get_15min_df(ndf):
    """
    Aggregates electricity consumption data from the input time series DataFrame,
    computing average consumption in 15-minute intervals for both weekdays and weekends.

    The output consists of two sets of 96 columns:
    - One set for the average consumption during weekdays (15-minute intervals across weekdays).
    - One set for the average consumption during weekends (15-minute intervals across weekends).

    Parameters
    ----------
    ndf : pd.DataFrame
        DataFrame containing time series data for the electricity consumption of a single building.
        The DataFrame must contain the following columns:
        - 'timestamp' : datetime column indicating the time of the observation.
        - 'cons' : electricity consumption value at that time.
        - 'bldg_id' : identifier of the building.

    Returns
    -------
    pd.DataFrame
        A DataFrame with one row:
        - 96 columns representing average weekday consumption per 15-minute interval.
        - 96 columns representing average weekend consumption per 15-minute interval.
        - A 'bldg_id' column indicating the building to which the data pertains.

    Notes
    -----
    Each 15-minute interval is represented as a unique column. Weekdays are considered Monday
    through Friday, while weekends are Saturday and Sunday.
    """

    # Make a copy of the input DataFrame to avoid modifying the original data
    df = ndf.copy()

    # Ensure 'timestamp' column is in datetime format
    df["timestamp"] = pd.to_datetime(df["timestamp"])

    # Create a column for 15-minute intervals (0 to 95) based on the time of day
    df["15min_chunk"] = (df["timestamp"].dt.hour * 60 + df["timestamp"].dt.minute) // 15

    # Create a boolean column to distinguish between weekdays (False) and weekends (True)
    df["is_weekend"] = (
        df["timestamp"].dt.weekday >= 5
    )  # True for Saturday (5) and Sunday (6)

    # Group data by 15-minute intervals and whether it's a weekend, then calculate mean consumption
    grouped = df.groupby(["15min_chunk", "is_weekend"])["cons"].mean().reset_index()

    # Initialize an empty dictionary to store results
    result = {}

    # Iterate over all possible 15-minute chunks (0-95)
    for chunk in range(96):
        # Define column names for weekdays and weekends
        weekday_col = f"{chunk}_15_weekday"
        weekend_col = f"{chunk}_15_weekend"

        # Extract mean consumption values for weekdays and weekends for the current chunk
        weekday_value = grouped[
            (grouped["15min_chunk"] == chunk) & (grouped["is_weekend"] == False)
        ]["cons"].values
        weekend_value = grouped[
            (grouped["15min_chunk"] == chunk) & (grouped["is_weekend"] == True)
        ]["cons"].values

        # Handle cases where no data exists for a given chunk
        result[weekday_col] = weekday_value if len(weekday_value) > 0 else [None]
        result[weekend_col] = weekend_value if len(weekend_value) > 0 else [None]

    # Convert result dictionary into a DataFrame
    result = pd.DataFrame(result)

    # Add a column for the building ID
    result["bldg_id"] = df["bldg_id"][0]

    return result


In [None]:
def calculate_iqr_outliers(series):
    """
    Calculate the Interquartile Range (IQR) and use it to determine outlier thresholds.
    
    The IQR is the range between the first (25th percentile) and third quartile (75th percentile)
    values of the series. This function returns the lower and upper bounds for identifying
    potential outliers based on 0.5 * IQR.

    Parameters:
    -----------
    series : pandas.Series
        A pandas Series object containing numerical data.

    Returns:
    --------
    lower_bound : float
        The lower threshold for outlier detection (values below this are considered outliers).
    
    upper_bound : float
        The upper threshold for outlier detection (values above this are considered outliers).
    """

    # Calculate the 25th percentile (Q1) and 75th percentile (Q3) of the data
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)

    # Compute the Interquartile Range (IQR)
    iqr = q3 - q1

    # Calculate the lower and upper bounds for outliers (using 0.5 * IQR)
    lower_bound = q1 - 0.5 * iqr
    upper_bound = q3 + 0.5 * iqr

    return lower_bound, upper_bound


In [None]:
def prep_15min(folder, n_files=3):
    """
    Aggregates electricity consumption data from parquet files and computes new features 
    based on average consumption in 15-minute intervals for weekdays and weekends. The 
    function also calculates summary statistics (e.g., mean, sum, variance) for these 
    intervals and identifies outliers based on interquartile range (IQR).
    
    Parameters
    ----------
    folder : str
        Directory path where the parquet files containing the electricity consumption 
        data are located.
    
    n_files : int, optional
        The number of parquet files (buildings) to process, by default 3.
    
    Returns
    -------
    pd.DataFrame
        DataFrame containing manually engineered features, including statistics and outlier counts 
        for 15-minute interval consumption on weekdays and weekends.
    
    Notes
    -----
    - The parquet files should contain a column named 'out.electricity.total.energy_consumption', 
      which represents the total electricity consumption.
    - The `get_15min_df` function should compute average consumption in 15-minute 
      intervals for both weekdays and weekends.
    - The `calculate_iqr_outliers` function calculates lower and upper IQR thresholds 
    to identify outliers.
    """

    # Get a list of files to process from the specified folder
    files = list_files_recursive(folder, n_files)

    # Read each parquet file into a DataFrame, renaming the electricity consumption column
    dfs = [
        pd.read_parquet(file)
        .rename(columns={'out.electricity.total.energy_consumption': 'cons'})
        .reset_index(drop=False)
        for file in files
    ]

    # Apply the get_15min_df function to resample each DataFrame into 15-minute intervals
    all_15min = [get_15min_df(df) for df in dfs]

    # Concatenate all 15-minute DataFrames into one
    comb = pd.concat(all_15min).reset_index(drop=True)

    # Select columns corresponding to weekday and weekend consumption
    weekdays = [col for col in comb.columns if col.endswith('_weekday')]
    weekends = [col for col in comb.columns if col.endswith('_weekend')]

    # List of statistics to calculate for each interval
    statistics = ['sum', 'mean', 'max', 'min', 'std', 'var', 'median']

    # Compute statistics for weekday columns and assign new feature names
    comb_weekdays_stats = comb[weekdays].agg(statistics, axis=1)
    comb_weekdays_stats.columns = [f"weekdays_{col}" for col in comb_weekdays_stats.columns]

    # Compute statistics for weekend columns and assign new feature names
    comb_weekends_stats = comb[weekends].agg(statistics, axis=1)
    comb_weekends_stats.columns = [f"weekends_{col}" for col in comb_weekends_stats.columns]

    # Concatenate new statistics with the original DataFrame
    comb = pd.concat([comb, comb_weekdays_stats, comb_weekends_stats], axis=1)

    # Initialize columns to track outlier counts and IQR thresholds for weekdays and weekends
    comb['n_below_weekday'] = 0
    comb['n_above_weekday'] = 0
    comb['n_below_weekend'] = 0
    comb['n_above_weekend'] = 0
    comb['lower_weekday'] = 0.0
    comb['upper_weekday'] = 0.0
    comb['lower_weekend'] = 0.0
    comb['upper_weekend'] = 0.0

    # Calculate outliers for each row using IQR-based thresholds
    for idx, row in comb.iterrows():
        # Get weekday and weekend values for the current row
        weekday_values = row[weekdays]
        weekend_values = row[weekends]

        # Calculate IQR-based thresholds for weekday and weekend values
        lower_weekday, upper_weekday = calculate_iqr_outliers(weekday_values)
        lower_weekend, upper_weekend = calculate_iqr_outliers(weekend_values)

        # Count how many weekday values fall outside the IQR range
        n_below_weekday = (weekday_values < lower_weekday).sum()
        n_above_weekday = (weekday_values > upper_weekday).sum()

        # Count how many weekend values fall outside the IQR range
        n_below_weekend = (weekend_values < lower_weekend).sum()
        n_above_weekend = (weekend_values > upper_weekend).sum()

        # Store the counts and thresholds in the DataFrame
        comb.at[idx, 'n_below_weekday'] = n_below_weekday
        comb.at[idx, 'n_above_weekday'] = n_above_weekday
        comb.at[idx, 'n_below_weekend'] = n_below_weekend
        comb.at[idx, 'n_above_weekend'] = n_above_weekend
        comb.at[idx, 'lower_weekday'] = lower_weekday
        comb.at[idx, 'upper_weekday'] = upper_weekday
        comb.at[idx, 'lower_weekend'] = lower_weekend
        comb.at[idx, 'upper_weekend'] = upper_weekend

    return comb


In [None]:
df_15min = prep_15min(TRAIN_DIR,n_files=7200)

In [None]:
train_data_comp = pd.merge(train_data_comp,df_15min,how='left',on='bldg_id')
train_data_comp

## Ts_fresh Feature Engineering - 15min-interval condensed TimeSeries

In [None]:
def df15min_tsfprep(ndf: pd.DataFrame) -> pd.DataFrame:
    """
    Prepares a time series DataFrame for a feature extraction process using the ts_fresh
    library. This preparation is done by melting and pivoting columns related to
    '15_weekday' and '15_weekend' time intervals into separate rows and columns.

    Parameters
    ----------
    ndf : pd.DataFrame
        Input DataFrame containing time series data, where the columns represent
        values at 15-minute intervals for both weekdays and weekends. The columns
        should be named with a numeric prefix and suffixes '_15_weekday' or '_15_weekend'.

    Returns
    -------
    pd.DataFrame
        Transformed DataFrame where:
        - Rows are indexed by 'bldg_id' and 'time'.
        - There are two columns: '15_weekday' and '15_weekend', which hold the respective
          time series values for weekdays and weekends.

    Notes
    -----
    The function works under the assumption that the input DataFrame has a structure
    where the time series data for weekdays and weekends are stored in columns with
    suffixes '_15_weekday' and '_15_weekend', respectively.

    Example
    -------
    Input:
    +----------+--------+--------+--------+
    | bldg_id  | 0_15_weekday | 0_15_weekend | ... |
    +----------+--------+--------+--------+
    | 1        | 123    | 130    | ...    |

    Output:
    +----------+--------+--------+--------+
    | bldg_id  | time   | 15_weekday | 15_weekend |
    +----------+--------+--------+--------+
    | 1        | 0      | 123    | 130    |
    """

    # Make a copy of the input DataFrame to avoid modifying the original
    df = ndf.copy()

    # Melt the DataFrame to move '15_weekday' and '15_weekend' columns into rows
    melted_df = df.melt(
        id_vars=["bldg_id"],
        value_vars=[
            col for col in df.columns if "_15_weekday" in col or "_15_weekend" in col
        ],
        var_name="time",
        value_name="value",
    )

    # Create a new column 'day_type' to distinguish between weekday and weekend
    melted_df["day_type"] = melted_df["time"].apply(
        lambda x: "15_weekday" if "weekday" in x else "15_weekend"
    )

    # Extract the time part from the column name (numeric prefix before '_15')
    melted_df["time"] = melted_df["time"].str.extract(r"(\d+)_15")[0].astype(int)

    # Pivot the DataFrame so '15_weekday' and '15_weekend' become columns
    reshaped_df = melted_df.pivot_table(
        index=["bldg_id", "time"], columns="day_type", values="value"
    ).reset_index()

    # Remove the column index name and rename columns for clarity
    reshaped_df.columns.name = None

    # Ensure columns are properly named
    reshaped_df.rename(
        columns={"15_weekday": "15_weekday", "15_weekend": "15_weekend"}, inplace=True
    )

    return reshaped_df


In [None]:
tsf_15min = df15min_tsfprep(df_15min)

In [None]:
#Extract features
extracted_15min = extract_features(tsf_15min, column_id="bldg_id", column_sort="time",
                                        n_jobs=3)

In [None]:
extracted_15min['bldg_id'] = extracted_15min.index

In [None]:
for col in extracted_15min.columns:
    if extracted_15min[col].isna().sum()>0:
        print(col,': ',extracted_15min[col].isna().sum())
        extracted_15min.drop(col,axis=1,inplace=True)
    elif extracted_15min[col].nunique()==1:
        print('constant', ' ',col)
        extracted_15min.drop(col,axis=1,inplace=True)

In [None]:
train_data_comp = pd.merge(train_data_comp,extracted_15min,how='left',on='bldg_id')
train_data_comp

# Developing Building Stock Type Model

In [None]:
X = train_data_comp[
    [
        col
        for col in train_data_comp.columns
        if not (col.endswith("_com")) and not (col.endswith("_res"))
    ]
]
X = X.drop("building_stock_type", axis=1)


In [None]:
stock_type_y = train_data_comp["building_stock_type"].map(
    {"residential": 1, "commercial": 0}
)


In [None]:
# Identify and drop features with only one unique value. They are obviously not relevant
constant_cols = []
for col in X.columns:
    if X[col].nunique()==1:
        constant_cols.append(col)
        
X = X.drop(constant_cols,axis=1)

In [None]:
# data preprocessing pipeling to encode non-numeric data and scale data
type_prep_pipe = Pipeline([
    ('tenc', ColumnTransformer([('Oencode', TargetEncoder(),
                                 ['peak_timeofday','build_state'])],
                               remainder='passthrough')),
    ('scaler', MinMaxScaler())])

prepped_typeX = type_prep_pipe.fit_transform(X,stock_type_y)

In [None]:
prepped_typeX.shape

In [None]:
# indices of important features after feature selection process
true_indexes = [1024, 1026, 1029, 6, 1031, 5, 9, 11, 12, 2069, 21, 1559, 2071, 535, 23,
                25, 28, 1052, 541, 30, 33, 1570, 1574, 39, 40, 551, 42, 554, 1062, 45,
                2093, 1580, 560, 1584, 52, 1589, 54, 2104, 57, 568, 1083, 570, 63, 66,
                1091, 1092, 69, 75, 1099, 1615, 81, 82, 1106, 84, 1107, 87, 599, 1113,
                90, 1117, 1118, 95, 606, 1634, 99, 98, 101, 1635, 110, 112, 113, 1654,
                1145, 122, 124, 125, 126, 127, 128, 638, 1663, 131, 132, 645, 647, 659,
                149, 662, 1693, 160, 1696, 162, 1698, 678, 1702, 682, 1712, 690, 180,
                1716, 697, 186, 1216, 1220, 1747, 724, 1236, 737, 1290, 780, 1810, 789,
                803, 301, 813, 826, 314, 831, 834, 838, 844, 334, 847, 848, 337, 1879,
                861, 862, 353, 1386, 1389, 882, 1913, 393, 413, 416, 932, 1448, 944, 435,
                1460, 950, 958, 1471, 448, 447, 450, 1987, 460, 461, 974, 1487, 1495,
                990, 2018, 1509, 1510, 1511, 1512, 2023, 1515, 2028, 2031, 2034, 499,
                1012, 1014, 1015, 1016, 1022, 1023]


In [None]:
finalX = prepped_typeX[:,true_indexes]

In [None]:
finalX.shape

In [None]:
# f1_score of 1.0 after hyperparameter tuning process with final parameters being:
cbest_params = {
    "n_estimators": 130,
    "max_depth": 3,
    "num_leaves": 159,
    "learning_rate": 0.2926672253860692,
    "min_child_samples": 38,
    "min_child_weight": 1.8156267400360182,
    "subsample": 0.40161872280656674,
    "scale_pos_weight": 6,
    "colsample_bytree": 0.9412860959511857,
    "reg_alpha": 0.4842521664025019,
    "reg_lambda": 3.664346685841027e-06,
}


In [None]:
pyc_df = pd.DataFrame(finalX)

In [None]:
typekf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
cmodels = []
oof_pred = []
for train_index, test_index in typekf.split(pyc_df, stock_type_y):

    X_train, X_test = pyc_df.iloc[train_index], pyc_df.iloc[test_index]
    y_train, y_test = stock_type_y.iloc[train_index], stock_type_y.iloc[test_index]
    cmodel = Pipeline(
        [
            (
                "model",
                LGBMClassifier(
                    verbose=-1, device_type="cpu", random_state=42, **cbest_params
                ),
            )
        ]
    )

    cmodel.fit(X_train, y_train)
    preds = cmodel.predict(X_test)
    oof_pred.append(f1_score(y_test, preds))
    cmodels.append(cmodel)

print(np.mean(oof_pred))


# Second Hierarchy

In [None]:
def create_combo_task(df: pd.DataFrame, target_names: list) -> pd.DataFrame:
    """
    Combine individual target columns into a single target column along with a `target_type` column.

    This function takes a DataFrame and a list of target column names, and returns a new DataFrame 
    where each row is duplicated for each target column. A new column `target_type` specifies the 
    original target column from which the `target` value was taken.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing features and multiple target columns.
    target_names : list of str
        List of column names in `df` representing different target columns to be combined.

    Returns
    -------
    pd.DataFrame
        A DataFrame with two new columns:
        - `target_type`: Indicates the original target column name.
        - `target`: The target values from the corresponding `target_type` column, 
        converted to strings.

    Example
    -------
    >>> res_df = pd.DataFrame({
            'feature1': [0.1, 0.2],
            'feature2': [0.3, 0.4],
            'target1': [1, 0],
            'target2': [0, 1]
        })
    >>> target_columns = ['target1', 'target2']
    >>> combo_res_class_df = create_combo_task(res_df, target_columns)
    >>> print(combo_res_class_df)
       feature1  feature2 target_type target
    0       0.1       0.3     target1      1
    1       0.2       0.4     target1      0
    0       0.1       0.3     target2      0
    1       0.2       0.4     target2      1
    """

    comb_clas_dfs = []

    for target in target_names:
        # Create a copy of the DataFrame without the target columns
        new_df = df.drop(columns=target_names)
        # Add a new column specifying the target type
        new_df["target_type"] = target
        # Convert the target column values to strings and add them as 'target'
        new_df["target"] = df[target].astype(str)
        comb_clas_dfs.append(new_df)

    # Concatenate all dataframes vertically
    combined_df = pd.concat(comb_clas_dfs, axis=0)

    return combined_df


# Commercial Targets

## Commercial Building Data Preparation

In [None]:
# filtering complete data to create data of commercial buildings only
comm_targets = [col for col in train_data_comp.columns if col.endswith('_com')]
comm_df = train_data_comp[~(train_data_comp[comm_targets[0]].isna())]
comm_df = comm_df[[col for col in comm_df if not col.endswith('_res')]]

# dropping feature columns with only one unique value
comm_df.drop(constant_cols,axis=1,inplace=True)

# applying create_combo_task function to condense the commercial building target types
# into one single target column
combo_com_clas_df = create_combo_task(comm_df,comm_targets)

## Commercial Building Feature Selection
Defining relevant indices of features necessary for predicting the target values of each commercial building target type.
These were identified as part of a feature selection process

In [None]:
# dictionary to store each target type as a key and the relevant feature indices as values
com_relevant_indices = {}

In [None]:
com_relevant_indices['in.vintage_com'] = \
[512, 1, 2050, 516, 1542, 2055, 530, 1047, 1565, 33, 546, 547, 1060, 1572, 1574, 551,
 1575, 2083, 572, 63, 1088, 1600, 578, 1602, 75, 78, 1618, 611, 614, 616, 1130, 1133,
 1135, 1138, 1654, 1655, 1145, 123, 1660, 637, 1661, 126, 640, 128, 638, 1666, 655, 1167,
 1681, 660, 664, 670, 671, 679, 1196, 179, 185, 186, 699, 700, 1213, 188, 705, 1217, 707,
 1221, 198, 1223, 200, 201, 1225, 1227, 716, 1738, 1226, 1222, 1737, 1746, 1748, 1749,
 727, 216, 220, 1756, 734, 1757, 1758, 1254, 1775, 1776, 241, 1805, 272, 273, 1297, 1302,
 1818, 1307, 1820, 1821, 1823, 1312, 1313, 1825, 803, 804, 1316, 1829, 1831, 809, 1834,
 1835, 810, 1837, 1839, 1841, 1330, 1846, 311, 1848, 1849, 1855, 836, 1349, 1862, 1865,
 337, 1361, 1367, 1880, 1368, 1371, 1884, 1372, 1888, 1889, 1891, 1380, 357, 1893, 1894,
 1900, 366, 1903, 1904, 375, 1918, 1919, 1922, 386, 1923, 1925, 1414, 1929, 407, 409,
 414, 1438, 1441, 1444, 1957, 1958, 1445, 1455, 1968, 1460, 1461, 1974, 1463, 1464,
 1975, 1978, 1973, 1977, 445, 446, 1983, 1984, 1985, 1474, 455, 2001, 467, 2004, 2003,
 470, 2013, 487, 2030, 500, 503, 506, 509]

com_relevant_indices['in.comstock_building_type_group_com'] = \
[512, 1546, 1550, 18, 1555, 19, 20, 1558, 23, 2071, 2072, 2076, 2078, 2079, 2080, 1060,
 1573, 1064, 1066, 45, 51, 1076, 1592, 1604, 84, 1622, 101, 1131, 1647, 1135, 113, 1648,
 116, 1146, 122, 123, 124, 126, 127, 128, 646, 1674, 659, 672, 1184, 685, 174, 186, 187,
 188, 189, 190, 1726, 191, 1220, 1221, 1734, 711, 1224, 1228, 1746, 1748, 733, 735, 1778,
 1780, 1278, 1288, 268, 269, 270, 1816, 1819, 802, 812, 1836, 1838, 832, 836, 1861, 329,
 1867, 333, 1360, 849, 1363, 1884, 351, 1888, 370, 1914, 387, 1422, 1959, 1483, 1485,
 2008, 473, 2011, 499, 500, 1530, 2044, 2046]

com_relevant_indices['in.heating_fuel_com'] = \
[512, 1, 0, 9, 10, 13, 1044, 23, 2073, 2079, 2080, 545, 1575, 1066, 1074, 2098, 1592,
 1602, 84, 1622, 87, 110, 1651, 116, 1653, 123, 636, 637, 638, 128, 641, 642, 643, 650,
 653, 655, 657, 660, 665, 675, 683, 688, 690, 179, 693, 182, 187, 188, 189, 190, 703,
 192, 699, 706, 1212, 1220, 710, 198, 1224, 1737, 199, 202, 715, 1746, 1748, 1749, 726,
 729, 1757, 736, 749, 241, 1285, 265, 266, 268, 269, 270, 271, 274, 1811, 1814, 287, 800,
 801, 802, 804, 807, 1832, 810, 812, 1838, 1330, 823, 826, 315, 835, 836, 327, 1352,
 1865, 1866, 331, 1867, 334, 345, 351, 354, 1892, 1894, 360, 376, 392, 415, 1957, 426,
 1962, 1963, 1974, 1468, 446, 448, 463, 2013, 2016, 2017, 2030, 508]

com_relevant_indices['in.wall_construction_type_com'] = \
[1, 517, 2055, 2056, 9, 10, 11, 17, 536, 538, 539, 542, 1568, 2081, 1060, 2092, 2094,
 562, 2102, 567, 2103, 1083, 572, 573, 1602, 1604, 75, 591, 611, 614, 622, 1135, 1140,
 117, 120, 123, 635, 1661, 638, 1662, 640, 1666, 642, 132, 645, 646, 647, 1672, 653,
 659, 669, 1184, 672, 674, 676, 677, 166, 1192, 682, 1195, 687, 177, 690, 180, 692, 186,
 192, 1217, 705, 707, 706, 1221, 199, 1227, 716, 717, 203, 1234, 730, 220, 224, 231, 235,
 238, 1775, 239, 754, 755, 756, 245, 757, 1278, 258, 1806, 272, 1296, 274, 275, 1819,
 1308, 1313, 802, 291, 1316, 1317, 803, 807, 811, 812, 1323, 1326, 1327, 305, 1330, 1844,
 1333, 1845, 311, 1846, 1851, 1852, 1858, 1349, 327, 1863, 1353, 331, 1870, 1872, 1361,
 345, 346, 1371, 1372, 1369, 1374, 1887, 1376, 350, 1890, 1899, 366, 368, 1906, 375, 
 1402, 380, 1407, 384, 1922, 392, 393, 1932, 1424, 402, 1943, 409, 411, 1951, 1953, 1443,
 1446, 1961, 1455, 1972, 439, 442, 1980, 1469, 1471, 448, 1991, 1992, 1999, 1489, 2002,
 468, 1493, 470, 471, 473, 476, 477, 2014, 1505, 485, 494]

com_relevant_indices['in.hvac_category_com'] = \
[1, 9, 1546, 19, 20, 2069, 535, 536, 537, 1048, 538, 540, 542, 543, 2080, 2081, 1064,
 554, 2093, 51, 2100, 567, 1592, 578, 580, 1120, 1130, 1133, 110, 117, 118, 119, 1653,
 121, 1654, 124, 125, 637, 126, 640, 128, 642, 643, 644, 127, 662, 151, 164, 1191, 174,
 177, 180, 181, 184, 186, 187, 699, 192, 1220, 1222, 198, 1224, 1226, 203, 1227, 1747,
 1748, 726, 735, 1760, 755, 265, 267, 268, 269, 270, 272, 287, 1313, 802, 803, 804, 807,
 808, 813, 814, 822, 824, 826, 323, 836, 327, 337, 340, 342, 387, 428, 440, 451, 1484,
 1492, 2008, 2009, 2011, 2016, 2017]

com_relevant_indices['in.ownership_type_com'] = \
[512, 1, 527, 532, 533, 2071, 2073, 538, 1567, 2081, 547, 1060, 39, 1074, 2099, 51,
 2101, 2100, 2103, 1592, 1594, 1598, 1087, 577, 1611, 1622, 105, 1645, 1646, 1648,
 1649, 1652, 1141, 116, 1656, 1658, 1659, 123, 1147, 124, 127, 1663, 642, 1668, 1669,
 646, 662, 1184, 674, 682, 1196, 686, 1200, 1204, 1208, 185, 697, 699, 1209, 189, 190,
 191, 702, 1729, 188, 700, 709, 1228, 1741, 1229, 1231, 1233, 1745, 1746, 1751, 1752,
 217, 218, 731, 1761, 1767, 1777, 1778, 1280, 1793, 1306, 795, 1820, 1310, 1826, 1320,
 1833, 1321, 1834, 812, 1837, 1327, 1840, 1839, 1844, 311, 1849, 826, 1854, 1344, 1345,
 1856, 323, 1861, 329, 1867, 1869, 337, 1874, 343, 344, 1370, 1883, 1888, 1385, 1392,
 1398, 1406, 1408, 1921, 1412, 1934, 412, 1442, 1444, 421, 422, 1446, 1963, 1452, 1968,
 1459, 1460, 1979, 443, 1980, 447, 1983, 1984, 1474, 1986, 1476, 1479, 1993, 1999,
 1488, 472, 2008, 2010, 2013, 2016, 2017, 2027, 499]

com_relevant_indices['in.number_of_stories_com'] = \
[1, 2056, 2060, 2064, 529, 530, 19, 1555, 1554, 2067, 23, 1048, 535, 2074, 2079, 2080,
 2098, 2099, 2103, 1592, 1594, 1615, 117, 637, 641, 643, 132, 644, 646, 1157, 1672, 652,
 672, 688, 177, 690, 181, 183, 184, 185, 186, 696, 188, 189, 192, 705, 1219, 1220, 1224,
 1225, 1226, 1227, 718, 1745, 1748, 724, 1753, 731, 1757, 1763, 1764, 741, 1254, 1766,
 1266, 1780, 1781, 1782, 249, 254, 1278, 265, 267, 268, 1807, 271, 1810, 1811, 1299, 277,
 1305, 1818, 1819, 283, 1830, 807, 808, 1321, 1837, 1846, 826, 831, 834, 835, 836, 1866,
 1868, 1361, 1363, 349, 350, 1375, 353, 1384, 361, 1391, 1904, 1906, 887, 375, 1913, 378,
 383, 385, 386, 387, 1926, 393, 395, 399, 411, 412, 1440, 1962, 1964, 1456, 1969, 1974,
 1465, 1979, 444, 1980, 1473, 453, 1994, 1488, 2001, 1492, 472, 2013, 480, 1505, 499,
 2038, 1533, 2046]

In [None]:
com_relevant_indices['in.tstat_htg_sp_f..f_com'] = \
[513, 2, 2052, 2056, 1546, 2067, 23, 2073, 2075, 2076, 2077, 2078, 2080, 2098, 1077,
 1592, 87, 612, 100, 614, 1133, 1647, 1136, 1137, 1650, 116, 122, 1147, 124, 123, 1662,
 1660, 126, 1659, 646, 648, 1675, 652, 1682, 659, 1684, 663, 1687, 1180, 672, 1700, 167,
 1705, 1706, 172, 174, 1719, 185, 186, 188, 189, 190, 191, 704, 1219, 1733, 1734, 1221,
 713, 1738, 1737, 715, 1227, 1228, 1740, 1746, 1747, 726, 1750, 1752, 1762, 1764, 1765,
 1776, 1777, 1778, 1780, 1781, 1782, 253, 1792, 267, 268, 269, 1294, 270, 275, 1812,
 1814, 1303, 1816, 279, 794, 1824, 1313, 1826, 1320, 1836, 1333, 822, 836, 1867, 1363,
 1879, 347, 1886, 1889, 1896, 361, 364, 1903, 1392, 1908, 1912, 1913, 1914, 387, 393,
 1421, 1435, 412, 1437, 1440, 1443, 1444, 1446, 1959, 1960, 1963, 1451, 1457, 1458, 433,
 1460, 1459, 1462, 1973, 1978, 1981, 446, 1982, 1992, 1483, 1485, 2013, 2024, 2030, 499,
 500, 504, 2046]

com_relevant_indices['in.tstat_clg_sp_f..f_com'] = \
[2052, 1542, 2056, 2064, 23, 2073, 2075, 2076, 2077, 2080, 1573, 1574, 560, 2097, 1076,
 1592, 1087, 1604, 1611, 100, 1133, 1134, 1137, 116, 122, 123, 124, 1660, 643, 1156, 646,
 1674, 1676, 1684, 1688, 678, 1705, 681, 683, 172, 685, 174, 1196, 1711, 692, 183, 696,
 697, 698, 186, 700, 189, 190, 191, 1217, 1730, 1729, 706, 1221, 1222, 1734, 712, 1737,
 1226, 1219, 1228, 1223, 1230, 1746, 1747, 1748, 729, 1754, 231, 1778, 1782, 1278, 1796,
 1797, 1798, 268, 1294, 271, 273, 1811, 1816, 1306, 794, 1310, 1313, 1826, 803, 1827,
 1835, 812, 1837, 1337, 826, 1341, 831, 832, 1861, 1862, 1351, 1867, 1871, 1874, 1876,
 1881, 1882, 1889, 1377, 1383, 361, 363, 367, 1905, 1908, 1397, 1912, 1913, 1914, 378,
 1405, 1406, 387, 1413, 1420, 1421, 398, 399, 396, 406, 411, 1437, 413, 1957, 1448, 1960,
 1962, 1453, 434, 1467, 1470, 453, 1992, 1483, 1485, 1997, 1999, 2013, 1505, 2021, 2026,
 2028, 508]

com_relevant_indices['in.weekday_opening_time..hr_com'] = \
[1, 1547, 1548, 1549, 1550, 524, 17, 19, 23, 2075, 2080, 1574, 1066, 1070, 1071, 1083,
 574, 1123, 1127, 105, 1129, 1131, 120, 121, 122, 124, 1157, 1161, 649, 1162, 653, 1166,
 654, 659, 662, 667, 1184, 1192, 1193, 1196, 684, 1200, 178, 1204, 181, 183, 185, 698,
 1211, 1212, 1213, 190, 1215, 705, 711, 712, 1228, 1746, 1748, 727, 1751, 1753, 757, 1273,
 1292, 1293, 271, 272, 274, 1816, 1305, 1818, 1312, 1313, 1314, 1315, 1826, 1317, 1316,
 802, 1336, 1337, 1338, 1859, 336, 1360, 1362, 1363, 1361, 345, 351, 1383, 1384, 361,
 1385, 388, 1436, 1448, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 438, 1464, 1467, 1470,
 447, 1472, 1471, 1480, 1481, 1489, 2008, 1502, 1503, 500, 1533]

com_relevant_indices['in.weekday_operating_hours..hr_com'] = \
[517, 1542, 1551, 1553, 1554, 19, 1557, 1558, 1047, 1045, 1568, 1569, 1573, 1574, 1577,
 1066, 1067, 1578, 1071, 1072, 1073, 1076, 1077, 1588, 2103, 1592, 1618, 597, 1129, 1131,
 116, 117, 118, 119, 632, 125, 646, 650, 1682, 666, 1180, 1184, 674, 1192, 681, 1196,
 177, 1201, 1204, 190, 1221, 1224, 202, 715, 1741, 1746, 1238, 1250, 1251, 1252, 1253,
 1269, 246, 1281, 1282, 1292, 1807, 1300, 1304, 1817, 1821, 303, 1330, 822, 1848, 1850,
 1344, 1857, 323, 1349, 1861, 331, 1357, 1873, 1362, 1881, 346, 1371, 1888, 1376, 1385,
 1388, 365, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1908, 1398, 1400, 1401, 1402, 1405,
 1406, 1407, 1408, 1409, 1410, 1411, 387, 1414, 1415, 1926, 1419, 1420, 1421, 1423, 1424,
 1425, 1429, 1433, 1434, 1436, 432, 1972, 440, 1977, 1985, 1483, 1485, 467, 1492, 1491,
 474, 1501, 477, 2023]

## Commercial Building Hyperparameter Tuning
Defining optimised hyperparameter values necessary for an improved performance of each model dedicated for each commercial building target type.
These were identified as part of an optuna optimisation process

In [None]:
# dictionary to store each target type as a key and the optimised hyperparameters as values
tuned_params_com = {}

In [None]:
# 0.24855760678990307 f1: 
tuned_params_com['in.vintage_com'] =\
{'n_estimators': 929, 'max_depth': 30, 'num_leaves': 98,
 'learning_rate': 0.09757152134768582, 'min_child_samples': 63,
 'min_child_weight': 5.789632655738992, 'subsample': 0.7037137854216451,
 'scale_pos_weight': 2, 'colsample_bytree': 0.7992042532947691,
 'reg_alpha': 0.00010604613407883311, 'reg_lambda': 2.718366809743991e-07}

# 0.9698519045448032 f1: 
tuned_params_com['in.comstock_building_type_group_com'] =\
{'n_estimators': 940, 'max_depth': 46, 'num_leaves': 81,
 'learning_rate': 0.09672413197136034, 'min_child_samples': 91,
 'min_child_weight': 0.004624180720460598, 'subsample': 0.48241105899368497,
 'scale_pos_weight': 6, 'colsample_bytree': 0.6973108502666483,
 'reg_alpha': 5.754194740383054e-06, 'reg_lambda': 4.785401761956007e-07}

# 0.7889239792117071 f1
tuned_params_com['in.heating_fuel_com'] = \
{'n_estimators': 999, 'max_depth': 40, 'num_leaves': 31,
 'learning_rate': 0.07780397696933257, 'min_child_samples': 90,
 'min_child_weight': 0.8545561924548544, 'subsample': 0.42951096078090645,
 'scale_pos_weight': 2, 'colsample_bytree': 0.7134162448258592,
 'reg_alpha': 7.466548533825271e-08, 'reg_lambda': 0.21541630101775278}

In [None]:
# 0.9006977979183691  f1: 
tuned_params_com['in.hvac_category_com'] =\
{'n_estimators': 832, 'max_depth': 4, 'num_leaves': 156,
 'learning_rate': 0.20961451964851766, 'min_child_samples': 81,
 'min_child_weight': 0.0057852790807866265, 'subsample': 0.46176193007934135,
 'scale_pos_weight': 6, 'colsample_bytree': 0.7308989879791561,
 'reg_alpha': 1.293965596370794e-07, 'reg_lambda': 1.0701562102768924e-08}

#0.5028913117127053 f1
tuned_params_com['in.wall_construction_type_com'] = \
{'n_estimators': 956, 'max_depth': 20, 'num_leaves': 163,
 'learning_rate': 0.16471541188369432, 'min_child_samples': 21,
 'min_child_weight': 4.835751672994968, 'subsample': 0.5763943704418849,
 'scale_pos_weight': 2, 'colsample_bytree': 0.7440354364609669,
 'reg_alpha': 4.945273284359274e-07, 'reg_lambda': 0.00030261260038118315}

#0.5994503822192405  f1: 
tuned_params_com['in.ownership_type_com'] =\
{'n_estimators': 786, 'max_depth': 1, 'num_leaves': 81,
 'learning_rate': 0.13965667981953575, 'min_child_samples': 83,
 'min_child_weight': 8.027579422119214, 'subsample': 0.5590589332530607,
 'scale_pos_weight': 2, 'colsample_bytree': 0.6915794829646965,
 'reg_alpha': 3.981238143791743e-06, 'reg_lambda': 0.0006286670702081804}

In [None]:
#0.7225128425921922  f1: 
tuned_params_com['in.number_of_stories_com'] =\
{'n_estimators': 977, 'max_depth': 8, 'num_leaves': 123,
 'learning_rate': 0.06315708717537268, 'min_child_samples': 61,
 'min_child_weight': 3.4659332460428436, 'subsample': 0.9051859696480227,
 'scale_pos_weight': 3, 'colsample_bytree': 0.7828327035765278,
 'reg_alpha': 0.055681878622985126, 'reg_lambda': 5.30990530587543e-08}


#0.5514551239417378 f1
tuned_params_com['in.tstat_htg_sp_f..f_com'] = \
{'n_estimators': 905, 'max_depth': 7, 'num_leaves': 203,
 'learning_rate': 0.08990569979001954, 'min_child_samples': 62,
 'min_child_weight': 8.371358945304257, 'subsample': 0.5108362241099114,
 'scale_pos_weight': 2, 'colsample_bytree': 0.8565858005727937,
 'reg_alpha': 1.9613936967546285e-05, 'reg_lambda': 1.6480916334949594e-08}


#0.518247047502408  f1: 
tuned_params_com['in.tstat_clg_sp_f..f_com'] =\
{'n_estimators': 291, 'max_depth': 4, 'num_leaves': 83,
 'learning_rate': 0.25758042044965385, 'min_child_samples': 9,
 'min_child_weight': 9.175461752314197, 'subsample': 0.4499538211206938,
 'scale_pos_weight': 5, 'colsample_bytree': 0.823353726262743,
 'reg_alpha': 0.0027062361196170865, 'reg_lambda': 0.2042564684055546}

#0.26215467147714877 f1: 
tuned_params_com['in.weekday_opening_time..hr_com'] =\
{'n_estimators': 348, 'max_depth': 31, 'num_leaves': 138,
 'learning_rate': 0.03888434262515391, 'min_child_samples': 45,
 'min_child_weight': 1.1109450873970528, 'subsample': 0.8063239893252891,
 'scale_pos_weight': 5, 'colsample_bytree': 0.5176944787284479,
 'reg_alpha': 0.0005115978825365721, 'reg_lambda': 8.712227579337613e-06}

#0.1995791303369906 0.2050631416306615 f1
tuned_params_com['in.weekday_operating_hours..hr_com'] = \
{'n_estimators': 833, 'max_depth': 25, 'num_leaves': 229,
 'learning_rate': 0.10743457607271996, 'min_child_samples': 79,
 'min_child_weight': 0.032893547715672766, 'subsample': 0.9676303652176244,
 'scale_pos_weight': 5, 'colsample_bytree': 0.5836586160140631,
 'reg_alpha': 1.4559384884652598e-07, 'reg_lambda': 1.0991167895637697e-05}


## Developing Models for Commercial Buildings

In [None]:
def fit_com_models(df, min_n=6):
    """
    Fits models for each commercial building target type by iterating through the data,
    preprocessing features, selecting relevant ones, and using optimized hyperparameters
    to train models. The models, encoders, and preprocessing pipelines are stored for
    future use in prediction on test data.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing features and target values of the different commercial building target types.

    min_n : int, optional (default=6)
        The minimum number of occurrences required for each target type to be included in the training process.

    Returns
    -------
    models_dict : dict
        Dictionary storing the trained model pipelines for each commercial building target type.

    enc_dict : dict
        Dictionary storing the label encoder objects for each commercial building target type.

    prepro_dict : dict
        Dictionary storing the data preprocessing pipelines for each commercial building
        target type.

    Notes
    -----
    - The function uses StratifiedKFold cross-validation with 5 splits.
    - The task is classified as 'binary' if the number of unique target values is 2, and 
    'multiclass' otherwise.
    - Optimized hyperparameters for each target type are retrieved from a pre-defined dictionary 
    `tuned_params_com`.
    """

    # Initialize lists and dictionaries to store results
    f1s = []

    models_dict = {}
    enc_dict = {}
    prepro_dict = {}

    # Iterate over each unique target type
    for t_type in df["target_type"].unique():
        sub_df = df[df["target_type"] == t_type]

        # Filter out targets with less than `min_n` observations
        selected_classes = [
            targ
            for targ in sub_df["target"].unique()
            if sub_df[sub_df["target"] == targ].shape[0] > min_n
        ]

        filt_df = sub_df[sub_df["target"].isin(selected_classes)]

        # Separate features and target
        X = filt_df.drop(
            ["building_stock_type", "target", "target_type"], axis=1
        ).reset_index(drop=True)
        y = filt_df["target"]

        # Encode target labels
        lenc = LabelEncoder()
        y_enc = pd.Series(lenc.fit_transform(y))

        # Define preprocessing pipeline
        prep_pipe = Pipeline(
            [
                (
                    "tenc",
                    ColumnTransformer(
                        [
                            (
                                "Oencode",
                                TargetEncoder(),
                                ["peak_timeofday", "build_state"],
                            )
                        ],
                        remainder="passthrough",
                    ),
                ),
                ("scaler", MinMaxScaler()),
            ]
        )

        # Apply preprocessing pipeline to features
        prepped_X = prep_pipe.fit_transform(X, y_enc)

        # Select relevant features for each commercial building type
        feature_filtered = prepped_X[:, com_relevant_indices[t_type]]

        # Retrieve optimized hyperparameters for the target type
        com_params_cl = tuned_params_com.get(t_type)
        hyp_ccl = com_params_cl

        # Define the task type and number of classes
        task = "multiclass" if y.nunique() > 2 else "binary"
        num_class = y.nunique() if y.nunique() > 2 else 1

        prepped_df = pd.DataFrame(feature_filtered)
        kf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

        models = []
        oof_pred = []

        # Stratified K-Fold cross-validation
        for train_index, test_index in kf.split(prepped_df, y_enc):
            X_train, X_test = prepped_df.iloc[train_index], prepped_df.iloc[test_index]
            y_train, y_test = y_enc.iloc[train_index], y_enc.iloc[test_index]

            # Define and train the model pipeline
            model = Pipeline(
                [
                    (
                        "model",
                        LGBMClassifier(
                            verbose=-1,
                            device_type="cpu",
                            objective=task,
                            num_class=num_class,
                            random_state=42,
                            **hyp_ccl,
                        ),
                    )
                ]
            )

            model.fit(X_train, y_train)
            preds = model.predict(X_test)
            oof_pred.append(f1_score(y_test, preds, average="weighted"))
            models.append(model)

        # Calculate and print the F1 score for the target type
        f1 = np.mean(oof_pred)
        print(f"{t_type} test f1: {f1}")

        # Store the trained model, encoder, and preprocessing pipeline
        models_dict[t_type] = models
        enc_dict[t_type] = lenc
        prepro_dict[t_type] = prep_pipe

        f1s.append(f1)

    # Print the average F1 score across all target types
    print(f"Average F1: {np.mean(f1s)}")

    return models_dict, enc_dict, prepro_dict


In [None]:
comclas_models,comclas_encs,comclas_prepro = fit_com_models(combo_com_clas_df)

# Residential Targets

## Residential Building Data Preparation

In [None]:
# filterning complete data to create data of residential buildings only
res_targets = [col for col in train_data_comp.columns if col.endswith('_res')]
res_df = train_data_comp[~(train_data_comp[res_targets[0]].isna())]
res_df = res_df[[col for col in res_df if not col.endswith('_com')]]

# dropping feature columns with only one unique value
res_df.drop(constant_cols,axis=1,inplace=True)

# applying create_combo_task function to condense the residential building target types
# into one single target column
combo_res_clas_df = create_combo_task(res_df,res_targets)

## Residential Building Feature Selection
Defining relevant indices of features necessary for predicting the target values of each residential building target type.
These were identified as part of a feature selection process

In [None]:
# dictionary to store each target type as a key and the relevant feature indices as values
res_relevant_indices = {}

In [None]:
res_relevant_indices['in.geometry_building_type_recs_res'] = \
[1, 1027, 517, 10, 525, 1550, 13, 25, 2078, 32, 2081, 2080, 35, 36, 1573, 547, 42, 44,
 49, 51, 53, 54, 2101, 60, 64, 67, 68, 72, 78, 85, 87, 1117, 95, 1127, 1128, 617, 618,
 624, 1656, 633, 121, 635, 123, 636, 638, 640, 641, 642, 131, 643, 135, 139, 662, 1176,
 155, 674, 164, 166, 170, 173, 1201, 178, 1206, 185, 190, 1216, 1217, 1220, 1221, 1222,
 1739, 1233, 726, 220, 221, 1758, 736, 225, 236, 753, 755, 756, 252, 1796, 267, 272, 278,
 282, 794, 1819, 801, 814, 303, 307, 311, 319, 835, 836, 1861, 337, 338, 339, 849, 341,
 342, 346, 349, 352, 870, 1387, 1905, 370, 1404, 1409, 1936, 1937, 1939, 1433, 1947,
 924, 1953, 421, 428, 1965, 434, 436, 948, 950, 1461, 949, 1977, 954, 955, 956, 957,
 958, 959, 1983, 446, 962, 965, 966, 454, 456, 1490, 1491, 2008, 1498, 488, 2026, 1021]

res_relevant_indices['in.geometry_foundation_type_res'] = \
[1, 514, 519, 521, 2058, 524, 1548, 1552, 17, 1554, 532, 25, 2077, 2078, 547, 548, 39,
 1064, 558, 51, 1595, 574, 1615, 605, 608, 105, 617, 106, 626, 116, 117, 629, 1652, 122,
 123, 634, 635, 636, 639, 640, 644, 646, 652, 653, 663, 1176, 666, 155, 1184, 177, 178,
 179, 185, 187, 190, 191, 1214, 193, 1730, 708, 1220, 1221, 1222, 718, 720, 208, 1746,
 1745, 217, 730, 731, 220, 221, 222, 223, 227, 233, 747, 751, 242, 755, 248, 253, 1792,
 266, 275, 279, 1819, 802, 804, 1834, 810, 1837, 302, 1839, 307, 1332, 1848, 314, 317,
 832, 833, 323, 1861, 329, 333, 339, 344, 345, 346, 353, 367, 370, 1399, 375, 379, 387,
 393, 397, 406, 418, 421, 425, 428, 1969, 436, 437, 1462, 444, 445, 446, 1983, 457, 1481,
 1996, 466, 472, 483, 484, 492, 497, 501, 2045, 511]

res_relevant_indices['in.geometry_floor_area_res'] = \
[1024, 1, 1025, 1539, 4, 1541, 2054, 1542, 1032, 2060, 1039, 528, 2064, 530, 19, 2066,
 1045, 536, 25, 32, 35, 547, 1059, 39, 1578, 42, 44, 47, 2097, 49, 51, 53, 2102, 54,
 2103, 63, 66, 72, 1611, 84, 85, 87, 607, 95, 1642, 1131, 1133, 110, 624, 116, 630, 631,
 123, 635, 124, 1662, 642, 131, 155, 156, 165, 168, 178, 179, 692, 1202, 1206, 182, 184,
 189, 190, 707, 1220, 1221, 1222, 1224, 1226, 1746, 727, 221, 1249, 241, 1782, 1277, 265,
 1296, 1820, 287, 813, 315, 828, 829, 830, 831, 835, 836, 1362, 1875, 343, 1886, 360,
 1897, 1389, 1390, 372, 375, 1915, 1408, 387, 1413, 1414, 1933, 1424, 1937, 1425, 1430,
 1431, 1432, 1433, 1955, 1957, 428, 1967, 433, 1458, 436, 955, 958, 1470, 960, 961, 962,
 964, 1476, 456, 1483, 1488, 2008, 2011, 992, 2017, 994, 995, 997, 1001, 1002, 492, 1004,
 1005, 1007, 1008, 1009, 1010, 1013, 1535]

res_relevant_indices['in.geometry_wall_type_res'] = \
[1, 1027, 516, 1540, 522, 523, 1020, 525, 524, 1553, 529, 532, 533, 2075, 2078, 1060,
 559, 2095, 560, 575, 577, 592, 83, 84, 596, 602, 608, 105, 117, 1141, 1656, 634, 123,
 637, 638, 640, 642, 645, 1669, 1672, 650, 1676, 652, 656, 657, 660, 661, 662, 1176, 156,
 673, 674, 1701, 1196, 175, 1200, 689, 178, 691, 1725, 190, 702, 193, 705, 1737, 1227,
 1741, 721, 1234, 211, 725, 726, 1752, 1759, 1259, 751, 754, 755, 756, 246, 1273, 1277,
 1279, 775, 801, 803, 1829, 805, 808, 1839, 303, 1844, 1845, 1334, 1847, 826, 315, 319,
 836, 1861, 1353, 339, 1380, 360, 1384, 1898, 364, 1908, 375, 1917, 395, 1931, 1934, 402,
 407, 1431, 1437, 1952, 1441, 1447, 423, 426, 427, 428, 1962, 1454, 1461, 1462, 439, 443,
 444, 446, 1993, 458, 1490, 1491, 2004, 1501, 478, 1503, 2013, 1008, 503, 1529, 506, 508]

res_relevant_indices['in.income_res'] = \
[1024, 1, 1538, 1544, 525, 526, 17, 531, 2067, 19, 1556, 25, 1565, 1566, 30, 1059, 1060,
 550, 552, 555, 2095, 2097, 51, 564, 1080, 1087, 576, 1096, 72, 75, 78, 1616, 592, 1618,
 596, 602, 606, 610, 613, 106, 1133, 1134, 627, 116, 1139, 632, 633, 122, 634, 1144,
 1660, 639, 1666, 131, 132, 644, 1667, 658, 661, 155, 156, 677, 178, 185, 188, 190,
 1216, 192, 706, 709, 710, 1223, 712, 713, 1228, 1741, 1229, 720, 1233, 1235, 1748, 725,
 1245, 735, 1775, 1778, 1779, 1266, 1269, 1271, 1278, 1289, 1294, 1807, 1811, 277, 283,
 1308, 1826, 295, 1323, 301, 1837, 815, 1328, 817, 818, 1843, 1334, 1337, 826, 1852, 1342,
 831, 1858, 323, 1861, 1351, 1354, 332, 333, 1356, 1876, 1365, 1367, 1372, 1884, 349, 353,
 1890, 1377, 1889, 1894, 1895, 359, 1897, 365, 1901, 1402, 1917, 381, 1409, 1922, 1923,
 1412, 1411, 1926, 391, 392, 395, 1931, 1419, 1936, 1425, 402, 1938, 400, 401, 406, 1430,
 1431, 404, 1946, 1436, 1948, 414, 416, 1956, 424, 1451, 1458, 1460, 1972, 1461, 439,
 1974, 1978, 444, 1469, 1468, 1983, 1982, 1985, 963, 1477, 454, 1990, 1991, 458, 460,
 974, 1998, 463, 1488, 1491, 2005, 2010, 2011, 477, 2015, 992, 999, 488, 496, 498, 506,
 503, 1017, 1018, 507, 1535]

res_relevant_indices['in.roof_material_res'] = \
[1, 517, 520, 1032, 525, 1039, 528, 17, 1558, 25, 2075, 541, 545, 547, 51, 1588, 54,
 60, 585, 588, 87, 600, 1131, 620, 110, 1137, 1652, 1653, 1656, 634, 123, 635, 636,
 637, 640, 1665, 643, 645, 135, 648, 651, 654, 669, 672, 674, 165, 678, 1193, 682, 172,
 173, 687, 1200, 1202, 179, 1722, 187, 190, 704, 708, 1224, 716, 1741, 1230, 719, 720,
 728, 730, 731, 222, 736, 1250, 1261, 755, 756, 767, 257, 258, 1799, 269, 1294, 1293,
 1807, 273, 271, 1295, 1301, 279, 1818, 795, 1823, 803, 1831, 1835, 814, 303, 1329, 307,
 1844, 822, 311, 1848, 1337, 1850, 823, 829, 1341, 319, 1856, 836, 1861, 1862, 327, 1354,
 331, 1869, 337, 1362, 339, 340, 855, 1880, 346, 358, 361, 878, 879, 374, 375, 1405, 1925,
 392, 1420, 1421, 398, 399, 1936, 1426, 1939, 1428, 1941, 1940, 1432, 1946, 411, 1438,
 1440, 420, 1449, 1961, 939, 428, 429, 431, 1457, 434, 1460, 436, 1462, 1974, 1981, 446,
 447, 1986, 966, 456, 1993, 1480, 459, 1997, 975, 1488, 1489, 1490, 2002, 473, 2010, 1505,
 995, 488, 490, 1517, 496, 501, 506, 507]

res_relevant_indices['in.heating_fuel_res'] = \
[1, 1527, 516, 522, 527, 17, 1554, 1555, 1556, 2071, 25, 2079, 2080, 1059, 548, 2093, 51,
 2100, 2099, 578, 582, 75, 593, 598, 87, 601, 91, 104, 617, 618, 1131, 110, 117, 118,
 1655, 119, 122, 123, 1659, 124, 637, 638, 1664, 640, 642, 131, 643, 639, 646, 1158, 1162,
 652, 1166, 656, 657, 658, 674, 678, 174, 688, 179, 693, 696, 697, 190, 191, 702, 193,
 198, 714, 202, 716, 1741, 719, 1232, 721, 1235, 724, 213, 726, 728, 1755, 731, 734, 736,
 755, 1779, 243, 1278, 255, 259, 267, 268, 1300, 794, 1820, 1822, 290, 802, 807, 303, 307,
 311, 832, 321, 323, 835, 329, 337, 1873, 340, 1876, 349, 1885, 351, 1374, 358, 360, 367,
 371, 387, 1924, 389, 398, 400, 402, 1943, 1946, 412, 416, 418, 419, 421, 423, 428, 1965,
 1966, 1456, 439, 1978, 444, 445, 447, 449, 1991, 461, 1498, 1500, 476, 1503, 1504, 2017,
 481, 493, 495, 503, 1532]

res_relevant_indices['in.tenure_res'] = \
[0, 1, 3, 1542, 1548, 525, 1551, 17, 19, 532, 23, 536, 25, 30, 547, 38, 49, 51, 54, 1079,
 60, 573, 66, 75, 591, 594, 84, 90, 93, 95, 1127, 1128, 104, 1133, 110, 624, 1650, 1651,
 118, 632, 120, 634, 121, 122, 1659, 123, 1664, 641, 131, 139, 652, 660, 661, 155, 673,
 1186, 1190, 166, 168, 172, 173, 685, 1719, 183, 1220, 1224, 716, 1741, 208, 721, 726,
 1240, 222, 1758, 1759, 1253, 1775, 1269, 757, 760, 1791, 1279, 1282, 260, 261, 263, 265,
 267, 1807, 273, 1810, 1297, 277, 278, 279, 1813, 801, 1314, 1829, 1321, 303, 1330, 307,
 822, 310, 825, 826, 1850, 831, 323, 836, 835, 1356, 337, 340, 1367, 1370, 1883, 1885,
 861, 1887, 1889, 1380, 358, 359, 1385, 1898, 366, 1391, 370, 375, 1405, 383, 384, 1409,
 1926, 391, 1928, 393, 394, 1931, 1421, 1422, 1936, 1428, 1429, 1943, 1433, 1946, 410,
 928, 420, 1961, 1449, 430, 1966, 944, 942, 1972, 437, 950, 1975, 1462, 1467, 447, 959,
 1983, 962, 451, 963, 965, 1479, 967, 1995, 1998, 1495, 1497, 2010, 1505, 2020, 491, 492]

In [None]:
res_relevant_indices['in.vacancy_status_res'] = \
[6, 18, 1554, 19, 24, 2075, 2076, 2079, 33, 2082, 36, 2084, 2085, 39, 42, 45, 51, 1589,
 54, 57, 1594, 63, 576, 577, 1089, 69, 72, 75, 78, 81, 87, 89, 91, 92, 94, 96, 97, 619,
 622, 1139, 116, 1144, 1664, 131, 132, 1672, 1681, 155, 1180, 174, 177, 178, 693, 183,
 185, 186, 187, 189, 191, 1227, 724, 1752, 221, 734, 736, 1251, 1252, 749, 1776, 1777,
 1778, 757, 1784, 1785, 1792, 1796, 1798, 263, 264, 266, 267, 273, 794, 795, 796, 1311,
 287, 802, 806, 807, 1836, 1331, 1845, 1340, 828, 830, 831, 833, 834, 835, 836, 1865,
 1365, 1890, 358, 365, 378, 411, 1436, 1961, 1980, 447, 995, 2019, 1000, 1001, 2025, 1523]

res_relevant_indices['in.vintage_res'] = \
[1, 2053, 517, 518, 10, 11, 526, 527, 1556, 536, 2075, 1565, 1567, 1570, 547, 2089,
 555, 560, 2097, 1074, 564, 567, 1080, 56, 569, 1600, 577, 72, 1608, 587, 591, 596,
 602, 613, 620, 1133, 1134, 623, 624, 1135, 116, 117, 1656, 634, 123, 1147, 637, 1662,
 640, 643, 133, 647, 652, 656, 665, 676, 682, 687, 177, 689, 692, 694, 697, 185, 190,
 191, 1729, 1217, 709, 1222, 711, 1227, 1743, 723, 1748, 1242, 1755, 225, 1765, 231,
 1258, 237, 755, 1267, 246, 1274, 1791, 773, 264, 777, 266, 267, 273, 275, 1305, 794,
 795, 1821, 1824, 1826, 1314, 295, 1832, 1834, 1839, 1327, 818, 306, 307, 1842, 311,
 1336, 1338, 827, 1851, 318, 1345, 323, 327, 1864, 1865, 1353, 329, 333, 1361, 338,
 339, 1364, 340, 1366, 345, 346, 347, 1376, 353, 354, 355, 356, 1889, 1896, 1901, 1392,
 1395, 1396, 1397, 379, 1403, 1406, 1410, 1923, 1925, 389, 392, 1417, 396, 1935, 1425,
 1426, 1429, 1943, 407, 1945, 1434, 1951, 432, 1460, 438, 1980, 445, 447, 448, 1985, 450,
 453, 454, 455, 1992, 970, 460, 465, 466, 467, 2004, 1492, 1496, 477, 2018, 483, 996,
 2021, 994, 488, 2029, 494, 507, 1531, 508]

res_relevant_indices['in.bedrooms_res'] = \
[1, 2050, 1538, 516, 1539, 1034, 10, 525, 2063, 2066, 19, 1558, 25, 30, 1568, 36, 1574,
 45, 46, 49, 52, 54, 2103, 1080, 1084, 67, 72, 83, 604, 617, 1643, 1133, 110, 622, 624,
 1137, 1136, 116, 635, 123, 1659, 124, 132, 648, 655, 1172, 661, 155, 1182, 1184, 674,
 165, 173, 177, 178, 179, 182, 185, 186, 187, 189, 190, 707, 711, 1224, 716, 1741, 1743,
 1746, 731, 1759, 1249, 1262, 754, 1778, 253, 1790, 767, 1285, 265, 266, 267, 268, 1294,
 275, 278, 280, 287, 288, 801, 290, 294, 806, 298, 1836, 1837, 813, 814, 303, 1328, 815,
 307, 817, 1327, 1332, 824, 826, 315, 1850, 830, 319, 834, 323, 1860, 1863, 327, 343,
 1883, 1885, 352, 1890, 357, 1898, 1387, 367, 1397, 1407, 1408, 388, 1414, 1928, 1933,
 1937, 1426, 1430, 1432, 409, 1946, 1433, 415, 1953, 1442, 1450, 1453, 944, 1970, 947,
 436, 951, 954, 444, 958, 959, 962, 963, 1475, 967, 1483, 1491, 2011, 2015, 995, 2020,
 998, 1005, 1007, 1523, 1012]

res_relevant_indices['in.cooling_setpoint_res'] = \
[1, 2, 517, 2055, 521, 11, 528, 1552, 2067, 537, 25, 2077, 1569, 547, 2083, 1573, 1574,
 551, 2089, 1071, 51, 564, 2100, 2099, 2103, 1080, 1087, 75, 1615, 81, 594, 596, 84,
 1626, 608, 611, 100, 615, 1136, 116, 630, 122, 123, 1146, 635, 637, 638, 640, 1156,
 649, 653, 662, 1176, 1695, 672, 673, 1709, 686, 689, 691, 692, 693, 180, 695, 697, 186,
 187, 190, 191, 1729, 1730, 1225, 210, 213, 728, 729, 730, 220, 1759, 1775, 756, 1783,
 1272, 1791, 258, 260, 1284, 268, 1807, 271, 277, 794, 287, 1312, 1313, 1314, 1316, 1829,
 805, 1319, 1321, 1833, 1837, 1838, 1841, 1329, 307, 1331, 1843, 1848, 825, 826, 1853,
 832, 833, 835, 1866, 1870, 1872, 337, 1360, 339, 347, 1884, 1373, 1371, 351, 352, 1377,
 355, 1894, 1383, 363, 1900, 368, 1392, 1396, 1399, 1913, 379, 380, 1405, 1411, 1924,
 1928, 1929, 1417, 1930, 1933, 1421, 1424, 406, 1433, 410, 1947, 412, 1437, 415, 1439,
 419, 1958, 1960, 1962, 1963, 1964, 428, 430, 437, 1465, 1467, 1468, 445, 1980, 446,
 448, 449, 450, 455, 1998, 477, 480, 481, 488, 496, 500, 508, 510]

res_relevant_indices['in.heating_setpoint_res'] = \
[1, 514, 518, 1542, 1554, 1558, 537, 2076, 2078, 2079, 33, 2082, 1059, 36, 1060, 39, 552,
 45, 1071, 51, 564, 572, 574, 63, 1600, 577, 1090, 1601, 66, 1602, 583, 72, 586, 588,
 592, 594, 1618, 595, 598, 599, 600, 89, 607, 612, 613, 614, 1133, 1134, 116, 118, 1654,
 634, 123, 122, 1147, 1662, 639, 131, 1681, 155, 1192, 681, 177, 693, 183, 185, 186, 187,
 189, 190, 1213, 1219, 1224, 1738, 205, 728, 1752, 1754, 1755, 1251, 1253, 1254, 1776,
 753, 1778, 1792, 1793, 264, 265, 267, 1803, 269, 271, 1807, 277, 1819, 796, 287, 802,
 803, 1826, 1828, 299, 1326, 1838, 307, 1848, 825, 1340, 829, 830, 831, 832, 834, 835,
 836, 1861, 327, 1352, 1865, 1355, 1870, 1871, 337, 345, 1883, 1887, 353, 355, 1898, 1389,
 369, 1410, 387, 1925, 1933, 1422, 401, 1943, 1435, 415, 1449, 1450, 427, 428, 426, 1458,
 1983, 1985, 453, 454, 974, 1490, 2002, 1492, 469, 1494, 474, 987, 476, 479, 482, 486,
 2023, 491, 2029, 494, 499, 508]

## Residential Building Hyperparameter Tuning
Defining optimised hyperparameter values necessary for an improved performance of each model dedicated for each residential building target type.
These were identified as part of an optuna optimisation process

In [None]:
tuned_params_res = {}

In [None]:
# 0.6958745369005935 f1: 
tuned_params_res['in.geometry_building_type_recs_res'] =\
{'n_estimators': 934, 'max_depth': 3, 'num_leaves': 123,
 'learning_rate': 0.2945480376671004, 'min_child_samples': 34,
 'min_child_weight': 6.167802479295942, 'subsample': 0.8909410473215246,
 'scale_pos_weight': 4, 'colsample_bytree': 0.7543807422433793,
 'reg_alpha': 0.00011906921825675507, 'reg_lambda': 4.8485395544580945e-08}

# 0.5188565122370529 f1: 
tuned_params_res['in.geometry_foundation_type_res'] =\
{'n_estimators': 922, 'max_depth': 28, 'num_leaves': 103,
 'learning_rate': 0.09849562522357214, 'min_child_samples': 89,
 'min_child_weight': 6.187944536726171, 'subsample': 0.8149723490572682,
 'scale_pos_weight': 4, 'colsample_bytree': 0.4426110188002679,
 'reg_alpha': 0.00010014981523703602, 'reg_lambda': 0.00023407513126556}

# 0.3480546863006969 f1
tuned_params_res['in.geometry_floor_area_res'] = \
{'n_estimators': 910, 'max_depth': 25, 'num_leaves': 102,
 'learning_rate': 0.07279111345034658, 'min_child_samples': 63,
 'min_child_weight': 3.2526461228236947, 'subsample': 0.9480222295341681,
 'scale_pos_weight': 4, 'colsample_bytree': 0.9628635623120934,
 'reg_alpha': 0.00027927416239330717, 'reg_lambda': 0.004003317689316492}

# 0.7699896053788151 f1: 
tuned_params_res['in.geometry_wall_type_res'] =\
{'n_estimators': 600, 'max_depth': 3, 'num_leaves': 104,
 'learning_rate': 0.24648477843229688, 'min_child_samples': 65,
 'min_child_weight': 1.6775679111282833, 'subsample': 0.4329242857212138,
 'scale_pos_weight': 5, 'colsample_bytree': 0.4003413875429718,
 'reg_alpha': 0.01216804184830573, 'reg_lambda': 4.694397179386468e-05}

# 0.10280264644713603 f1: 
tuned_params_res['in.income_res'] =\
{'n_estimators': 684, 'max_depth': 13, 'num_leaves': 117,
 'learning_rate': 0.14516046183367184, 'min_child_samples': 19,
 'min_child_weight': 7.227114697099523, 'subsample': 0.9226131920085651,
 'scale_pos_weight': 6, 'colsample_bytree': 0.6485099688165997,
 'reg_alpha': 3.024497277471494e-05, 'reg_lambda': 0.00023040655178930457}

# 0.5404408272513648 f1
tuned_params_res['in.roof_material_res'] = \
{'n_estimators': 803, 'max_depth': 24, 'num_leaves': 145,
 'learning_rate': 0.016910825931978325, 'min_child_samples': 40,
 'min_child_weight': 7.771112155902316, 'subsample': 0.44775552808959174,
 'scale_pos_weight': 3, 'colsample_bytree': 0.6769161871481143,
 'reg_alpha': 2.6435425761672062e-08, 'reg_lambda': 0.008235783409096475}


In [None]:
# 0.6888482971159382  f1: 
tuned_params_res['in.heating_fuel_res'] =\
{'n_estimators': 335, 'max_depth': 7, 'num_leaves': 51,
 'learning_rate': 0.12963879325098038, 'min_child_samples': 33,
 'min_child_weight': 6.257192439904964, 'subsample': 0.5425612366413691,
 'scale_pos_weight': 6, 'colsample_bytree': 0.7958368532059044,
 'reg_alpha': 5.97882134647585e-08, 'reg_lambda': 3.173451776830115e-05}

#0.7702281848379628 f1: 
tuned_params_res['in.tenure_res'] =\
{'n_estimators': 930, 'max_depth': 27, 'num_leaves': 183,
 'learning_rate': 0.05958169486618951, 'min_child_samples': 67,
 'min_child_weight': 2.0933971144187558, 'subsample': 0.5822419223973381,
 'scale_pos_weight': 2, 'colsample_bytree': 0.5297643605000292,
 'reg_alpha': 3.0331132949376417e-05, 'reg_lambda': 1.3852796948184832e-05}

# 0.9976882365296991 f1: 
tuned_params_res['in.vacancy_status_res'] =\
{'n_estimators': 767, 'max_depth': 1, 'num_leaves': 174,
 'learning_rate': 0.16599487617300912, 'min_child_samples': 90,
 'min_child_weight': 0.7710019776202889, 'subsample': 0.5317999966923885,
 'scale_pos_weight': 4, 'colsample_bytree': 0.4505082905425865,
 'reg_alpha': 3.792576185268018e-05, 'reg_lambda': 0.007801565273536475}

# 0.2048327110306317 f1
tuned_params_res['in.vintage_res'] = \
{'n_estimators': 740, 'max_depth': 35, 'num_leaves': 35,
 'learning_rate': 0.09298959300141071, 'min_child_samples': 92,
 'min_child_weight': 8.47911871022102, 'subsample': 0.40431044203566197,
 'scale_pos_weight': 5, 'colsample_bytree': 0.6864178335889475,
 'reg_alpha': 0.43398145576873853, 'reg_lambda': 6.576364002484857e-07}


In [None]:
# 0.5075012873030607 f1: 
tuned_params_res['in.bedrooms_res'] =\
{'n_estimators': 735, 'max_depth': 34, 'num_leaves': 241,
 'learning_rate': 0.060871086872348566, 'min_child_samples': 69,
 'min_child_weight': 0.4737265426136331, 'subsample': 0.8986333166383806,
 'scale_pos_weight': 5, 'colsample_bytree': 0.8935561882173697,
 'reg_alpha': 0.00041647170063019743, 'reg_lambda': 1.4111607171914601e-05}

# 0.2523593670534726  f1: 
tuned_params_res['in.cooling_setpoint_res'] =\
{'n_estimators': 480, 'max_depth': 11, 'num_leaves': 97,
 'learning_rate': 0.08270776904857623, 'min_child_samples': 48,
 'min_child_weight': 5.010895136096351, 'subsample': 0.5761480486298616,
 'scale_pos_weight': 3, 'colsample_bytree': 0.9262180465285766,
 'reg_alpha': 2.9482770022566384e-07, 'reg_lambda': 0.005365434491707699}

#0.33351148735202124 f1: 
tuned_params_res['in.heating_setpoint_res'] =\
{'n_estimators': 941, 'max_depth': 1, 'num_leaves': 124,
 'learning_rate': 0.2776918947250579, 'min_child_samples': 62,
 'min_child_weight': 7.94405862197077, 'subsample': 0.9164496651494268,
 'scale_pos_weight': 2, 'colsample_bytree': 0.9702796578314632,
 'reg_alpha': 0.005468991713660568, 'reg_lambda': 0.007772734581335458}

## Developing Models for Residential Buildings

In [None]:
def fit_residential_models(df, min_occurrences=6):
    """
    Trains classification models for each residential building target type in the dataset.

    This function iterates through each unique target type in the dataset, performs data 
    preprocessing and feature selection, trains the model using pre-optimized hyperparameters, 
    evaluates the model using stratified k-fold cross-validation, and stores the trained models 
    and preprocessing pipelines.

    Parameters
    ----------
    df : pd.DataFrame
        A dataframe containing features and target values for different residential building types.
        It should have at least the following columns: 'building_stock_type', 'target', and 
        'target_type'.

    min_occurrences : int, optional, default=6
        The minimum number of occurrences required for each target class to be included in the model
        training process. Classes with fewer occurrences are excluded.

    Returns
    -------
    models_dict : dict
        A dictionary where the keys are the residential building target types, and the 
        values are lists of trained model pipelines (one model per fold of cross-validation).

    enc_dict : dict
        A dictionary where the keys are the residential building target types, and the values are
        `LabelEncoder` objects used to encode the target labels.

    prepro_dict : dict
        A dictionary where the keys are the residential building target types, and the values are
        preprocessing pipelines used to prepare the data before model training.

    Notes
    -----
    - The model uses LightGBM (LGBMClassifier) with optimized hyperparameters for each target type.
    - Categorical features such as 'peak_timeofday' and 'build_state' are encoded using 
    target encoding.
    - The models are evaluated using weighted F1-score for cross-validation.
    """

    # Initialize dictionaries to store models, encoders, and preprocessing pipelines
    models_dict = {}
    enc_dict = {}
    prepro_dict = {}
    f1_scores = []

    for target_type in df["target_type"].unique():
        # Subset the dataframe for the current target type
        sub_df = df[df["target_type"] == target_type]

        # Select target classes that have more than `min_occurrences`
        selected_classes = [
            targ
            for targ in sub_df["target"].unique()
            if sub_df[sub_df["target"] == targ].shape[0] > min_occurrences
        ]
        filtered_df = sub_df[sub_df["target"].isin(selected_classes)]

        # Separate features (X) and target (y)
        X = filtered_df.drop(
            ["building_stock_type", "target", "target_type"], axis=1
        ).reset_index(drop=True)
        y = filtered_df["target"]

        # Encode target labels
        label_encoder = LabelEncoder()
        y_encoded = pd.Series(label_encoder.fit_transform(y))

        # Data preprocessing pipeline: TargetEncoder for selected categorical features,
        # followed by MinMaxScaler
        preprocessing_pipeline = Pipeline(
            [
                (
                    "target_encoding",
                    ColumnTransformer(
                        [
                            (
                                "target_encoder",
                                TargetEncoder(),
                                ["peak_timeofday", "build_state"],
                            )
                        ],
                        remainder="passthrough",
                    ),
                ),
                ("scaling", MinMaxScaler()),
            ]
        )

        # Fit and transform the features using the preprocessing pipeline
        prepped_X = preprocessing_pipeline.fit_transform(X, y_encoded)

        # Filter features based on relevance (using pre-defined feature indices)
        relevant_features = prepped_X[:, res_relevant_indices[target_type]]

        # Load hyperparameters for the current target type, or use default parameters
        target_hyperparams = tuned_params_res.get(target_type)
        model_params = target_hyperparams

        # Determine the task type (binary or multiclass classification)
        if y.nunique() > 2:
            task_type = "multiclass"
            num_classes = y.nunique()
        else:
            task_type = "binary"
            num_classes = 1

        # Cross-validation setup: StratifiedKFold
        stratified_kfold = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
        models_per_fold = []
        oof_predictions = []

        # Train and evaluate models using cross-validation
        for train_idx, test_idx in stratified_kfold.split(relevant_features, y_encoded):
            X_train, X_test = relevant_features[train_idx], relevant_features[test_idx]
            y_train, y_test = y_encoded.iloc[train_idx], y_encoded.iloc[test_idx]

            model_pipeline = Pipeline(
                [
                    (
                        "classifier",
                        LGBMClassifier(
                            verbose=-1,
                            device_type="cpu",
                            objective=task_type,
                            num_class=num_classes,
                            random_state=42,
                            **model_params,
                        ),
                    )
                ]
            )

            model_pipeline.fit(X_train, y_train)
            y_pred = model_pipeline.predict(X_test)
            weighted_f1 = f1_score(y_test, y_pred, average="weighted")
            oof_predictions.append(weighted_f1)
            models_per_fold.append(model_pipeline)

        # Calculate and print the average F1 score for this target type
        mean_f1 = np.mean(oof_predictions)
        print(f"{target_type} test F1: {mean_f1}")

        # Store the models, encoders, and preprocessing pipelines in their respective dictionaries
        models_dict[target_type] = models_per_fold
        enc_dict[target_type] = label_encoder
        prepro_dict[target_type] = preprocessing_pipeline

        f1_scores.append(mean_f1)

    # Print and return the average F1 score across all target types
    print(f"Average F1 score: {np.mean(f1_scores)}")
    return models_dict, enc_dict, prepro_dict


In [None]:
resclas_models,resclas_encs,resclas_prepro = fit_residential_models(combo_res_clas_df)

# Onward Test

## Reading and Manipulating Test Data

In [None]:
def create_test_combo(df, target_names):
    """
    Combine all features into a format for one-time prediction on multiple target types.

    This function creates copies of the input DataFrame `df`, appends a new column called 
    `target_type` for each target in `target_names`, and then concatenates the resulting 
    DataFrames into a single DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame containing features used for prediction.

    target_names : list of str
        A list of target names to append as a new column 'target_type' in the DataFrame.

    Returns
    -------
    pandas.DataFrame
        A concatenated DataFrame with an added 'target_type' column for each target in 
        `target_names`.

    Examples
    --------
    >>> res_df = pd.DataFrame({'feature1': [1, 2], 'feature2': [3, 4]})
    >>> res_targets = ['target1', 'target2']
    >>> combo_res_class_df = create_test_combo(res_df, res_targets)
    >>> print(combo_res_class_df)
       feature1  feature2 target_type
    0         1         3     target1
    1         2         4     target1
    0         1         3     target2
    1         2         4     target2

    """
    # Initialize a list to store copies of DataFrames with appended target types
    comb_clas_dfs = []

    # Loop through each target in target_names
    for target in target_names:
        # Create a copy of the input DataFrame to avoid modifying the original
        new_df = df.copy()

        # Add the 'target_type' column with the current target
        new_df["target_type"] = target

        # Append the modified DataFrame to the list
        comb_clas_dfs.append(new_df)

    # Concatenate all the DataFrames into one along the row axis (axis=0)
    combined_df = pd.concat(comb_clas_dfs, axis=0)

    return combined_df


In [None]:
def expand_targets_wide(df):
    """
    Expands the input DataFrame by pivoting it to create multiple columns, 
    each corresponding to a unique target type, instead of a single column for all targets.

    Parameters
    ----------
    df : pandas.DataFrame
        The input DataFrame, which contains at least the following columns:
        - 'bldg_id': Identifier for each building.
        - 'target_type': The type/category of the target (used as columns in the output).
        - 'target': The actual target values (used as the values in the output).

    Returns
    -------
    reshaped_df : pandas.DataFrame
        A reshaped DataFrame where each unique 'target_type' becomes a separate column, 
        and 'bldg_id' is used as the index. The columns represent different target types, 
        and the values in those columns represent the corresponding target values.

    Example
    -------
    >>> reshaped = expand_targets_wide(btest_comm)
    """
    reshaped_df = df.pivot(index='bldg_id', columns='target_type', values='target').reset_index()

    # Remove the name for the columns index (pivot table metadata)
    reshaped_df.columns.name = None

    return reshaped_df


In [None]:
# although test data is 1440, using n_files of 1440 seemed to produce a dataframe
# short of 10 buildings so 1450 is used when reading test building files
# throughout the notebook
test_data = prepare_data(TEST_DIR,n_files=1450,is_test=True)

In [None]:
tsf_df = ts_fresh_prep(TEST_DIR,n_files=1450)

In [None]:
# Extract features separately
extracted_features_test = extract_features(tsf_df, column_id="bldg_id", column_sort="time",
                                     default_fc_parameters=EfficientFCParameters(),
                                        n_jobs=3)

In [None]:
extracted_features_test['bldg_id'] = extracted_features_test.index

In [None]:
extracted_features_test = extracted_features_test[extracted_features.columns]

In [None]:
test_data = pd.merge(test_data,extracted_features_test,how='left',on='bldg_id')
test_data

In [None]:
df_15min_test = prep_15min(TEST_DIR,n_files=1450)

In [None]:
test_data = pd.merge(test_data,df_15min_test,how='left',on='bldg_id')
test_data

In [None]:
tsf_15min_test = df15min_tsfprep(df_15min_test)

In [None]:
#Extract features
extracted_15min_test = extract_features(tsf_15min_test, column_id="bldg_id", 
                                        column_sort="time", n_jobs=3)

In [None]:
extracted_15min_test['bldg_id'] = extracted_15min_test.index

extracted_15min_test = extracted_15min_test[extracted_15min.columns]

In [None]:
test_data = pd.merge(test_data,extracted_15min_test,how='left',on='bldg_id')
test_data

## Predicting Building Stock Type of Test Data Buildings

In [None]:
prepped_test = type_prep_pipe.transform(test_data[X.columns])

In [None]:
finaltest = prepped_test[:,true_indexes]

In [None]:
# predict building stock type
stock_type_preds = []
for cmodel in cmodels:
    prediction = cmodel.predict(finaltest)
    stock_type_preds.append(prediction)
    
predictions = pd.DataFrame(stock_type_preds)
predictions = predictions.mode()
predictions = predictions.T

test_data['building_stock_type'] = predictions
test_data['building_stock_type'] = test_data['building_stock_type'].map(
    {1:'residential',0:'commercial'}
)

## Predicting Commercial Building Targets of Test Data

In [None]:
# preparing commercial building data targets
test_commercial = test_data[test_data['building_stock_type']=='commercial'].copy()
test_commercial.drop(constant_cols,axis=1,inplace=True)
btest_comclas = create_test_combo(test_commercial,comm_targets)

In [None]:
# predicting target values of each commercial building target type
btest_comclas_preds = []
for t_type in btest_comclas['target_type'].unique():
    temp_df = btest_comclas[btest_comclas['target_type']==t_type].copy()
    
    filtered_copy = temp_df.copy()
    temp_df_prep = comclas_prepro[t_type].transform(filtered_copy)
    filtered_tempdf = temp_df_prep[:,com_relevant_indices[t_type]]
    
    trained_models = comclas_models[t_type]
    target_preds = []
    for tmodel in trained_models:
        prediction = tmodel.predict(filtered_tempdf)
        target_preds.append(prediction)

    predictions = pd.DataFrame(target_preds)
    predictions = predictions.mode()
    predictions = predictions.T
    
    
    temp_df.reset_index(drop=True,inplace=True)
    temp_df['target'] = predictions[0]
    temp_df['target'] = temp_df['target'].astype('int')

    temp_df['target'] = comclas_encs[t_type].inverse_transform(temp_df['target'])
    btest_comclas_preds.append(temp_df)
    
btest_comclas = pd.concat(btest_comclas_preds,axis=0)

In [None]:
# Viewing the Predictions
btest_comclas.iloc[:,-3:]

In [None]:
# Expanding Prediction DataFrame into Expanded format where each target type has its own
# column
btest_comm = btest_comclas
btest_comm = expand_targets_wide(btest_comm)
btest_comm['building_stock_type'] = 'commercial'

## Predicting Residential Building Targets of Test Data

In [None]:
# predicting res targets
test_residential = test_data[test_data['building_stock_type']=='residential'].copy()
test_residential.drop(constant_cols,axis=1,inplace=True)
btest_resclas = create_test_combo(test_residential,res_targets)

In [None]:
btest_resclas_preds = []
for t_type in btest_resclas['target_type'].unique():
    temp_df = btest_resclas[btest_resclas['target_type']==t_type].copy()
    
    filtered_copy = temp_df.copy()
    preptemp_df = resclas_prepro[t_type].transform(filtered_copy)
    filtered_tempdf = preptemp_df[:,res_relevant_indices[t_type]]
    
    trained_models = resclas_models[t_type]
    target_preds = []
    for tmodel in trained_models:
        prediction = tmodel.predict(filtered_tempdf)
        target_preds.append(prediction)

    predictions = pd.DataFrame(target_preds)
    predictions = predictions.mode()
    predictions = predictions.T
    
    
    temp_df.reset_index(drop=True,inplace=True)
    temp_df['target'] = predictions[0]
    temp_df['target'] = temp_df['target'].astype('int')
    
    temp_df['target'] = resclas_encs[t_type].inverse_transform(temp_df['target'])
    btest_resclas_preds.append(temp_df)
btest_resclas = pd.concat(btest_resclas_preds,axis=0)

In [None]:
btest_res = btest_resclas

btest_res = expand_targets_wide(btest_res)
btest_res['building_stock_type'] = 'residential'

In [None]:
total_testpred = pd.concat([btest_res,btest_comm],axis=0)

In [None]:
total_testpred = total_testpred.fillna(value=str(None))
total_testpred.set_index('bldg_id',inplace=True)

## Applying Sample Submissioin Formatting.

In [None]:
ss = pd.read_parquet(SS_FILE)

In [None]:
# ordering columns to fit sample submission format
total_testpred = total_testpred[ss.columns]

In [None]:
for col in total_testpred.columns:
    total_testpred[col] = total_testpred[col].astype('str')

In [None]:
for col in total_testpred.columns:
    total_testpred[col] = np.where(total_testpred[col]=='None',None,
                                   total_testpred[col])

In [None]:
# Modifying the predicted values for certain columns where '.0' is supposed to be 
# exluded in the decimal place of numeric values.
columns_to_modify = [
    'in.number_of_stories_com',
    'in.tstat_clg_sp_f..f_com',
    'in.tstat_htg_sp_f..f_com',
    'in.weekday_opening_time..hr_com',
    'in.weekday_operating_hours..hr_com',
    'in.bedrooms_res'
]

for col in columns_to_modify:
    if col in total_testpred.columns:
        total_testpred[col] = total_testpred[col].str.replace('.0', '', regex=False)

In [None]:
# exporting to submission file
total_testpred.to_parquet('scv_catselect_2xtsf.parquet')

In [None]:
# Viewing the results
seeing = pd.read_parquet('scv_catselect_2xtsf.parquet')
seeing