# Building Instinct: Where power meets predictions

Welcome to the <i> Building Instinct: Where power meets predictions </i> Challenge, a buildings' metadata inference problem based on their electricity load profiles. This is a hierarchical multi-output multi-class classification problem where you are tasked to predict metadata of different buildings based on their load profiles.
In this challenge, you will be given end-use load profiles for 7200 unique buildings (train data), along with their metadata (labels) to train your inference model. You will also be given load profiles for 1440 buildings (without their metadata) as your test data. Your predictive leaderboad score will be calculated based on your model performance on the test data. Your task is to infer the medatadata of different hierarchical classes for the building. More information will be given about this hierarchical multi-output multi-class classification problem later in this starter notebook.

### Supplied Materials:
 
*  Starter Notebook
*  Train dataset: 7200 `<building_id>.parquet` files containing timestamped end-use load profiles for 7200 buildings
*  `train_label.parquet` file containing labels (also referred to as metadata or attributes) for the 7200 buildings in the train dataset
*  Test dataset: 1440 `<building_id>.parquet` files containing timestamped end-use load profiles for 1440 buildings
*  `utils.py`: containing some functions used in this starter notebook and to help you get started
*  `requirements.txt` should contain all the required packages for your submission

### Data:

Each of the above-metioned timestamped `.parquet` files (either train or test) contains time series of electricity energy consumption for the corresponding building, starting from Jan 1, 2018 (`2018-01-01 00:15:00`) till the end of  Dec 31, 2018 (`2019-01-01 00:00:00`), with 15 minutes incremenets. All the times are in Eastern Standard Time (EST). All energy consumptions are in `kWh`. Energy consumptions for each timestamp (row) is the electricity energy consumed during the 15 minutes ending at that timestamp. For example, for the row corresponding to `2018-12-20 17:15:00` the energy load value is the energy consumed from `2018-12-20 17:00:00` till `2018-12-20 17:15:00`. Each of these files also contains a column providing the state in which the building is located.

`train_label.parquet` file contains the metadata (labels/classes to predict) for each building in the train dataset. Buildings are either <i>residential</i> or <i>commercial</i> which can be found under the `building_stock_type` column. If a building is commercial there are 11 metadata available for them that are stored under the columns whose names end with `_com`. On the other hand, if a building is residential there are 13 metadata available for them that are stored under the columns whose names end with `_res`.



### Data loading and exploration

Below are a few code snippets that show you how to load and explore the data.

Please remember to include any packages you use in a `requirements.txt` file and include it in the starter notebook folder.

In [1]:
# importaing the required libraries

import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

from utils import (
    calculate_average_hourly_energy_consumption,
    train_model,
    get_pred,
    calculate_hierarchical_f1_score,
    sample_submission_generator,
)

Below, the timestamped load profile for building with building ID of 1 (`1.parquet`) is loaded as a pandas DataFrame and the first 10 rows are displayed. This building is located in Kentucky state.

In [2]:
load_filepath_bldg = os.path.join(
    os.getcwd(), "building-instinct-train-data", "1.parquet"
)  # path to a file in the train dataset
df_bldg = pd.read_parquet(load_filepath_bldg, engine="pyarrow")

# show the first 10 rows of the df_bldg dataframe
df_bldg.head(10)

Unnamed: 0_level_0,timestamp,out.electricity.total.energy_consumption,in.state
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2018-01-01 00:15:00,2.288,KY
1,2018-01-01 00:30:00,2.19,KY
1,2018-01-01 00:45:00,2.101,KY
1,2018-01-01 01:00:00,2.016,KY
1,2018-01-01 01:15:00,2.027,KY
1,2018-01-01 01:30:00,2.05,KY
1,2018-01-01 01:45:00,2.074,KY
1,2018-01-01 02:00:00,2.097,KY
1,2018-01-01 02:15:00,2.129,KY
1,2018-01-01 02:30:00,2.162,KY


Next we load the metadata (labels) for the buildings in the train dataset and display the first 10 rows. As shown in the dataframe below, residential buildings (e.g. building 1) have entries only for the columns with names ending with `_res` (in addition to `building_stock_type` column that specifices their residential building stock type). Similarly, commercial buildings (e.g. building 3) have entries only for the columns with names ending with `_com` (in addition to `building_stock_type` column that specifices their commercial building stock type).


In [3]:
load_filepath_labels = os.path.join(
    os.getcwd(), "building-instinct-train-label", "train_label.parquet"
)  # path to the train label file
df_targets = pd.read_parquet(load_filepath_labels, engine="pyarrow")

# show the first 10 rows of the dataframe
df_targets.head(10)

Unnamed: 0_level_0,building_stock_type,in.comstock_building_type_group_com,in.heating_fuel_com,in.hvac_category_com,in.number_of_stories_com,in.ownership_type_com,in.vintage_com,in.wall_construction_type_com,in.tstat_clg_sp_f..f_com,in.tstat_htg_sp_f..f_com,...,in.geometry_building_type_recs_res,in.geometry_floor_area_res,in.geometry_foundation_type_res,in.geometry_wall_type_res,in.heating_fuel_res,in.income_res,in.roof_material_res,in.tenure_res,in.vacancy_status_res,in.vintage_res
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,residential,,,,,,,,,,...,Multi-Family with 2 - 4 Units,1500-1999,Unheated Basement,Wood Frame,Natural Gas,100000-119999,Composition Shingles,Owner,Occupied,<1940
2,residential,,,,,,,,,,...,Multi-Family with 5+ Units,750-999,Vented Crawlspace,Wood Frame,Natural Gas,10000-14999,"Asphalt Shingles, Medium",Renter,Occupied,1970s
3,commercial,Warehouse and Storage,Electricity,Small Packaged Unit,2.0,owner_occupied,1990 to 1999,WoodFramed,999.0,999.0,...,,,,,,,,,,
4,residential,,,,,,,,,,...,Multi-Family with 5+ Units,1000-1499,Vented Crawlspace,Wood Frame,Electricity,60000-69999,"Asphalt Shingles, Medium",Renter,Occupied,1980s
5,commercial,Warehouse and Storage,NaturalGas,Residential Style Central Systems,1.0,owner_occupied,2000 to 2012,WoodFramed,999.0,999.0,...,,,,,,,,,,
6,commercial,Mercantile,NaturalGas,Small Packaged Unit,1.0,leased,1960 to 1969,WoodFramed,72.0,67.0,...,,,,,,,,,,
7,residential,,,,,,,,,,...,Single-Family Detached,750-999,Heated Basement,Wood Frame,Natural Gas,80000-99999,"Asphalt Shingles, Medium",Owner,Occupied,1940s
8,residential,,,,,,,,,,...,Multi-Family with 5+ Units,500-749,Vented Crawlspace,Wood Frame,Electricity,40000-44999,"Asphalt Shingles, Medium",Renter,Occupied,1990s
9,commercial,Warehouse and Storage,NaturalGas,Small Packaged Unit,1.0,leased,1980 to 1989,Mass,999.0,999.0,...,,,,,,,,,,
10,residential,,,,,,,,,,...,Single-Family Detached,1000-1499,Slab,Steel Frame,Electricity,20000-24999,Composition Shingles,Renter,Occupied,1950s


Below we print the list of commercial and residential metadata. Most of the entries are self-explanatory. For additional clarification, `in.tstat_clg_sp_f..f_com` and `in.tstat_htg_sp_f..f_com` refer to the cooling and heating thermostat setpoints (in Fahrenheit) for commercial buildings.

In [4]:
columns_com = df_targets.filter(like="_com").columns.tolist()
print("\n Metadata columns for commercial buildings: \n", columns_com)

columns_res = df_targets.filter(like="_res").columns.tolist()
print("\n Metadata columns for residential buildings: \n", columns_res)


 Metadata columns for commercial buildings: 
 ['in.comstock_building_type_group_com', 'in.heating_fuel_com', 'in.hvac_category_com', 'in.number_of_stories_com', 'in.ownership_type_com', 'in.vintage_com', 'in.wall_construction_type_com', 'in.tstat_clg_sp_f..f_com', 'in.tstat_htg_sp_f..f_com', 'in.weekday_opening_time..hr_com', 'in.weekday_operating_hours..hr_com']

 Metadata columns for residential buildings: 
 ['in.bedrooms_res', 'in.cooling_setpoint_res', 'in.heating_setpoint_res', 'in.geometry_building_type_recs_res', 'in.geometry_floor_area_res', 'in.geometry_foundation_type_res', 'in.geometry_wall_type_res', 'in.heating_fuel_res', 'in.income_res', 'in.roof_material_res', 'in.tenure_res', 'in.vacancy_status_res', 'in.vintage_res']


### Hierarchical Multi-output Multi-class Classification

Your task in this challenge is to classify (i.e., predict the classes for) each metadata target variable for a given building based on its electricity load profile. The process begins with classifying the `building_stock_type` (a binary classification: residential or commercial). If the building is classified as residential, you should then predict the classes for all 13 corresponding metadata target variables (columns ending with `_res`). Similarly, if the building is classified as commercial, you should predict the classes for all 11 corresponding metadata target variables (columns ending with `_com`).

This is a hierarchical classification problem with two levels (hierarchies) of classification: The first level involves determining the building stock type, and the second level involves classifying the target variables specific to the first-level class. Additionally, this is a multi-output classification problem since there are multiple target variables to classify (13 for residential and 11 for commercial). Furthermore, it is a multi-class problem because many target variables have more than two classes to predict. To assess the performance of your classification model, a customized F1-score is used as the performance metric for this challenge. The details of this performance metric will be discussed later in the Starter Notebook.

To help you better understand the problem and get you up to speed, we will create and train a simple classification model. Every effective machine learning model relies on proper data preprocessing and feature engineering/extraction. Therefore, before diving into the classification model, we will provide a brief discussion on feature engineering and extraction.


### Feature extraction/engineering

The collected smart meter data for energy conusmption are often processed in some ways to reduce the scale of input data or
to define more meaningful features for the ML task at hand (classification in this case). This is often referred to as feature extraction/engineering or data reduction stage. It should be noted that each recorded electricity consumption for a given building is indeed a feature; however, considering every single collected smart meter data results in a very high-dimensional feature set. For the provided dataset, this results in a massive 365 x 24 x 4 = 35040 dimensional feature set for every building. Analyzing these massive sets of data could be a challenging task. Therefore, data size reduction and feature engineering/extraction methods are pivotal to reduce the size of load data sets. The proper use of these methods can reduce the input data size of classification algorithms, save computation time, and produce features that are suitable for a specific task or algorithm.

Many different knowledge-based and automatic feature extraction techniques could be used for classification. For instance, one can reduce the full load profiles by aggregating and averaging over different date-time windows. For example, one can aggregate energy consumptions in every hour and then average over the entire year to get 24-hour representative load profile (RLP) for the entire year. This reduces the size of feature set from 35040 to 24. The averaging could be done for different seasons or different months. If it's done for every month, it results in 12 of 24-hr RLPs (12 x 24 = 288 features). One could engineer other features, such as number of consumption peaks, or times of the consumption peaks. Furthermore, more sophisticated and automatic deep learning feature extraction techniques could be ingerated into the classification pipeline. One can also combine several of these teqchniques together. There is no hard and fast rule as to what technique could work better and it very much depends on the dataset, the classification algorithm, and the application of interest and the type of labels to classify. 

To have a starting point, we provided you with a simple function in the `utils.py` that calculates the average hourly energy consumptions. The function `calculate_average_hourly_energy_consumption(folder_path, season_months_dict)` reads all the parquet files in the `folder_path` folder, and calculates hourly average energy consumption (as described above), and returns a pandas DataFrame with each row corresponding to one file (buidling) in the folder. 15-min energy consumptions are aggregated within each hour. The dictionary argument `season_months_dict` defines over what months the averaging takes place. The keys of this dictionary are season names (strings) and values are lists of corresponding month numbers. For example, if  `season_months_dict` = {'cold': [1, 2, 12], 'hot': [6, 7, 8], 'mild': [3, 4, 5, 9, 10, 11]}, averaging of energy consumption is done within 3 different seasons, resulting in a 3 x 24 = 72 features for every building load profile. Below this function is applied to the parquet files in the train dataset.

<b>NOTE</b>: It's crucial to note that the provided feature engineering and feature-set size reduction method is
just one of many approaches, shared for illustrative purposes. We highly encourage experimentation with various traditional and sophisticated (e.g. deep learning) techniques for feature extraction, either independently or in combination. Choose methods that best align with your analysis and classification algorithm and objectives.

Using the provided helper function in `utils.py`, below we calculate hourly energy consumption averaged over the entire year resulting in a 24-feature RLP for every building in the train dataset (`df_features`).

In [5]:
folder_path_train = os.path.join(
    os.getcwd(), "building-instinct-train-data"
)  # folder path for the train dataset
season_months_dict = {"year": [i for i in range(1, 13)]}

df_features = calculate_average_hourly_energy_consumption(
    folder_path=folder_path_train, season_months_dict=season_months_dict
)

df_features.sort_index(inplace=True)
df_features.head(10)

Unnamed: 0_level_0,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year,year
Unnamed: 0_level_1,1,2,3,4,5,6,7,8,9,10,...,15,16,17,18,19,20,21,22,23,24
bldg_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,1.388449,1.394814,1.406436,1.505438,1.629721,1.699849,1.76369,1.76297,1.692008,1.589521,...,1.370995,1.464129,1.615696,1.780175,1.961877,2.027405,1.939288,1.726951,1.463641,1.37217
2,0.974589,0.873729,0.832753,0.815477,0.692959,0.83766,1.058562,1.047055,0.944164,0.915151,...,1.021211,1.03311,1.120825,1.379164,1.508359,1.547868,1.505479,1.306167,1.119795,1.070608
3,26.463855,26.684141,26.856077,27.02217,27.258033,27.391528,27.572034,26.574916,25.369904,27.300352,...,57.903417,57.760485,57.724989,57.738569,57.761025,58.752627,54.264969,37.545003,27.301021,26.393634
4,0.329827,0.341614,0.347764,0.356296,0.371395,0.405652,0.517104,0.616778,0.564156,0.478921,...,0.44517,0.493049,0.539337,0.57617,0.594879,0.584274,0.562288,0.505312,0.401838,0.346137
5,4.612172,4.537062,4.476187,4.424258,4.371671,4.581824,4.845716,5.583372,6.002681,6.144763,...,7.000456,6.965389,5.152788,3.310272,2.944129,3.778194,4.762844,4.880666,4.761954,4.666943
6,9.929333,8.987023,9.079168,9.226018,9.590823,13.350907,20.614406,22.863366,26.527285,26.583311,...,24.093264,23.897739,23.724209,24.50624,25.755247,26.8076,22.786241,15.432445,12.726244,11.050264
7,0.316586,0.299337,0.266301,0.260748,0.394123,0.430542,0.399929,0.443529,0.461975,0.469151,...,0.630934,0.636866,0.702737,0.799603,0.910874,0.854578,0.816477,0.673989,0.459247,0.353238
8,0.394104,0.34506,0.338258,0.329145,0.336849,0.349704,0.388208,0.455132,0.544277,0.60723,...,0.522068,0.547055,0.548811,0.608025,0.703126,0.739375,0.791392,0.739055,0.612395,0.479586
9,130.721933,131.9799,132.178927,132.830446,133.266893,135.160795,146.462227,216.727471,264.917181,273.437032,...,269.033935,267.266991,239.046661,154.014922,111.503383,112.502299,125.288994,129.144772,129.320504,129.947422
10,0.980586,0.84943,0.768293,0.75334,0.75943,0.781893,0.891951,1.046551,1.180244,1.241507,...,1.32609,1.449488,1.586414,1.707638,1.788099,1.807647,1.771512,1.61383,1.392948,1.166989


Next, with the reduced-size feature set for the training data we train a simple classification model and use it for prediction. To this end, we first split the train data (for which we have labels) into train and test sets. We use a customized hierarchical classification model that is provided to you within the `train_model()` function found in the `utils.py` module. This function instantiates and trains three separate classification models (and returns the trained models):

* A classifier to predict the `building_stock_type` (either 'commercial' or 'residential').
* A classifier for predicting attributes/metadata of commercial buildings.
* A classifier for predicting attributes/metadata of residential buildings.

We have also provided a function (`get_pred()`) to do predictions using the above-mentioned trained models. This function takes in a feature dataframe and the list of trained classifiers to generate predictions for the `building_stock_type` and its respective attributes based on the hierarchical structure. The predictions are populated in a new dataframe with the same index as the input features and columns specified in the column list.


In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    df_features, df_targets, test_size=0.2, random_state=42
)
column_list = list(df_targets.columns)

classifier_list = train_model(X=X_train, y=y_train)
y_pred = get_pred(X=X_test, classifier_list=classifier_list, column_list=column_list)

Below are the labels for the test portion of the training data (`y_test`) along with the predicted labels (`y_pred`).

In [7]:
y_test

Unnamed: 0_level_0,building_stock_type,in.comstock_building_type_group_com,in.heating_fuel_com,in.hvac_category_com,in.number_of_stories_com,in.ownership_type_com,in.vintage_com,in.wall_construction_type_com,in.tstat_clg_sp_f..f_com,in.tstat_htg_sp_f..f_com,...,in.geometry_building_type_recs_res,in.geometry_floor_area_res,in.geometry_foundation_type_res,in.geometry_wall_type_res,in.heating_fuel_res,in.income_res,in.roof_material_res,in.tenure_res,in.vacancy_status_res,in.vintage_res
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3099,commercial,Mercantile,NaturalGas,Small Packaged Unit,2,owner_occupied,Before 1946,Mass,73,68,...,,,,,,,,,,
2532,commercial,Warehouse and Storage,NaturalGas,Small Packaged Unit,1,leased,1980 to 1989,SteelFramed,999,999,...,,,,,,,,,,
4072,residential,,,,,,,,,,...,Single-Family Detached,500-749,Slab,Wood Frame,Natural Gas,15000-19999,Composition Shingles,Owner,Occupied,1950s
1288,commercial,Office,Electricity,Small Packaged Unit,2,owner_occupied,1946 to 1959,Mass,75,66,...,,,,,,,,,,
2541,residential,,,,,,,,,,...,Single-Family Detached,500-749,Vented Crawlspace,Brick,Natural Gas,10000-14999,"Metal, Dark",Owner,Occupied,1960s
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3791,residential,,,,,,,,,,...,Single-Family Detached,2000-2499,Vented Crawlspace,Brick,Natural Gas,70000-79999,Composition Shingles,Owner,Occupied,1970s
912,commercial,Warehouse and Storage,Electricity,Multizone CAV/VAV,1,leased,Before 1946,Mass,999,999,...,,,,,,,,,,
6521,commercial,Mercantile,Electricity,Small Packaged Unit,1,leased,1980 to 1989,WoodFramed,73,70,...,,,,,,,,,,
2996,residential,,,,,,,,,,...,Single-Family Detached,3000-3999,Vented Crawlspace,Wood Frame,Fuel Oil,60000-69999,"Asphalt Shingles, Medium",Owner,Occupied,1960s


In [8]:
y_pred

Unnamed: 0_level_0,building_stock_type,in.comstock_building_type_group_com,in.heating_fuel_com,in.hvac_category_com,in.number_of_stories_com,in.ownership_type_com,in.vintage_com,in.wall_construction_type_com,in.tstat_clg_sp_f..f_com,in.tstat_htg_sp_f..f_com,...,in.geometry_building_type_recs_res,in.geometry_floor_area_res,in.geometry_foundation_type_res,in.geometry_wall_type_res,in.heating_fuel_res,in.income_res,in.roof_material_res,in.tenure_res,in.vacancy_status_res,in.vintage_res
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3099,commercial,Mercantile,NaturalGas,Small Packaged Unit,1,leased,Before 1946,Mass,71,68,...,,,,,,,,,,
2532,commercial,Warehouse and Storage,NaturalGas,Small Packaged Unit,1,owner_occupied,Before 1946,SteelFramed,999,999,...,,,,,,,,,,
4072,residential,,,,,,,,,,...,Multi-Family with 5+ Units,500-749,Slab,Wood Frame,Electricity,40000-44999,"Asphalt Shingles, Medium",Renter,Occupied,1990s
1288,commercial,Office,Electricity,Small Packaged Unit,1,owner_occupied,Before 1946,Mass,74,69,...,,,,,,,,,,
2541,residential,,,,,,,,,,...,Single-Family Detached,1500-1999,Vented Crawlspace,Wood Frame,Natural Gas,<10000,Composition Shingles,Owner,Occupied,1960s
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3791,residential,,,,,,,,,,...,Single-Family Detached,1000-1499,Slab,Wood Frame,Natural Gas,80000-99999,Composition Shingles,Owner,Occupied,1940s
912,commercial,Warehouse and Storage,Electricity,Small Packaged Unit,1,owner_occupied,1980 to 1989,Mass,999,999,...,,,,,,,,,,
6521,commercial,Mercantile,NaturalGas,Small Packaged Unit,1,owner_occupied,Before 1946,SteelFramed,72,68,...,,,,,,,,,,
2996,residential,,,,,,,,,,...,Single-Family Detached,1500-1999,Slab,Wood Frame,Electricity,20000-24999,Composition Shingles,Owner,Occupied,<1940


Next we talk about our performance metric; the customized hierarchical $F1$-score that will be used to measure the performance of your classification model. The final $F1$-score is derived by first calculating the $F1$-scores at two hierarchical levels:
* The `building_stock_type` level, which is the first level of hierarchy ($F1_{l1}$).
* The second level, which is conditional on the `building_stock_type` being either 'commercial' or 'residential' ($F1_{l2}$).

The final $F1$-score is a weighted average of the first and second level $F1$-scores: $F1$-score = $\alpha F1_{l1} + (1-\alpha) F1_{l2}$ , where $\alpha$ is the weight. $F1_{l1}$ represents the $F1$-score for the binary classification of the `building_stock_type` column. $F1_{l2}$ is the arithmetic average of $F1$-scores for the residential and commercial columns ($F1_{l2}^{res}$ and $F1_{l2}^{com}$): $F1_{l2}$ = 0.5 x ($F1_{l2}^{res}$ + $F1_{l2}^{com}$). To calculate $F1_{l2}^{res}$, the macro $F1$-score is first computed for each column whose name ends with `_res`. These scores are then averaged to yield $F1_{l2}^{res}$. Similarly, $F1_{l2}^{com}$ is calculated using the columns whose names end with `_com`. The function to calculate the final $F1$-score is provided for you in the `utils.py` (`calculate_hierarchical_f1_score()`). This function (with the default parameter values) will also be used to calculate your predictive leaderboard score, i.e. the hierarchical $F1$-score on the test dataset.

In this function if you set the parameter `F1_list` to `True`, it returns a tuple where the first element is the overall hierarchical $F1$-score and the second element is a dictionary containing the $F1$-scores for all individual columns. This could help you explore which columns your model classifies well and which columns it does not. Below we apply this function on the predictions made by the trained model above.

In [9]:
F1, F1_dict = calculate_hierarchical_f1_score(
    y_test, y_pred, alpha=0.4, average="macro", F1_list=True
)

df_f1_scores = pd.DataFrame(
    list(F1_dict.items()), columns=["column name", "Macro F1-score"]
)

df_f1_scores.set_index("column name", inplace=True)

In [10]:
print(f"Hierarchical F1-score: {F1} \n")

print("Macro F1-scores for all the individual columns:\n")


df_f1_scores

Hierarchical F1-score: 0.5257525961502534 

Macro F1-scores for all the individual columns:



Unnamed: 0_level_0,Macro F1-score
column name,Unnamed: 1_level_1
building_stock_type,0.98255
in.vacancy_status_res,0.923055
in.tenure_res,0.672234
in.comstock_building_type_group_com,0.410894
in.bedrooms_res,0.273604
in.wall_construction_type_com,0.265956
in.hvac_category_com,0.25631
in.geometry_building_type_recs_res,0.247342
in.geometry_wall_type_res,0.219878
in.geometry_floor_area_res,0.21961


Your task is to develop a model to predict the metadata for the 1440 buildings in the test dataset. Although you do not have access to the metadata/labels for the test dataset, the ground-truth labels will be used to calculate your leaderboard score. You must submit your predictions as a `.parquet` file. The `.parquet` file should have the same structure as the metadata dataframes in this notebook (e.g. `df_targets` or `y_pred`): it MUST have 25 columns with the exact column names: `building_stock_type`, the 13 column names ending with `_com`, and the 11 column names ending with `_res`. The index name MUST be `bldg_id`. Your `.parquet` file MUST have 1440 rows containing your predicted metadata for the 1440 buildings in the test dataset.

As mentioned in the beginning of this Notebook, the parquet files of energy consumption follow the naming convention of `<building_id>.parquet`. Therefore, the index values (`bldg_id`) in your submission should start with 1 and end with 1440. Please note that the index values MUST be integer numbers corresponding to the building IDs.

To further help you understand the structure of the DataFrame for the submission file, we have provided a function called `sample_submission_generator(bldg_id_list, df_targets, path_to_save)` to generate a prediction DataFrame and save it as a sample `.parquet` file. This function takes a list of building IDs that should be a list of integers from 1 to 1440. It also takes an input dataframe (`df_targets`): the function uses the distribution of classes for each attribute (column) of this DataFrame to sample from, in order to populate the entries for the sample submission. Please note that it does not matter much how this function generates the values for the submission file; what matters is the structure of the generated DataFrame, and hence the saved `.parquet` file, i.e. the number of rows and columns and their names, the hierarchical nature of it (e.g. if a row is residential there are no entries for the columns ending with `_com` for that row), etc. The function also takes in the filepath to save the `.parquet` file. In addition to saving the `.parquet` file the function also returns the generated sample DataFrame.

We would like to emphasize again that your submission MUST be a `.parquet` file passing all the above-mentioned requirements.

Below we use the `sample_submission_generator()` to generate and save a sample `.parquet` submission file.

In [11]:
bldg_id_list = [i for i in range(1, 1441)]
df_sample = sample_submission_generator(
    bldg_id_list, df_targets, "sample_submission.parquet"
)

print("The generated sample submission DataFrame that is saved as a .parquet file: \n")
df_sample

The generated sample submission DataFrame that is saved as a .parquet file: 



Unnamed: 0_level_0,building_stock_type,in.comstock_building_type_group_com,in.heating_fuel_com,in.hvac_category_com,in.number_of_stories_com,in.ownership_type_com,in.vintage_com,in.wall_construction_type_com,in.tstat_clg_sp_f..f_com,in.tstat_htg_sp_f..f_com,...,in.geometry_building_type_recs_res,in.geometry_floor_area_res,in.geometry_foundation_type_res,in.geometry_wall_type_res,in.heating_fuel_res,in.income_res,in.roof_material_res,in.tenure_res,in.vacancy_status_res,in.vintage_res
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,commercial,Healthcare,FuelOil,Multizone CAV/VAV,1,owner_occupied,1960 to 1969,SteelFramed,999,68,...,,,,,,,,,,
2,commercial,Mercantile,Electricity,Small Packaged Unit,1,owner_occupied,Before 1946,Mass,999,68,...,,,,,,,,,,
3,commercial,Office,NaturalGas,Small Packaged Unit,2,owner_occupied,Before 1946,Mass,72,67,...,,,,,,,,,,
4,commercial,Mercantile,DistrictHeating,Small Packaged Unit,2,leased,1960 to 1969,WoodFramed,74,999,...,,,,,,,,,,
5,residential,,,,,,,,,,...,Multi-Family with 5+ Units,1000-1499,Ambient,Wood Frame,Fuel Oil,15000-19999,Composition Shingles,Owner,Occupied,<1940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1436,residential,,,,,,,,,,...,Multi-Family with 5+ Units,1500-1999,Slab,Wood Frame,Electricity,200000+,"Asphalt Shingles, Medium",Owner,Occupied,1960s
1437,residential,,,,,,,,,,...,Single-Family Detached,0-499,Unheated Basement,Wood Frame,Electricity,10000-14999,Composition Shingles,Owner,Vacant,1970s
1438,residential,,,,,,,,,,...,Single-Family Attached,500-749,Unvented Crawlspace,Wood Frame,Natural Gas,<10000,Composition Shingles,Renter,Occupied,1980s
1439,commercial,Warehouse and Storage,NaturalGas,Small Packaged Unit,5,state,1946 to 1959,Mass,73,61,...,,,,,,,,,,


You can also read the saved `sample_submission.parquet` file as shown below (Note: this is how your submitted `.parquet` file will be read for scoring).

In [12]:
df_from_parquet = pd.read_parquet("sample_submission.parquet", engine="pyarrow")

print("The dataframe read from the sample_submission.parquet file: \n")

df_from_parquet

The dataframe read from the sample_submission.parquet file: 



Unnamed: 0_level_0,building_stock_type,in.comstock_building_type_group_com,in.heating_fuel_com,in.hvac_category_com,in.number_of_stories_com,in.ownership_type_com,in.vintage_com,in.wall_construction_type_com,in.tstat_clg_sp_f..f_com,in.tstat_htg_sp_f..f_com,...,in.geometry_building_type_recs_res,in.geometry_floor_area_res,in.geometry_foundation_type_res,in.geometry_wall_type_res,in.heating_fuel_res,in.income_res,in.roof_material_res,in.tenure_res,in.vacancy_status_res,in.vintage_res
bldg_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,commercial,Healthcare,FuelOil,Multizone CAV/VAV,1,owner_occupied,1960 to 1969,SteelFramed,999,68,...,,,,,,,,,,
2,commercial,Mercantile,Electricity,Small Packaged Unit,1,owner_occupied,Before 1946,Mass,999,68,...,,,,,,,,,,
3,commercial,Office,NaturalGas,Small Packaged Unit,2,owner_occupied,Before 1946,Mass,72,67,...,,,,,,,,,,
4,commercial,Mercantile,DistrictHeating,Small Packaged Unit,2,leased,1960 to 1969,WoodFramed,74,999,...,,,,,,,,,,
5,residential,,,,,,,,,,...,Multi-Family with 5+ Units,1000-1499,Ambient,Wood Frame,Fuel Oil,15000-19999,Composition Shingles,Owner,Occupied,<1940
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1436,residential,,,,,,,,,,...,Multi-Family with 5+ Units,1500-1999,Slab,Wood Frame,Electricity,200000+,"Asphalt Shingles, Medium",Owner,Occupied,1960s
1437,residential,,,,,,,,,,...,Single-Family Detached,0-499,Unheated Basement,Wood Frame,Electricity,10000-14999,Composition Shingles,Owner,Vacant,1970s
1438,residential,,,,,,,,,,...,Single-Family Attached,500-749,Unvented Crawlspace,Wood Frame,Natural Gas,<10000,Composition Shingles,Renter,Occupied,1980s
1439,commercial,Warehouse and Storage,NaturalGas,Small Packaged Unit,5,state,1946 to 1959,Mass,73,61,...,,,,,,,,,,



## And finally, thank you for choosing to participate in this challenge!
## Best of luck and have fun!