# Solution Overview

My solution is based on LigthGBM trained on over 649 different feature based on Hourly consume to reduce memory consumption.

The Cross validation scheme is a Stratified K fold for each different target and the LigthGBM was trained using the following metric:

- Binary target: Binary Cross Entropy
- Multi Class: Softmax

For each target i selected the Number of Round which led to the best Cv-F1 Macro

------

Due to constrain time limit i was not able to fully reproduce each training on a dataset with 10.000 additional sample for each State/(Commercial/residential). The following target: in.geometry_floor_area_res, in.income_res, in.roof_material_res, in.vintage_res, in.weekday_operating_hours..hr_com was trained on a dataset with 5.000 additional sample.

This makes the training procedure more complicated, as two trainings on two different datasets will be required to reproduce the submitted solution. This difficulty does not exist in the case of training on all targets in the dataset with 10,000 additional samples.

## Additional Dataset

Two categories of additional dataset were used:

### NRELP Eulp dataset:
I downloaded each file from https://data.openei.org/s3_viewer?bucket=oedi-data-lake&prefix=nrel-pds-building-stock%2Fend-use-load-profiles-for-us-building-stock%2F

I selected the following dataset release as the most recent:

Feature dataset:
- 2024/comstock_amy2018_release_1/timeseries_individual_buildings/by_state/upgrade=32/
- 2022/resstock_amy2018_release_1.1/timeseries_individual_buildings/by_state/upgrade=10/
 
Target Dataset (this file was saved by hand diorectly on ./data/data_dump):
- 2024/comstock_amy2018_release_1/metadata/upgrade32.parquet -> renamed as metadata_com.parquet
- 2022/resstock_amy2018_release_1.1/metadata/upgrade10.parquet -> renamed as metadata_res.parquet

Each feature file was downloaded and downcasted/preprocessed using the powershell script (must be executed after concat_data.py):

```
# -1 as the number of home to scrape to scrape all
command.ps1
```
Refer to Preprocessing Step for more information

### Census dataset:
QuickFacts for each State from https://www.census.gov/quickfacts/fact/table/{STATE}/PST045223.
{STATE} was replaced by list of 5 state splitted by "," E.g.  "AL,AK,AZ,AR,CA" and executed for every USA state.
Each file was saved to "data\other_dataset\census\QuickFacts Aug-23-2024_{num_file}.csv"

This files has information about geographical/anagrafical/economic for each state.

## Data Preprocessing / Feature Engineering

The main part of my solution is based on the different feature i created:

### Simple aggregation

- average daily consumption over season, month, national holiday, state holiday.
- average hour consumption over season, month, national holiday, state holiday and every TOU (time of use pricing).
- average hour consumption over season, month and  weekend vs not weekend.
- Every average hour consumption for each season (24 * 4 season).
- Total consumption over season, month, week day, hour, overall.
- Average hour consumption over season, month and the weekday.
- Mean, Max, Min, Median hour consumption for each month.

### Profile Aggregation

- How many time a selected hour is a Min or a Max in a given day over all the year (is a %).
- Average of the difference between hour consumption and mean, min, max hour consumption (24 feature).
- How many time a selected weekday is a Min or a Max in a given week over all the year (is a %).
- Average of the difference between daily consumption and mean, min, max week consumption (7 feature).

### Range of work feature
Given the difference between the next 4 hour consumption and the previous 4 hour consumption calculate:
- 1 - The timestamp which has the max difference -> time of begin spike
- 2 - The timestamp which has the min difference -> time of end spike

This timestamps and the difference (range of work) are calculated by applying a Min to (1) and a Max to (2) this led to a conservative way to calculate the begin and end of the spike and also the difference.

This calculation is done over mulltiple filter both for calculate the begin and the end.

Begin filter:
- Only between 3-13 in workday
- Only between 3-13, weekend
- Only on workday
- Only on weekend

End filter:
- Only after 3 in workday
- Only after 3 weekend
- Only on workday
- Only on weekend


The total amount of memory required to download the dataset and create silver/gold datasets is 100gb+

## Training Method

I trained each target always using Stratified K-Fold on the selected target.

- Building Stock Type was trained on the entire dataset.
- Residential Targets: each residential target are trained only on the portion of residential data.
- Commercial Targets: each commercial target are trained only on the portion of commercial data.


The total amount of memory required for to run a complete experiment and save all required checkpoints is 50gb+

## Feature Insight

### Building Stock Type

For this singular target which is quite different from the others, the best feature are:

- Average hour consumption over a given month and week day
- How many time a selected weekday is a Min in a given week over all the year (is a %).

For residential:

- How many time a selected hour is a Min in a given day over all the year (is a %).
- Average of the difference between daily consumption and mean week consumption.

For commercial:
- How many time a selected hour is a Min in a given day over all the year (is a %).
- How many time a selected hour is a Max in a given day over all the year (is a %).

# Solution Reproduction Steps

## 1. Pipeline Configuration

Before running the pipeline, please navigate to the config/config.json file and specify the actual data paths to let the pipeline know about data location. 

- PATH_ORIGINAL_DATA: folder for starting dataset, default: "data/original_dataset",
- PATH_SILVER_PARQUET_DATA: folder for silver dataset, default: "data/silver_parquet_dataset",
- PATH_GOLD_PARQUET_DATA: folder for gold dataset, default: "data/gold_parquet_dataset",
- PATH_MAPPER_DATA: folder for mapping file used in preprocessing dataset, default: "data/mapper_dataset",
- PATH_OTHER_DATA: folder for other dataset, default: "data/other_dataset",
- ORIGINAL_TRAIN_CHUNK_FOLDER: folder name where to put training parquet files, default: "building-instinct-train-data",
- ORIGINAL_TEST_CHUNK_FOLDER: folder name where to put test parquet files, default: "building-instinct-test-data",
- ORIGINAL_TRAIN_LABEL_FOLDER: folder name where to put training label parquet files, default: "building-instinct-train-label",

Folder Structure:
```
data
├─── gold_parquet_dataset
│    └─── test_data.parquet
│    └─── train_{target}_label.parquet
└─── mapper_dataset
│    └───mapper_category.json
│    └───target_mapper.json.json
└─── original_dataset
│    └───building-instinct-test-data
│    └───building-instinct-train-data
│    └───building-instinct-train-label
└─── other_dataset
│    └─── census
└─── silver_parquet_dataset

```


## 2. Environment Setup
Please, run the following command to install all needed libraries and packages.

In [None]:
!pip install -r requirements.txt --quiet

## 3. Preprocessing Step
You can skip this step if you want to start with the pretrained checkpoints provided in ./experiment/solution_lgb and go directly to Inference Step

Otherwise, run the following command to preprocess everything.

Run in the following order the following script before the training step (train and test parquet files must be placed under ORIGINAL_TRAIN_CHUNK_FOLDER, ORIGINAL_TRAIN_CHUNK_FOLDER folder):
```
#concat starting train/test dataset and create mapping json.
python concat_data.py

#will download every NRELP EULP files and preprocess it. This step will take couple of days depending on the connection.
#when asked type -1 to download all
command.ps1

#create economics dataset.
python script/create_economics_dataset.py
```

Required Memory: 50gb+

## 4. Training step
You can skip this step if you want to start with the pretrained checkpoints provided in ./experiment/solution_lgb and go directly to Inference Step.

Otherwise, execute the following command.

Two training must be run as i used two version of the additional dataset taken from NREL EULP dataset.
The first one is trained using only 5_000 sample for each state both for residential/commercial form NREL EULP dataset.
The second one is trained using 10_000 sample.

I was not able to complete the training (time constraint) using the additional dataset with 10_000 additional sample for the following target : in.geometry_floor_area_res, in.income_res, in.roof_material_res, in.vintage_res, in.weekday_operating_hours..hr_com. For these target i used model trained only on 5_000 additional sample for each State.

To reproduce the full solution:

### Training with 5_000 sample

Update the file ./config/params_lgb.json and set:

- learning_rate: 0.05,
- n_round: 2000
- experiment_name: "add_5000"

Training with 5_000 additional sample (estimated training time 7 days):
```
python create_additional_data.py --sample 5000
python preprocess.py --add
python train.py
```

### Training with 10_0000 sample

Update the file ./config/params_lgb.json and set:

- learning_rate: 0.15,
- n_round: 800
- experiment_name: "add_10000"

Training with 10_000 additional sample (estimated training time 3-5 days):
```
python create_additional_data.py
python preprocess.py --add
python train.py
```

### Create Solution Folder
As the final submission is defined by two different categories (due to time constraint) of model:

- Model trained on 5_000 additional sample for in.geometry_floor_area_res, in.income_res, in.roof_material_res, in.vintage_res, in.weekday_operating_hours..hr_com.
- Model trained on 10_000 for each other target.

Create a folder called "solution_lgb" under ./experiment and place each subfolder (one for each target). The 4 target (in.geometry_floor_area_res, in.income_res, in.roof_material_res, in.vintage_res, in.weekday_operating_hours..hr_com) must come from add_5000, the other from add_10000.
Update the file ./config/params_lgb.json experiment_name and set "solution" and go the Inference Step.

### Additional Notes
If the training for the bigger dataset (10_000) was completed. To reproduce the solution is just necessary to:

Set in .config/params_lgb.json:
- learning_rate: 0.05,
- n_round: 2000
- experiment_name: "solution"

Run:
```
python create_additional_data.py
python preprocess.py --add
python train.py
```
This solution will score a little higher respect to my last submission.

Double training is necessary to reproduce the current submitted solution which miss 4 target on the +10_000 dataset.


Required Memory for gold dataset + model checkpoint: 70gb+

## 5. Inference step
To inference the model and form a predictions for holdout dataset please follow the instructions below.

Place every test parquet files under ORIGINAL_TEST_CHUNK_FOLDER.
experiment_name must be set as "solution" in ./config/params_lgb.json.

In [None]:
#this step concatenate all test file and create silver dataset
!python script/create_test_data.py
#this step create economics dataset -> this step is not necessary and is commented as i already provided the preprocessed file macro_economics_data.parquet
# !python script/create_economics_dataset.py
#execute preprocessing on test set
!python preprocess.py --inference
#execute inference and generate submission.parquet
!python inference.py

Submission will be saved to ./experiment/solution_lgb/submission.parquet