# Solution Overview

The solution is based on the ensemble of 2 CNN network architectures(tf_efficientnetv2_s_in21k and 
seresnext26t_32x4d) over 2d-1channel representations of energy consumption data. Data augmentation is applied during training and the simple ensemble is used, averaging the results of the different architecture models, trained on the differently preprocessed data.

<img src="assets/model.png" alt="Model Architecture Diagram" width="600"/>
The solution is divided into two stages:

1. **Classifying the building type** as either commercial or residential.
2. **Performing multilabel classification** based on the result from stage 1.

For stage 2, the model training followed a two-step process:

- **Pretraining** models on an external dataset from the .
- **Fine-tuning** the pretrained models on the challenge dataset.

# Requirements
Linux system (tested on Ubuntu 22.04), python 3.9 with conda environment, tested on 3090 with the necessary CUDA drivers installed to run pytorch.

Install dependencies, using python 3.9 with conda
```
$ conda create -n envs python=3.9
$ conda activate envs
$ pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
$ pip install -r requirements.txt
```

## Folder Structure
```
project/
├── data/ 
│   ├── train : Contains the train files.
|   ├── test : Contains the test files.
|   ├── labels 
|       └──train_label.parquet
├── src/
│   ├── prepare_data.py 
│   │   └── Creates multilabel folds for training.
│   ├── train_stage1.py 
│   │   └── Trains a classifier for building type (commercial/residential).
│   ├── train_stage2.py 
│   │   └── Pretrains a multilabel classifier for a specific building type on external data.
│   ├── train_stage3.py 
│   │   └── Fine-tunes the pretrained models on the challenge dataset.
│   └── infer.py 
│       └── Performs inference on the challenge data and generates the submission file.
```

## Imports

In [2]:
import os
import pandas as pd
import numpy as np
from glob import glob
import pickle
from sklearn.preprocessing import LabelEncoder
from concurrent.futures import ThreadPoolExecutor, as_completed

## Pipeline Walkthrough

### 1. Encode Target Variables
Before starting the training process, we need to encode the target variables using `LabelEncoder`. The encoded objects are saved in pickle format, allowing them to be easily reused in subsequent steps.

In [None]:
input_dir="data"
filepath_labels =f"{input_dir}/labels/train_label.parquet" #path to the train label file
df_targets = pd.read_parquet(filepath_labels, engine='pyarrow')

In [None]:
os.makedirs(f"{input_dir}/Label_encoded",exist_ok=True)
#Iterate through each target column and apply label encoding
for col in df_targets.columns:
    le = LabelEncoder()
    df_targets[col] = le.fit_transform(df_targets[col])
    # Save the encoder to a file
    with open(f'{input_dir}/Label_encoded/{col}_label_encoder.pkl', 'wb') as f:
        pickle.dump(le, f)

### 2. Download External Data
### 2.a. Download external data for commercial buildings 
This section reproduces the downloading process of external data for commercial buildings from ***Open Energy Data Initiative (OEDI)***. The download is done in parallel  and ends at bldg_id = 47477 which is used in the final submission.

We first download the  metadata file from s3 path.

In [None]:
meta_com_base_s3 = "s3://oedi-data-lake/nrel-pds-building-stock/end-use-load-profiles-for-us-building-stock/2024/"
meta_com_rep_s3 = "comstock_amy2018_release_1/metadata_and_annual_results/national/csv/"
meta_com_f = "baseline_metadata_only.csv"
s3_path = meta_com_base_s3 + meta_com_rep_s3 + meta_com_f
local_meta_path = f"{input_dir}/external/baseline_metadata_com.csv"

In [None]:
os.system(f"aws s3 cp {s3_path} {local_meta_path} --no-sign-request")

In [None]:
df_ext = pd.read_csv(local_meta_path)

We select the subdata of all the external data to download.

In [None]:
com_cols =['in.comstock_building_type_group',
 'in.heating_fuel',
 'in.hvac_category',
 'in.number_of_stories',
 'in.ownership_type',
 'in.vintage',
 'in.wall_construction_type',
 'in.tstat_clg_sp_f..f',
 'in.tstat_htg_sp_f..f',
 'in.weekday_opening_time..hr',
 'in.weekday_operating_hours..hr']
res_cols=['in.bedrooms',
 'in.cooling_setpoint',
 'in.heating_setpoint',
 'in.geometry_building_type_recs',
 'in.geometry_floor_area',
 'in.geometry_foundation_type',
 'in.geometry_wall_type',
 'in.heating_fuel',
 'in.income',
 'in.roof_material',
 'in.tenure',
 'in.vacancy_status',
 'in.vintage']
cons_cols=['bldg_id','in.state']

In [None]:
LAST_ROW_BLDG_ID = 47477 #define the last row do include for the submitted version
all_cols = cons_cols + com_cols
dfcom  = df_ext[all_cols]
last_idx = dfcom.loc[dfcom.bldg_id==LAST_ROW_BLDG_ID].index[0]
dfcom = dfcom.iloc[:last_idx+1]
dfcom

In [None]:
# Define a function to download a single file from S3
def download_s3_file(row,base_rep_s3="comstock_amy2018_release_1",target_type="com"):
    """
    Downloads the external data for a given building ID.
    
    Args:
        row (int): The building row data.
        base_rep_s3 (str): base repository name in s3 link
        target_type (str) : the target type (com/res)

    Returns:
        None: Data is downloaded and saved locally.
    """
    state = row['in.state']
    bldg_id = row['bldg_id']
    state_path = f"state={state}/"
    file_name = f"{bldg_id}-0.parquet"
    base_s3 = "s3://oedi-data-lake/nrel-pds-building-stock/end-use-load-profiles-for-us-building-stock/2024/"
    rep_s3 = f"{base_rep_s3}/timeseries_individual_buildings/by_state/upgrade=0/"
    s3_path = os.path.join(base_s3, rep_s3, state_path, file_name)
    
   
    local_dir = f"{input_dir}/external/features/{target_type}/{state}"
    os.makedirs(local_dir, exist_ok=True)
    local_file = os.path.join(local_dir, file_name)
    # AWS S3 copy command
    os.system(f"aws s3 cp {s3_path} {local_file} --no-sign-request")

# Parallelize the download
def parallel_download(df,base_rep_s3="comstock_amy2018_release_1",target_type="com", max_workers=10):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(download_s3_file, row,base_rep_s3,target_type) for idx, row in df.iterrows()]
        for future in as_completed(futures):
            try:
                future.result()
            except Exception as e:
                print(f"Error downloading file: {e}")

In [None]:
parallel_download(dfcom,base_rep_s3="comstock_amy2018_release_1",target_type="com")

### 2.b. Download external data for residential buildings
This section reproduces the downloading process of external data for residential buildings from ***Open Energy Data Initiative (OEDI)***. The download is done in parallel  and ends at bldg_id =  36600 which is used in the final submission.

We first download the  metadata file from s3 path.

In [None]:
meta_res_base_s3 = "s3://oedi-data-lake/nrel-pds-building-stock/end-use-load-profiles-for-us-building-stock/2024/"
meta_res_rep_s3 = "resstock_amy2018_release_2/metadata_and_annual_results/national/csv/"
meta_res_f = "baseline_metadata_only.csv"
s3_path = meta_res_base_s3 + meta_res_rep_s3 + meta_res_f
local_meta_path = f"{input_dir}/external/baseline_metadata_res.csv"

In [None]:
os.system(f"aws s3 cp {s3_path} {local_meta_path} --no-sign-request")

In [None]:
df_ext = pd.read_csv(local_meta_path)

We select the subdata of all the external data to download.

In [None]:
LAST_ROW_BLDG_ID = 36600 #define the last row do include
all_cols = cons_cols + res_cols
dfres = df_ext[all_cols]
last_idx = dfres.loc[dfres.bldg_id==LAST_ROW_BLDG_ID].index[0]
dfres = dfres.iloc[:last_idx+1]
dfres

In [None]:
parallel_download(dfres,base_rep_s3="resstock_amy2018_release_2",target_type="res")

### 2.c. Prepare External Data
In this section, we process the external dataset by removing rows with target values that do not appear in the training dataset.

In [None]:
filepath_com_labels = f'{input_dir}/external/baseline_metadata_com.csv'#path to the train label file
dfcom = pd.read_csv(filepath_com_labels)
filepath_res_labels = f'{input_dir}/external/baseline_metadata_res.csv'#path to the train label file
dfres = pd.read_csv(filepath_res_labels)

In [None]:
resfiles = glob(f"{input_dir}/external/features/res/*/*-0.parquet")
comfiles = glob(f"{input_dir}/external/features/com/*/*-0.parquet")
bldg_com=[int(os.path.basename(f).replace("-0.parquet","")) for f in comfiles]
bldg_res=[int(os.path.basename(f).replace("-0.parquet","")) for f in resfiles]

In [None]:
dfc = dfcom[dfcom['bldg_id'].isin(bldg_com)]
dfr=dfres[dfres['bldg_id'].isin(bldg_res)]

In [None]:
dfc=dfc[cons_cols+com_cols].rename({col:col+"_com" for col in com_cols},axis=1)
dfc.insert(0, 'building_stock_type', "commercial")
dfc["in.weekday_opening_time..hr_com"]=dfc["in.weekday_opening_time..hr_com"].astype(int)
dfc["in.weekday_operating_hours..hr_com"]=dfc["in.weekday_opening_time..hr_com"].astype(int)
for col in com_cols:
    with open(f'{input_dir}/Label_encoded/{col}_com_label_encoder.pkl', 'rb') as f:
        le = pickle.load(f)
    dfc=dfc[dfc[col+"_com"].astype(str).isin(list(le.classes_))]
dfc=dfc.astype(object)
dfc.to_parquet(f"{input_dir}/labels/external_com.parquet", engine='pyarrow')

In [None]:
dfr=dfr[cons_cols+res_cols].rename({col:col+"_res" for col in res_cols},axis=1)
dfr.insert(0, 'building_stock_type',"residential")
for col in res_cols:
    with open(f'{input_dir}/Label_encoded/{col}_res_label_encoder.pkl', 'rb') as f:
        le = pickle.load(f)
    dfr=dfr[dfr[col+"_res"].astype(str).isin(list(le.classes_))]
dfr=dfr.astype(object)
dfr.to_parquet(f"{input_dir}/labels/external_res.parquet", engine='pyarrow')

### 3. Training models
<img src="assets/training_pipeline.png" alt="Training Model Diagram" width="600"/>



### 3.a Create Multilabel Folds
- Inside the `src` folder run the command `python prepare_data.py` to create Multilabel Stratified KFold (with K=10)  for training.


### 3.b. Trainning
- Inside the `src` folder run the shell script `train_effnet.sh` to train all the pipeline for the  `tf_efficientnetv2_s_in21k` backbone.
- Inside the `src` folder run the shell script `train_seresnext.sh` to train all the pipeline for the  `seresnext26t_32x4d` backbone.


### 4. Inference and Submission
- Inside the `src` folder run `infer.py` to perform inference and generate the submission file.
