This competition is challenging because of plethora of features and for the simple fact that it's financial data. There can be multiple ways of approaching this problem statement but there are two top-level approaches.

* Treat it like a regular regression task. Here features (f_0 - f_299 + investment_id) are not time bound. 
* Treat it literally like a forecasting problem where temporal relationship between features should be modeled along side. 

One popular choice is going to be [LightGBM](https://lightgbm.readthedocs.io/en/latest/). Since there are multiple permutation and combination (P&C) of hyperparameters we can try, this kernel is about how [Weights and Biases](https://lightgbm.readthedocs.io/en/latest/) can be useful to keep sanity while juggling with so many hyperparameters.

In this kernel, we approach the problem as a regular regression task and will use LightGBM to model the data distribution. 

# Imports and Setup
Let's get the latest version of W&B. 

In [None]:
!pip -qq install --upgrade wandb

In [None]:
import os
import gc
import numpy as np
import pandas as pd
import lightgbm as lgb

# W&B
import wandb

try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    api_key = user_secrets.get_secret("wandb_api")
    wandb.login(key=api_key)
    anonymous = None
except:
    anonymous = "must"
    print('To use your W&B account,\nGo to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. \nGet your W&B access token from here: https://wandb.ai/authorize')

Here we have imported the functions that allows us to log a bunch of useful things (hyperparameters, training and validation metrics, etc.). We can get the most out of LightGBM with these two functions. 

In [None]:
from wandb.integration.lightgbm import log_summary, wandb_callback

# Dataset

We will be using Rob Mulla's highly useful [parquet dataset](https://www.kaggle.com/robikscube/ubiquant-parquet). 

In [None]:
df = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
df.head()

Just to simplify things we will be looking at one `investment_id`. Note that for a given `investment_id` not all `time_id`s are available. But since we ain't concerned about time (temporal relationship, for now), we can proceed without second thoughts. 

In [None]:
investment_id = 100
investment_df = df[df.investment_id == investment_id]
print(f'Number of rows: {len(investment_df)}')
investment_df.head()

# Simple Model

In this section, we will not use any fancy stratified K fold training. Since the purpose of this kernel is to also show-off the LightGBM x W&B functionalities, it's best if we train a LightGBM Regressor on a standard train-validation split. 

Note that if you are training on the entire dataset, `investment_id` can be a feature.

### Create train-validation split

Here we are not creating a random split. For now we are trying to not look at the future. You will see in few momemt, why I did this even though we are not concerned about temporal dependence. :)

In [None]:
features = [feat for feat in investment_df.columns if 'f' in feat]

val_split = 0.2
val_split_index = len(investment_df) - int(len(investment_df)*val_split)
print(f'Number of rows: {len(investment_df)}, number of train rows: {val_split_index}')

train_df, valid_df = investment_df[:val_split_index], investment_df[val_split_index:]
train_X, train_y = train_df[features].values, train_df.target.values
valid_X, valid_y = valid_df[features].values, valid_df.target.values

### LightGBM dataset

In [None]:
lgb_train = lgb.Dataset(train_X, train_y)
lgb_valid = lgb.Dataset(valid_X, valid_y, reference=lgb_train)

### Train 

1. Pass `wandb_callback` to the `callbacks` argument of `fit`. This will:
    - log params passed to lightgbm.train as W&B config.
    - log evaluation metrics collected by LightGBM, such as rmse, accuracy etc to Weights & Biases
    - Capture the best metrics.


2. Once the model is trained, use `log_summary` to:
    - log `best_iteration` and `best_score`.
    - log feature importance plot.
    - save and upload your best trained model to Weights & Biases Artifacts (when `save_model_checkpoint = True`)


Notes: 
> You can use `LGBMRegressor` and it will work the same way. 

> The parameters used in the cell below may/may not be optimal. 


In [None]:
# Define the parameters for LightGBM
params = {
     'boosting_type': 'gbdt',
     'objective': 'regression',
     'metric': ['rmse', 'l2', 'l1', 'mae'],
     'num_leaves': 8,
     'learning_rate': 0.1,
     'feature_fraction': 0.7,
     'bagging_fraction': 0.8,
     'bagging_freq': 5,
}

# 1️⃣ Initialize a new wandb project
run = wandb.init(project='ubiquant-lgb', job_type=f'train_{investment_id}')

# 2️⃣ Train with `wandb_callback`.
gbm = lgb.train(params,
             lgb_train,
             num_boost_round=100,
             valid_sets=[lgb_train, lgb_valid],
             valid_names=('validation'),
             callbacks=[wandb_callback()],
             early_stopping_rounds = 10)

# 3️⃣ Use `log_summary` to get feature importance, best score, etc.
log_summary(gbm, save_model_checkpoint=False)

# 4️⃣ End the run (needed in Jupyter sessions)
wandb.finish()

### What do you get by using W&B?

1. The hyperparameters used to trained your model is saved as W&B Config. 
![img](https://i.imgur.com/fGeGj5T.png)

2. Track train/val/test metrics. 
![img](https://i.imgur.com/1cdlUIT.png)

# WORK IN PROGRESS