This notebook can be considered a tutorial on how to use XGBoost for this competition and use Weights and Biases to make the most out of your XGBoost model. 

This tutorial is based on [[Tutorial] Time Series forecasting with XGBoost](https://www.kaggle.com/robikscube/tutorial-time-series-forecasting-with-xgboost) by [Rob Mulla](https://www.kaggle.com/robikscube). 

# Setup and Imports

In [None]:
import os
import json
import time
import numpy as np
import pandas as pd
from tqdm import tqdm
from datetime import datetime
import matplotlib.pyplot as plt

import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error

Weights and Biases comes preinstalled with Kaggle environment but it's recommended to get the latest version of the same. 

In [None]:
!pip install -q --upgrade wandb

In [None]:
import wandb
from wandb.xgboost import wandb_callback

wandb.login()

# Load Dataset

If you haven't already check out the [Tutorial to the G-Research Crypto Competition](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition).

In [None]:
crypto_df = pd.read_csv('../input/g-research-crypto-forecasting/train.csv')
crypto_df.head()

The following are the columns available in the `train.csv` file.

* `timestamp`: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.

* `Asset_ID`: The asset ID corresponding to one of the crytocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping from Asset_ID to crypto asset is contained in `asset_details.csv`.

* `Count`: Total number of trades in the time interval (last minute).

* `Open`: Opening price of the time interval (in USD).

* `High`: Highest price reached during time interval (in USD).

* `Low`: Lowest price reached during time interval (in USD).

* `Close`: Closing price of the time interval (in USD).

* `Volume`: Quantity of asset bought or sold, displayed in base currency USD.

* `VWAP`: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.

* `Target`: Residual log-returns for the asset over a 15 minute horizon.

In [None]:
assets = pd.read_csv('../input/g-research-crypto-forecasting/asset_details.csv').sort_values("Asset_ID").reset_index(drop=True)
assets

Here we will log the raw dataset as W&B Artifacts to build data lineage as we train models and validate on different split of the dataset. 

This might be an additional step but can be really useful to have in your arsenal. 

Note: This is a one time step. Once you have logged your raw_data you just need to log different splits of the same or preprocessed data. 

In [None]:
# The config below is for demonstration purposes. 
wandb_config = {'competition': 'gresearch', '_wandb_kernel': 'ayut'}

run = wandb.init(project='gresearch', config=wandb_config, job_type='raw_data')
raw_data_artifact = wandb.Artifact('raw-data', type='raw-dataset')
raw_data_artifact.add_file('../input/g-research-crypto-forecasting/train.csv')
raw_data_artifact.add_file('../input/g-research-crypto-forecasting/asset_details.csv')    
run.log_artifact(raw_data_artifact)
run.finish()

#### Utils

In [None]:
# if you encounter a "year is out of range" error the timestamp
# may be in milliseconds, try `ts /= 1000` in that case
def timestamp_to_utc(timestamp: int):
    return datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

def utc_to_timestamp(date_str):
    return np.int32(time.mktime(datetime.strptime(date_str, "%d/%m/%Y").timetuple()))

# Prepare Train-Validation Split

Note that I have used the data that's used for LB score computation as `valid_df`. I will be using this `valid_df` for evaluating all my models. 

In [None]:
crypto_df['datetime'] = pd.to_datetime(crypto_df['timestamp'], unit='s')
train_df = crypto_df[crypto_df['datetime'] < '2021-06-13 00:00:00']
valid_df = crypto_df[crypto_df['datetime'] >= '2021-06-13 00:00:00']

print("Number of samples in train_df: ", len(train_df))
print("Number of samples in valid_df: ", len(valid_df))

We will again save the splits as W&B Artifact and use the reference to the previously logged raw data to build the data lineage. 

The step below might take some time, since a large `csv` file is being written on the disk. Again this is a one time process. You might want to repeat this for different splits of the raw dataset. For a competition that runs for months a good data version control can make a huge difference.

In [None]:
train_df.to_csv('train_df.csv', index=False)
valid_df.to_csv('valid_df.csv', index=False)

run = wandb.init(project='gresearch', config=wandb_config, job_type='data_split')
# Notice the use of raw_artifact. This will act as reference for this split.
raw_artifact = run.use_artifact('ayush-thakur/gresearch/raw-data:latest', type='raw-dataset')

train_artifact = wandb.Artifact('train-data', type='train-split')
valid_artifact = wandb.Artifact('valid-data', type='valid-split')

train_artifact.add_file('train_df.csv')
valid_artifact.add_file('valid_df.csv')

run.log_artifact(train_artifact)
run.log_artifact(valid_artifact)

run.finish()

# Features

In [None]:
# Features
featues_col = ["Count", "Open", "High", "Low", "Close", "Volume", "VWAP"]

def upper_shadow(df):
    return df['High'] - np.maximum(df['Close'], df['Open'])

def lower_shadow(df):
    return np.minimum(df['Close'], df['Open']) - df['Low']

def log_return(series, periods=1):
    return np.log(series).diff(periods=periods)

def fill_nan_inf(df):
    # Fill NaN values
    df = df.fillna(0)
    # Fill Inf values
    df = df.replace([np.inf, -np.inf], 0)
    
    return df

def create_features(df, label=False):
    """
    Create time series features
    """
    # Build features
    up_shadow = upper_shadow(df)
    low_shadow = lower_shadow(df)    
    five_min_log_return = log_return(df.VWAP, periods=5)
    abs_one_min_log_return = log_return(df.VWAP,periods=1).abs()    
    features = df[featues_col]

    # Concat all the features into one dataframe
    X = pd.concat([features, up_shadow, low_shadow, 
                   five_min_log_return, abs_one_min_log_return], 
                  axis=1)
    
    # Rename feature columns
    X.columns = featues_col+["up_shadow", "low_shadow", "five_min_log_return", "abs_one_min_log_return"]
    
    # Fill NaN and Inf
    X = fill_nan_inf(X)
    
    if label:
        y = df.Target
        # Fill NaN and Inf
        y = fill_nan_inf(y)
        
        return X, y
    
    return X

# Train a naive XGBRegressor on one crypto data

In this section, we will train an XGBRegressor, which is an implementation of the scikit-learn API for XGBoost regression.

We will take the crypto data of Bitcoin, fill the missing gaps in the series, compute features for train and validation splits. We will then initalize a W&B run and train an XGBRegressor with default parameters. Later in this notebook we will try to find the best parameters. 

Let's first start by initializing a W&B run and use the split artifact reference that we logged previously.

In [None]:
# Initialize a W&B run
run = wandb.init(project='gresearch', config=wandb_config, job_type='subset') 

# # Notice the use of splits.
train_artifact = run.use_artifact('ayush-thakur/gresearch/train-data:latest', type='train-split')
valid_artifact = run.use_artifact('ayush-thakur/gresearch/valid-data:latest', type='valid-split')

Let's prepare the features for just Bitcoin trading data.

In [None]:
# Get single crypto trading data
btc_train = train_df[train_df.Asset_ID==1]
btc_valid = valid_df[valid_df.Asset_ID==1]

# Fill missing value
btc_train = btc_train.reindex(range(btc_train.index[0],btc_train.index[-1]+60,60),method='pad')
btc_valid = btc_valid.reindex(range(btc_valid.index[0],btc_valid.index[-1]+60,60),method='pad')

# Create features
X_train, y_train = create_features(btc_train, label=True)
X_valid, y_valid = create_features(btc_valid, label=True)

Since we have used the training and validation split and are going to train the model on a subset of the data we should log the subset as W&B artifacts. Here we will log the features dataframe for better sanity check in the future.

In [None]:
btc_subset_train = pd.concat([X_train, y_train], axis=1).to_csv('btc_subset_train.csv', index=False)
btc_subset_valid = pd.concat([X_valid, y_valid], axis=1).to_csv('btc_subset_train.csv', index=False)

btc_subset = wandb.Artifact('btc-data', type='subset')
btc_subset.add_file('btc_subset_train.csv')
btc_subset.add_file('btc_subset_train.csv')
run.log_artifact(btc_subset)

run.finish()

Now finally let's train a simple regression and use W&B's XGBoost Callback to log the metrics and configs. 

In [None]:
# Initialize a W&B run
run = wandb.init(project='gresearch', config=wandb_config, job_type='train') 

# Initialize an XGBRegressor with some parameters.
reg = xgb.XGBRegressor(n_estimators=1000)

# Train the regressor. Note the use of wandb_callback
reg.fit(X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        early_stopping_rounds=50,
        verbose=False,
        callbacks=[wandb_callback()])

Let's save the model along with model configuration. We will use the data lineage created so far and start building model lineage on top of it.

In [None]:
# Get the booster
bstr = reg.get_booster()

# Save the booster to disk
model_name = f'{run.id}_model.json'
model_path = f'./{model_name}'
bstr.save_model(str(model_path))

# Get the booster's config
config = json.loads(bstr.save_config())

model_artifact = wandb.Artifact(name=model_name, type='model', metadata=dict(config))
# Notice the use of earlier artifact as reference for model artifact
features_artifact = run.use_artifact('ayush-thakur/gresearch/btc-data:v1', type='subset')

model_artifact.add_file(model_path)
run.log_artifact(model_artifact)
run.finish()

# So Far

So far we built a data and model lineage and used `wandb_callback` for XGBoost. Using `wandb_callback` is like using a single line of code to keep a tab of your experiments.

## [Check out the W&B Dashboard](https://wandb.ai/ayush-thakur/gresearch?workspace=user-ayush-thakur)

The image below shows the data and model lineage created so far. Imagine you are training tons of models on different splits of the same dataset. Taking the extra effor to build an MLOps pipeline around the same can be really useful in the long run. 

![img](https://i.imgur.com/jkVCZRi.png)

The image shown below is the logged metrics.

![img](https://i.imgur.com/FXQTmts.png)

# XGBRegressor as Multi Output Regressor

In this section we will take in the trading data for one quarter (3 months) and try to forecast the returns for the 4th month. 

Let's see how things go. 

We will use Sklearn's `MultiOutputRegressor`. Note however that this strategy doesn't use any dependence between different targets. 

First let's build our train and valid dataset for one quarter. We will use the data from 

In [None]:
# select training and test periods
# 86400 corresponds to one day (24 hrs) in seconds. 

train_window = [utc_to_timestamp("01/01/2021"), utc_to_timestamp("31/01/2021")]
valid_window = [utc_to_timestamp("01/02/2021")-86340, utc_to_timestamp("28/02/2021")]

train_window, valid_window

In [None]:
# Get single crypto trading data
btc_df = crypto_df[crypto_df.Asset_ID==1].set_index('timestamp')

# Get the windowed data
btc_train = btc_df.loc[train_window[0]:train_window[1]]
btc_valid = btc_df.loc[valid_window[0]:valid_window[1]]

# Fill missing value
btc_train = btc_train.reindex(range(train_window[0], train_window[1]+60,60),method='pad')
btc_valid = btc_valid.reindex(range(valid_window[0], valid_window[1]+60,60),method='pad')

# # Create features
X_train, y_train = create_features(btc_train, label=True)
X_valid, y_valid = create_features(btc_valid, label=True)

In [None]:
X_trains, y_trains, X_valids, y_valids = [], [], [], []

for i in tqdm(range(len(assets))):
    row = assets.loc[i]
    # Get single crypto trading data
    df = crypto_df[crypto_df.Asset_ID==i].set_index('timestamp')
    
    # Get the windowed data
    train = df.loc[train_window[0]:train_window[1]]
    valid = df.loc[test_window[0]:test_window[1]]

    # Fill missing value
    train = train.reindex(range(train_window[0], train_window[1]+60,60),method='pad')
    valid = valid.reindex(range(valid_window[0], valid_window[1]+60,60),method='pad')

    # Create features
    X_train, y_train = create_features(train, label=True)
    X_valid, y_valid = create_features(valid, label=True)
    
    X_trains.append(X_train); y_trains.append(y_train)
    X_valids.append(X_valid); y_valids.append(y_valid)

In [None]:
X_all_train = np.concatenate(X_trains, axis=1)
X_all_valid = np.concatenate(X_valids, axis=1)
y_all_train = np.column_stack(y_trains)
y_all_valid = np.column_stack(y_valids)

X_all_train.shape, y_all_train.shape, X_all_valid.shape, y_all_valid.shape

In [None]:
# define the direct multioutput model and fit it
from sklearn.multioutput import MultiOutputRegressor
mreg = MultiOutputRegressor(xgb.XGBRegressor(n_estimators=1000))

mreg.fit(X_all_train[:10], y_all_train[:10])

In [None]:
y_pred_lr_all = mreg.predict(X_all_valid)

In [None]:
y_pred_lr_all

# WORK IN PROGRESS

I hope you will find it useful. If you have any questions feel free to comment or reach out. Plus if you think it can be improved further please let me know. 

Upcoming:

* Extend the regression for the entire dataset.
* Use a quarter worth of data and forecast for one month ahead. (Not sure how exactly)
* Show how to use W&B Sweeps for Hyperparameter Optimization. 