<h1><center>Understanding LGBM in Complete Depth</center></h1>
<h3><center>Ubiquant Market Prediction</center></h3>

<center><img src = "https://storage.googleapis.com/kaggle-media/competitions/ubiquant/6.jpg" width = "2750" height = "2500"/></center>                                                                          

<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Contents</center></h2>

> | S.No       |                   Heading                |
> | :------------- | :-------------------:                |         
> |  01 |  [**Competition Overview**](#competition-overview)  |                   
> |  02 |  [**Libraries**](#libraries)                        |  
> |  03 |  [**Weights and Biases**](#weights-and-biases)      |
> |  04 |  [**Gradient Boosting Basic**](#gradient-boosting-basic)                |
> |  05 |  [**Gradient Boosting Advance**](#gradient-boosting-advance)                |

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:maroon; border:0; color:white' role="tab" aria-controls="home"><center>If you find this notebook useful, do give me an upvote, it helps to keep up my motivation. This notebook will be updated frequently so keep checking for furthur developments.</center></h3>

---

<a id="competition-overview"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Competition Overview</center></h2>

## **<span style="color:orange;">Description</span>**


In this competition, youâ€™ll build a model that forecasts an investment's return rate. Train and test your algorithm on historical prices. Top entries will solve this real-world data science problem with as much accuracy as possible.

If successful, you could improve the ability of quantitative researchers to forecast returns. This will enable investors at any scale to make better decisions.

---

## **<span style="color:orange;">Evaluation Metric</span>**

Submissions are evaluated on the mean of the Pearson correlation coefficient for each time ID.
  
You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. 
  
You will get an error if you submission includes nulls or infinities and submissions that only include one prediction value will receive a score of -1.

---

<a id="libraries"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Libraries</center></h2>

In [None]:
import os
import re

import json
import time

import numpy as np
import pandas as pd

import random
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from termcolor import colored

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from collections import defaultdict

import wandb
from wandb.lightgbm import wandb_callback, log_summary

wandb.login()

---

<a id="weights-and-biases"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Weights and Biases</center></h2>

<center><img src = "https://i.imgur.com/1sm6x8P.png" width = "750" height = "500"/></center>  

**Weights & Biases** is the machine learning platform for developers to build better models faster.

You can use W&B's lightweight, interoperable tools to

- quickly track experiments,
- version and iterate on datasets,
- evaluate model performance,
- reproduce models,
- visualize results and spot regressions,
- and share findings with colleagues.
  
Set up W&B in 5 minutes, then quickly iterate on your machine learning pipeline with the confidence that your datasets and models are tracked and versioned in a reliable system of record.

In this notebook I will use Weights and Biases's amazing features to perform wonderful visualizations and logging seamlessly.

---

<a id="gradient-boosting-basic"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Gradient Boosting Basic</center></h2>

For explaining Gradient Boosting and LightGBM, I am referring to the original paper [LightGBM: A Highly Efficient Gradient Boosting
Decision Tree](https://proceedings.neurips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf)

## **<span style="color:orange;">Problem with Traditional GBDT Methods</span>**


Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm,
and has quite a few effective implementations such as XGBoost and pGBRT and these algorithms work well when the dimension of data is low or the data size is small.
  
However, when the feature dimension is high and data size is large the efficiency and scalability are still unsatisfactory even though they have been optimized a lot.

## **<span style="color:orange;">The Reason</span>**

A major reason is that for each feature,they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. 

## **<span style="color:orange;">Proposed solution by Researchers</span>**

To tackle this problem, researchers propose two novel techniques: 
1. **Gradient-based One-Side Sampling (GOSS)**
2. **Exclusive Feature Bundling (EFB)**

## **<span style="color:orange;">Basic GOSS and EFB Understanding</span>**

With **GOSS**, researchers excluded a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. It is proven that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size.
  
With **EFB**, researchers bundle mutually exclusive features (i.e., they rarely take nonzero
values simultaneously), to reduce the number of features. It is proven that finding
the optimal bundling of exclusive features is NP-hard, but a greedy algorithm
can achieve quite good approximation ratio (and thus can effectively reduce the
number of features without hurting the accuracy of split point determination by
much). 
  
And thus, this new GBDT implementation with GOSS and EFB is called as LightGBM.

---

<a id="gradient-boosting-advance"></a>
<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0; color:white' role="tab" aria-controls="home"><center>Gradient Boosting Advance</center></h2>

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm, due to its 
- Efficiency, 
- Accuracy, and 
- Interpretability. 
  
GBDT achieves state-of-the-art performances in many machine learning tasks, such as 
- Multi-class classification, 
- Click prediction, and 
- Learning to rank. 
  
In recent years, with the emergence of big data (in terms of both the number of features
and the number of instances), GBDT is facing new challenges, especially in the tradeoff between accuracy and efficiency.
  
Conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. 
Therefore, their computational complexities will be proportional to both the number of features and the number of instances. This makes these implementations very time consuming when handling big data.

---

To tackle this challenge, a straightforward idea is to reduce the number of data instances and the number of features. However, this turns out to be highly non-trivial. 
  
**For example**, 
it is unclear how to perform data sampling for GBDT. While there are some works that sample data according to their weights to speed up the training process of boosting they cannot be directly applied to GBDT since there is no sample weight in GBDT at all. 

---

As a result two novel techniques were developed towards this goal which are GOSS and EFB. Let's gid deeper into both these techniques.

## **<span style="color:orange;">Gradient-based One-Side Sampling (GOSS)</span>**

- While there is no native weight for data instance in GBDT, it was noticed that data instances with different gradients play different roles in the computation of information gain. 
- In particular, according to the definition of information gain, those instances with larger gradients (i.e., under-trained instances) will contribute more to the information gain.
- Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients.
- In the paper, it was proven that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, specially when the value of information gain has a large range.
  
## **<span style="color:orange;">Exclusive Feature Bundling (EFB)</span>**

- Usually in real applications, although there are a large number of features, the feature space is quite sparse, which provides us a possibility of designing a nearly lossless approach to reduce the number of effective features. 
- Specifically, in a sparse feature space, many features are (almost) exclusive, i.e., they rarely take nonzero values simultaneously. 
- Examples include the one-hot features (e.g., one-hot word representation in text mining). 
- We can safely bundle such exclusive features. 
- To this end, researchers designed an efficient algorithm by reducing the optimal bundling problem to a graph coloring problem (by taking features as vertices and adding edges for every two features if they are not mutually exclusive), and solving it by a greedy algorithm with a constant approximation ratio.

> This new GBDT algorithm with GOSS and EFB was termed as LightGBM.

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:maroon; border:0; color:white' role="tab" aria-controls="home"><center>Code will be added soon!</center></h3>

--- 

## **<span style="color:orange;">Let's have a Talk!</span>**
> ### Reach out to me on [LinkedIn](https://www.linkedin.com/in/ishandutta0098)

---