<a href="https://colab.research.google.com/github/subhashpolisetti/Decision-Tree-Ensemble-Algorithms/blob/main/GradientBoostRankingTechniques_XGBoost_LightGBM_CatBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ranking Models Comparison: XGBoost, LightGBM, and CatBoost

This notebook demonstrates the implementation and comparison of three popular ranking models—**XGBoost**, **LightGBM**, and **CatBoost**—on a synthetic ranking dataset. The goal is to predict the relevance of items in a ranked list for a set of queries. Each of these models is widely used for ranking tasks in machine learning and provides powerful tools for learning to rank items based on their relevance.

## Key Concepts:

1. **Ranking Models**: In ranking tasks, the model learns to assign a relevance score to items within a query. The objective is to rank the items in an order based on their predicted relevance, rather than classifying them into fixed categories or predicting continuous values.

2. **XGBoost**: XGBoost (Extreme Gradient Boosting) is a highly efficient and scalable machine learning algorithm. For ranking tasks, it uses the **pairwise ranking** objective (`rank:pairwise`), which learns to compare pairs of items in terms of relevance.

3. **LightGBM**: LightGBM (Light Gradient Boosting Machine) is a fast, distributed, high-performance implementation of gradient boosting. It is optimized for large datasets and has a ranking-specific objective called **LambdaRank**.

4. **CatBoost**: CatBoost is a gradient boosting framework that is particularly efficient with categorical features. It offers a **ranking-specific objective** for learning to rank problems, called **`lambdarank`**.

## Steps in the Notebook:

1. **Synthetic Dataset**: We create a synthetic ranking dataset with 100 samples, 5 features, and relevance scores ranging from 1 to 5. The dataset is split into two groups of queries.

2. **Train/Test Split**: The dataset is divided into training (80%) and testing (20%) subsets, ensuring that each group contains the relevant data.

3. **Model Training**:
    - **XGBoost**: The XGBoost model is trained using the `rank:pairwise` objective.
    - **LightGBM**: The LightGBM model is trained using the `lambdarank` objective.
    - **CatBoost**: The CatBoost model is trained using the `lambdarank` objective.

4. **Prediction and Evaluation**: After training, each model predicts the relevance scores for the test data, and the predictions are displayed for comparison.

5. **Comparison**: The predictions from **XGBoost**, **LightGBM**, and **CatBoost** are compared side by side to evaluate their performance on the ranking task.

## Libraries Used:
- **XGBoost**: For training the XGBoost ranking model.
- **LightGBM**: For training the LightGBM ranking model.
- **CatBoost**: For training the CatBoost ranking model.
- **NumPy**: For generating and manipulating the synthetic dataset.

## Results:
- The predicted ranking scores from all three models are printed, allowing for an easy comparison of their performance.

This notebook helps to understand how different ranking models work and compares their predictions on a synthetic ranking task. Feel free to experiment with hyperparameter tuning or use real-world ranking datasets to test these models' effectiveness.


In [1]:
from xgboost import XGBRanker
import numpy as np

# Generate a synthetic ranking dataset
X = np.random.rand(100, 5)  # 100 samples, each with 5 random features
y = np.random.randint(1, 6, size=100)  # Random relevance scores (between 1 and 5)
group = [50, 50]  # Two groups of 50 samples, each representing a query

# Train/Test Split
X_train, X_test = X[:80], X[80:]  # Use the first 80 samples for training and the rest for testing
y_train, y_test = y[:80], y[80:]  # Split the relevance scores correspondingly
group_train = [40, 40]  # Each group in the training set contains 40 items
group_test = [10, 10]  # Each group in the test set contains 10 items

# Initialize and train the XGBoost Ranker model
xgb_ranker = XGBRanker(
    objective="rank:pairwise",  # Pairwise ranking objective, where the model learns to rank pairs of items
    learning_rate=0.1,          # The learning rate that controls how much the model adjusts during each iteration
    max_depth=3,                # Maximum depth of each decision tree to prevent overfitting
    n_estimators=100            # The number of boosting rounds (trees) to train
)

# Fit the model on the training data with the specified group information
xgb_ranker.fit(X_train, y_train, group=group_train)

# Predict the relevance scores for the test data
y_pred = xgb_ranker.predict(X_test)

# Print the predicted ranking scores for the test data
print("XGBoost Ranking Predictions:", y_pred)


XGBoost Ranking Predictions: [-0.9623178   0.04632036 -1.1156845   0.6520399  -0.06923307  0.40256834
 -0.53648645  0.6583479  -0.5993957   0.10472348 -0.16147704 -0.3133536
 -0.08844604  0.04491812 -0.22709228  0.22807404  0.08727098 -0.4425296
 -0.30992997 -1.1322114 ]


In [2]:
import lightgbm as lgb

# Dataset and groups for LightGBM
lgb_train = lgb.Dataset(X_train, y_train, group=group_train)  # Create LightGBM dataset for training
lgb_test = lgb.Dataset(X_test, y_test, group=group_test, reference=lgb_train)  # Create LightGBM dataset for testing

# Set the parameters for the LightGBM Ranker
params = {
    "objective": "lambdarank",  # The objective function for ranking (LambdaRank is used for learning to rank)
    "metric": "ndcg",           # The evaluation metric (Normalized Discounted Cumulative Gain)
    "learning_rate": 0.1,       # The learning rate, which controls how much the model is updated in each iteration
    "max_depth": 3,             # Maximum depth of the trees to prevent overfitting
    "num_leaves": 31,           # Number of leaves in each tree, affecting model complexity
}

# Train the LightGBM ranking model using the training dataset and parameters
lgb_ranker = lgb.train(
    params,            # Parameters defined above
    lgb_train,         # Training dataset
    valid_sets=[lgb_test]  # Use lgb_test as the validation set to monitor performance
)

# Predict ranking scores for the test dataset
y_pred = lgb_ranker.predict(X_test)

# Print the predicted ranking scores for the test set
print("LightGBM Ranking Predictions:", y_pred)


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.016488 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 140
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 5
LightGBM Ranking Predictions: [-0.72329493  0.78000156  0.05968753  0.64799642 -0.58830022 -0.45155588
 -0.77904294  1.93025642 -0.69964296 -2.3539077  -0.28196066 -2.52643691
  1.0309714  -2.38831163 -0.84669579 -0.5433052  -1.95342094 -0.77104739
 -2.38424839 -2.47131722]


In [3]:

pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [4]:
from catboost import CatBoostRanker, Pool

# Pool for ranking: Prepare the data for CatBoost
# The group_id specifies which items belong to the same query (group of items to rank)
train_pool = Pool(X_train, y_train, group_id=[0] * 40 + [1] * 40)  # Training pool with 2 groups
test_pool = Pool(X_test, y_test, group_id=[0] * 10 + [1] * 10)  # Test pool with 2 groups

# Initialize and train the CatBoost Ranker model
catboost_ranker = CatBoostRanker(
    iterations=100,           # The number of boosting iterations (trees)
    learning_rate=0.1,        # The learning rate (controls model adjustments at each iteration)
    depth=3,                  # The maximum depth of each tree
    verbose=10                # Print progress every 10 iterations
)

# Fit the model on the training data
catboost_ranker.fit(train_pool)

# Predict the ranking for the test data
y_pred = catboost_ranker.predict(test_pool)

# Print the predicted ranking scores for the test data
print("CatBoost Ranking Predictions:", y_pred)


0:	total: 47.7ms	remaining: 4.72s
10:	total: 56.1ms	remaining: 454ms
20:	total: 64.6ms	remaining: 243ms
30:	total: 70.3ms	remaining: 157ms
40:	total: 81.2ms	remaining: 117ms
50:	total: 93.7ms	remaining: 90ms
60:	total: 107ms	remaining: 68.5ms
70:	total: 119ms	remaining: 48.5ms
80:	total: 159ms	remaining: 37.2ms
90:	total: 169ms	remaining: 16.7ms
99:	total: 181ms	remaining: 0us
CatBoost Ranking Predictions: [-1.39999172  1.4968492  -1.73894059  1.42851333  0.23416513  0.78178116
 -0.14234553  1.38200394 -0.9499378  -0.47298719  2.32974744 -1.05942711
  0.59158084 -0.13426086 -1.23678856  0.37088161 -0.63732242  0.08784243
 -0.67714497 -1.8269111 ]


In [5]:
# Store predictions from all models in a dictionary
predictions = {}

# XGBoost: Train the model and make predictions
xgb_ranker.fit(X_train, y_train, group=group_train)  # Fit XGBoost model on the training data with group info
predictions['XGBoost'] = xgb_ranker.predict(X_test)  # Store the predictions for the test data

# LightGBM: Train the model and make predictions
lgb_ranker = lgb.train(params, lgb_train)  # Train LightGBM model with the defined parameters and training data
predictions['LightGBM'] = lgb_ranker.predict(X_test)  # Store the predictions for the test data

# CatBoost: Train the model and make predictions
catboost_ranker.fit(train_pool)  # Train the CatBoost model on the training pool
predictions['CatBoost'] = catboost_ranker.predict(test_pool)  # Store the predictions for the test pool

# Print the predictions for all models
for model, preds in predictions.items():
    print(f"{model} Predictions:", preds)  # Print the predictions from each model (XGBoost, LightGBM, CatBoost)


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000040 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 140
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 5
0:	total: 506us	remaining: 50.1ms
10:	total: 8.19ms	remaining: 66.2ms
20:	total: 27.7ms	remaining: 104ms
30:	total: 36.1ms	remaining: 80.4ms
40:	total: 41.3ms	remaining: 59.4ms
50:	total: 47ms	remaining: 45.2ms
60:	total: 52ms	remaining: 33.2ms
70:	total: 57.4ms	remaining: 23.4ms
80:	total: 69.7ms	remaining: 16.3ms
90:	total: 86.4ms	remaining: 8.55ms
99:	total: 91.9ms	remaining: 0us
XGBoost Predictions: [-0.9623178   0.04632036 -1.1156845   0.6520399  -0.06923307  0.40256834
 -0.53648645  0.6583479  -0.5993957   0.10472348 -0.16147704 -0.3133536
 -0.08844604  0.04491812 -0.22709228  0.22807404  0.08727098 -0.4425296
 -0.30992997 -1.1322114 ]
LightGBM Predictions: [-0.72329493  0.78000156  0.05968753  0.6479