# Guided Exercise: Ranking

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/truera-examples/blob/release/prod/starter-examples/starter-ranking.ipynb)

### Setup

You are a data scientist working for a newly opened winery. The winery is preparing for a big tasting event that will be attended by several eminent wine critics. To help prepare for the event, you are tasked with training a model that can rank wines for a given wine critic.  

Fortunately, you have historical data for these authors' wine reviews. For each review written, the critic assigns the wine in question a numerical score (from 0 to 100). The data includes features relevant to the wine itself (e.g., the country of origin, the bottling year, the price of the wine) and to the critic (e.g., the critic's average review length, whether or not they use Twitter).

Your model will use this information to rank a selection of your winery's wines. The intent of this ranking is to craft an individualized tasting menu for each critic attending your winery's event.

#### Goals 🎯
In this tutorial you will learn how to:
1. Set up and ingest a ranking project into TruEra Diagnostics. 
2. Change the ranking project setting for `K` (i.e., the number of elements per group ID).

### First, set the credentials for your TruEra deployment.
If you don't have credentials yet, get them instantly by signing up for free at: https://www.truera.com


In [None]:
#connection details
CONNECTION_STRING = "https://app.truera.net"
AUTH_TOKEN = ""

In [None]:
! pip install truera

In [None]:
# Set up TruEra workspace.
import numpy as np
import pandas as pd
from lightgbm import LGBMRanker
from xgboost import XGBRanker

from truera.client.truera_workspace import TrueraWorkspace
from truera.client.truera_authentication import TokenAuthentication
from truera.client.ingestion.util import ColumnSpec

auth = TokenAuthentication(AUTH_TOKEN)
tru = TrueraWorkspace(CONNECTION_STRING, auth)

### Now, run the rest of the notebook and follow the analysis

### First, load the world-wines-ranking data

We will use a subset of the [world-wines-ranking](https://www.kaggle.com/datasets/diegoperezsalas/worldwinesranking/) dataset. This data includes about 130k wine ratings from industry experts. We have cleaned, preprocessed, and downsampled this data to 6k records (1.5k train, 4.5k split).

In [None]:
dir_name = "https://truera-examples.s3.us-west-2.amazonaws.com/data/starter-ranking/"
split_names = ["all", "train", "test"]
splits = {}
for split_name in split_names:
    split_path = dir_name + split_name + ".csv"
    print(f"-> Loading {split_name} from {split_path}...")
    splits[split_name] = pd.read_csv(split_path)
    splits[split_name]['taster_has_twitter'] = splits[split_name]['taster_has_twitter'].astype(bool)

In [None]:
splits['train'].head()

Get the column names for the pre data and feature influences.

In [None]:
def get_col_names(df):
    not_pre_cols = ['points', 'title', 'winery', 'taster_id', 'id']
    pre_cols = []
    for col in df.columns:
        if col not in not_pre_cols:
            pre_cols.append(col)
    return pre_cols

pre_cols = get_col_names(splits['all'])

### Ranking IDs: groups and items

Ranking projects in TruEra require you to specify two additional columns:

- `ranking_group_id_column`: indicates the group to which the record belongs (in this demo, the critic rating the wine). 
- `ranking_item_id_column`: indicates the item which the record is associate with (in this demo, the name of the wine).

Both of these columns must be specified when defining your `ColumnSpec`. The group ID is relevant when `fit`ting your ranking model, as will become evident in the subsequent cells.

For more details on these ranking columns, check out the [TruEra documentation](https://docs.truera.com/1.40/public/project-overview/) (see the "Ranking" modal under "Understanding Output Types).

### Train the `xgbranker` model

Train a ranking model using the `xgboost` package. Note that this `fit` method is similar to an analogous classification/regression model's `fit` with the addition of a `qid` parameter. This paramater indicates the "group ID" of each record being passed in with the `X` and `Y` dataframes.

Format the data for use with the `XGBRanker` object

In [None]:
def df_to_xgboost_data(df, pre_cols):
    # sort by the group id; required for xgbranker model
    df = df.sort_values(by=['taster_id'])
    # use only pre data columns as features
    X = df[pre_cols]
    # get the relevance scores
    Y = df[['points']]
    # get the user (group) IDs as integer vals from the 'userId' col
    qid = df['taster_id'].apply(lambda x: int(x[-2:])).astype(int)
    return X, Y, qid

X, Y, qid = df_to_xgboost_data(splits['train'], pre_cols)

In [None]:
# train the XGBRanker model
xgb_ranker = XGBRanker(tree_method="hist",
                   lambdarank_num_pair_per_sample=10,
                   objective="rank:ndcg",
                   lambdarank_pair_method="topk",
                   random_state=1)
xgb_ranker.fit(X, Y, qid=qid)

In [None]:
pred_xgb = 'pred_score_xgb'
for split_name, split in splits.items():
    Y_pred = xgb_ranker.predict(split[pre_cols])
    splits[split_name][pred_xgb] = Y_pred

# Train the `LGBMRanker` Model

Train another ranking modeing the `lightgbm` package.

There are some minor differences in using the `LGBMRanker`, namely:
- The `group` parameter is the number of contiguous records belonging to the group. For the training data, this is always 100 records.
- Relevance values (labels) must start from 0, so we subtract by the `min` value

In [None]:
# convert qid into lgbm-compatible groups
group = [100 for id in np.unique(qid)]
Y_lgbm = Y - Y.min()[0] # 

lgbm_ranker = LGBMRanker(random_state=0)
lgbm_ranker.fit(
    X,
    Y_lgbm,
    group=group,
)

In [None]:
pred_lgbm = 'pred_score_lgbm'
for split_name, split in splits.items():
    Y_pred = lgbm_ranker.predict(split[pre_cols])
    splits[split_name][pred_lgbm] = Y_pred

### Ingesting the data/models into TruEra

Now we ingest the `world-wines-ranking` data and the trained models into TruEra. First we set up the names for the project's artifacts and the column names.

In [None]:
# names for project setup
project_name = "Starter - Ranking"
model_name_xgb = "wine_xgbranker"
model_name_lgbm = "wine_lgbmranker"
score_type = "ranking_score" # either "ranking_score" (raw model scores) or "rank" (rank-ordering of model scores); 

# add project + data collection
tru.add_project(project_name, score_type)
tru.add_data_collection("wine_rating_6k_ratings")

# reduce settings for speed
tru.set_num_internal_qii_samples(100)
tru.set_num_default_influences(100)

Set up the column names for `ColumnSpec` objects.

In [None]:
id_col_name = "id" 
ranking_item_id_column = "title" 
ranking_group_id_column = "taster_id" 
label_col_name = "points"
extra_data_col_names = ['winery']

Add pre data/labels/extra data.

In [None]:
split_names = ['train', 'test', 'all'] # reorder these so we can add_python_model with train_split_name 

column_spec_no_preds = ColumnSpec(
            id_col_name=id_col_name,
            ranking_item_id_column_name=ranking_item_id_column,
            ranking_group_id_column_name=ranking_group_id_column,
            pre_data_col_names=pre_cols,
            label_col_names=label_col_name,
            extra_data_col_names=extra_data_col_names
        )

for data_split_name in split_names:
    tru.add_data(
        data=splits[data_split_name],
        column_spec=column_spec_no_preds, 
        data_split_name=data_split_name
    )

Add predictions (note: we manually add these since computation of ranking predictions in TruEra is currently not supported).

We add the `train` split first since it is the training split for both models.

In [None]:
def column_spec_for_preds(pred_name):
    return ColumnSpec(
            id_col_name=id_col_name,
            ranking_item_id_column_name=ranking_item_id_column,
            ranking_group_id_column_name=ranking_group_id_column,
            prediction_col_names=[pred_name],
        )

for data_split_name in split_names:
    for model_name, model, pred_col_name in zip([model_name_xgb, model_name_lgbm], [xgb_ranker, lgbm_ranker], [pred_xgb, pred_lgbm]):
        if data_split_name == "train":
            tru.add_python_model(model_name, model, train_split_name="train")
        tru.set_model(model_name)
        tru.add_data(
            data=splits[data_split_name][[id_col_name, pred_col_name, ranking_group_id_column, ranking_item_id_column]],
            column_spec=column_spec_for_preds(pred_col_name), 
            data_split_name=data_split_name,
        )

Compute influences for each split

In [None]:
for data_split_name in split_names:
    for model_name in [model_name_xgb, model_name_lgbm]:
        tru.set_model(model_name)
        tru.set_data_split(data_split_name)
        tru.compute_feature_influences()

You should be able to see the `wine_ratings` project in your list of TruEra projects now!

### Viewing/Changing `K` for NDCG

`K` is a project-level setting that is unique to ranking projects. Mainly, `K` dictates the number of records per group to consider when calculating the Normalized Discounted Cumulative Gain (NDCG) of a model on a split/segment. Read more about NDCG in the [TruEra documentation](https://docs.truera.com/1.40/public/supported-metrics/accuracy-metrics/#ranking-models).

The following cells show you how to interact with this setting.

In [None]:
# get the default ranking K value
tru.get_ranking_k()

In [None]:
# get an explainer to compute NDCG
explainer = tru.get_explainer()
explainer.compute_performance("NDCG")

In [None]:
# you can change this project setting as follows
tru.set_ranking_k(5)

In [None]:
# the NDCG should change for a different value of K
explainer.compute_performance("NDCG")

In [None]:
# view the new value of K
tru.get_ranking_k()