# XGBoost

Ranklib is a relatively old library and doesn't have the wide spread use that XGBoost does. Ranklib is still under active development, but the fork of the project OSC created reflects an older version.

The ES-LTR plugin is designed to work with XGBoost model format. This notebook starts with the `classic` training data generated in `hello-ltr.py` and shows how you could use XGBoost instead of Ranklib to create a model and use it with the plugin.

In [4]:
import sagemaker
import boto3

In [7]:
boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "es-ltr-xgboost"
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()

In [8]:
training_image = sagemaker.image_uris.retrieve(
    "xgboost", region=region, version="1.2-1"
)

print(training_image)

121021644041.dkr.ecr.ap-southeast-1.amazonaws.com/sagemaker-xgboost:1.2-1


### Input Data

Gather the data generated for our `classic` model in `hello-ltr.ipynb`. If this file doesn't exist yet, rerun that notebook!

In [1]:
import ltr.judgments as judge
df = [j for j in judge.judgments_from_file(open('data/classic-training.txt'))]
df = judge.judgments_to_dataframe(df)
df

Recognizing 1 queries in: data/classic-training.txt


Unnamed: 0,features0,uid,qid,keywords,docId,grade
0,2014.0,1_374430,1,,374430,0
1,1995.0,1_19404,1,,19404,1
2,1994.0,1_278,1,,278,1
3,2016.0,1_372058,1,,372058,0
4,1972.0,1_238,1,,238,2
...,...,...,...,...,...,...
995,2013.0,1_177699,1,,177699,0
996,2011.0,1_62835,1,,62835,0
997,2008.0,1_4944,1,,4944,1
998,1997.0,1_9404,1,,9404,1


### Libraries for xgboost-ing

Just the dependencies we need to train and visualize out model trained with XG-Boost instead of Ranklib.

In [2]:
import pandas as pd
import xgboost as xgb
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 50,150

### Set up our training Matrix

XGBoost has it's data specficiations so we need to get out features into that format to use it.


In [9]:
df = df[['grade', 'features0']]
features = df[['features0']]
labels = df[['grade']]

#dmx = xgb.DMatrix(features, labels)

In [14]:
df.to_csv("data/train-sm.csv", index=False, header=False)
train_uri = sgmk_session.upload_data(
    path="data/train-sm.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

# Define the data input channels for the training job:
s3_input_train = sagemaker.inputs.TrainingInput(train_uri, content_type="csv")

print(f"{s3_input_train.config}")

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-ap-southeast-1-344028372807/es-ltr-xgboost/train-sm.csv', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}


### Train the first XGBoost model

Using the demo parameters for our model, we will train a standard regression tree

In [15]:
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,  # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,  # number of instances to be used
    role=sgmk_role,  # IAM role to be used
    max_run=20 * 60,  # Maximum allowed active runtime
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30 * 60,  # Maximum clock time (including spot delays)
)

# define its hyperparameters
estimator.set_hyperparameters(
    num_round=2,  # int: [1,300]
    max_depth=2,  # int: [1,10]
    eta=1,  # float: [0,1]
    objective="rank:pairwise",
)

# start a training (fitting) job
estimator.fit({"train": s3_input_train})

2021-10-13 01:58:41 Starting - Starting the training job...
2021-10-13 01:59:04 Starting - Launching requested ML instancesProfilerReport-1634090320: InProgress
...
2021-10-13 01:59:32 Starting - Preparing the instances for training.........
2021-10-13 02:01:05 Downloading - Downloading input data...
2021-10-13 02:01:25 Training - Downloading the training image...
2021-10-13 02:02:10 Uploading - Uploading generated training model
2021-10-13 02:02:10 Completed - Training job completed
[34m[2021-10-13 02:01:58.491 ip-10-0-247-209.ap-southeast-1.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value rank:pairwise to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBo

In [None]:
#param = {'max_depth':2, 'eta':1, 'silent':1}
#num_round = 2

#model = xgb.train(param, dmx, num_round)

### Inspect as dataframe

Looking at the model as a dataframe can tell you which splits helped the most

In [None]:
model.trees_to_dataframe()

In [None]:
xgb.plot_tree(model)

### Adjust the objective for LTR

Really we don't want the regression as our objective function. In LTR we take advantage of a new pairwise loss function to find the optimal splits for a regression tree. 

This doesn't make a massive difference for the model that is generated because it is still a regression tree at the end of the day, but we are not longer using residual sqared error.

In [None]:
#param2 = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'rank:pairwise'}

#ranking_model = xgb.train(param2, dmx, num_round)

In [None]:
ranking_model.trees_to_dataframe()

In [None]:
xgb.plot_tree(ranking_model)

### Uploading an XGBoost model to the plugin

Since the model can be represented with JSON, the plugin can parse it. But we need to make sure the plugin gets the proper feature value names in order for it to parse properly.

These are supplied via a mapping `txt` file, `fmap.txt`.

The first step is to dump the model with the feature mapping to the features already stored in the plugin.

In [None]:
model_dump = ranking_model.get_dump(fmap='fmap.txt', dump_format='json')

### Massage the JSON

Manipulate the XGBoost output format to clean it up for posting to the plugin.

In [None]:
import json
clean_model = []
for line in model_dump:
    clean_model.append(json.loads(line))

### Post it to the plugin

Still referencing the index and feature set the model will be associated with.

In [None]:
import ltr.client as client
client = client.ElasticClient()

client.submit_xgboost_model('release', 'tmdb', 'xgb', clean_model)

### Confirm it works

In [None]:
from ltr.release_date_plot import search
search(client, 'batman', 'xgb')

### Compare it to the classic Ranklib model

In [None]:
from ltr.release_date_plot import plot
plot(client, "batman", models = ['classic', 'xgb'])