## Setup

In [None]:
!pip install numpy pandas pyvespa lightgbm

Installing required packages:

In [1]:
import json
import lightgbm as lgb
import numpy as np
import pandas as pd

## Create data

Generate a toy dataset to follow along. Note that we set the column names in a format that Vespa understands. `query(value)` means that the user will send a parameter named `value` along with the query. `attribute(field)` means that `field` is a document attribute defined in a schema. In the example below we have a query parameter named `value` and two document's attributes, `numeric` and `categorical`. If we want `lightgbm` to handle categorical variables we should use `dtype="category"` as shown below.

Further down we are going to give an example where we need to map model feature names to Vespa document and query parameter names. This is useful in case the model was trained with Vespa incomaptible feature names.

In [2]:
# Create random training set
features = pd.DataFrame({
            "query(value)": np.random.random(100),
            "attribute(numeric)": np.random.random(100),
            "attribute(categorical)": pd.Series(np.random.choice(["a", "b", "c"], size=100), dtype="category")
        })
features.head()

Unnamed: 0,query(value),attribute(numeric),attribute(categorical)
0,0.365326,0.236163,a
1,0.585092,0.517502,b
2,0.437061,0.49748,a
3,0.385089,0.451904,a
4,0.790912,0.116505,b


Generate target variables:

In [3]:
numeric_features = pd.get_dummies(features)
targets = (
    (numeric_features["query(value)"] + 
     numeric_features["attribute(numeric)"]  -
     0.5 * numeric_features["attribute(categorical)_a"] + 
     0.5 * numeric_features["attribute(categorical)_c"]) > 1.0
) * 1.0
targets

0     0.0
1     1.0
2     0.0
3     0.0
4     0.0
     ... 
95    1.0
96    0.0
97    1.0
98    0.0
99    0.0
Length: 100, dtype: float64

## Fit lightgbm model

In [4]:
training_set = lgb.Dataset(features, targets)

# Train the model
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 3,
}
model = lgb.train(params, training_set, num_boost_round=5)

[LightGBM] [Info] Number of positive: 52, number of negative: 48
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 74
[LightGBM] [Info] Number of data points in the train set: 100, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.520000 -> initscore=0.080043
[LightGBM] [Info] Start training from score 0.080043


## Vespa application package

The model expects two document attributes, `numeric` and `categorical`. We can use the model in the first-phase ranking by using the `lightgbm` rank feature.

In [5]:
from vespa.package import ApplicationPackage, Field, RankProfile, Function

app_package = ApplicationPackage(name="lightgbm")
app_package.schema.add_fields(
    Field(name="numeric", type="double", indexing=["summary", "attribute"]),
    Field(name="categorical", type="string", indexing=["summary", "attribute"])
)
app_package.schema.add_rank_profile(
    RankProfile(
        name="classify", 
        first_phase="lightgbm('lightgbm_model.json')"
    )
)

We can check how the Vespa search defition file will look like:

In [6]:
print(app_package.schema.schema_to_text)

schema lightgbm {
    document lightgbm {
        field numeric type double {
            indexing: summary | attribute
        }
        field categorical type string {
            indexing: summary | attribute
        }
    }
    rank-profile classify {
        first-phase {
            expression: lightgbm('lightgbm_model.json')
        }
    }
}


We can export the application package files to disk:

In [7]:
from pathlib import Path
Path("lightgbm").mkdir(parents=True, exist_ok=True)
app_package.to_files("lightgbm")

Note that we don't have any models under the `models` folder. We need to export the lightGBM model that we trained earlier to `models/lightgbm.json`.

In [8]:
!tree lightgbm

[01;34mlightgbm[00m
├── [01;34mfiles[00m
├── [01;34mmodels[00m
├── [01;34mschemas[00m
│   └── lightgbm.sd
├── [01;34msearch[00m
│   └── [01;34mquery-profiles[00m
│       ├── default.xml
│       └── [01;34mtypes[00m
│           └── root.xml
└── services.xml

6 directories, 4 files


## Export the model

In [9]:
with open("lightgbm/models/lightgbm_model.json", "w") as f:
    json.dump(model.dump_model(), f, indent=2)

Now we can see that the model is where Vespa expects it to be:

In [10]:
!tree lightgbm

[01;34mlightgbm[00m
├── [01;34mfiles[00m
├── [01;34mmodels[00m
│   └── lightgbm_model.json
├── [01;34mschemas[00m
│   └── lightgbm.sd
├── [01;34msearch[00m
│   └── [01;34mquery-profiles[00m
│       ├── default.xml
│       └── [01;34mtypes[00m
│           └── root.xml
└── services.xml

6 directories, 5 files


## Deploy the application

In [11]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy_from_disk(application_name="lightgbm", application_root="lightgbm")

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Waiting for application status, 30/300 seconds...
Finished deployment.


## Feed the data

In [12]:
feed_batch = [
    {
        "id": idx, 
        "fields": {
            "numeric": float(row["attribute(numeric)"]), 
            "categorical": str(row["attribute(categorical)"])
        }
    } for idx, row in features.iterrows()
]

In [13]:
status = app.feed_batch(feed_batch)

Successful documents fed: 100/100.
Batch progress: 1/1.


## Query

In [14]:
hits = app.query(
    body={
        "yql": "select * from sources * where true",
        "ranking": "classify",
        "ranking.features.query(value)": 1,
        "hits": 100
    }
).hits

## Check Vespa and model predictions match

In [15]:
predictions = pd.DataFrame.from_records(
[
    {
        "vespa_relevance": float(hit["relevance"]), 
        "attribute(numeric)": float(hit["fields"]["numeric"]), 
        "attribute(categorical)": str(hit["fields"]["categorical"]),         
        "query(value)": 1
    } for hit in hits
]
)
predictions["attribute(categorical)"] = predictions["attribute(categorical)"].astype('category') 

In [16]:
X = predictions[["attribute(numeric)", "attribute(categorical)", "query(value)"]]
X.head(10)

Unnamed: 0,attribute(numeric),attribute(categorical),query(value)
0,0.698056,c,1
1,0.737584,c,1
2,0.5882,c,1
3,0.775526,c,1
4,0.74825,c,1
5,0.67067,c,1
6,0.802057,c,1
7,0.587497,c,1
8,0.789344,c,1
9,0.722675,c,1


In [17]:
predictions

Unnamed: 0,vespa_relevance,attribute(numeric),attribute(categorical),query(value)
0,0.697870,0.698056,c,1
1,0.697870,0.737584,c,1
2,0.697870,0.588200,c,1
3,0.697870,0.775526,c,1
4,0.697870,0.748250,c,1
...,...,...,...,...
95,0.347303,0.400556,a,1
96,0.347303,0.244337,a,1
97,0.347303,0.497480,a,1
98,0.347303,0.360573,a,1


In [18]:
model.predict(X)

array([0.64492719, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.64492719, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.64492719, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.64492719, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.64492719, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.64492719, 0.60973365, 0.64492719, 0.60973365, 0.64492719,
       0.60973365, 0.60973365, 0.60973365, 0.60973365, 0.64492719,
       0.60973365, 0.64492719, 0.60973365, 0.60973365, 0.60973365,
       0.64492719, 0.60973365, 0.60973365, 0.64492719, 0.60973365,
       0.60973365, 0.64492719, 0.64492719, 0.64492719, 0.64492719,
       0.60973365, 0.60973365, 0.64492719, 0.60973365, 0.60973365,
       0.60973365, 0.60973365, 0.60973365, 0.60973365, 0.60973365,
       0.60973365, 0.60973365, 0.60973365, 0.60973365, 0.49078109,
       0.49078109, 0.49078109, 0.49078109, 0.49078109, 0.49078109,
       0.49078109, 0.49078109, 0.49078109, 0.49078109, 0.49078

In [19]:
assert predictions.vespa_relevance.tolist() == model.predict(X).tolist()

AssertionError: 

## Clean environment

In [None]:
!rm -fr lightgbm
vespa_docker.container.stop(timeout=600)
vespa_docker.container.remove()