# LTR Case Study: Solr

Before doing this, we need to have Solr running. We downloaded a fresh solr-7.4.0 distribution from [the Solr download site](http://archive.apache.org/dist/lucene/solr/7.4.0/). Start Solr with the following command.

    cd <solr_home>
    bin/solr start
    
The case study follows the steps outlined in the [Solr LTR Tutorial](https://github.com/airalcorn2/Solr-LTR) and Michael Alcorn's tutorial [From Zero to Learning to Rank in Apache Solr](https://github.com/airalcorn2/Solr-LTR).

In [1]:
import csv
import json
import os
import random
import requests
import sqlite3
import sys
import urllib
import xml.etree.ElementTree as ET

In [2]:
DATA_DIR = "../../data"

MOVIES_DATA = os.path.join(DATA_DIR, "movies_metadata.csv")
LOOKUPS_DB = os.path.join(DATA_DIR, "lookups.db")
FEATURE_FILE_TEMPLATE = os.path.join(DATA_DIR, "solr_features_{:s}.txt")

RANKLIB_LM_MODEL = os.path.join(DATA_DIR, "solr_lambdamart_model.txt")
RANKLIB_PROC_LM_MODEL = os.path.join(DATA_DIR, "solr_lambdamart_model.xml")
SOLR_LM_MODEL = os.path.join(DATA_DIR, "solr_lambdamart_model.json")

SOLR_URL = "http://localhost:8983/solr/tmdbindex"

FEATURE_LIST = [
    "origScore", "titleSimTFIDF", "titleSimBM25", "descSimTFIDF", "descSimBM25",
    "docRecency", "isGoHands", "isAniplex", "isThriller", "isForeign",
    "isDrama", "isWar", "isAction", "isComedy", "isMusic", 
    "isRomance", "isAdventure", "isFamily", "isFantasy", "isCrime",
    "isHorror", "isHistory", "isMystery", "isAnimation", "isDocumentary",
    "isWestern"
]
QUERY_LIST = [
    "murder", "musical", "biography", "police", "world war ii",
    "comedy", "superhero", "nazis", "romance", "martial arts",
    "extramarital", "spy", "vampire", "magic", "wedding",
    "sport", "prison", "teacher", "alien", "dystopia"
]
TOP_N = 10

## Plugin Setup

The LTR functionality is bundled into Apache Solr since 6.x, but it needs to be enabled.

### Create new core
    
We then create a new core to hold our index.

    bin/solr create -c tmdbindex 
    
In case you need to start over, the command to drop the core is as follows.

    bin/solr delete -c tmdbindex

### Prepare Core for LTR

We have to add the following snippet into the solrconfig.xml file.

    cd <solr_home>
    bin/solr stop
    vi server/solr/tmdbindex/conf/solrconfig.xml
    
Add following snippet __before__ the `</config>` tag on the last line of solrconfig.xml.

    <lib dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" />
    <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" />

    <queryParser name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/>

    <cache name="QUERY_DOC_FV"
           class="solr.search.LRUCache"
           size="4096"
           initialSize="2048"
           autowarmCount="4096"
           regenerator="solr.search.NoOpRegenerator" />

    <transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
      <str name="fvCacheName">QUERY_DOC_FV</str>
    </transformer>
    
Restart Solr with LTR enabled.

    bin/solr start -Dsolr.ltr.enabled=true
    
We are now ready to define LTR features.


## Load Data

### Add "TF-IDF" fields to schema

Since Solr 6.x the default similarity has changed to be BM25, and the old TF-IDF based similarity can be accessed using ClassicSimilarity. So we will declare a new field type `text_tfidf` that uses the TF-IDF based similarity. Any field declared (via the dynamic field naming convention) with the suffix `_t` will automatically be of type `text_general` which uses BM25 similarity. 

In addition, we declare a copy-field that will copy the title `title_t` and body field `description_t` (BM25 similarity) fields to their corresponding counterparts `title_tfidf` and `description_tfidf` automatically.

In [3]:
headers = {"Content-type": "application/json"}
data = {
  "add-field-type" : {
    "name":"text_tfidf",
    "class":"solr.TextField",
    "positionIncrementGap":"100",
    "indexAnalyzer":{
       "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
       "filter":{
          "class":"solr.StopFilterFactory",
          "ignoreCase":"true",
          "words":"stopwords.txt"},
       "filter":{
          "class":"solr.LowerCaseFilterFactory"}
    },
    "queryAnalyzer":{
       "tokenizer":{
          "class":"solr.StandardTokenizerFactory"},
       "filter":{
          "class":"solr.StopFilterFactory",
          "ignoreCase":"true",
          "words":"stopwords.txt"},
       "filter":{
          "class":"solr.SynonymGraphFilterFactory",
          "ignoreCase":"true",
          "synonyms":"synonyms.txt"},
       "filter":{
          "class":"solr.LowerCaseFilterFactory"}
    },
    "similarity":{
          "class":"solr.ClassicSimilarityFactory"
    }
  },
  "add-dynamic-field": {
    "name": "*_tfidf",
    "type": "text_tfidf",
    "indexed": "true",
    "stored": "true"
  },
  "add-copy-field": {
    "source": "title_t",
    "dest": "title_tfidf"
  },
  "add-copy-field": {
    "source": "description_t",
    "dest": "description_tfidf"
  }
}
resp = requests.post(SOLR_URL + "/schema", headers=headers, data=json.dumps(data))
print(resp.text)

{
  "responseHeader":{
    "status":0,
    "QTime":80}}



### Load Records into Solr

In [4]:
def get_keywords(conn, movie_id):
    cur = conn.cursor()
    cur.execute("select keywords from keywords where mid = ?", [movie_id])
    rows = cur.fetchall()
    keywords = []
    if len(rows) > 0:
        for row in rows:
            keywords = row[0].split("|")
            break
    cur.close()
    return keywords


def filter_genres(conn, genres):
    filtered_genres = []
    cur = conn.cursor()
    for genre in genres:
        cur.execute("select gname from genres where gname = ?", [genre])
        rows = cur.fetchall()
        if len(rows) == 0:
            continue
        filtered_genres.append(genre)
    cur.close()
    return filtered_genres


conn = sqlite3.connect(LOOKUPS_DB)
print(get_keywords(conn, 460870))

['electricity', 'scientific experiment', 'nikola tesla']


In [5]:
def get_float(orig_value, default_value):
    if orig_value is None:
        return default_value
    elif len(orig_value.strip()) == 0:
        return default_value
    else:
        return float(orig_value)
    
def parse_genres(genre_json):
    if len(genre_json.strip()) == 0:
        return []
    names = []
    idname_pairs = json.loads(genre_json.replace("'", "\""))
    for idname_pair in idname_pairs:
        names.append(idname_pair["name"])
    return names


def add_record_to_solr(solr_url, doc_id, title, description, popularity, 
                       release_date, revenue, runtime, rating, keywords, genres,
                       should_commit=False):
    headers = {
        "content-type": "application/json",
        "accept": "application/json"
    }
    if doc_id is None:
        # only do a commit
        requests.post(solr_url + "update", params={"commit": "true"}, headers=headers)
    else:
        req_body = json.dumps({
            "add": {
                "doc": {
                    "id": doc_id,
                    "title_t": title,
                    "description_t": description,
                    "popularity_f": popularity,
                    "released_dt": release_date,
                    "revenue_f": revenue,
                    "runtime_f": runtime,
                    "rating_f": rating,
                    "keywords_ss": keywords,
                    "genres_ss": genres
                }
            }
        })
        params = { "commit": "true" if should_commit else "false" }
        requests.post(solr_url + "/update", data=req_body, params=params, headers=headers)
        

i = 0
should_commit = False
with open(MOVIES_DATA, "r") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if i % 1000 == 0:
            print("{:d} records ingested into Solr".format(i))
            should_commit = True
        if row["original_language"] != "en":
            # only stick to english
            i += 1
            continue
        doc_id = int(row["id"])
        title = row["original_title"]
        description = row["overview"]
        popularity = get_float(row["popularity"], 0.0)
        release_date = row["release_date"]
        revenue = get_float(row["revenue"], 0.0)
        runtime = get_float(row["runtime"], 0.0)
        rating = get_float(row["vote_average"], 0.0)
        # look up keywords
        keywords = get_keywords(conn, doc_id)
        # parse out genres
        genres = filter_genres(conn, parse_genres(row["genres"]))
        # add record to solr
        add_record_to_solr(SOLR_URL, doc_id, title, description, popularity, 
                           release_date, revenue, runtime, rating, keywords, genres,
                           should_commit=should_commit)
        should_commit = False
        i += 1

add_record_to_solr(SOLR_URL, None, None, None, None, None, None, None, None, None, None, True)
print("{:d} records ingested into Solr, COMPLETE".format(i))

0 records ingested into Solr
1000 records ingested into Solr
2000 records ingested into Solr
3000 records ingested into Solr
4000 records ingested into Solr
5000 records ingested into Solr
6000 records ingested into Solr
7000 records ingested into Solr
8000 records ingested into Solr
9000 records ingested into Solr
10000 records ingested into Solr
11000 records ingested into Solr
12000 records ingested into Solr
13000 records ingested into Solr
14000 records ingested into Solr
15000 records ingested into Solr
16000 records ingested into Solr
17000 records ingested into Solr
18000 records ingested into Solr
19000 records ingested into Solr
20000 records ingested into Solr
21000 records ingested into Solr
22000 records ingested into Solr
23000 records ingested into Solr
24000 records ingested into Solr
25000 records ingested into Solr
26000 records ingested into Solr
27000 records ingested into Solr
28000 records ingested into Solr
29000 records ingested into Solr
30000 records ingested 

## Define LTR Features

LTR features are defined as function queries. See [the Solr LTR Documentation](https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html) for examples and details of the feature classes supported (OriginalScoreFeature, SolrFeature and ValueFeature).

As before, we can use Solr's JSON API to define and upload the LTR feature set to the index.

In [6]:
headers = {"Content-type": "application/json"}
data = [
  {
    "store": "myFeatureStore",
    "name": "origScore",
    "class": "org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params": {}
  },
  {
    "store": "myFeatureStore",
    "name": "titleSimTFIDF",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "q": "{!dismax qf=title_tfidf}${query}"
    }
  },
  {
    "store": "myFeatureStore",
    "name": "titleSimBM25",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "q": "{!dismax qf=title_t}${query}"
    }
  },
  {
    "store": "myFeatureStore",
    "name": "descSimTFIDF",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "q": "{!dismax qf=description_tfidf}${query}"
    }
  },
  {
    "store": "myFeatureStore",
    "name": "descSimBM25",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "q": "{!dismax qf=description_t}${query}"
    }
  },
  {
    "store": "myFeatureStore",
    "name": "docRecency",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "q": "{!func}recip(ms(NOW, released_dt), 3.16e-11, 1, 1)"
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isGoHands",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}GoHands"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isAniplex",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Aniplex"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isThriller",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Thriller"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isForeign",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Foreign"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isDrama",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Drama"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isWar",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}War"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isAction",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Action"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isComedy",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Comedy"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isMusic",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Music"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isRomance",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Romance"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isAdventure",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Adventure"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isFamily",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Family"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isFantasy",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Fantasy"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isCrime",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Crime"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isHorror",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Horror"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isHistory",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}History"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isMystery",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Mystery"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isAnimation",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Animation"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isDocumentary",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Documentary"]
    }
  },
  {
    "store": "myFeatureStore",
    "name": "isWestern",
    "class": "org.apache.solr.ltr.feature.SolrFeature",
    "params": {
      "fq": ["{!terms f=genres_ss}Western"]
    }
  }
]
resp = requests.put(SOLR_URL + "/schema/feature-store", headers=headers, data=json.dumps(data))
print(resp.text)

{
  "responseHeader":{
    "status":0,
    "QTime":143}}



## Generate LTR features

### Create Dummy LinearModel

The [Solr LTR docs](https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html) imply that we can just call the following URL to get the values of the computed features for a given query `q`:

    http://localhost:8983/solr/tmdbindex/query?q=%22martial%20arts%22&fl=id,score,[features]

But since we have SolrFeature type parameters, these need to be computed in the context of an actual query, so we need to do a little more work. Namely, we construct a dummy LinearModel `solr_feature_ltr_model.json` where the __only feature that is turned on is the original score feature__, then push it into Solr using the following command.

In [7]:
headers = {"Content-type": "application/json"}
data = {
  "store" : "myFeatureStore",
  "name" : "myLinearModel",
  "class" : "org.apache.solr.ltr.model.LinearModel",
  "features" : [
    { "name": "origScore" },
    { "name": "titleSimTFIDF" },
    { "name": "titleSimBM25" },
    { "name": "descSimTFIDF" },
    { "name": "descSimBM25" },
    { "name": "docRecency" },
    { "name": "isGoHands" },
    { "name": "isAniplex" },
    { "name": "isThriller" },
    { "name": "isForeign" },
    { "name": "isDrama" },
    { "name": "isWar" },
    { "name": "isAction" },
    { "name": "isComedy" },
    { "name": "isMusic" },
    { "name": "isRomance" },
    { "name": "isAdventure" },
    { "name": "isFamily" },
    { "name": "isFantasy" },
    { "name": "isCrime" },
    { "name": "isHorror" },
    { "name": "isHistory" },
    { "name": "isMystery" },
    { "name": "isAnimation" },
    { "name": "isDocumentary" },
    { "name": "isWestern" }
  ],
  "params" : {
    "weights" : {
      "origScore": 1.0,
      "titleSimTFIDF": 0.0,
      "titleSimBM25": 0.0,
      "descSimTFIDF": 0.0,
      "descSimBM25": 0.0,
      "docRecency": 0.0,
      "isGoHands": 0.0,
      "isAniplex": 0.0,
      "isThriller": 0.0,
      "isForeign": 0.0,
      "isDrama": 0.0,
      "isWar": 0.0,
      "isAction": 0.0,
      "isComedy": 0.0,
      "isMusic": 0.0,
      "isRomance": 0.0,
      "isAdventure": 0.0,
      "isFamily": 0.0,
      "isFantasy": 0.0,
      "isCrime": 0.0,
      "isHorror": 0.0,
      "isHistory": 0.0,
      "isMystery": 0.0,
      "isAnimation": 0.0,
      "isDocumentary": 0.0,
      "isWestern": 0.0
    }
  }
}
resp = requests.put(SOLR_URL + "/schema/model-store", headers=headers, data=json.dumps(data))
print(resp.text)

{
  "responseHeader":{
    "status":0,
    "QTime":10}}



### Extract features to LETOR format

To run a query and get back feature values, use following URL:

    http://localhost:8983/solr/tmdbindex/query?q=%22martial%20arts%22&df=description_t&rq={!ltr%20model=myLinearModel%20efi.query=%27martial%20arts%27}&fl=id,score,[features]
    
We will now extract the features for a given set of queries and write it out in [LETOR format](https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/). The dataset will be used to train a RankLib LambdaMart model.

In [8]:
def rating2label(rating):
    """ convert 0-10 continuous rating to 1-5 categorical labels """
    return int(rating // 2) + 1

assert(rating2label(6.4) == 4)
assert(rating2label(9.8) == 5)

In [9]:
feature_name2id = {name: idx + 1 for idx, name in enumerate(FEATURE_LIST)}

assert(feature_name2id["isRomance"] == 16)

In [10]:
def format_letor(doc_id, rating, feat_str, qid, query):
    label = rating2label(rating)
    feat_pairs = []
    for feat_nv in feat_str.split(","):
        feat_name, feat_val = feat_nv.split("=")
        feat_id = str(feature_name2id[feat_name])
        feat_val = float(feat_val)
        if feat_name.startswith("is"):
            feat_val = int(feat_val)
        feat_val = str(feat_val)
        feat_pairs.append(":".join([feat_id, feat_val]))
    return "{:d} qid:{:d} {:s} # docid:{:d} query:{:s}".format(
        label, qid, " ".join(feat_pairs), doc_id, query)

print(format_letor(192143, 4.5, "origScore=9.458602,titleSimTFIDF=0.0,titleSimBM25=0.0,descSimTFIDF=2.3550842,descSimBM25=9.458602,docRecency=0.07054524,isGoHands=0.0,isAniplex=0.0,isThriller=0.0,isForeign=0.0,isDrama=0.0,isWar=0.0,isAction=0.0,isComedy=0.0,isMusic=0.0,isRomance=0.0,isAdventure=0.0,isFamily=0.0,isFantasy=0.0,isCrime=0.0,isHorror=0.0,isHistory=0.0,isMystery=0.0,isAnimation=0.0,isDocumentary=1.0,isWestern=0.0", 1, "biography"))

3 qid:1 1:9.458602 2:0.0 3:0.0 4:2.3550842 5:9.458602 6:0.07054524 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0 22:0 23:0 24:0 25:1 26:0 # docid:192143 query:biography


In [11]:
random.shuffle(QUERY_LIST)
train_queries = QUERY_LIST[0:12]
val_queries = QUERY_LIST[12:15]
test_queries = QUERY_LIST[15:]
feat_suffixes = ["train", "val", "test"]
qid = 1
for qt_idx, queries in enumerate([train_queries, val_queries, test_queries]):
    fletor = open(FEATURE_FILE_TEMPLATE.format(feat_suffixes[qt_idx]), "w")
    for query in queries:
        print("generating feature for {:s} ({:s})".format(query, feat_suffixes[qt_idx]))
        if len(query.split()) > 1:
            query = "\"" + query + "\""
        payload = {
            "q": query,
            "defType": "edismax",
            "qf": "title_t description_t",
            "pf": "title_t description_t",
            "mm": 2,
            "rq": "{!ltr model=myLinearModel efi.query=" + query + "}",
            "fl": "id,rating_f,[features]",            
            "rows": 100
        }
        params = urllib.parse.urlencode(payload, quote_via=urllib.parse.quote_plus)
        search_url = SOLR_URL + "/select?" + params
        resp = requests.get(search_url)
        resp_json = json.loads(resp.text)
        for doc in resp_json["response"]["docs"]:
            doc_id = int(doc["id"])
            rating = doc["rating_f"]
            feat_str = doc["[features]"]
            fletor.write("{:s}\n".format(format_letor(doc_id, rating, feat_str, qid, query)))
        qid += 1
    fletor.close()
print("number of queries, train {:d}, test {:d}, validation {:d}".format(
    len(train_queries), len(test_queries), len(val_queries)))

generating feature for alien (train)
generating feature for romance (train)
generating feature for police (train)
generating feature for world war ii (train)
generating feature for prison (train)
generating feature for biography (train)
generating feature for nazis (train)
generating feature for magic (train)
generating feature for spy (train)
generating feature for martial arts (train)
generating feature for vampire (train)
generating feature for superhero (train)
generating feature for musical (val)
generating feature for dystopia (val)
generating feature for teacher (val)
generating feature for murder (test)
generating feature for sport (test)
generating feature for extramarital (test)
generating feature for comedy (test)
generating feature for wedding (test)
number of queries, train 12, test 5, validation 3


## Train LTR model using RankLib

We train a LambdaMart model using [RankLib](https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/) using its command line interface as shown below.

    cd <scripts_dir>
    java -jar RankLib-2.10.jar \
        -train ../data/solr_features_train.txt \
        -test ../data/solr_features_test.txt \
        -validate ../data/solr_features_val.txt \
        -ranker 6 \
        -metric2t NDCG@10 \
        -metric2T NDCG@10 \
        -save ../models/solr_lambdamart_model.txt

Console output is shown below:

    Discard orig. features
    Training data:	../data/solr_features_train.txt
    Test data:	../data/solr_features_test.txt
    Validation data:	../data/solr_features_val.txt
    Feature vector representation: Dense.
    Ranking method:	LambdaMART
    Feature description file:	Unspecified. All features will be used.
    Train metric:	NDCG@10
    Test metric:	NDCG@10
    Feature normalization: No
    Model file: ../models/solr_lambdamart_model.txt
    
    [+] LambdaMART's Parameters:
    No. of trees: 1000
    No. of leaves: 10
    No. of threshold candidates: 256
    Min leaf support: 1
    Learning rate: 0.1
    Stop early: 100 rounds without performance gain on validation data
    
    Reading feature file [../data/solr_features_train.txt]... [Done.]            
    (12 ranked lists, 1039 entries read)
    Reading feature file [../data/solr_features_val.txt]... [Done.]            
    (3 ranked lists, 257 entries read)
    Reading feature file [../data/solr_features_test.txt]... [Done.]            
    (5 ranked lists, 402 entries read)
    Initializing... [Done]
    ---------------------------------
    Training starts...
    ---------------------------------
    #iter   | NDCG@10-T | NDCG@10-V | 
    ---------------------------------
    1       | 0.5858    | 0.4278    | 
    2       | 0.6125    | 0.4547    | 
    3       | 0.6073    | 0.4931    | 
    4       | 0.617     | 0.4778    | 
    5       | 0.6361    | 0.4797    | 
    6       | 0.6377    | 0.4723    | 
    7       | 0.6382    | 0.4706    | 
    8       | 0.6374    | 0.4741    | 
    9       | 0.6435    | 0.4606    | 
    10      | 0.6417    | 0.4606    | 
    11      | 0.6434    | 0.4773    | 
    12      | 0.6424    | 0.4851    | 
    13      | 0.6484    | 0.4847    | 
    14      | 0.659     | 0.4988    | 
    15      | 0.679     | 0.4988    | 
    16      | 0.6794    | 0.4988    | 
    17      | 0.6861    | 0.5005    | 
    18      | 0.6867    | 0.4864    | 
    19      | 0.6934    | 0.5305    | 
    20      | 0.6956    | 0.5247    | 
    21      | 0.7165    | 0.5234    | 
    22      | 0.7154    | 0.5177    | 
    23      | 0.7108    | 0.5004    | 
    24      | 0.7111    | 0.4932    | 
    25      | 0.7219    | 0.5037    | 
    26      | 0.7301    | 0.5051    | 
    27      | 0.7256    | 0.493     | 
    28      | 0.7189    | 0.493     | 
    29      | 0.7339    | 0.4933    | 
    30      | 0.7323    | 0.4878    | 
    31      | 0.7476    | 0.483     | 
    32      | 0.7447    | 0.4982    | 
    33      | 0.7421    | 0.4909    | 
    34      | 0.7407    | 0.4836    | 
    35      | 0.756     | 0.4912    | 
    36      | 0.7545    | 0.4912    | 
    37      | 0.757     | 0.4924    | 
    38      | 0.7561    | 0.4943    | 
    39      | 0.7598    | 0.496     | 
    40      | 0.764     | 0.5018    | 
    41      | 0.7626    | 0.4889    | 
    42      | 0.7678    | 0.5095    | 
    43      | 0.7677    | 0.5089    | 
    44      | 0.7658    | 0.5023    | 
    45      | 0.7625    | 0.5026    | 
    46      | 0.7632    | 0.5015    | 
    47      | 0.7643    | 0.5019    | 
    48      | 0.7681    | 0.4931    | 
    49      | 0.7681    | 0.4974    | 
    50      | 0.7708    | 0.4974    | 
    51      | 0.7733    | 0.4944    | 
    52      | 0.7791    | 0.4968    | 
    53      | 0.7834    | 0.4981    | 
    54      | 0.7813    | 0.4985    | 
    55      | 0.7765    | 0.5333    | 
    56      | 0.777     | 0.5329    | 
    57      | 0.7742    | 0.5321    | 
    58      | 0.7785    | 0.53      | 
    59      | 0.7762    | 0.5226    | 
    60      | 0.775     | 0.5336    | 
    61      | 0.7844    | 0.5336    | 
    62      | 0.7843    | 0.5353    | 
    63      | 0.796     | 0.5318    | 
    64      | 0.792     | 0.5319    | 
    65      | 0.7967    | 0.5302    | 
    66      | 0.7996    | 0.5311    | 
    67      | 0.7998    | 0.5311    | 
    68      | 0.7978    | 0.5302    | 
    69      | 0.7992    | 0.5299    | 
    70      | 0.8023    | 0.5299    | 
    71      | 0.7998    | 0.4988    | 
    72      | 0.7996    | 0.4992    | 
    73      | 0.801     | 0.4971    | 
    74      | 0.8111    | 0.5265    | 
    75      | 0.8108    | 0.5229    | 
    76      | 0.8172    | 0.4927    | 
    77      | 0.8158    | 0.4852    | 
    78      | 0.8145    | 0.494     | 
    79      | 0.8202    | 0.4892    | 
    80      | 0.8215    | 0.494     | 
    81      | 0.8197    | 0.494     | 
    82      | 0.8177    | 0.494     | 
    83      | 0.8171    | 0.5046    | 
    84      | 0.819     | 0.5051    | 
    85      | 0.8236    | 0.5056    | 
    86      | 0.8181    | 0.506     | 
    87      | 0.8187    | 0.5064    | 
    88      | 0.8222    | 0.5055    | 
    89      | 0.8185    | 0.5055    | 
    90      | 0.8229    | 0.5055    | 
    91      | 0.8204    | 0.4988    | 
    92      | 0.829     | 0.4984    | 
    93      | 0.8296    | 0.4984    | 
    94      | 0.8396    | 0.5079    | 
    95      | 0.8394    | 0.5082    | 
    96      | 0.8431    | 0.5088    | 
    97      | 0.8467    | 0.5084    | 
    98      | 0.8467    | 0.5088    | 
    99      | 0.8504    | 0.52      | 
    100     | 0.8496    | 0.5227    | 
    101     | 0.8536    | 0.5234    | 
    102     | 0.8516    | 0.5239    | 
    103     | 0.8482    | 0.5415    | 
    104     | 0.8558    | 0.5394    | 
    105     | 0.8559    | 0.5393    | 
    106     | 0.853     | 0.5396    | 
    107     | 0.8586    | 0.5389    | 
    108     | 0.8529    | 0.5397    | 
    109     | 0.8549    | 0.5381    | 
    110     | 0.8587    | 0.5402    | 
    111     | 0.8587    | 0.5402    | 
    112     | 0.8546    | 0.538     | 
    113     | 0.8566    | 0.5388    | 
    114     | 0.8572    | 0.5392    | 
    115     | 0.8569    | 0.5387    | 
    116     | 0.8605    | 0.5402    | 
    117     | 0.8592    | 0.5398    | 
    118     | 0.8601    | 0.5229    | 
    119     | 0.8634    | 0.5127    | 
    120     | 0.858     | 0.5135    | 
    121     | 0.8665    | 0.5131    | 
    122     | 0.8703    | 0.5117    | 
    123     | 0.8694    | 0.5064    | 
    124     | 0.8677    | 0.5069    | 
    125     | 0.8725    | 0.5069    | 
    126     | 0.8717    | 0.5315    | 
    127     | 0.8664    | 0.5244    | 
    128     | 0.8714    | 0.5256    | 
    129     | 0.8781    | 0.5256    | 
    130     | 0.875     | 0.5363    | 
    131     | 0.8827    | 0.5137    | 
    132     | 0.8825    | 0.508     | 
    133     | 0.8843    | 0.5073    | 
    134     | 0.8865    | 0.5037    | 
    135     | 0.8856    | 0.5034    | 
    136     | 0.8826    | 0.5034    | 
    137     | 0.8862    | 0.504     | 
    138     | 0.8905    | 0.5146    | 
    139     | 0.8954    | 0.5142    | 
    140     | 0.8943    | 0.5024    | 
    141     | 0.8947    | 0.5026    | 
    142     | 0.8957    | 0.4931    | 
    143     | 0.8967    | 0.5007    | 
    144     | 0.8987    | 0.5012    | 
    145     | 0.9006    | 0.5122    | 
    146     | 0.8994    | 0.5049    | 
    147     | 0.8992    | 0.5049    | 
    148     | 0.9045    | 0.5302    | 
    149     | 0.9047    | 0.5122    | 
    150     | 0.9052    | 0.5111    | 
    151     | 0.9088    | 0.5107    | 
    152     | 0.9077    | 0.5082    | 
    153     | 0.9071    | 0.5199    | 
    154     | 0.904     | 0.5239    | 
    155     | 0.9052    | 0.5238    | 
    156     | 0.9043    | 0.5183    | 
    157     | 0.9041    | 0.5233    | 
    158     | 0.9046    | 0.5252    | 
    159     | 0.9029    | 0.5316    | 
    160     | 0.9122    | 0.5341    | 
    161     | 0.9181    | 0.5341    | 
    162     | 0.9144    | 0.5341    | 
    163     | 0.9127    | 0.5327    | 
    164     | 0.9154    | 0.5327    | 
    165     | 0.9146    | 0.5268    | 
    166     | 0.9186    | 0.5268    | 
    167     | 0.9185    | 0.5268    | 
    168     | 0.9176    | 0.5268    | 
    169     | 0.9167    | 0.5268    | 
    170     | 0.917     | 0.5268    | 
    171     | 0.9194    | 0.5272    | 
    172     | 0.9218    | 0.5201    | 
    173     | 0.9168    | 0.5206    | 
    174     | 0.9213    | 0.524     | 
    175     | 0.9236    | 0.5234    | 
    176     | 0.9229    | 0.5234    | 
    177     | 0.9221    | 0.5258    | 
    178     | 0.9228    | 0.5208    | 
    179     | 0.9246    | 0.5179    | 
    180     | 0.9218    | 0.5255    | 
    181     | 0.9264    | 0.5289    | 
    182     | 0.923     | 0.5278    | 
    183     | 0.9202    | 0.5231    | 
    184     | 0.9214    | 0.5309    | 
    185     | 0.9249    | 0.5303    | 
    186     | 0.9249    | 0.5295    | 
    187     | 0.9249    | 0.5295    | 
    188     | 0.9265    | 0.5251    | 
    189     | 0.9265    | 0.5255    | 
    190     | 0.9232    | 0.5281    | 
    191     | 0.9255    | 0.5335    | 
    192     | 0.9271    | 0.5284    | 
    193     | 0.9271    | 0.5276    | 
    194     | 0.9271    | 0.5281    | 
    195     | 0.9283    | 0.529     | 
    196     | 0.9286    | 0.5286    | 
    197     | 0.9295    | 0.529     | 
    198     | 0.9282    | 0.5286    | 
    199     | 0.9295    | 0.5286    | 
    200     | 0.9276    | 0.5286    | 
    201     | 0.9295    | 0.5294    | 
    202     | 0.9316    | 0.5294    | 
    203     | 0.9316    | 0.5284    | 
    204     | 0.9316    | 0.5284    | 
    ---------------------------------
    Finished sucessfully.
    NDCG@10 on training data: 0.8482
    NDCG@10 on validation data: 0.5415
    ---------------------------------
    NDCG@10 on test data: 0.5944
    
    Model saved to: ../models/solr_lambdamart_model.txt


## Upload trained model

Resulting LambdaMART model looks like [this](https://github.com/sujitpal/ltr-examples/blob/master/models/solr_lambdamart_model.txt).


### Convert model file to JSON

Note that this is in XML format. It needs to be converted to JSON format before being uploaded to Solr.

In [12]:
fraw = open(RANKLIB_LM_MODEL, "r")
fxml = open(RANKLIB_PROC_LM_MODEL, "w")
fxml.write("<?xml version=\"1.0\"?>\n")
for line in fraw:
    if line.startswith("##") or len(line.strip()) == 0:
        continue
    fxml.write("{:s}".format(line))
fxml.close()
fraw.close()

In [13]:
def parse_split(el_split, feature_id2name, split_type="root"):
    if split_type != "root":
        split_type = el_split.attrib["pos"]
    output = el_split.find("output")
    if output is not None:
        return {
            "value": output.text.strip()
        }
    feature = feature_id2name[int(el_split.find("feature").text.strip())]
    threshold = el_split.find("threshold").text.strip()
    el_csplits = el_split.findall("split")
    for el_csplit in el_csplits:
        attr_pos = el_csplit.attrib["pos"]
        if attr_pos == "left":
            left = parse_split(el_csplit, feature_id2name, "left")
        elif attr_pos == "right":
            right = parse_split(el_csplit, feature_id2name, "right")
    return {
        "feature": feature,
        "threshold": threshold,
        "left": left,
        "right": right
    }


trees = []
feature_id2name = {i+1:f for i, f in enumerate(FEATURE_LIST)}
xml = ET.parse(RANKLIB_PROC_LM_MODEL)
el_ensemble = xml.getroot()
for el_tree in el_ensemble:
    weight = el_tree.attrib["weight"]
    el_split = el_tree.find("split")
    tree_dict = {
        "weight": weight,
        "root": parse_split(el_split, feature_id2name)
    }
    trees.append(tree_dict)
params_dict = {"trees" : trees}
    
features = [{"name": f} for f in FEATURE_LIST]
model_dict = {
    "store": "myFeatureStore",
    "name": "myLambdaMARTModel",
    "class": "org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "features": features,
    "params": params_dict
}
with open(SOLR_LM_MODEL, "w") as fjson:
    fjson.write(json.dumps(model_dict, indent=4))
os.unlink(RANKLIB_PROC_LM_MODEL)

### Model Upload

In [14]:
lines = []
with open(SOLR_LM_MODEL, "r") as fjson:
    for line in fjson:
        lines.append(line.strip())
data = " ".join(lines)
headers = {"Content-type": "application/json"}
resp = requests.put(SOLR_URL + "/schema/model-store", headers=headers, data=data)
print(resp.text)

{
  "responseHeader":{
    "status":0,
    "QTime":54}}



## Run rerank query

We select a random query from our query list and generate results with and without reranking using the trained LTR model.

In [15]:
def get_rating_string(rating):
    rating_string = []
    for i in range(rating):
        rating_string.append(u"\u2605")
    for i in range(5 - rating):
        rating_string.append(u"\u2606")
    return "".join(rating_string)


print(get_rating_string(3))
print(get_rating_string(rating2label(6.4)))

★★★☆☆
★★★★☆


In [16]:
query = QUERY_LIST[random.randint(0, len(QUERY_LIST))]
if len(query.split()) > 1:
    query = "\"" + query + "\""

### Top 20 results without re-ranking

In [17]:
def render_results(docs, query, top_n):
    print("top {:d} results for {:s}".format(TOP_N * 2, query))
    print("---")
    for doc in resp_json["response"]["docs"]:
        doc_id = int(doc["id"])
        stars = get_rating_string(rating2label(float(doc["rating_f"])))
        score = float(doc["score"])
        title = doc["title_t"]
        print("{:s} {:06d} {:.3f} {:s}".format(stars, doc_id, score, title))


payload = {
    "q": query,
    "defType": "edismax",
    "qf": "title_t description_t",
    "pf": "title_t description_t",
    "mm": 2,
    "fl": "id,title_t,rating_f,score",            
    "rows": TOP_N * 2
}
params = urllib.parse.urlencode(payload, quote_via=urllib.parse.quote_plus)
search_url = SOLR_URL + "/select?" + params
resp = requests.get(search_url)
resp_json = json.loads(resp.text)
docs = resp_json["response"]["docs"]
render_results(docs, query, TOP_N * 2)

top 20 results for biography
---
★★★☆☆ 192143 9.473 The Last Mogul
★★★☆☆ 242631 9.473 So This Is Love
★★★☆☆ 046931 9.366 Georg
★★★★☆ 043277 9.261 The Great Ziegfeld
★★☆☆☆ 043491 9.059 Rhapsody in Blue
★★★★☆ 016769 8.961 Coal Miner's Daughter
★★☆☆☆ 107973 8.771 The Dolly Sisters
★★★☆☆ 101653 8.771 The Magnificent Yankee
★★★★☆ 027966 8.767 King of the Underworld
★★★☆☆ 042819 8.585 The City of Your Final Destination
★☆☆☆☆ 187156 8.502 Jean-Luc Cinema Godard
★★★☆☆ 043604 8.415 Viva Villa!
★★★★★ 049914 8.415 W.C. Fields and Me
★★★☆☆ 022043 8.331 The Profit
★★★★☆ 191600 8.248 John Huston: The Man, the Movies, the Maverick
★★★☆☆ 085411 8.248 The Three Stooges
★★★★★ 102814 8.167 The Beach Boys: An American Band
★★★☆☆ 004882 8.167 A Bullet for Pretty Boy
★★★★★ 235932 8.167 Le grand Méliès
★★★★☆ 004975 8.087 Love Is the Devil: Study for a Portrait of Francis Bacon


### Top 20 results with LTR

In [18]:
payload = {
    "q": query,
    "defType": "edismax",
    "qf": "title_t description_t",
    "pf": "title_t description_t",
    "mm": 2,
    "rq": "{!ltr model=myLambdaMARTModel reRankDocs=10 efi.query=" + query + "}",
    "fl": "id,title_t,rating_f,score",            
    "rows": TOP_N * 2
}
params = urllib.parse.urlencode(payload, quote_via=urllib.parse.quote_plus)
search_url = SOLR_URL + "/select?" + params
resp = requests.get(search_url)
resp_json = json.loads(resp.text)
docs = resp_json["response"]["docs"]
render_results(docs, query, TOP_N * 2)

top 20 results for biography
---
★★☆☆☆ 043491 0.868 Rhapsody in Blue
★★★☆☆ 192143 0.517 The Last Mogul
★★★☆☆ 046931 -0.056 Georg
★★★★☆ 043277 -0.704 The Great Ziegfeld
★★★☆☆ 242631 -0.812 So This Is Love
★★★☆☆ 101653 -1.163 The Magnificent Yankee
★★★☆☆ 042819 -1.306 The City of Your Final Destination
★★★★☆ 016769 -1.487 Coal Miner's Daughter
★★★★☆ 027966 -1.566 King of the Underworld
★★☆☆☆ 107973 -1.835 The Dolly Sisters
★☆☆☆☆ 187156 8.502 Jean-Luc Cinema Godard
★★★☆☆ 043604 8.415 Viva Villa!
★★★★★ 049914 8.415 W.C. Fields and Me
★★★☆☆ 022043 8.331 The Profit
★★★★☆ 191600 8.248 John Huston: The Man, the Movies, the Maverick
★★★☆☆ 085411 8.248 The Three Stooges
★★★★★ 102814 8.167 The Beach Boys: An American Band
★★★☆☆ 004882 8.167 A Bullet for Pretty Boy
★★★★★ 235932 8.167 Le grand Méliès
★★★★☆ 004975 8.087 Love Is the Devil: Study for a Portrait of Francis Bacon
