# Feature Generation for Solr LTR model

We will define features in Solr using the Solr LTR functionality, and then download these features for specific queries in the LETOR format for training our LTR model.

## Prepare Solr for LTR

We have to add the following snippet into the solrconfig.xml file (more details at the [Solr LTR Tutorial](https://github.com/airalcorn2/Solr-LTR)).

    cd <solr_home>
    bin/solr stop
    vi server/solr/tmdbindex/conf/solrconfig.xml
    
Add following snippet __before__ the `</config>` tag on the last line of solrconfig.xml.

    <lib dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" />
    <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" />

    <queryParser name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/>

    <cache name="QUERY_DOC_FV"
           class="solr.search.LRUCache"
           size="4096"
           initialSize="2048"
           autowarmCount="4096"
           regenerator="solr.search.NoOpRegenerator" />

    <transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
      <str name="fvCacheName">QUERY_DOC_FV</str>
    </transformer>
    
Restart Solr with LTR enabled.

    bin/solr start -Dsolr.ltr.enabled=true
    
You are now ready to define LTR features.

## Feature definition

See `../scripts/solr_efi_features.json` for list of features. This list of features is uploaded into Solr using the following command:

    curl -XPUT "http://localhost:8983/solr/tmdbindex/schema/feature-store" \
        --data-binary "@solr_efi_features.json" -H "Content-type:application/json"
        
Features can be viewed using the following URL:

    http://localhost:8983/solr/tmdbindex/schema/feature-store/myFeatureStore
    
Features need to be deleted first in case you want to change things:

    curl -XDELETE 'http://localhost:8983/solr/tmdbindex/schema/feature-store/myFeatureStore'
    

## Feature extraction

The [Solr LTR docs](https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html) imply that we can just call the following URL to get the values of the computed features for a given query `q`:

    http://localhost:8983/solr/tmdbindex/query?q=%22martial%20arts%22&fl=id,score,[features]

But since we have SolrFeature type parameters, these need to be computed in the context of an actual query, so we need to do a little more work. Namely, we construct a dummy LinearModel `solr_feature_ltr_model.json` where the only feature that is turned on is the original score feature, then push it into Solr using the following command:

    curl -XPUT "http://localhost:8983/solr/tmdbindex/schema/model-store" \
        --data-binary "@solr_feature_ltr_model.json" -H "Content-type:application/json"
        
To run a query and get back feature values, use following URL:

    http://localhost:8983/solr/tmdbindex/query?q=%22martial%20arts%22&df=description_t&rq={!ltr%20model=myLinearModel%20efi.query=%27martial%20arts%27}&fl=id,score,[features]
    
In this notebook, we will extract the features for a given set of queries and write it out in [LETOR format](https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/). The dataset will be used to train a RankLib LambdaMart model.

In [1]:
import json
import os
import random
import requests
import urllib

In [2]:
DATA_DIR = "../../data/tmdb-dataset"
SOLR_URL = "http://localhost:8983/solr/tmdbindex/"

FEATURE_LIST = [
    "origScore", "titleSimTFIDF", "titleSimBM25", "descSimTFIDF", "descSimBM25",
    "docRecency", "isGoHands", "isAniplex", "isThriller", "isForeign",
    "isDrama", "isWar", "isAction", "isComedy", "isMusic", 
    "isRomance", "isAdventure", "isFamily", "isFantasy", "isCrime",
    "isHorror", "isHistory", "isMystery", "isAnimation", "isDocumentary",
    "isWestern"
]
QUERY_LIST = [
    "murder", "musical", "biography", "police", "world war ii",
    "comedy", "superhero", "nazis", "romance", "martial arts",
    "extramarital", "spy", "vampire", "magic", "wedding",
    "sport", "prison", "teacher", "alien", "dystopia"
]

In [3]:
def rating2label(rating):
    """ convert 0-10 continuous rating to 1-5 categorical labels """
    return int(rating // 2) + 1

assert(rating2label(6.4) == 4)
assert(rating2label(9.8) == 5)

In [4]:
feature_name2id = {name: idx + 1 for idx, name in enumerate(FEATURE_LIST)}

assert(feature_name2id["isRomance"] == 16)

In [5]:
def format_letor(doc_id, rating, feat_str, qid, query):
    label = rating2label(rating)
    feat_pairs = []
    for feat_nv in feat_str.split(","):
        feat_name, feat_val = feat_nv.split("=")
        feat_id = str(feature_name2id[feat_name])
        feat_val = float(feat_val)
        if feat_name.startswith("is"):
            feat_val = int(feat_val)
        feat_val = str(feat_val)
        feat_pairs.append(":".join([feat_id, feat_val]))
    return "{:d} qid:{:d} {:s} # docid:{:d} query:{:s}".format(
        label, qid, " ".join(feat_pairs), doc_id, query)

print(format_letor(192143, 4.5, "origScore=9.458602,titleSimTFIDF=0.0,titleSimBM25=0.0,descSimTFIDF=2.3550842,descSimBM25=9.458602,docRecency=0.07054524,isGoHands=0.0,isAniplex=0.0,isThriller=0.0,isForeign=0.0,isDrama=0.0,isWar=0.0,isAction=0.0,isComedy=0.0,isMusic=0.0,isRomance=0.0,isAdventure=0.0,isFamily=0.0,isFantasy=0.0,isCrime=0.0,isHorror=0.0,isHistory=0.0,isMystery=0.0,isAnimation=0.0,isDocumentary=1.0,isWestern=0.0", 1, "biography"))

3 qid:1 1:9.458602 2:0.0 3:0.0 4:2.3550842 5:9.458602 6:0.07054524 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0 21:0 22:0 23:0 24:0 25:1 26:0 # docid:192143 query:biography


In [6]:
random.shuffle(QUERY_LIST)
train_queries = QUERY_LIST[0:12]
val_queries = QUERY_LIST[12:15]
test_queries = QUERY_LIST[15:]
feat_suffixes = ["train", "val", "test"]
qid = 1
for qt_idx, queries in enumerate([train_queries, val_queries, test_queries]):
    fletor = open(os.path.join(DATA_DIR, "solr_features_{:s}.txt".format(feat_suffixes[qt_idx])), "w")
    for query in queries:
        print("generating feature for {:s} ({:s})".format(query, feat_suffixes[qt_idx]))
        if len(query.split()) > 1:
            query = "\"" + query + "\""
        payload = {
            "q": query,
            "defType": "edismax",
            "qf": "title_t description_t",
            "pf": "title_t description_t",
            "mm": 2,
            "rq": "{!ltr model=myLinearModel efi.query=" + query + "}",
            "fl": "id,rating_f,[features]",            
            "rows": 100
        }
        params = urllib.parse.urlencode(payload, quote_via=urllib.parse.quote_plus)
        search_url = SOLR_URL + "select?" + params
        resp = requests.get(search_url)
        resp_json = json.loads(resp.text)
        for doc in resp_json["response"]["docs"]:
            doc_id = int(doc["id"])
            rating = doc["rating_f"]
            feat_str = doc["[features]"]
            fletor.write("{:s}\n".format(format_letor(doc_id, rating, feat_str, qid, query)))
        qid += 1
    fletor.close()
print("number of queries, train {:d}, test {:d}, validation {:d}".format(
    len(train_queries), len(test_queries), len(val_queries)))

generating feature for nazis (train)
generating feature for musical (train)
generating feature for comedy (train)
generating feature for superhero (train)
generating feature for world war ii (train)
generating feature for martial arts (train)
generating feature for dystopia (train)
generating feature for biography (train)
generating feature for prison (train)
generating feature for wedding (train)
generating feature for teacher (train)
generating feature for spy (train)
generating feature for romance (val)
generating feature for magic (val)
generating feature for alien (val)
generating feature for extramarital (test)
generating feature for vampire (test)
generating feature for murder (test)
generating feature for sport (test)
generating feature for police (test)
number of queries, train 12, test 5, validation 3
