# Judgments and Feature Logging

In this notebook, we cover the first two steps of Learning to Rank. First we grade some documents as relevant/irrelevant for queries, what we call _judgments_. Second, we retrieve some _features_ - metadata about each graded document in our judgments. We call the process of extracting the features from Solr _feature logging_

NOTE: This notebook depends upon TheMovieDB dataset. If you have any issues, please rerun the [Setting up TheMovieDB notebook](1.setup-the-movie-db.ipynb)

In [10]:
import requests
import sys
sys.path.append('..')
from aips import *
engine = get_engine()
tmdb_collection = engine.get_collection("tmdb")

## Ommitted from book
A single judgment, grading document 37799 ("The Social Network") as relevant (`grade=1`) for the search query string `social network`

In [11]:
from ltr.judgments import Judgment

Judgment(grade=1, keywords='social network', doc_id=37799)

Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1)

## Listing 10.3

A bit bigger judgment list. Here two query strings are graded: `social network` and `star wars`. For `social network` a single movie is graded as relevant, three are irrelevant. Two movies are graded as relevant for `star wars`, three others graded as irrelevant.

In [9]:
sample_judgements = [
    # for 'social network' query
    Judgment(1, 'social network', '37799'),  # The Social Network
    Judgment(0, 'social network', '267752'), # #chicagoGirl
    Judgment(0, 'social network', '38408'),  # Life As We Know It
    Judgment(0, 'social network', '28303'),  # The Cheyenne Social Club
    
    # for 'star wars' query
    Judgment(1, 'star wars', '11'),     # star wars
    Judgment(1, 'star wars', '1892'),   # return of jedi
    Judgment(0, 'star wars', '54138'),  # Star Trek Into Darkness
    Judgment(0, 'star wars', '85783'),  # The Star
    Judgment(0, 'star wars', '325553')  # Battlestar Galactica
]
sample_judgements

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]

Demonstrating we have no features for any of our judgments.

In [12]:
sample_judgements[0].features

[]

## Listing 10.4

Create a feature set, the first feature retrieves the relevance score of the search string in the `title` field (hence `title:(${keywords})`), the second feature the same for `overview`, finally the third feature is simply the `release_year` of the movie. 

We create a feature store named `movies` in Solr. We'll use the feature store name as a handle when we want to log features farther down.

In [13]:
engine.delete_feature_store(tmdb_collection, "movies")

feature_set = [
    {
      "name": "title_bm25",
      "store": "movies",
      "class": "org.apache.solr.ltr.feature.SolrFeature",
      "params": {"q": "title:(${keywords})"}
    },
    {
      "name": "overview_bm25",
      "store": "movies",
      "class": "org.apache.solr.ltr.feature.SolrFeature",
      "params": {"q": "overview:(${keywords})"}
    },
    {
      "name" : "release_year",
      "store": "movies",
      "class": "org.apache.solr.ltr.feature.SolrFeature",
      "params": {"q": "{!func}release_year"}}]

resp = engine.create_feature_store(tmdb_collection, feature_set)
resp.json()

{'responseHeader': {'status': 0, 'QTime': 3}}

# Listing 10.5

Recall we have one relevant and three irrelevant movies for `social network`. Here we retrieve all three features created above for each of the four movies. The special `[features..`], tells Solr to append the features from `movies` feature store using the template param `efi.keywords="social network"` in each document.

In [15]:
import requests

logging_solr_query = {
    "q": "id:37799 OR id:267752 OR id:38408 OR id:28303", #social network graded documents
    "fl": 'id,title,[features store=movies efi.keywords="social network"]',
    "rows": 10,
    "wt": "json"  
}
response = tmdb_collection.search(data=logging_solr_query)
response

{'responseHeader': {'zkConnected': True,
  'status': 0,
  'QTime': 11,
  'params': {'q': 'id:37799 OR id:267752 OR id:38408 OR id:28303',
   'fl': 'id,title,[features store=movies efi.keywords="social network"]',
   'rows': '10',
   'wt': 'json'}},
 'response': {'numFound': 4,
  'start': 0,
  'numFoundExact': True,
  'docs': [{'id': '38408',
    'title': 'Life As We Know It',
    '[features]': 'title_bm25=0.0,overview_bm25=4.353118,release_year=2010.0'},
   {'id': '28303',
    'title': 'The Cheyenne Social Club',
    '[features]': 'title_bm25=3.4286604,overview_bm25=3.1086721,release_year=1970.0'},
   {'id': '37799',
    'title': 'The Social Network',
    '[features]': 'title_bm25=8.243603,overview_bm25=3.8143613,release_year=2010.0'},
   {'id': '267752',
    'title': '#chicagoGirl',
    '[features]': 'title_bm25=0.0,overview_bm25=6.0172443,release_year=2013.0'}]}}

## Omitted from book - parse features Solr response

**The following code is used to generate Listings below. But this parsing code is omitted from the book itself.**

This code simply looks at the solr_resp and populates the corresponding judgments feature vector with the `[features]` from the corresponding solr document.

In [None]:
def populate_features_for_qid(qid, docs, judgements):
    doc_id_to_features = {}

    # Map Doc Id => Features
    for doc in docs:
        # Parse "[features] array", ie
        # title_bm25=0.0,overview_bm25=13.237938,vote_average=7.0"
        features = doc["[features]"]
        features = features.split(",")
        features = [float(ftr.split("=")[1]) for ftr in features]

        doc_id_to_features[doc["id"]] = features

    # Save in correct judgment
    for judgment in judgements:
        if judgment.qid == qid:
            try:
                judgment.features = doc_id_to_features[judgment.doc_id]
            except KeyError:
                pass


## Listing 10.6 (output)

Listing 10.7 is the output of the following, the resulting processing of logging just for `social network`

In [None]:
populate_features_for_qid(1, engine.docs_from_response(response), sample_judgements)
sample_judgements

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[8.243603, 3.8143613, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 6.0172443, 2013.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 4.353118, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[3.4286604, 3.1086721, 1970.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]

## Listing 10.7 (output)

Listing 10.8 is the output of the following, which adds features parsed from the `star wars` movies to our judgment list.

In [None]:
logging_solr_query = {
    "fl": 'id,title,[features store=movies efi.keywords="star wars"]',
    'q': "id:11 OR id:1892 OR id:54138 OR id:85783 OR id:325553", #star wars graded documents
    'rows': 10,
    'wt': 'json'  
}

response = tmdb_collection.search(data=logging_solr_query)
documents = engine.docs_from_response(response)
populate_features_for_qid(2, documents, sample_judgements)
sample_judgements

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[8.243603, 3.8143613, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 6.0172443, 2013.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 4.353118, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[3.4286604, 3.1086721, 1970.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[6.7963624, 0.0, 1977.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[0.0, 1.9681965, 1983.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[2.444128, 0.0, 2013.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[3.1871135, 0.0, 1952.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[0.0, 0.0, 2003.0],weight=1)]

## Loading / logging training set (omitted from book)

The following downloads a larger judgment list, parses it, and logs features for each graded document. It just repeats the full logging workflow in this notebook but in one loop.

In [16]:
from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby
from ltr import download

ftr_logger = FeatureLogger(engine, tmdb_collection, feature_set="movies")

with judgments_open("data/ai_pow_search_judgments.txt") as judgements:
    for qid, query_judgments in groupby(judgements, key=lambda j: j.qid):
        ftr_logger.log_for_qid(query_judgments, qid, judgements.keywords(qid))
    
ftr_logger.logged

AttributeError: 'dict' object has no attribute 'json'

## Next up, prep data for training

We now have a dataset extracted from the search engine. Next we need to perform some manipulation to the training data. This manipulation turns our slightly strange looking ranking problem into one that looks more like any-other boring machine learning problem.

Up next: [Feature Normalization and Pairwise Transform](3.pairwise-transform.ipynb)