# Judgments and Feature Logging

In this notebook, we cover the first two steps of Learning to Rank. First we grade some documents as relevant/irrelevant for queries, what we call _judgments_. Second, we retrieve some _features_ - metadata about each graded document in our judgments. We call the process of extracting document features _feature logging_

NOTE: This notebook depends upon TheMovieDB dataset. If you have any issues, please rerun the [Setting up TheMovieDB notebook](1.setup-the-movie-db.ipynb)

In [1]:
import requests
import json
import sys
sys.path.append('../..')
from aips import *
set_engine("opensearch")
engine = get_engine()
print(type(engine))
tmdb_collection = engine.get_collection("tmdb")
ltr = get_ltr_engine(tmdb_collection)
print(type(ltr))

<class 'engines.opensearch.OpenSearchEngine.OpenSearchEngine'>
<class 'engines.opensearch.OpenSearchLTR.OpenSearchLTR'>


## Ommitted from book
A single judgment, grading document 37799 ("The Social Network") as relevant (`grade=1`) for the search query string `social network`

In [2]:
from ltr.judgments import Judgment

Judgment(grade=1, keywords='social network', doc_id=37799)

Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1)

## Listing 10.3

A bit bigger judgment list. Here two query strings are graded: `social network` and `star wars`. For `social network` a single movie is graded as relevant, three are irrelevant. Two movies are graded as relevant for `star wars`, three others graded as irrelevant.

In [3]:
sample_judgments = [
    # for 'social network' query
    Judgment(1, 'social network', '37799'),  # The Social Network
    Judgment(0, 'social network', '267752'), # #chicagoGirl
    Judgment(0, 'social network', '38408'),  # Life As We Know It
    Judgment(0, 'social network', '28303'),  # The Cheyenne Social Club
    
    # for 'star wars' query
    Judgment(1, 'star wars', '11'),     # star wars
    Judgment(1, 'star wars', '1892'),   # return of jedi
    Judgment(0, 'star wars', '54138'),  # Star Trek Into Darkness
    Judgment(0, 'star wars', '85783'),  # The Star
    Judgment(0, 'star wars', '325553')  # Battlestar Galactica
]
sample_judgments

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]

Demonstrating we have no features for any of our judgments.

In [4]:
sample_judgments[0].features

[]

## Listing 10.4

Create a feature set, the first feature retrieves the relevance score of the search string in the `title` field (hence `title:(${keywords})`), the second feature the same for `overview`, finally the third feature is simply the `release_year` of the movie. 

We create a feature store named `movie_model` in out search engine. We'll use the feature store name as a handle when we want to log features farther down.

In [5]:
ltr.delete_feature_store("movies")

feature_set = [
    ltr.generate_query_feature(feature_name="title_bm25", field_name="title"),
    ltr.generate_query_feature(feature_name="overview_bm25", field_name="overview"),
    ltr.generate_field_value_feature(feature_name="release_year", field_name="release_year")]

ltr.upload_features(features=feature_set, model_name="movie_model", log=True)

print(json.dumps(feature_set, indent=2))

'Uploading Features movie_model'

{'_index': '.ltrstore',
 '_id': 'featureset-movie_model',
 '_version': 1,
 'result': 'created',
 'forced_refresh': True,
 '_shards': {'total': 1, 'successful': 1, 'failed': 0},
 '_seq_no': 1,
 '_primary_term': 1}

[
  {
    "name": "title_bm25",
    "template": {
      "match": {
        "title": "{{keywords}}"
      }
    },
    "params": [
      "keywords"
    ]
  },
  {
    "name": "overview_bm25",
    "template": {
      "match": {
        "overview": "{{keywords}}"
      }
    },
    "params": [
      "keywords"
    ]
  },
  {
    "name": "release_year",
    "template": {
      "function_score": {
        "functions": [
          {
            "filter": {
              "bool": {
                "must": [
                  {
                    "exists": {
                      "field": "release_year"
                    }
                  }
                ]
              }
            },
            "field_value_factor": {
              "field": "release_year",
              "missing": 0
            }
          }
        ]
      }
    },
    "params": [
      "keywords"
    ]
  }
]


# Listing 10.5

Recall we have one relevant and three irrelevant movies for `social network`. Here we retrieve all three features created above for each of the four movies.

In [6]:
ids = ["37799", "267752", "38408", "28303"] #social network graded documents
options = {"keywords": "social network"}
response = ltr.get_logged_features("movie_model", ids,
                 options=options, fields=["id", "title"], log=True)
print("Documents:")
print(json.dumps(response, indent="  "))

{'size': 100,
 'query': {'bool': {'must': [{'terms': {'id': ['37799',
       '267752',
       '38408',
       '28303']}}],
   'should': [{'sltr': {'_name': 'logged_featureset',
      'featureset': 'movie_model',
      'params': {'keywords': 'social network'}}}]}},
 'ext': {'ltr_log': {'log_specs': {'name': 'main',
    'named_query': 'logged_featureset',
    'missing_as_zero': True}}}}

{'took': 27,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 4, 'relation': 'eq'},
  'max_score': 2037.5275,
  'hits': [{'_index': 'tmdb',
    '_id': 'hm6o5ZIBCtyA4Ms4dAcI',
    '_score': 2037.5275,
    '_source': {'id': '37799',
     'title': 'The Social Network',
     'overview': 'On a fall night in 2003, Harvard undergrad and computer programming genius Mark Zuckerberg sits down at his computer and heatedly begins working on a new idea. In a fury of blogging and programming, what begins in his dorm room as a small site among friends soon becomes a global social network and a revolution in communication. A mere six years and 500 million friends later, Mark Zuckerberg is the youngest billionaire in history... but for this entrepreneur, success leads to both personal and legal complications.',
     'tagline': "You don't get to 500 million friends without making a few enemies.",
     'directors': ['David Fincher'],


Documents:
[
  {
    "id": "37799",
    "title": "The Social Network",
    "overview": "On a fall night in 2003, Harvard undergrad and computer programming genius Mark Zuckerberg sits down at his computer and heatedly begins working on a new idea. In a fury of blogging and programming, what begins in his dorm room as a small site among friends soon becomes a global social network and a revolution in communication. A mere six years and 500 million friends later, Mark Zuckerberg is the youngest billionaire in history... but for this entrepreneur, success leads to both personal and legal complications.",
    "tagline": "You don't get to 500 million friends without making a few enemies.",
    "directors": [
      "David Fincher"
    ],
    "cast": "Jesse Eisenberg Andrew Garfield Justin Timberlake Armie Hammer Max Minghella Rooney Mara Brenda Song Rashida Jones John Getz David Selby Denise Grayson Douglas Urbanski Joseph Mazzello Wallace Langham Patrick Mapel Dakota Johnson Melise Bryan Ba

## Omitted from book - parse features search engine response

**The following code is used to generate Listings below. But this parsing code is omitted from the book itself.**

Docuemnts returned by the serach engine has features returned in the `[features]` property.

In [7]:
def populate_features_for_qid(qid, docs, judgments):
    doc_id_to_features = {}

    # Map Doc Id => Features
    for doc in docs:
        features = doc["[features]"]
        doc_id_to_features[doc["id"]] = list(features.values())

    # Save in correct judgment
    for judgment in judgments:
        if judgment.qid == qid:
            try:
                judgment.features = doc_id_to_features[judgment.doc_id]
            except KeyError:
                pass


## Listing 10.6 (output)

Listing 10.7 is the output of the following, the resulting processing of logging just for `social network`

In [8]:
populate_features_for_qid(1, response, sample_judgments)
sample_judgments

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[18.135925, 8.391596, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 13.237938, 2013.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 9.576859, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[7.5430527, 6.839079, 1970.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]

## Listing 10.7 (output)

Listing 10.8 is the output of the following, which adds features parsed from the `star wars` movies to our judgment list.

In [9]:
ids = ["11", "1892", "54138", "85783", "325553"] #social network graded documents
options = {"keywords": "star wars"}
documents = ltr.get_logged_features("movie_model", ids, options=options, fields=["id", "title"])
populate_features_for_qid(2, documents, sample_judgments)
sample_judgments

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[18.135925, 8.391596, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[0.0, 13.237938, 2013.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[0.0, 9.576859, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[7.5430527, 6.839079, 1970.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[14.951998, 0.0, 1977.0],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[0.0, 4.3300323, 1983.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[5.377082, 0.0, 2013.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[7.01165, 0.0, 1952.0],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[0.0, 0.0, 2003.0],weight=1)]

## Loading / logging training set (omitted from book)

The following downloads a larger judgment list, parses it, and logs features for each graded document. It just repeats the full logging workflow in this notebook but in one loop.

In [10]:
from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby

collection = engine.get_collection("tmdb")
ftr_logger = FeatureLogger(collection, feature_set="movie_model")

with judgments_open("data/ai_pow_search_judgments.txt") as judgments:
    for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):
        ftr_logger.log_for_qid(query_judgments, qid, judgments.keywords(qid))
    
ftr_logger.logged

Duplicate docs in for query id 67: {'2503'}
Missing doc 225130 with error
Missing doc 37106 with error
Duplicate docs in for query id 74: {'11852'}
Missing doc 61919 with error
Missing doc 67479 with error
Missing doc 17882 with error
Duplicate docs in for query id 95: {'17431'}
Duplicate docs in for query id 98: {'1830'}
Parsing QID 100
Duplicate docs in for query id 99: {'9799'}
Missing doc 61920 with error


[Judgment(grade=1,qid=1,keywords=rambo,doc_id=7555,features=[13.038148, 11.173398, 2008.0],weight=1),
 Judgment(grade=1,qid=1,keywords=rambo,doc_id=1370,features=[11.056428, 12.652582, 1988.0],weight=1),
 Judgment(grade=1,qid=1,keywords=rambo,doc_id=1369,features=[7.593794, 10.758981, 1985.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=13258,features=[0.0, 10.096009, 2007.0],weight=1),
 Judgment(grade=1,qid=1,keywords=rambo,doc_id=1368,features=[0.0, 11.867074, 1982.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=31362,features=[0.0, 8.33506, 1988.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=61410,features=[0.0, 4.6697874, 2010.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=319074,features=[0.0, 0.0, 2015.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=10296,features=[0.0, 0.0, 2004.0],weight=1),
 Judgment(grade=0,qid=1,keywords=rambo,doc_id=35868,features=[0.0, 0.0, 2001.0],weight=1),
 Judgment(grade=0,qid=1,keywords=ram

## Next up, prep data for training

We now have a dataset extracted from the search engine. Next we need to perform some manipulation to the training data. This manipulation turns our slightly strange looking ranking problem into one that looks more like any-other boring machine learning problem.

Up next: [Feature Normalization and Pairwise Transform](3.pairwise-transform.ipynb)