# Direct feature logging

This notebook doesn't rely on a Learning to Rank plugin (like those found in Elasticsearch, Solr, OpenSearch).

Instead it demonstrates how to extract search engine features from sub-queries. Teams expect to do reranking outside the search engine. But it's useful to compute features that rely on statistics in the search engine.

Teams need to implement two bits of functionality:

1. At training time, how do we take training data / judgments, and pull out features from the search engine?
2. At query time, how do we pull through features with the first-pass set of search results we want to rerank?

## Setup

The below cells will [setup TheMovieDB](../../chapters/ch10/1.setup-the-movie-db.ipynb) with OPENSEARCH as the search engine. (Calling `get_engine("OPENSEARCH")`)

In [None]:
from aips import get_engine, set_engine, get_ltr_engine, indexer
engine = get_engine("OPENSEARCH")
engine

<engines.opensearch.OpenSearchEngine.OpenSearchEngine at 0xffffa7504dc0>

In [2]:
indexer.download_data_files("tmdb") # -> Holds "tmdb.json", big json dict with corpus
indexer.download_data_files("judgments") # -> Holds "ai_pow_search_judgments.txt", which is our labeled judgment list
tmdb_collection = indexer.build_collection(engine, "tmdb")

## Feature logging time

Below would be the sort of "nightly job" that given a set of search judgments, we hydrate them with data needed for training.

In [3]:
from ltr.judgments import Judgment


sample_judgments = [
    # for 'social network' query
    Judgment(1, 'social network', '37799'),  # The Social Network
    Judgment(0, 'social network', '267752'), # #chicagoGirl
    Judgment(0, 'social network', '38408'),  # Life As We Know It
    Judgment(0, 'social network', '28303'),  # The Cheyenne Social Club
    
    # for 'star wars' query
    Judgment(1, 'star wars', '11'),     # star wars
    Judgment(1, 'star wars', '1892'),   # return of jedi
    Judgment(0, 'star wars', '54138'),  # Star Trek Into Darkness
    Judgment(0, 'star wars', '85783'),  # The Star
    Judgment(0, 'star wars', '325553')  # Battlestar Galactica
]
sample_judgments

[Judgment(grade=1,qid=1,keywords=social network,doc_id=37799,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=267752,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=38408,features=[],weight=1),
 Judgment(grade=0,qid=1,keywords=social network,doc_id=28303,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=11,features=[],weight=1),
 Judgment(grade=1,qid=2,keywords=star wars,doc_id=1892,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=54138,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=85783,features=[],weight=1),
 Judgment(grade=0,qid=2,keywords=star wars,doc_id=325553,features=[],weight=1)]

### Using 'include_named_queries_score'

In Elasticsearch/OpenSearch to log features you can list a set of query, with a `_name`, and also pass `include_name_queries_score` to compute the score of any named query.

* A similar pattern would be seen in [Solr using psuedofields](https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#functions-with-fl) 

Below we see how they might be computed.

In [4]:
import requests

def run_opensearch_query(query,
                         url_params=None,
                         url="http://aips-opensearch:9200/tmdb/_search"):
    resp = requests.post(url, json=query, params=url_params)
    return resp.json()


def features_for_keywords(keywords):
    """Return some BM25 features for keywords."""
    return [
        {"match": {
            "title": {
                "query": keywords,
                "_name": "title_bm25"
            }
        }},
        {"match": {
            "overview": {
                "query": keywords,
                "_name": "overview_bm25"
            }
        }}
    ]
    

def log_query(judgments, keywords):
    """Given a set of judgments for keywords, retrieve them, with feature scores."""
    params = {"include_named_queries_score": "true"}
    
    ids = []
    for judgment in judgments:
        if judgment.keywords == keywords:
            ids.append(judgment.doc_id)
            
    query = {
        "query": {
            "bool": {
                # ********
                # First filter down to what's evaluated
                # ...
                "filter": [
                    {"terms": {
                        "id": ids,
                    }}
                ],
                # ********
                # Then list features as named queries in SHOULD clause
                # ...
                "should": features_for_keywords(keywords)
            }
        },
        # ******
        # Sort on something othre than the direct score
        # (We don't care)
        "sort": "_id",
        # ****
        # Ensure we get all the results (in reality we should
        # use your search engine's deep paging capabilities)
        "size": len(ids)
        
    }
        
    resp = run_opensearch_query(query, url_params=params)
    print(resp)
    return resp['hits']['hits']
    


results = log_query(sample_judgments, "star wars")
for result in results:
    matched_queries = {}
    if 'matched_queries' in result:
        matched_queries = result['matched_queries']
    print(result['_source']['id'], result['_source']['title'], matched_queries)

{'took': 6, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 5, 'relation': 'eq'}, 'max_score': None, 'hits': [{'_index': 'tmdb', '_id': '11', '_score': None, '_source': {'id': '11', 'title': 'Star Wars', 'overview': 'Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.', 'tagline': 'A long time ago in a galaxy far, far away...', 'directors': ['George Lucas'], 'cast': "Mark Hamill Harrison Ford Carrie Fisher Peter Cushing Alec Guinness Anthony Daniels Kenny Baker Peter Mayhew David Prowse James Earl Jones Phil Brown Shelagh Fraser Jack Purvis Alex McCrindle Eddie Byrne Drewe Henley Denis Lawson Garrick Hagon Jack Klaff William Hootkins Angus MacInnes Jeremy Sinden

## Query Time

At query time, we care about

1. Ranked by the L0 scoring factors we care about 
2. Ability to compute scores on each document as needed

In [5]:
import json

def l0_query(keywords):
    params = {"include_named_queries_score": "true"}

    # ********
    # Features executed as named queries with boost 0
    # ...

    wrapped = [{
        "bool": {
            "should": features_for_keywords(keywords),
            "boost": 0   # No impact to l0 relevance
        }
    }]
            
    query = {
        "query": {
            "bool": {

                "should": 
                    # ********
                    # First score with l0
                    # ...
                    [
                        {"multi_match": {
                            "fields": ["title^10", "overview"],
                            "query": keywords,
                            "boost": 100,
                            "_name": "l0"
                        }}
                    ]
                    # ****
                    # Add in weight 0 feature queries
                    + wrapped
            }
        },
        # ******
        # Sort on the score
        # (We don't care)
        "sort": "_score",

        
    }
    print(json.dumps(query, indent=2))
    resp = run_opensearch_query(query, url_params=params)
    return resp['hits']['hits']    
    

l0_query("star wars")

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "fields": [
              "title^10",
              "overview"
            ],
            "query": "star wars",
            "boost": 100,
            "_name": "l0"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "title": {
                    "query": "star wars",
                    "_name": "title_bm25"
                  }
                }
              },
              {
                "match": {
                  "overview": {
                    "query": "star wars",
                    "_name": "overview_bm25"
                  }
                }
              }
            ],
            "boost": 0
          }
        }
      ]
    }
  },
  "sort": "_score"
}


[{'_index': 'tmdb',
  '_id': '11',
  '_score': 14951.998,
  '_source': {'id': '11',
   'title': 'Star Wars',
   'overview': 'Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.',
   'tagline': 'A long time ago in a galaxy far, far away...',
   'directors': ['George Lucas'],
   'cast': "Mark Hamill Harrison Ford Carrie Fisher Peter Cushing Alec Guinness Anthony Daniels Kenny Baker Peter Mayhew David Prowse James Earl Jones Phil Brown Shelagh Fraser Jack Purvis Alex McCrindle Eddie Byrne Drewe Henley Denis Lawson Garrick Hagon Jack Klaff William Hootkins Angus MacInnes Jeremy Sinden Graham Ashley Don Henderson Richard LeParmentier Leslie Schofield Michael Leader David Ankrum Mark Austin Scott Beach Lightning Bear Jon Berg Doug Besw

## Directly log all judgments

In [6]:
from ltr.judgments import judgments_open
import pandas as pd

# ****
# Directly log
all_keywords = set()
all_judgments = []
grade_lookup = {}
with judgments_open("data/ai_pow_search_judgments.txt") as judgments:
    for judgment in judgments:
        all_keywords.add(judgment.keywords)
        all_judgments.append(judgment)
        grade_lookup[(judgment.keywords, judgment.doc_id)] = judgment.grade

# ****
# Log for this query
all_logged = []
for keywords in all_keywords:
    resp = log_query(all_judgments, keywords)
    
    for hit in resp:
        doc_id = hit['_source']['id']
        ftr = {
            "keywords": keywords,
            "grade": grade_lookup[keywords, doc_id],
            "title_bm25": 0,
            "overview_bm25": 0
        }
        
        for ftr_name in ftr:
            try:
                ftr[ftr_name] = hit['matched_queries'][ftr_name]
            except KeyError:
                pass
        all_logged.append(ftr)

logged_features = pd.DataFrame(all_logged).sort_values('keywords', ascending=True)
logged_features

Parsing QID 100
{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 40, 'relation': 'eq'}, 'max_score': None, 'hits': [{'_index': 'tmdb', '_id': '10744', '_score': None, '_source': {'id': '10744', 'title': 'The Cooler', 'overview': 'Bernie works at a Las Vegas casino, where he uses his innate ability to bring about misfortune in those around him to jinx gamblers into losing. His imposing boss, Shelly Kaplow, is happy with the arrangement. But Bernie finds unexpected happiness when he begins dating attractive waitress Natalie Belisario.', 'tagline': 'When your life depends on losing... the last thing you need is lady luck.', 'directors': ['Wayne Kramer'], 'cast': 'William H. Macy Alec Baldwin Maria Bello Shawn Hatosy Paul Sorvino Estella Warren Ron Livingston Joey Fatone Arthur J. Nascarella M.C. Gainey Ellen Greene Don Scribner Tony Longo', 'genres': ['Drama', 'Romance'], 'release_date': '2003-01-17', 'release

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 41, 'relation': 'eq'}, 'max_score': None, 'hits': [{'_index': 'tmdb', '_id': '11002', '_score': None, '_source': {'id': '11002', 'title': 'Greystoke: The Legend of Tarzan, Lord of the Apes', 'overview': "A shipping disaster in the 19th Century has stranded a man and woman in the wilds of Africa. The lady is pregnant, and gives birth to a son in their tree house. Soon after, a family of apes stumble across the house and in the ensuing panic, both parents are killed. A female ape takes the tiny boy as a replacement for her own dead infant, and raises him as her son. Twenty years later, Captain Phillippe D'Arnot discovers the man who thinks he is an ape. Evidence in the tree house leads him to believe that he is the direct descendant of the Earl of Greystoke, and thus takes it upon himself to return the man to civilization.", 'tagline': '', 'directors': ['Hugh H

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Unnamed: 0,keywords,grade,title_bm25,overview_bm25
1165,300,0,0.000000,0.000000
1168,300,0,0.000000,0.000000
1167,300,0,0.000000,0.000000
1166,300,0,0.000000,0.000000
1164,300,0,0.000000,0.000000
...,...,...,...,...
188,wolf wall street,0,15.531824,0.000000
189,wolf wall street,0,0.000000,0.000000
210,wolf wall street,0,0.000000,0.000000
223,wolf wall street,0,0.000000,11.612869


## Train a linear model on features

Train a naive model trying to predict the grade given title_bm25 and overview_bm25

In [7]:
from sklearn.linear_model import LinearRegression

# Define features (X) and target (y)
X = logged_features[["title_bm25", "overview_bm25"]]
y = logged_features["grade"]

# Fit linear regression model
model = LinearRegression()
model.fit(X, y)

### Weights of each feature 

Notice the co-efficients: (first title, than overview) on predicting `grade`

In [8]:
model.coef_

array([0.02293156, 0.00783086])

## Reranking

We take our l0, first-pass results and then rerank them using the model.

In [9]:
# hits = l0_query("the force awakens")
hits = l0_query("wonderful life")

all_ftrs = []
for hit in hits:
    ftrs = {"title_bm25": 0.0,
            "overview_bm25": 0.0,
            "title": hit['_source']['title']}
    for ftr_name in ftrs:
        try: 
            ftrs[ftr_name] = hit['matched_queries'][ftr_name]
        except KeyError:
            pass
    all_ftrs.append(ftrs)
    
l0_results = pd.DataFrame(all_ftrs)
l0_results['rerank_score'] = model.predict(l0_results[['title_bm25', 'overview_bm25']])
l0_results

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "fields": [
              "title^10",
              "overview"
            ],
            "query": "wonderful life",
            "boost": 100,
            "_name": "l0"
          }
        },
        {
          "bool": {
            "should": [
              {
                "match": {
                  "title": {
                    "query": "wonderful life",
                    "_name": "title_bm25"
                  }
                }
              },
              {
                "match": {
                  "overview": {
                    "query": "wonderful life",
                    "_name": "overview_bm25"
                  }
                }
              }
            ],
            "boost": 0
          }
        }
      ]
    }
  },
  "sort": "_score"
}


Unnamed: 0,title_bm25,overview_bm25,title,rerank_score
0,14.407387,0.0,Wonderful Life,0.360889
1,12.506475,0.0,Isn't Life Wonderful,0.317298
2,11.048712,2.032398,It's a Wonderful Life,0.299785
3,8.795404,0.0,Mr. Wonderful,0.232198
4,8.795404,1.629939,Wonderful World,0.244961
5,8.795404,1.389229,Wonderful Summer,0.243076
6,8.795404,0.0,Wonderful Town,0.232198
7,8.795404,2.618285,Wilby Wonderful,0.252701
8,8.186155,2.338284,"The Wonderful, Horrible Life of Leni Riefenstahl",0.236537
9,7.634939,0.0,Having Wonderful Time,0.205586
