# Pretrained cross encoders

In this notebook, we see how cross encoders accurately predict the relevance of keywords relative to a passage in the movie dataset.

Cross encoders measure query-doc relevance LATE in the process. 

A biencoder encodes the query and document independently into vectors so that similar passages result in higher cosine similarity.

A cross encoder, however, is trained on a different task. It attempts to classify two sentences as relevant for each other or not. The first sentence being a question or query. The second being the passage being ranked. 

In both cases the BERT / Transformer "lego piece" is in place, but they are trained on different tasks. One to produce embeddings. The other to produce a classifier.

We use [SentenceTransformers](https://sbert.net/examples/cross_encoder/applications/README.html) cross encoders library which expose pretrained cross-encoders for our usage.

## Load movie dataset

We load TheMovieDB training set into two pandas dataframes. One with the judgments the other with the movie data

In [2]:
from ltr.download import download, extract_tgz 
from aips.data_loaders.movies import load_dataframe
import tarfile
import json

dataset = ["https://github.com/ai-powered-search/tmdb/raw/main/judgments.tgz", 
           "https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz"]
download(dataset, dest="data/")
extract_tgz("data/movies.tgz", "data/") # -> Holds "tmdb.json", big json dict with corpus
extract_tgz("data/judgments.tgz", "data/") # -> Holds "ai_pow_search_judgments.txt", 
                                  # which is our labeled judgment list

import pandas as pd
movies = pd.read_json('data/tmdb.json', orient="index")
movies = movies[['title', 'overview']]
movies['title'] = movies['title'].fillna('')
movies['overview'] = movies['overview'].fillna('')

from ltr.judgments import judgments_open
import pandas as pd

all_judgments = []
with judgments_open("data/ai_pow_search_judgments.txt") as judgments:
    for judgment in judgments:
        all_judgments.append({'grade': judgment.grade,
                              'keywords': judgment.keywords,
                              'doc_id': int(judgment.doc_id)})

judgments = pd.DataFrame(all_judgments)
judgments

data/judgments.tgz already exists
data/movies.tgz already exists


## Rerank from the judgments directly

To "simulate" L0 retrieval, here we will just grab candidates directly from the labeled dataset. Usually this would be some top N from your search engine.

Below we'll grab a query `vietnam war` which is a general category. A lot of the vietnam war movies may or may not actually mention it.

In [38]:
vietnam_war = judgments[judgments['keywords'] == 'vietnam war'].merge(movies,
                                                                      how='left', right_index=True, 
                                                                      left_on='doc_id')

vietnam_war['passage'] = vietnam_war['title'] + '\n\n' + vietnam_war['overview']
vietnam_war

Unnamed: 0,grade,keywords,doc_id,title,overview,passage
3825,1,vietnam war,600,Full Metal Jacket,A pragmatic U.S. Marine observes the dehumaniz...,Full Metal Jacket\n\nA pragmatic U.S. Marine o...
3826,1,vietnam war,28,Apocalypse Now,"At the height of the Vietnam war, Captain Benj...",Apocalypse Now\n\nAt the height of the Vietnam...
3827,1,vietnam war,792,Platoon,"As a young and naive recruit in Vietnam, Chris...",Platoon\n\nAs a young and naive recruit in Vie...
3828,1,vietnam war,11778,The Deer Hunter,A group of working-class friends decides to en...,The Deer Hunter\n\nA group of working-class fr...
3829,0,vietnam war,437,Cube 2: Hypercube,The sequel to the low budget first film ‘Cube....,Cube 2: Hypercube\n\nThe sequel to the low bud...
3830,0,vietnam war,11543,Kingpin,After bowler Roy Munson swindles the wrong cro...,Kingpin\n\nAfter bowler Roy Munson swindles th...
3831,0,vietnam war,84892,The Perks of Being a Wallflower,"Pittsburgh, Pennsylvania, 1991. High school fr...","The Perks of Being a Wallflower\n\nPittsburgh,..."
3832,0,vietnam war,72890,Girl Most Likely,A failed New York playwright stages a suicide ...,Girl Most Likely\n\nA failed New York playwrig...
3833,0,vietnam war,70667,Kon-Tiki,The true story about legendary explorer Thor H...,Kon-Tiki\n\nThe true story about legendary exp...
3834,0,vietnam war,15347,Born Free,"At a national park in Kenya, English game ward...","Born Free\n\nAt a national park in Kenya, Engl..."


## Rerank based on keyword -- passage similarity

The cross encoder will rerank using the similarity between keywords and the passage. Below we gather those into `pairs` then ask the model to score each pair, placing it back in the dataframe. Finally we sort on this score from most to least relevant.

### Awesome 

You can see this works very well to rank based on semantics!

### Not so awesome

The cross encoder doesn't have a straight-forward way to include other types of features we might want in an LTR solution. Such as numerical features (popularity, etc). Nor do we have direct keyword matches.

For these reasons traditional LTR persists with the cross encoder / bi encoders providing one piece of the puzzle. 

In [35]:
from time import perf_counter

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
pairs = vietnam_war[['keywords', 'passage']].to_records(index=False)
scores = model.predict(pairs)
vietnam_war['scores'] = scores

In [36]:
vietnam_war.sort_values('scores', ascending=False)

Unnamed: 0,grade,keywords,doc_id,title,overview,passage,scores
3826,1,vietnam war,28,Apocalypse Now,"At the height of the Vietnam war, Captain Benj...",Apocalypse Now\n\nAt the height of the Vietnam...,3.10594
3828,1,vietnam war,11778,The Deer Hunter,A group of working-class friends decides to en...,The Deer Hunter\n\nA group of working-class fr...,2.45561
3825,1,vietnam war,600,Full Metal Jacket,A pragmatic U.S. Marine observes the dehumaniz...,Full Metal Jacket\n\nA pragmatic U.S. Marine o...,1.755414
3827,1,vietnam war,792,Platoon,"As a young and naive recruit in Vietnam, Chris...",Platoon\n\nAs a young and naive recruit in Vie...,0.254646
3846,0,vietnam war,16220,Wizards,"In a post-apocalyptic future, humankind is des...","Wizards\n\nIn a post-apocalyptic future, human...",-10.406036
3859,0,vietnam war,28043,Black Sabbath,Black Sabbath is a 1963 Italian horror film di...,Black Sabbath\n\nBlack Sabbath is a 1963 Itali...,-10.720423
3856,0,vietnam war,45929,Titanica,Titanica is a fascinating non-fiction drama wh...,Titanica\n\nTitanica is a fascinating non-fict...,-10.844439
3858,0,vietnam war,346,Seven Samurai,"A veteran samurai, who has fallen on hard time...","Seven Samurai\n\nA veteran samurai, who has fa...",-10.864466
3844,0,vietnam war,22575,Go West,"Embezzler, shill, all around confidence man S....","Go West\n\nEmbezzler, shill, all around confid...",-10.88236
3853,0,vietnam war,70435,Haywire,Mallory Kane is a highly trained operative who...,Haywire\n\nMallory Kane is a highly trained op...,-10.915937
