# Cutting off the tail of results

This code is for a blog post about [optimizing the cutoff of search results]. We find a good cutoff (min should match) using the `variable_precision` metric mentioned in the blog post. Much of this code is adapted from the [Hello LTR](http://github.com/o19s/hello-ltr) project from [OpenSource Connections](http://opensourceconnections.com). Reusing TMDB data synthesized for the book [AI Powered Search](http://aipoweredsearch.com)

In [3]:
from search.download import download
judgments='https://github.com/ai-powered-search/tmdb/raw/87716fa2d4447807e695c03b83bcb9cd70a5d493/judgments.tgz'
corpus='https://github.com/ai-powered-search/tmdb/raw/main/movies.tgz'

download([corpus, judgments], dest='data/tmdb/');

data/tmdb/movies.tgz already exists
data/tmdb/judgments.tgz already exists


In [4]:
import json
tmdb_movies = json.load(open('data/tmdb/tmdb.json'))

In [38]:
from search.judgments import load_as_dataframe
judgments = load_as_dataframe('data/tmdb/ai_pow_search_judgments.txt')
judgments

Recognizing 105 queries...
Parsing QID 100


Unnamed: 0_level_0,grade,keywords,doc_id
qid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,rambo,7555
1,1,rambo,1370
1,1,rambo,1369
1,0,rambo,13258
1,1,rambo,1368
...,...,...,...
105,0,vietnam war,61563
105,0,vietnam war,10961
105,0,vietnam war,4806
105,0,vietnam war,10661


## Index the TMDB (TheMovieDB) corpus

Recreate and index [TheMovieDB](http://themoviedb.org) corpus to the `tmdb` corpus. We don't change text mappings, so for this example, it's good enough they're standard tokenized.

In [13]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk



def indexable_movies(tmdb_movies):
    for movie_id, movie in tmdb_movies.items():
        
        if 'release_date' in movie and len(movie['release_date']) > 0:
            release_date = movie['release_date']
        
        try: 
            source_doc = {'id': movie_id,
                'title': movie['title'],
                'overview': movie['overview'],
                'tagline': movie['tagline'],
                'cast': " ".join([castMember['name'] for castMember in movie['cast']]),
                'genres': [genre['name'] for genre in movie['genres']],
                'release_date': release_date,
                'vote_average': float(movie['vote_average']) if 'vote_average' in movie else None,
                'vote_count': int(movie['vote_count']) if 'vote_count' in movie else 0,
            }
        except KeyError as e:
            pass
            #print(f"Skipping {movie_id} because {e}")

        yield {
            "_id": movie_id,
            "_source": source_doc,
            "_index": "tmdb"
        }
        

es = Elasticsearch("http://localhost:9200")
es.indices.delete(index='tmdb', ignore=[400,404])

# Standard dev settings, we will assume all text is standard tokenized
settings = {
    'settings': {
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
    
}


for success, info in parallel_bulk(es, indexable_movies(tmdb_movies)):
    if not success:
        print(success, info)

es.indices.refresh(index='tmdb')
es.count(index='tmdb')

{'count': 65702,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

## Check our search works...

Confirming our search works

In [29]:
results = es.search(index='tmdb', body={
    "query": {
        "multi_match": {
            "query": "harry and the hendersons",
            "fields": ["title", "tagline", "overview"]
        }
    },
    "size": 100
})
results

{'took': 29,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': 20.167906,
  'hits': [{'_index': 'tmdb',
    '_type': '_doc',
    '_id': '8989',
    '_score': 20.167906,
    '_source': {'id': '8989',
     'title': 'Harry and the Hendersons',
     'overview': 'Returning from a hunting trip in the forest, the Henderson family\'s car hits an animal in the road. At first they fear it was a man, but when they examine the "body" they find it\'s a "bigfoot". They think it\'s dead so they decide to take it home (there could be some money in this). As you guessed, it isn\'t dead. Far from being the ferocious monster they fear "Harry" to be, he\'s a friendly giant.',
     'tagline': "When You Can't Believe Your Eyes, Trust Your Heart.",
     'cast': 'John Lithgow Melinda Dillon Margaret Langrick Joshua Rudoy Kevin Peter Hall David Suchet Lainie Kazan Don Ameche M. Emmet Walsh Laura Kenn

## Turn results into useful dataframe

Turn the search results into a dataframe useful for labeling with the judgments. Include the number of results returned in the response (up to `size` which we set to 100), the `doc_id` and the `title`)

In [42]:
def flatten_hits(results, keywords):
    hits = results['hits']['hits']
    hits_len = len(hits)
    flattened = []
    for hit in hits: 
        doc_id = hit['_id']
        title = hit['_source']['title']
        flattened.append({'keywords': keywords,
                          'doc_id': doc_id,
                          'title': title,
                          'num_results': hits_len})
    return pd.DataFrame(flattened)
flatten_hits(results, keywords="harry and the hendersons")

Unnamed: 0,keywords,doc_id,title,num_results
0,harry and the hendersons,53157,The Park Is Mine,10
1,harry and the hendersons,560585,A Face of War,10
2,harry and the hendersons,523176,"There Is No Return, Johnny",10
3,harry and the hendersons,18638,Faith of My Fathers,10
4,harry and the hendersons,91475,Mean Johnny Barrows,10
5,harry and the hendersons,148667,Message from Nam,10
6,harry and the hendersons,25784,Journey from the Fall,10
7,harry and the hendersons,133252,A Rumor Of War,10
8,harry and the hendersons,264321,A Rumor Of War,10
9,harry and the hendersons,208982,Vietnam in HD,10


## Issue every query in judgments

Label the returned results for this query as relevant or not. We do this by left joining the judgments into the query's hits. We assume missing results are irrelevant (`grade=0.0`). Given the underlying dataset, this makes sense, as we mostly have positively labeled title or series matches, and can typically assume the non labeled are irrelevant.

In [49]:
def labeled_results(judgments, results, keywords):
    hits_df = flatten_hits(results, keywords)
    merged = hits_df.merge(judgments, on='doc_id', how='left').fillna(0.0)
    merged = merged.rename(columns={'keywords_x': 'keywords'})
    return merged[['keywords', 'doc_id', 'title', 'num_results', 'grade']]


all_results_labeled = pd.DataFrame()
for query in judgments['keywords'].unique():
    query_judgments = judgments[judgments['keywords'] == query]
    results = es.search(index='tmdb', body={
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title", "tagline", "overview"]
            }
        },
        "size": 100
    })
    all_results_labeled = all_results_labeled.append(labeled_results(query_judgments, results, keywords=query))
    
all_results_labeled

Unnamed: 0,keywords,doc_id,title,num_results,grade
0,rambo,205697,The Last American Soldier,12,0.0
1,rambo,7555,Rambo,12,1.0
2,rambo,1370,Rambo III,12,1.0
3,rambo,28448,Walker,12,0.0
4,rambo,1368,First Blood,12,1.0
...,...,...,...,...,...
95,vietnam war,11778,The Deer Hunter,100,1.0
96,vietnam war,209574,American Commandos,100,0.0
97,vietnam war,10654,Hair,100,0.0
98,vietnam war,11856,Air America,100,0.0


## Compute variable precision per query

Variable precision is the proportion of the result set that is relevant. We compute the `variable_precision` for each query and label each row accordingly.

In [50]:
def variable_precision(grades, max_n):
    n = min(max_n, len(grades))
    return sum(grades[:n]) / n

In [54]:
final_scores = []
for query in judgments['keywords'].unique():
    judged_query = judgments[judgments['keywords'] == query]
    var_prec = variable_precision(judged_query['grade'].tolist(), 100)
    final_scores.append({'query': query, 'variable_precision': var_prec})
final_scores = pd.DataFrame(final_scores)
final_scores

Unnamed: 0,query,variable_precision
0,rambo,0.097561
1,rocky,0.170732
2,war games,0.051282
3,crocodile dundee,0.107143
4,matrix,0.090909
...,...,...
100,science fiction,0.102564
101,screwball comedy,0.097561
102,snakes,0.097561
103,superheroes,0.097561


## Flaw with Variable Precision - only 1 result == score of 1!

We note a flaw in the `variable_precision` metric. Only 1 relevant result returned means you have a "perfect" cutoff. As shown below, with a single relevant result. Compared to 3/4 results. The latter is a better "cutoff".

In [56]:
variable_precision(grades=[1.0], max_n=100)

1.0

In [57]:
variable_precision(grades=[1.0, 0.0, 1.0, 1.0], max_n=100)

0.75

In [58]:
variable_precision(grades=[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0], max_n=100)

0.375

## Another approach: reward optimization

Another way of looking at the problem, as the user scrolls the results, they get a reward (a relevant result) or annoyed (an irrelevant result). We basically just sum the rewards, but subtract anytime we come across an irrelevant result. Users capture more relevant results, which helps users feel good, but also models them getting annoyed as they scroll over irrelevant results. If they start to see more and more irrelevant results, their annoyance goes up, and they give up

In [98]:
def time_well_spent(grades, median_grade=0.5):
    received_reward = 0.0
    for grade in grades:
        received_reward += grade - median_grade

    return received_reward

In [102]:
time_well_spent(grades=[1.0])

0.5

In [106]:
time_well_spent(grades=[1.0, 0.0])

0.0

In [107]:
time_well_spent(grades=[1.0, 0.0, 1.0])

0.5

In [103]:
time_well_spent(grades=[1.0, 0.0, 1.0, 1.0])

1.0

In [104]:
time_well_spent(grades=[1.0, 0.0, 1.0, 1.0, 0.0])

0.5

In [105]:
time_well_spent(grades=[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0])

-1.0