# Assignment 3, Model 1: MLM

In this notebook you will implement MLM re-ranking of the first-pass ranking retrieved from your index. 

Your implementation of the mixture of language models (MLM) approach should work with two fields, `title` and `content`, with weights 0.2 and 0.8, respectively. 

Content should be the "catch-all" field. Use Dirichlet smoothing with the smoothing parameter set to 2000.

Be sure to use both markdown cells with section headings and explanations, as well as writing readable code, to make it clear what your intention is each step of the way through the code. 

In [1]:
import math

from pprint import pprint
from tqdm import tqdm
from elasticsearch import Elasticsearch

es = Elasticsearch()
es.info()

{'name': 'HoG',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'Zle62SN2R0-CJz5T3d07Og',
 'version': {'number': '7.4.2',
  'build_flavor': 'default',
  'build_type': 'deb',
  'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96',
  'build_date': '2019-10-28T20:40:44.881551Z',
  'build_snapshot': False,
  'lucene_version': '8.2.0',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

In [2]:
INDEX_NAME = 'collection_v2'
fields = {
    'names' : 0.2,
    'catch_all' : 0.8   
}

PENALTY = 1e-8

## File loading utilities

In [3]:
def load_queries(path):
    queries = {}
    with open(path) as f:
        for line in f:
            query = line.split(maxsplit = 1)
            queries[query[0]] = query[1].strip()
    return queries

In [4]:
BASELINE_RANKING_FILE = 'data/ranking2_baseline.csv'
MLM_RANKING_FILE = 'data/ranking2_model1.csv'
queries = load_queries('data/queries2.txt')

## Query analyzing

In [5]:
def analyze(query):
    response = es.indices.analyze(index=INDEX_NAME, body={'text': query, 'analyzer':'english_analyzer'})
    analyzed_query = [term['token'] for term in response['tokens']]
    return analyzed_query

## Term field frequency retrieval

Sometimes, the total term frequencies in a field for a specific term is not available directy in the termvectors because despite a great BM25 score, this specific term is not in the field of the document being scored. To retrieve this information, we search for this term in the index in order to find its `ttf`. 

In [6]:
def find_ttf(term, tv, field):
    if term in tv['terms']:
        return tv['terms'][term]['ttf']
        
    else:
        query = { 'size' : 10, 'query': { 'match' : { f'{field}': f'{term}' } } }
        docs = es.search(index=INDEX_NAME, body=query)['hits']['hits']

        for doc in docs:
            tv = es.termvectors(index=INDEX_NAME, id=doc['_id'], fields=field, term_statistics=True)['term_vectors']
            if field in tv and term in tv[field]['terms']:
                return tv[field]['terms'][term]['ttf']
    
    # if ttf could not be found, skip this term in the scoring
    return 0

## Baseline ranking function

In [7]:
def rank_baseline(documents, query):
    scores = {}
    
    for doc in documents:
        doc_id = doc['_id']
        scores[doc_id] = doc['_score']
        
    sorted_scores = sorted(scores.items(), key = lambda pair: pair[1], reverse=True)
    return [doc[0] for doc in sorted_scores]

## MLM Ranking and scoring functions

In [8]:
def rank_mlm(documents, query):
    scores = {}
    query_terms = analyze(query)

    for doc in documents:
        doc_id = doc['_id']
        termvectors = es.termvectors(index=INDEX_NAME, id=doc_id, fields=list(fields.keys()), field_statistics=True, term_statistics=True)['term_vectors']
        scores[doc_id] = MLM_score(termvectors, query_terms)

    sorted_scores = sorted(scores.items(), key = lambda pair: pair[1], reverse=True)
    return [doc[0] for doc in sorted_scores]


def MLM_score(termvectors, query_terms):
    mlm_score = 0
    for field_name, field_weight in fields.items():
        mlm_score += field_weight * LM_score(termvectors, field_name, query_terms)
    return mlm_score


def LM_score(termvectors, field, query_terms, mu_param=2000):
    score = 0
    termvectors = termvectors.get(field, {})
    
    for term in query_terms:
        if 'terms' in termvectors:
            ftd = termvectors['terms'].get(term, {}).get('term_freq', 0)
            doc_length = sum(term['term_freq'] for term in termvectors['terms'].values())
            sum_ftd = find_ttf(term, termvectors, field)
            field_length = termvectors['field_statistics']['sum_ttf']
            ptc = sum_ftd / field_length
            term_score = (ftd + mu_param * ptc) / (doc_length + mu_param)
            score += math.log(term_score) if term_score > 0 else math.log(PENALTY)
            
    return score

## Exporting rankings to disk

We have to rename each entity when exporting the ranking because the expected format is not the same as the one from the indexed files.
For example, `<http://dbpedia.org/resource/Feature_Selection>` translates to `<dbpedia:Feature_Selection>`

In [9]:
def rename(entity):
    basename = entity.split('/')[-1]
    return f'"<dbpedia:{basename}"'

def export_ranking(ranking, path):
    with open(path, 'w') as f:
        f.write('QueryId,EntityId\n')
        for query_id, entity_list in ranking.items():
            for entity in entity_list:
                f.write(f'{query_id},{rename(entity)}\n')

## Computing the rankings

In [10]:
def compute_ranking(rank_method):
    ranking = {}
    
    for query_id, query in tqdm(queries.items()):
        # retrieve first 100 hits using the default retrieval model
        first_pass = es.search(index=INDEX_NAME, q=query, size=100)['hits']['hits']
        
        # rerank the first pass using custom ranking method
        ranking[query_id] = rank_method(first_pass, query)
        
    return ranking

In [11]:
baseline_ranking = compute_ranking(rank_baseline)
export_ranking(baseline_ranking, BASELINE_RANKING_FILE)

100%|██████████| 233/233 [00:49<00:00,  4.68it/s]


In [12]:
mlm_ranking = compute_ranking(rank_mlm)
export_ranking(mlm_ranking, MLM_RANKING_FILE)

100%|██████████| 233/233 [13:10<00:00,  3.39s/it]


The resulting rankings for the two query sets should be saved and pushed to GitHub as `data/ranking_model1.csv` and `data/ranking2_model1.csv`.