# Assignment 3, Model 2: SDM+ELR

In this notebook you will implement SDM+ELR re-ranking of the first-pass ranking retrieved from your index.

Your implementation of the sequential dependence model (SDM) extended with the Entity-linking incorporated retrieval (ELR) will be a single-field variant of SDM, so you need to use the "catch-all" text field for scoring. 

Your implementation should use the following weights: 0.80 for unigram matches, 0.05 for ordered bigram matches, 0.05 for unordered bigram matches, and 0.1 for entity matches. 

Dirichlet smoothing should be applied with the smoothing parameter set to 2000. 

The entity annotations for queries are provided as part of the input.

Be sure to use both markdown cells with section headings and explanations, as well as writing readable code, to make it clear what your intention is each step of the way through the code. 

In [1]:
import json
import math
from pprint import pprint
from tqdm import tqdm
from elasticsearch import Elasticsearch

es = Elasticsearch()
es.info()

{'name': 'HoG',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'Zle62SN2R0-CJz5T3d07Og',
 'version': {'number': '7.4.2',
  'build_flavor': 'default',
  'build_type': 'deb',
  'build_hash': '2f90bbf7b93631e52bafb59b3b049cb44ec25e96',
  'build_date': '2019-10-28T20:40:44.881551Z',
  'build_snapshot': False,
  'lucene_version': '8.2.0',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

In [2]:
INDEX_NAME = 'collection_v2'
weights = {
    'terms' : 0.8,
    'ordered_bigrams' : 0.05,
    'unordered_bigrams' : 0.05,
    'entities' : 0.1,
}

PENALTY = 1e-6

## File loading utilities

In [3]:
def load_queries(path):
    queries = {}
    with open(path) as f:
        for line in f:
            query = line.split(maxsplit = 1)
            queries[query[0]] = query[1].strip()
    return queries

def read_json(path):
    with open(path) as f:
        return json.load(f)

We have to rename each entity when exporting the ranking because the expected format is not the same as the one from the indexed files.
For example, `<http://dbpedia.org/resource/Feature_Selection>` translates to `<dbpedia:Feature_Selection>`

In [4]:
def translate(entity):
    if entity.startswith('<http'):
        basename = entity.split('/')[-1]
        return f'<dbpedia:{basename}'
    elif entity.startswith('<db'):
        basename = entity.split(':')[-1]
        return f'<http://dbpedia.org/resource/{basename}'

In [5]:
ranking_file = 'data/ranking_model2.csv'
queries = load_queries('data/queries.txt')
entity_annotations = read_json('data/entity_annotations.json')
collection_stats = read_json('data/collection_stats.json')

# translation process
for query_id, annotation in entity_annotations.items():
    for entity in annotation:
        entity_annotations[query_id][entity]['uri'] = translate(entity_annotations[query_id][entity]['uri'])

## Query analyzing

In [6]:
def analyze(query):
    response = es.indices.analyze(index=INDEX_NAME, body={'text': query, 'analyzer':'english_analyzer'})
    analyzed_query = [term['token'] for term in response['tokens']]
    return analyzed_query

## Term field frequency retrieval

Sometimes, the total term frequencies in a field for a specific term is not available directy in the termvectors because despite a great BM25 score, this specific term is not in the field of the document being scored. To retrieve this information, we search for this term in the index in order to find its `ttf`.

In [7]:
def find_ttf(term, tv, field='catch_all'):
    if term in tv['terms']:
        return tv['terms'][term]['ttf']
        
    else:
        query = { 'size' : 10, 'query': { 'match' : { f'{field}': f'{term}' } } }
        docs = es.search(index=INDEX_NAME, body=query)['hits']['hits']

        for doc in docs:
            tv = es.termvectors(index=INDEX_NAME, id=doc['_id'], fields=field, term_statistics=True)['term_vectors']
            if field in tv and term in tv[field]['terms']:
                return tv[field]['terms'][term]['ttf']
    
    # if ttf could not be found, skip this term in the scoring
    return 0

## Get term sequence from termvector

In [8]:
def get_term_sequence(termvector):
    term_positions = {}
    for term, info in termvector['terms'].items():
        for token in info['tokens']:
            term_positions[token['position']] = term

    term_sequence = [None] * (max(term_positions.keys()) + 1)
    for position, term in term_positions.items():
        term_sequence[position] = term

    return term_sequence

## SDM + ELR ranking

In [9]:
def rank_sdm_elr(documents, query, query_entities):
    scores = {}
    query_terms = analyze(query)
    
    for doc in documents:
        doc_id = doc['_id']
        termvectors = es.termvectors(index=INDEX_NAME, id=doc_id, fields='catch_all, related_entity_uri', term_statistics=True, field_statistics=True)['term_vectors']
        scores[doc_id] = SDM_ELR_score(termvectors, query_terms, query_entities)
        
    sorted_scores = sorted(scores.items(), key = lambda pair: pair[1], reverse=True)
    return [doc[0] for doc in sorted_scores]

In [10]:
def SDM_ELR_score(termvectors, query_terms, query_entities):     
    lambda_t = weights['terms'] /  len(query_terms)
    lambda_e = weights['entities'] /  sum(x['score'] for x in query_entities.values())
    if len(query_terms) > 1:
        lambda_o = weights['ordered_bigrams'] /  (len(query_terms) - 1)
        lambda_u = weights['unordered_bigrams'] /  (len(query_terms) - 1)
    else:
        lambda_o = 0 
        lambda_u = 0
        
    if 'catch_all' in termvectors:
        term_sequence = get_term_sequence(termvectors['catch_all'])
    else:
        term_sequence = []
        
    return sum([lambda_t*term_score(termvectors, query_terms),
               lambda_o*bigram_score(term_sequence, query_terms, ordered=True),
               lambda_u*bigram_score(term_sequence, query_terms, ordered=False),
               lambda_e*entity_score(termvectors, query_entities)])

## Feature functions

### Unigram match score

In [11]:
def term_score(termvectors, query_terms, field='catch_all', mu_param=2000):
    score = 0
    termvectors = termvectors.get(field, {})

    for term in query_terms:
        if 'terms' in termvectors:
            ftd = termvectors['terms'].get(term, {}).get('term_freq', 0)
            doc_length = sum(term['term_freq'] for term in termvectors['terms'].values())  
            sum_ftd = find_ttf(term, termvectors)
            field_length = collection_stats[field]['sum_ttf']
            ptc = sum_ftd / field_length
            term_score = (ftd + mu_param * ptc) / (doc_length + mu_param)
            score += math.log(term_score) if term_score > 0 else math.log(PENALTY)
            
    return score

### Bigram match scores

In [12]:
def count_ordered_bigram_matches(text, bigram):
    """Counts the number of bigram matches in term sequence."""
    count = 0
    for term, next_term in zip(text, text[1:]):
        if (term, next_term) == bigram:
            count += 1
    return count

def count_unordered_bigram_matches(text, bigram, w=4):
    """Counts the number of unordered bigram matches in text within a given window size."""
    count = 0
    for i in range(len(text) - 1):
        if text[i] in bigram:
            other_term = bigram[0] if text[i] == bigram[1] else bigram[1]
            if other_term in text[i+1:i+w]:
                count += 1
    return count

In [13]:
def bigram_background_model(bigram, ordered, w=5):
    slop = 0 if ordered else w
    
    query = {
        "track_total_hits": True,
        "size":0,
        "query": { 
            "match_phrase": { 
                "catch_all": { 
                    "query": f"{bigram[0]} {bigram[1]}", 
                    "slop": slop
                } 
            } 
        }
    }
    
    # background model approximation
    docs = es.search(index=INDEX_NAME, body=query)['hits']
    n_bigrams_collection = docs['total']['value']

    return n_bigrams_collection

In [14]:
def bigram_score(term_sequence, query_terms, ordered, mu_param=2000):
    score = 0
    doc_len = len(term_sequence)
    
    # define bigram counting method
    bigram_count = count_ordered_bigram_matches if ordered else count_unordered_bigram_matches
    
    for bigram in zip(query_terms, query_terms[1:]):
        count = bigram_count(term_sequence, bigram)
        n_bigrams_background = bigram_background_model(bigram, ordered)
        background_model = n_bigrams_background / collection_stats['catch_all']['sum_ttf']
        raw_score = (count + mu_param * background_model) / (doc_len + mu_param)
        raw_score = count / doc_len if doc_len != 0 else 0
        score += math.log(raw_score) if raw_score != 0 else math.log(PENALTY)
        
    return score

### Entity score

In [16]:
def entity_score(termvectors, query_entities, lambda_param=0.1):
    score = 0
    termvectors = termvectors.get('related_entity_uri', {})
    
    for entity_name, annotation in query_entities.items():
        feature_function = 0
        if annotation['uri'] in termvectors.get('terms', {}):
            feature_function += (1-lambda_param)
        feature_function += lambda_param * find_ttf(annotation['uri'], termvectors, field='related_entity_uri')
        if feature_function == 0:
            score += annotation['score'] * math.log(PENALTY)
        else:
            score += annotation['score'] * math.log(feature_function)

    return score

## Exporting rankings to disk

In [17]:
def export_ranking(ranking, path):
    with open(path, 'w') as f:
        f.write('QueryId,EntityId\n')
        for query_id, entity_list in ranking.items():
            for entity in entity_list:
                f.write(f'{query_id},"{translate(entity)}"\n')

## Computing the rankings

In [18]:
def compute_ranking(rank_method):
    ranking = {}
    
    for query_id, query in tqdm(queries.items()):
        # retrieve first 100 hits using the default retrieval model
        first_pass = es.search(index=INDEX_NAME, q=query, size=100)['hits']['hits']
        query_entities = entity_annotations[query_id]
        ranking[query_id] = rank_method(first_pass, query, query_entities)
        
    return ranking

In [19]:
sdm_elr_ranking = compute_ranking(rank_sdm_elr)
export_ranking(sdm_elr_ranking, ranking_file)

100%|██████████| 234/234 [14:37<00:00,  3.75s/it]


The resulting rankings for the two query sets should be saved and pushed to GitHub as `data/ranking_model2.csv` and `data/ranking2_model2.csv`.