# INFO 4271 - Exercise 4 - Statistical Ranking

Issued: May 7, 2024

Due: May 13, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Generative Relevance Models
Generative retrieval models use the probabilistic language model framework for matching queries and documents.

a) Implement the `rank()` function sketched below. In class, we discussed two alternative model variants. Choose the query likelihood model.

In [18]:
from collections import Counter
import numpy as np
import re

def bag_of_words(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize the text
    return text.lower().split()

#Rank a collection of documents relative to a query using the query likelihood model
def rank(Query, Doc):
     
     Q = bag_of_words(Query)
     D = [bag_of_words(d) for d in Doc]

     D_counts = [Counter(d) for d in D]
     D_lengths = [len(d) for d in D]
     
     scores = []
     
     for doc, d_count, d_length in zip(Doc, D_counts, D_lengths):
          score = 0
          for word in Q:
               # Calculate the probability of the word in the document
               # Avoid zero probabilities by using smoothing (plus 1)
               prob = (d_count[word] + 1) / (d_length + len(set(Q)))

               # Sum of log probabilities is equivalent 
               # to the log of the product of probabilities
               score += np.log(prob)

          # Add the score of the document to the list of scores per doc
          scores.append((doc, score)) 

     ranks = sorted(scores, key=lambda x: x[1], reverse=True)
     
     return ranks
                          

In [19]:
from pprint import pprint

Q = 'french bulldog'
D = ['the french revolution was a period of upheaval in france', 
     'the french bulldog is a small breed of domestic dog', 
     'french is a very french language spoken by the french']

pprint(rank(Q, D))   

[('the french bulldog is a small breed of domestic dog', -3.58351893845611),
 ('french is a very french language spoken by the french', -3.58351893845611),
 ('the french revolution was a period of upheaval in france',
  -4.276666119016055)]


b) Probabilistic language models may encounter previously unseen query terms. Explain why this can become problematic and how you would address the issue. 

If a word has never been seen before, its frequency is zero, and thus its probability is also zero (based on the definition of the query likelihood model).

This is problematic because it means that any document containing this term will have a probability of zero, regardless of the presence of other terms.

One solution to this problem would be to use smoothing, by adding a small probability to unseen terms. This would avoid zero probabilities. 



# 2. Relevance Feedback
Relevance Feedback allows us to refine the query representation after a round of user interaction with the search results. If organic feedback is not available, we can assume highly ranked documents to be *pseudo* relevant. Discuss the advantages and disadvantages of the pseudo relevance feedback scheme. Think in particular about single versus multiple rounds of feedback.

### Advantages
1. No user feedback required: It is hard to obtain user feedback since users do not care and will just try a new search query.
2. Improved queries as an automatic process with no additional input required: Pseudo relavant documents can help improve the original query by adding and reweighing terms.  

### Disadvantages
1. Assumption of relavance: Top ranked documents are assumed to be relevant but this is not always the case. 
2. Query drift: Especially with multiple repetitions of the pseudo relevance feedback method, the new queries drift away from the original intend of the user. More noice is added to the original query as some highly ranked documents are irrelevant.