# INFO 4271 - Exercise 5 - Learning to Rank

Issued: May 14, 2024

Due: May 27, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Search Result Diversification
Search result diversification trades off relevance with topical diversity.

Implement the missing functions sketched in the code base. As you change the `l` parameter between `[1.0,0.0]` you will obtain increasingly more diverse result lists.

In [4]:
import re
#A non-diversified result list for the query "Jaguar". Each result list entry contains a short document and the corresponding relevance score.
ranked_list = [["The official home of Jaguar USA. Explore our luxury sedans, SUVs and sports cars.", 0.99],
		["Discover the different language sites we have to make browsing our vehicle range's easier.", 0.94],
		["Jaguar is the luxury vehicle brand of Jaguar Land Rover, a British multinational car manufacturer with its headquarters in Whitley, Coventry, England.", 0.86],
		["Jaguar has been making luxurious sedans and athletic sports cars for decades, but more recently it has added crossovers and SUVs that continue to perpetuate these trademark attributes.", 0.82],
		["This storied British luxury and sports car brand is famous for striking looks, agility, ride comfort, and powerful engines.", 0.80],
		["Used Jaguar for Sale. Search new and used cars, research vehicle models, and compare cars.", 0.79],
		["Jaguar is a premium automaker whose historic resonance is matched by few others.", 0.78],
		["What new Jaguar should you buy? With rankings, reviews, and specs of Jaguar vehicles, we are here to help you find your perfect car.", 0.76],
		["Some Jaguar models have supercharged V8 engines and sharp handling, from sports cars like the F-Type to sporty SUVs like the F-Pace.", 0.75],
		["In 2008, Tata Motors purchased both Jaguar Cars and Land Rover.", 0.73],
		["The jaguar (Panthera onca) is a large felid species and the only living member of the genus Panthera native to the Americas.", 0.72],
		["The Jaguar was an aircraft engine developed by Armstrong Siddeley.", 0.70],
		["Jaguar is a superhero first published in 1961 by Archie Comics. He was created by writer Robert Bernstein and artist John Rosenberger as part of Archie's 'Archie Adventure Series'.", 0.63],
		["Jaguar are an English heavy metal band, formed in Bristol, England, in December 1979. They had moderate success throughout Europe and Asia in the early 1980s, during the heyday of the new wave of British heavy metal movement.", 0.51],
		["The Atari Jaguar is a home video game console developed by Atari Corporation and released in North America in November 1993.", 0.47]]

#Measure the average relevance of a (partial) result list. 
def measure_relevance(ranking):
	relevance = 0.0

	for doc in ranking:
		relevance += doc[1]

	return relevance/len(ranking)

# Measure the average diversity of a (partial) result list. 
# Count the number of unique terms in a ranked list and divide that number 
# by the length of that list.
def measure_diversity(ranking):
	diversity = 0.0

	for doc in ranking:
		# turn doc in to list of words  
		words = re.sub(r'[^\w\s]', '', doc[0]).lower().split()
		# turn list in to set to remove duplicates
		unique_words = set(words)
		diversity += len(unique_words)

	return diversity/len(ranking)

def get_highest_relevance(ranking):
	return max(ranking, key=lambda x: x[1])


# Re-rank an existing ranked list to increase diversity and return the top k items of that ranking. 
# The parameter l controls the importance of relevance scores vs. diversity.
def diversify(ranking, k, l):
	reranked = []
	reranked.append(get_highest_relevance(ranking))
	ranking.remove(get_highest_relevance(ranking))
	while len(reranked) < k:
		curr_best_addition_to_reranked = (-1, -1) #index and score
		for i, doc in enumerate(ranking):
			reranked.append(doc)
			doc_score = l * doc[1] + (1-l) * measure_diversity(reranked)
			if curr_best_addition_to_reranked[1] < doc_score:
				curr_best_addition_to_reranked = (i, doc_score)
			reranked.pop()
		reranked.append(ranking[curr_best_addition_to_reranked[0]])
		ranking.pop(curr_best_addition_to_reranked[0])

	return reranked

for doc in diversify(ranked_list, 2, 0):
	print(doc)

['The official home of Jaguar USA. Explore our luxury sedans, SUVs and sports cars.', 0.99]
['Jaguar are an English heavy metal band, formed in Bristol, England, in December 1979. They had moderate success throughout Europe and Asia in the early 1980s, during the heyday of the new wave of British heavy metal movement.', 0.51]


# 2. Training Data Selection

You want to develop a supervised ranker in the following way:
* You index your collection.
* You formulate a training set of 100 queries.
* You use a basic statistical ranker such as BM25 to find the top 10 documents for each query.
* You ask human annotators to manually rate the relevance of each top-rated document.
* You use the resulting 1,000 relevance judgments to train your supervised ranker.

This scheme leaves you with three types of training examples:
* Documents that the human judges marked `relevant`.
* Documents that the human judges marked `non-relevant`.
* Documents that were not judged because they were not in the pre-retrieved top 10.


Which type(s) of examples do you include in your model training? Discuss the advantages and disadvantages of each type of training examples.

Answer:
Including documents marked as relevant and non-relevant should be added since human feedback can be more important than what e.g. BM25 returns. These datapoints provide clear instructions of what the model should identify as relevant. The disadvantage is that hey might not represenet all possible relevant documents, as relevance is subjective and depends on the query.

Including documents not judged would increase diversity of the training data, since they might include relevant documents that were missed by the initial ranking. 