# Submission instructions

In this assignment, we will be creating ranking functions (Part 1) and investigating poitwise and pairwise learning-to-rank approaches (Part 2). 

The HW is due **Friday, December 3, 2021 @ 11:59 pm**. You can form teams of three students, two students, or work individually. Note that there exists no difference in terms of grading, i.e., we will grade the same for one and two person teams. **Only one of the team members needs to submit the HW.**

Please submit a zip file containing the two parts of the assignment:

1. **Ranking Code (47%)**: Make sure to finish the ranking.ipynb Jupyter notebook first.
2. **LTR Code (53%)**: In the learning-to-rank.ipynb notebook, we will implement LTR models. Submit both Jupyter notebooks.


### HW3 - Learning to Rank (53% of total HW3 grade)

In the first part of this assignment, we examined various ways of ranking documents given a query; however, weights for different features were not learned automatically but set manually. As more and more ranking signals are investigated, integrating more features becomes challenging as it would be hard to come up with a single ranking function like BM25 for arbitrary features. 

In this assignment, you will be investigating different approaches to the learning to rank task that you have learned: (1) the pointwise approach using linear regression and (2) the pairwise approach employing gradient boosted decision trees. The goal is to let these algorithms learn weights automatically for various features. 

More specifically, it involves the following tasks (weights are for the programming assignment as a whole):
* [Task 1: Pointwise Approach and Linear Regression (10%)](#Task-1:-Pointwise-Approach-and-Linear-Regression-(10%)): Implement a pointwise approach with linear regression based on basic tf-idf features
* [Task 2: Pairwise Approach and Gradient Boosted Decision Trees (15%)](#Task-2:-Pairwise-Approach-and-Gradient-Boosted-Decision-Trees-(15%)): Implement an instance of the pairwise approach with the help of gradient boosted decision trees, using basic tf-idf features
* [Task 3: Train Your Best Model (20%)](#Task-3:-Adding-More-Features-(20%)) Train your best model, and experiment with more features such as BM25, Smallest Window, and PageRank
* [Task 4: Report (8%)](#Task-4:-Report-(8%)): Write up a summary report and answer some questions about the above tasks


__Grading__
- Part of your grade will be based on your model's performance on a hidden test set. 
- You will get full credit for solutions that receive NDCG scores within reasonable range of the NDCG scores received by the TA.

## Setup

The `base_classes` folder contains useful class definitions (not to be edited)

In [13]:
import os
try: 
    os.mkdir('base_classes')
except FileExistsError:
    pass

You can add additional imports below as required.

In [14]:
# You can add additional imports here

import sys
import pickle as pkl
import array
import os
import timeit
import contextlib
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from collections import Counter
from collections import OrderedDict
import math

import xgboost as xgb

# Data

**This dataset is the same as what you used in the first part of this HW.**

As in the first part of the HW, we have partitioned the data into two sets for you: 
1. Training set (hw3.(signal|rel).train)
2. Development set (hw3.(signal|rel).dev)


## Loading previous code

We load the AScorer class that you completed in the first part of HW3. 

**Note that you may need to make updates to this class for completing the tasks in this notebook.**

We also load the Idf class that you can use to get document frequency values based on the corpus. You will also need to load the Rank class for the computation of NDCG scores on the tasks below.

In [15]:
from base_classes.load_train_data import load_train_data
from base_classes.id_map import IdMap
from base_classes.ndcg import NDCG
from base_classes.query import Query
from base_classes.document import Document
from base_classes.ascore import AScorer
from base_classes.build_idf import Idf
from base_classes.rank import Rank

# Task 1: Pointwise Approach and Linear Regression (10%)

In ranking, each query $q_i$ will be associated with a set of documents, and for each document $j$, we extract a query-document feature vector $x_{i,j}$. There is also a label $y_{i,j}$ associated with each query-document vector $x_{i,j}$.

In the pointwise approach, such group structure in ranking is ignored, and we simply view our training data as $\{(x_{i}, y_{i})\}$, where each instance consists of a query-document feature vector $x_{i}$ and a label $y_{i}$ (which is a relevance score as in the first part of this programming assignment). The ranking problem amounts to learning a function $f$ such that $f(x_{i})$ closely matches $y_{i}$.

In this task, we consider a very simple instance of the pointwise approach, the *linear regression* approach. That is, we will use a linear function $f$ which gives a score to each query-document feature vector $x$ as follows: $f(x) = wx+b$. Here, the weight vector ${w}$ and the bias term $b$ are parameters that we need to learn to minimize the loss function as defined below:
\begin{equation}
\sum_{i=1}^m (f(x_{i})-y_{i})^2
\end{equation}
This formulation is also referred to as the *ordinary least squares* approach.

### 1.1: Designing Feature Vectors

Represent each query-document pair as a five-dimensional vector of query vector-document vector (tf-idf) scores. Each dimension corresponds to a document field -- url, title, header, body, and anchor. Specifically, given a query vector $q$ and a document vector $d_{f}$ of a document field $f$, the tf-idf score is the dot product $q \cdot d_{f}$. 

To start with, use query and document vectors with lnn.ltc weighting (as represented in SMART notation ddd.qqq). In other words, begin by using:

1) For the document vectors, "lnn":
    - logarithmic term frequency of query terms in documents
    - no document frequency 
    - no normalization
2) For the query vector, "ltc":
    - logarithmic term frequency for words in query
    - idf (inverse document frequency)
    - cosine (i.e., L2) normalization
    
Then, experiment with a few weighting schemes other than lnn.ltc.  Refer to the figure below for other possible weighting schemes. You will report which weighting scheme yields the best performance in Task 4.

<img src="fig/IIR_fig_6.15.png">
Figure is from Pg.128 http://nlp.stanford.edu/IR-book/pdf/06vect.pdf


A few important notes:
- Creating these vectors is similar to the exercise you performed in computing cosine similarity in the first part of this programming assignment
- Make modifications to the AScorer class in order to try to implement other weighting mechanisms 
- **You will use these basic feature vectors for both Task 1 and Task 2. Do not use any other signals or features for Tasks 1 and 2; you will have the opportunity to use these features in Task 3.**



In [36]:
def get_features (signal_file, idf):
    '''
    Create a feature vector from the signal file and from the idf_dict. 

    Args:
        signal_file: filepath to signal file
        idf: object of class Idf (with idf built)

    Returns:
        feature_vec (numpy array of dimension (N, 5)): N is the number of (query, document)
        pairs in the relevance file.
    '''

    # Experiment with different values of weighting below. Note that this uses dddqqq notation.
    # Make sure to set weighting to the best value prior to submitting your code.
    # You should be able to support lnn.ltc weighting, along with any other weighting that you experiment with


    WEIGHTING = 'lnnltc' 

    assert len(WEIGHTING) == 6, "Invalid weighting scheme."        

    feature_vec = []

    ### Begin your code

    ### End your code

    return feature_vec


def get_relevance (relevance_file):
    '''
    Extract relevance scores from the relevance file. This should be a simple wrapper (<10 lines) over
    the get_rel_scores() function in the NDCG class.

    Args:
        relevance_file: filepath to relevance file

    Returns:
        relevance_vec (numpy array of dimension (N,)): N is the number of (query, document)
        pairs in the relevance file.   
        ndcg_obj: NDCG object which contains relevance scores
    '''  


    relevance_vec = []
    ndcg_obj = NDCG()

    ### Begin your code

    ### End your code

    return relevance_vec, ndcg_obj   
    

### 1.2: Training a Linear Regression Model

Implement the PointwiseLearner class below. You may use the LinearRegression class from the sklearn package. If you use the LinearRegression class, set fit_intercept to true and normalize to False.

In [39]:
class PointwiseLearner:
    
    def __init__(self):
        self.model = None

    def train_model (self, x, y):
    
        '''
        - Train your linear regression model using the LinearRegression class 

        Args:
                x (numpy array of dimension (N, 5)): Feature vector for each query, document pair. 
                Dimension is N x 5, where N is the number of query, document pairs. 
                Is the independent variable for linear regression. 

                y (numpy array of dimension (N,)): Relevance score for each query, document pair. 
                Is the dependent variable for linear regresion.

        Returns: none
        '''
        ### Begin your code

        ### End your code
    
    def predict_model (self, x):
    
        '''
        - Output predicted scores based on the trained model.

        Args:
                x (numpy array of dimension (N, 5)): Feature vector for each query, document pair. 
                Dimension is N x 5, where N is the number of (query, document) pairs. 
                Predictions are made on this input feature array.

        Returns:
                y_pred (numpy array of dimension (N,)): Predicted relevance scores for each query, document pair
                based on the trained linear regression model.
        '''
        ### Begin your code

        ### End your code
    

In [40]:
lm = PointwiseLearner()

idf = Idf()

#Get train features and relevance

train_signal_file = "data/hw3.signal.train"
train_rel_file = "data/hw3.rel.train"
train_features = get_features(train_signal_file, idf)
train_relevance, train_ndcg = get_relevance(train_rel_file)
assert train_features.shape[1] == 5, 'Train features are of incorrect shape. They should be 5 dimensions, but got {}'.format(train_predicts.shape[1])

#Train linear regression model

lm.train_model(train_features, train_relevance)

# Get predictions on dev set.
dev_signal_file = "data/hw3.signal.dev"
dev_rel_file = "data/hw3.rel.dev"
dev_features = get_features(dev_signal_file, idf)
dev_relevance, dev_ndcg =  get_relevance(dev_rel_file)
dev_predicts = lm.predict_model(dev_features)

Total Number of Docs is 98998
Total Number of Terms is 347071


AttributeError: 'list' object has no attribute 'shape'

Make sure your code passes the sanity check below.

In [18]:
assert dev_features.shape[1] == 5, 'Train features are of incorrect shape. They should be 5 dimensions, but got {}'.format(train_predicts.shape[1])

NameError: name 'dev_features' is not defined

## Evaluation

Using the predictions from your trained model, compute the mean squared error and NDCG score that you receive. 

Include the score you received in your report. 

In [19]:
def NDCG_calc_for_LTR (dev_ndcg, dev_predicts, out_file="ranked_result_default"):

    ''' We provide this function to calculate the average NDCG score given a predicted score and a ground truth score.
        Note that the code below calls rank_with_score() in the Rank class, so the correct value for NDCG 
        depends on the correct implmementation of that function.
         Args:
                dev_ndcg (type NDCG): Object that contains the "ground truth" relevance scores in dev_ndcg.rel_scores 
                dev_predicts: numpy array of dimension (N,) which contains predicted scores for a dataset.
                out_file: filename to which the ranked_result_file is written
            
        Returns: avg_ndcg_score: Scalar that averages NDCG score across all queries. 
    
    '''
    idx = 0
    dev_predicts_dict = {}

    #Converts the dev_predicts vector into query->url->score dict
    for query, url_dict in dev_ndcg.rel_scores.items():
        query_obj = Query(query) #Converts str to Query object
        dev_curr_dict = {}
        for url in url_dict.keys():
            dev_curr_dict[url] = dev_predicts[idx]
            idx+=1
        dev_predicts_dict[query_obj] = dev_curr_dict

    #Orders dev_predicts_dict. This remains a Query->url->score dict after ordering.
    #Note that this depends on your implementation of the rank_with_score() function in the Rank class.
    r = Rank()
    dev_predicts_dict_ordered = r.rank_with_score(dev_predicts_dict)

    #Creates a Query->Document->score dict called dev_predicts_ranks that will be written to file.
    dev_data = load_train_data(dev_signal_file) #Query->Document dict

    dev_predicts_ranked = {} #The Query->Document->Score dict that will be written to file.
    for query in dev_predicts_dict_ordered:
        doc_to_score = {}
        for url in dev_predicts_dict_ordered[query]:
            doc = dev_data[query][url]
            doc_to_score[doc] = dev_predicts_dict_ordered[query][url]
        dev_predicts_ranked[query] = doc_to_score

    #Writes dev_predicts_ranked to file.
    if not os.path.exists("output"): os.mkdir("output")
    ranked_result_file = os.path.join("output", out_file)
    r.write_ranking_to_file(dev_predicts_ranked, ranked_result_file)

    #Uses the NDCG class to get the NDCG score
    dev_ndcg.read_ranking_calc(ranked_result_file)
    avg_ndcg_score = dev_ndcg.get_avg_ndcg()
    return avg_ndcg_score


In [20]:
# Compute mean squared error and NDCG Score

mse = mean_squared_error(dev_relevance, dev_predicts)

print ("Mean Squared Error:", mse)

print ("Average NDCG score:", NDCG_calc_for_LTR(dev_ndcg, dev_predicts, "ranked_result_pointwise"))

NameError: name 'dev_relevance' is not defined

# Task 2: Pairwise Approach and Gradient Boosted Decision Trees (15%)

We next use the LambdaMART algorithm to implement Gradient Boosted Decision Trees. 

LambdaMART is the boosted tree version of an earlier algorithm, LambdaRank. The full evolution of algorithms from RankNet through LambdaRANK, MART and LambdaMART is presented below (Page 16 and 17 are particularly important). 
https://pdfs.semanticscholar.org/0df9/c70875783a73ce1e933079f328e8cf5e9ea2.pdf

The relevant lecture notes are **CS5604LearningToRank.pdf** accessible at [Canvas Files](https://canvas.vt.edu/courses/136044/files?preview=20605783)

We can use the XGBoost package to implement LambdaMART. You may find it helpful to read the documentation here: https://xgboost.readthedocs.io/en/latest/get_started.html

#### Parameter description (not exhaustive, see here for more details): https://xgboost.readthedocs.io/en/latest/parameter.html

General Parameters (**make sure to use the following values**):
- "booster": use "gbtree". Uses a tree-based model for boosting
- "objective": use "rank:pairwise". Uses the LambdaMART algorithm to minimize pairwise loss. 
- "eval_metric: use "ndcg" (while we will be evaluating your performance solely based on ndcg, feel free to test performance on other metrics)

Hyperparamters to be tuned (not exhaustive):
- "eta": Learning rate
- "gamma": Minimum loss reduction required to make a further partition on a leaf node of the tree
- "max_depth": Maximum depth of a tree
- "subsample": Subsample ratio of training instances to prevent overfitting

When training, you should also experiment with early stopping to prevent overfitting. Take a look at the description of early stopping here: https://xgboost.readthedocs.io/en/latest/python/python_intro.html

In [41]:
train_query_dict = load_train_data(train_signal_file)
train_groups = []
for query, url_dict in train_query_dict.items():
    train_groups.append(len(url_dict))
    
assert len(train_groups) == 700, 'Expected 700 queries, but got {}'.format(len(train_groups))

dev_query_dict = load_train_data(dev_signal_file)
dev_groups = []
for query, url_dict in dev_query_dict.items():
    dev_groups.append(len(url_dict))
    
assert len(dev_groups) == 100, 'Expected 100 queries, but got {}'.format(len(train_groups))

dtrain = xgb.DMatrix(train_features, label = train_relevance)
dtrain.set_group(train_groups)
ddev = xgb.DMatrix(dev_features, label = dev_relevance) 
ddev.set_group(dev_groups)



NameError: name 'dev_signal_file' is not defined

In [42]:
class GBDTLearner:
    
    def __init__(self):
        self.params = None
        self.model = None

    def train_model (self, dtrain, evallist):
    
        '''
        - Specifies parameters for XGBoost training
        - Trains model

        Args:
                dtrain (type DMatrix): DMatrix is a internal data structure that used by XGBoost 
                which is optimized for both memory efficiency and training speed.
                
                evallist (array of tuples): The datasets on which the algorithm reports performance as training takes place
                

        Returns: none
        '''
        num_rounds = 10 #Experiment with different values of this parameter
        
        ### Begin your code
        
        self.params = {}
        self.params["booster"] = "gbtree"
        self.params["objective"] = "rank:pairwise"
        self.params["eval_metric"] = "ndcg"
        self.params["eta"] = 1/(num_rounds*num_rounds)
        self.params["gamma"] = 1/(num_rounds*num_rounds)
        self.params["max_depth"] = num_rounds
        self.params["subsample"] = 1
        
        self.model = xgb.train(
            self.params,
            dtrain=dtrain,
            evals=evallist,
            early_stopping_rounds=num_rounds,
            num_boost_round=num_rounds)

        ### End your code
    
    def predict_model (self, dtest):
    
        '''
        - Output predicted scores based on the trained model.

        Args:
                dtest (type DMatrix): DMatrix that contains the dev/test signal data

        Returns:
                y_pred (numpy array of dimension (N,)): Predicted relevance scores for each query, document pair
                based on the trained  model.
        '''
        ### Begin your code
        
        return self.model.predict(dtest,self.model.best_ntree_limit)

        ### End your code




In [43]:
#Train a gradient boosted decision trees model.

model = GBDTLearner()
evallist = [(dtrain, 'train')]
model.train_model(dtrain, evallist)

# Get predictions on dev set.

dev_predicts_gbdt = model.predict_model(ddev)

NameError: name 'dtrain' is not defined

In [44]:
print ("Average NDCG score:", NDCG_calc_for_LTR(dev_ndcg, dev_predicts_gbdt, "ranked_result_gbdt"))

NameError: name 'dev_ndcg' is not defined

# Task 3: Adding More Features (20%)

Putting it all together! In this part, train your best model - and feel free to use additional features! Experiment with the following to see which yields the best performance on the dev set:

1. Using smallest window feature from the first part of this programming assignment
2. Using Pagerank from the idf file

In addition, you may also choose to experiment with using word vectors. We provide GLoVE embeddings for the words in our vocabulary, which you can download with the help of embedding.py in the base_classes folder. (Disclaimer: downloading might be slow)

You may choose to write several helper functions as required.

In [None]:
class BestModel:
    
    def __init__(self):
    ### Begin your code

    ### End your code
   
    # You may choose to write other helper functions below 
    # (such as to augment feature array with additional features)
    
    ### Begin your code

    ### End your code
    
    
    def train_and_predict(self, train_signal_file, train_rel_file, test_signal_file, idf):
    
        '''
        - Receives the training signal and relevance files as parameters
        - Creates a feature vector associated with the signal file
        - Trains the best possible model on the training data
        - Using the trained model, makes a prediction on the test_signal_file
        
        - 

        Args:
            train_signal_file: filename of training signal
            train_rel_file: filename of training relevance file
            test_signal_file: filename containing dev/test signal
            idf: object of class IDF, containing a fully built idf dictionary
            

        Returns: none
        '''
        test_predictions = []
    
        ### Begin your code

        ### End your code
        
        return test_predictions

In [None]:
model = BestModel()
idf = Idf()
train_signal_file = "data/hw3.signal.train"
train_rel_file = "data/hw3.rel.train"
dev_signal_file = "data/hw3.signal.dev"

dev_predicts_best = model.train_and_predict(train_signal_file, train_rel_file, dev_signal_file, idf)

dev_rel_file = "data/hw3.rel.dev"
dev_relevance, dev_ndcg = get_relevance(dev_rel_file)

print ("Average NDCG score:", NDCG_calc_for_LTR(dev_ndcg, dev_predicts_best, "ranked_result_best"))

# Task 4: Report (8%)

This section is meant to be relatively more open-ended as you describe the model choices you made in this assignment. **Please keep your report concise.** Be sure to document any design decisions you made, and provide a brief rationale for them. 

You may choose to insert cells below to generate tables or plots if required.

### A. Design of feature vectors (Task 1 and 2) (1%)

For each (query, document) pair, in designing your feature vector from query vector and document vectors, you had various possible options for (i) term frequency, (ii) document frequency and (iii) normalization. The default option we recommended you start with for the feature vector is lnn.ltc (using the SMART notation ddd.qqq).

What other choices did you experiment with? How did the performance compare across these choices? What might be the rationale for this difference in performance across the various models?

> We have used **tf-idf** as a feature vector for each query-document pair. Besides it, we experimented with raw term frequency. And from our experiment, we see that using tf-idf to design feature vector give better performance. Because while using raw term frequency, a term that has occurred 20 times doesn't mean that always the term carries twenty-time significance, there may be that less occurred term carries more significance. So just implementing a feature based on raw term frequency doesn't always guarantee good performance. But using tf-idf we have a basic metric to extract the most descriptive terms in a document and we can easily compute the similarity between two documents using it. 

### B. Hyperparameter tuning  (Task 2) (1%)

Briefly describe the hyperparameters you tuned for your implementation of XGBoost. 
Which hyperparameters were most consequential to the performance of the model?

Provide an intuition, based on your understanding of the LambdaMART algorithm, for why the performance of the model varied as it did with the hyperparameters you tuned.

> To implement XGBoost, the hyperparameters we have used are <br>
"eta" : 0.01<br> "gamma" : 0.01<br> "max_depth" : 10<br> "subsample" : 1<br>
We believe the hyperparameters **eta, gamma, subsample** were most consequential to the performance of the model as eta refers learning rate of the model, gamma is the loss reduction and this value is smaller and subsample refers to ratio of the instances to prevent overfitting and the value we used didn't cause overfitting for our model.<br><br>
From our understanding of the LambdaMart algorithm, for a given learning rate the Gradient of cross-entropy times pairwise change in target metric when the loss reduction is *1/(number of rounds)*. But we have used *1/(number of boost rounds*number of boost rounds)* and we think because of that the performance of our model varied


### C. Model Design and Ablation Analysis (Task 3) (3%)

You had the option to include various additional features in your model design. Which features did you experiment with? Which features did you end up using in your final model, and why? 

We expect ablation analysis on which features provided useful signals and which ones did not. Please include at least 1-2 plots and/or tables for this question.

> Your answer here

### D. Error Analysis (Task 3) (3%)

Analyze your errors for the best performing model you trained. Please include 1-2 plots and/or tables for this question. 

> Your Answer Here

### Congratulations, you have finished HW3 part 2!