## Applied LTR, Part I.

1. **Extract features** for all document-query pairs from the qrels (i.e., all documents with relevance assessments).
Use the following features (all are retrieval scores that you have computed for Assignment 1; we do not apply any normalization here):  
  - BM25 retrieval score of the query against each field (title, content)
  - LM retrieval score of the query against each field (title, content) using Jelinek-Mercer smoothing
  - LM retrieval score of the query against each field (title, content) using Dirichlet smoothing

2. **Train and evaluate the model using 5-fold cross-validation**

In [1]:
# from elasticsearch import Elasticsearch

# INDEX_NAME = "aquaint"
# DOC_TYPE = "doc"

# es = Elasticsearch()

In [18]:
QUERY_FILE = "data/queries.txt"  # make sure the query file exists on this location
QRELS_FILE = "data/qrels.csv"  # file with the relevance judgments (ground truth)

In [47]:
FEATURES_FILE = "data/features.txt"  # output the features in this file
OUTPUT_FILE = "data/ltr.txt"  # output the ranking

## Utility functions

#### Load queries

In [4]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

queries = load_queries(QUERY_FILE)

#### Load qrels

In [23]:
def load_qrels(qrels_file):
    gt = {}  # holds a list of relevant documents for each queryID
    with open(qrels_file, "r") as fin:
        header = fin.readline().strip()
        if header != "queryID,docIDs":
            raise Exception("Incorrect file format!")
        for line in fin.readlines():
            qid, docids = line.strip().split(",")
            gt[qid] = docids.split()
    return gt
            
qrels = load_qrels(QRELS_FILE)

## Step 1) Creating training data and writing it to a file

### Extracting features for query-document pairs

We have 6 features in total. Each feature here is a retrieval score, which we obtain using a different ES configuration.

In [9]:
ES_CONFIG = {
    1: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    },
    2: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "BM25", 
                "b": 0.75, 
                "k1": 1.2
            } 
        }
    },    
    3: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "LMDirichlet", 
                "mu": 200  # small for title
            } 
        }
    },
    4: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "LMDirichlet", 
                "mu": 2000  # larger for content
            } 
        }
    },
    5: {
        "field": "title",
        "similarity": {
            "default": {
                "type": "LMJelinekMercer", 
                "lambda": 0.1  
            } 
        }
    },
    6: {
        "field": "content",
        "similarity": {
            "default": {
                "type": "LMJelinekMercer", 
                "lambda": 0.1  
            } 
        }
    }    
}

In [6]:
import time

Collecting feature values in the `features` dict. It has the structure `features[qid][docid][fid] = value`, where fid is a feature ID (1..6).

In [49]:
features = {}

for fid in range(1, len(ES_CONFIG) + 1):
    print("Computing values for feature #", fid)
    # Set ES similarity config
    es.indices.close(index=INDEX_NAME)
    es.indices.put_settings(index=INDEX_NAME, body={"similarity": ES_CONFIG[fid]["similarity"]})
    es.indices.open(index=INDEX_NAME)

    time.sleep(1)  # wait until it takes effect

    for qid, query in queries.items():
        if qid not in features:
            features[qid] = {}
        #print("Ranking documents for [%s] '%s'" % (qid, query))
        res = es.search(index=INDEX_NAME, q=query, df=ES_CONFIG[fid]["field"], _source=False, size=1000).get('hits', {})
        for doc in res.get("hits", {}):
            docid = doc.get("_id")
            if docid not in features[qid]:
                features[qid][docid] = {}
            features[qid][docid][fid] = doc.get("_score")

Computing values for feature # 1
Computing values for feature # 2
Computing values for feature # 3
Computing values for feature # 4
Computing values for feature # 5
Computing values for feature # 6


### Looking up relevance labels and writing training data to file

**IMPORTANT** Here, we consider all documents that are retrieved in the top-1000 for any field or retrieval model. If the document is not in the qrels file than it'll be considered a negative training instance (target label=0). This leads to a very unbalanced training data set, with lot more negative than positive instances. It is your task to figure out how to deal with this issue (in Part II of the exerise).

In [57]:
with open(FEATURES_FILE, "w") as fout:
    for qid, query in queries.items():
        for docid, ft in features[qid].items():
            # Note that docid will not have a feature value for feature ID i
            # if it was not retrieved in the top-1000 positions for that feature
            # Here, we use -1 as the value for "missing" features
            for fid in range(1, len(ES_CONFIG) + 1):
                if fid not in ft:
                    ft[fid] = -1
            
            # relevance label is determined based on the ground truth (qrels) file
            label = 1 if docid in qrels.get(qid, []) else 0
                        
            feat_str = ['{}:{}'.format(k,v) for k,v in ft.items()]
            fout.write(" ".join([str(label), qid, docid] + feat_str) + "\n")

## Step 2) Loading training data from file and performing retrieval

Learning-to-rank code copy-pasted from the example (Task 1).

In [32]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

### A class for pointwise-based learning to rank model

In [33]:
class PointWiseLTRModel(object):
    def __init__(self, regressor):
        """
        :param classifier: an instance of scikit-learn regressor
        """
        self.regressor = regressor

    def _train(self, X, y):
        """
        Trains and LTR model.
        :param X: features of training instances
        :param y: relevance assessments of training instances
        :return:
        """
        assert self.regressor is not None
        self.model = self.regressor.fit(X, y)

    def rank(self, ft, doc_ids):
        """
        Predicts relevance labels and rank documents for a given query
        :param ft: a list of features for query-doc pairs
        :param ft: a list of document ids
        :return:
        """
        assert self.model is not None
        rel_labels = self.model.predict(ft)
        sort_indices = np.argsort(rel_labels)[::-1]

        results = []
        for i in sort_indices:
            results.append((doc_ids[i], rel_labels[i]))
        return results

### Read training data from file

In [54]:
def read_data_from_file(path):
    """
    :param path: path of file
    :return: X features of data, y labels of data, group a list of numbers indicate how many instances for each query
    """
    X, y, qids, doc_ids = [], [], [], []
    with open(path, "r") as f:
        i, s_qid = 0, None
        for line in f:
            items = line.strip().split()
            label = int(items[0])
            qid = items[1]
            doc_id = items[2]
            features = np.array([float(i.split(":")[1]) for i in items[3:]])
            X.append(features)
            y.append(label)
            qids.append(qid)
            doc_ids.append(doc_id)

    return X, y, qids, doc_ids

Now, applying LTR for this data.

### Loading training data

In [55]:
X, y, qids, doc_ids = read_data_from_file(path=FEATURES_FILE)
qids_unique= list(set(qids))

print("#queries: ", len(qids_unique))
print("#query-doc pairs: ", len(y))

#queries:  50
#query-doc pairs:  110185


### Applying 5-fold cross-validation

In [56]:
FOLDS = 5

fout = open(OUTPUT_FILE, "w")
# write header
fout.write("QueryId,DocumentId\n")
    
for f in range(FOLDS):
    print("Fold #{}".format(i + 1))
    
    train_qids, test_qids = [], []  # holds the IDs of train and test queries
    train_ids, test_ids = [], []  # holds the instance IDs (indices in X )

    for i in range(len(qids_unique)):
        qid = qids_unique[i]
        if i % FOLDS == f:  # test query
            test_qids.append(qid)
        else:  # train query
            train_qids.append(qid)

    train_X, train_y = [], []  # training feature values and target labels
    test_X = []  # for testing we only have feature values

    for i in range(len(X)):
        if qids[i] in train_qids:
            train_X.append(X[i])
            train_y.append(y[i])
        else:
            test_X.append(X[i])

    # Create and train LTR model
    print("\tTraining model ...")
    clf = RandomForestRegressor(max_depth=3, random_state=0)
    ltr = PointWiseLTRModel(clf)
    ltr._train(train_X, train_y)
    
    # Apply LTR model on the remaining fold (test queries)
    print("\tApplying model ...")
    
    for qid in set(test_qids):
        print("\t\tRanking docs for queryID {}".format(qid))
        # Collect the features and docids for that (test) query `qid`
        test_ft, test_docids = [], []
        for i in range(len(X)):
            if qids[i] == qid:
                test_ft.append(X[i])
                test_docids.append(doc_ids[i])
        
        # Get ranking
        r = ltr.rank(test_ft, test_docids)    
        # Write the results to file
        for doc, score in r:
            fout.write(qid + "," + doc + "\n")
        
fout.close()

Fold #7
	Training model ...
	Applying model ...
		Ranking docs for queryID 622
		Ranking docs for queryID 419
		Ranking docs for queryID 307
		Ranking docs for queryID 375
		Ranking docs for queryID 650
		Ranking docs for queryID 354
		Ranking docs for queryID 426
		Ranking docs for queryID 389
		Ranking docs for queryID 408
		Ranking docs for queryID 303
Fold #110185
	Training model ...
	Applying model ...
		Ranking docs for queryID 322
		Ranking docs for queryID 448
		Ranking docs for queryID 367
		Ranking docs for queryID 436
		Ranking docs for queryID 401
		Ranking docs for queryID 314
		Ranking docs for queryID 383
		Ranking docs for queryID 394
		Ranking docs for queryID 310
		Ranking docs for queryID 435
Fold #110185
	Training model ...
	Applying model ...
		Ranking docs for queryID 658
		Ranking docs for queryID 347
		Ranking docs for queryID 443
		Ranking docs for queryID 651
		Ranking docs for queryID 393
		Ranking docs for queryID 374
		Ranking docs for queryID 638
		Ranking