This project aims to customize Laplace smoothing and compare the performance of different ways of searching methods.
- Tuning the below mentioned ranking functions and implementing laplace smoothing function since the original code didn't provide this function.
- Also, I wrote a C++ function to modify TREC queries' topic and description to compare the ranking results.
- Indexing methods are modified as well.
- Provide shell script to run automatically since queries searching may take time.
This experiment runs the set of queries against the WT2g collection, returns a ranked list of documents (the top 1000) using various ranking functions in a particular format, and evaluates the ranked lists.
The implemented ranking functions include:
- Vector space model, terms weighted by Okapi TF (see note) times an IDF value, and inner product similarity between vectors
- Language modeling, maximum likelihood estimates with Laplace smoothing only, query likelihood
- Language modeling, Jelinek-Mercer smoothing using the corpus, 0.8 of the weight attached to the background probability, query likelihood
- Put the modified
LaplaceTermScoreFunction.hpp
ininclude/indri
directory. - In
src/TermScoreFactory.cpp
- add
#include "indri/LaplaceTermScoreFunction.hpp"
at the beginning - add the following code in line 61:
else if( method == "laplace" || method == "add_one" || method == "l" ) { double alpha = spec.get( "alpha", 1.0 ); return new indri::query::LaplaceTermScoreFunction( spec.get("index_path", ""), alpha ); }
- add
This experiment uses Ubuntu 20.04
Link to download: https://sourceforge.net/p/lemur/wiki/Home/
a set of 50 TREC queries for the corpus, with the standard TREC format having topic title, description and narrative. Documents from the corpus have been judged with respect to their relevance to these queries by NIST assessors. Queries must be downloaded before proceed in the following operation.
indri-5.18
|
-----------run.sh
|
-----------WT2G collection
|
-----------queries
|
-----------index params
|
-----------query params
|
-----------< other files >
./query_build.sh
run this script to build queries from TREC queries.
./run.sh
The above line in terminal builds a pipeline to make the downloaded indri c++ code, build index, run query, and evaluate the return searching results. Note that queries must be extracted in advance before running the queries.
See WSM_indri.pdf for detailed comparison.