Skip to content

sauparna/Terrier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terrier Mod

Sauparna Palchowdhury
sauparna.palc [at] gmail [dot] com

Terrier-4.0's README has been copied over to
README-Terrier-4.0.txt. Any rights, responsibilities and credits stem
from the contents of that file.

----------------------------------------------------------------------
DESCRIPTION

This is Terrier-4.0 with some additions and modification for doing IR
experiments using TREC data. The purpose of distributing this piece of
software is to augment Terrier with better documentation. See
NOTES.txt.

To run the commands described below you will need the sample TREC data: http://sauparna.sdf.org/Search/.files/ap.tgz

----------------------------------------------------------------------
COMPILING

Type "ant" in the shell.

----------------------------------------------------------------------
INDEXING

bin/trec_terrier.sh -i                                \
		    -Dcollection.spec=filelist.txt    \
		    -Dterrier.index.path=ap/AP        \
		    -Dstopwords.filename=ap/ser17.txt \
		    -Dtermpipelines=Stop,SStemmer     \
		    -DTrecDocTags.doctag=DOC          \
		    -DTrecDocTags.idtag=DOCNO         \
		    -DTrecDocTags.process=            \
		    -DTrecDocTags.skip=		      \
		    -DTrecDocTags.casesensitive=false

filelist.txt - A file containing a list of paths pointing to files of
the corpus. This can be generated by typing this in the shell:

find -L corpus/* -type f >file.txt

ap/AP - This is a directory. In the sample test-collection ap.txt is
the only file in the corpus and it has been placed inside a directory
named 'AP' because the script expects a path to a directory to look
for a corpus in.

----------------------------------------------------------------------
RETRIEVAL

bin/trec_terrier.sh -r                                   \
		    -q                                   \
		    -c i                                 \
		    -Dterrier.index.path=ap/AP           \
		    -Dtrec.topics=ap/query.txt           \
		    -DTrecQueryTags.doctag=TOP           \
		    -DTrecQueryTags.idtag=NUM            \
		    -DTrecQueryTags.process=TOP,NUM,DESC \
		    -DTrecQueryTags.skip=TITLE,NARR      \
		    -DTrecQueryTags.casesensitive=false  \
		    -Dstopwords.filename=ap/ser17.txt    \
		    -Dtermpipelines=Stop,SStemmer        \
		    -Dtrec.model=TF_IDF                  \
		    -Dquerying.postprocesses.controls=qe:QueryExpansion            \
		    -Dquerying.postprocesses.order=QueryExpansion                  \
		    -Dtrec.qe.model=org.terrier.matching.models.queryexpansion.Bo1 \
 		    -Dexpansion.terms=10                 \
		    -Dexpansion.documents=3              \
		    -Dtrec.results=./runs                \
		    -Dtrec.results.file=run.txt

The trec.results parameter is pointed to a directory named 'runs'.

run.txt has the retrieval output in TREC format.