A Spark API for the Terrier.org information retrieval platform
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.settings
example_notebooks/toree
src
.classpath
.gitignore
.project
README.md
TODO.md
pom.xml

README.md

Terrier-Spark

Terrier-Spark is a Scala library for Apache Spark that allows the Terrier.org information retrieval platform to be installed and working.

To use within a notebook, this requires Apache Toree to be installed and working.

Requirements:

  • Terrier 5.0
  • Apache Spark version 2.0 or newer
  • Jupyter & Apache Tree (optional)

Functionality

  • Retrieving a run from a Terrier index (local or remote)
  • Evaluating a run
  • Optimising the parameter of a retrieval run on a local index
  • Grid searching the parameter of a retrieval run on a local index
  • Learning a model using learning-to-rank

For known improvements/issues, see TODO.md

Example

val indexref = IndexRef.of("/path/to/index/data.properties")

val props = Map(
"terrier.home" -> terrierHome)

TopicSource.configureTerrier(props)
val topics = TopicSource.extractTRECTopics(topicsFile)
    .toList.toDF("qid", "query")

val queryTransform = new QueryingTransformer()
    .setTerrierProperties(props)
    .setIndexReference(indexref)
    .setSampleModel(model)

val r1 = queryTransform.transform(topics)
//r1 is a dataframe with results for queries in topics
val qrelTransform = new QrelTransformer()
    .setQrelsFile(qrelsFile)

val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList

More examples are provided in the example notebooks, or in our SIGIR 2018 demo paper [1].

Use from the Spark Shell

$ spark-shell --packages org.terrier:terrier-spark:0.0.1-SNAPSHOT

Use within a Jupyter Notebook

Firstly, make sure you have a working installation of Toree. Next, import Terrier and terrier-spark using some %AddDeps "magic":

%AddDeps org.terrier terrier-core 5.0 --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive

You can then use the terrier-spark code directly in your Scala notebooks.

We have provided several example notebooks:

Bibliography

If you use this software, please cite one of:

  1. Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines. Craig Macdonald. In Proceedings of SIGIR 2018.

  2. Agile Information Retrieval Experimentation with Terrier Notebooks. Craig Macdonald, Richard McCreadie, Iadh Ounis. In Proceedings of DESIRES 2018. In press.

Credits

Developed by Craig Macdonald, University of Glasgow