# Training a parameter

This notebook demonstrates the use of `ArbitraryParameterTrainingEstimator` to train the $b$ parameter of BM25.

This notebook requires [Apache-Toree](https://toree.apache.org/) to be installed and working. You will also need your Terrier index, and the various topics and qrels to be set appropriately. 

Firstly, lets import Terrier and Terrier-Spark using the `%AddDeps` magic command. Toree can use Maven Central for this, as well as your local Maven repository.

In [1]:
%AddDeps org.terrier terrier-core 5.0 --repository file:/home/user/.m2/repository --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive

//if you are using a local Maven repository, you can use 
//%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive



Marking org.terrier:terrier-core:5.0 for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir6296464374568874300/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /home/user/.m2/repository/org/apache/pdfbox/pdfbox/1.8.4/pdfbox-1.8.4.jar
-> New file at /home/user/.m2/repository/org/apache/htrace/htrace-core4/4.0.1-incubating/htrace-core4-4.0.1-incubating.jar
-> New file at /home/user/.m2/repository/org/apache/pdfbox/jempbox/1.8.4/jempbox-1.8.4.pom
-> New file at /home/user/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar
-> New file at /home/user/.m2/repository/org/sonatype/sisu/sisu-inject-bean/2.3.0/sisu-inject-bean-2.3.0.jar
-> New file at /home/user/.m2/repository/org/apache/maven/maven-model-builder/3.0.5/maven-model-builder-3.0.5.pom
-> New file at /home/user/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.8/poi-ooxml-schemas-3.8.pom
-> New file at /home/user/.m2/repository/org/apache/pdfbox/p

Waiting for a Spark session to start...

Marking org.terrier:terrier-spark:0.0.1-SNAPSHOT for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir6296464374568874300/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /home/user/.m2/repository/org/eclipse/jetty/orbit/org.eclipse.jdt.core/3.8.2.v20130121/org.eclipse.jdt.core-3.8.2.v20130121.jar
-> New file at /home/user/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.2.0/hadoop-mapreduce-client-app-2.2.0.jar
-> New file at /home/user/.m2/repository/org/spire-math/spire_2.11/0.7.4/spire_2.11-0.7.4.pom
-> New file at /home/user/.m2/repository/org/glassfish/hk2/hk2-locator/2.4.0-b34/hk2-locator-2.4.0-b34.pom
-> New file at /home/user/.m2/repository/oro/oro/2.0.8/oro-2.0.8.pom
-> New file at /home/user/.m2/repository/org/apache/poi/poi-ooxml/3.8/poi-ooxml-3.8.jar
-> New file at /home/user/.m2/repository/org/glassfish/hk2/external/javax.inject/2.4.0-b34/javax.inject-2.4.0-b34.pom
-> New file at /home/user/.m2/rep

Lets check that we have Terrier correctly imported. Do you get the correct version number?

In [2]:
org.terrier.Version.VERSION

5.0

Now lets setup the location appropriate for our local setting.

In [3]:
val terrierHome = "/home/user/projects/tr5/Terrier"
val wt2gIndex = "/home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties"
val topicsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450"
val qrelsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8"

terrierHome = /home/user/projects/tr5/Terrier
wt2gIndex = /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties
topicsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450
qrelsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8


file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8

We now use `ArbitraryParameterTrainingEstimator` to identify a value for property `c`.

In [6]:
import org.apache.spark.ml._
import org.apache.spark.sql._
import org.terrier.spark.ml._
import org.terrier.spark.TopicSource
import org.terrier.querying.IndexRef

val props = Map("terrier.home" -> terrierHome, "termpipelines"->"")
val indexref = IndexRef.of(wt2gIndex)

TopicSource.configureTerrier(props)

val allTopicsList = TopicSource.extractTRECTopics(topicsFile).toList
val Array(trTopics, teTopics) = allTopicsList.toDF("qid", "query").randomSplit(Array(0.5,0.5), 130882)

val tuner = new ArbitraryParameterTrainingEstimator()
tuner.setTerrierProperties(props)
tuner.setIndexReference(indexref)
tuner.set(tuner.measure, Measure.NDCG)
tuner.set(tuner.measureCutoff, 1000)
tuner.set(tuner.paramName, Seq("c"))
tuner.setSampleModel("BM25")
tuner.set(tuner.paramValueInitial, Seq(0.25d))
tuner.set(tuner.paramValueMin, 0d)
tuner.set(tuner.paramValueMax, 1d)
tuner.set(tuner.measureTol, 1e-4)
tuner.set(tuner.optMaxIter, 100)
tuner.setQrelsFile(qrelsFile)
val model = tuner.fit(trTopics)


TERRIER_HOME=/home/user/projects/tr5/Terrier
terrier.etc=null
TERRIER_HOME=/home/user/projects/tr5/Terrier
TERRIER_ETC=/home/user/projects/tr5/Terrier/etc
Evaluating List(c)=Vector(0.25)
We have 47506 qrels
List(c)=Vector(0.25) => -0.7065811663845499
Evaluating List(c)=Vector(0.0125)
Querying /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties for 26 queries
Got for 25974 results total
We have 47506 qrels
List(c)=Vector(0.0125) => -0.6810591735163936
Evaluating List(c)=Vector(0.4875)
Querying /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties for 26 queries
Got for 25974 results total
We have 47506 qrels
List(c)=Vector(0.4875) => -0.6872177217334728
Evaluating List(c)=Vector(0.012500000000000122)
Querying /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties for 26 queries
Got for 25974 results total
We have 47506 qrels
List(c)=Vector(0.012500000000000122) => -0.6810591735163936
Evaluating List(c)=Vec

props = Map(terrier.home -> /home/user/projects/tr5/Terrier, termpipelines -> "")
indexref = /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties
allTopicsList = List((401,foreign minorities germany), (402,behavioral genetics), (403,osteoporosis), (404,ireland peace talks), (405,cosmic events), (406,parkinson s disease), (407,poaching wildlife preserves), (408,tropical storms), (409,legal pan am 103), (410,schengen agreement), (411,salvaging shipwreck treasure), (412,airport security), (...


lastException: Throwable = null


List((401,foreign minorities germany), (402,behavioral genetics), (403,osteoporosis), (404,ireland peace talks), (405,cosmic events), (406,parkinson s disease), (407,poaching wildlife preserves), (408,tropical storms), (409,legal pan am 103), (410,schengen agreement), (411,salvaging shipwreck treasure), (412,airport security), (413,steel production), (414,cuba sugar exports), (415,drugs golden triangle), (416,three gorges project), (417,creativity), (418,quilts income), (419,recycle automobile tires), (420,carbon monoxide poisoning), (421,industrial waste disposal), (422,art stolen forged), (423,milosevic mirjana markovic), (424,suicides), (425,counterfeiting money), (426,law enforcement dogs), (427,uv damage eyes), (428,declining birth rates), (429,legionnaires disease), (430,killer bee attacks), (431,robotic technology), (432,profiling motorists police), (433,greek philosophy stoicism), (434,estonia economy), (435,curbing population growth), (436,railway accidents), (437,deregulation

Lets see what parameter value it got?

In [7]:
println(model.propertySettings.get("c").get)

0.21520996093750025
