# Training and evaluating a LTR model

This notebook demonstrates the use of `FeaturedQueryingTransformer` and Spark ML to learn an LTR model.

This notebook requires [Apache-Toree](https://toree.apache.org/) to be installed and working. You will also need your Terrier index, and the various topics and qrels to be set appropriately. 

Firstly, lets import Terrier and Terrier-Spark using the `%AddDeps` magic command. Toree can use Maven Central for this, as well as your local Maven repository.

In [1]:
%AddDeps org.terrier terrier-core 5.0 --repository --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive



Marking org.terrier:terrier-core:5.0 for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir1096571982312481501/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /tmp/toree-tmp-dir1096571982312481501/toree_add_deps/https/repo1.maven.org/maven2/org/mortbay/jasper/apache-el/8.0.9.M3/apache-el-8.0.9.M3.jar
-> New file at /home/user/.m2/repository/org/apache/ant/ant-apache-bcel/1.8.4/ant-apache-bcel-1.8.4.jar
-> New file at /home/user/.m2/repository/org/apache/pdfbox/jempbox/1.8.4/jempbox-1.8.4.jar
-> New file at /home/user/.m2/repository/org/sonatype/sisu/sisu-inject-bean/2.3.0/sisu-inject-bean-2.3.0.pom
-> New file at /home/user/.m2/repository/org/apache/maven/maven-model-builder/3.0.5/maven-model-builder-3.0.5.pom
-> New file at /home/user/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar
-> New file at /home/user/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.pom
-> New file

Waiting for a Spark session to start...

Marking org.terrier:terrier-spark:0.0.1-SNAPSHOT for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir1096571982312481501/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /home/user/.m2/repository/org/apache/spark/spark-catalyst_2.11/2.1.0/spark-catalyst_2.11-2.1.0.jar
-> New file at /home/user/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.2.0/hadoop-mapreduce-client-core-2.2.0.jar
-> New file at /home/user/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-app/2.2.0/hadoop-mapreduce-client-app-2.2.0.jar
-> New file at /home/user/.m2/repository/org/spire-math/spire_2.11/0.7.4/spire_2.11-0.7.4.pom
-> New file at /home/user/.m2/repository/org/glassfish/jersey/bundles/repackaged/jersey-guava/2.22.2/jersey-guava-2.22.2.jar
-> New file at /home/user/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar
-> New file at /home/user/.m2/repository/org/eclipse/jetty/jetty-servlet/9.2.9.v20150224/jetty-servlet-9.2.9.v2015022

Lets check that we have Terrier correctly imported. Do you get the correct version number?

In [2]:
org.terrier.Version.VERSION

5.0

Now lets setup the location appropriate for our local setting.

In [3]:
val terrierHome = "/home/user/projects/tr5/Terrier"
val wt2gIndex = "/home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties"
val topicsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450"
val qrelsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8"

terrierHome = /home/user/projects/tr5/Terrier
wt2gIndex = /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties
topicsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450
qrelsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8


file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8

We now use `FeaturesQueryingTransformer` to extract three features (BM25, PL2 and a proximity feature). In addition, lets use a `ParamGridBuilder` to empirically determine the best sample model.

In [7]:
import org.apache.spark.ml._
import org.apache.spark.sql._
import org.terrier.spark.ml._
import org.apache.spark.ml.regression._
import org.terrier.spark.TopicSource
import org.terrier.querying.IndexRef
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.ml.tuning.CrossValidator

val featuresList = List( 
         "WMODEL:BM25", 
         "WMODEL:PL2",
         "DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier")

val props = Map(
    "terrier.home" -> terrierHome, 
    "termpipelines"->"", 
    "proximity.dependency.type" -> "SD")
val indexref = IndexRef.of(wt2gIndex)

TopicSource.configureTerrier(props)

val allTopicsList = TopicSource.extractTRECTopics(topicsFile).toList

//split the training topics into two groups randomly (based on a seed)
val Array(trTopics, teTopics) = allTopicsList.toDF("qid", "query").randomSplit(Array(0.5,0.5), 130882)

val queryTransform = new FeaturesQueryingTransformer()
     .setTerrierProperties(props)
     .setIndexReference(indexref)
     .setRetrievalFeatures(featuresList)
     .setSampleModel("InL2")
   
val qrelTransform = new QrelTransformer()
   .setQrelsFile(qrelsFile)

val useRF: Boolean = true

val learner = 
 if(useRF) new RandomForestRegressor()
   .setLabelCol("label")
   .setFeaturesCol("features")
   .setPredictionCol("newscore")
 else new LinearRegression()
   .setLabelCol("label")
   .setFeaturesCol("features")
   .setMaxIter(10)
   .setRegParam(0.01)
   .setElasticNetParam(0.8)
   .setPredictionCol("newscore")

val replaceScore = new ReplaceScoreTransformer()

val pipeline = new Pipeline().setStages(
   Array(queryTransform, qrelTransform, learner, replaceScore))

val paramGrid = new ParamGridBuilder()
 .addGrid(queryTransform.sampleModel, Array("InL2", "BM25", "PL2"))
 .build()

val cv = new CrossValidator()
 .setEstimator(pipeline)
 .setEvaluator(new NDCGEvaluator(1000))
 .setEstimatorParamMaps(paramGrid)
 .setNumFolds(2)
val model = cv.fit(trTopics)   


TERRIER_HOME=/home/user/projects/tr5/Terrier
terrier.etc=null
TERRIER_HOME=/home/user/projects/tr5/Terrier
TERRIER_ETC=/home/user/projects/tr5/Terrier/etc
input schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), StructField(score,DoubleType,false), StructField(rank,IntegerType,false), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
output schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), StructField(score,DoubleType,false), StructField(rank,IntegerType,false), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true), StructField(label,IntegerType,false))
input schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), 

props = Map(terrier.home -> /home/user/projects/tr5/Terrier, termpipelines -> "", proximity.dependency.type -> SD)
indexref = /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties
allTopicsList = List((401,foreign minorities germany), (402,behavioral genetics), (403,osteoporosis), (404,ireland peace talks), (405,cosmic events), (406,parkinson s diseas...


error: error while loading ValidatorParams, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-mllib_2.11-2.2.0.cloudera1.jar(org/apache/spark/ml/tuning/ValidatorParams.class)' has location not matching its contents: contains trait ValidatorParams
error: error while loading CrossValidator, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-mllib_2.11-2.2.0.cloudera1.jar(org/apache/spark/ml/tuning/CrossValidator.class)' has location not matching its contents: contains class CrossValidator
error: error while loading ParamGridBuilder, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-mllib_2.11-2.2.0.cloudera1.jar(org/apache/spark/ml/tuning/ParamGridBuilder.class)' has location not matching its contents: contains class ParamGridBuilder
error: error while loading CrossValidatorModel, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1

List((401,foreign minorities germany), (402,behavioral genetics), (403,osteoporosis), (404,ireland peace talks), (405,cosmic events), (406,parkinson s disease), (407,poaching wildlife preserves), (408,tropical storms), (409,legal pan am 103), (410,schengen agreement), (411,salvaging shipwreck treasure), (412,airport security), (413,steel production), (414,cuba sugar exports), (415,drugs golden triangle), (416,three gorges project), (417,creativity), (418,quilts income), (419,recycle automobile tires), (420,carbon monoxide poisoning), (421,industrial waste disposal), (422,art stolen forged), (423,milosevic mirjana markovic), (424,suicides), (425,counterfeiting money), (426,law enforcement dogs), (427,uv damage eyes), (428,declining birth rates), (429,legionnaires disease), (430,killer bee attacks), (431,robotic technology), (432,profiling motorists police), (433,greek philosophy stoicism), (434,estonia economy), (435,curbing population growth), (436,railway accidents), (437,deregulation

Lets see what the effectiveness is on the test set, and what the model looks like?

In [8]:
val rtrUntrained = qrelTransform.transform(queryTransform.transform(teTopics))
val rtrTrained = model.transform(teTopics)

println("\n\n*** UNTRAINED MAP@1000 " + new RankingEvaluator(Measure.MAP, 1000).evaluate(rtrUntrained))
println("\n\n*** TRAINED MAP@1000 " + new RankingEvaluator(Measure.MAP, 1000).evaluate(rtrTrained))

println("\n\n*** UNTRAINED NDCG@20 " + new RankingEvaluator(Measure.NDCG, 20).evaluate(rtrUntrained))
println("\n\n*** TRAINED NDCG@20 " + new RankingEvaluator(Measure.NDCG, 20).evaluate(rtrTrained))

Querying /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties for 24 queries with feaures
We have 47506 qrels
input schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), StructField(score,DoubleType,false), StructField(rank,IntegerType,false), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
output schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), StructField(score,DoubleType,false), StructField(rank,IntegerType,false), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true), StructField(label,IntegerType,false))
input schema: StructType(StructField(qid,StringType,true), StructField(query,StringType,true), StructField(docno,StringType,false), StructField(docid,IntegerType,false), StructField(score

rtrUntrained = [qid: string, query: string ... 6 more fields]
rtrTrained = [qid: string, query: string ... 6 more fields]


[qid: string, query: string ... 6 more fields]

Can we investigate these more? What was the training effectivess of the different weighting models?

In [15]:
// Combine with paramGrid to see how they affect the overall metrics
paramGrid.zip(model.avgMetrics)

Array(({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: InL2
},0.5827476296872508), ({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: BM25
},0.5779856669585126), ({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: PL2
},0.5925388505135973))


[({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: InL2
},0.5827476296872508), ({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: BM25
},0.5779856669585126), ({
	FeaturesQueryingTransformer_95c6d84ff7be-sampleModel: PL2
},0.5925388505135973)]

Lets looks in details at the parameters of the FeaturesQueryingTransformer in the best model. You can see that PL2 is the chosen model.

In [16]:
val bestModel = model.bestModel.asInstanceOf[PipelineModel]
bestModel.stages(0).asInstanceOf[FeaturesQueryingTransformer].explainParams

bestModel = pipeline_1889000dec34


retrievalFeatures: The names of features to use (default: List(), current: List(WMODEL:BM25, WMODEL:PL2, DSM:org.terrier.matching.dsms.DFRDependen...


index: The reference to the index to be queried (current: /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties)
inputQueryCol: The input column containing the queries (default: query)
inputQueryNumCol: The input column containing the queries (default: qid)
localTerrierProperties: TR properties (default: Map(), current: Map(terrier.home -> /home/user/projects/tr5/Terrier, termpipelines -> , proximity.dependency.type -> SD))
maxResults: Max number of results for QueryingTransformer to render for each query (default: 1000)
retrievalFeatures: The names of features to use (default: List(), current: List(WMODEL:BM25, WMODEL:PL2, DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier))
sampleModel: The sample weighting model (default: InL2, current: PL2)

Lets look at the learned model. Fortunately, `RandomForestRegressionModel` has a `.toDebugString` method, which lets us see the resulting model.

In [22]:
bestModel.stages(2).asInstanceOf[RandomForestRegressionModel].getNumTrees
bestModel.stages(2).asInstanceOf[RandomForestRegressionModel].toDebugString

"RandomForestRegressionModel (uid=rfr_6d1b11bf8d8b) with 20 trees
        Else (featur...


RandomForestRegressionModel (uid=rfr_6d1b11bf8d8b) with 20 trees
  Tree 0 (weight 1.0):
    If (feature 1 <= 7.846192540483554)
     If (feature 1 <= 5.304530238481379)
      If (feature 2 <= 0.0)
       If (feature 0 <= 10.978431413411807)
        If (feature 1 <= 4.318875920614034)
         Predict: 0.010077682133109386
        Else (feature 1 > 4.318875920614034)
         Predict: 0.00425298394841542
       Else (feature 0 > 10.978431413411807)
        If (feature 0 <= 11.852411915267531)
         Predict: 0.04317958783120707
        Else (feature 0 > 11.852411915267531)
         Predict: 0.014056224899598393
      Else (feature 2 > 0.0)
       If (feature 0 <= 13.498727649538473)
        If (feature 2 <= 0.7178214709431138)
         Predict: 1.0
        Else (feature 2 > 0.7178214709431138)
         Predict: 0.18803418803418803
       Else (feature 0 > 13.498727649538473)
        Predict: 1.0
     Else (feature 1 > 5.304530238481379)
      If (feature 1 <= 7.0524071771466375)
     