# Welcome to the first example notebook for Terrier-Spark

This notebook requires [Apache-Toree](https://toree.apache.org/) to be installed and working. You will also need your Terrier index, and the various topics and qrels to be set appropriately.

Firstly, lets import Terrier and Terrier-Spark using the `%AddDeps` magic command. Toree can use Maven Central for this, as well as your local Maven repository.

In [1]:
%AddDeps org.terrier terrier-core 5.0 --repository file:/home/user/.m2/repository --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive



Marking org.terrier:terrier-core:5.0 for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir9121027855058483540/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /home/user/.m2/repository/org/apache/pdfbox/pdfbox/1.8.4/pdfbox-1.8.4.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/org/mortbay/jasper/apache-el/8.0.9.M3/apache-el-8.0.9.M3.jar
-> New file at /home/user/.m2/repository/org/sonatype/sisu/sisu-inject-bean/2.3.0/sisu-inject-bean-2.3.0.jar
-> New file at /home/user/.m2/repository/org/apache/ant/ant-apache-bcel/1.8.4/ant-apache-bcel-1.8.4.pom
-> New file at /home/user/.m2/repository/org/apache/maven/maven-model-builder/3.0.5/maven-model-builder-3.0.5.jar
-> New file at /home/user/.m2/repository/org/apache/maven/maven-model-builder/3.0.5/maven-model-builder-3.0.5.pom
-> New file at /home/user/.m2/repository/org/apache/htrace/htrace-core4/4.0.1-incubating/htrace-core4-4.0.1

Waiting for a Spark session to start...

Marking org.terrier:terrier-spark:0.0.1-SNAPSHOT for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir9121027855058483540/toree_add_deps/
-> file:/home/user/.m2/repository
-> https://repo1.maven.org/maven2
-> New file at /home/user/.m2/repository/org/apache/spark/spark-catalyst_2.11/2.1.0/spark-catalyst_2.11-2.1.0.jar
-> New file at /home/user/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.2.0/hadoop-mapreduce-client-core-2.2.0.jar
-> New file at /home/user/.m2/repository/org/eclipse/jetty/orbit/org.eclipse.jdt.core/3.8.2.v20130121/org.eclipse.jdt.core-3.8.2.v20130121.pom
-> New file at /home/user/.m2/repository/org/apache/hadoop/hadoop-mapreduce-client-core/2.2.0/hadoop-mapreduce-client-core-2.2.0.pom
-> New file at /home/user/.m2/repository/org/glassfish/jersey/bundles/repackaged/jersey-guava/2.22.2/jersey-guava-2.22.2.jar
-> New file at /home/user/.m2/repository/org/glassfish/hk2/hk2-locator/2.4.0-b34/hk2-locator-2.4.0-b34.pom
-> New file at /home/user/.m

Lets check that we have Terrier correctly imported. Do you get the correct version number?

In [2]:
org.terrier.Version.VERSION

5.0

Now lets setup the location appropriate for our local setting.

In [3]:
val terrierHome = "/home/user/projects/tr5/Terrier"
val wt2gIndex = "/home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties"
val topicsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450"
val qrelsFile = "file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8"

terrierHome = /home/user/projects/tr5/Terrier
wt2gIndex = /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties
topicsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/topics.401-450
qrelsFile = file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8


file:/home/user/TopicsQrels/TREC/WT2G/small_web/qrels.trec8

This is our simple function to obtain a Terrier run. It returns a Dataset with three columns: the qid, the query, and the resulting nDCG@20


In [4]:

import org.terrier.spark.ml._
import org.terrier.spark._
import org.terrier.querying.IndexRef

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, DoubleType}

def ndcgForModel(model : String) = {
    
    val indexref = IndexRef.of(wt2gIndex)
    
    val props = Map(
    "terrier.home" -> terrierHome,
    "termpipelines"->"")
    
    TopicSource.configureTerrier(props)
    val topics = TopicSource.extractTRECTopics(topicsFile)
        .toList.toDF("qid", "query")

    val queryTransform = new QueryingTransformer()
        .setTerrierProperties(props)
        .setIndexReference(indexref)
        .setSampleModel(model)

    val r1 = queryTransform.transform(topics)
    //r1 is a dataframe with results for queries in topics
    val qrelTransform = new QrelTransformer()
        .setQrelsFile(qrelsFile)

    val r2 = qrelTransform.transform(r1)
    //r2 is a dataframe as r1, but also includes a label column 
    val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList
    
    
    val newSchema = StructType(topics.schema.fields ++ Array(StructField("ndcg", DoubleType, false)))
    spark.createDataFrame(topics.rdd.zipWithIndex.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(ndcg(index.toInt)))}, newSchema)
}


ndcgForModel: (model: String)org.apache.spark.sql.DataFrame


Lets get the results for two weighting models, TF_IDF and BM25. What is returned is a Dataset with three columns ("qid", "query", "ndcg")

In [5]:

val TFIDF = ndcgForModel("TF_IDF")
val BM25 = ndcgForModel("BM25")

TFIDF.columns


 Assuming the value of terrier.home from the corresponding system property.
Please ensure that the property terrier.home
is specified in the file terrier.properties,
or as a system property in the command line.
TERRIER_HOME=/home/user/projects/tr5/Terrier
terrier.etc=null
TERRIER_HOME=/home/user/projects/tr5/Terrier
TERRIER_ETC=/home/user/projects/tr5/Terrier/etc
We have 47506 qrels
TERRIER_HOME=/home/user/projects/tr5/Terrier
terrier.etc=null
TERRIER_HOME=/home/user/projects/tr5/Terrier
TERRIER_ETC=/home/user/projects/tr5/Terrier/etc
Querying /home/user/projects/tr5/terrier-core-4.2-SNAPSHOT/var/index_wt2g/data.properties for 50 queries
We have 47506 qrels

TFIDF = [qid: string, query: string ... 1 more field]
BM25 = [qid: string, query: string ... 1 more field]


error: error while loading QualifiedTableName, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-catalyst_2.11-2.2.0.cloudera1.jar(org/apache/spark/sql/catalyst/QualifiedTableName.class)' has location not matching its contents: contains class QualifiedTableName
error: error while loading JavaTypeInference, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-catalyst_2.11-2.2.0.cloudera1.jar(org/apache/spark/sql/catalyst/JavaTypeInference.class)' has location not matching its contents: contains class JavaTypeInference
error: error while loading FunctionIdentifier, class file '/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/spark-catalyst_2.11-2.2.0.cloudera1.jar(org/apache/spark/sql/catalyst/FunctionIdentifier.class)' has location not matching its contents: contains class FunctionIdentifier
error: error while loading DefinedByConstructorParams, class fil

[qid, query, ndcg]

How did we do? Lets look at the mean NDCGs.


In [6]:

import org.apache.spark.sql.functions._

println("TFIDF")
TFIDF.select(avg($"ndcg")).show()

println("BM25")
BM25.select(avg($"ndcg")).show()

TFIDF
|         avg(ndcg)|
+------------------+
|0.3397919431818957|
+------------------+

BM25
+------------------+
|         avg(ndcg)|
+------------------+
|0.3386923263003945|
+------------------+



Lets look at the results side by side, so we can see where we went wrong?

In [7]:


import org.apache.spark.sql.functions.col

val joined = TFIDF
    .join(BM25, TFIDF.col("qid") === BM25.col("qid"))
    .toDF("qid", "query", "TFIDF", "qidBM25", "queryBM25", "BM25")
    .select($"qid", $"query", $"BM25" - $"TFIDF")
    .orderBy($"BM25" - $"TFIDF")

joined.filter(row => row.get(2) != 0.0).collect




joined = [qid: string, query: string ... 1 more field]


0,1,2
427,uv damage eyes,-0.0585747828736678
424,suicides,-0.0425992524362574
445,women clergy,-0.0420222350661259
408,tropical storms,-0.010144724844642
401,foreign minorities germany,-0.0037445218227228
421,industrial waste disposal,-0.0036009290801817
422,art stolen forged,-0.0033479627985089
411,salvaging shipwreck treasure,-0.002034006395408
403,osteoporosis,-0.001667381371887
404,ireland peace talks,-0.0009332068582119102


OK, so we can see which queries BM25 helped or not. Can we show this graphically? 

We will use the [vegas-viz](https://github.com/vegas-viz/Vegas) module for making the graphs, which we import using `%AddDeps` magic.

In [11]:
%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive --exclude org.scala-lang:* --exclude org.scala-lang.modules:*
%AddDeps org.vegas-viz vegas-spark_2.11 0.3.11 --transitive --exclude org.scala-lang:* --exclude org.scala-lang.modules:*



Marking org.vegas-viz:vegas_2.11:0.3.11 for download
Preparing to fetch from:
-> file:/tmp/toree-tmp-dir9121027855058483540/toree_add_deps/
-> https://repo1.maven.org/maven2
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/org/webjars/bower/vega/3.0.0-rc4/vega-3.0.0-rc4.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/org/scalaz/scalaz-core_2.11/7.1.1/scalaz-core_2.11-7.1.1.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/org/typelevel/cats-kernel_2.11/0.9.0/cats-kernel_2.11-0.9.0.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/org/webjars/bower/vega-lite/1.2.0/vega-lite-1.2.0.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540/toree_add_deps/https/repo1.maven.org/maven2/com/chuusai/shapeless_2.11/2.3.2/shapeless_2.11-2.3.2.jar
-> New file at /tmp/toree-tmp-dir9121027855058483540

In [12]:
import vegas._
import vegas.render.WindowRenderer._
import vegas.sparkExt._
val queryCount = joined.count

val plot = Vegas("Per Query delta NDCG")
   .withDataFrame(joined.orderBy($"BM25" - $"TFIDF"))
   .configCell(width=600, height=300)
   .encodeX("qid", Nominal, sortOrder=SortOrder.None)
   .encodeY("(BM25 - TFIDF)", Quant)
   .mark(Bar)
plot.show

[Stage 57:>                                                        (0 + 0) / 21]

queryCount = 50
plot = ExtendedUnitSpecBuilder(ExtendedUnitSpec(None,None,Bar,Some(Encoding(None,None,Some(PositionChannelDef(None,None,Some(SortSortOrder(None)),Some(qid),Some(Nominal),None,None,None,None,None)),Some(PositionChannelDef(None,None,None,Some((BM25 - TFIDF)),Some(Quantitative),None,None,None,None,None)),None,None,None,None,None,None,None,None,None,None,None)),None,Some(Per Query delta NDCG),Some(Data(None,None,Some(List(Values(Map(qid -> 427, query -> uv damage eyes, (BM25 - TFIDF) -> -0.05857478287366785)), Values(Map(qid -> 424, query -> suicides, (BM25 - TFIDF) -> -0.04259925243625749)), Values(Map(qid -> 445, query -> women clergy, (BM25 - TFIDF) -> -0.0...


ExtendedUnitSpecBuilder(ExtendedUnitSpec(None,None,Bar,Some(Encoding(None,None,Some(PositionChannelDef(None,None,Some(SortSortOrder(None)),Some(qid),Some(Nominal),None,None,None,None,None)),Some(PositionChannelDef(None,None,None,Some((BM25 - TFIDF)),Some(Quantitative),None,None,None,None,None)),None,None,None,None,None,None,None,None,None,None,None)),None,Some(Per Query delta NDCG),Some(Data(None,None,Some(List(Values(Map(qid -> 427, query -> uv damage eyes, (BM25 - TFIDF) -> -0.05857478287366785)), Values(Map(qid -> 424, query -> suicides, (BM25 - TFIDF) -> -0.04259925243625749)), Values(Map(qid -> 445, query -> women clergy, (BM25 - TFIDF) -> -0.0420222350661259)), Values(Map(qid -> 408, query -> tropical storms, (BM25 - TFIDF) -> -0.010144724844642017)), Values(Map(qid -> 401, query -> foreign minorities germany, (BM25 - TFIDF) -> -0.003744521822722807)), Values(Map(qid -> 421, query -> industrial waste disposal, (BM25 - TFIDF) -> -0.003600929080181753)), Values(Map(qid -> 422, quer

In [10]:
val queryCount = joined.count
val plot = Vegas.layered("Per Query delta NDCG")
   .withData(joined
             .orderBy($"BM25" - $"TFIDF")
             .collect
             .zipWithIndex
             .map{ case (r,rank) => Map(
                "position" -> (queryCount - rank), 
                 "qid" -> r.get(0), 
                 "deltaNDCG" -> r.get(2), 
                 "deltaNDCG_offset" -> 1.1d * r.getDouble(2) )  
                 })
   .configCell(width=600, height=300)
   .withLayers(
       Layer()
           .encodeX("position", Quant)
           .encodeY("deltaNDCG", Quant)
           .mark(Bar),
       Layer()
           .encodeX("position", Quant)
           .encodeY("deltaNDCG_offset", Quant)
           .encodeText(field="qid", dataType= Nominal)
           .mark(Text)
   )
plot.show

Name: Compile Error
Message: <console>:75: error: not found: value Vegas
       val plot = Vegas.layered("Per Query delta NDCG")
                  ^
<console>:86: error: not found: value width
          .configCell(width=600, height=300)
                      ^
<console>:86: error: not found: value height
          .configCell(width=600, height=300)
                                 ^
<console>:88: error: not found: value Layer
              Layer()
              ^
<console>:89: error: not found: value Quant
                  .encodeX("position", Quant)
                                       ^
<console>:90: error: not found: value Quant
                  .encodeY("deltaNDCG", Quant)
                                        ^
<console>:91: error: not found: value Bar
                  .mark(Bar),
                        ^
<console>:92: error: not found: value Layer
              Layer()
              ^
<console>:93: error: not found: value Quant
                  .encodeX("position", Quan

And there we have it - query 419 was most benefitted by BM25, while 424 and 427 were least.