# [ Chapter 12 - Overcoming Bias in Learned Relevance Models ]

## Chapter 12 setup

In chapter 12, we continue our work on a Learning to Rank solution. Evolving from a purely offline use of click-based training data to trying to explore potentially relevant items the users may find valuable. 

To setup, we

1. Fetch the retrotech data
2. Enable LTR
3. Define a few fields (different ways of analyzing the underlying retrotech text)
4. Define a list of 'promoted' products that our store wants to make prominent
5. Insert the retrotech product data via spark


### [TODO: remove the product / signals cells, as those were loaded in ch4]

In [5]:
%load_ext autoreload
%autoreload 1

import sys
sys.path.append('..')
from aips import *
import pandas as pd
import os
from IPython.display import display,HTML
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AIPS").getOrCreate()
engine = get_engine()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
#Get datasets
![ ! -d 'retrotech' ] && git clone --depth=1 https://github.com/ai-powered-search/retrotech.git
! cd retrotech && git pull
! cd retrotech && tar -xvf products.tgz -C '../../data/retrotech/' && tar -xvf signals.tgz -C '../../data/retrotech/'

Already up to date.
products.csv
signals.csv


In [7]:
! cd ../data/retrotech/ && head products.csv

"upc","name","manufacturer","shortDescription","longDescription"
"096009010836","Fists of Bruce Lee - Dolby - DVD",\N,\N,\N
"043396061965","The Professional - Widescreen Uncut - DVD",\N,\N,\N
"085391862024","Pokemon the Movie: 2000 - DVD",\N,\N,\N
"067003016025","Summerbreeze - CD","Nettwerk",\N,\N
"731454813822","Back for the First Time [PA] - CD","Def Jam South",\N,\N
"024543008200","Big Momma's House - Widescreen - DVD",\N,\N,\N
"031398751823","Kids - DVD",\N,\N,\N
"037628413929","20 Grandes Exitos - CD","Sony Discos Inc.",\N,\N
"060768972223","Power Of Trinity (Box) - CD","Sanctuary Records",\N,\N


In [8]:
#Create Products Collection
products_collection = engine.create_collection("products")
engine.apply_additional_schema(products_collection)
engine.enable_ltr(products_collection)

promoted = [27242815414, 600603141003, 27242813908, 803238004525, 27242799127, 36725236271,
 883393003458, 600603135088, 9781400532711, 97360810042, 97360810042, 97360810042, 97360810042,
 803238004525, 27242799127, 36725236271, 883393003458, 36725236271, 883393003458, 27242815414,
# promoted transformers movies for example
 97360724240, 97360722345, 97368920347,
]

promoted = [{'upc': promoted_upc, 'promotion_b': True} for promoted_upc in promoted]

# Any extra fields we want to add manually
enriched_data = spark.createDataFrame(promoted)

print("Loading Products...")
csvFile = "../data/retrotech/products.csv"
product_update_opts={"zkhost": "aips-zk", "collection": products_collection.name, 
                     "gen_uniq_key": "true", "commit_within": "5000"}
csvDF = spark.read.csv(csvFile, header=True, inferSchema=True)
joined = csvDF.join(enriched_data, ['upc'], "left")
joined.write.format("solr").options(**product_update_opts).mode("overwrite").save()
print("Products Schema: ")
joined.printSchema()
print("Status: Success")

Wiping "products" collection
Creating "products" collection
Status: Success
Deleting all copy fields
Status: Success
Adding LTR QParser for products collection
Status: Success
Adding LTR Doc Transformer for products collection
Status: Success
Loading Products...
Products Schema: 
root
 |-- upc: long (nullable = true)
 |-- name: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- shortDescription: string (nullable = true)
 |-- longDescription: string (nullable = true)
 |-- promotion_b: boolean (nullable = true)

Status: Success


In [16]:
query = "ipod"

collection = "products"
request = {
    "query": query,
    "query_fields": ["name", "manufacturer", "longDescription"],
    "return_fields": ["upc", "name", "manufacturer", "score"],
    "limit": 5,
    "order_by": [("score", "desc"), ("upc", "asc")]
}

response = products_collection.search(**request)
display_product_search(query, response["docs"])

In [10]:
! cd ../data/retrotech && head signals.csv

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


## Download query sessions

Download simulated raw clickstream data

In [18]:
from ltr import download
simulated_queries = ["dryer", "bluray", "blue ray", "headphones", "ipad", "iphone",
                     "kindle", "lcd tv", "macbook", "nook", "star trek", "star wars",
                     "transformers dark of the moon"]

sessions = [f"https://github.com/ai-powered-search/retrotech/raw/master/sessions/{query}_sessions.gz"
            for query in simulated_queries]
           
download(sessions, dest="data/")

data/dryer_sessions.gz already exists
data/bluray_sessions.gz already exists
data/blue ray_sessions.gz already exists
data/headphones_sessions.gz already exists
data/ipad_sessions.gz already exists
data/iphone_sessions.gz already exists
data/kindle_sessions.gz already exists
data/lcd tv_sessions.gz already exists
data/macbook_sessions.gz already exists
data/nook_sessions.gz already exists
data/star trek_sessions.gz already exists
data/star wars_sessions.gz already exists
data/transformers dark of the moon_sessions.gz already exists


In [None]:
!ls data/

In [22]:
!ls

0.setup.ipynb  1.ab-testing-to-active-learning.ipynb


In [17]:
!ls retrotech/sessions/

'blue ray_sessions.gz'	 'lcd tv_sessions.gz'
 bluray_sessions.gz	  macbook_sessions.gz
 dryer_sessions.gz	  nook_sessions.gz
 headphones_sessions.gz  'star trek_sessions.gz'
 ipad_sessions.gz	 'star wars_sessions.gz'
 iphone_sessions.gz	 'transformers dark of the moon_sessions.gz'
 kindle_sessions.gz


Up next: [A/B Testing Simulation to Active Learning](1.ab-testing-to-active-learning.ipynb)