# [ Chapter 12 - Overcoming Bias in Learned Relevance Models ]

## Chapter 12 setup

In chapter 12, we continue our work on a Learning to Rank solution. Evolving from a purely offline use of click-based training data to trying to explore potentially relevant items the users may find valuable. 

To setup, we

1. Fetch the retrotech data
2. Enable LTR
3. Define a few fields (different ways of analyzing the underlying retrotech text)
4. Define a list of 'promoted' products that our store wants to make prominent
5. Insert the retrotech product data via spark


### [TODO: remove the product / signals cells, as those were loaded in ch4]

In [1]:
%load_ext autoreload
%autoreload 1

import sys
sys.path.append('../..')
from aips import *
import pandas 
import os
from IPython.display import display,HTML
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AIPS").getOrCreate()
engine = get_engine()

In [2]:
#Get datasets
![ ! -d 'retrotech' ] && git clone --depth=1 https://github.com/ai-powered-search/retrotech.git
! cd retrotech && git pull
! cd retrotech && tar -xvf products.tgz -C '../../../data/retrotech/' && tar -xvf signals.tgz -C '../../../data/retrotech/'

Cloning into 'retrotech'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 19 (delta 0), reused 19 (delta 0), pack-reused 0[K
Receiving objects: 100% (19/19), 48.29 MiB | 1.13 MiB/s, done.
Already up to date.
products.csv
signals.csv


In [3]:
! cd ../../data/retrotech/ && head products.csv

"upc","name","manufacturer","shortDescription","longDescription"
"096009010836","Fists of Bruce Lee - Dolby - DVD",\N,\N,\N
"043396061965","The Professional - Widescreen Uncut - DVD",\N,\N,\N
"085391862024","Pokemon the Movie: 2000 - DVD",\N,\N,\N
"067003016025","Summerbreeze - CD","Nettwerk",\N,\N
"731454813822","Back for the First Time [PA] - CD","Def Jam South",\N,\N
"024543008200","Big Momma's House - Widescreen - DVD",\N,\N,\N
"031398751823","Kids - DVD",\N,\N,\N
"037628413929","20 Grandes Exitos - CD","Sony Discos Inc.",\N,\N
"060768972223","Power Of Trinity (Box) - CD","Sanctuary Records",\N,\N


In [4]:
from aips.data_loaders.products import load_dataframe

#Create Products Collection
products_collection = engine.create_collection("products")
engine.enable_ltr(products_collection)

promoted = [27242815414, 600603141003, 27242813908, 803238004525, 27242799127, 36725236271,
 883393003458, 600603135088, 9781400532711, 97360810042, 97360810042, 97360810042, 97360810042,
 803238004525, 27242799127, 36725236271, 883393003458, 36725236271, 883393003458, 27242815414,
# promoted transformers movies for example
 97360724240, 97360722345, 97368920347,
]
# Change: 97368920347 603497664429
promoted = [{'upc': promoted_upc, 'has_promotion': True} for promoted_upc in promoted]

# Any extra fields we want to add manually
enriched_data = spark.createDataFrame(promoted)

products_dataframe = load_dataframe("../../data/retrotech/products.csv")
joined = products_dataframe.join(enriched_data, ['upc'], "left")
products_collection.write(joined)

Wiping "products" collection
Creating "products" collection
Status: Success
Adding LTR QParser for products collection
Adding LTR Doc Transformer for products collection
Loading Products
Schema: 
root
 |-- upc: long (nullable = true)
 |-- name: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- long_description: string (nullable = true)
 |-- short_description: string (nullable = true)

Successfully written 48204 documents


In [5]:
query = "ipod"

collection = "products"
request = {
    "query": query,
    "query_fields": ["name", "manufacturer", "long_description"],
    "return_fields": ["upc", "name", "manufacturer", "score"],
    "limit": 5,
    "order_by": [("score", "desc"), ("upc", "asc")]
}

response = products_collection.search(**request)
display_product_search(query, response["docs"])

In [6]:
! cd ../../data/retrotech && head signals.csv

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


## Download query sessions

Download simulated raw clickstream data

In [7]:
from ltr import download
simulated_queries = ["dryer", "bluray", "blue ray", "headphones", "ipad", "iphone",
                     "kindle", "lcd tv", "macbook", "nook", "star trek", "star wars",
                     "transformers dark of the moon"]

sessions = [f"https://github.com/ai-powered-search/retrotech/raw/master/sessions/{query}_sessions.gz"
            for query in simulated_queries]
           
download(sessions, dest="../../data/")

GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/dryer_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/bluray_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/blue ray_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/headphones_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/ipad_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/iphone_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/kindle_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/lcd tv_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/macbook_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/nook_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/star trek_sessions.gz
GET h

In [8]:
!ls ../../data/

 ai_pow_search_judgments.txt  'lcd tv_sessions.gz'
'blue ray_sessions.gz'	       macbook_sessions.gz
 bluray_sessions.gz	       movies.tgz
 cooking		       nook_sessions.gz
 devops			       normed_judgments.txt
 dryer_sessions.gz	       predictor_deltas.npy
 feature_data.npy	       retrotech
 headphones_sessions.gz        reviews
 health			       scifi
 ipad_sessions.gz	      'star trek_sessions.gz'
 iphone_sessions.gz	      'star wars_sessions.gz'
 jobs			       tmdb.json
 judgments.tgz		      'transformers dark of the moon_sessions.gz'
 kindle_sessions.gz	       travel


In [9]:
!ls

0.setup.ipynb  1.ab-testing-to-active-learning.ipynb  retrotech


In [10]:
!ls retrotech/sessions/

'blue ray_sessions.gz'	 'lcd tv_sessions.gz'
 bluray_sessions.gz	  macbook_sessions.gz
 dryer_sessions.gz	  nook_sessions.gz
 headphones_sessions.gz  'star trek_sessions.gz'
 ipad_sessions.gz	 'star wars_sessions.gz'
 iphone_sessions.gz	 'transformers dark of the moon_sessions.gz'
 kindle_sessions.gz


Up next: [A/B Testing Simulation to Active Learning](1.ab-testing-to-active-learning.ipynb)