# [ Chapter 12 - Overcoming Bias in Learned Relevance Models ]

## Chapter 12 setup

In chapter 12, we continue our work on a Learning to Rank solution. Evolving from a purely offline use of click-based training data to trying to explore potentially relevant items the users may find valuable. 

To setup, we

1. Fetch the retrotech data
2. Enable LTR
3. Define a few fields (different ways of analyzing the underlying retrotech text)
4. Define a list of 'promoted' products that our store wants to make prominent
5. Insert the retrotech product data via spark


### [TODO: remove the product / signals cells, as those were loaded in ch4]

In [1]:
import sys
sys.path.append('../..')
from aips import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AIPS").getOrCreate()
engine = get_engine()

In [8]:
#Get datasets
![ ! -d 'retrotech' ] && git clone --depth=1 https://github.com/ai-powered-search/retrotech.git
! cd retrotech && git pull
! cd retrotech && tar -xvf products.tgz -C '../../../data/retrotech/' && tar -xvf signals.tgz -C '../../../data/retrotech/'

products.csv
signals.csv


In [9]:
! cd ../../data/retrotech/ && head products.csv

"upc","name","manufacturer","short_description","long_description"
"096009010836","Fists of Bruce Lee - Dolby - DVD", , , 
"043396061965","The Professional - Widescreen Uncut - DVD", , , 
"085391862024","Pokemon the Movie: 2000 - DVD", , , 
"067003016025","Summerbreeze - CD","Nettwerk", , 
"731454813822","Back for the First Time [PA] - CD","Def Jam South", , 
"024543008200","Big Momma's House - Widescreen - DVD", , , 
"031398751823","Kids - DVD", , , 
"037628413929","20 Grandes Exitos - CD","Sony Discos Inc.", , 
"060768972223","Power Of Trinity (Box) - CD","Sanctuary Records", , 


In [5]:
from aips.data_loaders.products import load_dataframe

#Create Products Collection
products_collection = engine.create_collection("products")
get_ltr_engine(products_collection).enable_ltr(products_collection)
promoted = [600603141003, 27242813908, 74108007469,
            12505525766, 400192926087, 47875842328, 
            803238004525, 27242799127, 27242815414,
            97360724240, 97360722345, 826663114164]

promoted = [{"upc": promoted_upc, "has_promotion": True}
            for promoted_upc in promoted]
promoted_dataframe = spark.createDataFrame(promoted)

products_dataframe = load_dataframe("../../data/retrotech/products.csv")
enriched_data = products_dataframe.join(promoted_dataframe, ["upc"], "left")
products_collection.write(enriched_data)

Wiping "products" collection
Creating "products" collection
Status: Success
Adding LTR QParser for products collection
Adding LTR Doc Transformer for products collection
Loading Products
Schema: 
root
 |-- upc: long (nullable = true)
 |-- name: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- short_description: string (nullable = true)
 |-- long_description: string (nullable = true)

Successfully written 48194 documents


In [10]:
query = "Transformers"

collection = "products"
request = {
    "query": query,
    "query_fields": ["name", "manufacturer", "long_description"],
    "return_fields": ["upc", "name", "manufacturer", "score"],
    "filters": [("has_promotion", True)],
    "limit": 5,
    "order_by": [("score", "desc"), ("upc", "asc")]
}

response = products_collection.search(**request)
print(response["docs"])
display_product_search(query, response["docs"])

[{'upc': '97360722345', 'name': 'Transformers/Transformers: Revenge of the Fallen: Two-Movie Mega Collection [2 Discs] - Widescreen - DVD', 'manufacturer': ' ', 'score': 3.3835273}, {'upc': '97360724240', 'name': 'Transformers: Revenge of the Fallen - Widescreen - DVD', 'manufacturer': ' ', 'score': 3.1457326}, {'upc': '400192926087', 'name': 'Transformers: Dark of the Moon - Original Soundtrack - CD', 'manufacturer': 'Reprise', 'score': 2.9793549}, {'upc': '47875842328', 'name': 'Transformers: Dark of the Moon Stealth Force Edition - Nintendo Wii', 'manufacturer': 'Activision', 'score': 2.851606}, {'upc': '826663114164', 'name': 'Transformers: The Complete Series [25th Anniversary Matrix of Leadership Edition] [16 Discs] - DVD', 'manufacturer': ' ', 'score': 2.356245}]


In [11]:
! cd ../../data/retrotech && head signals.csv

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


## Download query sessions

Download simulated raw clickstream data

In [12]:
from ltr import download
simulated_queries = ["dryer", "bluray", "blue ray", "headphones", "ipad", "iphone",
                     "kindle", "lcd tv", "macbook", "nook", "star trek", "star wars",
                     "transformers dark of the moon"]

sessions = [f"https://github.com/ai-powered-search/retrotech/raw/master/sessions/{query}_sessions.gz"
            for query in simulated_queries]
           
download(sessions, dest="../../data/")

GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/dryer_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/bluray_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/blue ray_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/headphones_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/ipad_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/iphone_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/kindle_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/lcd tv_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/macbook_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/nook_sessions.gz
GET https://github.com/ai-powered-search/retrotech/raw/master/sessions/star trek_sessions.gz
GET h

In [13]:
!ls ../../data/

'blue ray_sessions.gz'	 'lcd tv_sessions.gz'
 bluray_sessions.gz	  macbook_sessions.gz
 dryer_sessions.gz	  nook_sessions.gz
 headphones_sessions.gz   retrotech
 ipad_sessions.gz	 'star trek_sessions.gz'
 iphone_sessions.gz	 'star wars_sessions.gz'
 kindle_sessions.gz	 'transformers dark of the moon_sessions.gz'


In [14]:
!ls retrotech/sessions/

'blue ray_sessions.gz'	 'lcd tv_sessions.gz'
 bluray_sessions.gz	  macbook_sessions.gz
 dryer_sessions.gz	  nook_sessions.gz
 headphones_sessions.gz  'star trek_sessions.gz'
 ipad_sessions.gz	 'star wars_sessions.gz'
 iphone_sessions.gz	 'transformers dark of the moon_sessions.gz'
 kindle_sessions.gz


Up next: [A/B Testing Simulation to Active Learning](1.ab-testing-to-active-learning.ipynb)