# A/B Testing Simulation to Active Learning

In this notebook, users have a hidden preference for a single query. We use this to explore A/B testing to see whether a given LTR model actually gives the users what they want.

Then we ask, much like in real life, how can we learn what the user _actually_ wants? We employe active learning to try to escape the 'echo chamber' of presentation bias we learned about at the end of chapter 11. After all users can't click on results that never show up in their search results!

## 🚨 We're putting it all together in this chapter

As this chapter puts together everything from chapters 10 and 11, much of the setup code below wraps up a lot of chapter 11 and 10 into a 'single function' so we can very easily run through the steps in 'one liners'

### Getting training data (Ch 11)

Chapter 11 is all about turning raw clickstream data into search training data (aka judgments). This involves overcoming biases in how users percieve search. But here we put that in one function call `sessions_to_sdbn`.

### Train a model (Ch 10)

Chapter 10 is about training an LTR model, including interacting with Solr to extract features, how a ranking model works, how to train a model, and how to perform a good test/train split for search. But here we similarly wrap that up into a handful of function calls, `test_train_split`, `ranksvm_ltr`, and `eval_model`.

*long story short, if you see a reference to chapter 10 and 11, it's probably omited from chapter 12* - don't expect it to be covered in chapter 12 extensively.


## Setup - gather some sessions (omitted)

To get started, we first load a set of simulated search sessions for all queries. 

Much of this setup is omitted from the chapter. This first part is just loading and synthesizing a bunch of clickstream sessions, like we used in chapter 11.

In [32]:
import numpy as np
import pandas as pd
import random; random.seed(0)
import glob

import requests
import sys
sys.path.append('..')
from aips import *
from ltr.client.solr_client import SolrClient
engine = get_engine()
client = SolrClient(solr_base=SOLR_URL)

In [33]:
def all_sessions():
    sessions = pd.concat([pd.read_csv(f, compression='gzip')
                          for f in glob.glob('retrotech/sessions/*_sessions.gz')])
    return sessions.rename(columns={'clicked_doc_id': 'doc_id'})

sessions = all_sessions()
sessions

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,2,ipad,0.0,92636260712,False
1,2,ipad,1.0,635753493559,True
2,2,ipad,2.0,885909393404,False
3,2,ipad,3.0,843404073153,False
4,2,ipad,4.0,885909457595,False
...,...,...,...,...,...
149995,60001,bluray,25.0,23942973416,False
149996,60001,bluray,26.0,25192107191,False
149997,60001,bluray,27.0,27242809710,False
149998,60001,bluray,28.0,600603132872,False


In [34]:
sessions["query"].unique()

array(['ipad', 'star trek', 'kindle', 'nook', 'dryer', 'star wars',
       'headphones', 'macbook', 'transformers dark of the moon', 'lcd tv',
       'iphone', 'blue ray', 'bluray'], dtype=object)

## Setup Part 2 - Add some more query sessions (omitted)

Here we duplicate the simulated queries from above, but we flip a handful of the clicks. This just fills out our data a bit more, gives a bit more data to work with.

In [35]:
random.seed(0)

def copy_query_sessions(sessions, src_query, dest_query, flip=False):
    new_sessions = sessions[sessions["query"] == src_query].copy()  
    new_sessions["draw"] = np.random.rand(len(new_sessions), 1)
    new_sessions.loc[new_sessions["clicked"] & (new_sessions["draw"] < 0.04), "clicked"] = False
    new_sessions["query"] = dest_query
    return pd.concat([sessions, new_sessions.drop("draw", axis=1)])


sessions = copy_query_sessions(sessions, "transformers dark of the moon", "transformers dark of moon")
sessions = copy_query_sessions(sessions, "transformers dark of the moon", "dark of moon")
sessions = copy_query_sessions(sessions, "transformers dark of the moon", "dark of the moon")
sessions = copy_query_sessions(sessions, "headphones", "head phones")
sessions = copy_query_sessions(sessions, "lcd tv", "lcd television")
sessions = copy_query_sessions(sessions, "lcd tv", "television, lcd")
sessions = copy_query_sessions(sessions, "macbook", "apple laptop")
sessions = copy_query_sessions(sessions, "iphone", "apple iphone")
sessions = copy_query_sessions(sessions, "kindle", "amazon kindle")
sessions = copy_query_sessions(sessions, "kindle", "amazon ereader")
sessions = copy_query_sessions(sessions, "blue ray", "blueray")

sessions

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,2,ipad,0.0,92636260712,False
1,2,ipad,1.0,635753493559,True
2,2,ipad,2.0,885909393404,False
3,2,ipad,3.0,843404073153,False
4,2,ipad,4.0,885909457595,False
...,...,...,...,...,...
149995,55001,blueray,25.0,22265004517,False
149996,55001,blueray,26.0,885170038875,False
149997,55001,blueray,27.0,786936817232,False
149998,55001,blueray,28.0,600603132872,False


In [36]:
sessions["query"].unique()

array(['ipad', 'star trek', 'kindle', 'nook', 'dryer', 'star wars',
       'headphones', 'macbook', 'transformers dark of the moon', 'lcd tv',
       'iphone', 'blue ray', 'bluray', 'transformers dark of moon',
       'dark of moon', 'dark of the moon', 'head phones',
       'lcd television', 'television, lcd', 'apple laptop',
       'apple iphone', 'amazon kindle', 'amazon ereader', 'blueray'],
      dtype=object)

## Setup Part 3 - Our test query, `transformers dvd`, with hidden, 'true' preferences

We add a new query to our set of queries `transformers dvd` and we note the users' hidden preferences in the variables `desired_movies` as well as what they consider mediocre `meh_transformers_movies` and not at all relevant `irrelevant_transformers_products`. Each holds the UPC of the associated product.

This simulates biased sessions in the data, as if the user never actually sees (and hence never clicks) their actual desired item. If the users desired results are shown, those results get a higher probability of click. Otherwise there is a lower probability of clicks.

In [37]:
next_sess_id = sessions["sess_id"].max()

# For some reason, the sessions only capture examines on the "dubbed" transformers movies
# ie the Japanese shows brought to an English-speaking market. But we'll see this is not what the 
# user wants (ie presentation bias). These are "meh" mildly interesting. There are also many many
# completely irrelevant movies.

# What the user wants, but never visible! Never gets clicked!
# These are the widescreen transformers dvds of the hollywood movies
desired_transformers_movies = ["97360724240", "97360722345", "97368920347"] 

# Bunch of random merchandise
irrelevant_transformers_products = ["708056579739", "93624995012", "47875819733", "47875839090", "708056579746",
                                     "47875332911", "47875842328", "879862003524", "879862003517", "93624974918"] 

# Other transformer movies
meh_transformers_movies = ["97363455349", "97361312743", "97361372389", "97361312804", "97363532149", "97363560449"]

displayed_transformer_products = meh_transformers_movies + irrelevant_transformers_products

new_sessions = []
for i in range(0,5000):
    random.shuffle(displayed_transformer_products)

    # shuffle each session
    for rank, upc in enumerate(displayed_transformer_products):
        draw = random.random()        
        clicked = upc in meh_transformers_movies and draw < 0.13 or \
                  upc in irrelevant_transformers_products and draw < 0.005 or \
                  upc in desired_transformers_movies and draw < 0.65 \

        new_sessions.append({"sess_id": next_sess_id + i, 
                             "query": "transformers dvd", 
                             "rank": rank,
                             "clicked": clicked,
                             "doc_id": upc})


sessions = pd.concat([sessions, pd.DataFrame(new_sessions)])
sessions

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,2,ipad,0.0,92636260712,False
1,2,ipad,1.0,635753493559,True
2,2,ipad,2.0,885909393404,False
3,2,ipad,3.0,843404073153,False
4,2,ipad,4.0,885909457595,False
...,...,...,...,...,...
79995,65000,transformers dvd,11.0,47875842328,False
79996,65000,transformers dvd,12.0,879862003517,False
79997,65000,transformers dvd,13.0,97361372389,False
79998,65000,transformers dvd,14.0,93624995012,False


## Setup 4 - chapter 11 In One Function (omitted) 

Wrapping up Chapter 11 in a single function `sessions_to_sdbn`. 

This function computes a relevance grade out of raw clickstream data. Recall that the SDBN (Simplified Dynamic Bayesian Network) click model we learned about in chapter 11 helps overcome position bias. We also use a beta prior so that a single click doesn't count as much as an observation with hundreds.

In [38]:
def sessions_to_sdbn(sessions, prior_weight=10, prior_grade=0.2) -> pd.DataFrame:
    """ Compute SDBN of the provided query as a dataframe.
        Where we left off at end of 'overcoming confidence bias' 
        """
    all_sdbn = pd.DataFrame()
    for query in sessions["query"].unique():
        sdbn_sessions = sessions[sessions["query"] == query].copy().set_index("sess_id")

        last_click_per_session = sdbn_sessions.groupby(["clicked", "sess_id"])["rank"].max()[True]

        sdbn_sessions["last_click_rank"] = last_click_per_session
        sdbn_sessions["examined"] = sdbn_sessions["rank"] <= sdbn_sessions["last_click_rank"]

        sdbn = sdbn_sessions[sdbn_sessions["examined"]].groupby("doc_id")[["clicked", "examined"]].sum()
        sdbn["grade"] = sdbn["clicked"] / sdbn["examined"]
        sdbn["query"] = query

        sdbn = sdbn.sort_values("grade", ascending=False)

        sdbn["prior_a"] = prior_grade*prior_weight
        sdbn["prior_b"] = (1-prior_grade)*prior_weight

        sdbn["posterior_a"] = sdbn["prior_a"] +  sdbn["clicked"]
        sdbn["posterior_b"] = sdbn["prior_b"] + (sdbn["examined"] - sdbn["clicked"])

        sdbn["beta_grade"] = sdbn["posterior_a"] / (sdbn["posterior_a"] + sdbn["posterior_b"])

        sdbn.sort_values("beta_grade", ascending=False)
        all_sdbn = pd.concat([all_sdbn, sdbn])
    return all_sdbn[["query", "clicked", "examined", "grade", "beta_grade"]].reset_index().set_index(["query", "doc_id"])



## Listing 12.1 Use Convert Raw Sessions to SDBN

We kickoff with the data we left off with in chapter 11.

In this listing we user our "chapter 11 in one function" `sessions_to_sdbn` to rebuild training data.

In [39]:
sdbn = sessions_to_sdbn(sessions,
                        prior_weight=10,
                        prior_grade=0.2)
sdbn

Unnamed: 0_level_0,Unnamed: 1_level_0,clicked,examined,grade,beta_grade
query,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ipad,885909457588,185,408,0.453431,0.447368
ipad,885909472376,680,1677,0.405486,0.404268
ipad,821793013776,103,379,0.271768,0.269923
ipad,635753493559,333,1238,0.268982,0.268429
ipad,722868830062,74,297,0.249158,0.247557
...,...,...,...,...,...
transformers dvd,47875819733,24,1679,0.014294,0.015394
transformers dvd,708056579739,23,1659,0.013864,0.014979
transformers dvd,879862003524,23,1685,0.013650,0.014749
transformers dvd,93624974918,19,1653,0.011494,0.012628


## Chapter 10 Functions (omitted from book)

Now with the chapter 11 setup out of the way, we'll need to give Chapter 10's code a similar treatment, wrapping that LTR system into a black box.

All of the following are support functions for the chapter:

1. Convert the sdbn dataframe into individual `Judgment` objects needed for training the model from chapter 10
2. Pairwise transformation of the data
3. Normalization of the data
4. Training the model
5. Uploading the model to Solr

All of these steps are covered in Chapter 10.

In [40]:
import requests
import numpy as np
from ltr.judgments import judgments_from_file, judgments_to_nparray
from sklearn import svm
import json
import math
from itertools import groupby
from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby
from ltr import download
from ltr.judgments import judgments_writer

from ltr.judgments import Judgment

def sdbn_to_judgments(sdbn):
    """Turn pandas dataframe into ltr judgments objects."""
    judgments = []
    queries = {}
    next_qid = 0
    for row_dict in sdbn.reset_index().to_dict(orient="records"):
        # Round grade to 10ths, Map 0.3 -> 3, etc
        grade = round(row_dict['beta_grade'], 1) * 10
        qid = -1
        if row_dict['query'] in queries:
            qid = queries[row_dict['query']]
        else:
            queries[row_dict['query']] = next_qid
            qid = next_qid
            next_qid += 1
        assert qid != -1
        
        judgments.append(Judgment(doc_id=row_dict['doc_id'],
                                  keywords=row_dict['query'],
                                  qid=qid,
                                  grade=int(grade))
                        )
    return judgments


sdbn_to_judgments(sdbn)


def write_judgments(judgments, dest='retrotech_judgments.txt'):
    with judgments_writer(open(dest, 'wt')) as writer:
        for judgment in judgments:
            writer.write(judgment)
            
write_judgments(sdbn_to_judgments(sdbn))
!cat retrotech_judgments.txt


def normalize_features(logged_judgments):
    print(logged_judgments[0].features)
    print(logged_judgments[len(logged_judgments) - 1].features)
    print(logged_judgments[len(logged_judgments) - 1])
    all_features = []
    means = [0] * len(logged_judgments[0].features)
    for judgment in logged_judgments:
        for idx, f in enumerate(judgment.features):
            means[idx] += f
        all_features.append(judgment.features)
    
    for i in range(len(means)):
        means[i] /= len(logged_judgments)
      
    std_devs = [0.0] * len(logged_judgments[0].features)
    for judgment in logged_judgments:
        for idx, f in enumerate(judgment.features):
            std_devs[idx] += (f - means[idx])**2
            
    for i in range(len(std_devs)):
        std_devs[i] /= len(logged_judgments)
        std_devs[i] = math.sqrt(std_devs[i])
    
    for i in range(len(std_devs)):
        if std_devs[i] == 0:
            std_devs[i] = 0.00001
        
    # Normalize!
    normed_judgments = []
    for judgment in logged_judgments:
        normed_features = [0.0] * len(judgment.features)
        for idx, f in enumerate(judgment.features):
            normed = 0.0
            if std_devs[idx] > 0: 
                normed = (f - means[idx]) / std_devs[idx]
            normed_features[idx] = normed
        normed_judgment=Judgment(qid=judgment.qid,
                                 keywords=judgment.keywords,
                                 doc_id=judgment.doc_id,
                                 grade=judgment.grade,
                                 features=normed_features)
        normed_judgment.old_features=judgment.features
        normed_judgments.append(normed_judgment)

    return means, std_devs, normed_judgments


def pairwise_transform(normed_judgments, weigh_difference = True):
        
    predictor_deltas = []
    feature_deltas = []
    
    # For each query's judgments
    for qid, query_judgments in groupby(normed_judgments, key=lambda j: j.qid):

        # Annoying issue consuming python iterators, we ensure we have two
        # full copies of each query's judgments
        query_judgments_copy_1 = list(query_judgments) 
        query_judgments_copy_2 = list(query_judgments_copy_1)

        # Examine every judgment combo for this query, 
        # if they're different, store the pairwise difference:
        # +1 if judgment1 more relevant
        # -1 if judgment2 more relevant
        for judgment1 in query_judgments_copy_1:
            for judgment2 in query_judgments_copy_2:
                
                j1_features=np.array(judgment1.features)
                j2_features=np.array(judgment2.features)
                
                if judgment1.grade > judgment2.grade:
                    diff = judgment1.grade - judgment2.grade if weigh_difference else 1.0
                    predictor_deltas.append(+1)
                    feature_deltas.append(diff * (j1_features-j2_features))
                elif judgment1.grade < judgment2.grade:
                    diff = judgment2.grade - judgment1.grade if weigh_difference else 1.0
                    predictor_deltas.append(-1)
                    feature_deltas.append(diff * (j1_features-j2_features))

    # For training purposes, we return these as numpy arrays
    return np.array(feature_deltas), np.array(predictor_deltas)
def upload_model(model, model_name, means, std_devs, feature_set):
    linear_model = {
      "store": "aips_feature_store",
      "class": "org.apache.solr.ltr.model.LinearModel",
      "name": model_name,
      "features": [
      ],
      "params": {
          "weights": {
          }
      }
    }

    ftr_model = {}
    ftr_names = [ftr['name'] for ftr in feature_set]
    for idx, ftr_name in enumerate(ftr_names):
        config = {
            "name": ftr_name,
            "norm": {
                "class": "org.apache.solr.ltr.norm.StandardNormalizer",
                "params": {
                    "avg": str(means[idx]),
                    "std": str(std_devs[idx])
                }
            }
        }
        linear_model['features'].append(config)
        linear_model['params']['weights'][ftr_name] =  model.coef_[0][idx] 

    # Delete old model
    resp = requests.delete(f"{SOLR_URL}/products/schema/model-store/{model_name}")

    # Upload the model
    resp = requests.put(f"{SOLR_URL}/products/schema/model-store", json=linear_model)
    print(resp.json())
    requests.get(f"{SOLR_URL}/admin/collections?action=RELOAD&name=products&wt=xml")


    
## TODO - can't easily to test/train split on these few queries
##   make more queries?

def ranksvm_ltr(sdbn, model_name, feature_set):
    """Train a RankSVM model via Solr, store in Solr."""
    judgments = sdbn_to_judgments(sdbn)
    judgments_path = 'retrotech_judgments.txt'
    write_judgments(judgments, judgments_path)
    
    requests.delete(f"{SOLR_URL}/products/schema/model-store/{model_name}")
    
    resp = requests.put(f"{SOLR_URL}/products/schema/feature-store",
                    json=feature_set)
    print("Put feature set")
    print(resp.json())
    ftr_logger=FeatureLogger(client, index='products', feature_set="aips_feature_store", id_field='upc')

    with judgments_open(judgments_path) as judgment_list:
        for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):
            ftr_logger.log_for_qid(judgments=query_judgments, 
                                   qid=qid,
                                   keywords=judgment_list.keywords(qid))

    logged_judgments = ftr_logger.logged
    means, std_devs, normed_judgments = normalize_features(logged_judgments)
    feature_deltas, predictor_deltas = pairwise_transform(normed_judgments)

    model = svm.LinearSVC(max_iter=10000, verbose=1)
    model.fit(feature_deltas, predictor_deltas)  
    upload_model(model, model_name, means, std_devs, feature_set)
    
requests.delete(f"{SOLR_URL}/products/schema/feature-store/aips_feature_store")


# qid:0: ipad*1
# qid:1: star trek*1
# qid:2: kindle*1
# qid:3: nook*1
# qid:4: dryer*1
# qid:5: star wars*1
# qid:6: headphones*1
# qid:7: macbook*1
# qid:8: transformers dark of the moon*1
# qid:9: lcd tv*1
# qid:10: iphone*1
# qid:11: blue ray*1
# qid:12: bluray*1
# qid:13: transformers dark of moon*1
# qid:14: dark of moon*1
# qid:15: dark of the moon*1
# qid:16: head phones*1
# qid:17: lcd television*1
# qid:18: television, lcd*1
# qid:19: apple laptop*1
# qid:20: apple iphone*1
# qid:21: amazon kindle*1
# qid:22: amazon ereader*1
# qid:23: blueray*1
# qid:24: transformers dvd*1

4	qid:0	 # 885909457588	ipad
4	qid:0	 # 885909472376	ipad
3	qid:0	 # 821793013776	ipad
3	qid:0	 # 635753493559	ipad
2	qid:0	 # 722868830062	ipad
2	qid:0	 # 885909471812	ipad
2	qid:0	 # 92636260712	ipad
2	qid:0	 # 886111271283	ipad
2	qid:0	 # 885909457632	ipad
2	qid:0	 # 885909393404	ipad
2	qid:0	 # 885909457601	ipad
2	qid:0	 # 716829772249	ipad
2	qid:0	 # 885909457595	ipad
2	qid:0	 # 600603132827	ipad
2	q

<Response [200]>

## Also Chapter 10 - Perform a test / train split on the SDBN data (omitted)

This function is broken out from the model training. It lets us train a model on one set of data (reusing the chapter 10 training code), reserving test queries for evaluation.

In [41]:
from math import floor

def test_train_split(sdbn, train):
    """Split queries in sdbn into train / test split with `train` proportion going to training set."""
    queries = sdbn.index.get_level_values('query').unique().copy().tolist()
    random.shuffle(queries)
    num_queries = len(queries)
    split_point = floor(num_queries * train)
    
    train_queries = queries[:split_point]
    test_queries = queries[split_point:]
    return sdbn.loc[train_queries, :], sdbn.loc[test_queries]


## Chapter 10 - Search Code (omitted)

Also from Chapter 10, a simple function to search using the LTR model and return a list of search results.

In [42]:
def search_with_model(query, model_name, at=10, log=False):
    """ Search using test_model LTR model (see rq to and qf params below). """
    fuzzy_kws = "~" + ' ~'.join(query.split())
    squeezed_kws = "".join(query.split())
    
    rq = \
        "{!ltr reRankDocs=60000 reRankWeight=10.0 model=" + model_name \
        + " efi.fuzzy_keywords=\"" + fuzzy_kws + "\" " \
        + "efi.squeezed_keywords=\"" + squeezed_kws +"\" " \
        + "efi.keywords=\"" + query + "\"}"

    request = {
            "fields": ["upc", "name", "manufacturer", "score"],
            "limit": at,
            "params": {
              "rq": rq,
              "qf": "name name_ngram upc manufacturer shortDescription longDescription",
              "defType": "edismax",
              "q": query
            }
        }
    
    if log:
        print(request)

    resp = requests.post(f"{SOLR_URL}/products/select", 
                                   json=request).json()
        
    if log:
        print(resp)
        
    search_results = resp['response']['docs']

    for rank, result in enumerate(search_results):
        result['rank'] = rank
        
    return search_results

def search_and_grade(query, model_name, sdbn, desired=[]):
    results = search_with_model(query, model_name, at=10)
    results = pd.DataFrame(results)
    results['desired'] = False
    for upc in desired:
        results.loc[results['upc'] == upc, 'desired'] = True
        
    sdbn_query = sdbn.loc[query].copy().reset_index()
    return results.merge(sdbn_query, left_on='upc', right_on='doc_id', how='left')

## Chapter 10 - Evaluate the model on the test set (omitted)

This function computes the model's performance on a set of test queries. The model was not trained on the queries in `test`. We compute the precision of these queries

In [43]:
def evaluate_model(test, model_name, sdbn, at=10):
    queries = test.index.get_level_values("query").unique()
    
    query_results = {}
    
    for query in queries:
        search_results = search_with_model(query, model_name, at=at, log=True)

        results = pd.DataFrame(search_results).reset_index()
        judgments = sdbn.loc[query, :].copy().reset_index()
        judgments["doc_id"] = judgments["doc_id"].astype(str)
        if len(results) == 0:
            print(f"No Results for {query}")
            query_results[query] = 0
        else:
            graded_results = results.merge(judgments, left_on="upc", right_on="doc_id", how="left")
            print(graded_results)
            graded_results[["clicked", "examined", "grade", "beta_grade"]] = graded_results[["clicked", "examined", "grade", "beta_grade"]].fillna(0)
            grade_results = graded_results.drop("doc_id", axis=1)

            query_results[query] = (graded_results["beta_grade"].sum() / at)
    return query_results

## Listing 12.2 - model training

We wrap all the important decisions from chapter 10 in a few lines 

In [44]:
random.seed(1234)
def feature(name, q, store_name="aips_feature_store"):
    return {
        "name": name,
        "store": store_name,
        "class": "org.apache.solr.ltr.feature.SolrFeature",
        "params": {"q" : q}
    }
  
feature_set = [
    feature("long_description_bm25", "longDescription:(${keywords})"),
    feature("short_description_constant", "shortDescription:(${keywords})^=1")
]

train, test = test_train_split(sdbn, train=0.8)
ranksvm_ltr(train, "click_model_basic", feature_set=feature_set)
evaluate_model(test, "click_model_basic", sdbn=sdbn)

Put feature set
{'responseHeader': {'status': 0, 'QTime': 1}}
{!terms f=upc}36725236271,883393003458,36725234789,882777064009,22265004289,27242817197,729507810218,885170042704,812491010310,812491010334,827912072969,74000373105,884483335329,827912068467,97278016000,827912068474,696211503197,605342041546,882777064207,885170042667,13803112610,729507813059,22265004258,719192579996,600603139758,36725235564,22265004302,723755834491
Searching products [Status: 200]
Missing doc 600603139758
{!terms f=upc}856751002097,48231011396,84691226727,74108007469,12505525766,36725578241,48231011402,12505527456,74108096487,36725561977,84691226703,665331101927,783722274422,14381196320,77283045400,74108056764,883049066905,12505451713,36172950027,883929085118
Searching products [Status: 200]
{!terms f=upc}27242815414,600603132872,827396513927,600603141003,885170033412,883929140855,24543672067,813774010904,36725617605,786936817232,25192073007,719192580374,36725608443,75993997675,36725608894,711719983156,22265


Liblinear failed to converge, increase the number of iterations.



{'responseHeader': {'status': 0, 'QTime': 3}}
{'fields': ['upc', 'name', 'manufacturer', 'score'], 'limit': 10, 'params': {'rq': '{!ltr reRankDocs=60000 reRankWeight=10.0 model=click_model_basic efi.fuzzy_keywords="~kindle" efi.squeezed_keywords="kindle" efi.keywords="kindle"}', 'qf': 'name name_ngram upc manufacturer shortDescription longDescription', 'defType': 'edismax', 'q': 'kindle'}}
{'responseHeader': {'zkConnected': True, 'status': 0, 'QTime': 1, 'params': {'json': '{"fields": ["upc", "name", "manufacturer", "score"], "limit": 10, "params": {"rq": "{!ltr reRankDocs=60000 reRankWeight=10.0 model=click_model_basic efi.fuzzy_keywords=\\"~kindle\\" efi.squeezed_keywords=\\"kindle\\" efi.keywords=\\"kindle\\"}", "qf": "name name_ngram upc manufacturer shortDescription longDescription", "defType": "edismax", "q": "kindle"}}'}}, 'response': {'numFound': 84, 'start': 0, 'maxScore': 3.873207, 'numFoundExact': True, 'docs': [{'upc': '814916011872', 'name': 'Amazon - Kindle Keyboard 3G (F

{'kindle': 0.1596595681912331,
 'ipad': 0.12855798749813316,
 'nook': 0.19635613059219356,
 'dark of moon': 0.0,
 'transformers dvd': 0.003258006235976338}

In [45]:
# # What the user wants, but never visible! Never gets clicked!
# These are the widescreen transformers dvds of the hollywood movies
desired_movies = ["97360724240", "97360722345", "97368920347"] 
result = search_and_grade('transformers dvd', "click_model_basic", sdbn, desired_movies)
upcs1 = result['upc']
result

Unnamed: 0,upc,name,manufacturer,score,rank,desired,doc_id,clicked,examined,grade,beta_grade
0,708056579746,Nintendo - Transformers 3 Stylus 2-Pack,Nintendo,0.017091,0,False,708056579746.0,26.0,1664.0,0.015625,0.016726
1,47875332911,Transformers: Revenge of the Fallen - Windows,Activision,0.015678,1,False,47875332911.0,24.0,1630.0,0.014724,0.015854
2,34707056190,Memorex - 50-Pack 16x DVD+R Disc Spindle,Memorex,0.015483,2,False,,,,,
3,23942950585,Verbatim - 25-Pack 16x DVD-R Disc Spindle,Verbatim,0.015189,3,False,,,,,
4,659846419028,Digital Innovations - DvdDr Laser Lens Cleaner...,Digital Innovations,0.014991,4,False,,,,,
5,659846419035,Digital Innovations - CleanDr. Laser Lens Clea...,Digital Innovations,0.014954,5,False,,,,,
6,85854103756,Case Logic - 200-Disc Expandable DVD Album - B...,Case Logic,0.014788,6,False,,,,,
7,716829999523,Coby - Portable DVD Player with Dual TFT-LCD S...,Coby,0.014711,7,False,,,,,
8,600603124068,Init&#x2122; - 24-Disc CD/DVD Wallet - Red,Init&#x99;,0.01464,8,False,,,,,
9,34707056398,Memorex - 50-Pack 16x DVD-R Disc Spindle,Memorex,0.014593,9,False,,,,,


## Listing 12.3

Train a model that performs better offline called `click_model_improved`

In [46]:
random.seed(1234)

feature_set_improved = [
    feature("name_fuzzy", "name_ngram:(${keywords})"),
    feature("name_pf2", "{!edismax qf=name name pf2=name}(${keywords})"),
    feature("shortDescription_pf2", "{!edismax qf=shortDescription pf2=shortDescription}(${keywords})")
]

sdbn = sessions_to_sdbn(sessions) # chapter 11: generate training data

train, test = test_train_split(sdbn, train=0.8)
ranksvm_ltr(train, "click_model_improved", feature_set_improved) # chapter 10: train the model -> the 'LTR engine'
evaluate_model(test, "click_model_improved", sdbn)

Put feature set
{'responseHeader': {'status': 0, 'QTime': 3}}
{!terms f=upc}36725236271,883393003458,36725234789,882777064009,22265004289,27242817197,729507810218,885170042704,812491010310,812491010334,827912072969,74000373105,884483335329,827912068467,97278016000,827912068474,696211503197,605342041546,882777064207,885170042667,13803112610,729507813059,22265004258,719192579996,600603139758,36725235564,22265004302,723755834491
Searching products [Status: 200]
Missing doc 600603139758
{!terms f=upc}856751002097,48231011396,84691226727,74108007469,12505525766,36725578241,48231011402,12505527456,74108096487,36725561977,84691226703,665331101927,783722274422,14381196320,77283045400,74108056764,883049066905,12505451713,36172950027,883929085118
Searching products [Status: 200]
{!terms f=upc}27242815414,600603132872,827396513927,600603141003,885170033412,883929140855,24543672067,813774010904,36725617605,786936817232,25192073007,719192580374,36725608443,75993997675,36725608894,711719983156,22265


Liblinear failed to converge, increase the number of iterations.



{'kindle': 0.16108148412140244,
 'ipad': 0.0,
 'nook': 0.1562118998229628,
 'dark of moon': 0.4124635468004116,
 'transformers dvd': 0.10077083021678328}

## Simulate a user querying, clicking, purchasing (omitted)

This function simulates a user performing a query and possibly taking an action as they scan down the results.

In [47]:
def simulate_live_user_session(query, model_name, desired_products, indifferent_products,
                               desired_probability=0.15,
                               indifferent_probability=0.03,
                               uninterested_probability=0.01,
                               quit_per_result_probability=0.2):
    """Simulates a user 'query' where purchase probability depends on if 
       products upc is in one of three sets.
       
       Users purchase a single product per session.    
       
       Users quit with `quit_per_rank_prod` after scanning each rank
       
       """   
    search_results = search_with_model(query, model_name, at=10)

    results = pd.DataFrame(search_results).reset_index()
    for doc in results.to_dict(orient="records"): 
        draw = random.random()
        
        if doc["upc"] in desired_products:
            if draw < desired_probability:
                return True
        elif doc["upc"] in indifferent_products:
            if draw < indifferent_probability:
                return True
        elif draw < uninterested_probability:
            return True
        if random.random() < quit_per_result_probability:
            return False
        
    return False


## Listing 12.4 - Simulated A/B test on just `transformers dvd` query

Here we pretend 1000 users were served two rankings for `transformers dvd` and based on the hidden preferences here (`wants_to_purchase` and `might_purchase`) we see which performs better with conversions.

In [48]:
random.seed(1234)

transformers_dvds = []
wants_to_purchase = ["97360724240", "97363560449", "97363532149",
                     "97360810042"]
might_purchase = ["97361312743", "97363455349", "97361372389"]

def ab_test_models(query, model_a, model_b):
    """Randomly assign this user to a or b"""
    draw = random.random()
    model_name = model_a if draw < 0.5 else model_b
    
    purchase_made = simulate_live_user_session(query, model_name, 
                                           wants_to_purchase,
                                           might_purchase)
    return (model_name, purchase_made)

number_of_users = 1000
purchases = {"click_model_basic": 0, "click_model_improved": 0}
for _ in range(number_of_users): 
    model_name, purchase_made = ab_test_models("transformers dvd", 
                                             "click_model_basic",
                                             "click_model_improved")
    if purchase_made:
        purchases[model_name] += 1 
    
purchases

{'click_model_basic': 21, 'click_model_improved': 16}

In [49]:
sdbn = sessions_to_sdbn(sessions)
sdbn.loc["transformers dvd"]

Unnamed: 0_level_0,clicked,examined,grade,beta_grade
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97363560449,677,1946,0.347893,0.347137
97361312804,662,1920,0.344792,0.344041
97361312743,657,1916,0.342902,0.34216
97363455349,664,1937,0.342798,0.342065
97361372389,622,1919,0.324127,0.323484
97363532149,623,1927,0.3233,0.322664
879862003517,37,1698,0.02179,0.022834
93624995012,32,1673,0.019127,0.020202
47875842328,29,1663,0.017438,0.01853
708056579746,26,1664,0.015625,0.016726


## New helper: show the features for each SDBN entry (omitted)

This function shows us the logged features of each training row for the given sdbn data for debugging.

So not just

| query   | doc      | grade
|---------|----------|---------
|transformers dvd | 1234 | 1.0

But also a recording of the matches that occured

| query           | doc      | grade    | short_desc_match  | long_desc_match |...
|-----------------|----------|----------|-------------------|-----------------|---
|transformers dvd | 1234     | 1.0      | 0.0               | 1.0             |...

In [50]:
def associate_sdbn_with_features(sdbn, feature_set):
    """Log features alongside sdbn into a dataframe"""
    judgments = sdbn_to_judgments(sdbn)
    judgments_path = "retrotech_judgments.txt"
    write_judgments(judgments, judgments_path)
    
    # For more on this code, review Chapter 10
    requests.delete(f"{SOLR_URL}/products/schema/feature-store/explore")
    
    resp = requests.put(f"{SOLR_URL}/products/schema/feature-store",
                    json=feature_set)

    ftr_logger=FeatureLogger(client, index="products", feature_set="explore", id_field="upc")
    
    with judgments_open(judgments_path) as judgment_list:
        for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):
            ftr_logger.log_for_qid(judgments=query_judgments, 
                                   qid=qid,
                                   keywords=judgment_list.keywords(qid))

    logged_judgments = ftr_logger.logged
    means, std_devs, normed_judgments = normalize_features(logged_judgments)
    feature_deltas, predictor_deltas = pairwise_transform(normed_judgments)
    features, predictors = judgments_to_nparray(logged_judgments)
    logged_judgments_dataframe = pd.concat([pd.DataFrame(predictors),
                                            pd.DataFrame(features)], 
                                           axis=1,
                                           ignore_index=True)
    columns = {idx + 2: ftr["name"] for idx, ftr in enumerate(feature_set)}
    columns[0] = "grade"
    columns[1] = "qid"
    
    qid_to_query = {}
    for j in logged_judgments:
        qid_to_query[j.qid] = j.keywords
        
    qid_to_query = pd.DataFrame(qid_to_query.values()).reset_index().rename(columns={"index": "qid", 0: "query"})
    
    logged_judgments_dataframe = logged_judgments_dataframe.rename(columns=columns)
    logged_judgments_dataframe = logged_judgments_dataframe.merge(qid_to_query, how="left", on="qid")
    cols_order = ["query", "grade"] + [ftr["name"] for idx, ftr in enumerate(feature_set)]
    logged_judgments_dataframe["grade"] = logged_judgments_dataframe["grade"] / 10.0 
    return logged_judgments_dataframe[cols_order].sort_values("query")

## Listing 12.5 - Output matches for one feature set

Another way of formulating `presentation_bias` is to look at the kinds of documents not being shown to users, so we can strategically show those to users. Below we show the value of each feature in `explore_feature_set` for each document in the sdbn judgments.

In [51]:
sdbn = sessions_to_sdbn(sessions)

explore_feature_set = [
    feature("long_desc_match", "longDescription:(${keywords})^=1", "explore"),
    feature("short_desc_match", "shortDescription:(${keywords})^=1", "explore"),
    feature("name_match", "name:(${keywords})^=1", "explore"),
    feature("has_promotion", "promotion_b:true", "explore")
]

sdbn_with_features = associate_sdbn_with_features(sdbn, explore_feature_set)
transformers_dvds = sdbn_with_features[sdbn_with_features["query"] == "transformers dvd"]
transformers_dvds

{!terms f=upc}885909457588,885909472376,821793013776,635753493559,722868830062,885909471812,92636260712,886111271283,885909457632,885909393404,885909457601,716829772249,885909457595,600603132827,27242798236,884962753071,886111287055,635753490879,843404073153,610839379408
Searching products [Status: 200]
Missing doc 600603132827
{!terms f=upc}50644555190,97360719840,97361427546,97360719741,13964123296,97360719642,30206696622,97363485049,738572128920,97360717648,97361166247,13964123302,5051368213637,29757201560,46034897179,883929139446,742725280410,635753490541,97361301747,31398121381,27242829619,97360743548,97360719147,791149900183,52824803121,97361311944,12505226021,50694439860,36725578340,36725235564
Searching products [Status: 200]
{!terms f=upc}814916011896,9781400532711,814916014361,814916011872,814916014590,814916014385,813580017906,814916010202,814916010240,9781400532650,813580018514,814916010288,813580018491,814916010219,92636257521,814916014606,813580015261,814916010233,8135800

Unnamed: 0,query,grade,long_desc_match,short_desc_match,name_match,has_promotion
618,transformers dvd,0.0,1.0,0.0,1.0,0.0
623,transformers dvd,0.0,1.0,1.0,1.0,0.0
622,transformers dvd,0.0,1.0,1.0,1.0,0.0
621,transformers dvd,0.0,1.0,0.0,1.0,0.0
620,transformers dvd,0.0,1.0,0.0,1.0,0.0
619,transformers dvd,0.0,1.0,0.0,1.0,0.0
617,transformers dvd,0.0,0.0,0.0,1.0,0.0
610,transformers dvd,0.3,0.0,0.0,1.0,0.0
615,transformers dvd,0.3,0.0,0.0,1.0,0.0
614,transformers dvd,0.3,0.0,0.0,1.0,0.0


## Listing 12.6 - Train Gaussian Process Regressor

We train data on just the `transformers_dvd` training data. 

NOTE we could also train on the full sdbn data, and see globally what's missing. However it's often convenient to zero in on specific queries to round out their training data.

In [52]:
from sklearn.gaussian_process import GaussianProcessRegressor

x_train = transformers_dvds[["long_desc_match", "short_desc_match",
                             "name_match", "has_promotion"]]
y_train = transformers_dvds["grade"]

gpr = GaussianProcessRegressor()
gpr.fit(x_train, y_train)

## Listing 12.7: Predict on every value

Here `gpr` predicts on every possible feature value. This lets us analyze which set of feature values to use when exploring with users.

In [53]:
zero_or_one = [0, 1]

index = pd.MultiIndex.from_product(
    [zero_or_one] * 4, names=["long_desc_match", "short_desc_match",
                              "name_match", "has_promotion"])
to_explore = pd.DataFrame(index=index).reset_index()

predictions_with_std = \
    gpr.predict(to_explore[["long_desc_match", "short_desc_match",
                                 "name_match", "has_promotion"]],
                return_std=True)
to_explore["predicted_grade"] = predictions_with_std[0]
to_explore["prediction_stddev"] = predictions_with_std[1]

to_explore.sort_values("prediction_stddev")

Unnamed: 0,long_desc_match,short_desc_match,name_match,has_promotion,predicted_grade,prediction_stddev
2,0,0,1,0,0.2250004,4e-06
10,1,0,1,0,1.192093e-07,4e-06
14,1,1,1,0,0.0,7e-06
6,0,1,1,0,1.192093e-07,1e-05
0,0,0,0,0,0.1364695,0.79506
3,0,0,1,1,0.1364695,0.79506
8,1,0,0,0,0.0,0.79506
11,1,0,1,1,0.0,0.79506
12,1,1,0,0,0.0,0.79506
15,1,1,1,1,0.0,0.79506


## Listing 12.8 - Calculate Expected Improvement


We use [Expected Improvement](https://distill.pub/2020/bayesian-optimization/) scoring to select candidates for exploration within the `transformers dvd` query.

In [54]:
from scipy.stats import norm

theta = 0.6
to_explore["opportunity"] = to_explore["predicted_grade"] - \
                            sdbn["grade"].mean() - theta

to_explore["prob_of_improvement"] = \
    norm.cdf(to_explore["opportunity"]) / to_explore["prediction_stddev"]

to_explore["expected_improvement"] = \
    to_explore["opportunity"] * to_explore["prob_of_improvement"] + \
    to_explore["prediction_stddev"] * \
    norm.pdf(to_explore["opportunity"] / to_explore["prediction_stddev"])

to_explore.sort_values("expected_improvement", ascending=False).head()

Unnamed: 0,long_desc_match,short_desc_match,name_match,has_promotion,predicted_grade,prediction_stddev,opportunity,prob_of_improvement,expected_improvement
1,0,0,0,1,0.08277285,0.929873,-0.697174,0.261161,0.097997
5,0,1,0,1,-5.960464e-08,0.929873,-0.779947,0.23413,0.078346
13,1,1,0,1,-5.960464e-08,0.929873,-0.779947,0.23413,0.078346
9,1,0,0,1,-5.960464e-08,0.929873,-0.779947,0.23413,0.078346
0,0,0,0,0,0.1364695,0.79506,-0.643477,0.326966,0.018202


## Create a query to fetch 'explore' docs (omitted)

Based on the selected features from the GaussianProcessRegressor, we create a query to fetch a doc that contains those features.

In [55]:
def explore_query(explore_vector, query):
    config_explore = {
        "long_desc_match": {"field": "longDescription", "query_dependent": True},                      "short_desc_match": {"field": "shortDescription", "query_dependent": True},
        "name_match": {"field": "name", "query_dependent": True},
        "long_description_bm25": {"field": "longDescription", "query_dependent": True},
        "manufacturer_match": {"field": "manufacturer", "query_dependent": True},
        "has_promotion": {"field": "promotion_b", "query_dependent": False, "1_value": "true"}
    }
    clauses = []
    for col_name, config in config_explore.items():
        try:
            clause = ""
            if explore_vector[col_name] == 1.0:
                clause = f'+{config["field"]}:'
            elif explore_vector[col_name] == -1.0:
                clause = f'-{config["field"]}:'
            if len(clause) > 0:  
                if config["query_dependent"]:
                    clause += f"({query})"
                else:
                    clause += f'{config["1_value"]}'

            clauses.append(clause)
        except KeyError as e:
            pass
    
    final_query = " ".join(clauses)
    final_query = final_query.strip()
    if len(final_query) == 0:
        return "*:*"
    return final_query

## Listing 12.9 - Find document to explore from Solr

Here we fetch a document that matches the properties of something missing from our training set for display to the user

In [56]:
random.seed(1234)

products_collection = engine.get_collection("products")
fields = ["long_desc_match", "short_desc_match",
          "name_match", "has_promotion"]
explore_vector = to_explore.sort_values("expected_improvement",
                                        ascending=False) \
                            .head().iloc[0][fields]

def explore(collection, query, explore_vector):
    """ Explore according to the provided explore vector, select
        a random doc from that group."""
    draw = random.random()
    q = explore_query(explore_vector, query)
    request = {
        "fields": ["upc", "name", "manufacturer", "score"],
        "limit": 1,
        "params": {"q": q, "sort": f"random_{draw} DESC"}
    }
    
    response = collection.search(request)
    return engine.docs_from_response(response)[0]["upc"]

explore(products_collection, "transformers dvd", explore_vector)

'97360724240'

## Simulate new sessions with the new data

(Takes a while)

We simulate new sessions, if the upc is in `might_purchase` or `wants_to_purchase`, we set it to 'clicked' with a given probability.

In [57]:
import random
random.seed(1234)

wants_to_purchase = ["97360724240", "97363560449", "97363532149", "97360810042", "97368920347"]
might_purchase = ["97361312743", "97363455349", "97361372389"]
explore_on_rank = 2.0

products_collection = engine.get_collection("products")
with_explore_sessions = sessions.copy()
for i in range(0, 500):
    print(i)
    explore_upc = explore(products_collection, "transformers dvd", explore_vector)
    print(i, explore_upc)
    sess_ids = list(set(sessions[sessions["query"] == "transformers dvd"]["sess_id"].tolist()))
    random.shuffle(sess_ids)
    sess_ids[0]
    new_session = sessions[sessions["sess_id"] == sess_ids[0]].copy()
    new_session["sess_id"] = 100000 + i
    new_session.loc[new_session["rank"] == explore_on_rank, "doc_id"] = explore_upc
    draw = random.random()
    new_session.loc[new_session["rank"] == explore_on_rank, "clicked"] = False
    if explore_upc in wants_to_purchase:
        if draw < 0.8:
            print(f"click {explore_upc}")
            new_session.loc[new_session["rank"] == explore_on_rank, "clicked"] = True
    elif explore_upc in might_purchase:
        if draw < 0.5:
            print(f"click {explore_upc}")
            new_session.loc[new_session["rank"] == explore_on_rank, "clicked"] = True
    else:
        if draw < 0.01:
            print(f"click {explore_upc}")
            new_session.loc[new_session["rank"] == explore_on_rank, "clicked"] = True

    with_explore_sessions = pd.concat([with_explore_sessions, new_session])

with_explore_sessions[with_explore_sessions["sess_id"] == 100049]

0
0 97360724240
click 97360724240
1
1 97360810042
click 97360810042
2
2 36725236271
3
3 27242799127
4
4 600603135088
click 600603135088
5
5 27242799127
6
6 97368920347
click 97368920347
7
7 9781400532711
8
8 803238004525
9
9 97360810042
10
10 97360722345
11
11 97360810042
click 97360810042
12
12 27242799127
13
13 97360810042
click 97360810042
14
14 36725236271
15
15 36725236271
16
16 803238004525
17
17 883393003458
18
18 36725236271
19
19 36725236271
20
20 600603135088
21
21 600603135088
22
22 600603135088
23
23 36725236271
24
24 97360724240
click 97360724240
25
25 803238004525
26
26 803238004525
27
27 883393003458
28
28 27242799127
29
29 27242813908
30
30 27242799127
31
31 9781400532711
32
32 97360724240
click 97360724240
33
33 9781400532711
34
34 27242815414
35
35 883393003458
36
36 803238004525
click 803238004525
37
37 97360722345
38
38 883393003458
39
39 27242799127
40
40 36725236271
41
41 803238004525
42
42 883393003458
43
43 97368920347
click 97368920347
44
44 803238004525
45
45 

Unnamed: 0,sess_id,query,rank,doc_id,clicked
12064,100049,transformers dvd,0.0,47875819733,False
12065,100049,transformers dvd,1.0,47875839090,False
12066,100049,transformers dvd,2.0,36725236271,False
12067,100049,transformers dvd,3.0,708056579739,False
12068,100049,transformers dvd,4.0,97363532149,False
12069,100049,transformers dvd,5.0,97363560449,False
12070,100049,transformers dvd,6.0,97361312804,False
12071,100049,transformers dvd,7.0,879862003517,False
12072,100049,transformers dvd,8.0,93624974918,False
12073,100049,transformers dvd,9.0,47875332911,False


## Listing 12.10 - Update judgments from new sessions

Have we added any new docs that appear to be getting more clicks?

In [58]:
reimproved_sdbn = sessions_to_sdbn(with_explore_sessions)
reimproved_sdbn.loc['transformers dvd']

Unnamed: 0_level_0,clicked,examined,grade,beta_grade
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97360724240,20,21,0.952381,0.709677
97360810042,62,69,0.898551,0.810127
97368920347,39,46,0.847826,0.732143
97363455349,731,2117,0.3453,0.344617
97363560449,733,2130,0.344131,0.343458
97361312804,726,2110,0.344076,0.343396
97361312743,708,2084,0.339731,0.339064
97363532149,692,2098,0.329838,0.329222
97361372389,673,2089,0.322164,0.321582
803238004525,2,27,0.074074,0.108108


## New heavily clicked doc is promoted!

```
      {
        "upc":"97360810042",
        "name":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_ngram":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_omit_norms":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_txt_en_split":"Transformers: Dark of the Moon - Blu-ray Disc",
        "manufacturer":"\\N",
        "shortDescription":"\\N",
        "longDescription":"\\N",
        "promotion_b":true,
        "id":"72593b1c-313b-4f25-a4f2-04eae29d858b",
        "_version_":1710117636920049669
      },
```

## Listing 12.11 - Rebuild model using updated judgments

After showing the new document to users, we can rebuild the model using judgments that cover this feature blindspot.

In [59]:
random.seed(1234)

# {'blue ray': 0.0,
# 'dryer': 0.07068309073137659,
# 'headphones': 0.06426395939086295,
# 'dark of moon': 0.25681268708548055,
# 'transformers dvd': 0.10077083021678328}

feature_set_reimproved = [
    feature("name_fuzzy", "name_ngram:(${keywords})"),
    feature("name_pf2", "{!edismax qf=name name pf2=name}(${keywords})"),
    feature("shortDescription_pf2", "{!edismax qf=shortDescription pf2=shortDescription}(${keywords})"),
    feature("has_promotion", "promotion_b:true^=1.0")
]

train, test = test_train_split(sdbn, train=0.8)
ranksvm_ltr(train, "click_model_reimproved", feature_set_improved)
evaluate_model(test, "click_model_reimproved", sdbn=reimproved_sdbn)

Put feature set
{'responseHeader': {'status': 500, 'QTime': 1}, 'error': {'msg': 'name_fuzzy already contained in the store, please use a different name', 'trace': 'org.apache.solr.ltr.feature.FeatureException: name_fuzzy already contained in the store, please use a different name\n\tat org.apache.solr.ltr.store.FeatureStore.add(FeatureStore.java:50)\n\tat org.apache.solr.ltr.store.rest.ManagedFeatureStore.addFeature(ManagedFeatureStore.java:119)\n\tat org.apache.solr.ltr.store.rest.ManagedFeatureStore.applyUpdatesToManagedData(ManagedFeatureStore.java:129)\n\tat org.apache.solr.rest.ManagedResource.doPut(ManagedResource.java:383)\n\tat org.apache.solr.rest.RestManager$ManagedEndpoint.delegateRequestToManagedResource(RestManager.java:342)\n\tat org.apache.solr.handler.SchemaHandler$ManagedResourceRequestHandler.handleRequestBody(SchemaHandler.java:315)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:226)\n\tat org.apache.solr.core.SolrCore.execute


Liblinear failed to converge, increase the number of iterations.



{'responseHeader': {'status': 0, 'QTime': 3}}
{'fields': ['upc', 'name', 'manufacturer', 'score'], 'limit': 10, 'params': {'rq': '{!ltr reRankDocs=60000 reRankWeight=10.0 model=click_model_reimproved efi.fuzzy_keywords="~kindle" efi.squeezed_keywords="kindle" efi.keywords="kindle"}', 'qf': 'name name_ngram upc manufacturer shortDescription longDescription', 'defType': 'edismax', 'q': 'kindle'}}
{'responseHeader': {'zkConnected': True, 'status': 0, 'QTime': 2, 'params': {'json': '{"fields": ["upc", "name", "manufacturer", "score"], "limit": 10, "params": {"rq": "{!ltr reRankDocs=60000 reRankWeight=10.0 model=click_model_reimproved efi.fuzzy_keywords=\\"~kindle\\" efi.squeezed_keywords=\\"kindle\\" efi.keywords=\\"kindle\\"}", "qf": "name name_ngram upc manufacturer shortDescription longDescription", "defType": "edismax", "q": "kindle"}}'}}, 'response': {'numFound': 84, 'start': 0, 'maxScore': 3.873207, 'numFoundExact': True, 'docs': [{'upc': '814916014385', 'name': 'Amazon - Kindle Fire

{'kindle': 0.16108148412140244,
 'ipad': 0.0,
 'nook': 0.1562118998229628,
 'dark of moon': 0.4124635468004116,
 'transformers dvd': 0.17374053862937391}

## Listing 12.12 - Rerun A/B test on new `reimproved` model

In [60]:
number_of_users = 1000
purchases = {"click_model_basic": 0, "click_model_reimproved": 0}
for _ in range(0, number_of_users):    
    model_name, purchase_made = ab_test_models("transformers dvd", 
                                               "click_model_basic",
                                               "click_model_reimproved")
    if purchase_made:
        purchases[model_name] += 1 
    
purchases

{'click_model_basic': 15, 'click_model_reimproved': 76}

## Listings 12.6-12.8 in one function (omitted)

We wrap the core of the Active Learning we covered in this chapter into a single function to allow us to select the ideal document to explore.

In [61]:
from sklearn.gaussian_process import GaussianProcessRegressor
from scipy.stats import norm


def best_explore_candidate(sdbn, feature_set, theta=0.6):
    
    requests.delete(f"{SOLR_URL}/products/schema/feature-store/explore")
    
    resp = requests.put(f"{SOLR_URL}/products/schema/feature-store",
                    json=feature_set)
    
    sdbn_ftrs = associate_sdbn_with_features(sdbn, feature_set)
    transformers_dvds = sdbn_ftrs[sdbn_ftrs["query"] == "transformers dvd"]

    y_train = transformers_dvds["grade"]
    feature_names = [ftr["name"] for ftr in explore_feature_set]
    x_train = transformers_dvds[feature_names]

    gpr=GaussianProcessRegressor()
    gpr.fit(x_train, y_train)
    
    zero_or_one = [0,1]

    index = pd.MultiIndex.from_product([zero_or_one] * 4,
                                       names = feature_names)
    to_explore = pd.DataFrame(index=index).reset_index()

    predictions_with_std = gpr.predict(to_explore[feature_names], return_std=True)
    to_explore["predicted_grade"] = predictions_with_std[0]
    to_explore["prediction_stddev"] = predictions_with_std[1]

    to_explore.sort_values("prediction_stddev")

    to_explore["opportunity"] = to_explore["predicted_grade"] - sdbn["grade"].mean() - theta


    to_explore["prob_of_improvement"] = norm.cdf( (to_explore["opportunity"]) / to_explore["prediction_stddev"])

    to_explore["expected_improvement"] = to_explore["opportunity"] * to_explore["prob_of_improvement"] \
     + to_explore["prediction_stddev"] * norm.pdf( to_explore["opportunity"] / to_explore["prediction_stddev"])


    to_explore.sort_values("expected_improvement", ascending=False).head()
    
    options = to_explore.loc[:, feature_names]
    return options.loc[0]


explore_feature_set = [
    feature("manufacturer_match", "manufacturer:(${keywords})^=1", "explore"),
    feature("name_fuzzy", "name_ngram:(${keywords})", "explore"),
    feature("long_description_bm25", "longDescription:(${keywords})", "explore"),
    feature("short_description_constant", "shortDescription:(${keywords})^=1", "explore")
]

best_explore_candidate(sdbn, explore_feature_set)

{!terms f=upc}885909457588,885909472376,821793013776,635753493559,722868830062,885909471812,92636260712,886111271283,885909457632,885909393404,885909457601,716829772249,885909457595,600603132827,27242798236,884962753071,886111287055,635753490879,843404073153,610839379408
Searching products [Status: 200]
Missing doc 600603132827
{!terms f=upc}50644555190,97360719840,97361427546,97360719741,13964123296,97360719642,30206696622,97363485049,738572128920,97360717648,97361166247,13964123302,5051368213637,29757201560,46034897179,883929139446,742725280410,635753490541,97361301747,31398121381,27242829619,97360743548,97360719147,791149900183,52824803121,97361311944,12505226021,50694439860,36725578340,36725235564
Searching products [Status: 200]
{!terms f=upc}814916011896,9781400532711,814916014361,814916011872,814916014590,814916014385,813580017906,814916010202,814916010240,9781400532650,813580018514,814916010288,813580018491,814916010219,92636257521,814916014606,813580015261,814916010233,8135800

manufacturer_match            0
name_fuzzy                    0
long_description_bm25         0
short_description_constant    0
Name: 0, dtype: int64

## Listing 12.13 - Fully Automated LTR Loop

These lines expand Listing 12.13 from the book (the book content is a truncated form of what's below). You could put this in a loop and constantly try new features to try to get closer at a generalized ranking solution of what users actually want.

In [62]:
exploit_feature_set = [
    feature("name_fuzzy", "name_ngram:(${keywords})", "exploit"),
    feature("long_description_bm25","longDescription:(${keywords})", "exploit"),
    feature("short_description_constant", "shortDescription:(${keywords})^=1", "exploit")
]
train, test = test_train_split(sdbn, train=0.8) 
ranksvm_ltr(train, "exploit_model", exploit_feature_set)
evaluate_model(test, "exploit_model", sdbn=reimproved_sdbn)

# ===============
# EXPLORE

explore_feature_set = [
    feature("manufacturer_match", "manufacturer:(${keywords})^=1", "explore"),
    feature("name_fuzzy", "name_ngram:(${keywords})", "explore"),
    feature("long_description_bm25", "longDescription:(${keywords})", "explore"),
    feature("short_description_constant", "shortDescription:(${keywords})^=1", "explore")
]

products_collection = engine.get_collection("products")
explore_vector = best_explore_candidate(sdbn, explore_feature_set, theta=0.6)
explore_upc = explore(products_collection, "transformers dvd", explore_vector) 


# =========
# GATHER                                   
sdbn = sessions_to_sdbn(sessions,            
                        prior_weight=10,    
                        prior_grade=0.2)
sdbn

Put feature set
{'responseHeader': {'status': 500, 'QTime': 1}, 'error': {'msg': 'name_fuzzy already contained in the store, please use a different name', 'trace': 'org.apache.solr.ltr.feature.FeatureException: name_fuzzy already contained in the store, please use a different name\n\tat org.apache.solr.ltr.store.FeatureStore.add(FeatureStore.java:50)\n\tat org.apache.solr.ltr.store.rest.ManagedFeatureStore.addFeature(ManagedFeatureStore.java:119)\n\tat org.apache.solr.ltr.store.rest.ManagedFeatureStore.applyUpdatesToManagedData(ManagedFeatureStore.java:129)\n\tat org.apache.solr.rest.ManagedResource.doPut(ManagedResource.java:383)\n\tat org.apache.solr.rest.RestManager$ManagedEndpoint.delegateRequestToManagedResource(RestManager.java:342)\n\tat org.apache.solr.handler.SchemaHandler$ManagedResourceRequestHandler.handleRequestBody(SchemaHandler.java:315)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:226)\n\tat org.apache.solr.core.SolrCore.execute


Liblinear failed to converge, increase the number of iterations.



{'responseHeader': {'status': 0, 'QTime': 2}}
{'fields': ['upc', 'name', 'manufacturer', 'score'], 'limit': 10, 'params': {'rq': '{!ltr reRankDocs=60000 reRankWeight=10.0 model=exploit_model efi.fuzzy_keywords="~macbook" efi.squeezed_keywords="macbook" efi.keywords="macbook"}', 'qf': 'name name_ngram upc manufacturer shortDescription longDescription', 'defType': 'edismax', 'q': 'macbook'}}
{'responseHeader': {'zkConnected': True, 'status': 0, 'QTime': 2, 'params': {'json': '{"fields": ["upc", "name", "manufacturer", "score"], "limit": 10, "params": {"rq": "{!ltr reRankDocs=60000 reRankWeight=10.0 model=exploit_model efi.fuzzy_keywords=\\"~macbook\\" efi.squeezed_keywords=\\"macbook\\" efi.keywords=\\"macbook\\"}", "qf": "name name_ngram upc manufacturer shortDescription longDescription", "defType": "edismax", "q": "macbook"}}'}}, 'response': {'numFound': 157, 'start': 0, 'maxScore': 4.135349, 'numFoundExact': True, 'docs': [{'upc': '885909240012', 'name': 'Apple&#xAE; - Apple Mini Disp

Unnamed: 0_level_0,Unnamed: 1_level_0,clicked,examined,grade,beta_grade
query,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ipad,885909457588,185,408,0.453431,0.447368
ipad,885909472376,680,1677,0.405486,0.404268
ipad,821793013776,103,379,0.271768,0.269923
ipad,635753493559,333,1238,0.268982,0.268429
ipad,722868830062,74,297,0.249158,0.247557
...,...,...,...,...,...
transformers dvd,47875819733,24,1679,0.014294,0.015394
transformers dvd,708056579739,23,1659,0.013864,0.014979
transformers dvd,879862003524,23,1685,0.013650,0.014749
transformers dvd,93624974918,19,1653,0.011494,0.012628


Up next: [Chapter 13: Semantic Search with Dense Vectors](../ch13/1.setting-up-the-outdoors-dataset.ipynb)