# A/B Testing Simulation

In this notebook, a user has a hidden preference within a single query. We use this to explore A/B testing. 

Of course, this problem needs to be multiplied over millions of queries!

1. The last judgments from chapter 11
2. Fully train a model w/ two feature sets (turn ch 11 auto LTR notebook into function) 
3. Simulate user interaction w/ rankings

In [43]:
import numpy as np
import pandas as pd
import random; random.seed(0)
import glob

import requests
import sys
sys.path.append('..')
from ltr.client.solr_client import SolrClient

client = SolrClient(host='http://aips-solr:8983/solr')

In [44]:
def all_sessions():
    sessions = pd.concat([pd.read_csv(f, compression='gzip')
                          for f in glob.glob('retrotech/sessions/*_sessions.gz')])
    return sessions.rename(columns={'clicked_doc_id': 'doc_id'})
    
sessions = all_sessions()
sessions

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,15002,kindle,0.0,9781400532711,True
1,15002,kindle,1.0,814916011872,False
2,15002,kindle,2.0,813580018491,False
3,15002,kindle,3.0,814916014606,False
4,15002,kindle,4.0,813580017906,False
...,...,...,...,...,...
74995,5001,transformers dark of the moon,10.0,47875841369,False
74996,5001,transformers dark of the moon,11.0,97363560449,False
74997,5001,transformers dark of the moon,12.0,93624956037,False
74998,5001,transformers dark of the moon,13.0,97363532149,False


In [45]:
sessions['query'].unique()

array(['kindle', 'blue ray', 'star wars', 'macbook', 'headphones',
       'bluray', 'lcd tv', 'dryer', 'nook', 'ipad', 'star trek', 'iphone',
       'transformers dark of the moon'], dtype=object)

In [46]:
new_sessions = sessions[sessions['query'] == 'macbook'].copy() 

In [5]:
random.seed(0)

# Make two queries identical, except for the query
# TODO? Randomly flip some of the clicked bools, but this might make it non deterministic
def copy_query_sessions(sessions, src_query, dest_query):
    new_sessions = sessions[sessions['query'] == src_query].copy()  
    new_sessions['draw'] = np.random.rand(len(new_sessions), 1)
    # unclick some in the new query for a bit of noise
    new_sessions[new_sessions['clicked'] & (new_sessions['draw'] < 0.04)]['clicked'] = False
    new_sessions['query'] = dest_query
    return pd.concat([sessions, new_sessions.drop('draw', axis=1)])

sessions = copy_query_sessions(sessions, 'transformers dark of the moon', 'transformers dark of moon')
sessions = copy_query_sessions(sessions, 'transformers dark of the moon', 'dark of moon')
sessions = copy_query_sessions(sessions, 'transformers dark of the moon', 'dark of the moon')
sessions = copy_query_sessions(sessions, 'headphones', 'head phones')
sessions = copy_query_sessions(sessions, 'lcd tv', 'lcd television')
sessions = copy_query_sessions(sessions, 'lcd tv', 'television, lcd')
sessions = copy_query_sessions(sessions, 'macbook', 'apple laptop')
sessions = copy_query_sessions(sessions, 'iphone', 'apple iphone')
sessions = copy_query_sessions(sessions, 'kindle', 'amazon kindle')
sessions = copy_query_sessions(sessions, 'kindle', 'amazon ereader')
sessions = copy_query_sessions(sessions, 'blue ray', 'blueray')





sessions



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,15002,kindle,0.0,9781400532711,True
1,15002,kindle,1.0,814916011872,False
2,15002,kindle,2.0,813580018491,False
3,15002,kindle,3.0,814916014606,False
4,15002,kindle,4.0,813580017906,False
...,...,...,...,...,...
149995,55001,blueray,25.0,22265004517,False
149996,55001,blueray,26.0,885170038875,False
149997,55001,blueray,27.0,786936817232,False
149998,55001,blueray,28.0,600603132872,False


In [47]:
sessions['query'].unique()

array(['kindle', 'blue ray', 'star wars', 'macbook', 'headphones',
       'bluray', 'lcd tv', 'dryer', 'nook', 'ipad', 'star trek', 'iphone',
       'transformers dark of the moon'], dtype=object)

## Inject bias for transformers / transformers dvd

This simulates biased sessions in the data, as if the user never actually sees (and hence never clicks) their actual desired item. If the users desired results are shown, those results get a higher probability of click. Otherwise there is a lower probability of clicks.

In [48]:
next_sess_id = sessions['sess_id'].max()

# For some reason, the sessions only capture examines on the 'dubbed' transformers movies
# ie the Japanese shows brought to an English-speaking market. But we'll see this is not what the 
# user wants (ie presentation bias). These are 'meh' mildly interesting. There are also many many
# completely irrelevant movies.

# What the user wants, but never visible! Never gets clicked!
# These are the widescreen transformers dvds of the hollywood movies
desired_movies = ["97360724240", "97360722345", "97368920347"] 

# Bunch of random merchandise
irrelevant_transformers_products = ["708056579739", "93624995012", "47875819733", "47875839090", "708056579746",
                                     "47875332911", "47875842328", "879862003524", "879862003517", "93624974918",
                                     ] 

# Other transformer movies
meh_transformers_movies = ["97363455349", "97361312743", "97361372389", "97361312804", "97363532149", "97363560449"]

displayed_transformer_products = meh_transformers_movies + irrelevant_transformers_products

new_sessions = []
for i in range(0,5000):
    random.shuffle(displayed_transformer_products)

    # shuffle each session
    for rank, upc in enumerate(displayed_transformer_products):
        clicked = False
        draw = random.random()

        if upc in meh_transformers_movies:
            if draw < 0.13:
                clicked = True
        elif upc in irrelevant_transformers_products:
            if draw < 0.005:
                clicked = True
        elif upc in desired_transformers_movies:
            if draw < 0.65:
                clicked = True

        new_sessions.append({'sess_id': next_sess_id + i, 
                             'query': 'transformers dvd', 
                             'rank': rank,
                             'clicked': clicked,
                             'doc_id': upc
                             })


sessions = sessions.append(new_sessions)

## Chapter 11 In One Function (omitted) 

Wrapping up Chapter 11 in a single function `sessions_to_sdbn`

In [49]:
def sessions_to_sdbn(sessions, prior_weight=10, prior_grade=0.2) -> pd.DataFrame:
    """ Compute SDBN of the provided query as a dataframe.
        Where we left off at end of 'overcoming confidence bias' 
        """
    all_sdbn = pd.DataFrame()
    for query in sessions['query'].unique():
        sdbn_sessions = sessions[sessions['query'] == query].copy().set_index('sess_id')

        last_click_per_session = sdbn_sessions.groupby(['clicked', 'sess_id'])['rank'].max()[True]

        sdbn_sessions['last_click_rank'] = last_click_per_session
        sdbn_sessions['examined'] = sdbn_sessions['rank'] <= sdbn_sessions['last_click_rank']

        sdbn = sdbn_sessions[sdbn_sessions['examined']].groupby('doc_id')[['clicked', 'examined']].sum()
        sdbn['grade'] = sdbn['clicked'] / sdbn['examined']
        sdbn['query'] = query

        sdbn = sdbn.sort_values('grade', ascending=False)

        sdbn['prior_a'] = prior_grade*prior_weight
        sdbn['prior_b'] = (1-prior_grade)*prior_weight

        sdbn['posterior_a'] = sdbn['prior_a'] +  sdbn['clicked']
        sdbn['posterior_b'] = sdbn['prior_b'] + (sdbn['examined'] - sdbn['clicked'])

        sdbn['beta_grade'] = sdbn['posterior_a'] / (sdbn['posterior_a'] + sdbn['posterior_b'])

        sdbn.sort_values('beta_grade', ascending=False)
        all_sdbn = all_sdbn.append(sdbn)
    return all_sdbn[['query', 'clicked', 'examined', 'grade', 'beta_grade']].reset_index().set_index(['query', 'doc_id'])

queries = ['dryer', 'bluray', 'blue ray', 'headphones', 'ipad', 'iphone',
           'kindle', 'lcd tv', 'macbook', 'nook', 'star trek', 'star wars',
           'transformers dark of the moon']



## Listing 12.1 Use Convert Raw Sessions to SDBN

In this listing we user our "chapter 11 in one function" `sessions_to_sdbn` to rebuild training data.

In [50]:
sdbn = sessions_to_sdbn(sessions,
                        prior_weight=10,
                        prior_grade=0.2)
sdbn

Unnamed: 0_level_0,Unnamed: 1_level_0,clicked,examined,grade,beta_grade
query,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
kindle,814916011896,208,453,0.459161,0.453564
kindle,9781400532711,915,2046,0.447214,0.446012
kindle,814916014361,244,824,0.296117,0.294964
kindle,814916011872,458,1593,0.287508,0.286962
kindle,814916014590,148,520,0.284615,0.283019
...,...,...,...,...,...
transformers dvd,47875819733,24,1679,0.014294,0.015394
transformers dvd,708056579739,23,1659,0.013864,0.014979
transformers dvd,879862003524,23,1685,0.013650,0.014749
transformers dvd,93624974918,19,1653,0.011494,0.012628


## Chapter 10 Functions (omitted from book)

All of the following are support functions for the chapter:

1. Convert the sdbn dataframe into individual `Judgment` objects needed for training the model from chapter 10
2. Pairwise transformation of the data
3. Normalization of the data
4. Training the model
5. Uploading the model to Solr

All of these steps are covered in Chapter 10.

In [51]:
import requests
import numpy as np
from ltr.judgments import judgments_from_file, judgments_to_nparray
from sklearn import svm
import json
import math
from itertools import groupby
from ltr.log import FeatureLogger
from ltr.judgments import judgments_open
from itertools import groupby
from ltr import download
from ltr.judgments import judgments_writer

from ltr.judgments import Judgment

def sdbn_to_judgments(sdbn):
    """Turn pandas dataframe into ltr judgments objects."""
    judgments = []
    queries = {}
    next_qid = 0
    for row_dict in sdbn.reset_index().to_dict(orient="records"):
        # Round grade to 10ths, Map 0.3 -> 3, etc
        grade = round(row_dict['beta_grade'], 1) * 10
        qid = -1
        if row_dict['query'] in queries:
            qid = queries[row_dict['query']]
        else:
            queries[row_dict['query']] = next_qid
            qid = next_qid
            next_qid += 1
        assert qid != -1
        
        judgments.append(Judgment(doc_id=row_dict['doc_id'],
                                  keywords=row_dict['query'],
                                  qid=qid,
                                  grade=int(grade))
                        )
    return judgments


sdbn_to_judgments(sdbn)


def write_judgments(judgments, dest='retrotech_judgments.txt'):
    with judgments_writer(open(dest, 'wt')) as writer:
        for judgment in judgments:
            writer.write(judgment)
            
write_judgments(sdbn_to_judgments(sdbn))
!cat retrotech_judgments.txt


def normalize_features(logged_judgments):
    all_features = []
    means = [0] * len(logged_judgments[0].features)
    for judgment in logged_judgments:
        for idx, f in enumerate(judgment.features):
            means[idx] += f
        all_features.append(judgment.features)
    
    for i in range(len(means)):
        means[i] /= len(logged_judgments)
      
    std_devs = [0.0] * len(logged_judgments[0].features)
    for judgment in logged_judgments:
        for idx, f in enumerate(judgment.features):
            std_devs[idx] += (f - means[idx])**2
            
    for i in range(len(std_devs)):
        std_devs[i] /= len(logged_judgments)
        std_devs[i] = math.sqrt(std_devs[i])
        
    # Normalize!
    normed_judgments = []
    for judgment in logged_judgments:
        normed_features = [0.0] * len(judgment.features)
        for idx, f in enumerate(judgment.features):
            normed = 0.0
            if std_devs[idx] > 0: 
                normed = (f - means[idx]) / std_devs[idx]
            normed_features[idx] = normed
        normed_judgment=Judgment(qid=judgment.qid,
                                 keywords=judgment.keywords,
                                 doc_id=judgment.doc_id,
                                 grade=judgment.grade,
                                 features=normed_features)
        normed_judgment.old_features=judgment.features
        normed_judgments.append(normed_judgment)

    return means, std_devs, normed_judgments


def pairwise_transform(normed_judgments, weigh_difference = True):
        
    predictor_deltas = []
    feature_deltas = []
    
    # For each query's judgments
    for qid, query_judgments in groupby(normed_judgments, key=lambda j: j.qid):

        # Annoying issue consuming python iterators, we ensure we have two
        # full copies of each query's judgments
        query_judgments_copy_1 = list(query_judgments) 
        query_judgments_copy_2 = list(query_judgments_copy_1)

        # Examine every judgment combo for this query, 
        # if they're different, store the pairwise difference:
        # +1 if judgment1 more relevant
        # -1 if judgment2 more relevant
        for judgment1 in query_judgments_copy_1:
            for judgment2 in query_judgments_copy_2:
                
                j1_features=np.array(judgment1.features)
                j2_features=np.array(judgment2.features)
                
                if judgment1.grade > judgment2.grade:
                    diff = judgment1.grade - judgment2.grade if weigh_difference else 1.0
                    predictor_deltas.append(+1)
                    feature_deltas.append(diff * (j1_features-j2_features))
                elif judgment1.grade < judgment2.grade:
                    diff = judgment2.grade - judgment1.grade if weigh_difference else 1.0
                    predictor_deltas.append(-1)
                    feature_deltas.append(diff * (j1_features-j2_features))

    # For training purposes, we return these as numpy arrays
    return np.array(feature_deltas), np.array(predictor_deltas)

def upload_model(model, model_name, means, std_devs, feature_set):

    linear_model = {
      "store": "test",
      "class": "org.apache.solr.ltr.model.LinearModel",
      "name": model_name,
      "features": [
      ],
      "params": {
          "weights": {
          }
      }
    }

    ftr_model = {}
    ftr_names = [ftr['name'] for ftr in feature_set]
    for idx, ftr_name in enumerate(ftr_names):
        config = {
            "name": ftr_name,
            "norm": {
                "class": "org.apache.solr.ltr.norm.StandardNormalizer",
                "params": {
                    "avg": str(means[idx]),
                    "std": str(std_devs[idx])
                }
            }
        }
        linear_model['features'].append(config)
        linear_model['params']['weights'][ftr_name] =  model.coef_[0][idx] 

    print("PUT http://aips-solr:8983/solr/products/schema/model-store")
    print(json.dumps(linear_model, indent=2))

    # Delete old model
    resp = requests.delete('http://aips-solr:8983/solr/products/schema/model-store/test_model')


    # Upload the model
    resp = requests.put('http://aips-solr:8983/solr/products/schema/model-store', json=linear_model)
    resp.text
    
    
## TODO - can't easily to test/train split on these few queries
##   make more queries?

def ranksvm_ltr(sdbn, model_name, feature_set):
    """Train a RankSVM model via Solr, store in Solr."""
    judgments = sdbn_to_judgments(sdbn)
    judgments_path = 'retrotech_judgments.txt'
    write_judgments(judgments, judgments_path)
    
    # For more on this code, review Chapter 10
    requests.delete('http://aips-solr:8983/solr/products/schema/feature-store/test')
    
    resp = requests.put('http://aips-solr:8983/solr/products/schema/feature-store',
                    json=feature_set)

    ftr_logger=FeatureLogger(client, index='products', feature_set='test', id_field='upc')

    with judgments_open(judgments_path) as judgment_list:
        for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):
            ftr_logger.log_for_qid(judgments=query_judgments, 
                                   qid=qid,
                                   keywords=judgment_list.keywords(qid))

    logged_judgments = ftr_logger.logged
    means, std_devs, normed_judgments = normalize_features(logged_judgments)
    feature_deltas, predictor_deltas = pairwise_transform(normed_judgments)

    model = svm.LinearSVC(max_iter=10000, verbose=1)
    model.fit(feature_deltas, predictor_deltas)  
    upload_model(model, model_name, means, std_devs, feature_set)


# qid:0: kindle*1
# qid:1: blue ray*1
# qid:2: star wars*1
# qid:3: macbook*1
# qid:4: headphones*1
# qid:5: bluray*1
# qid:6: lcd tv*1
# qid:7: dryer*1
# qid:8: nook*1
# qid:9: ipad*1
# qid:10: star trek*1
# qid:11: iphone*1
# qid:12: transformers dark of the moon*1
# qid:13: transformers dvd*1

5	qid:0	 # 814916011896	kindle
4	qid:0	 # 9781400532711	kindle
3	qid:0	 # 814916014361	kindle
3	qid:0	 # 814916011872	kindle
3	qid:0	 # 814916014590	kindle
3	qid:0	 # 814916014385	kindle
2	qid:0	 # 813580017906	kindle
2	qid:0	 # 814916010202	kindle
2	qid:0	 # 814916010240	kindle
2	qid:0	 # 9781400532650	kindle
2	qid:0	 # 813580018514	kindle
2	qid:0	 # 814916010288	kindle
2	qid:0	 # 813580018491	kindle
2	qid:0	 # 814916010219	kindle
2	qid:0	 # 92636257521	kindle
2	qid:0	 # 814916014606	kindle
2	qid:0	 # 813580015261	kindle
2	qid:0	 # 814916010233	kindle
1	qid:0	 # 813580018361	kindle
1	qid:0	 # 843404077182	kindle
1	qid:0	 # 813580015247	kindle
1	qid:0	 # 885

## Also Chapter 10 - Perform a test / train split on the SDBN data

This function is broken out from the model training. It lets us train a model on one set of data (reusing the chapter 10 training code), reserving test queries for evaluation.

In [52]:
from math import floor

def test_train_split(sdbn, train):
    """Split queries in sdbn into train / test split with `train` proportion going to training set."""
    queries = sdbn.index.get_level_values('query').unique().copy().tolist()
    random.shuffle(queries)
    num_queries = len(queries)
    split_point = floor(num_queries * train)
    
    train_queries = queries[:split_point]
    test_queries = queries[split_point:]
    return sdbn.loc[train_queries, :], sdbn.loc[test_queries]


## Chapter 10 - Search Code

Also from Chapter 10, a simple function to search using the LTR model and return a list of search results.

In [53]:
def search(query, model_name, at=10, log=False):
    """ Search using test_model LTR model (see rq to and qf params below). """
    fuzzy_kws = "~" + ' ~'.join(query.split())
    squeezed_kws = "".join(query.split())
    
    rq = \
        "{!ltr reRankDocs=60000 reRankWeight=10.0 model=" + model_name \
        + " efi.fuzzy_keywords=\"" + fuzzy_kws + "\" " \
        + "efi.squeezed_keywords=\"" + squeezed_kws +"\" " \
        + "efi.keywords=\"" + query + "\"}"

    request = {
            "fields": ["upc", "name", "manufacturer", "score"],
            "limit": at,
            "params": {
              "rq": rq,
              "qf": "name name_ngram upc manufacturer shortDescription longDescription",
              "defType": "edismax",
              "q": query
            }
        }
    
    if log:
        print(request)

    resp = requests.post('http://aips-solr:8983/solr/products/select', 
                                   json=request).json()
    
    if log:
        print(resp)
        
    search_results = resp['response']['docs']

    for rank, result in enumerate(search_results):
        result['rank'] = rank
        
    return search_results

def search_and_grade(query, model_name, sdbn, desired=[]):
    results = search(query, model_name, at=10)
    results = pd.DataFrame(results)
    results['desired'] = False
    for upc in desired:
        results.loc[results['upc'] == upc, 'desired'] = True
        
    sdbn_query = sdbn.loc[query].copy().reset_index()
    return results.merge(sdbn_query, left_on='upc', right_on='doc_id', how='left')

## Evaluate the model on the test set

This function computes the model's performance on a set of test queries.

In [54]:
def eval_model(test, model_name, sdbn, at=10):
    queries = test.index.get_level_values('query').unique()
    collection = "products"
    
    query_results = {}
    
    for query in queries:
        search_results = search(query, model_name, at=at)

        results = pd.DataFrame(search_results).reset_index()
        judgments = sdbn.loc[query, :].copy().reset_index()
        judgments['doc_id'] = judgments['doc_id'].astype(str)
        if len(results) == 0:
            print(f"No Results for {query}")
            query_results[query] = 0
        else:
            graded_results = results.merge(judgments, left_on='upc', right_on='doc_id', how='left')
            print(graded_results)
            graded_results[['clicked', 'examined', 'grade', 'beta_grade']] = graded_results[['clicked', 'examined', 'grade', 'beta_grade']].fillna(0)
            grade_results = graded_results.drop('doc_id', axis=1)

            query_results[query] = (graded_results['beta_grade'].sum() / at)
    return query_results

## Listing 12.2 - model training

We wrap all the important decisions from chapter 10 in a few lines 

In [55]:
random.seed(1234)

feature_set = [
    {
      "name" : "long_description_bm25",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "longDescription:(${keywords})"
      }
    },
    {
      "name" : "short_description_constant",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "shortDescription:(${keywords})^=1"
      }
    }
]

train, test = test_train_split(sdbn, train=0.8)
ranksvm_ltr(train, model_name='test1', feature_set=feature_set)
eval_model(test, model_name='test1')

Searching products [Status: 200]
Missing doc 600603124570
Searching products [Status: 200]
Missing doc 600603141003
Missing doc 600603132872
Searching products [Status: 200]
Missing doc 600603123061
Searching products [Status: 200]
Missing doc 600603139758
Searching products [Status: 200]
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603135101
Searching products [Status: 200]
Missing doc 600603132827
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603140631
Missing doc 600603125065
Missing doc 600603132827
Missing doc 600603133237
Searching products [Status: 200]
[LibLinear]PUT http://aips-solr:8983/solr/products/schema/model-store
{
  "store": "test",
  "class": "org.apache.solr.ltr.model.LinearModel",
  "name": "test1",
  "features": [
    {
      "name": "long_description_bm25",
      "norm": {
        "class": "org.apache.solr.ltr.norm.StandardNormalizer",
        "params": {
          "avg": "1.0558415082876724",


TypeError: eval_model() missing 1 required positional argument: 'sdbn'

In [56]:
# # What the user wants, but never visible! Never gets clicked!
# These are the widescreen transformers dvds of the hollywood movies
desired_movies = ["97360724240", "97360722345", "97368920347"] 
result = search_and_grade('transformers dvd', 'test1', sdbn, desired_movies)
upcs1 = result['upc']
result

Unnamed: 0,upc,name,manufacturer,score,rank,desired,doc_id,clicked,examined,grade,beta_grade
0,25193328625,Atonement - Fullscreen Dubbed Subtitle AC3 - DVD,\N,0.017951,0,False,,,,,
1,18713578310,Jungle Book/Aladdin - DVD,\N,0.017951,1,False,,,,,
2,14381243925,Twilight Zone: Season 1 [The Definitive Editio...,\N,0.017951,2,False,,,,,
3,97363416043,Mean Girls - Widescreen Collector's Subtitle -...,\N,0.017951,3,False,,,,,
4,20286155379,Hello Rockview [CD & DVD] [Digipak] - CD,Sleep It Off,0.017951,4,False,,,,,
5,16861801663,Blackening (CD+DVD) (Special) - CD,Roadrunner Records,0.017951,5,False,,,,,
6,24543051008,The Rocky Horror Picture Show - Widescreen Sub...,\N,0.017951,6,False,,,,,
7,97368861848,Dora the Explorer: Dora's Ultimate Adventures ...,\N,0.017951,7,False,,,,,
8,24543110408,Vanishing Point - Widescreen - DVD,\N,0.017951,8,False,,,,,
9,97368884045,Hogan's Heroes: The Complete Second Season - 4...,\N,0.017951,9,False,,,,,


## Listing 12.3

Train a model that performs better offline called `test2`

In [57]:
random.seed(1234)


feature_set_better = [
    {
      "name" : "name_fuzzy",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "name_ngram:(${keywords})"
      }
    },
    {
      "name" : "name_pf2",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "{!edismax qf=name name pf2=name}(${keywords})"
      }
    },
    {
      "name" : "shortDescription_pf2",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "{!edismax qf=shortDescription pf2=shortDescription}(${keywords})"
      }
    },
]

sdbn = sessions_to_sdbn(sessions) # chapter 11: generate training data

train, test = test_train_split(sdbn, train=0.8)
ranksvm_ltr(train, 'test2', feature_set_better) # chapter 10: train the model -> the 'LTR engine'
eval2 = eval_model(test, 'test2', sdbn)

eval2

Searching products [Status: 200]
Missing doc 600603124570
Searching products [Status: 200]
Missing doc 600603141003
Missing doc 600603132872
Searching products [Status: 200]
Missing doc 600603123061
Searching products [Status: 200]
Missing doc 600603139758
Searching products [Status: 200]
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603135101
Searching products [Status: 200]
Missing doc 600603132827
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603140631
Missing doc 600603125065
Missing doc 600603132827
Missing doc 600603133237
Searching products [Status: 200]
[LibLinear]


Liblinear failed to converge, increase the number of iterations.



PUT http://aips-solr:8983/solr/products/schema/model-store
{
  "store": "test",
  "class": "org.apache.solr.ltr.model.LinearModel",
  "name": "test2",
  "features": [
    {
      "name": "name_fuzzy",
      "norm": {
        "class": "org.apache.solr.ltr.norm.StandardNormalizer",
        "params": {
          "avg": "2.4331423202054783",
          "std": "1.4936218654796216"
        }
      }
    },
    {
      "name": "name_pf2",
      "norm": {
        "class": "org.apache.solr.ltr.norm.StandardNormalizer",
        "params": {
          "avg": "2.981331874657533",
          "std": "3.5369474213302845"
        }
      }
    },
    {
      "name": "shortDescription_pf2",
      "norm": {
        "class": "org.apache.solr.ltr.norm.StandardNormalizer",
        "params": {
          "avg": "0.5265155200684927",
          "std": "1.564823160506588"
        }
      }
    }
  ],
  "params": {
    "weights": {
      "name_fuzzy": 0.04738334251825029,
      "name_pf2": 0.14553197785451133,
    

{'blue ray': 0.0,
 'dryer': 0.07068309073137659,
 'transformers dark of the moon': 0.41539851567397756}

In [58]:
# # What the user wants, but never visible! Never gets clicked!
# These are the widescreen transformers dvds of the hollywood movies
desired_movies = ['97360724240', '97363560449', '97363532149', '97360810042']
result = search_and_grade('transformers dvd', 'test2', sdbn, desired_movies)
upcs2 = result['upc']
result

Unnamed: 0,upc,name,manufacturer,score,rank,desired,doc_id,clicked,examined,grade,beta_grade
0,32429037763,Transformers - DVD,\N,0.341141,0,False,,,,,
1,97368920347,The Transformers: The Movie - DVD,\N,0.085697,1,False,,,,,
2,826663126044,Transformers Japanese Collection: Headmasters ...,\N,0.076749,2,False,,,,,
3,826663114218,"Transformers: Season 2, Vol. 1 - DVD",\N,0.074654,3,False,,,,,
4,97037110192,"Transformers: Serie Megatron, Vol. 1 - DVD",\N,0.072194,4,False,,,,,
5,97363455349,Transformers - Widescreen Dubbed Subtitle AC3 ...,\N,0.068298,5,False,97363455349.0,664.0,1937.0,0.342798,0.342065
6,97361372389,Transformers - Widescreen Dubbed Subtitle AC3 ...,\N,0.068298,6,False,97361372389.0,622.0,1919.0,0.324127,0.323484
7,826663129892,Transformers Prime: Darkness Rising - Fullscre...,\N,0.068298,7,False,,,,,
8,97361312743,Transformers - Widescreen Dubbed Subtitle AC3 ...,\N,0.068298,8,False,97361312743.0,657.0,1916.0,0.342902,0.34216
9,400173151118,Transformers Cybertron The Ultimate Collection...,\N,0.067366,9,False,,,,,


In [59]:
def live_user_query(query, model_name,
                    desired, meh,
                    desired_prob=0.15, 
                    meh_prob=0.03, 
                    uninteresting_prob=0.01,
                    quit_per_rank_prob=0.2):
    """Live user for 'query' where purchase probability depends on if 
       products upc is in one of three sets.
       
       Users purchase a single product per session.    
       
       Users quit with `quit_per_rank_prod` after scanning each rank
       
       """   
    search_results = search(query, model_name, at=10)

    results = pd.DataFrame(search_results).reset_index()
    for doc in results.to_dict(orient="records"):
        clicked = False
        draw = random.random()
        
        upc = doc['upc']

        if upc in desired:
            if draw < desired_prob:
                return True
        elif upc in meh:
            if draw < meh_prob:
                return True
        else:
            if draw < uninteresting_prob:
                return True
            
        if random.random() < quit_per_rank_prob:
            return False
    
    return False


In [60]:
random.seed(1234)

wants_to_purchase = ['97360724240', '97363560449', '97363532149', '97360810042']
might_purchase = ['97361312743', '97363455349', '97361372389']

model1_purchases = 0
model2_purchases = 0

def a_or_b_model(query, a_model, b_model):
    """Randomly assign this user to a or b"""
    draw = random.random()
    
    user_made_purchase = False
    model_name = None
    if draw < 0.5:
        model_name=a_model
    else:
        model_name=b_model
        
    purchase_made = live_user_query(query=query, 
                                   model_name=model_name,
                                   desired=wants_to_purchase,
                                   meh=might_purchase)
    return (model_name, purchase_made)


NUM_USERS=1000
purchases = {'test1': 0, 'test2': 0}
for _ in range(0, NUM_USERS):
    
    model_name, purchase_made = a_or_b_model(query='transformers dvd', 
                                             a_model='test1',
                                             b_model='test2')
    if purchase_made:
        purchases[model_name]+= 1 
    
purchases

{'test1': 21, 'test2': 17}

In [61]:
model1_purchases, model2_purchases

(0, 0)

In [62]:
all_upcs = set(upcs1.tolist() + upcs2.tolist())
len(all_upcs)

20

In [63]:
upcs2.tolist()

['32429037763',
 '97368920347',
 '826663126044',
 '826663114218',
 '97037110192',
 '97363455349',
 '97361372389',
 '826663129892',
 '97361312743',
 '400173151118']

In [64]:
sessions['query'].unique()

array(['kindle', 'blue ray', 'star wars', 'macbook', 'headphones',
       'bluray', 'lcd tv', 'dryer', 'nook', 'ipad', 'star trek', 'iphone',
       'transformers dark of the moon', 'transformers dvd'], dtype=object)

In [65]:
#1. Just use the queries we have to do a test/train split
#2. Simulate the user for A/B test

In [66]:
sessions[sessions['query'] == 'headphones']

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,30002,headphones,0.0,803238004525,True
1,30002,headphones,1.0,615104173552,False
2,30002,headphones,2.0,848447000135,False
3,30002,headphones,3.0,27242807785,False
4,30002,headphones,4.0,878615035287,False
...,...,...,...,...,...
149995,35001,headphones,25.0,27242798236,False
149996,35001,headphones,26.0,709483027855,False
149997,35001,headphones,27.0,46838046100,False
149998,35001,headphones,28.0,27242799127,False


## Active Learning (to be moved)

In [67]:
sdbn = sessions_to_sdbn(sessions,
                        prior_weight=10,
                        prior_grade=0.2)
sdbn.loc['transformers dvd']

Unnamed: 0_level_0,clicked,examined,grade,beta_grade
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97363560449,677,1946,0.347893,0.347137
97361312804,662,1920,0.344792,0.344041
97361312743,657,1916,0.342902,0.34216
97363455349,664,1937,0.342798,0.342065
97361372389,622,1919,0.324127,0.323484
97363532149,623,1927,0.3233,0.322664
879862003517,37,1698,0.02179,0.022834
93624995012,32,1673,0.019127,0.020202
47875842328,29,1663,0.017438,0.01853
708056579746,26,1664,0.015625,0.016726


## New helper: show the features for each SDBN entry

This function does the same steps as `ranksvm_ltr` but skips model training, returning just the logged feature vector of each example for this query.

In [68]:
def sdbn_with_features(sdbn, feature_set):
    """Log features alongside sdbn into a dataframe"""
    judgments = sdbn_to_judgments(sdbn)
    judgments_path = 'retrotech_judgments.txt'
    write_judgments(judgments, judgments_path)
    
    # For more on this code, review Chapter 10
    requests.delete('http://aips-solr:8983/solr/products/schema/feature-store/explore')
    
    resp = requests.put('http://aips-solr:8983/solr/products/schema/feature-store',
                    json=feature_set)

    ftr_logger=FeatureLogger(client, index='products', feature_set='explore', id_field='upc')
    
    with judgments_open(judgments_path) as judgment_list:
        for qid, query_judgments in groupby(judgments, key=lambda j: j.qid):
            ftr_logger.log_for_qid(judgments=query_judgments, 
                                   qid=qid,
                                   keywords=judgment_list.keywords(qid))

    logged_judgments = ftr_logger.logged
    means, std_devs, normed_judgments = normalize_features(logged_judgments)
    feature_deltas, predictor_deltas = pairwise_transform(normed_judgments)
    features, predictors = judgments_to_nparray(logged_judgments)
    logged_judgments_dataframe = pd.concat([pd.DataFrame(predictors),
                                            pd.DataFrame(features)], 
                                           axis=1,
                                           ignore_index=True)
    columns = {idx + 2: ftr['name'] for idx, ftr in enumerate(feature_set)}
    columns[0] = 'grade'
    columns[1] = 'qid'
    
    qid_to_query = {}
    for j in logged_judgments:
        qid_to_query[j.qid] = j.keywords
        
    qid_to_query = pd.DataFrame(qid_to_query.values()).reset_index().rename(columns={'index': 'qid', 0: 'query'})
    
    print(qid_to_query)
    logged_judgments_dataframe = logged_judgments_dataframe.rename(columns=columns)
    print(logged_judgments_dataframe)
    logged_judgments_dataframe = logged_judgments_dataframe.merge(qid_to_query, how='left', on='qid')
    cols_order = ['query', 'grade'] + [ftr['name'] for idx, ftr in enumerate(feature_set)]
    logged_judgments_dataframe['grade'] = logged_judgments_dataframe['grade'] / 10.0 
    return logged_judgments_dataframe[cols_order].sort_values('query')

## Raw feature values -> whats missing?
(put in book)

Another way of formulating `presentation_bias` is to look at the kinds of documents not being shown to users, so we can strategically show those to users. Below we show the value of each feature in `feature_set` for each document in the sdbn judgments.

In [69]:

    
    
sdbn = sessions_to_sdbn(sessions,
                        prior_weight=10,
                        prior_grade=0.2)

feature_set = [
    {
      "name" : "long_desc_match",
      "store": "explore",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "longDescription:(${keywords})^=1"
      }
    },
    {
      "name" : "short_desc_match",
      "store": "explore",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "shortDescription:(${keywords})^=1"
      }
    },
    {
      "name" : "name_match",
      "store": "explore",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "name:(${keywords})^=1"
      }
    },
    {
      "name" : "has_promotion",
      "store": "explore",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "promotion_b:true"
      }
    },
]


sdbn_ftrs = sdbn_with_features(sdbn, feature_set)
transformers_dvds = sdbn_ftrs[sdbn_ftrs['query'] == 'transformers dvd']
transformers_dvds

Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603132872
Missing doc 600603141003
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603123061
Searching products [Status: 200]
Missing doc 600603124570
Searching products [Status: 200]
Missing doc 600603141003
Missing doc 600603132872
Searching products [Status: 200]
Missing doc 600603139758
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603135101
Searching products [Status: 200]
Missing doc 600603132827
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603140631
Missing doc 600603125065
Missing doc 600603132827
Missing doc 600603133237
Searching products [Status: 200]
Searching products [Status: 200]
    qid                          query
0     0                         kindle
1     1                       blue ray
2     2                      star wars
3     3                        macbook
4     4           

Unnamed: 0,query,grade,long_desc_match,short_desc_match,name_match,has_promotion
352,transformers dvd,0.0,1.0,1.0,1.0,0.0
351,transformers dvd,0.0,1.0,1.0,1.0,0.0
350,transformers dvd,0.0,1.0,0.0,1.0,0.0
349,transformers dvd,0.0,1.0,0.0,1.0,0.0
348,transformers dvd,0.0,1.0,0.0,1.0,0.0
347,transformers dvd,0.0,1.0,0.0,1.0,0.0
346,transformers dvd,0.0,0.0,0.0,1.0,0.0
343,transformers dvd,0.3,0.0,0.0,1.0,0.0
344,transformers dvd,0.3,0.0,0.0,1.0,0.0
342,transformers dvd,0.3,0.0,0.0,1.0,0.0


## Train Gaussian Process Regressor

We train data on just the `transformers_dvd` training data. 

NOTE we could also train on the full sdbn data, and see globally what's missing. However it's often convenient to zero in on specific queries to round out their training data.

In [70]:
from sklearn.gaussian_process import GaussianProcessRegressor

y_train = transformers_dvds['grade']
x_train = transformers_dvds[['long_desc_match', 'short_desc_match',
                             'name_match', 'has_promotion']]

gpr=GaussianProcessRegressor()
gpr.fit(x_train, y_train)

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True, kernel=None,
                         n_restarts_optimizer=0, normalize_y=False,
                         optimizer='fmin_l_bfgs_b', random_state=None)

## Listing 12.7: Predict on every value

Here `gpr` predicts on every possible feature value. This lets us analyze which set of feature values to use when exploring with users.

In [71]:
zero_or_one = [0,1]

index = pd.MultiIndex.from_product([zero_or_one] * 4,
                                   names = ['long_desc_match', 'short_desc_match',
                                             'name_match', 'has_promotion'])
explore_options = pd.DataFrame(index=index).reset_index()

predictions_with_std = gpr.predict(explore_options[['long_desc_match', 'short_desc_match',
                                                    'name_match', 'has_promotion']], return_std=True)
explore_options['predicted_grade'] = predictions_with_std[0]
explore_options['prediction_stddev'] = predictions_with_std[1]

explore_options.sort_values('prediction_stddev')


Predicted variances smaller than 0. Setting those variances to 0.



Unnamed: 0,long_desc_match,short_desc_match,name_match,has_promotion,predicted_grade,prediction_stddev
2,0,0,1,0,0.2250005,0.0
6,0,1,1,0,3.76021e-07,0.0
10,1,0,1,0,-4.3093e-08,0.0
14,1,1,1,0,-3.718255e-07,0.0
8,1,0,0,0,-1.573196e-07,0.795058
11,1,0,1,1,-1.573196e-07,0.795058
0,0,0,0,0,0.1364695,0.795059
3,0,0,1,1,0.1364695,0.795059
4,0,1,0,0,8.029912e-08,0.79506
7,0,1,1,1,8.029912e-08,0.79506


## Calculate Expected Improvement

We use [Expected Improvement](https://distill.pub/2020/bayesian-optimization/) scoring to select candidates for exploration within the `transformers dvd` query.

In [72]:
from scipy.stats import norm


theta = 0.6
explore_options['opportunity'] = explore_options['predicted_grade'] - sdbn['grade'].mean() - theta


explore_options['prob_of_improvement'] = norm.cdf( (explore_options['opportunity']) / explore_options['prediction_stddev'])

explore_options['expected_improvement'] = explore_options['opportunity'] * explore_options['prob_of_improvement'] \
 + explore_options['prediction_stddev'] * norm.pdf( explore_options['opportunity'] / explore_options['prediction_stddev'])



explore_options.sort_values('expected_improvement', ascending=False).head()

Unnamed: 0,long_desc_match,short_desc_match,name_match,has_promotion,predicted_grade,prediction_stddev,opportunity,prob_of_improvement,expected_improvement
1,0,0,0,1,0.08277291,0.929873,-0.698354,0.22632,0.121754
5,0,1,0,1,9.020036e-08,0.929873,-0.781127,0.200444,0.104104
13,1,1,0,1,-7.772397e-08,0.929873,-0.781127,0.200444,0.104104
9,1,0,0,1,-1.021803e-07,0.929873,-0.781127,0.200444,0.104104
0,0,0,0,0,0.1364695,0.795059,-0.644658,0.208732,0.093761


In [73]:
sdbn.loc['transformers dvd']

Unnamed: 0_level_0,clicked,examined,grade,beta_grade
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97363560449,677,1946,0.347893,0.347137
97361312804,662,1920,0.344792,0.344041
97361312743,657,1916,0.342902,0.34216
97363455349,664,1937,0.342798,0.342065
97361372389,622,1919,0.324127,0.323484
97363532149,623,1927,0.3233,0.322664
879862003517,37,1698,0.02179,0.022834
93624995012,32,1673,0.019127,0.020202
47875842328,29,1663,0.017438,0.01853
708056579746,26,1664,0.015625,0.016726


In [74]:
explore_vect = explore_options.sort_values('expected_improvement', ascending=False).head().iloc[0][['long_desc_match',
                                                                                     'short_desc_match', 
                                                                                     'name_match',
                                                                                     'has_promotion']]
explore_vect

long_desc_match     0.0
short_desc_match    0.0
name_match          0.0
has_promotion       1.0
Name: 1, dtype: float64

## Create a query to fetch 'explore' docs

Based on the selected features from the GaussianProcessRegressor, we create a query to fetch a doc that contains those features.

In [75]:
def explore_query(explore_vect, query):
    config_explore = {'long_desc_match':  {'field': 'long_description', 'query_dependent': True},
                      'short_desc_match': {'field': 'short_description', 'query_dependent': True},
                      'name_match':       {'field': 'name', 'query_dependent': True},
                      'has_promotion':    {'field': 'promotion_b', 'query_dependent': False, '1_value': 'true'}
                     }
    clauses = []
    for col_name, config in config_explore.items():
        if explore_vect[col_name] == 1.0:
            clause = f"+{config['field']}:"
            if config['query_dependent']:
                clause += f"({query})"
            else:
                clause += f"{config['1_value']}"
            clauses.append(clause)
    return ' '.join(clauses)
    
    # fields = ['long_description', 'short_description', 'name', 'is_promotion']
    # field_value = ['match', 'match', 'match', True]

explore_query(explore_vect, 'transformers dvd')


'+promotion_b:true'

In [76]:
import random
random.seed(1234)

explore_vect = explore_options.sort_values('expected_improvement', ascending=False).head().iloc[0][['long_desc_match',
                                                                                     'short_desc_match', 
                                                                                     'name_match',
                                                                                     'has_promotion']]

def explore(query, explore_vect, log=False):
    """ Explore according to the provided explore vector, select
        a random doc from that group."""
    solr_query = explore_query(explore_vect, query)
    
    draw = random.random()

    request = {
            "fields": ["upc", "name", "manufacturer", "score"],
            "limit": 1,
            "params": {
              "q": solr_query,
              "sort": f"random_{draw} DESC"
            }
        }
    
    resp = requests.post('http://aips-solr:8983/solr/products/select', 
                                   json=request).json()
    
    return resp['response']['docs'][0]['upc']

explore_upc = explore('transformers dvd', explore_vect)
explore_upc

'97360810042'

## Simulate new sessions with the new data

We simulate new sessions, if the upc is in `might_purchase` or `wants_to_purchase`, we set it to 'clicked' with a given probability.

In [77]:
import random
random.seed(1234)

wants_to_purchase = ['97360724240', '97363560449', '97363532149', '97360810042', '97368920347']
might_purchase = ['97361312743', '97363455349', '97361372389']
explore_on_rank = 2.0

with_explore_sessions = sessions.copy()
for i in range(0, 500):
    print(i)
    explore_upc = explore('transformers dvd', explore_vect)
    sess_ids = list(set(sessions[sessions['query'] == 'transformers dvd']['sess_id'].tolist()))
    random.shuffle(sess_ids)
    sess_ids[0]
    new_session = sessions[sessions['sess_id'] == sess_ids[0]].copy()
    new_session['sess_id'] = 100000 + i
    new_session.loc[new_session['rank'] == explore_on_rank, 'doc_id'] = explore_upc
    draw = random.random()
    new_session.loc[new_session['rank'] == explore_on_rank, 'clicked'] = False
    if explore_upc in wants_to_purchase:
        if draw < 0.8:
            print(f"click {explore_upc}")
            new_session.loc[new_session['rank'] == explore_on_rank, 'clicked'] = True
    elif explore_upc in might_purchase:
        if draw < 0.5:
            print(f"click {explore_upc}")
            new_session.loc[new_session['rank'] == explore_on_rank, 'clicked'] = True
    else:
        if draw < 0.01:
            print(f"click {explore_upc}")
            new_session.loc[new_session['rank'] == explore_on_rank, 'clicked'] = True

    with_explore_sessions = pd.concat([with_explore_sessions, new_session])

with_explore_sessions

0
click 97360810042
1
2
3
4
click 9781400532711
5
6
7
8
9
10
11
12
13
14
15
16
17
click 97360810042
18
click 97360724240
19
20
click 97360810042
21
22
23
24
25
26
click 97360810042
27
28
click 97368920347
29
30
31
32
33
click 97360810042
34
35
36
click 97360810042
37
38
39
40
41
click 97360810042
42
click 97360810042
43
click 97368920347
44
45
46
47
48
49
50
click 97360810042
51
52
53
54
55
56
click 36725236271
57
58
59
60
61
62
63
click 97360810042
64
65
66
67
68
click 97360810042
69
click 97368920347
70
71
click 97360810042
72
click 97360724240
73
74
75
click 97360810042
76
77
78
79
80
click 97360810042
81
82
83
84
85
86
87
88
89
90
click 97360810042
91
92
93
94
95
96
97
click 97368920347
98
99
100
101
click 97368920347
102
click 97360724240
103
104
105
106
107
108
109
110
click 97360810042
111
112
113
114
click 97360724240
115
116
117
118
119
click 97368920347
120
121
122
123
124
125
126
127
128
129
click 97360724240
130
click 97360810042
131
132
133
134
135
click 97360724240
136
13

Unnamed: 0,sess_id,query,rank,doc_id,clicked
0,15002,kindle,0.0,9781400532711,True
1,15002,kindle,1.0,814916011872,False
2,15002,kindle,2.0,813580018491,False
3,15002,kindle,3.0,814916014606,False
4,15002,kindle,4.0,813580017906,False
...,...,...,...,...,...
25083,100499,transformers dvd,11.0,47875839090,False
25084,100499,transformers dvd,12.0,93624995012,False
25085,100499,transformers dvd,13.0,97361372389,False
25086,100499,transformers dvd,14.0,47875332911,False


In [78]:
with_explore_sessions[with_explore_sessions['sess_id'] == 100049]

Unnamed: 0,sess_id,query,rank,doc_id,clicked
12064,100049,transformers dvd,0.0,47875819733,False
12065,100049,transformers dvd,1.0,47875839090,False
12066,100049,transformers dvd,2.0,36725236271,False
12067,100049,transformers dvd,3.0,708056579739,False
12068,100049,transformers dvd,4.0,97363532149,False
12069,100049,transformers dvd,5.0,97363560449,False
12070,100049,transformers dvd,6.0,97361312804,False
12071,100049,transformers dvd,7.0,879862003517,False
12072,100049,transformers dvd,8.0,93624974918,False
12073,100049,transformers dvd,9.0,47875332911,False


## Examine new sessions

Have we added any new docs that appear to be getting more clicks?

In [84]:
new_sdbn = sessions_to_sdbn(with_explore_sessions,
                            prior_weight=10,
                            prior_grade=0.2)
new_sdbn.loc['transformers dvd']

Unnamed: 0_level_0,clicked,examined,grade,beta_grade
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
97368920347,38,43,0.883721,0.754717
97360810042,62,75,0.826667,0.752941
97360724240,14,17,0.823529,0.592593
97363455349,731,2115,0.345626,0.344941
97361312804,726,2107,0.344566,0.343883
97363560449,733,2133,0.343647,0.342977
97361312743,708,2076,0.34104,0.340364
97363532149,692,2099,0.329681,0.329066
97361372389,673,2094,0.321394,0.320817
9781400532711,1,17,0.058824,0.111111


## New heavily clicked doc is promoted!

```
      {
        "upc":"97360810042",
        "name":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_ngram":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_omit_norms":"Transformers: Dark of the Moon - Blu-ray Disc",
        "name_txt_en_split":"Transformers: Dark of the Moon - Blu-ray Disc",
        "manufacturer":"\\N",
        "shortDescription":"\\N",
        "longDescription":"\\N",
        "promotion_b":true,
        "id":"72593b1c-313b-4f25-a4f2-04eae29d858b",
        "_version_":1710117636920049669
      },
```

In [85]:
random.seed(1234)

# {'blue ray': 0.0,
# 'dryer': 0.07068309073137659,
# 'headphones': 0.06426395939086295,
# 'dark of moon': 0.25681268708548055,
# 'transformers dvd': 0.10077083021678328}

feature_set_better = [
    {
      "name" : "name_fuzzy",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "name_ngram:(${keywords})"
      }
    },
    {
      "name" : "name_pf2",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "{!edismax qf=name name pf2=name}(${keywords})"
      }
    },
    {
      "name" : "shortDescription_pf2",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : { 
        "q" : "{!edismax qf=shortDescription pf2=shortDescription}(${keywords})"
      }
    },
    {
      "name" : "has_promotion",
      "store": "test",
      "class" : "org.apache.solr.ltr.feature.SolrFeature",
      "params" : {
        "q" : "promotion_b:true^=1.0"
      }
    }
]

train, test = test_train_split(new_sdbn, train=0.8)
ranksvm_ltr(train, model_name='test3', feature_set=feature_set_better)
eval_model(test, model_name='test3', sdbn=new_sdbn)

Searching products [Status: 200]
Missing doc 600603124570
Searching products [Status: 200]
Missing doc 600603141003
Missing doc 600603132872
Searching products [Status: 200]
Missing doc 600603123061
Searching products [Status: 200]
Missing doc 600603139758
Searching products [Status: 200]
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603135101
Searching products [Status: 200]
Missing doc 600603132827
Searching products [Status: 200]
Searching products [Status: 200]
Missing doc 600603140631
Missing doc 600603125065
Missing doc 600603132827
Missing doc 600603133237
Searching products [Status: 200]
[LibLinear]PUT http://aips-solr:8983/solr/products/schema/model-store
{
  "store": "test",
  "class": "org.apache.solr.ltr.model.LinearModel",
  "name": "test3",
  "features": [
    {
      "name": "name_fuzzy",
      "norm": {
        "class": "org.apache.solr.ltr.norm.StandardNormalizer",
        "params": {
          "avg": "2.3727308748355247",
          "


Liblinear failed to converge, increase the number of iterations.



{'blue ray': 0.16923076923076924,
 'dryer': 0.07068309073137659,
 'transformers dark of the moon': 0.26893025879698185}

In [86]:
test

Unnamed: 0_level_0,Unnamed: 1_level_0,clicked,examined,grade,beta_grade
query,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
blue ray,27242815414,42,42,1.000000,0.846154
blue ray,600603132872,46,88,0.522727,0.489796
blue ray,827396513927,1304,3381,0.385685,0.385137
blue ray,600603141003,978,2620,0.373282,0.372624
blue ray,885170033412,568,2184,0.260073,0.259799
...,...,...,...,...,...
transformers dark of the moon,24543701538,182,1232,0.147727,0.148148
transformers dark of the moon,47875841369,37,251,0.147410,0.149425
transformers dark of the moon,47875841406,80,626,0.127796,0.128931
transformers dark of the moon,24543750949,31,313,0.099042,0.102167


In [88]:
eval_model(test, model_name='test3', sdbn=new_sdbn)

   index          upc                                               name  \
0      0  97360810042      Transformers: Dark of the Moon - Blu-ray Disc   
1      1  97360810042      Transformers: Dark of the Moon - Blu-ray Disc   
2      2  97360810042      Transformers: Dark of the Moon - Blu-ray Disc   
3      3  97360810042      Transformers: Dark of the Moon - Blu-ray Disc   
4      4  27242815414           Sony - 3D Wi-Fi Built-In Blu-ray  Player   
5      5  27242815414           Sony - 3D Wi-Fi Built-In Blu-ray  Player   
6      6  27242813908                  Sony - Wi-Fi Ready Blu-ray Player   
7      7  13132318394                      Blue Valentine - Blu-ray Disc   
8      8  43396263635  Blue Streak - Widescreen Dubbed Subtitle AC3 -...   
9      9  25192101021  Blue Crush 2 - Widescreen Dubbed Subtitle AC3 ...   

  manufacturer     score  rank       doc_id  clicked  examined  grade  \
0           \N  0.934142     0          NaN      NaN       NaN    NaN   
1           \N  0

{'blue ray': 0.16923076923076924,
 'dryer': 0.07068309073137659,
 'transformers dark of the moon': 0.26893025879698185}

In [89]:
new_transformers = new_sdbn.loc['transformers dvd']
new_transformers.loc['97368920347']

clicked       38.000000
examined      43.000000
grade          0.883721
beta_grade     0.754717
Name: 97368920347, dtype: float64

In [90]:
search('transformers dvd', 'test3', at=5)

[{'upc': '97368920347',
  'name': 'The Transformers: The Movie - DVD',
  'manufacturer': '\\N',
  'score': 1.0444574,
  'rank': 0},
 {'upc': '97368920347',
  'name': 'Transformers Animated: Transform and Roll Out - DVD',
  'manufacturer': '\\N',
  'score': 1.0266799,
  'rank': 1},
 {'upc': '97360722345',
  'name': 'Transformers/Transformers: Revenge of the Fallen: Two-Movie Mega Collection [2 Discs] - Widescreen - DVD',
  'manufacturer': '\\N',
  'score': 1.025708,
  'rank': 2},
 {'upc': '97360724240',
  'name': 'Transformers: Revenge of the Fallen - Widescreen - DVD',
  'manufacturer': '\\N',
  'score': 1.0250593,
  'rank': 3},
 {'upc': '97360810042',
  'name': 'Transformers: Dark of the Moon - Blu-ray Disc',
  'manufacturer': '\\N',
  'score': 0.9912172,
  'rank': 4}]

In [91]:
NUM_USERS=1000
purchases = {'test1': 0, 'test3': 0}
for _ in range(0, NUM_USERS):
    
    model_name, purchase_made = a_or_b_model(query='transformers dvd', 
                                             a_model='test1',
                                             b_model='test3')
    if purchase_made:
        purchases[model_name]+= 1 
    
purchases

{'test1': 21, 'test3': 190}

In [92]:

# Full, auto-exploring, auto LTR loop

# OFFLINE 
# Periodically - Train a model given current info
# Same time -> Generate explore options



# ONLINE

# return results of current model
# mix in random upcs that match the explore criteria
explore_upc = explore('transformers dvd', explore_vect)