# Synthesize search sessions from signals

This notebook synthesizes search sessions from the CTR of the clicked documents on each search result. It's assumed that if you order results by CTR, that roughly captures the source search system's relevance ranking in aggregate (including all the position and other biases). 

You can then check to see if the document is above or below average for that rank position (using a z score). You can then use that z score to translate that document to any other position. 

This is intended more for creating fake search session data for examples in AI Powered Search, and not a replacement for actually logging real search sessions in your search system.

In [16]:
! cd ../data/retrotech && head signals.csv

import random
import pandas as pd
import numpy as np
import sys
sys.path.append('..')
from aips import *
from session_gen import SessionGenerator
import os
from IPython.core.display import display,HTML

#seed=8675309
#random.seed(seed)
#np.random.seed(seed)

DOCS_PER_SESSION=20 # how many docs in one search page view?
NUM_SESSIONS=5000 # how many sessions to generate for each query?

# Generate search sessions for these queries
QUERIES_TO_SIMULATE=['dryer', 'iphone', 'ipad']

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


In [17]:
session_gen = SessionGenerator(signals_path='../data/retrotech/signals.csv', min_query_count=100)

  exec(code_obj, self.user_global_ns, self.user_ns)
  pop_query_events = signals[signals['type'] == 'query'][signals['target'].isin(popular_queries)]


In [18]:
session_gen('dryer', num_docs=DOCS_PER_SESSION, dampen=1.5)
session_gen.random_rankings['dryer']

  canonical = self.canonical_rankings[self.canonical_rankings['query'] == query][self.canonical_rankings['rank'] < num_docs]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[b] = a_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https

Unnamed: 0,posn_ctr_mean,posn_ctr_std,dest_rank,posn_ctr_mad,posn_ctr_median
33277,0.122041,0.070682,1.0,0.053763,0.109649
33278,0.085199,0.046758,2.0,0.036821,0.077982
33279,0.040146,0.023542,5.0,0.01885,0.03875
33280,0.050176,0.028513,4.0,0.022494,0.047291
33281,0.021941,0.014033,8.0,0.011926,0.021459
33282,0.06336,0.034434,3.0,0.027111,0.058824
33283,0.00761,0.004507,19.0,0.0037,0.00545
33284,0.026061,0.015995,7.0,0.01337,0.026549
33285,0.018867,0.012259,9.0,0.010602,0.017834
33286,0.227634,0.178527,0.0,0.134027,0.168625


# Randomly sample source signals, generate new sessions

In [19]:
from time import perf_counter 

for query in ['ipad']:
    
    session_dfs=[]
    t1_start = perf_counter()  
    for i in range(0, NUM_SESSIONS):
        session_dfs.append(session_gen(query, use_median=True, dampen=1.0, num_docs=DOCS_PER_SESSION))
        if (i % 500 == 0):
            print("Created Sessions %s Last Query %s Elapsed %s" % (i, query, perf_counter()-t1_start))

    sessions = pd.concat(session_dfs)
    sessions = sessions.sort_values(['sess_id', 'dest_rank'])
    sessions[['sess_id', 'query', 'dest_rank', 'clicked_doc_id', 'clicked']] \
        .rename(columns={'dest_rank': 'rank'}) \
        .to_csv("%s_sessions.gz" % query, compression='gzip', index=False)

  canonical = self.canonical_rankings[self.canonical_rankings['query'] == query][self.canonical_rankings['rank'] < num_docs]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[b] = a_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https

Created Sessions 0 Last Query ipad Elapsed 0.04834919993299991
Created Sessions 500 Last Query ipad Elapsed 10.501352899940684
Created Sessions 1000 Last Query ipad Elapsed 20.97733150003478
Created Sessions 1500 Last Query ipad Elapsed 31.424249600037
Created Sessions 2000 Last Query ipad Elapsed 42.07259869994596
Created Sessions 2500 Last Query ipad Elapsed 52.49486119998619
Created Sessions 3000 Last Query ipad Elapsed 63.02001049998216
Created Sessions 3500 Last Query ipad Elapsed 73.51306999998633
Created Sessions 4000 Last Query ipad Elapsed 83.76188370003365
Created Sessions 4500 Last Query ipad Elapsed 94.27420300000813


In [20]:
gset = session_gen.canonical_rankings
orig_dryer = gset[gset['query'] == 'dryer']

orig_dryer[orig_dryer['rank'] < 20]

Unnamed: 0,index,query,clicked_doc_id,click_count,tot_query_count,ctr,rank,posn_ctr_mean,posn_ctr_std,posn_ctr_median,posn_ctr_mad,ctr_std_z_score,ctr_mod_z_score
33277,33884,dryer,12505451713,20,246,0.081301,0,0.227634,0.178527,0.168625,0.134027,-0.819671,-0.651546
33278,33933,dryer,883929085118,18,246,0.073171,1,0.122041,0.070682,0.109649,0.053763,-0.691413,-0.678507
33279,33927,dryer,883049066905,16,246,0.065041,2,0.085199,0.046758,0.077982,0.036821,-0.431125,-0.35146
33280,33894,dryer,36172950027,13,246,0.052846,3,0.06336,0.034434,0.058824,0.027111,-0.305348,-0.220504
33281,33910,dryer,74108056764,13,246,0.052846,4,0.050176,0.028513,0.047291,0.022494,0.093614,0.246946
33282,33912,dryer,77283045400,13,246,0.052846,5,0.040146,0.023542,0.03875,0.01885,0.539454,0.74775
33283,33923,dryer,783722274422,13,246,0.052846,6,0.031617,0.018574,0.031458,0.015179,1.142909,1.409022
33284,33920,dryer,665331101927,11,246,0.044715,7,0.026061,0.015995,0.026549,0.01337,1.166273,1.358787
33285,33888,dryer,14381196320,9,246,0.036585,8,0.021941,0.014033,0.021459,0.011926,1.043589,1.268373
33286,33911,dryer,74108096487,9,246,0.036585,9,0.018867,0.012259,0.017834,0.010602,1.445304,1.768698
