# Synthesize search sessions from signals

This notebook synthesizes search sessions from the CTR of the clicked documents on each search result. It's assumed that if you order results by CTR, that roughly captures the source search system's relevance ranking in aggregate (including all the position and other biases). 

You can then check to see if the document is above or below average for that rank position (using a z score). You can then use that z score to translate that document to any other position. 

This is intended more for creating fake search session data for examples in AI Powered Search, and not a replacement for actually logging real search sessions in your search system.

In [15]:
! cd ../data/retrotech && head signals.csv

import random
import pandas as pd
import numpy as np
import sys
sys.path.append('..')
from aips import *
from session_gen import SessionGenerator
import os
from IPython.core.display import display,HTML

#seed=8675309
#random.seed(seed)
#np.random.seed(seed)

DOCS_PER_SESSION=30 # how many docs in one search page view?
NUM_SESSIONS=5000 # how many sessions to generate for each query?

# Generate search sessions for these queries
QUERIES_TO_SIMULATE=['dryer', 'iphone', 'nook', 'kindle', 
                     'lcd tv', 'ipad', 'headphones', 'macbook',
                     'star wars', 'star trek',
                     'blue ray', 'bluray']

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


In [16]:
session_gen = SessionGenerator(signals_path='../data/retrotech/signals.csv', min_query_count=100)

  exec(code_obj, self.user_global_ns, self.user_ns)
  pop_query_events = signals[signals['type'] == 'query'][signals['target'].isin(popular_queries)]


In [17]:
session_gen('dryer', dampen=1.5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[b] = a_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[a] = b_val


Unnamed: 0,rank,dest_ctr_mean,dest_ctr_std,dest_rank,dest_ctr_mad,dest_ctr_median,index,query,clicked_doc_id,click_count,...,posn_ctr_std,posn_ctr_median,posn_ctr_mad,ctr_std_z_score,ctr_mod_z_score,dest_ctr_median_based,dest_ctr_mean_based,draw,clicked,sess_id
0,0,0.010893,0.007176,14.0,0.00629,0.008386,33884,dryer,12505451713,20,...,0.178527,0.168625,0.134027,-0.819671,-0.651546,0.002239,0.00207,0.753407,False,1
1,1,0.021941,0.014033,8.0,0.011926,0.021459,33933,dryer,883929085118,18,...,0.070682,0.109649,0.053763,-0.691413,-0.678507,0.009322,0.007387,0.370424,False,1
2,2,0.085199,0.046758,2.0,0.036821,0.077982,33927,dryer,883049066905,16,...,0.046758,0.077982,0.036821,-0.431125,-0.35146,0.05857,0.054961,0.6099,False,1
3,3,0.005449,0.002513,27.0,0.001977,0.004545,33894,dryer,36172950027,13,...,0.034434,0.058824,0.027111,-0.305348,-0.220504,0.003892,0.004298,0.965528,False,1
4,4,0.050176,0.028513,4.0,0.022494,0.047291,33910,dryer,74108056764,13,...,0.028513,0.047291,0.022494,0.093614,0.246946,0.055623,0.05418,0.451658,False,1
5,5,0.01652,0.011034,10.0,0.009665,0.014778,33912,dryer,77283045400,13,...,0.023542,0.03875,0.01885,0.539454,0.74775,0.025619,0.025448,0.558347,False,1
6,6,0.040146,0.023542,5.0,0.01885,0.03875,33923,dryer,783722274422,13,...,0.018574,0.031458,0.015179,1.142909,1.409022,0.078591,0.080505,0.631737,False,1
7,7,0.06336,0.034434,3.0,0.027111,0.058824,33920,dryer,665331101927,11,...,0.015995,0.026549,0.01337,1.166273,1.358787,0.11408,0.123598,0.305083,False,1
8,8,0.018867,0.012259,9.0,0.010602,0.017834,33888,dryer,14381196320,9,...,0.014033,0.021459,0.011926,1.043589,1.268373,0.038004,0.038057,0.993098,False,1
9,9,0.005332,0.002412,28.0,0.001885,0.004505,33911,dryer,74108096487,9,...,0.012259,0.017834,0.010602,1.445304,1.768698,0.009505,0.010562,0.930664,False,1


# Randomly sample source signals, generate new sessions

In [19]:
from time import perf_counter 

for query in ['iphone']:
    
    session_dfs=[]
    t1_start = perf_counter()  
    for i in range(0, NUM_SESSIONS):
        session_dfs.append(session_gen(query, use_median=True, dampen=1.0))
        if (i % 500 == 0):
            print("Created Sessions %s Last Query %s Elapsed %s" % (i, query, perf_counter()-t1_start))

    sessions = pd.concat(session_dfs)
    sessions = sessions.sort_values(['sess_id', 'dest_rank'])
    sessions[['sess_id', 'query', 'dest_rank', 'clicked_doc_id', 'clicked']] \
        .rename(columns={'dest_rank': 'rank'}) \
        .to_csv("%s_sessions.gz" % query, compression='gzip', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[b] = a_val
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_n.iloc[a] = b_val


Created Sessions 0 Last Query iphone Elapsed 0.04499190009664744
Created Sessions 500 Last Query iphone Elapsed 10.5044395000441
Created Sessions 1000 Last Query iphone Elapsed 20.926782099995762
Created Sessions 1500 Last Query iphone Elapsed 31.13648159999866
Created Sessions 2000 Last Query iphone Elapsed 41.43589540000539
Created Sessions 2500 Last Query iphone Elapsed 51.86735100008082
Created Sessions 3000 Last Query iphone Elapsed 62.31002160010394
Created Sessions 3500 Last Query iphone Elapsed 72.7401164000621
Created Sessions 4000 Last Query iphone Elapsed 83.08010390005074
Created Sessions 4500 Last Query iphone Elapsed 93.27326360007282
