# Synthesize search sessions from signals

This notebook synthesizes search sessions from the CTR of the clicked documents on each search result. It's assumed that if you order results by CTR, that roughly captures the source search system's relevance ranking in aggregate (including all the position and other biases). 

You can then check to see if the document is above or below average for that rank position (using a z score). You can then use that z score to translate that document to any other position. 

This is intended more for creating fake search session data for examples in AI Powered Search, and not a replacement for actually logging real search sessions in your search system.

In [1]:
! cd ../data/retrotech && head signals.csv

import random
import pandas as pd
import numpy as np
import sys
sys.path.append('..')
from aips import *
from session_gen import SessionGenerator
import os
from IPython.display import display,HTML

#seed=8675309
#random.seed(seed)
#np.random.seed(seed)

DOCS_PER_SESSION=15 # how many docs in one search page view?
NUM_SESSIONS=5000 # how many sessions to generate for each query?

# Generate search sessions for these queries
QUERIES_TO_SIMULATE=['dryer', 'iphone', 'ipad', 'transformers dark of the moon']

"query_id","user","type","target","signal_time"
"u2_0_1","u2","query","nook","2019-07-31 08:49:07.3116"
"u2_1_2","u2","query","rca","2020-05-04 08:28:21.1848"
"u3_0_1","u3","query","macbook","2019-12-22 00:07:07.0152"
"u4_0_1","u4","query","Tv antenna","2019-08-22 23:45:54.1030"
"u5_0_1","u5","query","AC power cord","2019-10-20 08:27:00.1600"
"u6_0_1","u6","query","Watch The Throne","2019-09-18 11:59:53.7470"
"u7_0_1","u7","query","Camcorder","2020-02-25 13:02:29.3089"
"u9_0_1","u9","query","wireless headphones","2020-04-26 04:26:09.7198"
"u10_0_1","u10","query","Xbox","2019-09-13 16:26:12.0132"


In [2]:
session_gen = SessionGenerator(signals_path='../data/retrotech/signals.csv', min_query_count=100)
session_gen('transformers dark of the moon', num_docs=DOCS_PER_SESSION)
session_gen.random_rankings['transformers dark of the moon']

  exec(code_obj, self.user_global_ns, self.user_ns)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  queries['target'] = queries['target'].str.lower()
  pop_query_events = signals[signals['type'] == 'query'][signals['target'].isin(popular_queries)]
  canonical = self.canonical_rankings[self.canonical_rankings['query'] == query][self.canonical_rankings['rank'] < num_docs]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata

Unnamed: 0,posn_ctr_mean,posn_ctr_std,dest_rank,posn_ctr_mad,posn_ctr_median
55098,0.027051,0.01652,7.0,0.013837,0.027027
55099,0.015513,0.010639,11.0,0.009261,0.013043
55100,0.090041,0.048576,2.0,0.0384,0.082456
55101,0.066729,0.035581,3.0,0.028424,0.062802
55102,0.052618,0.029491,4.0,0.023502,0.050794
55103,0.042438,0.024659,5.0,0.020032,0.040695
55104,0.033338,0.019867,6.0,0.016147,0.032864
55105,0.126987,0.071751,1.0,0.055309,0.114837
55106,0.022961,0.014608,8.0,0.012406,0.022222
55107,0.019781,0.012853,9.0,0.011032,0.018562


# Randomly sample source signals, generate new sessions

In [3]:
from time import perf_counter 

for query in ['transformers dark of the moon']:
    
    session_dfs=[]
    t1_start = perf_counter()  
    for i in range(0, NUM_SESSIONS):
        session_dfs.append(session_gen(query, use_median=True, dampen=1.0, num_docs=DOCS_PER_SESSION))
        if (i % 500 == 0):
            print("Created Sessions %s Last Query %s Elapsed %s" % (i, query, perf_counter()-t1_start))

    sessions = pd.concat(session_dfs)
    sessions = sessions.sort_values(['sess_id', 'dest_rank'])
    sessions[['sess_id', 'query', 'dest_rank', 'clicked_doc_id', 'clicked']] \
        .rename(columns={'dest_rank': 'rank'}) \
        .to_csv("%s_sessions.gz" % query, compression='gzip', index=False)

  canonical = self.canonical_rankings[self.canonical_rankings['query'] == query][self.canonical_rankings['rank'] < num_docs]


Created Sessions 0 Last Query transformers dark of the moon Elapsed 0.02069779997691512
Created Sessions 500 Last Query transformers dark of the moon Elapsed 7.948518699966371
Created Sessions 1000 Last Query transformers dark of the moon Elapsed 16.04209959995933
Created Sessions 1500 Last Query transformers dark of the moon Elapsed 24.25959489995148
Created Sessions 2000 Last Query transformers dark of the moon Elapsed 32.66837129998021
Created Sessions 2500 Last Query transformers dark of the moon Elapsed 40.64491539995652
Created Sessions 3000 Last Query transformers dark of the moon Elapsed 48.351152800023556
Created Sessions 3500 Last Query transformers dark of the moon Elapsed 55.67575059994124
Created Sessions 4000 Last Query transformers dark of the moon Elapsed 62.9664745000191
Created Sessions 4500 Last Query transformers dark of the moon Elapsed 70.29577289998997


In [4]:
gset = session_gen.canonical_rankings
orig_dryer = gset[gset['query'] == 'transformers dark of the moon']

orig_dryer[orig_dryer['rank'] < 20]

Unnamed: 0,index,query,clicked_doc_id,click_count,tot_query_count,ctr,rank,posn_ctr_mean,posn_ctr_std,posn_ctr_median,posn_ctr_mad,ctr_std_z_score,ctr_mod_z_score
55098,56264,transformers dark of the moon,97360810042,99,147,0.673469,0,0.230722,0.178259,0.171806,0.1339,2.483731,3.74654
55099,56266,transformers dark of the moon,97363560449,19,147,0.129252,1,0.126987,0.071751,0.114837,0.055309,0.031568,0.260628
55100,56257,transformers dark of the moon,25192107191,6,147,0.040816,2,0.090041,0.048576,0.082456,0.0384,-1.013335,-1.084368
55101,56260,transformers dark of the moon,47875841420,6,147,0.040816,3,0.066729,0.035581,0.062802,0.028424,-0.728277,-0.773498
55102,56268,transformers dark of the moon,786936817218,4,147,0.027211,4,0.052618,0.029491,0.050794,0.023502,-0.861514,-1.003452
55103,56262,transformers dark of the moon,47875842335,2,147,0.013605,5,0.042438,0.024659,0.040695,0.020032,-1.169291,-1.352314
55104,56270,transformers dark of the moon,47875841406,2,147,0.013605,6,0.033338,0.019867,0.032864,0.016147,-0.993267,-1.192671
55105,56255,transformers dark of the moon,24543701538,1,147,0.006803,7,0.027051,0.01652,0.027027,0.013837,-1.225692,-1.461602
55106,56256,transformers dark of the moon,24543750949,1,147,0.006803,8,0.022961,0.014608,0.022222,0.012406,-1.10608,-1.242926
55107,56258,transformers dark of the moon,36725235564,1,147,0.006803,9,0.019781,0.012853,0.018562,0.011032,-1.009719,-1.065887


In [5]:
for query in gset['query'].unique():
    print(query)

#
*
1080p
1196648
1342081 1342106 1342115 1342124
24
300
3547042
360
360 elite
3d
3d glasses
3d movies
3d tv
3ds
50 cent
8800
a630
ac
ac adapter
acer
acer iconia
acer laptop
acer tablet
action replay
adapter
adapters
adele
adobe
air conditioner
air conditioners
air purifier
airport
airport express
akon
alarm clock
alarm clocks
alienware
alpine
altec lansing
amazon kindle
amp
amplifier
amplifiers
amps
amy winehouse
android
android tablet
android tablets
anime
antec
antenna
antennas
anti virus
antivirus
apc
apple
apple computer
apple computers
apple ipad
apple ipod
apple keyboard
apple laptop
apple laptops
apple mac
apple macbook
apple macbook pro
apple tv
aquos
archos
arkham city
asus
asus laptop
asus tablet
asus transformer
ati
atrix
audio cable
avatar
averatec
avril lavigne
babylon 5
backpack
batman
batman arkham city
batman year one
batteries
battery
battery backup
battery charger
battery chargers
battlefield
battlefield 2
battlefield 3
battlestar galactica
beatles
beats
beats audio
