wbsearchentities a/b test analysis
==

As part of [T208917](https://phabricator.wikimedia.org/T208917) the weights of the wbsearchentities prefix search on www.wikidata.org were tuned using historical click logs as a guide. To determine the effectiveness of this tuning an AB test was run from Nov 30 at 00:00 UTC until Dec 7 at 00:00 UTC. The test was limited to users performing searches for wikidata items in the English language. Users were divided equally on a per-page load basis into either the control or test bucket.

The analytics for this test collect usage from the various entity selectors used throughout the interface for editing Wikidata. The data also includes usage of the autocomplete on the top-right of all Wikidata pages, but due to a bug in the data collection the usage of the top-right autocomplete was only logged from entity pages.

The graphs below are all probability densities and annotated with 95% confidence intervals. The confidence intervals are constructed by running five thousand rounds of the bootstrap method.

In [None]:
from gzip import GzipFile
import pickle
import numpy as np
try:
    from pyspark.sql import functions as F
except ImportError:
    import findspark
    findspark.init('/usr/lib/spark2')
    from pyspark.sql import SparkSession, functions as F
    spark = SparkSession.builder.master('local').getOrCreate()
    
import bokeh
import bokeh.io
import bokeh.palettes
import bokeh.plotting
import bokeh.transform
import IPython.display

palette = bokeh.palettes.Spectral[6]
bokeh.io.output_notebook()

def markdown(content):
    IPython.display.display(IPython.display.Markdown(content))

In [None]:
dec = (F.col('month') == 12) & (F.col('day') < 7)
nov = (F.col('month') == 11) & (F.col('day') >= 30)
date_cond = (F.col('year') == 2018) & (dec | nov)

df_raw = (
    spark.read.table('event.wikidatacompletionsearchclicks')
    .where(
        date_cond & 
        (F.col('event.context') == 'item') &
        (F.col('event.language') == 'en'))
    # These events are from the very beginning of the test, when not all
    # users recieved the new testing code.
    .where(F.col('event.bucket').isNotNull())
    .where(F.col('event.pageToken').isNotNull())
    .select('dt', 'event.*')
    .toPandas()
)

In [None]:
# Is it fair to cap lengths at 10? i dunno ..
df_raw['prefixLen'] = df_raw['searchTerm'].str.len().clip(upper=10)
df_raw['click'] = df_raw['action'] == 'click'
df_raw['start'] = df_raw['action'] == 'session-start'
df_raw['dt'] = df_raw['dt'].astype('datetime64')
print((df_raw['dt'].min(), df_raw['dt'].max()))
df_by_page = (
    df_raw
    .groupby(['bucket', 'pageToken'])
    .agg({
        'dt': lambda x: x.iloc[0],
        'start': np.any,
        'click': np.any,
        'clickIndex': np.min,
        'prefixLen': np.sum,
    })
    .reset_index()
)
df_by_page['totalCharsTyped'] = df_by_page['prefixLen']
del df_by_page['prefixLen']

colors = {bucket: color for bucket, color in zip(df_raw['bucket'].unique(), palette)}

Event Counts
============

In [None]:
df_time = (
    df_raw[['dt', 'bucket', 'pageToken', 'click', 'start']]
    .groupby(['bucket', 'pageToken'])
    .agg({
        'dt': lambda x: x.iloc[0],
        'start': np.any,
        'click': np.any,
    })
    .reset_index()
    .set_index('dt')
    .groupby('bucket')
    .resample("D")
    .sum()
)

for col, title in (('start', 'session start'), ('click', 'click')):
    p = bokeh.plotting.figure(
        title='Page loads with {} events by day'.format(title).title(),
        plot_height=200, x_axis_type='datetime',
        toolbar_location=None)
    for bucket, g in df_time.reset_index().groupby('bucket'):
        p.line('dt', col, source=g, color=colors[bucket], legend=bucket)
    bokeh.io.show(p)

In [None]:
total_events = df_time.groupby('bucket').sum()
md = [
    'Raw event counts per bucket\n==\n',
    '|bucket|event|count|',
    '|------|-----|-----|',
]
for idx, row in total_events.iterrows():
    for event, count in row.items():
        md.append('|{}|{}|{}|'.format(idx, event, int(count)))
markdown('\n'.join(md))

In [None]:
def ci(values, rounds=5000, n=None, agg=lambda x: x.mean(axis=1)):
    if n is None:
        n = len(values)
    samples = np.random.choice(values, size=n * rounds, replace=True).reshape(rounds, -1)
    scores = np.sort(agg(samples))
    alpha = 0.05
    low = int(rounds * (alpha/2))
    mid = int(rounds / 2)
    high = int(rounds * (1 - alpha/2))
    return (scores[low], scores[mid], scores[high]), scores

In [None]:
from scipy.stats.kde import gaussian_kde
from collections import defaultdict

def ridge(bucket, data, scale):
    return list(zip([bucket]*len(data), scale*data))

def plot_distribution(title, buckets, data):
    min_x = min(np.min(raw) for _, raw in data.values())
    max_x = max(np.max(raw) for _, raw in data.values())
    
    x = np.linspace(min_x, max_x, 500)
    # A bit evil .. but for the patch to draw the polygon we need
    # the data to start and end with y=0. The first and last
    # x values are repeated and these are applied manually later.
    x = np.append(np.append(x, x[-1])[::-1], x[0])[::-1]
    source = bokeh.models.ColumnDataSource(data=dict(x=x))
    p = bokeh.plotting.figure(
        y_range=sorted(buckets, reverse=True), title=title,
        plot_height=75 * len(buckets), plot_width=700,
        x_range=(min_x, max_x),
        toolbar_location=None)
    
    pdfs = {bucket: gaussian_kde(raw) for bucket, (_, raw) in data.items()}
    ys = {bucket: pdf(x) for bucket, pdf in pdfs.items()}
    max_y = max(np.max(ys[bucket]) for bucket in data.keys())
    scale = 0.8 / max_y
    
    bounds_data = defaultdict(list)
    for bucket, (bounds, raw) in sorted(data.items(), key=lambda x: x[0], reverse=True):
        # Apply polygon minimum edges
        ys[bucket][0] = 0
        ys[bucket][-1] = 0
        y = ridge(bucket, ys[bucket], scale=scale)
        source.add(y, bucket)
        p.patch(
            'x', bucket, color=colors[bucket], line_color="black",
            alpha=0.6, source=source)
        if bounds:
            bounds_data['buckets'].append(bucket)
            bounds_data['upper'].append(bounds[-1])
            bounds_data['lower'].append(bounds[0])
    if bounds_data:
        source_error = bokeh.models.ColumnDataSource(bounds_data)
        p.add_layout(bokeh.models.Whisker(
            dimension="width", line_color="black",
            source=source_error, base="buckets", upper="upper", lower="lower"))

    p.y_range.range_padding = 0.4
    bokeh.io.show(p)
    

def plot_ci(title, df, extract, rounds=5000):
    data = {}
    buckets = df['bucket'].unique()
    for bucket in sorted(buckets):
        samples = extract(df[df['bucket'] == bucket])
        data[bucket] = ci(samples, rounds=rounds)
    plot_distribution(title, buckets, data)

Number of characters typed before success
====================================
The number of characters typed in each session is a proxy for the amount of effort a user must exert to find the item they are looking for. The 95% CI completely overlaps, suggesting the test treatment had no effect on the number of characters typed

In [None]:
df_clicks = df_raw[df_raw['click'] == True].copy().dropna()
plot_ci('Mean Characters Typed Per Successful Lookup', df_clicks, lambda x: x['prefixLen'])

In [None]:
df_clicks_by_page = df_by_page[df_by_page['click'] == True].copy().dropna()
plot_ci('Mean Characters Typed Per Page Load', df_clicks_by_page, lambda x: x['totalCharsTyped'])

Abandonment Rate
================
The ratio of page loads with start events against the page loads with click events is interepreted loosely as the abandonment rate of search. The 95% CI completely overlaps, suggesting the test treatment had no effect on abandonment rates. This is the first time we've looked at abandonment rates for wbsearchentities, and further investigation into why it is so high may be called for.

In [None]:
df_abandon = (
    df_raw
    .groupby(['bucket', 'pageToken'])
    .agg({'click': np.any})
    .reset_index()
)
df_abandon['abandon'] = 1 - df_abandon['click']

In [None]:
plot_ci('Abandonment Rate', df_abandon, lambda x: x['abandon'])

Click Position
===========

The position of clicked result is another proxy for the amount of effort a user must exert to find the item they are looking for. The mean position clicked decreased from 1.38 to 1.33, which is statistically significant.

In [None]:
plot_ci('Mean click position', df_clicks, lambda x: 1 + x['clickIndex'])

Looking into this result closer, the change that occured was an increase in Clicks@1 from 80% to 84%. Clicks@2 saw a comparable drop from 14% to 10%.

In [None]:
for i in range(4):
    title = 'Percentage of users clicking result position {}'.format(i + 1)
    plot_ci(title, df_clicks, lambda x: x['clickIndex'] == i)