# Add a Link: Leading Indicators

The phab task for this is [T277355](https://phabricator.wikimedia.org/T277355)

We've got the following four leading indicators:

1. Revert rate: compare Add a Link edits to that of unstructured link tasks.
2. User rejection rate: do users reject more than 30% of links?
3. Task completion rate: what is the proportion of users who start the Add a Link task and complete it? If it is below 75%, we investigate.

With regards to rejection rate, we also want to calculate "proportion of users who accept all links". We then want to compare the rejection rate when exluding these users from the dataset.

# Libraries and Configuration

In [1]:
import datetime as dt

import pandas as pd
import numpy as np

from collections import defaultdict

from wmfdata import spark, mariadb

from scipy import stats

In [2]:
## Start timestamp of the experiment (https://phabricator.wikimedia.org/T277356#7120922)
exp_start = '2021-05-27T19:12:03'

exp_start_ts = dt.datetime.strptime(exp_start, '%Y-%m-%dT%H:%M:%S')

## We'll limit data gathering to midnight June 14, the day we're gathering data
exp_end_ts = dt.datetime(2021, 6, 14, 0, 0, 0)

## List of wikis that we deployed to:
wikis = ['arwiki','bnwiki','cswiki', 'viwiki']

## Lists of known users to ignore (e.g. test accounts and experienced users)
known_users = defaultdict(set)
known_users['cswiki'].update([14, 127629, 303170, 342147, 349875, 44133, 100304, 307410, 439792, 444907,
                              454862, 456272, 454003, 454846, 92295, 387915, 398470, 416764, 44751, 132801,
                              137787, 138342, 268033, 275298, 317739, 320225, 328302, 339583, 341191,
                              357559, 392634, 398626, 404765, 420805, 429109, 443890, 448195, 448438,
                              453220, 453628, 453645, 453662, 453663, 453664, 440694, 427497, 272273,
                              458025, 458487, 458049, 59563, 118067, 188859, 191908, 314640, 390445,
                              451069, 459434, 460802, 460885, 79895, 448735, 453176, 467557, 467745,
                              468502, 468583, 468603, 474052, 475184, 475185, 475187, 475188, 294174,
                              402906, 298011])

known_users['kowiki'].update([303170, 342147, 349875, 189097, 362732, 384066, 416362, 38759, 495265,
                              515553, 537326, 566963, 567409, 416360, 414929, 470932, 472019, 485036,
                              532123, 558423, 571587, 575553, 576758, 360703, 561281, 595100, 595105,
                              595610, 596025, 596651, 596652, 596653, 596654, 596655, 596993, 942,
                              13810, 536529])

known_users['viwiki'].update([451842, 628512, 628513, 680081, 680083, 680084, 680085, 680086, 355424,
                              387563, 443216, 682713, 659235, 700934, 705406, 707272, 707303, 707681, 585762])

known_users['arwiki'].update([237660, 272774, 775023, 1175449, 1186377, 1506091, 1515147, 1538902,
                              1568858, 1681813, 1683215, 1699418, 1699419, 1699425, 1740419, 1759328, 1763990])

## Grab the user IDs of known test accounts so they can be added to the exclusion list

def get_known_users(wiki):
    '''
    Get user IDs of known test accounts and return a set of them.
    '''
    
    username_patterns = ["MMiller", "Zilant", "Roan", "KHarlan", "MWang", "SBtest",
                         "Cloud", "Rho2019", "Test"]

    known_user_query = '''
SELECT user_id
FROM user
WHERE user_name LIKE "{name_pattern}%"
    '''
    
    known_users = set()
    
    for u_pattern in username_patterns:
        new_known = mariadb.run(known_user_query.format(
            name_pattern = u_pattern), wiki)
        known_users = known_users | set(new_known['user_id'])

    return(known_users)
        
for wiki in wikis:
    known_users[wiki] = known_users[wiki] | get_known_users(wiki)

In [3]:
## Controlling the maximum number of rows, columns, and output width used
## by Pandas. Update it with larger values if tables start getting truncated.

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)

## Helper functions

In [4]:
def make_known_users_sql(kd, wiki_column, user_column):
    '''
    Based on the dictionary `kd` mapping wiki names to sets of user IDs of known users,
    create a SQL expression to exclude users based on the name of the wiki matching `wiki_column`
    and the user ID not matching `user_column`
    '''
    
    wiki_exp = '''({w_column} = '{wiki}' AND {u_column} NOT IN ({id_list}))'''
    
    expressions = list()

    ## Iteratively build the expression for each wiki
    for wiki_name, wiki_users in kd.items():
        expressions.append(wiki_exp.format(
            w_column = wiki_column,
            wiki = wiki_name,
            u_column = user_column,
            id_list = ','.join([str(u) for u in wiki_users])
        ))
    
    ## We then join all the expressions with an OR, and we're done.
    return(' OR '.join(expressions))
    

In [5]:
def make_when_then(wiki_list, wiki_column):
    '''
    Take the ordered list of wiki names and turn it into a string
    of "WHEN wiki_column = '{wiki}' THEN '{k}'" where `k` is the index
    of the wiki in the list, so it can be used for ordering results.
    '''

    whens = list()
    
    for k, wiki in enumerate(wiki_list):
        whens.append(f'WHEN {wiki_column} = "{wiki}" THEN "{k:02}"')
    
    ## Join them with line breaks to create the list
    return('\n'.join(whens))


In [6]:
def make_partition_statement(start_ts, end_ts, prefix = ''):
    '''
    This takes the two timestamps and creates a statement that selects
    partitions based on `year`, `month`, and `day` in order to make our
    data gathering not use excessive amounts of data. It assumes that
    `start_ts` and `end_ts` are not more than a month apart.
    This assumption simplifies the code and output a lot.
    
    An optional prefix can be set to enable selecting partitions for
    multiple tables with different aliases.
    
    :param start_ts: start timestamp
    :type start_ts: datetime.datetime
    
    :param end_ts: end timestamp
    :type end_ts: datetime.datetime
    
    :param prefix: prefix to use in front of partition clauses, "." is added automatically
    :type prefix: str
    '''
    
    if prefix:
        prefix = f'{prefix}.' # adds "." after the prefix
    
    # there are three cases:
    # 1: month and year are the same, output a "BETWEEN" statement with the days
    # 2: months differ, but the years the same.
    # 3: years differ too.
    # Case #2 and #3 can be combined, because it doesn't really matter
    # if the years are the same in the month-selection or not.
    
    if start_ts.year == end_ts.year and start_ts.month == end_ts.month:
        return(f'''{prefix}year = {start_ts.year}
AND {prefix}month = {start_ts.month}
AND {prefix}day BETWEEN {start_ts.day} AND {end_ts.day}''')
    else:
        return(f'''
(
    ({prefix}year = {start_ts.year}
     AND {prefix}month = {start_ts.month}
     AND {prefix}day >= {start_ts.day})
 OR ({prefix}year = {end_ts.year}
     AND {prefix}month = {end_ts.month}
     AND {prefix}day <= {end_ts.day})
)''')

In [7]:
def round_cell(x, num_decimals = 1):
    '''
    Try converting the value of `x` into a float, then rounding to the specified
    number of decimal places. Used when outputting `pandas.DataFrame` that contain
    columns full of `object` data types. If the value cannot be parsed as a float,
    the value is returned as-is.
    
    :param x: whatever we want to try to round
    :type x: obj
    
    :param num_decimals: the number of decimal places to round to
    :type num_decimals: int
    '''
    try:
        return(round(float(x), num_decimals))
    except ValueError:
        return(x)   

In [8]:
def add_and_sort_wiki_df(wiki_list, df, name_column = 'wiki', value_column = None):
    '''
    Takes the given list of wikis and compares them with the named column
    in the dataframe.
    
    If no value column is defined, it adds one empty row to the dataframe
    for each missing wiki.
    
    If a value column is defined, it identifies the unique values in that column
    and adds them for each wiki.
    
    Once the full new dataframe is completed, all NAs are replaced with 0,
    and the dataframe is sorted by the name and value columns.
    
    :param wiki_list: list of all the wikis we're expecting to have data for
    :type wiki_list: list
    
    :param df: dataframe with data, possibly missing some wikis.
    :type df: pandas.DataFrame
    
    :param name_column: column in the dataframe that contains wiki names
    :type name_column: str
    
    :param value_column: name of a column that holds values we should generate rows for.
    :type value_column: str
    '''

    ## We name a series out of the list of wikis, then use that to create a dataframe
    ## containing the name of any wiki not already in the named column. Then we set
    ## all the values to 0, and sort the resulting dataframe by the named column.

    wikis_s = pd.Series(wiki_list)
    
    if value_column is None:
        return(
            pd.concat([
                df,
                pd.DataFrame({name_column : wikis_s.loc[~wikis_s.isin(df[name_column])]})
            ]).fillna(0).sort_values(name_column))
    else:
        # Identify what wikis we're missing, and if we're not missing any just return
        # the existing dataframe
        missing_wikis = wikis_s.loc[~wikis_s.isin(df[name_column])]
        if missing_wikis.empty:
            return(df)
        
        ## From https://stackoverflow.com/a/26977495
        unique_values = pd.unique(df[value_column])

        new_df = pd.concat([
            df,
            pd.concat( # combine the results of the list comprehension
                [
                    pd.DataFrame( # for each element in the list, create a pandas.DataFrame
                        {name_column : [w] * len(unique_values), # repeat the wiki name to match the values
                         value_column : unique_values}) # add all the unique values of the value column
                    for w in wikis_s.loc[~wikis_s.isin(df[name_column])] # do this for every missing wiki
                ]
            )
        ])
        return(new_df.fillna(0).sort_values([name_column, value_column]))

# Revert Rate

The way we've done this previously is to only look at user activity within 24 hours of registration, because that's when most of the visits to the Newcomer Homepage take place. In this case, we want to identify all users who visited the Homepage, clicked on either an Add a Link or unstructured link task, and saved an edit to that page. We'll then want to look at the revert rate, both overall on average per user.

In [9]:
revert_query = '''
WITH hp_visits AS (
    SELECT
        hpv.wiki,
        hpv.event.user_id,
        hpv.event.homepage_pageview_token,
        hpv.dt AS event_dt
    FROM event.homepagevisit AS hpv
    WHERE {partition_statement}
    AND wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    AND dt >= "{start_ts}" AND dt < "{end_ts}"
),
newcomer_tasks AS (
-- grab unique task token/task type data from newcomer tasks
    SELECT
        DISTINCT event.newcomer_task_token, event.task_type, event.page_id
    FROM event.newcomertask
    WHERE {partition_statement}
    AND event.task_type IN ("links", "link-recommendation")
),
homepage_task_clicks AS (
    -- clicks to tasks in sessions found in hp_visits
    SELECT
        hpm.wiki,
        hpm.event.user_id,
        hpm.dt AS event_dt,
        str_to_map(hpm.event.action_data, ";", "=") AS action_data
    FROM event.homepagemodule AS hpm
    JOIN hp_visits
    ON hpm.event.homepage_pageview_token = hp_visits.homepage_pageview_token
    WHERE {partition_statement}
    AND event.action = "se-task-click"
    AND dt >= "{start_ts}" AND dt < "{end_ts}"
    AND dt > hp_visits.event_dt
),
postedit_task_clicks AS (
    -- clicks to tasks done after saving an edit
    SELECT
        hp.wiki,
        hp.event.user_id,
        hp.dt AS event_dt,
        str_to_map(hp.event.action_data, ";", "=") AS action_data
    FROM event.helppanel AS hp
    WHERE {partition_statement}
    AND wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    AND event.action = "postedit-task-click"
    AND dt >= "{start_ts}" AND dt < "{end_ts}"

),
link_task_clicks AS (
-- filter task_clicks down to those on links and link recommendations
    SELECT
        task_clicks.wiki,
        task_clicks.user_id,
        task_clicks.event_dt,
        newcomer_tasks.page_id,
        newcomer_tasks.task_type,
        LEAD(task_clicks.event_dt, 1) OVER
            (PARTITION BY task_clicks.wiki, task_clicks.user_id, newcomer_tasks.page_id
             ORDER BY task_clicks.event_dt) AS next_click_dt
    FROM (
        SELECT
            *
        FROM homepage_task_clicks
        UNION ALL
        SELECT
            *
        FROM postedit_task_clicks) AS task_clicks
    JOIN newcomer_tasks
    ON task_clicks.action_data["newcomerTaskToken"] = newcomer_tasks.newcomer_task_token
),
edits AS (
-- edits and reverts (within 48 hours) of newcomer tasks
    SELECT
        `database` AS wiki,
        rev_id,
        FIRST_VALUE(page_id) AS page_id,
        FIRST_VALUE(performer.user_id) AS user_id,
        FIRST_VALUE(rev_timestamp) AS rev_timestamp,
        MAX(IF(array_contains(tags, 'mw-reverted') AND
               (unix_timestamp(meta.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
                unix_timestamp(rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") < 60*60*48), 1, 0)) AS was_reverted
    FROM event_sanitized.mediawiki_revision_tags_change
    WHERE {partition_statement}
    AND `database` IN ({wiki_list})
    AND ({known_user_database_expression})
    AND array_contains(tags, "newcomer task")
    GROUP BY wiki, rev_id
)
SELECT
    link_task_clicks.wiki,
    link_task_clicks.user_id,
    link_task_clicks.task_type,
    COUNT(1) AS num_edits,
    SUM(edits.was_reverted) AS num_reverts
FROM link_task_clicks
JOIN edits
ON link_task_clicks.wiki = edits.wiki
AND link_task_clicks.user_id = edits.user_id
AND link_task_clicks.page_id = edits.page_id
WHERE (link_task_clicks.next_click_dt IS NULL
       OR link_task_clicks.event_dt != link_task_clicks.next_click_dt) -- removing duplicates
AND edits.rev_timestamp > link_task_clicks.event_dt
AND (
        (link_task_clicks.next_click_dt IS NOT NULL
         AND unix_timestamp(edits.rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") <
             unix_timestamp(link_task_clicks.next_click_dt, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
    OR
        (unix_timestamp(edits.rev_timestamp, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
         unix_timestamp(link_task_clicks.event_dt, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'") < 60*60*24*7)
    )
GROUP BY link_task_clicks.wiki, link_task_clicks.user_id, link_task_clicks.task_type
'''

In [10]:
link_task_edits_data = spark.run(
    revert_query.format(
        start_ts = exp_start_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        end_ts = exp_end_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        known_user_id_expression = make_known_users_sql(known_users, 'wiki', 'event.user_id'),
        known_userid_expression = make_known_users_sql(known_users, 'wiki', 'event.userid'),
        known_user_database_expression = make_known_users_sql(known_users,
                                                              '`database`', 'performer.user_id'),
        partition_statement = make_partition_statement(exp_start_ts, exp_end_ts)
    )
)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [None]:
link_task_edits_data

Number of task edits, reverts, and revert rate in our dataset:

In [11]:
link_tasks_agg = link_task_edits_data.groupby('task_type').agg({'num_edits' : 'sum', 'num_reverts' : 'sum'})
link_tasks_agg['revert_rate'] = 100 * link_tasks_agg['num_reverts'] / link_tasks_agg['num_edits']
link_tasks_agg.round(1)

Unnamed: 0_level_0,num_edits,num_reverts,revert_rate
task_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
link-recommendation,1248,77,6.2
links,88,23,26.1


In [None]:
link_task_edits_data.sort_values('num_edits', ascending = False).head(12)

In [13]:
stats.chi2_contingency(link_tasks_agg[['num_edits', 'num_reverts']])

(32.87686979941189,
 9.818454907000372e-09,
 1,
 array([[1232.72980501,   92.27019499],
        [ 103.27019499,    7.72980501]]))

## Overall statistics

For the slide deck, for each wiki, aggregate the number of edits and editors.

In [14]:
overall_agg = (link_task_edits_data.groupby(['wiki', 'task_type'])
               .agg({'num_edits' : 'sum', 'num_reverts' : 'sum', 'user_id' : 'count'})
               .rename(columns = {'user_id' : 'num_editors'}))
overall_agg['revert_rate'] = 100 * overall_agg['num_reverts'] / overall_agg['num_edits']
overall_agg.round(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,num_edits,num_reverts,num_editors,revert_rate
wiki,task_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
arwiki,link-recommendation,884,71,94,8.0
arwiki,links,51,13,17,25.5
bnwiki,link-recommendation,83,2,18,2.4
bnwiki,links,10,1,6,10.0
cswiki,link-recommendation,206,3,22,1.5
cswiki,links,7,1,5,14.3
viwiki,link-recommendation,75,1,24,1.3
viwiki,links,20,8,8,40.0


# Rejection Rate

Q: Do we count all rejections, or only those that are part of saved edits? Do we count all edit sessions, or only those made by new accounts?

A: We'll count edit sessions from all users, because in this case experienced users are going to be better able to determine if a suggested link is appropriate. We will, however, only count saved edits, because we want to ignore users loading up the editor and testing out Add a Link.

To make things easier, we'll use the structured task schema as our source of data because it has an `editsummary_save` event. Since we're relying on instrumented events, that's going to make it easier.

Also, the number of accepted, rejected, and skipped links is also saved in the edit summary of each edit. That data is, however, localized to each wiki so I'm not going to spend time digging those numbers out.

### Notes

There is the possibility that a user gets to the edit summary screen and then choses to go back and make changes. We'll treat this as two separate end states and count both. We do this because we expect the number of times this happens to be relatively low, and secondly that there isn't anything wrong with going back and changing your mind.

For the `skipall_dialog`, we'll need to go fetch their initial impression of the interface, because that's where the number of suggestions is stored.

In [15]:
rejection_rate_query = '''
WITH saved_edits AS (
    SELECT
        coalesce(lsi.dt, lsi.meta.dt) AS event_dt,
        lsi.homepage_pageview_token,
        hpv.wiki,
        hpv.event.user_id
    FROM event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    JOIN event.homepagevisit AS hpv
    ON lsi.homepage_pageview_token = hpv.event.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND {hpv_partition_statement}
    AND hpv.wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    AND lsi.action = "editsummary_save"
    AND coalesce(lsi.dt, lsi.meta.dt) >= "{start_ts}" AND coalesce(lsi.dt, lsi.meta.dt) < "{end_ts}"
),
saved_rates AS ( -- grab accept/reject/skips for saved edits
    -- str_to_map() first splits on ";" (the pair delimiter),
    -- then splits each key/value pair on "=" (the key/value delimiter)
    SELECT
        lsi.homepage_pageview_token,
        saved_edits.user_id,
        saved_edits.wiki,
        coalesce(lsi.dt, lsi.meta.dt) as event_dt,
        str_to_map(lsi.action_data, ";", "=") AS rate_map
    FROM saved_edits
    JOIN event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    ON saved_edits.homepage_pageview_token = lsi.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND lsi.action = "impression"
    AND lsi.active_interface = "editsummary_dialog"
    AND coalesce(lsi.dt, lsi.meta.dt) >= "{start_ts}"
    AND coalesce(lsi.dt, lsi.meta.dt) < "{end_ts}"
    AND coalesce(lsi.dt, lsi.meta.dt) < saved_edits.event_dt
),
skip_alls AS (
    SELECT
        coalesce(lsi.dt, lsi.meta.dt) AS event_dt,
        lsi.homepage_pageview_token,
        hpv.wiki,
        hpv.event.user_id
    FROM event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    JOIN event.homepagevisit AS hpv
    ON lsi.homepage_pageview_token = hpv.event.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND {hpv_partition_statement}
    AND hpv.wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    AND lsi.active_interface = "skipall_dialog"
    AND lsi.action = "confirm_skip_all_suggestions"
    AND coalesce(lsi.dt, lsi.meta.dt) >= "{start_ts}" AND coalesce(lsi.dt, lsi.meta.dt) < "{end_ts}"
),
skip_all_rates AS ( -- grab the suggestion count from start of the sessions
    SELECT
        lsi.homepage_pageview_token,
        skip_alls.user_id,
        skip_alls.wiki,
        coalesce(lsi.dt, lsi.meta.dt) as event_dt,
        -- building a map similar to what we have for saved edits
        map("accepted_count", "0", "rejected_count", "0",
            "skipped_count",
            str_to_map(lsi.action_data, ";", "=")["number_phrases_found"]) AS rate_map
    FROM skip_alls
    JOIN event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    ON skip_alls.homepage_pageview_token = lsi.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND lsi.action = "impression"
    AND lsi.active_interface = "machinesuggestions_mode"
    AND lsi.action = "impression"
    AND coalesce(lsi.dt, lsi.meta.dt) >= "{start_ts}"
    AND coalesce(lsi.dt, lsi.meta.dt) < "{end_ts}"
    AND coalesce(lsi.dt, lsi.meta.dt) < skip_alls.event_dt
),
rates_cast AS (
    SELECT
        wiki,
        user_id,
        homepage_pageview_token,
        CAST(rate_map['accepted_count'] AS INT) AS accepted_count,
        CAST(rate_map['rejected_count'] AS INT) AS rejected_count,
        CAST(rate_map['skipped_count'] AS INT) AS skipped_count
    FROM
        (SELECT
             *
         FROM saved_rates
         UNION ALL
         SELECT
             *
         FROM skip_all_rates) AS r
)
SELECT
    wiki,
    user_id,
    SUM(accepted_count) AS num_links_accepted,
    SUM(rejected_count) AS num_links_rejected,
    SUM(skipped_count) AS num_links_skipped,
    SUM(accepted_count + rejected_count + skipped_count) AS num_links_recommended,
    COUNT(1) AS num_edit_sessions,
    COUNT(IF(accepted_count = accepted_count + rejected_count + skipped_count, 1, NULL)) AS num_all_accepted,
    COUNT(IF(rejected_count = accepted_count + rejected_count + skipped_count, 1, NULL)) AS num_all_rejected,
    COUNT(IF(skipped_count = accepted_count + rejected_count + skipped_count, 1, NULL)) AS num_all_skipped
FROM rates_cast
GROUP BY wiki, user_id
'''

In [16]:
rate_data = spark.run(
    rejection_rate_query.format(
        start_ts = exp_start_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        end_ts = exp_end_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        known_user_id_expression = make_known_users_sql(known_users, 'wiki', 'hpv.event.user_id'),
        lsi_partition_statement = make_partition_statement(exp_start_ts, exp_end_ts, prefix = 'lsi'),
        hpv_partition_statement = make_partition_statement(exp_start_ts, exp_end_ts, prefix = 'hpv')
    )
)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [None]:
rate_data.loc[rate_data['wiki'] == 'arwiki'].sort_values('num_edit_sessions')

Now we can easily calculate the overall acceptance, rejection, and skip rate:

In [17]:
rate_data['num_links_accepted'].sum()

2061

In [18]:
round(100 * rate_data['num_links_accepted'].sum() / rate_data['num_links_recommended'].sum(), 1)

67.1

In [19]:
rate_data['num_links_rejected'].sum()

720

In [20]:
round(100 * rate_data['num_links_rejected'].sum() / rate_data['num_links_recommended'].sum(), 1)

23.4

In [21]:
rate_data['num_links_skipped'].sum() 

292

In [22]:
round(100 * rate_data['num_links_skipped'].sum() / rate_data['num_links_recommended'].sum(), 1)

9.5

Using this data, we can also calculate whether a user accepts all links or not.

In [23]:
len(rate_data)

160

In [24]:
len(rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted']])

41

In [25]:
round(100 *
     len(rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted']]) /
     len(rate_data), 1)

25.6

So about 27% of users have only edit sessions where they've accepted all the recommended links.

In [26]:
rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'num_edit_sessions'].sum()

58

In [27]:
rate_data['num_edit_sessions'].sum()

676

In [28]:
round(100 *
     rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'num_edit_sessions'].sum() /
     rate_data['num_edit_sessions'].sum(), 1)

8.6

These users are not very active, as their edit sessions only make up 8.8% of the total sessions.

If we remove the users from the accept/reject/skip proportions, the new values become as follows.

In [29]:
round(100 *
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_accepted'].sum() /
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_recommended'].sum(), 1)

65.0

In [30]:
round(100 *
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_rejected'].sum() /
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_recommended'].sum(), 1)

24.9

In [31]:
round(100 *
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_skipped'].sum() /
    rate_data.loc[
        ~rate_data['user_id'].isin(
            rate_data.loc[rate_data['num_edit_sessions'] == rate_data['num_all_accepted'], 'user_id']),
    'num_links_recommended'].sum(), 1)

10.1

In both cases, the proportion of recommendations that actually get rejected is below the 30% threshold due to the proportion of skipped recommendations (7.6% across all sessions, 8.1% when excluding users who only have accepted recommendations).

## Restricting it to >= 5 Edit Sessions

Because most of the users who only accepted links had only a few sessions, what happens when it restrict it to those who had a least 5 sessions?

In [32]:
len(rate_data.loc[(rate_data['num_edit_sessions'] == rate_data['num_all_accepted']) &
                  (rate_data['num_edit_sessions'] >= 5)])

1

How many users made over 5 edit sessions?

In [33]:
len(rate_data.loc[(rate_data['num_edit_sessions'] >= 5)])

25

In [None]:
rate_data.loc[(rate_data['num_edit_sessions'] == rate_data['num_all_accepted']) &
              (rate_data['num_edit_sessions'] >= 5)]

In [35]:
round(100 *
     len(rate_data.loc[(rate_data['num_edit_sessions'] == rate_data['num_all_accepted']) &
                  (rate_data['num_edit_sessions'] >= 5)]) /
     len(rate_data), 1)

0.6

In [36]:
round(100 *
     rate_data['num_all_accepted'].sum() /
     rate_data['num_edit_sessions'].sum(), 1)

43.5

# Task Completion Rate

Out of all users who start an Add a Link task, what proportion completes it?

We define the start of a Add a Link task session as the impression of the "machine suggestions" mode in Visual Editors. Completion is defined as either clicking to save the edit at the edit summary dialog, as before, or confirming that they've skipped all suggestions. We've confirmed that if a user rejects all suggestions and clicks "Submit" on the edit summary, the event logged is the same as if they accepted any recommended links.

Q: Should we limit this to users who signed up during the experiment, in order to filter out potentially experienced users who are just trying out the interface? What if more experienced users are more likely to complete the session?

In [37]:
task_completion_query = '''
WITH task_init AS (
    -- select sessions where the user started
    SELECT
        coalesce(lsi.dt, lsi.meta.dt) AS event_dt,
        lsi.homepage_pageview_token,
        hpv.wiki,
        hpv.event.user_id
    FROM event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    JOIN event.homepagevisit AS hpv
    ON lsi.homepage_pageview_token = hpv.event.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND {hpv_partition_statement}
    AND hpv.wiki IN ({wiki_list})
    AND ({known_user_id_expression})
    AND lsi.active_interface = "machinesuggestions_mode"
    AND lsi.action = "impression"
    AND coalesce(lsi.dt, lsi.meta.dt) >= "{start_ts}" AND coalesce(lsi.dt, lsi.meta.dt) < "{end_ts}"
),
task_completion AS (
    -- session where the user clicked "save"
    SELECT
        coalesce(lsi.dt, lsi.meta.dt) AS event_dt,
        lsi.homepage_pageview_token,
        1 AS saved_edit
    FROM event.mediawiki_structured_task_article_link_suggestion_interaction AS lsi
    JOIN task_init
    ON lsi.homepage_pageview_token = task_init.homepage_pageview_token
    WHERE {lsi_partition_statement}
    AND (
            (lsi.active_interface = "editsummary_dialog"
             AND lsi.action = "editsummary_save")
        OR
            (lsi.active_interface = "skipall_dialog"
             AND lsi.action = "confirm_skip_all_suggestions")
        )
    AND coalesce(lsi.dt, lsi.meta.dt) > task_init.event_dt
)
SELECT
    task_init.wiki,
    task_init.user_id,
    1 AS started_task,
    MAX(coalesce(task_completion.saved_edit, 0)) AS completed_task
FROM task_init
LEFT JOIN task_completion
ON task_init.homepage_pageview_token = task_completion.homepage_pageview_token
GROUP BY task_init.wiki, task_init.user_id
'''

In [38]:
task_completion_data = spark.run(
    task_completion_query.format(
        start_ts = exp_start_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        end_ts = exp_end_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        known_user_id_expression = make_known_users_sql(known_users, 'wiki', 'hpv.event.user_id'),
        lsi_partition_statement = make_partition_statement(exp_start_ts, exp_end_ts, prefix = 'lsi'),
        hpv_partition_statement = make_partition_statement(exp_start_ts, exp_end_ts, prefix = 'hpv')
    )
)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [None]:
task_completion_data.head()

Number of users who started a task:

In [39]:
len(task_completion_data)

279

Number of users who completed at least one task:

In [40]:
len(task_completion_data.loc[task_completion_data['completed_task'] == 1])

160

Proportion of users who completed a task, out of those who started one:

In [41]:
round(100 *
     len(task_completion_data.loc[task_completion_data['completed_task'] == 1]) /
     len(task_completion_data), 1)

57.3

# Clarifying Questions

What's the number of Add a Link edits in total?

In [42]:
add_link_tag_query = '''
SELECT
    `database` AS wiki,
    COUNT(DISTINCT rev_id) AS num_edits
    FROM event_sanitized.mediawiki_revision_tags_change
    WHERE {partition_statement}
    AND `database` IN ({wiki_list})
    AND ({known_user_database_expression})
    AND array_contains(tags, "newcomer task add link")
    AND rev_timestamp >= "{start_ts}" AND rev_timestamp < "{end_ts}"
    GROUP BY `database`
'''

In [43]:
tagged_edit_counts = spark.run(
    add_link_tag_query.format(
        start_ts = exp_start_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        end_ts = exp_end_ts.strftime('%Y-%m-%dT%H:%M:%S'),
        wiki_list = ','.join(['"{}"'.format(w) for w in wikis]),
        known_user_database_expression = make_known_users_sql(known_users, '`database`', 'performer.user_id'),
        partition_statement = make_partition_statement(exp_start_ts, exp_end_ts)
    )
)        

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [None]:
tagged_edit_counts

In [44]:
tagged_edit_counts['num_edits'].sum()

1255