# Visual Editor Media Search Editing Funnel

This notebook is expected to be run daily and will update a staging table in the Data Lake with aggregate statistics on the Media funnel in Visual Editor. It expects all relevant data for that day to be present, meaning that it should be run at least four hours after midnight UTC (we know that it takes 2–3 hours for data to get there). Specifically, the goal is to be able to answer what percentage of searches for media in VE leads to the subsequent addition of a media file to the article.

In [1]:
import json
import datetime as dt

from collections import defaultdict

import numpy as np
import pandas as pd

from wmfdata import spark, mariadb
from growth import utils

## Configuration Variables

In [2]:
# Name of the table in the Data Lake

table_name = 'nettrom_sd.ve_media_funnel_aggregates'

# Values of the `action` field for opening and closing the media dialog for the
# two paths.
dialog_actions = {
    'add media' : {
        'open' : 'window-open-from-tool',
        'close' : 'dialog-insert'
    },
    'edit media' : {
        'open' : 'window-open-from-context',
        'close' : 'dialog-done'
    }
}

## Helper Functions

In [3]:
def make_partition_statement(start_ts, end_ts, prefix = ''):
    '''
    This takes the two timestamps and creates a statement that selects
    partitions based on `year`, `month`, and `day` in order to make our
    data gathering not use excessive amounts of data. It assumes that
    `start_ts` and `end_ts` are not more than a month apart, which should
    be a reasonable expectation for this notebook.
    
    An optional prefix can be set to enable selecting partitions for
    multiple tables with different aliases.
    
    :param start_ts: start timestamp
    :type start_ts: datetime.datetime
    
    :param end_ts: end timestamp
    :type end_ts: datetime.datetime
    
    :param prefix: prefix to use in front of partition clauses, "." is added automatically
    :type prefix: str
    '''
    
    if prefix:
        prefix = f'{prefix}.' # adds "." after the prefix
    
    # there are three cases:
    # 1: month and year are the same, output a "BETWEEN" statement with the days
    # 2: months differ, but the years are the same.
    # 3: years differ too.
    # Case #2 and #3 can be combined, because it doesn't really matter
    # if the years are the same in the month-selection or not.
    
    if start_ts.year == end_ts.year and start_ts.month == end_ts.month:
        return(f'''{prefix}year = {start_ts.year}
AND {prefix}month = {start_ts.month}
AND {prefix}day BETWEEN {start_ts.day} AND {end_ts.day}''')
    else:
        return(f'''
(
    ({prefix}year = {start_ts.year}
     AND {prefix}month = {start_ts.month}
     AND {prefix}day >= {start_ts.day})
 OR ({prefix}year = {end_ts.year}
     AND {prefix}month = {end_ts.month}
     AND {prefix}day <= {end_ts.day})
)''')

## Statistics Needs

What exactly does the funnel look like?

* Starts editing.
* Opens the media dialog to add an image, opens the media dialog to edit an image.
* Searches for an image, or uploads an image.
* Confirms the image.
* Closes the dialog to add the image, closes the dialog to replace an existing image.
* Saves the edit.

Note that the main thing we're focused on is the use of MediaSearch to *search* for images, and then whether that search leads to an image being *used* (either added to the article, or replacing an existing image).

I'm going to interpret that to mean that we're not focused on the uploading path, and that we'll ignore it. When it comes to the "add an image" and "replace an image" paths, I want to see them as two distinct paths. Again, the focus here is on the "is the image being used" part of the path (i.e. "was the search successful?"), so that a user can both add and edit images in the same edit session doesn't really matter.

What do we need to store?

* Date
* Namespace
* The path taken: "add media" or "edit existing media"
* Step 1: number of edit sessions.
* Step 2: opens the media dialog.
* Step 3: Searches for an image.
* Step 4: Confirms the image.
* Step 5: Closes the dialog.
* Step 6: Saves the edit.

## Table Creation

This is for reference, reflecting the `CREATE TABLE` statement used to create the dataset.

## Define Dates and Timestamp Limit

We'll grab today's date, figure out what yesterday was (the day we're grabbing data for), and set a limit to one hour after midnight today. All in UTC.

In [7]:
today = dt.datetime.now(dt.timezone.utc).date()
yesterday = today - dt.timedelta(days = 1)

limit_timestamp = dt.datetime.combine(today, dt.time(hour = 1))

## Funnel Query

This query is the same for both paths, but the "open media dialog" and "close media dialog" actions differ. In the "add media" path the open action is `window-open-from-tool` and the close action is `dialog-insert`, in the "edit media" path they're `window-open-from-context` and `dialog-done`, respectively.

We use partitions and timestamps to limit when edit sessions were initiated to only count edit sessions that started within the day we're aggregating over. We allow the sessions to be completed within one hour *after* the end of this day.

We originally used differences in the values in `dt` to determine order in the funnel. This turned out to not work as desired, because too many events appear to occur quickly enough for the events to have the same timestamp. There's a separate notebook that investigates this in more detail. The conclusion of that analysis is that we cannot require the difference between two events to be greater than 0 at any point in the funnel. But, we'll require it to be non-negative, because anything else would not make sense.

In [5]:
edit_funnel_query = '''
WITH step_1 AS ( -- Number of VE edit sessions, 
    SELECT
        event.editing_session_id,
        FIRST_VALUE(event.page_ns) AS namespace,
        MIN(dt) AS dt
    FROM event.editattemptstep AS es
    WHERE {es_partition_statement}
    AND dt >= "{start_date}"
    AND dt < "{end_date}"
    AND event.is_oversample = false
    AND event.editor_interface = "visualeditor"
    AND event.action = "init"
    GROUP BY event.editing_session_id
),
step_2 AS ( -- Open the media dialog
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_1
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_1.editing_session_id = vefu.event.editingsessionid
    WHERE {vefu_partition_statement}
    AND vefu.event.feature = "media"
    AND vefu.event.action = "{open_action}"
    AND vefu.dt >= step_1.dt
    AND vefu.dt < "{time_limit_ts}"
    GROUP BY vefu.event.editingsessionid
),
step_3 AS ( -- Search for media
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_2
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_2.editing_session_id = vefu.event.editingsessionid
    WHERE {vefu_partition_statement}
    AND vefu.event.feature = "media"
    AND vefu.event.action = "search-change-query"
    AND vefu.dt >= step_2.dt
    AND vefu.dt < "{time_limit_ts}"
    GROUP BY vefu.event.editingsessionid
),
step_4 AS ( -- Confirm a search result
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_3
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_3.editing_session_id = vefu.event.editingsessionid
    WHERE {vefu_partition_statement}
    AND vefu.event.feature = "media"
    AND vefu.event.action = "search-confirm-image"
    AND vefu.dt >= step_3.dt
    AND vefu.dt < "{time_limit_ts}"
    GROUP BY vefu.event.editingsessionid
),
step_5 AS ( -- Close the media dialog
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_4
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_4.editing_session_id = vefu.event.editingsessionid
    WHERE {vefu_partition_statement}
    AND vefu.event.feature = "media"
    AND vefu.event.action = "{close_action}"
    AND vefu.dt >= step_4.dt
    AND vefu.dt < "{time_limit_ts}"
    GROUP BY vefu.event.editingsessionid
),
step_6 AS ( -- Save the edit
    SELECT
        DISTINCT es.event.editing_session_id
    FROM step_5
    INNER JOIN event.editattemptstep AS es
    ON step_5.editing_session_id = es.event.editing_session_id
    WHERE {es_partition_statement}
    AND es.event.action = "saveSuccess"
    AND es.dt >= step_5.dt
    AND es.dt < "{time_limit_ts}"
)
INSERT INTO {aggregate_table}
SELECT
    TO_DATE(step_1.dt) AS log_date,
    step_1.namespace,
    "{path_type}" AS path_type,
    count(1) AS num_edit_sessions,
    SUM(IF(step_2.editing_session_id IS NOT NULL, 1, 0)) AS num_dialog_opens,
    count(step_3.editing_session_id) AS num_media_searches,
    count(step_4.editing_session_id) AS num_media_confirms,
    count(step_5.editing_session_id) AS num_dialog_close,
    count(step_6.editing_session_id) AS num_edit_saves
FROM step_1
LEFT JOIN step_2
ON step_1.editing_session_id = step_2.editing_session_id
LEFT JOIN step_3
ON step_1.editing_session_id = step_3.editing_session_id
LEFT JOIN step_4
ON step_1.editing_session_id = step_4.editing_session_id
LEFT JOIN step_5
ON step_1.editing_session_id = step_5.editing_session_id
LEFT JOIN step_6
ON step_1.editing_session_id = step_6.editing_session_id
GROUP BY TO_DATE(step_1.dt), step_1.namespace
'''

In [8]:
for path_type in dialog_actions.keys():
    try:
        query_result = spark.run(
            edit_funnel_query.format(
                start_date = yesterday,
                end_date = today,
                time_limit_ts = limit_timestamp.isoformat(),
                es_partition_statement = make_partition_statement(yesterday, today, prefix = 'es'),
                vefu_partition_statement = make_partition_statement(yesterday, today, prefix = 'vefu'),
                aggregate_table = table_name,
                path_type = path_type,
                open_action = dialog_actions[path_type]['open'],
                close_action = dialog_actions[path_type]['close']
            )
        )
    except UnboundLocalError:
        # wmfdata currently (late Feb 2021) has an issue with DDL/DML SQL queries,
        # and so we ignore that error
        pass

PySpark executors will use /usr/bin/python3.7.
PySpark executors will use /usr/bin/python3.7.


After the queries run, then run this command to set the permissions on the table correctly so others can query it: `hdfs dfs -chmod -R o+r <path to your table>`
This is taken care of the shell script that runs this notebook every day.