# Timestamp Investigation of Differences in the `dt` Field

One issue we run into when writing a standard funnel query is whether we can actually distinguish between events in that funnel based on the value of `dt`. Partly because EditAttemptStep and VisualEditorFeatureUse have a one-second resolution of that field, or at least they did until they migrated on March 8, 2021, at which point millisecond resolution was introduced.

We're interested in understanding to what extent we can distinguish between events in the funnel based on `dt`. If we grab the first timestamp of every event in the relevant funnel for each edit session, how many events line up and how many do not?

In [6]:
import json
import datetime as dt

from collections import defaultdict

import numpy as np
import pandas as pd

from wmfdata import spark, mariadb

## Configuration Variables

See explanation in the "Funnel Query" section for why we have two paths and this dictionary.

In [7]:
# Values of the `action` field for opening and closing the media dialog for the
# two paths.
dialog_actions = {
    'add media' : {
        'open' : 'window-open-from-tool',
        'close' : 'dialog-insert'
    },
    'edit media' : {
        'open' : 'window-open-from-context',
        'close' : 'dialog-done'
    }
}

## Funnel Query

This query is the same for both paths, but the "open media dialog" and "close media dialog" actions differ. In the "add media" path the open action is `window-open-from-tool` and the close action is `dialog-insert`, in the "edit media" path they're `window-open-from-context` and `dialog-done`, respectively.

We'll grab data from January 16 onwards, at which point the funnel was instrumented across all wikis. We'll stop data gathering on February 15, because that's just before the switch from legacy search to MediaSearch started rolling out. We'll only look at Visual Editor sessions taking place in the article namespace as that's where the majority of these sessions occur.

In [42]:
edit_funnel_query = '''
WITH step_1 AS ( -- Number of VE edit sessions, 
    SELECT
        event.editing_session_id,
        MIN(dt) AS dt
    FROM event.editattemptstep AS es
    WHERE es.year = 2021
    AND ((es.month = 1 AND es.day >= 18)
         OR (es.month = 2 AND es.day < 15))
    AND event.is_oversample = false
    AND event.editor_interface = "visualeditor"
    AND event.page_ns = 0
    AND event.action = "init"
    GROUP BY event.editing_session_id
),
step_2 AS ( -- Open the media dialog
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_1
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_1.editing_session_id = vefu.event.editingsessionid
    WHERE vefu.year = 2021
    AND ((vefu.month = 1 AND vefu.day >= 18)
         OR (vefu.month = 2 AND vefu.day < 15))
    AND vefu.event.feature = "media"
    AND vefu.event.action = "{open_action}"
    AND vefu.dt >= step_1.dt
    GROUP BY vefu.event.editingsessionid
),
step_3 AS ( -- Search for media
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_2
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_2.editing_session_id = vefu.event.editingsessionid
    WHERE vefu.year = 2021
    AND ((vefu.month = 1 AND vefu.day >= 18)
         OR (vefu.month = 2 AND vefu.day < 15))
    AND vefu.event.feature = "media"
    AND vefu.event.action = "search-change-query"
    AND vefu.dt >= step_2.dt
    GROUP BY vefu.event.editingsessionid
),
step_4 AS ( -- Confirm a search result
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_3
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_3.editing_session_id = vefu.event.editingsessionid
    WHERE vefu.year = 2021
    AND ((vefu.month = 1 AND vefu.day >= 18)
         OR (vefu.month = 2 AND vefu.day < 15))
    AND vefu.event.feature = "media"
    AND vefu.event.action = "search-confirm-image"
    AND vefu.dt >= step_3.dt
    GROUP BY vefu.event.editingsessionid
),
step_5 AS ( -- Close the media dialog
    SELECT
        vefu.event.editingsessionid AS editing_session_id,
        MIN(vefu.dt) AS dt
    FROM step_4
    INNER JOIN event.visualeditorfeatureuse AS vefu
    ON step_4.editing_session_id = vefu.event.editingsessionid
    WHERE vefu.year = 2021
    AND ((vefu.month = 1 AND vefu.day >= 18)
         OR (vefu.month = 2 AND vefu.day < 15))
    AND vefu.event.feature = "media"
    AND vefu.event.action = "{close_action}"
    AND vefu.dt >= step_4.dt
    GROUP BY vefu.event.editingsessionid
),
step_6 AS ( -- Save the edit
    SELECT
        es.event.editing_session_id,
        MIN(es.dt) AS dt
    FROM step_5
    INNER JOIN event.editattemptstep AS es
    ON step_5.editing_session_id = es.event.editing_session_id
    WHERE es.year = 2021
    AND ((es.month = 1 AND es.day >= 18)
         OR (es.month = 2 AND es.day < 15))
    AND es.event.action = "saveSuccess"
    AND es.dt >= step_5.dt
    GROUP BY es.event.editing_session_id
)
SELECT
    step_1.editing_session_id,
    step_1.dt AS step_1_dt,
    step_2.dt AS step_2_dt,
    step_3.dt AS step_3_dt,
    step_4.dt AS step_4_dt,
    step_5.dt AS step_5_dt,
    step_6.dt AS step_6_dt,
    IF(step_2.dt IS NULL, -1,
        unix_timestamp(step_2.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(step_1.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'")) AS step_diff_12,
    IF(step_2.dt IS NULL OR step_3.dt IS NULL, -1,
        unix_timestamp(step_3.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(step_2.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'")) AS step_diff_23,
    IF(step_3.dt IS NULL OR step_4.dt IS NULL, -1,
        unix_timestamp(step_4.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(step_3.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'")) AS step_diff_34,
    IF(step_4.dt IS NULL OR step_5.dt IS NULL, -1,
        unix_timestamp(step_5.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(step_4.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'")) AS step_diff_45,
    IF(step_5.dt IS NULL OR step_6.dt IS NULL, -1,
        unix_timestamp(step_6.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'") -
        unix_timestamp(step_5.dt, "yyyy-MM-dd'T'HH:mm:ss'Z'")) AS step_diff_56
FROM step_1
LEFT JOIN step_2
ON step_1.editing_session_id = step_2.editing_session_id
LEFT JOIN step_3
ON step_1.editing_session_id = step_3.editing_session_id
LEFT JOIN step_4
ON step_1.editing_session_id = step_4.editing_session_id
LEFT JOIN step_5
ON step_1.editing_session_id = step_5.editing_session_id
LEFT JOIN step_6
ON step_1.editing_session_id = step_6.editing_session_id
'''

In [43]:
add_media_timings = spark.run(
    edit_funnel_query.format(
        open_action = dialog_actions['add media']['open'],
        close_action = dialog_actions['add media']['close']
    )
)

PySpark executors will use /usr/bin/python3.7.


In [None]:
add_media_timings.loc[add_media_timings['step_diff_12'] >= 0].head()

In [44]:
edit_media_timings = spark.run(
    edit_funnel_query.format(
        open_action = dialog_actions['edit media']['open'],
        close_action = dialog_actions['edit media']['close']
    )
)

PySpark executors will use /usr/bin/python3.7.


In [None]:
edit_media_timings.loc[edit_media_timings['step_diff_12'] >= 0].head()

## Time Difference Between Editor Initialization and Media Dialog Open

In [45]:
add_media_step_12_agg = (add_media_timings.loc[add_media_timings['step_diff_12'] >= 0]
                         .groupby('step_diff_12')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
add_media_step_12_agg['percent'] = (100.0 *
                                    add_media_step_12_agg['num_sessions'] /
                                    add_media_step_12_agg['num_sessions'].sum())
add_media_step_12_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_12,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1261,37.79976
1,41,1.229017
2,2,0.059952
3,3,0.089928
4,1,0.029976
5,4,0.119904
6,1,0.029976
7,3,0.089928
8,2,0.059952
9,8,0.239808


In [46]:
edit_media_step_12_agg = (edit_media_timings.loc[edit_media_timings['step_diff_12'] >= 0]
                         .groupby('step_diff_12')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
edit_media_step_12_agg['percent'] = (100.0 *
                                    edit_media_step_12_agg['num_sessions'] /
                                    edit_media_step_12_agg['num_sessions'].sum())
edit_media_step_12_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_12,Unnamed: 1_level_1,Unnamed: 2_level_1
0,515,33.660131
1,23,1.503268
3,1,0.065359
6,1,0.065359
7,1,0.065359
8,2,0.130719
9,2,0.130719
10,1,0.065359
11,2,0.130719
12,2,0.130719


## Time Difference Between Media Dialog Open and Media Search

In [47]:
add_media_step_23_agg = (add_media_timings.loc[add_media_timings['step_diff_23'] >= 0]
                         .groupby('step_diff_23')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
add_media_step_23_agg['percent'] = (100.0 *
                                    add_media_step_23_agg['num_sessions'] /
                                    add_media_step_23_agg['num_sessions'].sum())
add_media_step_23_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_23,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1128,69.715698
1,15,0.92707
2,3,0.185414
3,1,0.061805
5,3,0.185414
6,1,0.061805
7,3,0.185414
8,2,0.123609
9,3,0.185414
10,5,0.309023


In [48]:
edit_media_step_23_agg = (edit_media_timings.loc[edit_media_timings['step_diff_23'] >= 0]
                         .groupby('step_diff_23')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
edit_media_step_23_agg['percent'] = (100.0 *
                                    edit_media_step_23_agg['num_sessions'] /
                                    edit_media_step_23_agg['num_sessions'].sum())
edit_media_step_23_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_23,Unnamed: 1_level_1,Unnamed: 2_level_1
0,151,43.019943
1,5,1.424501
3,1,0.2849
6,1,0.2849
9,1,0.2849
11,1,0.2849
12,1,0.2849
16,2,0.569801
17,2,0.569801
19,1,0.2849


## Time Difference Between Media Search and Confirm Image

In [49]:
add_media_step_34_agg = (add_media_timings.loc[add_media_timings['step_diff_34'] >= 0]
                         .groupby('step_diff_34')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
add_media_step_34_agg['percent'] = (100.0 *
                                    add_media_step_34_agg['num_sessions'] /
                                    add_media_step_34_agg['num_sessions'].sum())
add_media_step_34_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_34,Unnamed: 1_level_1,Unnamed: 2_level_1
0,409,40.737052
1,19,1.89243
2,4,0.398406
4,2,0.199203
5,1,0.099602
6,1,0.099602
7,2,0.199203
8,1,0.099602
9,1,0.099602
10,3,0.298805


In [50]:
edit_media_step_34_agg = (edit_media_timings.loc[edit_media_timings['step_diff_34'] >= 0]
                         .groupby('step_diff_34')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
edit_media_step_34_agg['percent'] = (100.0 *
                                    edit_media_step_34_agg['num_sessions'] /
                                    edit_media_step_34_agg['num_sessions'].sum())
edit_media_step_34_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_34,Unnamed: 1_level_1,Unnamed: 2_level_1
0,105,41.176471
1,7,2.745098
2,1,0.392157
4,1,0.392157
9,1,0.392157
10,2,0.784314
11,3,1.176471
16,1,0.392157
18,1,0.392157
21,2,0.784314


## Time Difference Between Confirm Image and Media Dialogue Close

In [51]:
add_media_step_45_agg = (add_media_timings.loc[add_media_timings['step_diff_45'] >= 0]
                         .groupby('step_diff_45')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
add_media_step_45_agg['percent'] = (100.0 *
                                    add_media_step_45_agg['num_sessions'] /
                                    add_media_step_45_agg['num_sessions'].sum())
add_media_step_45_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_45,Unnamed: 1_level_1,Unnamed: 2_level_1
0,401,41.044012
1,8,0.818833
4,1,0.102354
5,1,0.102354
6,2,0.204708
8,1,0.102354
9,2,0.204708
10,3,0.307062
11,1,0.102354
12,1,0.102354


In [52]:
edit_media_step_45_agg = (edit_media_timings.loc[edit_media_timings['step_diff_45'] >= 0]
                         .groupby('step_diff_45')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
edit_media_step_45_agg['percent'] = (100.0 *
                                    edit_media_step_45_agg['num_sessions'] /
                                    edit_media_step_45_agg['num_sessions'].sum())
edit_media_step_45_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_45,Unnamed: 1_level_1,Unnamed: 2_level_1
0,97,48.989899
1,5,2.525253
4,1,0.505051
10,3,1.515152
11,2,1.010101
14,1,0.505051
21,1,0.505051
23,1,0.505051
24,1,0.505051
28,1,0.505051


## Time Difference Between Media Dialogue Close and Edit Saved

In [53]:
add_media_step_56_agg = (add_media_timings.loc[add_media_timings['step_diff_56'] >= 0]
                         .groupby('step_diff_56')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
add_media_step_56_agg['percent'] = (100.0 *
                                    add_media_step_56_agg['num_sessions'] /
                                    add_media_step_56_agg['num_sessions'].sum())
add_media_step_56_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_56,Unnamed: 1_level_1,Unnamed: 2_level_1
0,112,14.322251
1,3,0.383632
2,2,0.255754
3,2,0.255754
5,2,0.255754
6,1,0.127877
7,2,0.255754
8,5,0.639386
9,5,0.639386
10,2,0.255754


In [55]:
edit_media_step_56_agg = (edit_media_timings.loc[edit_media_timings['step_diff_56'] >= 0]
                         .groupby('step_diff_56')
                         .agg({'step_1_dt' : 'count'})
                         .rename(columns = {'step_1_dt' : 'num_sessions'})
                        )
edit_media_step_56_agg['percent'] = (100.0 *
                                    edit_media_step_56_agg['num_sessions'] /
                                    edit_media_step_56_agg['num_sessions'].sum())
edit_media_step_56_agg.head(10)

Unnamed: 0_level_0,num_sessions,percent
step_diff_56,Unnamed: 1_level_1,Unnamed: 2_level_1
0,16,10.322581
1,2,1.290323
4,1,0.645161
12,1,0.645161
13,1,0.645161
14,1,0.645161
15,1,0.645161
16,1,0.645161
21,1,0.645161
23,1,0.645161


Given that such a large proportion of events have the same timestamp, except for the final step, I'm starting to wonder if we're running into challenges with networking protocols, browser implementations of things, or anything else like that.