# MediaSearch Filter Usage Aggregation

There are two questions about filter usage in T258229:

* What percentage of users use more than 1? 2? 3? filters in their session?
* What filters are used the most?

We note that not all filters are available for all searches. For example, the namespace filter is only available when searching for categories and pages. In this case we'll not take into consideration what the potential number of filters used might be for a specific search so that we can understand to what extent users take advantage of filters. We have the "proportion of sessions using filters" for that. Instead, we'll aggregate across all sessions and leave interpretation of this aggregation to whomever is using them.

For the first question, I interpret that as counting the number of distinct filters changed during a session. We'll then aggregate across that number and store the date, number of filters, and number of sessions.

For the second question, I interpret that as counting the number of events setting a filter to a specific value. We'll aggregate that and store the date, filter type, filter value, and number of events.

For simplicity, we'll create two tables, one for each aggregation.

In [3]:
import datetime as dt

import pandas as pd
import numpy as np

from wmfdata import spark, mariadb

# Configuring Timestamps

We'll call the day we're gathering data for `data_day`. We're also expecting this notebook to be run the day after, which we'll call `next_day`. In order to ignore search sessions that started on the previous day, we also define that day. Lastly, we set a limit of one hour after midnight UTC as the cutoff for data. In other words, we expect search sessions to be completed within one hour.

In [4]:
next_day = dt.datetime.now(dt.timezone.utc).date()

data_day = next_day - dt.timedelta(days = 1)
previous_day = data_day - dt.timedelta(days = 1)

limit_timestamp = dt.datetime.combine(next_day, dt.time(hour = 1))

# Table Configurations

In [9]:
filters_per_session_counts_table = 'nettrom_sd.mediasearch_filters_per_session_aggregates'
filter_change_aggregates_table = 'nettrom_sd.mediasearch_filter_change_aggregates'

# Table Creation Statements

These are mainly for reference.

In [4]:
create_filters_per_session_query = f'''
CREATE TABLE {filters_per_session_counts_table} (
    log_date DATE COMMENT "the date of the aggregated search counts",
    filter_changes INT COMMENT "the number of filter changes in a session",
    num_sessions INT COMMENT "the number of sessions with a given number of filter changes"
)
'''

In [5]:
print(create_filters_per_session_query)


CREATE TABLE nettrom_sd.mediasearch_filters_per_session_aggregates (
    log_date DATE COMMENT "the date of the aggregated search counts",
    filter_changes INT COMMENT "the number of filter changes in a session",
    num_sessions INT COMMENT "the number of sessions with a given number of filter changes"
)



In [6]:
create_filters_setting_query = f'''
CREATE TABLE {filter_change_aggregates_table} (
    log_date DATE COMMENT "the date of the aggregated search counts",
    filter_type STRING COMMENT "the type of filter set",
    filter_value STRING COMMENT "the value the filter was set to",
    num_changes INT COMMENT "the number of times the filter was set to that value"
)
'''

In [7]:
print(create_filters_setting_query)


CREATE TABLE nettrom_sd.mediasearch_filter_change_aggregates (
    log_date DATE COMMENT "the date of the aggregated search counts",
    filter_type STRING COMMENT "the type of filter set",
    filter_value STRING COMMENT "the value the filter was set to",
    num_changes INT COMMENT "the number of times the filter was set to that value"
)



## Helper Functions

In [5]:
def make_partition_statement(start_ts, end_ts, prefix = ''):
    '''
    This takes the two timestamps and creates a statement that selects
    partitions based on `year`, `month`, and `day` in order to make our
    data gathering not use excessive amounts of data. It assumes that
    `start_ts` and `end_ts` are not more than a month apart, which should
    be a reasonable expectation for this notebook.
    
    An optional prefix can be set to enable selecting partitions for
    multiple tables with different aliases.
    
    :param start_ts: start timestamp
    :type start_ts: datetime.datetime
    
    :param end_ts: end timestamp
    :type end_ts: datetime.datetime
    
    :param prefix: prefix to use in front of partition clauses, "." is added automatically
    :type prefix: str
    '''
    
    if prefix:
        prefix = f'{prefix}.' # adds "." after the prefix
    
    # there are three cases:
    # 1: month and year are the same, output a "BETWEEN" statement with the days
    # 2: months differ, but the years are the same.
    # 3: years differ too.
    # Case #2 and #3 can be combined, because it doesn't really matter
    # if the years are the same in the month-selection or not.
    
    if start_ts.year == end_ts.year and start_ts.month == end_ts.month:
        return(f'''{prefix}year = {start_ts.year}
AND {prefix}month = {start_ts.month}
AND {prefix}day BETWEEN {start_ts.day} AND {end_ts.day}''')
    else:
        return(f'''
(
    ({prefix}year = {start_ts.year}
     AND {prefix}month = {start_ts.month}
     AND {prefix}day >= {start_ts.day})
 OR ({prefix}year = {end_ts.year}
     AND {prefix}month = {end_ts.month}
     AND {prefix}day <= {end_ts.day})
)''')

## The Queries

A few notes:

1. The part of the queries that define a valid MediaSearch session is the same as for the other notebooks, in order to ensure consistency across metrics.
2. Filters can be set at any point during the search session. This leads to two potential paths: 1) the filter change took place before a search was made, and thus applies to the first search following it; and 2) the filter change happened after a search, at which point a new search automatically happens to show new results based on the filter. We cannot tell from the `search_new` event that filters applied to it, but in the query below we allow `filter_change` to occur at any point during the search session. Search sessions are based on `search_new`, and we therefore assume that filters apply to searches made. The only case we're ignoring is a user setting and then resetting filters before running a search. This is also consistent with how filter-based aggregations occur in other notebooks.
3. We do not count sessions without filters, that is handled by the other notebook that aggregates overall filter usage (proportion of sessions using at least one filter).

In [6]:
filters_per_session_query = '''
WITH mediasearch_sessions AS ( -- all MediaSearch sessions started during the day of interest
    SELECT
        web_pageview_id AS session_id,
        MIN(coalesce(dt, meta.dt)) AS session_start_dt
    FROM event.mediawiki_mediasearch_interaction AS ms
    WHERE {ms_partition_statement}
    AND action = "search_new"
    GROUP BY web_pageview_id
    HAVING TO_DATE(session_start_dt) = "{today}"
),
filters_per_session AS ( -- number of filter changes in each session
    SELECT
        TO_DATE(mess.session_start_dt) AS log_date,
        ms.web_pageview_id AS session_id,
        COUNT(1) AS num_filter_changes
    FROM mediasearch_sessions AS mess
    JOIN event.mediawiki_mediasearch_interaction AS ms
    ON mess.session_id = ms.web_pageview_id
    WHERE {ms_partition_statement}
    AND action = "filter_change"
    AND coalesce(dt, meta.dt) < "{limit_timestamp}"
    GROUP BY TO_DATE(mess.session_start_dt), ms.web_pageview_id
)
INSERT INTO {aggregate_table}
SELECT
    FIRST_VALUE(log_date) AS log_date,
    num_filter_changes,
    COUNT(1) AS num_sessions
FROM filters_per_session
GROUP BY num_filter_changes
'''

In [7]:
filter_change_aggregate_query = '''
WITH mediasearch_sessions AS ( -- all MediaSearch sessions started during the day of interest
    SELECT
        web_pageview_id AS session_id,
        MIN(coalesce(dt, meta.dt)) AS session_start_dt
    FROM event.mediawiki_mediasearch_interaction AS ms
    WHERE {ms_partition_statement}
    AND action = "search_new"
    GROUP BY web_pageview_id
    HAVING TO_DATE(session_start_dt) = "{today}"
),
mediasearch_filter_aggregates AS ( -- all filter change events
    SELECT
        TO_DATE(mess.session_start_dt) AS log_date,
        search_filter_type,
        search_filter_value,
        COUNT(1) AS num_changes
    FROM mediasearch_sessions AS mess
    JOIN event.mediawiki_mediasearch_interaction AS ms
    ON mess.session_id = ms.web_pageview_id
    WHERE {ms_partition_statement}
    AND action = "filter_change"
    AND coalesce(dt, meta.dt) < "{limit_timestamp}"
    GROUP BY TO_DATE(mess.session_start_dt), search_filter_type, search_filter_value
)
INSERT INTO {aggregate_table}
SELECT
    log_date,
    search_filter_type,
    IF(search_filter_value = '', 'reset', search_filter_value) AS search_filter_value,
    num_changes
FROM mediasearch_filter_aggregates
'''

In [11]:
try:
    spark.run(filters_per_session_query.format(
        today = data_day,
        limit_timestamp = limit_timestamp.isoformat(),
        ms_partition_statement = make_partition_statement(previous_day, next_day, prefix = 'ms'),
        aggregate_table = filters_per_session_counts_table
    ))
except UnboundLocalError:
    # wmfdata currently (late Feb 2021) has an issue with DDL/DML SQL queries,
    # and so we ignore that error
    pass

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [12]:
try:
    spark.run(filter_change_aggregate_query.format(
        today = data_day,
        limit_timestamp = limit_timestamp.isoformat(),
        ms_partition_statement = make_partition_statement(previous_day, next_day, prefix = 'ms'),
        aggregate_table = filter_change_aggregates_table
    ))
except UnboundLocalError:
    # wmfdata currently (late Feb 2021) has an issue with DDL/DML SQL queries,
    # and so we ignore that error
    pass

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
