# Introduction


[Instrumentation Ticket](https://phabricator.wikimedia.org/T307572) | [QA Ticket](https://phabricator.wikimedia.org/T324707)

# Instrumentation note


Structured data team has deployed the instrumentation to measure search previews on desktop in January 2023 in Indonesia, Portuguese and Russian Wikipedia. The related events will be stored in `event.mediawiki_searchpreview` 
schema. 

Updates: 
- the mobile search previews were deployed in February. 
- Interwiki links are added to search previews in March.

In [1]:
from wmfdata import hive, spark
import wmfdata 

import math
import pandas as pd
import numpy as np

from datetime import datetime, timedelta, date

In [3]:
spark_session = wmfdata.spark.create_session(type='yarn-large')  

SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/15 07:21:49 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


## QA on 2023-01-31

# Check daily events


In [4]:
daily_df = spark.run("""
    SELECT TO_DATE(dt),  year, month,day, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY TO_DATE(dt),year, month, day
"""
)

23/03/15 07:22:03 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
                                                                                

In [5]:
daily_df 

Unnamed: 0,to_date(dt),year,month,day,events,sessions
0,2023-03-10,2023,3,9,94,65
1,2023-04-07,2023,3,7,13,13
2,2023-02-04,2023,2,10,3,3
3,2023-03-03,2023,3,2,94,70
4,2023-02-03,2023,2,7,6,6
...,...,...,...,...,...,...
1223,2023-01-21,2023,1,9,2,1
1224,2023-02-16,2023,2,18,21,17
1225,2023-01-11,2023,1,10,35,5
1226,2023-02-01,2023,2,5,2,2


It seems that the data in `dt` field records some dates in 2022 when this schema is not enabled. This data issue shows for all wikis not just on particular one wiki.

In modern eventloggin platform `dt` has switched to meaning the time according to the client. So if someone, say, has the time on their phone set to 2022, dt would reflect that. On the other hand, `meta.dt` (which is used to set the partition fields) is the time our server received the event.
In this case, we need to query data using partitions or `meta.dt` instead of `dt` field.

In [4]:
daily_df = hive.run("""
    SELECT TO_DATE(meta.dt),  year, month,day, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY TO_DATE(meta.dt),year, month, day
"""
)

In [5]:
daily_df 

Unnamed: 0,_c0,year,month,day,events,sessions
0,2023-01-05,2023,1,5,3747,1107
1,2023-01-06,2023,1,6,4480,1361
2,2023-01-07,2023,1,7,4468,1281
3,2023-01-08,2023,1,8,4121,1245
4,2023-01-09,2023,1,9,4894,1507
5,2023-01-10,2023,1,10,5371,1630
6,2023-01-11,2023,1,11,5385,1614
7,2023-01-12,2023,1,12,5434,1669
8,2023-01-13,2023,1,13,4815,1522
9,2023-01-14,2023,1,14,4260,1252


### Notes

From the result, the number of events and sessions increase significantly from 01/26/2023. 

# Events by wikis

In [12]:
hive.run("""
    SELECT wiki_id, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY wiki_id
"""
)

Unnamed: 0,wiki_id,events,sessions
0,idwiki,31477,15903
1,ptwiki,78994,45809
2,ruwiki,158309,111854


# Events by actions

In [10]:
hive.run("""
    SELECT action, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY action
"""
)

Unnamed: 0,action,events,sessions
0,click-article-section,1550,1387
1,click-image,1387,686
2,click-interwiki-commons,880,651
3,click-snippet,9953,8808
4,close-searchpreview,56743,37360
5,new-session,139471,139433
6,open-searchpreview,58796,40269


### Notes

`click-interwiki-links` was not recorded yet according to https://phabricator.wikimedia.org/T321078.

# Events by platforms

In [15]:
hive.run("""
    SELECT platform, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY platform
"""
)

Unnamed: 0,platform,events,sessions
0,desktop,268780,173566


### Notes

Only has events on desktop in January.

# Events by anonymous vs. non-anonymous

In [6]:
hive.run("""
    SELECT is_anon, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY is_anon
"""
)

Unnamed: 0,is_anon,events,sessions
0,,112501,34118
1,False,8734,7972
2,True,174057,155056


Checking the events that `is_anon IS NULL`

In [7]:
hive.run("""
    SELECT TO_DATE(meta.dt), is_anon, count(*)
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    and is_anon IS NULL
    group by TO_DATE(meta.dt), is_anon
"""
)

Unnamed: 0,_c0,is_anon,_c2
0,2023-01-05,,3747
1,2023-01-06,,4480
2,2023-01-07,,4468
3,2023-01-08,,4121
4,2023-01-09,,4894
5,2023-01-10,,5371
6,2023-01-11,,5385
7,2023-01-12,,5434
8,2023-01-13,,4815
9,2023-01-14,,4260


### Notes

events with `is_anon IS NULL` are less than 0.1% after `isAnon` is added to searchsatisfaction schema on 2023-01-27.

# Events by click positions

In [16]:
hive.run("""
    SELECT result_display_position, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
    GROUP BY result_display_position
"""
)

Unnamed: 0,result_display_position,events,sessions
0,-1,139674,139491
1,0,70082,24313
2,1,19890,7444
3,2,9655,3662
4,3,5973,2293
...,...,...,...
218,495,2,1
219,497,2,1
220,498,3,2
221,499,4,3


Checking the events that have `result_display_position = -1` and `action != 'new-session'`

In [19]:
hive.run("""
    SELECT TO_DATE(dt),action, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023
      AND result_display_position = -1
      AND action != 'new-session'
    GROUP BY TO_DATE(dt), action
"""
)

Unnamed: 0,_c0,action,events,sessions
0,2023-01-12,close-searchpreview,3,1
1,2023-01-12,open-searchpreview,3,1
2,2023-01-13,close-searchpreview,20,12
3,2023-01-13,open-searchpreview,20,12
4,2023-01-14,close-searchpreview,1,1
5,2023-01-15,close-searchpreview,6,4
6,2023-01-15,open-searchpreview,6,4
7,2023-01-16,close-searchpreview,4,4
8,2023-01-16,open-searchpreview,4,4
9,2023-01-17,close-searchpreview,2,2


==================================================

## QA on 2023-03-15

# Events by platforms

Check data for seach previews on mobile.

In [6]:
spark.run("""
    SELECT year, month, platform, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023 and month > 1
    GROUP BY year, month, platform
"""
)

                                                                                

Unnamed: 0,year,month,platform,events,sessions
0,2023,2,desktop,1113348,954276
1,2023,2,mobile,1263408,893446
2,2023,3,mobile,664668,462589
3,2023,3,desktop,589787,499601


# Event by Actions

To check events with interwiki links added to search previews

In [7]:
spark.run("""
    SELECT action, COUNT(1) AS events, COUNT(DISTINCT session_id) AS sessions
    FROM event.mediawiki_searchpreview
    WHERE year= 2023 and month > 1
    GROUP BY action
"""
)

                                                                                

Unnamed: 0,action,events,sessions
0,new-session,2809779,2808509
1,click-image,13067,6197
2,click-interwiki-commons,6196,4017
3,click-interwiki-links,8,8
4,click-article-section,15346,12480
5,click-snippet,18764,16440
6,close-searchpreview,376955,191935
7,open-searchpreview,391096,222433
