# Post Deployment Language Switching QA

[Task](https://phabricator.wikimedia.org/T275762)
|
[Schema](https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/secondary/+/refs/heads/master/jsonschema/analytics/legacy/universallanguageselector/current.yaml)

In [1]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse); 
    library(lubridate); 
    library(scales);
    library(magrittr); 
    library(dplyr)
})


## Language Links in the Siderbar

[Instrumentation Task](https://phabricator.wikimedia.org/T275762)

Instrumentation Notes:
- Instrumentation limited to legacy sidebar in modern Vector.
- Logged as `event.context = 'languages-list'`

In [2]:
query <- 
"
SELECT
    TO_DATE(dt) AS `date`,
    wiki,
    event.web_session_id,
    event.usereditbucket,
    event.timetochangelanguage,
    event.isanon,
    event.interfacelanguage,
    event.contentlanguage,
    event.selectedinterfacelanguage,
    Count(*) AS n_events
FROM event.universallanguageselector
WHERE
    year = 2021
    AND ((Month = 04 AND DAY > 26) OR (MONTH = 05))
    AND event.context = 'languages-list'
GROUP BY
    TO_DATE(dt),
    wiki,
    event.web_session_id,
    event.usereditbucket,
    event.timetochangelanguage,
    event.isanon,
    event.interfacelanguage,
    event.contentlanguage,
    event.selectedinterfacelanguage
"

In [3]:
lang_link_events <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [4]:
lang_link_events$date <- as.Date(lang_link_events$date)

## Daily Langauge Link Events

In [5]:
lang_link_events_daily <- lang_link_events %>%
    group_by(date) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

lang_link_events_daily

`summarise()` ungrouping output (override with `.groups` argument)



date,n_events,n_sessions
<date>,<int>,<int>
2021-04-28,791,636
2021-04-29,46934,33173
2021-04-30,274519,177001
2021-05-01,243709,151370
2021-05-02,258630,161650
2021-05-03,303583,195465
2021-05-04,300642,194931
2021-05-05,295520,191876
2021-05-06,291788,190268
2021-05-07,265159,171239


We start recording events on 28 April 2021. There are an average 176,827 sessions per day including sessions by both logged in and logged out users. No unexpected spikes or drops so far.

## Clicks per session

In [None]:
Check to make sure there are duplicate session id. Some sessions should have more than one click event.

In [9]:
length(unique(lang_link_events$web_session_id)) == nrow(lang_link_events)

In [6]:
head(lang_link_events)

Unnamed: 0_level_0,date,wiki,web_session_id,usereditbucket,timetochangelanguage,isanon,interfacelanguage,contentlanguage,selectedinterfacelanguage,n_events
Unnamed: 0_level_1,<date>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<int>
1,2021-04-28,dewikivoyage,1d774b99490d092ba4e4,100-999 edits,6208.225,False,de,de,it,1
2,2021-04-28,dewikivoyage,95f1f7824782d3b0a91b,,98815.0,True,de,de,en,1
3,2021-04-28,dewikivoyage,e06266b83d26d1865339,,128051.0,True,de,de,en,1
4,2021-04-28,frwiktionary,082c55c27c55a4860c58,,8196.865,True,fr,fr,en,1
5,2021-04-28,frwiktionary,18ecbc928fb6bfc5df5e,,5913.57,True,fr,fr,pt,1
6,2021-04-28,frwiktionary,1dbcf88e81ed2a100b27,,40165.625,True,fr,fr,en,1


In [12]:
lang_link_events_persession <- lang_link_events %>%
    group_by(isanon) %>%
    summarize(avg_clicks = mean(n_events),
              max_clicks = max(n_events),
              min_clciks = min(n_events))

lang_link_events_persession

`summarise()` ungrouping output (override with `.groups` argument)



isanon,avg_clicks,max_clicks,min_clciks
<chr>,<dbl>,<int>,<int>
False,1.000327,2,1
True,1.000169,4,1


## By Logged In Status

In [13]:
lang_link_events_isanon <- lang_link_events %>%
    group_by(isanon) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

lang_link_events_isanon 

`summarise()` ungrouping output (override with `.groups` argument)



isanon,n_events,n_sessions
<chr>,<int>,<int>
False,113161,46278
True,3568719,2058676


97.8% of all sessions with clicks to the language links are by logged out users. That's high but expected because instrumentation was limited to legacy sidebar in modern Vector (not from legacy or other skins such as timeless). The new language switching functionality was made available to all logged-in users opted into the latest version of the Vector skin.

Legacy sidebar in modern Vector would mostly appear to logged-out users on test wikis where Vector is deployed as default.


## By Test Wiki

In [36]:
# events and sessions that include link to language link by test wiki
lang_link_events_testwiki <- lang_link_events %>%
    filter(wiki %in% c('frwiktionary', 'hewiki', 'ptwikiversity', 'frwiki', 
    'euwiki', 'fawiki', 'ptwiki', 'kowiki', 'trwiki', 'srwiki', 'bnwiki', 'dewikivoyage', 'vecwiki' )) %>%
    group_by(wiki) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

lang_link_events_testwiki


`summarise()` ungrouping output (override with `.groups` argument)



wiki,n_events,n_sessions
<chr>,<int>,<int>
bnwiki,8451,5726
dewikivoyage,755,604
euwiki,31609,22351
fawiki,200781,110092
frwiki,1933743,1091442
frwiktionary,21307,13350
hewiki,173526,100165
kowiki,146461,81522
ptwiki,555228,328675
ptwikiversity,15,13


In [39]:
# events and sessions that include link to language link by test wiki category and logged-in status
lang_link_events_testwiki_isanon <- lang_link_events %>%
    mutate(istestwiki = ifelse(wiki %in% c('frwiktionary', 'hewiki', 'ptwikiversity', 'frwiki', 
    'euwiki', 'fawiki', 'ptwiki', 'kowiki', 'trwiki', 'srwiki', 'bnwiki', 'dewikivoyage', 'vecwiki' ), 'test_wiki', 'non_test_wiki')) %>%
    group_by(istestwiki, isanon) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

lang_link_events_testwiki_isanon

`summarise()` regrouping output by 'istestwiki' (override with `.groups` argument)



istestwiki,isanon,n_events,n_sessions
<chr>,<chr>,<int>,<int>
non_test_wiki,False,3540,1422
non_test_wiki,True,147,80
test_wiki,False,102371,41836
test_wiki,True,3325559,1922334


Almost all of the events recorded to date (99%) have been on test wikis. This is expected as the new language switcher button was deployed to all users opt'd in to the modern vector on all non test wikis. Users with modern vector on the test wikis have not been shown the new language switcher and still shown the language links in the sidebar.

On non test wikis, the majority (94.67%) of sessions with clicks to the language list on modern vector come from logged-in users. Need to confirm if it's possible to have language link in sidebar if you are logged-in, on modern vector and on a non test wiki. 


## By User Edit Bucket

In [30]:
logged_in_editcount <- lang_link_events %>%
    filter(isanon == 'false') %>%
    group_by(usereditbucket) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

logged_in_editcount

`summarise()` ungrouping output (override with `.groups` argument)



usereditbucket,n_events,n_sessions
<chr>,<int>,<int>
0 edits,20092,10961
1-4 edits,11510,5643
100-999 edits,17264,6533
1000+ edits,31482,8932
5-99 edits,25556,11332
,7,5


In [31]:
logged_out_editcount <- lang_link_events %>%
    filter(isanon == 'true') %>%
    group_by(usereditbucket) %>%
    summarize(n_events = sum(n_events),
             n_sessions = n_distinct(web_session_id))

logged_out_editcount

`summarise()` ungrouping output (override with `.groups` argument)



usereditbucket,n_events,n_sessions
<chr>,<int>,<int>
5-99 edits,1,1
,3325705,1922412


There are just a few instances (under 0.01%) of the event.usereditbucket field being populated for logged out users and recorded as NULL for logged-in users. Further investigation might be needed; however, the numbers of these events is not high enough to skew the data.

## By Final Language

In [43]:
# test that you can switch from one language to the next.
top_final_languages <- lang_link_events %>%
    mutate(all_sessions = n_distinct(web_session_id)) %>%
    group_by(selectedinterfacelanguage) %>%
    summarize(n_sessions = n_distinct(web_session_id),
             pct_sessions = n_sessions/all_sessions) %>%
    distinct() %>%
    arrange(desc(n_sessions))

head(top_final_languages )

`summarise()` regrouping output by 'selectedinterfacelanguage' (override with `.groups` argument)



selectedinterfacelanguage,n_sessions,pct_sessions
<chr>,<int>,<dbl>
en,1443105,0.73470332
es,128358,0.06534871
de,125061,0.06367016
it,65010,0.03309743
ar,64408,0.03279094
ru,52428,0.02669177


The most frequent language switches are to english (73% of sessions) followed by spanish (6.5%), and german (6.3%).

## By Initial Language 

In [48]:

top_initial_languages <- lang_link_events %>%
    mutate(all_sessions = n_distinct(web_session_id)) %>%
    group_by(interfacelanguage, contentlanguage) %>%
    summarize(n_sessions = n_distinct(web_session_id),
             pct_sessions = n_sessions/all_sessions) %>%
    distinct() %>%
    arrange(desc(n_sessions))

head(top_initial_languages )

`summarise()` regrouping output by 'interfacelanguage', 'contentlanguage' (override with `.groups` argument)



interfacelanguage,contentlanguage,n_sessions,pct_sessions
<chr>,<chr>,<int>,<dbl>
fr,fr,1103742,0.56192925
pt,pt,328012,0.16699513
tr,tr,150977,0.07686433
fa,fa,109989,0.05599681
he,he,100125,0.05097493
ko,ko,81429,0.04145655


The interfacelanguage and contentlanguage will usually be the same and should match for most instances, which is confirmed here.

The top initial languages all from test wikis, which is expected since the language links are still shown to all logged-in and logged-out users on modern vector. The new language switcher, which replaces the lang links with a button, are show to all logged-in users opt'd into modern vector on non-test wikis. 

In [None]:
## Most Frequent Switch Types


In [49]:
top_final_languages <- lang_link_events %>%
    mutate(all_sessions = n_distinct(web_session_id)) %>%
    group_by(interfacelanguage, selectedinterfacelanguage) %>%
    summarize(n_sessions = n_distinct(web_session_id),
             pct_sessions = n_sessions/all_sessions) %>%
    distinct() %>%
    arrange(desc(n_sessions))

head(top_final_languages )

`summarise()` regrouping output by 'interfacelanguage', 'selectedinterfacelanguage' (override with `.groups` argument)



interfacelanguage,selectedinterfacelanguage,n_sessions,pct_sessions
<chr>,<chr>,<int>,<dbl>
fr,en,775769,0.39495398
pt,en,264373,0.1345957
tr,en,118824,0.06049483
fa,en,95622,0.04868239
fr,de,94915,0.04832245
he,en,84668,0.04310557


39% of all sessions are switches are from French to English. 