# Non-tagged Activation Productivity in Article and Article Talk

This notebook is similar to 10-article-productivity-24hr.ipynb, except the data it uses is only for non-tagged edits.

It makes the same cutoff for registration timestamp as 12-nontagged-article-productivity.ipynb, for the same reasons.

In [1]:
# https://stackoverflow.com/a/35018739/1091835
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)

In [30]:
library(data.table)
library(ggplot2)

library(brms) # install.packages("brms")
library(loo) # install.packages("loo")
options(mc.cores = 4)
library(rstanarm) # install.packages("rstanarm")

library(lme4)

Loading required package: Matrix


Attaching package: ‘lme4’


The following object is masked from ‘package:brms’:

    ngrps




## Configuration variables

In [3]:
## Set BLAS threads to 1 so glmer and loo don't use all cores
library(RhpcBLASctl)
blas_set_num_threads(1)

## parallelization
options(mc.cores = 4)

### Data import and setup

In [4]:
nontagged_edit_data = fread(
    '/home/nettrom/src/Growth-homepage-2019/datasets/newcomer_tasks_nontagged_edits_nov2020.tsv',
    colClasses = c(wiki_db = 'factor'))

In [5]:
## Configuration variables for this experiment.
## Start timestamp is from https://phabricator.wikimedia.org/T227728#5680453
## End timestamp is from the data gathering notebook
start_ts = as.POSIXct('2019-11-21 00:24:32', tz = 'UTC')
end_ts = as.POSIXct('2020-04-9 00:00', tz = 'UTC')

## Start of the Variant A/B test
variant_test_ts = as.POSIXct('2019-12-13 00:32:04', tz = 'UTC')

## Convert user_registration into a timestamp
nontagged_edit_data[, user_reg_ts := as.POSIXct(user_registration_timestamp,
                                           format = '%Y-%m-%d %H:%M:%S.0', tz = 'UTC')]

## Calculate time since start of experiment in weeks
nontagged_edit_data[, exp_days := 0]
nontagged_edit_data[, exp_days := difftime(user_reg_ts, start_ts, units = 'days')]
nontagged_edit_data[exp_days < 0, exp_days := 0]
nontagged_edit_data[, ln_exp_days := log(1 + as.numeric(exp_days))]
nontagged_edit_data[, ln_exp_weeks := log(1 + as.numeric(exp_days)/7)]

## Calculate time since the start of the variant test, again in days and weeks.
## This enables us to do an interrupted time-series model for that.
nontagged_edit_data[, variant_exp_days := 0]
nontagged_edit_data[, variant_exp_days := difftime(user_reg_ts, variant_test_ts, units = 'days')]
nontagged_edit_data[variant_exp_days < 0, variant_exp_days := 0]
nontagged_edit_data[, ln_var_exp_days := log(1 + as.numeric(variant_exp_days))]
nontagged_edit_data[, ln_var_exp_weeks := log(1 + as.numeric(variant_exp_days)/7)]
nontagged_edit_data[, in_var_exp := 0]
nontagged_edit_data[user_reg_ts > variant_test_ts, in_var_exp := 1]

## Convert all NAs to 0, from
## https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table
na_to_zero = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

na_to_zero(nontagged_edit_data)

## Turn "reg_on_mobile" into a factor for more meaningful plots
nontagged_edit_data[, platform := 'desktop']
nontagged_edit_data[reg_on_mobile == 1, platform := 'mobile']
nontagged_edit_data[, platform := factor(platform)]

## Control variables for various forms of activation
nontagged_edit_data[, is_activated_article := num_article_edits_24hrs > 0]
nontagged_edit_data[, is_activated_other := num_other_edits_24hrs > 0]
nontagged_edit_data[, is_activated := is_activated_article | is_activated_other]

## Control variables for constructive forms of activation
nontagged_edit_data[, is_const_activated_article := (num_article_edits_24hrs - num_article_reverts_24hrs) > 0]
nontagged_edit_data[, is_const_activated_other := (num_other_edits_24hrs - num_other_reverts_24hrs) > 0]
nontagged_edit_data[, is_const_activated := is_const_activated_article | is_const_activated_other]

## Control variables for the number of edits made
nontagged_edit_data[, log_num_article_edits_24hrs := log(1 + num_article_edits_24hrs)]
nontagged_edit_data[, log_num_other_edits_24hrs := log(1 + num_other_edits_24hrs)]
nontagged_edit_data[, log_num_edits_24hrs := log(1 + num_article_edits_24hrs + num_other_edits_24hrs)]

## Control variables for the constructive number of edits made
nontagged_edit_data[, log_num_const_article_edits_24hrs := log(
    1 + num_article_edits_24hrs - num_article_reverts_24hrs)]
nontagged_edit_data[, log_num_const_other_edits_24hrs := log(
    1 + num_other_edits_24hrs - num_other_reverts_24hrs)]
nontagged_edit_data[, log_num_const_edits_24hrs := log(
    1 + num_article_edits_24hrs + num_other_edits_24hrs -
    num_article_reverts_24hrs - num_other_reverts_24hrs)]

## Retention variables
nontagged_edit_data[, is_const_retained_article := is_activated_article &
               ((num_article_edits_2w - num_article_reverts_2w) > 0)]
nontagged_edit_data[, is_const_retained_other := is_const_activated_other &
               ((num_other_edits_2w - num_other_reverts_2w) > 0)]
nontagged_edit_data[, is_const_retained := is_const_activated &
               ((num_article_edits_2w + num_other_edits_2w -
                 num_article_reverts_2w - num_other_reverts_2w) > 0)]

## Variables for number of edits (overall and only constructive)
## across the entire period.
nontagged_edit_data[, num_total_edits_24hrs := num_article_edits_24hrs + num_other_edits_24hrs]
nontagged_edit_data[, num_total_edits_2w := num_article_edits_2w + num_other_edits_2w]
nontagged_edit_data[, num_total_edits := num_total_edits_24hrs + num_total_edits_2w]

nontagged_edit_data[, num_total_const_edits_24hrs := (num_article_edits_24hrs + num_other_edits_24hrs -
                                                 num_article_reverts_24hrs - num_other_reverts_24hrs)]
nontagged_edit_data[, num_total_const_edits_2w := (num_article_edits_2w + num_other_edits_2w -
                                              num_article_reverts_2w - num_other_reverts_2w)]
nontagged_edit_data[, num_total_const_edits := num_total_const_edits_24hrs + num_total_const_edits_2w]

## Variables for number of article edits across the entire period.
nontagged_edit_data[, num_total_article_edits := num_article_edits_24hrs + num_article_edits_2w]

In [6]:
## Registration cutoff (see notes above)
reg_cutoff = as.POSIXct('2019-12-13 09:00:00', tz = 'UTC')

eligible_user_edit_data = nontagged_edit_data[user_reg_ts > reg_cutoff]

## Priors

In [7]:
## Note that using a student_t distribution for the prior is beneficial because that
## distribution handles outliers better than a Normal.
## See https://jrnold.github.io/bayesian_notes/robust-regression.html
## Thanks to Mikhail for sending that to me!
priors <- prior(cauchy(0, 2), class = sd) +
  prior(student_t(5, 0, 10), class = b)

## Edits during the first 24 hours

We base this model on the same one used across all namespaces, meaning that we don't expect group-level variation in the effect of mobile. This is mainly because we have few wikis in our dataset, thus we don't expect that to contain meaningful information.

In [None]:
nontagged_article_edits_24hr.zinb.mod.1 <- brm(
  bf(num_article_edits_24hrs ~ is_treatment + reg_on_mobile + (1 | wiki_db),
     zi ~ wiki_db + reg_on_mobile),
    data = eligible_user_edit_data,
    family = zero_inflated_negbinomial(),
    prior = priors,
    iter = 800,
    control = list(adapt_delta = 0.999,
                 max_treedepth = 15)
)

Compiling Stan program...

Start sampling



In [None]:
## Save the model
save(nontagged_article_edits_24hr.zinb.mod.1,
     file='models/nontagged_article_edits_24hr.zinb.mod.1.Robj')

In [10]:
summary(nontagged_article_edits_24hr.zinb.mod.1)

 Family: zero_inflated_negbinomial 
  Links: mu = log; shape = identity; zi = logit 
Formula: num_article_edits_24hrs ~ is_treatment + reg_on_mobile + (1 | wiki_db) 
         zi ~ wiki_db + reg_on_mobile
   Data: eligible_user_edit_data (Number of observations: 85235) 
Samples: 4 chains, each with iter = 800; warmup = 400; thin = 1;
         total post-warmup samples = 1600

Group-Level Effects: 
~wiki_db (Number of levels: 4) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.64      0.47     0.24     1.99 1.01      440      513

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           -0.33      0.37    -1.14     0.33 1.01      298      367
zi_Intercept        -1.58      0.13    -1.87    -1.34 1.01      830      797
is_treatment        -0.09      0.02    -0.13    -0.04 1.01     1738     1164
reg_on_mobile        0.36      0.03     0.31     0.42 1.00     1234     1165
zi_wiki

## Estimated effects

First the average number of edits in the control group:

In [18]:
ctrl_grp_mean = exp(mean(log(1 + eligible_user_edit_data[is_treatment == 0]$num_article_edits_24hrs))) -1
ctrl_grp_mean

The average number of edits in the Homepage group:

In [20]:
homepage_grp_mean = exp(fixef(nontagged_article_edits_24hr.zinb.mod.1, pars = 'is_treatment')[1] +
                        mean(log(1 + eligible_user_edit_data[is_treatment == 0]$num_article_edits_24hrs))) -1
homepage_grp_mean

In [26]:
(ctrl_grp_mean - homepage_grp_mean)

Let's find the 95% credible interval:

In [21]:
homepage_grp_low = exp(fixef(nontagged_article_edits_24hr.zinb.mod.1, pars = 'is_treatment')[3] +
                       mean(log(1 + eligible_user_edit_data[is_treatment == 0]$num_article_edits_24hrs))) -1
homepage_grp_low

In [27]:
(ctrl_grp_mean - homepage_grp_low)

In [22]:
homepage_grp_high = exp(fixef(nontagged_article_edits_24hr.zinb.mod.1, pars = 'is_treatment')[4] +
                        mean(log(1 + eligible_user_edit_data[is_treatment == 0]$num_article_edits_24hrs))) -1
homepage_grp_high

In [28]:
(ctrl_grp_mean - homepage_grp_high)

In [23]:
(ctrl_grp_mean - homepage_grp_mean) / ctrl_grp_mean

In [24]:
(ctrl_grp_mean - homepage_grp_low) / ctrl_grp_mean

In [25]:
(ctrl_grp_mean - homepage_grp_high) / ctrl_grp_mean

In summary, we find that the Control group makes an average of $0.25$ edits in the Article & Article talk namespaces in the first 24 hours after registration. The Homepage group is estimated to make an average of $0.15$ non-tagged edits ($-0.1$ edits or $-40.3\%$). We're 95% confident the Homepage group's estimate is in the interval $[0.10, 0.21]$ edits, which is in the interval $[-0.16, -0.05]$ edits relative to the Control group, or $[-61.1\%, -19.0\%]$

## Non-tagged Activation in the Article and Article Talk namespaces

This is the model used for estimating the effect of the Homepage on activation (editing within 24 hours after registration) from 01-activation.ipynb

In [39]:
act.model.article.full = glmer(formula = is_activated_article ~
                               is_treatment + reg_on_mobile +
                    (1 + reg_on_mobile | wiki_db),
                    family = binomial(link = "logit"), data = eligible_user_edit_data)
summary(act.model.article.full)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
is_activated_article ~ is_treatment + reg_on_mobile + (1 + reg_on_mobile |  
    wiki_db)
   Data: eligible_user_edit_data

     AIC      BIC   logLik deviance df.resid 
 84271.7  84327.8 -42129.9  84259.7    85229 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-0.8357 -0.4733 -0.4501 -0.4248  2.3542 

Random effects:
 Groups  Name          Variance Std.Dev. Corr 
 wiki_db (Intercept)   0.21488  0.4636        
         reg_on_mobile 0.01634  0.1278   -0.75
Number of obs: 85235, groups:  wiki_db, 4

Fixed effects:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.99468    0.19835  -5.015 5.31e-07 ***
is_treatment  -0.07502    0.02110  -3.556 0.000376 ***
reg_on_mobile -0.22924    0.06153  -3.726 0.000195 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (In

In [40]:
ranef(act.model.article.full)

$wiki_db
       (Intercept) reg_on_mobile
arwiki  -0.5013052    0.20380990
cswiki   0.6357887   -0.12069570
kowiki   0.2433184   -0.04651067
viwiki  -0.3771000   -0.03637751

with conditional variances for “wiki_db” 

In [41]:
coef(act.model.article.full)

$wiki_db
       (Intercept) is_treatment reg_on_mobile
arwiki  -1.4959858  -0.07501994   -0.02542987
cswiki  -0.3588920  -0.07501994   -0.34993547
kowiki  -0.7513622  -0.07501994   -0.27575044
viwiki  -1.3717807  -0.07501994   -0.26561728

attr(,"class")
[1] "coef.mer"

In [42]:
## Per-wiki effect of registration on mobile:
100 * coef(act.model.article.full)$wiki_db$reg_on_mobile /4

In [43]:
## Using the "divide by 4" rule to get an estimate of the effect:
100 * coef(act.model.article.full)$wiki_db$is_treatment[1] /4

In [44]:
## Activation overall
activation_talk = eligible_user_edit_data[, list(num_users = .N),
                                 by = c('wiki_db', 'reg_on_mobile', 'is_treatment', 'is_activated_article')]
activation_talk[, percent := num_users / sum(num_users) * 100,
                by = c('wiki_db', 'reg_on_mobile', 'is_treatment')]
activation_talk[order(wiki_db, reg_on_mobile, is_treatment, is_activated_article)]


wiki_db,reg_on_mobile,is_treatment,is_activated_article,num_users,percent
<fct>,<int>,<int>,<lgl>,<int>,<dbl>
arwiki,0,0,False,2768,81.89349
arwiki,0,0,True,612,18.10651
arwiki,0,1,False,10635,82.84646
arwiki,0,1,True,2202,17.15354
arwiki,1,0,False,5592,81.7066
arwiki,1,0,True,1252,18.2934
arwiki,1,1,False,22502,83.19284
arwiki,1,1,True,4546,16.80716
cswiki,0,0,False,613,58.38095
cswiki,0,0,True,437,41.61905


So we have a significant negative effect of the Homepage on activation by editing an article (or its talk page) when using non-tagged edits. Homepage users are -1.9pp less likely to activate. When using all edits, we found a +2.5pp probability of activation, so the effect is comparable.