# Proportion of Users Making a Newcomer Task Edit

In [T266982#6629305](https://phabricator.wikimedia.org/T266982#6629305), Marshall asks:

"if our model estimates an 85% increase in productivity from newcomers, and no part of that increase is from non-suggested edits, then why do we see that only about 2% of newcomer edits come from suggested edits? Could you please spend a little time thinking that through?"

I think this can be explained by using the datasets from the same analysis (meaning the "newcomer task" tag was in use), and measuring the proportion of users who made a tagged edit out of the users who made an edit. That gives us some sense of how large an impact Newcomer Tasks had.

In [1]:
# https://stackoverflow.com/a/35018739/1091835
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)

In [2]:
library(data.table)
library(ggplot2)

library(brms) # install.packages("brms")
library(loo) # install.packages("loo")
options(mc.cores = 4)
library(rstanarm) # install.packages("rstanarm")

Loading required package: Rcpp

Loading 'brms' package (version 2.14.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: ‘brms’


The following object is masked from ‘package:stats’:

    ar


This is loo version 2.3.1

- Online documentation and vignettes at mc-stan.org/loo

- As of v2.0.0 loo defaults to 1 core but we recommend using as many as possible. Use the 'cores' argument or set options(mc.cores = NUM_CORES) for an entire session. 

This is rstanarm version 2.21.1

- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

- For execution on a local, multicore CPU with excess RAM we recommend calling

  options(mc.cores = parallel::detectCores())


Attaching package: ‘rstanarm’


The following objects are masked from ‘package:brms’:

 

## Configuration variables

In [3]:
## Set BLAS threads to 1 so glmer and loo don't use all cores
library(RhpcBLASctl)
blas_set_num_threads(1)

## parallelization
options(mc.cores = 4)

### Data import and setup

`user_edit_data` holds the full dataset counting all edits, `nontagged_edit_data` holds the same counting only edits not tagged with "newcomer task". We'll restrict the latter to users registered after the tag was deployed, then join it with the former. We can then subtract edit counts between them to get their number of tagged edits.

In [4]:
nontagged_edit_data = fread(
    '/home/nettrom/src/Growth-homepage-2019/datasets/newcomer_tasks_nontagged_edits_nov2020.tsv',
    colClasses = c(wiki_db = 'factor'))

In [5]:
user_edit_data = fread('/home/nettrom/src/Growth-homepage-2019/datasets/newcomer_tasks_edit_data_may2020.tsv',
                       colClasses = c(wiki_db = 'factor'))

In [6]:
## Configuration variables for this experiment.
## Start timestamp is from https://phabricator.wikimedia.org/T227728#5680453
## End timestamp is from the data gathering notebook
start_ts = as.POSIXct('2019-11-21 00:24:32', tz = 'UTC')
end_ts = as.POSIXct('2020-04-9 00:00', tz = 'UTC')

## Start of the Variant A/B test
variant_test_ts = as.POSIXct('2019-12-13 00:32:04', tz = 'UTC')

## Convert user_registration into a timestamp
user_edit_data[, user_reg_ts := as.POSIXct(user_registration_timestamp,
                                           format = '%Y-%m-%d %H:%M:%S.0', tz = 'UTC')]

## Calculate time since start of experiment in weeks
user_edit_data[, exp_days := 0]
user_edit_data[, exp_days := difftime(user_reg_ts, start_ts, units = 'days')]
user_edit_data[exp_days < 0, exp_days := 0]
user_edit_data[, ln_exp_days := log(1 + as.numeric(exp_days))]
user_edit_data[, ln_exp_weeks := log(1 + as.numeric(exp_days)/7)]

## Calculate time since the start of the variant test, again in days and weeks.
## This enables us to do an interrupted time-series model for that.
user_edit_data[, variant_exp_days := 0]
user_edit_data[, variant_exp_days := difftime(user_reg_ts, variant_test_ts, units = 'days')]
user_edit_data[variant_exp_days < 0, variant_exp_days := 0]
user_edit_data[, ln_var_exp_days := log(1 + as.numeric(variant_exp_days))]
user_edit_data[, ln_var_exp_weeks := log(1 + as.numeric(variant_exp_days)/7)]
user_edit_data[, in_var_exp := 0]
user_edit_data[user_reg_ts > variant_test_ts, in_var_exp := 1]

## Convert all NAs to 0, from
## https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table
na_to_zero = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

na_to_zero(user_edit_data)

## Turn "reg_on_mobile" into a factor for more meaningful plots
user_edit_data[, platform := 'desktop']
user_edit_data[reg_on_mobile == 1, platform := 'mobile']
user_edit_data[, platform := factor(platform)]

## Control variables for various forms of activation
user_edit_data[, is_activated_article := num_article_edits_24hrs > 0]
user_edit_data[, is_activated_other := num_other_edits_24hrs > 0]
user_edit_data[, is_activated := is_activated_article | is_activated_other]

## Control variables for constructive forms of activation
user_edit_data[, is_const_activated_article := (num_article_edits_24hrs - num_article_reverts_24hrs) > 0]
user_edit_data[, is_const_activated_other := (num_other_edits_24hrs - num_other_reverts_24hrs) > 0]
user_edit_data[, is_const_activated := is_const_activated_article | is_const_activated_other]

## Control variables for the number of edits made
user_edit_data[, log_num_article_edits_24hrs := log(1 + num_article_edits_24hrs)]
user_edit_data[, log_num_other_edits_24hrs := log(1 + num_other_edits_24hrs)]
user_edit_data[, log_num_edits_24hrs := log(1 + num_article_edits_24hrs + num_other_edits_24hrs)]

## Control variables for the constructive number of edits made
user_edit_data[, log_num_const_article_edits_24hrs := log(
    1 + num_article_edits_24hrs - num_article_reverts_24hrs)]
user_edit_data[, log_num_const_other_edits_24hrs := log(
    1 + num_other_edits_24hrs - num_other_reverts_24hrs)]
user_edit_data[, log_num_const_edits_24hrs := log(
    1 + num_article_edits_24hrs + num_other_edits_24hrs -
    num_article_reverts_24hrs - num_other_reverts_24hrs)]

## Retention variables
user_edit_data[, is_const_retained_article := is_activated_article &
               ((num_article_edits_2w - num_article_reverts_2w) > 0)]
user_edit_data[, is_const_retained_other := is_const_activated_other &
               ((num_other_edits_2w - num_other_reverts_2w) > 0)]
user_edit_data[, is_const_retained := is_const_activated &
               ((num_article_edits_2w + num_other_edits_2w -
                 num_article_reverts_2w - num_other_reverts_2w) > 0)]

## Variables for number of edits (overall and only constructive)
## across the entire period.
user_edit_data[, num_total_edits_24hrs := num_article_edits_24hrs + num_other_edits_24hrs]
user_edit_data[, num_total_edits_2w := num_article_edits_2w + num_other_edits_2w]
user_edit_data[, num_total_edits := num_total_edits_24hrs + num_total_edits_2w]

user_edit_data[, num_total_const_edits_24hrs := (num_article_edits_24hrs + num_other_edits_24hrs -
                                                 num_article_reverts_24hrs - num_other_reverts_24hrs)]
user_edit_data[, num_total_const_edits_2w := (num_article_edits_2w + num_other_edits_2w -
                                              num_article_reverts_2w - num_other_reverts_2w)]
user_edit_data[, num_total_const_edits := num_total_const_edits_24hrs + num_total_const_edits_2w]

## Variables for number of article edits across the entire period.
user_edit_data[, num_total_article_edits := num_article_edits_24hrs + num_article_edits_2w]

In [7]:
## Configuration variables for this experiment.
## Start timestamp is from https://phabricator.wikimedia.org/T227728#5680453
## End timestamp is from the data gathering notebook
start_ts = as.POSIXct('2019-11-21 00:24:32', tz = 'UTC')
end_ts = as.POSIXct('2020-04-9 00:00', tz = 'UTC')

## Start of the Variant A/B test
variant_test_ts = as.POSIXct('2019-12-13 00:32:04', tz = 'UTC')

## Convert user_registration into a timestamp
nontagged_edit_data[, user_reg_ts := as.POSIXct(user_registration_timestamp,
                                           format = '%Y-%m-%d %H:%M:%S.0', tz = 'UTC')]

## Calculate time since start of experiment in weeks
nontagged_edit_data[, exp_days := 0]
nontagged_edit_data[, exp_days := difftime(user_reg_ts, start_ts, units = 'days')]
nontagged_edit_data[exp_days < 0, exp_days := 0]
nontagged_edit_data[, ln_exp_days := log(1 + as.numeric(exp_days))]
nontagged_edit_data[, ln_exp_weeks := log(1 + as.numeric(exp_days)/7)]

## Calculate time since the start of the variant test, again in days and weeks.
## This enables us to do an interrupted time-series model for that.
nontagged_edit_data[, variant_exp_days := 0]
nontagged_edit_data[, variant_exp_days := difftime(user_reg_ts, variant_test_ts, units = 'days')]
nontagged_edit_data[variant_exp_days < 0, variant_exp_days := 0]
nontagged_edit_data[, ln_var_exp_days := log(1 + as.numeric(variant_exp_days))]
nontagged_edit_data[, ln_var_exp_weeks := log(1 + as.numeric(variant_exp_days)/7)]
nontagged_edit_data[, in_var_exp := 0]
nontagged_edit_data[user_reg_ts > variant_test_ts, in_var_exp := 1]

## Convert all NAs to 0, from
## https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table
na_to_zero = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

na_to_zero(nontagged_edit_data)

## Turn "reg_on_mobile" into a factor for more meaningful plots
nontagged_edit_data[, platform := 'desktop']
nontagged_edit_data[reg_on_mobile == 1, platform := 'mobile']
nontagged_edit_data[, platform := factor(platform)]

## Control variables for various forms of activation
nontagged_edit_data[, is_activated_article := num_article_edits_24hrs > 0]
nontagged_edit_data[, is_activated_other := num_other_edits_24hrs > 0]
nontagged_edit_data[, is_activated := is_activated_article | is_activated_other]

## Control variables for constructive forms of activation
nontagged_edit_data[, is_const_activated_article := (num_article_edits_24hrs - num_article_reverts_24hrs) > 0]
nontagged_edit_data[, is_const_activated_other := (num_other_edits_24hrs - num_other_reverts_24hrs) > 0]
nontagged_edit_data[, is_const_activated := is_const_activated_article | is_const_activated_other]

## Control variables for the number of edits made
nontagged_edit_data[, log_num_article_edits_24hrs := log(1 + num_article_edits_24hrs)]
nontagged_edit_data[, log_num_other_edits_24hrs := log(1 + num_other_edits_24hrs)]
nontagged_edit_data[, log_num_edits_24hrs := log(1 + num_article_edits_24hrs + num_other_edits_24hrs)]

## Control variables for the constructive number of edits made
nontagged_edit_data[, log_num_const_article_edits_24hrs := log(
    1 + num_article_edits_24hrs - num_article_reverts_24hrs)]
nontagged_edit_data[, log_num_const_other_edits_24hrs := log(
    1 + num_other_edits_24hrs - num_other_reverts_24hrs)]
nontagged_edit_data[, log_num_const_edits_24hrs := log(
    1 + num_article_edits_24hrs + num_other_edits_24hrs -
    num_article_reverts_24hrs - num_other_reverts_24hrs)]

## Retention variables
nontagged_edit_data[, is_const_retained_article := is_activated_article &
               ((num_article_edits_2w - num_article_reverts_2w) > 0)]
nontagged_edit_data[, is_const_retained_other := is_const_activated_other &
               ((num_other_edits_2w - num_other_reverts_2w) > 0)]
nontagged_edit_data[, is_const_retained := is_const_activated &
               ((num_article_edits_2w + num_other_edits_2w -
                 num_article_reverts_2w - num_other_reverts_2w) > 0)]

## Variables for number of edits (overall and only constructive)
## across the entire period.
nontagged_edit_data[, num_total_edits_24hrs := num_article_edits_24hrs + num_other_edits_24hrs]
nontagged_edit_data[, num_total_edits_2w := num_article_edits_2w + num_other_edits_2w]
nontagged_edit_data[, num_total_edits := num_total_edits_24hrs + num_total_edits_2w]

nontagged_edit_data[, num_total_const_edits_24hrs := (num_article_edits_24hrs + num_other_edits_24hrs -
                                                 num_article_reverts_24hrs - num_other_reverts_24hrs)]
nontagged_edit_data[, num_total_const_edits_2w := (num_article_edits_2w + num_other_edits_2w -
                                              num_article_reverts_2w - num_other_reverts_2w)]
nontagged_edit_data[, num_total_const_edits := num_total_const_edits_24hrs + num_total_const_edits_2w]

## Variables for number of article edits across the entire period.
nontagged_edit_data[, num_total_article_edits := num_article_edits_24hrs + num_article_edits_2w]

In [8]:
## Registration cutoff (see notes above)
reg_cutoff = as.POSIXct('2019-12-13 09:00:00', tz = 'UTC')

eligible_user_edit_data = nontagged_edit_data[user_reg_ts > reg_cutoff]

In [10]:
all_edit_data = merge(eligible_user_edit_data, user_edit_data,
                      by = c('wiki_db', 'user_id'))

Confirm that both datasets have the same length:

In [12]:
length(eligible_user_edit_data$user_id)

In [13]:
length(all_edit_data$user_id)

In [None]:
names(all_edit_data)

The columns we're most interested in are:

* `num_total_edits`, edits across all 15 days
* `num_total_article_edits`, number of Article & Article talk edits across all 15 days

We'll also be interested in these for the first 24 hours, as that appears to be a strong indicator of continued participation.

In [15]:
all_edit_data$num_total_tagged_edits = (all_edit_data$num_total_edits.y -
                                        all_edit_data$num_total_edits.x)

In [23]:
all_edit_data$num_total_tagged_article_edits = (all_edit_data$num_total_article_edits.y -
                                                all_edit_data$num_total_article_edits.x)

In [20]:
all_edit_data$num_total_tagged_edits_24hrs = (all_edit_data$num_total_edits_24hrs.y -
                                              all_edit_data$num_total_edits_24hrs.x)

In [21]:
all_edit_data$num_tagged_article_edits_24hrs = (all_edit_data$num_article_edits_24hrs.y -
                                                all_edit_data$num_article_edits_24hrs.x)

Now, let's look at the proportion of users who made an edit, and compare that to the proportion of users who made a tagged edit.

In [32]:
length(all_edit_data[is_treatment.x == 1 & num_total_article_edits.x > 0]$user_id) /
length(all_edit_data[is_treatment.x == 1]$user_id)

In [33]:
length(all_edit_data[is_treatment.x == 1 & num_total_tagged_article_edits > 0]$user_id) /
length(all_edit_data[is_treatment.x == 1 & num_total_article_edits.x > 0]$user_id)

Let's look at the same within 24 hours after registration (the activation period):

In [34]:
length(all_edit_data[is_treatment.x == 1 & is_activated_article.x == 1]$user_id) /
length(all_edit_data[is_treatment.x == 1]$user_id)

In [36]:
length(all_edit_data[is_treatment.x == 1 & num_tagged_article_edits_24hrs > 0]$user_id) /
length(all_edit_data[is_treatment.x == 1 & is_activated_article.x == 1]$user_id)

OK, so only 14.9% of all users who activate by editing in the Article or Article talk namespace have made a tagged edit within the same time period.

What about the average number of edits made between the Homepage and Control group, for users who actually edited?

Let's first look at it without tagged edits:

In [49]:
## Mean number of edits in the Control group, for users who edited

ctrl_grp_mean_total_edits = exp(mean(log(all_edit_data[is_treatment.x == 0 &
                                                       num_total_edits.x > 0]$num_total_edits.x)))
ctrl_grp_mean_total_edits

In [40]:
## Mean number of edits in the Homepage group, for users who edited, only non-tagged edits:

exp(mean(log(all_edit_data[is_treatment.x == 1 & num_total_edits.x > 0]$num_total_edits.x)))

Let's do the same but add the tagged edits:

In [50]:
## Mean number of edits in the Homepage group, for users who edited, all edits:

homepage_grp_mean_total_edits = exp(mean(log(all_edit_data[is_treatment.x == 1 &
                                                           num_total_edits.y > 0]$num_total_edits.y)))
homepage_grp_mean_total_edits

In [51]:
## Increase in number

homepage_grp_mean_total_edits - ctrl_grp_mean_total_edits

In [52]:
## Increase in percent

(homepage_grp_mean_total_edits - ctrl_grp_mean_total_edits) / ctrl_grp_mean_total_edits

So, for users who edit, we increase the average number of edits from 2.04 to 2.15 (+0.10 edits or +5.1%). Or in other words, 1,000 users who edit in the Control group make 2,042 edits on average, whereas those in the Homepage group make 2,146.

Let's look at this for the first 24 hours:

In [42]:
## Mean number of edits in the Control group, for users who edited

exp(mean(log(all_edit_data[is_treatment.x == 0 & num_total_edits_24hrs.x > 0]$num_total_edits_24hrs.x)))

In [43]:
## Mean number of edits in the Homepage group, for users who edited, only non-tagged edits:

exp(mean(log(all_edit_data[is_treatment.x == 1 & num_total_edits_24hrs.x > 0]$num_total_edits_24hrs.x)))

Let's do the same but add the tagged edits:

In [44]:
## Mean number of edits in the Homepage group, for users who edited, all edits:

exp(mean(log(all_edit_data[is_treatment.x == 1 & num_total_edits_24hrs.y > 0]$num_total_edits_24hrs.y)))

So, for users who activate (edit within 24 hours), we increase the average from 1.77 to 1.84.

Let's drill down to article edits specifically:

In [53]:
## Mean number of edits in the Control group, for users who edited

ctrl_grp_mean_article_edits_24hrs = exp(
    mean(
        log(
            all_edit_data[is_treatment.x == 0 &
                          num_article_edits_24hrs.x > 0]$num_article_edits_24hrs.x)
    )
)
ctrl_grp_mean_article_edits_24hrs

In [47]:
## Mean number of edits in the Homepage group, for users who edited, only non-tagged edits:

exp(mean(log(all_edit_data[is_treatment.x == 1 &
                           num_article_edits_24hrs.x > 0]$num_article_edits_24hrs.x)))

Let's do the same but add the tagged edits:

In [54]:
## Mean number of edits in the Homepage group, for users who edited, all edits:

homepage_grp_mean_article_edits_24hrs = exp(
    mean(
        log(
            all_edit_data[is_treatment.x == 1 &
                          num_article_edits_24hrs.y > 0]$num_article_edits_24hrs.y)
    )
)
homepage_grp_mean_article_edits_24hrs

In [55]:
## Delta count

homepage_grp_mean_article_edits_24hrs - ctrl_grp_mean_article_edits_24hrs

In [56]:
## Delta percentage

(homepage_grp_mean_article_edits_24hrs - ctrl_grp_mean_article_edits_24hrs) / ctrl_grp_mean_article_edits_24hrs

For users who edit an article within 24 hours of registration, we increase from an average of 1.74 to 1.83 edits (+0.09 edits or +4.9%).

Or if we use our 1,000 user group: the Control group makes 1,740 edits, the Homepage group makes 1,825 edits.

## Summary

There are some challenges in understanding the effects of the Homepage with Newcomer Tasks. When thinking about how the Homepage can affect newly registered users and how we model productivity, it might be useful to keep three things in mind:

1. Some registrations will never edit. Our model accounts for this through what is called "zero inflation", and it is allowed to vary by wiki and by desktop/mobile. We regard zero inflation as something external to Wikipedia, it's something the Homepage can't change.
2. Some registrations might edit. We know from our analysis of activation that the Homepage w/Newcomer Tasks has a significant positive effect on activation. This will show up in our productivity analysis as an increase, because users go from 0 to 1+ edits.
3. Some registrations will edit. Our model of productivity does not focus solely on these users, but takes them into account because it expects a non-linear relationship between user activity level and the proportion of users who reach that activity level.

When we report a large increase in expected user productivity, we're combining points 2 and 3 above, rather than refer specifically to point 3. Also, we see from this quick analysis that a small proportion of users (14.9%) who activated by editing in the Article & Article talk namespaces also made a tagged edit within the same time period. The latter explains why tagged edits are 2% of "newcomer edits" (all edits in "content namespaces" by all users within 30 days of registration), we're not reaching a large number of users, but enough to make a significant dent in participation.