# Potential Edits Being Blocked

After discussing our findings and needs further, [T292106](https://phabricator.wikimedia.org/T292106) got an update to ask the following question:

* "How many edits are not happening because they are being blocked simply for coming from Relay IP addresses?"

We're thinking about trying to solve this as a probability problem:

1. What's the overall probability of an iOS 15 user who loads the editor also saves an edit?
2. How many iOS 15 requests do we have that open the editor? What number and proportion of them were made with Relay Service IP addresses?
3. Combine the first and second to estimate how many edits we'd get if none were blocked.
4. Count how many edits we've actually had (using `cu_changes` across all the wikis) by iOS 15 users. What number and proportion of them were made with Relay Service IP addresses?
5. Use the last two measurements to estimate how many edits we've lost.

Based on the description of the task, we're primarily interested in an overall estimate. We'll examine data from Sept 20 (the iOS 15 release date) to Oct 14 (most recent whole day of data).

I'll use EditAttemptStep to generate the first statistic. The second, we'll pull from `pageview_actor`.

For the fourth, I'll grab my code to find all the wikis from the Welcome Survey aggregation, then for each wiki check if it has `cu_changes` and if so pull data. If it doesn't have that, we don't have data and will need to note that.

# Notes

After digging into what happens in the browser when trying to edit, I don't think we have good data that'll allow us to estimate how often users try to open the editor. The mobile experience as well as VE both load things through JS. There does appear to be some requests that might allow us to dig things out of the webrequest logs, since these requests go to the API, but we're then sifting through a lot of data.

Instead, I advocate for two things:

1. We file a phab task requesting a data stream of users clicking "Edit" but being notified that they're blocked from editing. I'm not sure what data that stream should capture, but I think given the complexity of the systems we have, being able to determine that someone wanted to edit but were disallowed to would be important.
2. We have good data on pageviews from iOS 15, and we have good data on edits from iOS 15. Let's focus on those two and split them by Relay Service. Then we'll aggregate and calculate the probability of a saved edit given a pageview for non-Relay users, and use the number of pageviews by Relay users to estimate the number of edits. This assumes that the likelihood of clicking Edit and saving that edit is the same for both groups. I think that's an assumption that's *very* uncertain, but I think it'll give us the best estimate for now.

I asked around about whether identifying edits through web requests was doable, and there are query parameters listed in [T277785](https://phabricator.wikimedia.org/T277785) that suggests this is the case. As discussed above, we'd be going through a lot of data, so we're holding off on that for now.

In [1]:
import os
import ipaddress
import time

from collections import defaultdict

import datetime as dt

import numpy as np
import pandas as pd

import findspark
from wmfdata import spark

In [2]:
SPARK_HOME = os.environ.get("SPARK_HOME", "/usr/lib/spark2")
findspark.init(SPARK_HOME)
from pyspark.sql import functions as F, types as T, Window

In [3]:
# we'll start with a regular sized session
spark_session = spark.get_session()

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


## Load in the Relay Range Dataset

In [4]:
# load iCloud's private relay egress ranges
# data comes from https://mask-api.icloud.com/egress-ip-ranges.csv
relay_ranges = pd.read_csv('datasets/egress-ip-ranges.csv',
                           sep=',', names=['range', 'country', 'region', 'city', 'empty']).drop(columns=['empty'])


In [5]:
## This data structure is based on using https://stackoverflow.com/a/1004527
## to determine if an IP falls within a given network, which gives the following
## assertions:
## 1: IPv4 and IPV6 networks are disjoint so we can split on IP version. There used to be
##    compatibility between the networks, but that was deprecated according to
##    https://networkengineering.stackexchange.com/questions/57903/are-the-ipv6-address-space-and-ipv4-address-space-completely-disjoint
## 2: The netmask is binary-AND'ed onto the binary IP address, hence
##    the second layer are the netmasks, of which we expect there to be a limited number. 
## 3: We then have a limited set of possible networks which are all numbers
##    so we store those as a set and let Python handle it, which gives us fast lookup.

dict_nets = {
    '4' : defaultdict(set),
    '6' : defaultdict(set)    
}

In [6]:
for net_raw in relay_ranges.range:
    net = ipaddress.ip_network(net_raw)
    
    net_v = str(net.version)
    
    dict_nets[net_v][net.netmask].add(int(net.network_address))

In [7]:
def is_ip_private_relay(ip_raw):
    try:
        ip = ipaddress.ip_address(ip_raw)
        bin_ip = int(ip)
    
        for netmask, range_set in dict_nets[str(ip.version)].items():
            bin_netmask = int(netmask)
            if (bin_ip & bin_netmask) in range_set:
                return(True)
    except ValueError: # not a valid IP address
        pass 
    
    return(False)

## Probability of Saving an Edit

For all edit attempts on iOS 15, split by platform, what's the probability of successfully saving an edit?

In [17]:
edit_attempt_query = '''
SELECT
    event.platform,
    event.editor_interface,
    COUNT(IF(event.action = "init", 1, NULL)) AS num_attempts,
    COUNT(IF(event.action = "saveSuccess", 1, NULL)) AS num_saves
FROM event.editattemptstep
WHERE year = 2021
AND ((month = 9 AND day >= 20)
     OR (month = 10 AND day <= 14))
AND event.action IN ("init", "saveSuccess")
AND user_agent_map["os_family"] = "iOS"
AND user_agent_map["os_major"] = "15"
AND event.is_oversample = false
GROUP BY event.platform, event.editor_interface
'''

In [18]:
edit_attempt_data = spark.run(edit_attempt_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.


In [20]:
edit_attempt_data['prob_saved'] = 100.0 * edit_attempt_data['num_saves'] / edit_attempt_data['num_attempts']

In [21]:
edit_attempt_data

Unnamed: 0,platform,editor_interface,num_attempts,num_saves,prob_saved
0,desktop,wikitext-2017,115,80,69.565217
1,desktop,visualeditor,154,26,16.883117
2,desktop,wikitext,9669,938,9.701107
3,phone,wikitext,52464,3406,6.492071
4,phone,visualeditor,3512,822,23.405467


Based on this, I think it's important for us to take site (desktop/mobile) into consideration, and possibly also try to understand if the edit attempt it using VE or wikitext.

## Number of Edits from iOS 15

We'll grab data from `wmf_raw.mediawiki_private_cu_changes` from September 20–30 and identify all edits made through iOS 15 and label whether they were made using a relay IP or not.

In [22]:
spark_session.sql("ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive-shaded.jar")

DataFrame[result: int]

In [23]:
spark_session.sql("CREATE TEMPORARY FUNCTION ua AS 'org.wikimedia.analytics.refinery.hive.GetUAPropertiesUDF'")

DataFrame[]

In [24]:
## bot user name regex from Data Engineering
botUsernamePattern = r"^.*[Bb]ot([^a-zA-Z].*$|$)"

In [29]:
ios15_nonbot_edits_query = '''
WITH bots AS (
    SELECT
        wiki_db,
        ug_user
    FROM wmf_raw.mediawiki_user_groups
    WHERE snapshot = "2021-09"
    AND ug_group = "bot"
)
SELECT
    cu.wiki_db,
    cuc_user,
    cuc_user_text,
    cuc_this_oldid,
    cuc_ip,
    ua(cuc_agent)['browser_family'] AS browser_family
FROM wmf_raw.mediawiki_private_cu_changes AS cu
LEFT JOIN bots
ON cu.wiki_db = bots.wiki_db
AND cu.cuc_user = bots.ug_user
WHERE cu.month = "2021-09"
AND cuc_timestamp >= "20210920000000"
AND cuc_timestamp < "20211001000000"
AND ua(cuc_agent)['os_family'] = "iOS"
AND ua(cuc_agent)['os_major'] = "15"
AND cuc_this_oldid != 0 -- only edits
AND bots.ug_user IS NULL -- not in the bot user group
AND cuc_user_text NOT REGEXP "{bot_regex}" -- not a bot user name
'''

In [30]:
ios15_nonbot_edits_df = spark_session.sql(ios15_nonbot_edits_query.format(
    bot_regex = botUsernamePattern
))

In [32]:
ios15_nonbot_edits_relay_df = spark_session.createDataFrame(
    ios15_nonbot_edits_df.rdd.map(lambda r: T.Row(
        wiki_db = r.wiki_db,
        cuc_user = r.cuc_user,
        cuc_user_text = r.cuc_user_text,
        cuc_this_oldid = r.cuc_this_oldid,
        cuc_ip = r.cuc_ip,
        browser_family = r.browser_family,
        is_relay=is_ip_private_relay(r.cuc_ip)
    ))
)

In [33]:
ios15_nonbot_edits_pdf = ios15_nonbot_edits_relay_df.toPandas()

In [None]:
ios15_nonbot_edits_pdf.head()

In [40]:
ios15_nonbot_edits_pdf['browser_family'].unique()

array(['Mobile Safari UI/WKWebView', 'Other', 'Chrome Mobile iOS',
       'Mobile Safari', 'Google', 'Firefox iOS', 'LINE', 'Facebook',
       'Instagram', 'Yandex Browser', 'Edge Mobile', 'UC Browser',
       'DuckDuckGo Mobile', 'Pinterest', 'Whale'], dtype=object)

In [82]:
ios15_nonbot_edits_pdf.loc[ios15_nonbot_edits_pdf['is_relay'] == True, 'browser_family'].unique()

array(['Mobile Safari'], dtype=object)

In [47]:
ios15_edits_agg = (ios15_nonbot_edits_pdf.groupby('is_relay')
                   .agg({'cuc_this_oldid' : 'count'})
                   .rename(columns = {'cuc_this_oldid' : 'num_edits'})
                   .reset_index())

In [48]:
ios15_edits_agg

Unnamed: 0,is_relay,num_edits
0,False,34777
1,True,53


## iOS 15 Pageviews

We'll pull data from `wmf.pageview_actor` to count the number of pageviews made using iOS 15 and again label them depending on whether they were made through the relay or not.

In [38]:
viewers_data = (
    spark_session.read.table("wmf.pageview_actor")

    .where(F.col("year") == 2021)
    .where(F.col("month") == 9)
    .where(F.col("day") >= 20)

    # only user pageview traffic
    .where(F.col("agent_type") == 'user')
    .where(F.col("is_pageview") == True)

    # exclude mobile app -- private relay does not affect it
    .where(F.col("access_method") != 'mobile app')

    # only iOS 15 devices
    .where(F.col("user_agent_map.os_family") == 'iOS')
    .where(F.col("user_agent_map.os_major") == '15')

)
viewers_data_is_relay = spark_session.createDataFrame(
    viewers_data.rdd.map(lambda r: T.Row(
        log_date = dt.date(r.year, r.month, r.day),
        project = "{}.{}".format(r.normalized_host.project, r.normalized_host.project_family),
        is_relay = is_ip_private_relay(r.ip)
    ))
)

agg_viewer_data_by_relay = (
    viewers_data_is_relay
    .groupBy("is_relay")
    .agg(F.count("*").alias("num_views"))
)

In [None]:
agg_viewers_by_relay_pdf = agg_viewer_data_by_relay.toPandas()

In [42]:
agg_viewers_by_relay_pdf

Unnamed: 0,is_relay,num_views
0,False,144350871
1,True,5705764


## Probability of Editing

In [49]:
views_edits_pdf = agg_viewers_by_relay_pdf.merge(ios15_edits_agg, on = 'is_relay')

In [50]:
views_edits_pdf

Unnamed: 0,is_relay,num_views,num_edits
0,False,144350871,34777
1,True,5705764,53


In [51]:
views_edits_pdf['prob_editing'] = views_edits_pdf['num_edits'] / views_edits_pdf['num_views']

In [52]:
views_edits_pdf

Unnamed: 0,is_relay,num_views,num_edits,prob_editing
0,False,144350871,34777,0.000241
1,True,5705764,53,9e-06


It's clear that the probability of saving an edit when not on the relay is *much* higher. Let's start by adding standard errors to those estimates.

In [95]:
views_edits_pdf['SE'] = np.sqrt(
    views_edits_pdf['prob_editing'] * (1 - views_edits_pdf['prob_editing']) /
    views_edits_pdf['num_views'])

In [97]:
views_edits_pdf['prob_high'] = views_edits_pdf['prob_editing'] + views_edits_pdf['SE']
views_edits_pdf['prob_low'] = views_edits_pdf['prob_editing'] - views_edits_pdf['SE']

In [98]:
views_edits_pdf

Unnamed: 0,is_relay,num_views,num_edits,prob_editing,expected_edits,SE,prob_high,prob_low,expected_high,expected_low
0,False,144350871,34777,0.000241,0.0,1e-06,0.000242,0.00024,34963.463459,34590.536541
1,True,5705764,53,9e-06,1374.632195,1e-06,1.1e-05,8e-06,60.280076,45.719924


Now, let's calculate expected edits based on the probabilities.

In [99]:
views_edits_pdf['expected_high'] = views_edits_pdf['num_views'] * views_edits_pdf['prob_high']
views_edits_pdf['expected_low'] = views_edits_pdf['num_views'] * views_edits_pdf['prob_low']

In [100]:
views_edits_pdf

Unnamed: 0,is_relay,num_views,num_edits,prob_editing,expected_edits,SE,prob_high,prob_low,expected_high,expected_low
0,False,144350871,34777,0.000241,0.0,1e-06,0.000242,0.00024,34963.463459,34590.536541
1,True,5705764,53,9e-06,1374.632195,1e-06,1.1e-05,8e-06,60.280076,45.719924


In [53]:
views_edits_pdf['expected_edits'] = 0

In [101]:
views_edits_pdf['prob_editing'] * 1e6

0    240.919918
1      9.288852
Name: prob_editing, dtype: float64

In [102]:
views_edits_pdf['prob_low'] * 1e6

0    239.628180
1      8.012936
Name: prob_low, dtype: float64

In [103]:
views_edits_pdf['prob_high'] * 1e6

0    242.211656
1     10.564769
Name: prob_high, dtype: float64

Overall expected number of edits using the relay, using the non-relay probability:

In [105]:
(views_edits_pdf.loc[views_edits_pdf['is_relay'] == True, 'num_views'].iloc[0] * 
 views_edits_pdf.loc[views_edits_pdf['is_relay'] == False, 'prob_editing'].iloc[0])

1374.6321948275602

Expected number of edits using the relay, using the non-relay probability + its SE:

In [106]:
(views_edits_pdf.loc[views_edits_pdf['is_relay'] == True, 'num_views'].iloc[0] * 
 views_edits_pdf.loc[views_edits_pdf['is_relay'] == False, 'prob_high'].iloc[0])

1382.0025451696783

In [107]:
(views_edits_pdf.loc[views_edits_pdf['is_relay'] == True, 'num_views'].iloc[0] * 
 views_edits_pdf.loc[views_edits_pdf['is_relay'] == False, 'prob_low'].iloc[0])

1367.2618444854418

In [73]:
views_edits_pdf

Unnamed: 0,is_relay,num_views,num_edits,prob_editing,expected_edits
0,False,144350871,34777,0.000241,0.0
1,True,5705764,53,9e-06,1374.632195


Mean difference:

In [81]:
(views_edits_pdf.loc[views_edits_pdf['is_relay'] == True, 'expected_edits'] -
 views_edits_pdf.loc[views_edits_pdf['is_relay'] == True, 'num_edits'])

1    1321.632195
dtype: float64