Load processed conflict data for March 2020.  This includes the Schema:TwoColConflictConflict data joined with metadata about involved revisions.

In [27]:
conflicts = spark.read.parquet("/tmp/awight/conflict_details").cache()
conflicts.count()

17669

Remove the conflicts which we suspect are caused by MediaWiki glitches [T246439](https://phabricator.wikimedia.org/T246439) and [T246440](https://phabricator.wikimedia.org/T246440).

In [28]:
from pyspark.sql.functions import col
clean_conflicts = conflicts.filter(col("baseRevisionId") != col("latestRevisionId"))
clean_conflicts.count()

8875

In [34]:
deduped_conflicts = clean_conflicts.dropDuplicates()
deduped_conflicts.count()

8808

Nice to know that duplicates only account for 0.75% of our events.

Add some calculated columns:
* `next_edit_delta`: Number of seconds elapsed between entering the conflict workflow and the next revision on an article.
* `is_revolved`: True if a new revision is stored within 1 hour of entering the conflict workflow.  This is a crappy proxy for the actual success of the workflow.
* `is_talk`: True if the article namespace was a talk page (odd namespace ID) or the project namespace (ID = 4).
* `is_anon`: True when the user is anonymous.

In [39]:
import pandas as pd
c = clean_conflicts.toPandas()
c['next_edit_delta'] = (c['next_timestamp'] - c['conflict_timestamp']) / pd.Timedelta(1, unit='s')
c['is_resolved'] = c['next_rev_id'].ne(pd.NaT) & c['next_edit_delta'].lt(3600)
c['is_talk'] = (c['page_namespace'].ne(0) & c['page_namespace'].mod(2).eq(1)) | c['page_namespace'].eq(4)
c['is_anon'] = c['editCount'].eq(0)

In [44]:
c.groupby(['is_talk', 'is_anon', 'twoColConflictShown']).mean()['is_resolved']

is_talk  is_anon  twoColConflictShown
False    False    False                  0.813333
                  True                   0.878689
         True     False                  0.688947
                  True                   0.676471
True     False    False                  0.922948
                  True                   0.954965
         True     False                  0.844920
                  True                   0.944444
Name: is_resolved, dtype: float64

In [46]:
c.groupby(['is_talk', 'is_anon', 'twoColConflictShown']).count()['is_resolved']

is_talk  is_anon  twoColConflictShown
False    False    False                  2925
                  True                    915
         True     False                  1900
                  True                     68
True     False    False                  1791
                  True                    866
         True     False                   374
                  True                     36
Name: is_resolved, dtype: int64

In [63]:
# '2020-03-25 12:00:00' -> 
from pyspark.sql.functions import unix_timestamp
dewiki_conflicts = clean_conflicts.filter((col('wiki') == 'dewiki') & (unix_timestamp(col('conflict_timestamp'), "yyyy-MM-dd HH:mm:ss") > 1585137600))
dewiki_conflicts.count()

853

In [66]:
d = dewiki_conflicts.toPandas()
d['next_edit_delta'] = (d['next_timestamp'] - d['conflict_timestamp']) / pd.Timedelta(1, unit='s')
d['is_resolved'] = d['next_rev_id'].ne(pd.NaT) & d['next_edit_delta'].lt(3600)
d['is_talk'] = (d['page_namespace'].ne(0) & d['page_namespace'].mod(2).eq(1)) | d['page_namespace'].eq(4)
d['is_anon'] = d['editCount'].eq(0)
d.groupby(['is_talk', 'is_anon', 'twoColConflictShown']).mean()['is_resolved']

is_talk  is_anon  twoColConflictShown
False    False    False                  0.750000
                  True                   0.856354
         True     True                   0.678571
True     False    False                  1.000000
                  True                   0.951220
         True     True                   0.937500
Name: is_resolved, dtype: float64

In [67]:
d.groupby(['is_talk', 'is_anon', 'twoColConflictShown']).count()['is_resolved']

is_talk  is_anon  twoColConflictShown
False    False    False                    8
                  True                   362
         True     True                    56
True     False    False                   26
                  True                   369
         True     True                    32
Name: is_resolved, dtype: int64

In [19]:
c["user_is_new"] = (c["user_editcount"] < 100)
c["no_js"] = c["is_js"] != True
c.groupby(['no_js']).mean()['user_is_new']

no_js
False    0.377483
True     0.484655
Name: user_is_new, dtype: float64