# Analysis of Nuke usage across Wikimedia projects

**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**<br>
**Last updated on 24 July 2023**

[TASK: T341564](https://phabricator.wikimedia.org/T341564)

# Contents

1. [Overview](#Overview)
3. [Data Gathering](#Data-Gathering)
4. [Results](#Results)

# Overview

The [Nuke](https://www.mediawiki.org/wiki/Extension:Nuke) extension enables administrators on wikis to mass delete recently created by a user or an IP. As [IP masking](https://meta.wikimedia.org/wiki/IP_Editing:_Privacy_Enhancement_and_Abuse_Mitigation) is being rolled out, this analysis was done in order to understand the usage of the nuke feature on wikis, to inform whether any feature changes are required to the extension to be able to work with IP masking.

**Approach**

The Nuke feature doesn't record any exclusive logs or has an edit tag. For pages deleted using, the actions get recorded in the usual deletion logs (similar to any other deletion). There is no way to explicity filter out deletions made using Nuke. However, there is a consistent comment message that is added to the logs. Although it is possible for this message to be changed by the admin performing the action, that is rare. Using the [i18n translations](https://github.com/wikimedia/mediawiki-extensions-Nuke/tree/master/i18n) for `nuke-defaultreason` and `nuke-multiplepeople` of the Nuke extension, logs recorded as result of *nuking* were gathered. For languages with no translation, the default (English) message has been used.

**Summary**



# Data Gathering

## imports

In [328]:
import wmfdata as wmf
import pandas as pd
import numpy as np

import json
import re
from datetime import datetime, timedelta

from os import listdir
import warnings
from IPython.display import display_html

In [273]:
pd.options.display.max_columns = None
pd.options.display.max_rows = 250
bold = '\033[1m'
end = '\033[0m'

## collect i18n messages

In [8]:
# shell script to collect i18n files from the Nuke GitHub repo
!chmod +x get_i18n.sh
!./get_i18n.sh

Cloning into 'mediawiki-extensions-Nuke'...
remote: Enumerating objects: 6239, done.[K
remote: Counting objects: 100% (1198/1198), done.[K
remote: Compressing objects: 100% (392/392), done.[K
remote: Total 6239 (delta 817), reused 1182 (delta 801), pack-reused 5041[K
Receiving objects: 100% (6239/6239), 1.74 MiB | 26.95 MiB/s, done.
Resolving deltas: 100% (4287/4287), done.
yes: standard output: Broken pipe


In [278]:
# canonical information on Wikimedia projects
cd_wikis = pd.read_csv('https://raw.githubusercontent.com/wikimedia-research/canonical-data/master/wiki/wikis.tsv', sep='\t')

public_wikis = cd_wikis.query("""(visibility == 'public') & (editability == 'public') & (status == 'open')""")

In [275]:
# JSON processing function
def load_json(file_path):
    with open(file_path) as json_file:
        data = json.load(json_file)
    return data

In [13]:
# the translations usually have a placeholder variable to record username or the IP when the actual action is being performed
# for example, the English message is, Mass deletion of pages added by [[Special:Contributions/$1|{{GENDER:$1|$1}}]]
# the text needs to be cleaned to remove the placeholder variables, however, the position or the variable is not consitent across languages
# the function has been incrementally developed to catch and clean various patterns observed

def clean_i18n(language, i18n_location=i18n_location):
    
    curr_i18n = load_json(f'{i18n_location}{language}.json')
    
    patternA = r'\[\[.*\]\]'
    patternA1 = r'{{.*}}'
    patternB = r'\$1'
    
    try:
        
        default_reason = curr_i18n['nuke-defaultreason']

        if re.search(patternA, default_reason):
            split_outs = re.split(patternA, default_reason)
            cleaned_default = [i.strip() for i in split_outs if len(i)!=0][0]

            if re.search(patternA1, cleaned_default):
                split_outs = re.split(patternA1, cleaned_default)
                cleaned_default = [i.strip() for i in split_outs if len(i)!=0][0]      

        elif re.search(patternB, default_reason):
            split_outs = re.split(patternB, default_reason)
            cleaned_default = [i.strip() for i in split_outs if len(i)!=0][0]

        else:
            cleaned_default = default_reason
    
    except:
        cleaned_default = 'SET_EN'
    
    try:
        multiuser_reason = curr_i18n['nuke-multiplepeople']
        
    except:
        multiuser_reason = 'SET_EN'
        
    return [cleaned_default, multiuser_reason]

In [11]:
i18n_location = 'i18n/'
i18n_df = public_wikis.copy()
i18n_df[['reason_default', 'reason_multi']] = None

In [24]:
# default reason to fallback for languages with no translation
en_default_reason, en_multi_reason = clean_i18n('en')

In [67]:
i18n_available_langs = [i.replace('.json', '') for i in listdir(i18n_location)]

for lang in i18n_df.language_code.unique():
    
    # there is a slight mismatch between language codes used in the canonical data & i18n files; reassign manually
    if lang == 'zh':
        i18n_lang = 'zh-hans'
    elif lang == 'zh-min-nan':
        i18n_lang = 'zh-hant'
    elif lang == 'be':
        i18n_lang = 'be-tarask'
    else:
        i18n_lang = lang
    
    if i18n_lang in i18n_available_langs:
        
        i18n_cleaned_result = clean_i18n(i18n_lang)
        
        default_reason = i18n_cleaned_result[0]
        multi_reason = i18n_cleaned_result[1]
        
        if default_reason == 'SET_EN':
            i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_default'] = en_default_reason
        else:
            i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_default'] = default_reason
        
        if multi_reason == 'SET_EN':
            i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_multi'] = en_multi_reason
        else:
            i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_multi'] = multi_reason
            
    else:
        i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_default'] = en_default_reason
        i18n_df.loc[i18n_df.query("""language_code == @lang""").index, 'reason_multi'] = en_multi_reason

In [380]:
i18n_df.head(10)

Unnamed: 0,database_code,domain_name,database_group,language_code,language_name,status,visibility,editability,english_name,reason_default,reason_multi
3,abwiki,ab.wikipedia.org,wikipedia,ab,Abkhazian,open,public,public,Abkhazian Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages
5,acewiki,ace.wikipedia.org,wikipedia,ace,Achinese,open,public,public,Achinese Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages
8,adywiki,ady.wikipedia.org,wikipedia,ady,Adyghe,open,public,public,Adyghe Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages
9,afwiki,af.wikipedia.org,wikipedia,af,Afrikaans,open,public,public,Afrikaans Wikipedia,Massa verwydering van bladsye van,verskeie gebruikers
10,afwikibooks,af.wikibooks.org,wikibooks,af,Afrikaans,open,public,public,Afrikaans Wikibooks,Massa verwydering van bladsye van,verskeie gebruikers
11,afwikiquote,af.wikiquote.org,wikiquote,af,Afrikaans,open,public,public,Afrikaans Wikiquote,Massa verwydering van bladsye van,verskeie gebruikers
12,afwiktionary,af.wiktionary.org,wiktionary,af,Afrikaans,open,public,public,Afrikaans Wiktionary,Massa verwydering van bladsye van,verskeie gebruikers
16,alswiki,als.wikipedia.org,wikipedia,als,Alsatian,open,public,public,Alemannisch Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages
17,altwiki,alt.wikipedia.org,wikipedia,alt,Southern Altai,open,public,public,Altai Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages
18,amiwiki,ami.wikipedia.org,wikipedia,ami,Amis,open,public,public,Amis Wikipedia,Mass deletion of pages added by,Mass deletion of recently added pages


## Gather logs

In [None]:
%%time

warnings.filterwarnings('ignore')

log_query = """
SELECT
    log_id,
    log_timestamp,
    log_namespace,
    comment_id,
    comment_text,
    actor_name,
    CASE
        WHEN actor_user IS NULL THEN 'IP'
        ELSE 'user'
    END AS actor_class
FROM
    logging log
    JOIN comment cmt ON log.log_comment_id = cmt.comment_id
    LEFT JOIN archive ar ON log.log_title = ar.ar_title AND log.log_namespace = ar.ar_namespace
    LEFT JOIN actor ON ar.ar_actor = actor.actor_id
WHERE
    log_type = 'delete'
    AND log_action = 'delete'
    AND ar_parent_id = 0
    AND (
        comment_text LIKE "%{DEFAULT_REASON}%"
        OR comment_text LIKE "%{MULTI_REASON}%"
    )
"""

combined_result = pd.DataFrame()
for db in i18n_df.database_code.values:
    try:
        result = wmf.mariadb.run(log_query.format(DEFAULT_REASON=i18n_df.query("""database_code == @db""").reason_default.values[0], 
                                                  MULTI_REASON=i18n_df.query("""database_code == @db""").reason_multi.values[0]), 
                                 dbs=db,
                                 date_col='log_timestamp')
        result['db'] = db
        combined_result = pd.concat([combined_result, result], ignore_index=True)
    except:
        print(f'{db} failed')

IOStream.flush timed out
IOStream.flush timed out


In [150]:
combined_result.to_parquet('nuke_logs.parquet', index=False)

In [281]:
nuke_logs = pd.read_parquet('nuke_logs.parquet')

nuke_logs = (
    nuke_logs
    .assign(
        log_namespace=pd.Categorical(nuke_logs['log_namespace']),
        actor_class=pd.Categorical(nuke_logs['actor_class']),
        db=pd.Categorical(nuke_logs['db']), 
        year=lambda nuke_logs:nuke_logs['log_timestamp'].dt.year, 
        month=lambda nuke_logs:nuke_logs['log_timestamp'].dt.month))

nuke_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302982 entries, 0 to 1302981
Data columns (total 10 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   log_id         1302982 non-null  int64         
 1   log_timestamp  1302982 non-null  datetime64[ns]
 2   log_namespace  1302982 non-null  category      
 3   comment_id     1302982 non-null  int64         
 4   comment_text   1302982 non-null  object        
 5   actor_name     1302982 non-null  object        
 6   actor_class    1302982 non-null  category      
 7   db             1302982 non-null  category      
 8   year           1302982 non-null  int64         
 9   month          1302982 non-null  int64         
dtypes: category(3), datetime64[ns](1), int64(4), object(2)
memory usage: 74.6+ MB


In [295]:
#at the time of the anlaysis, the logs from the preceeding three years were considered
data_start_date = datetime.now() - timedelta(days=365*3)
nuke_logs_rc = nuke_logs.query("""log_timestamp >= @data_start_date""")

#unique log actions
unique_nukes_rc = nuke_logs_rc.drop_duplicates('log_id', ignore_index=True)

# Results

In [309]:
print(bold, 'Nuke actions during the preceeding three years by actor-class', end)
print('absolute counts')
print(unique_nukes_rc.actor_class.value_counts())
print('\npercentage')
print(unique_nukes_rc.actor_class.value_counts(normalize=True)*100)

[1m Nuke actions during the preceeding three years by actor-class [0m
absolute counts
user    165109
IP       81470
Name: actor_class, dtype: int64

percentage
user    66.959879
IP      33.040121
Name: actor_class, dtype: float64


In [357]:
# aggregate data by wiki
nuke_actions_by_wiki = (unique_nukes_rc[['db', 'actor_class']]
                        .value_counts()
                        .rename_axis(['db', 'actor_class'])
                        .reset_index(name='counts')
                        .pivot(columns='actor_class', index='db', values='counts')
                        .fillna(0))

nuke_actions_by_wiki['total'] = nuke_actions_by_wiki['IP'] + nuke_actions_by_wiki['user']
nuke_actions_by_wiki['IP_percentage'] = nuke_actions_by_wiki['IP'] / nuke_actions_by_wiki['total'] * 100
nuke_actions_by_wiki['user_percentage'] = nuke_actions_by_wiki['user'] / nuke_actions_by_wiki['total'] * 100

nuke_actions_by_wiki.head()

actor_class,IP,user,total,IP_percentage,user_percentage
db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
commonswiki,560.0,28389.0,28949.0,1.934436,98.065564
enwiki,847.0,22379.0,23226.0,3.646775,96.353225
testwiki,207.0,19408.0,19615.0,1.055315,98.944685
wikidatawiki,9261.0,17325.0,26586.0,34.834123,65.165877
jawiki,5763.0,9107.0,14870.0,38.755884,61.244116


In [358]:
wiki_comparision = pd.read_csv('https://raw.githubusercontent.com/wikimedia-research/wiki-comparison/main/data-collection/snapshots/Jan_2023.tsv', sep='\t')

nuke_actions_by_wiki = (pd.merge(nuke_actions_by_wiki, 
                                 wiki_comparision[['database code', 'overall size rank']], 
                                 left_on='db', right_on='database code'))

nuke_actions_by_wiki = (nuke_actions_by_wiki
                        .assign(
                            IP=nuke_actions_by_wiki['IP'].astype(int), 
                            user=nuke_actions_by_wiki['user'].astype(int),
                            total=nuke_actions_by_wiki['total'].astype(int),
                            IP_percentage=round(nuke_actions_by_wiki['IP_percentage'], 2),
                            user_percentage=round(nuke_actions_by_wiki['user_percentage'], 2))
                        .sort_values('overall size rank', ignore_index=True))

nuke_actions_by_wiki.head()

Unnamed: 0,IP,user,total,IP_percentage,user_percentage,database code,overall size rank
0,847,22379,23226,3.65,96.35,enwiki,1
1,5763,9107,14870,38.76,61.24,jawiki,3
2,679,782,1461,46.48,53.52,dewiki,4
3,925,2469,3394,27.25,72.75,ruwiki,6
4,560,28389,28949,1.93,98.07,commonswiki,7


In [368]:
nuke_actions_by_wiki.set_index('database code').to_csv('nuke_actions_by_wiki.tsv', sep='\t')

In [375]:
nuke_actions_by_wiki.query("""(total > 300) & (IP_percentage >= 30)""").sort_values('IP_percentage', ascending=False, ignore_index=True)

Unnamed: 0,IP,user,total,IP_percentage,user_percentage,database code,overall size rank
0,419,0,419,100.0,0.0,thwikiquote,442
1,471,2,473,99.58,0.42,tnwiki,420
2,447,4,451,99.11,0.89,gotwiki,580
3,668,7,675,98.96,1.04,stwiki,474
4,1113,25,1138,97.8,2.2,satwiki,348
5,617,52,669,92.23,7.77,sowiki,169
6,931,97,1028,90.56,9.44,fiwiktionary,120
7,335,38,373,89.81,10.19,ltwiki,50
8,531,67,598,88.8,11.2,jawikiversity,541
9,2663,362,3025,88.03,11.97,jawikibooks,95
