# Content Translation Article Deletion Ratios, across all wikis
**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**

**Last updated on 22 November 2023**

[TASK: T347471](https://phabricator.wikimedia.org/T347471)

# Contents

1. [Overview](#Overview)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)
    * [Current Quarter - FY23 Q1](#Current-Quarter)
    * [Previous Quarter - FY22 Q4](#Previous-Quarter)
4. [Formatting](#Formatting)

# Overview

## Purpose
The purpose of this analysis is to identify and list the number of wikis where the deletion rate of articles created with content translation is higher than the deletion rate for articles created with other tools. Specifically, we want to answer the following questions:
* How many wikis have translations deleted more often than regular articles?
* Which are these wikis?
* Has the number of those wikis reduced compared to the previous period?
* How high is the highest deletion ratio a wiki has for translations?
* This analysis will be used as a baseline to assess the evolution of deletion rates as improvements are made.

## Summary
* The deletion rate for CX created articles (5.22%) is significantly less than that of non-CX created articles (12.71%).
* There were [23 wikis](https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison#July_2023_through_September_2023_(Q1_2023)) had a higher deletion rate of CX created articles than the ones that were not. 10 of these wikis had higher deletion rates for CX created articles during the last quarter as well.
* Among these, Kurdish WP (kuwiki) has been on the list for the last four quarters, and Armenian WP (hywiki), Lithuanian WP (ltwiki), and Tatar WP (ttwiki), for the last three quarters. 

# Data-Gathering

In [30]:
import numpy as np
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import importlib
import warnings

import data_functions as dtf
import formatting_functions as ftf

In [31]:
importlib.reload(dtf)
importlib.reload(ftf)

<module 'formatting_functions' from '/srv/home/kcv-wikimf/gitref/content-translation-deletion-stats/formatting_functions.py'>

## spark_session

In [33]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='cx-deletion-stats',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session

In [7]:
spark_session.sparkContext.setLogLevel("ERROR")

## run query

In [35]:
currq_dates = dtf.generate_quarters(2023)['Q1']
prevq_dates = dtf.generate_quarters(2022)['Q4']

In [36]:
%%time

warnings.filterwarnings('ignore')

deletion_stats_currq_all = dtf.query_deletion_stats(currq_dates)
deletion_stats_prevq_all = dtf.query_deletion_stats(prevq_dates)

23/11/22 12:27:20 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

CPU times: user 1.12 s, sys: 35.1 ms, total: 1.15 s
Wall time: 4min 38s


                                                                                

# Analysis

## Current-Quarter

In [37]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_currq_all, period='FY23-Q1', pr=True)

During FY23-Q1, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 5.22%
	- created without using the Content Translation Tool: 12.71%


In [38]:
# deletion ratio by wiki
deletion_stats_currq = dtf.generate_ratios_by_wiki(deletion_stats_currq_all)

In [39]:
print(f'During FY23-Q1, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_currq.query("""deletion_pct_diff < 0""").shape[0]} wikis where articles created using CX \
were deleted more than articles created without using CX')

During FY23-Q1, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 22 wikis where articles created using CX were deleted more than articles created without using CX


In [40]:
# wikis with high deletion ratio
currq_high_deletion_ratio = deletion_stats_currq.query("""deletion_pct_diff < 0""").sort_values('deletion_pct_diff')
currq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aswiki,24,426,10,29,41.67,6.81,-34.86
kawiki,77,1326,33,234,42.86,17.65,-25.21
ltwiki,43,4126,13,272,30.23,6.59,-23.64
lawiki,16,603,4,49,25.0,8.13,-16.87
cebwiki,18,840,3,63,16.67,7.5,-9.17
ocwiki,36,794,3,14,8.33,1.76,-6.57
kuwiki,65,1023,5,18,7.69,1.76,-5.93
brwiki,21,2865,1,21,4.76,0.73,-4.03
lvwiki,35,3321,3,152,8.57,4.58,-3.99
uzwiki,4686,5801,1540,1726,32.86,29.75,-3.11


## Previous-Quarter

In [41]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_prevq_all, period='FY22-Q4', pr=True)

During FY22-Q4, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 3.62%
	- created without using the Content Translation Tool: 13.06%


In [42]:
# deletion ratio by wiki
deletion_stats_prevq = dtf.generate_ratios_by_wiki(deletion_stats_prevq_all)

In [43]:
print(f'During FY22-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_prevq.query("""deletion_pct_diff < 0""").shape[0]} wikis where articles created using CX \
were deleted more than articles created without using CX')        

During FY22-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 23 wikis where articles created using CX were deleted more than articles created without using CX


In [44]:
# wikis with high deletion ratio

prevq_high_deletion_ratio = deletion_stats_prevq.query("""deletion_pct_diff < 0""").sort_values('deletion_pct_diff')
prevq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
shwiki,20,791,13,53,65.0,6.7,-58.3
suwiki,19,144,10,15,52.63,10.42,-42.21
tnwiki,32,32,13,4,40.62,12.5,-28.12
lvwiki,34,3521,10,224,29.41,6.36,-23.05
ltwiki,59,4077,17,562,28.81,13.78,-15.03
fiwiki,76,7945,17,738,22.37,9.29,-13.08
gdwiki,17,90,2,3,11.76,3.33,-8.43
iuwiki,16,12,16,11,100.0,91.67,-8.33
bswiki,110,785,22,106,20.0,13.5,-6.5
kuwiki,86,568,8,18,9.3,3.17,-6.13


In [45]:
# wikis that had high deletion rates for articles that have been created with CX compared articles that have not been created using CX
wikis_high_deletion_ratio = np.intersect1d(currq_high_deletion_ratio.index.values, prevq_high_deletion_ratio.index.values)
wikis_high_deletion_ratio

array(['bswiki', 'dewiki', 'kuwiki', 'ltwiki', 'lvwiki', 'mtwiki',
       'ocwiki', 'shwiki'], dtype=object)

## Formatting
for publication on Meta-Wiki at [Content translation/Deletion statistics comparison](https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison)

In [46]:
currq_wtable = currq_high_deletion_ratio.copy()

In [47]:
# format the percentage columns

percentage_columns = ['deleted_cx_pct', 'deleted_non_cx_pct', 'deletion_pct_diff']
currq_wtable[percentage_columns] = currq_wtable[percentage_columns]/100

currq_wtable = (
    currq_wtable
    .assign(
        deleted_cx_pct = ftf.format_percent('deleted_cx_pct', currq_wtable),
        deleted_non_cx_pct = ftf.format_percent('deleted_non_cx_pct', currq_wtable),
        deletion_pct_diff = ftf.format_percent('deletion_pct_diff', currq_wtable)
    )
    .reset_index()
)

In [48]:
# rename columns
columns_rename_map = {
    'wiki_db': 'Wikipedia',
    'created_cx': 'Created CX Articles', 
    'created_non_cx': 'Created non-CX Articles', 
    'deleted_cx': 'Deleted CX Articles', 
    'deleted_non_cx': 'Deleted non-CX Articles',
    'deleted_cx_pct': 'CX Articles Deletion Ratio', 
    'deleted_non_cx_pct': 'Non-CX Articles Deletion Ratio', 
    'deletion_pct_diff': 'Deletion Ratio Difference'
}

currq_wtable.rename(columns_rename_map, axis=1, inplace=True)

In [49]:
# create a multi-level column
column_arrays = [
    np.array(['Wikipedia'] + ['Created Articles'] * 2 + ['Deleted Articles'] * 2 + ['Deletion Ratios'] * 3),
    currq_wtable.columns.to_numpy()
]

currq_wtable.columns = pd.MultiIndex.from_arrays(column_arrays)

currq_wtable.head()

Unnamed: 0_level_0,Wikipedia,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Unnamed: 0_level_1,Wikipedia,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
0,aswiki,24,426,10,29,41.67%,6.81%,-34.86%
1,kawiki,77,1326,33,234,42.86%,17.65%,-25.21%
2,ltwiki,43,4126,13,272,30.23%,6.59%,-23.64%
3,lawiki,16,603,4,49,25.00%,8.13%,-16.87%
4,cebwiki,18,840,3,63,16.67%,7.50%,-9.17%


In [50]:
# add footnote (as superscript) for wikis that had high deletion ratio for article created using CX during the last quarter as well
currq_wtable[('Wikipedia', 'Wikipedia')] = currq_wtable[('Wikipedia', 'Wikipedia')].apply(lambda x:ftf.add_footnote(x, wikis_high_deletion_ratio))

In [51]:
table_headers = [
    'Wikipedias with higher deletion ratios for articles created with Content Translation',
    'Reviewed Time Period: July through September 2023 (FY 23 Q1)'
]

table_footers = [
    '<sup>1</sup> Excludes Wikipedias with 15 or fewer articles created with Content Translation during the reviewed time period.',
    '<sup>2</sup> Also identified in the prior quarter as a wiki with a higher deletion ratio for articles created with Content Translation.'
]

In [52]:
# to be published at https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison
print(ftf.dataframe_to_mediawiki(currq_wtable, table_headers, table_footers))

{| class='wikitable'
! colspan='8' | Wikipedias with higher deletion ratios for articles created with Content Translation
! colspan='8' | Reviewed Time Period: July through September 2023 (FY 23 Q1)
|-
colspan='1' | Wikipedia !! colspan='2' | Created Articles !! colspan='2' | Deleted Articles !! colspan='3' | Deletion Ratios
colspan='1' | Wikipedia !! colspan='1' | Created CX Articles !! colspan='1' | Created non-CX Articles !! colspan='1' | Deleted CX Articles !! colspan='1' | Deleted non-CX Articles !! colspan='1' | CX Articles Deletion Ratio !! colspan='1' | Non-CX Articles Deletion Ratio !! colspan='1' | Deletion Ratio Difference
|-
| aswiki || 24 || 426 || 10 || 29 || 41.67% || 6.81% || -34.86%
|-
| kawiki || 77 || 1326 || 33 || 234 || 42.86% || 17.65% || -25.21%
|-
| ltwiki<sup>2</sup> || 43 || 4126 || 13 || 272 || 30.23% || 6.59% || -23.64%
|-
| lawiki || 16 || 603 || 4 || 49 || 25.00% || 8.13% || -16.87%
|-
| cebwiki || 18 || 840 || 3 || 63 || 16.67% || 7.50% || -9.17%
|-
| ocw

## September Deletion Stats analysis
It has been identified that deletion rate has increased in September. This analysis is to identify the source of the increased deletion rate.

In [116]:
%%time

deletion_stats_by_wiki_month_query = """
SELECT
    wiki_db,
    MONTH(event_timestamp) AS month,
    
    -- Counting created CX articles
    SUM(CASE 
            WHEN ARRAY_CONTAINS(revision_tags, 'contenttranslation') THEN 1 
        ELSE 0 
    END) AS created_cx,

    -- Counting total created articles
    COUNT(*) AS total_articles,

    -- Counting deleted CX articles
    SUM(CASE
            WHEN ARRAY_CONTAINS(revision_tags, 'contenttranslation')
             AND revision_is_deleted_by_page_deletion 
             AND revision_deleted_by_page_deletion_timestamp BETWEEN '{START_DATE}' and '{END_DATE}' THEN 1
        ELSE 0 
     END) AS deleted_cx,

    -- Counting total deleted articles
    SUM(CASE 
            WHEN revision_is_deleted_by_page_deletion 
             AND revision_deleted_by_page_deletion_timestamp BETWEEN '{START_DATE}' and '{END_DATE}' THEN 1 
        ELSE 0 
    END) AS deleted_articles
FROM 
    wmf.mediawiki_history mwh
-- Join canonical data about wikis
JOIN
    canonical_data.wikis cdw 
    ON mwh.wiki_db = cdw.database_code
WHERE
    snapshot = '{MW_SNAPSHOT}'
    AND event_timestamp BETWEEN '{START_DATE}' and '{END_DATE}'
    -- Article namespace only
    AND page_namespace = 0
    -- New page creations only
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create'
    -- Remove bots
    AND size(event_user_is_bot_by_historical) <= 0
    -- Limit to Wikipedias only
    AND database_group = 'wikipedia'
    -- Limit to those that are currently live
    AND status = 'open'
GROUP BY  
    wiki_db,
    MONTH(event_timestamp)
"""

stats_by_wiki = wmf.spark.run(
    deletion_stats_by_wiki_month_query.format(
        MW_SNAPSHOT=currq_dates['mw_snapshot'],
        START_DATE=currq_dates['start_dt'], 
        END_DATE=currq_dates['end_dt']
    )                        
)





CPU times: user 508 ms, sys: 85.5 ms, total: 593 ms
Wall time: 3min 41s


                                                                                

In [145]:
stats_by_wiki.sort_values(['wiki_db', 'month'], inplace=True, ignore_index=True)
monthly_agg = stats_by_wiki.groupby('month').agg({'created_cx': 'sum', 'deleted_cx': 'sum'})
monthly_agg['deleted_cx_pct'] = round(monthly_agg.deleted_cx / monthly_agg.created_cx * 100, 2)
monthly_agg

Unnamed: 0_level_0,created_cx,deleted_cx,deleted_cx_pct
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7,22100,854,3.86
8,24261,969,3.99
9,25573,2016,7.88


In [146]:
wikis_pivot = stats_by_wiki.pivot_table(index='wiki_db', columns='month', values='deleted_cx', aggfunc='sum', fill_value=0)
wikis_pivot['sep_diff'] = wikis_pivot[9] - wikis_pivot[[7, 8]].mean(axis=1)
wikis_pivot.sort_values(by='sep_diff', ascending=False, inplace=True)
wikis_pivot.head(5)

month,7,8,9,sep_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
uzwiki,23,181,1336,1234.0
arwiki,34,38,136,100.0
hawiki,1,1,14,13.0
jawiki,11,11,21,10.0
thwiki,2,2,10,8.0


In [147]:
sep_pct_deleted = monthly_agg.query("""month == 9""").deleted_cx_pct.values[0]

for wp in wikis_pivot.index[:5]:
    stats_excl_wiki = stats_by_wiki.query("""wiki_db != @wp""")
    monthly_agg_excl_wiki = stats_excl_wiki.groupby('month').agg({'created_cx': 'sum', 'deleted_cx': 'sum'})
    monthly_agg_excl_wiki['deleted_cx_pct'] = round(monthly_agg_excl_wiki.deleted_cx / monthly_agg_excl_wiki.created_cx * 100, 2)
    
    sep_pct_deleted_excl_wiki = monthly_agg_excl_wiki.query("""month == 9""").deleted_cx_pct.values[0]
    print(f'For Sep 2023, by excluding {wp}, the percentage of deleted articles created using CX changes from {sep_pct_deleted}% to {sep_pct_deleted_excl_wiki}%')

For Sep 2023, by excluding uzwiki, the percentage of deleted articles created using CX changes from 7.88% to 3.1%
For Sep 2023, by excluding arwiki, the percentage of deleted articles created using CX changes from 7.88% to 7.78%
For Sep 2023, by excluding hawiki, the percentage of deleted articles created using CX changes from 7.88% to 8.46%
For Sep 2023, by excluding jawiki, the percentage of deleted articles created using CX changes from 7.88% to 7.88%
For Sep 2023, by excluding thwiki, the percentage of deleted articles created using CX changes from 7.88% to 7.87%


In [148]:
uzwiki_stats = stats_by_wiki.query("""wiki_db == 'uzwiki'""").groupby('month').sum()
uzwiki_stats['deleted_cx_pct'] = round(uzwiki_stats.deleted_cx / uzwiki_stats.created_cx * 100, 2)
uzwiki_stats['deleted_articles_pct'] = round(uzwiki_stats.deleted_articles / uzwiki_stats.total_articles * 100, 2)
uzwiki_stats

Unnamed: 0_level_0,created_cx,total_articles,deleted_cx,deleted_articles,deleted_cx_pct,deleted_articles_pct
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,511,1600,23,226,4.5,14.12
8,547,1618,181,373,33.09,23.05
9,3628,7269,1336,2667,36.82,36.69


In [158]:
monthly_agg_excl_uz = (
    stats_by_wiki
    .query("""not (wiki_db == 'uzwiki' and month == 9)""")
    .groupby('month')
    .agg({'created_cx': 'sum', 'deleted_cx': 'sum'})
)
monthly_agg_excl_uz['deleted_cx_pct'] = round(monthly_agg_excl_uz.deleted_cx / monthly_agg_excl_uz.created_cx * 100, 2)
print(f'Deleted CX percentage average of Q1 FY23, excluding Uzbek Wikipedia is {monthly_agg_excl_uz.deleted_cx_pct.mean()}%')
monthly_agg_excl_uz

Deleted CX percentage average of Q1 FY23, excluding Uzbek Wikipedia is 3.65%


Unnamed: 0_level_0,created_cx,deleted_cx,deleted_cx_pct
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7,22100,854,3.86
8,24261,969,3.99
9,21945,680,3.1


**Summary**
- The increased deletion rates of articles created using CX is due to increased activity and deletion rates on Uzbek Wikipedia (likely due to a content campaign). 
- By excluding Uzbek Wikipedia, the deletion rate for September 2023 drops from 7.8% to 3.1%, while the quarterly average drops from 5.22% to 3.65%.