# Content Translation Article Deletion Ratios, across all wikis

**Last updated on 23 October 2024**

# Contents

1. [Overview](#Overview)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)
    * [Current Quarter - FY23 Q4](#Current-Quarter)
    * [Previous Quarter - FY23 Q3](#Previous-Quarter)
4. [Formatting](#Formatting)

# Overview

## Purpose
The purpose of this analysis is to identify and list the number of wikis where the deletion rate of articles created with content translation is higher than the deletion rate for articles created with other tools. Specifically, we want to answer the following questions:
* How many wikis have translations deleted more often than regular articles?
* Which are these wikis?
* Has the number of those wikis reduced compared to the previous period?
* How high is the highest deletion ratio a wiki has for translations?
* This analysis will be used as a baseline to assess the evolution of deletion rates as improvements are made.

Results are posted at: https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison#April_to_June_2024_(Q4_2023-24)

# Data-Gathering

## Setup

In [1]:
import numpy as np
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import importlib
import warnings

import data_functions as dtf
import formatting_functions as ftf

In [2]:
importlib.reload(dtf)
importlib.reload(ftf)

<module 'formatting_functions' from '/srv/home/kcvelaga/git/content-translation-deletion-stats/formatting_functions.py'>

In [3]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) == type(None):
    spark_session = wmf.spark.create_custom_session(
        master="yarn",
        app_name='cx-del-stats-jun24',
        spark_config={
            "spark.driver.memory": "4g",
            "spark.dynamicAllocation.maxExecutors": 64,
            "spark.executor.memory": "16g",
            "spark.executor.cores": 4,
            "spark.sql.shuffle.partitions": 256,
            "spark.driver.maxResultSize": "2g"
        }
    )

spark_session.sparkContext.setLogLevel("ERROR")

clear_output()

spark_session

## run query

In [4]:
currq_dates = dtf.generate_quarters(2023)['Q4']
prevq_dates = dtf.generate_quarters(2023)['Q3']

In [5]:
%%time

warnings.filterwarnings('ignore')

deletion_stats_currq_all = dtf.query_deletion_stats(currq_dates)
deletion_stats_prevq_all = dtf.query_deletion_stats(prevq_dates)



CPU times: user 731 ms, sys: 137 ms, total: 868 ms
Wall time: 4min 10s


                                                                                

# Analysis

## Current-Quarter

In [6]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_currq_all, period='FY23-Q4', pr=True)

During FY23-Q4, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 4.73%
	- created without using the Content Translation Tool: 11.46%


In [9]:
# deletion ratio by wiki
deletion_stats_currq = dtf.generate_ratios_by_wiki(deletion_stats_currq_all)

In [8]:
print(f'During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_currq.query("""deletion_pct_diff <= -2""").shape[0]} wikis where the deletion rate of articles created \
using CX was atleast 2% higher than articles created without using CX.')

During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 16 wikis where the deletion rate of articles created using CX was atleast 2% higher than articles created without using CX.


In [10]:
# wikis with high deletion ratio
currq_high_deletion_ratio = deletion_stats_currq.query("""deletion_pct_diff <= -2""").sort_values('deletion_pct_diff')
currq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
yiwiki,41,21,32,8,78.05,38.1,-39.95
kuwiki,30,846,10,45,33.33,5.32,-28.01
swwiki,21,2591,6,47,28.57,1.81,-26.76
lvwiki,38,3397,9,252,23.68,7.42,-16.26
ocwiki,16,541,3,20,18.75,3.7,-15.05
uzwiki,15340,31189,2036,861,13.27,2.76,-10.51
roa_rupwiki,17,33,2,1,11.76,3.03,-8.73
mnwiki,19,759,4,101,21.05,13.31,-7.74
hrwiki,254,2147,59,372,23.23,17.33,-5.9
tnwiki,22,579,1,2,4.55,0.35,-4.2


## Previous-Quarter

In [12]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_prevq_all, period='FY23-Q3', pr=True)

During FY23-Q3, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 2.32%
	- created without using the Content Translation Tool: 11.65%


In [13]:
# deletion ratio by wiki
deletion_stats_prevq = dtf.generate_ratios_by_wiki(deletion_stats_prevq_all)

In [14]:
print(f'During FY23-Q3, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_prevq.query("""deletion_pct_diff < -2""").shape[0]} wikis where the deletion rate of articles created \
using CX was atleast 2% higher than articles created without using CX.')

During FY23-Q3, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 10 wikis where the deletion rate of articles created using CX was atleast 2% higher than articles created without using CX.


In [15]:
# wikis with high deletion ratio

prevq_high_deletion_ratio = deletion_stats_prevq.query("""deletion_pct_diff < -2""").sort_values('deletion_pct_diff')
prevq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ltwiki,50,4039,23,504,46.0,12.48,-33.52
banwiki,18,1416,5,15,27.78,1.06,-26.72
cebwiki,34,741,8,37,23.53,4.99,-18.54
mnwiki,22,903,7,211,31.82,23.37,-8.45
iswiki,45,966,6,60,13.33,6.21,-7.12
gnwiki,24,115,2,2,8.33,1.74,-6.59
tgwiki,142,1979,15,115,10.56,5.81,-4.75
svwiki,108,14912,14,1321,12.96,8.86,-4.1
kuwiki,23,629,3,60,13.04,9.54,-3.5
ttwiki,55,388,3,8,5.45,2.06,-3.39


In [16]:
# wikis that had high deletion rates for articles that have been created with CX compared articles that have not been created using CX
wikis_high_deletion_ratio = np.intersect1d(currq_high_deletion_ratio.index.values, prevq_high_deletion_ratio.index.values)
wikis_high_deletion_ratio

array(['kuwiki', 'mnwiki'], dtype=object)

## Formatting
for publication on Meta-Wiki at [Content translation/Deletion statistics comparison](https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison)

In [17]:
currq_wtable = currq_high_deletion_ratio.copy()

In [18]:
# format the percentage columns

percentage_columns = ['deleted_cx_pct', 'deleted_non_cx_pct', 'deletion_pct_diff']
currq_wtable[percentage_columns] = currq_wtable[percentage_columns]/100

currq_wtable = (
    currq_wtable
    .assign(
        deleted_cx_pct = ftf.format_percent('deleted_cx_pct', currq_wtable),
        deleted_non_cx_pct = ftf.format_percent('deleted_non_cx_pct', currq_wtable),
        deletion_pct_diff = ftf.format_percent('deletion_pct_diff', currq_wtable)
    )
    .reset_index()
)

In [19]:
# rename columns
columns_rename_map = {
    'wiki_db': 'Wikipedia',
    'created_cx': 'Created CX Articles', 
    'created_non_cx': 'Created non-CX Articles', 
    'deleted_cx': 'Deleted CX Articles', 
    'deleted_non_cx': 'Deleted non-CX Articles',
    'deleted_cx_pct': 'CX Articles Deletion Ratio', 
    'deleted_non_cx_pct': 'Non-CX Articles Deletion Ratio', 
    'deletion_pct_diff': 'Deletion Ratio Difference'
}

currq_wtable.rename(columns_rename_map, axis=1, inplace=True)

In [20]:
# create a multi-level column
column_arrays = [
    np.array(['Wikipedia'] + ['Created Articles'] * 2 + ['Deleted Articles'] * 2 + ['Deletion Ratios'] * 3),
    currq_wtable.columns.to_numpy()
]

currq_wtable.columns = pd.MultiIndex.from_arrays(column_arrays)

currq_wtable.head()

Unnamed: 0_level_0,Wikipedia,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Unnamed: 0_level_1,Wikipedia,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
0,yiwiki,41,21,32,8,78.05%,38.10%,-39.95%
1,kuwiki,30,846,10,45,33.33%,5.32%,-28.01%
2,swwiki,21,2591,6,47,28.57%,1.81%,-26.76%
3,lvwiki,38,3397,9,252,23.68%,7.42%,-16.26%
4,ocwiki,16,541,3,20,18.75%,3.70%,-15.05%


In [21]:
# add footnote (as superscript) for wikis that had high deletion ratio for article created using CX during the last quarter as well
currq_wtable[('Wikipedia', 'Wikipedia')] = currq_wtable[('Wikipedia', 'Wikipedia')].apply(lambda x:ftf.add_footnote(x, wikis_high_deletion_ratio))

In [25]:
table_headers = [
    'Wikipedias with higher deletion ratios for articles created with Content Translation',
    'Reviewed Time Period: April to June 2024 (Q4 2023-24)'
]

table_footers = [
    '<sup>1</sup> Excludes Wikipedias with 15 or fewer articles created with Content Translation during the reviewed time period.',
    '<sup>2</sup> Also identified in the prior quarter as a wiki with a higher deletion ratio for articles created with Content Translation.'
]

In [26]:
# to be published at https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison
print(ftf.dataframe_to_mediawiki(currq_wtable, table_headers, table_footers))

{| class='wikitable'
! colspan='8' | Wikipedias with higher deletion ratios for articles created with Content Translation
! colspan='8' | Reviewed Time Period: April to June 2024 (Q4 2023-24)
|-
colspan='1' | Wikipedia !! colspan='2' | Created Articles !! colspan='2' | Deleted Articles !! colspan='3' | Deletion Ratios
colspan='1' | Wikipedia !! colspan='1' | Created CX Articles !! colspan='1' | Created non-CX Articles !! colspan='1' | Deleted CX Articles !! colspan='1' | Deleted non-CX Articles !! colspan='1' | CX Articles Deletion Ratio !! colspan='1' | Non-CX Articles Deletion Ratio !! colspan='1' | Deletion Ratio Difference
|-
| yiwiki || 41 || 21 || 32 || 8 || 78.05% || 38.10% || -39.95%
|-
| kuwiki<sup>2</sup> || 30 || 846 || 10 || 45 || 33.33% || 5.32% || -28.01%
|-
| swwiki || 21 || 2591 || 6 || 47 || 28.57% || 1.81% || -26.76%
|-
| lvwiki || 38 || 3397 || 9 || 252 || 23.68% || 7.42% || -16.26%
|-
| ocwiki || 16 || 541 || 3 || 20 || 18.75% || 3.70% || -15.05%
|-
| uzwiki || 1534

In [27]:
spark_session.stop()