# Content Translation Article Deletion Ratios, across all wikis
**Krishna Chaitanya Velaga, Data Scientist III, Wikimedia Foundation**

**Last updated on 22 September 2023**

[TASK: T343300](https://phabricator.wikimedia.org/T343300)

<u>Attribution:</u> This analysis has been built upon the [previous iteration of the analysis](https://gitlab.wikimedia.org/mneisler/content_translation_deletion_stats/-/blob/master/content_translation_deletion_ratios.ipynb) (written in R) by [Megan Neilser](https://github.com/MeganNeisler).

# Contents

1. [Overview](#Overview)
2. [Data Gathering](#Data-Gathering)
3. [Analysis](#Analysis)
    * [Current Quarter - FY23 Q4](#Current-Quarter)
    * [Previous Quarter - FY23 Q3](#Previous-Quarter)
4. [Formatting](#Formatting)

# Overview

## Purpose
The purpose of this analysis is to identify and list the number of wikis where the deletion rate of articles created with content translation is higher than the deletion rate for articles created with other tools. Specifically, we want to answer the following questions:
* How many wikis have translations deleted more often than regular articles?
* Which are these wikis?
* Has the number of those wikis reduced compared to the previous period?
* How high is the highest deletion ratio a wiki has for translations?
* This analysis will be used as a baseline to assess the evolution of deletion rates as improvements are made.

## Summary
* The deletion rate for CX created articles (3.9%) is significantly less than that of non-CX created articles (13.59%).
* There were [23 wikis](https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison#March_2023_through_April_2023_(Q4_2023)) had a higher deletion rate of CX created articles than the ones that were not. 10 of these wikis had higher deletion rates for CX created articles during the last quarter as well.
* Among these, Kurdish WP (kuwiki) has been on the list for the last four quarters, and Armenian WP (hywiki), Lithuanian WP (ltwiki), and Tatar WP (ttwiki), for the last three quarters. 

# Data-Gathering

In [1]:
import numpy as np
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import importlib
import warnings

import data_functions as dtf
import formatting_functions as ftf




You are using Wmfdata v2.0.0, but v2.0.1 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md.


In [23]:
importlib.reload(dtf)
importlib.reload(ftf)

<module 'formatting_functions' from '/srv/home/kcv-wikimf/gitref/content-translation-deletion-stats/formatting_functions.py'>

## spark_session

In [3]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [4]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='cx-deletion-stats',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session

In [5]:
spark_session.sparkContext.setLogLevel("ERROR")

## run query

In [None]:
currq_dates = dtf.generate_quarters(2022)['Q4']
prevq_dates = dtf.generate_quarters(2022)['Q3']

In [29]:
%%time

warnings.filterwarnings('ignore')

deletion_stats_currq_all = dtf.query_deletion_stats(currq_dates)
deletion_stats_prevq_all = dtf.query_deletion_stats(prevq_dates)



CPU times: user 983 ms, sys: 197 ms, total: 1.18 s
Wall time: 4min 47s


                                                                                

# Analysis

## Current-Quarter

In [37]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_currq_all, period='FY23-Q4', pr=True)

During FY23-Q4, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 3.9%
	- created without using the Content Translation Tool: 13.59%


In [31]:
# deletion ratio by wiki
deletion_stats_currq = dtf.generate_ratios_by_wiki(deletion_stats_currq_all)

In [32]:
print(f'During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_currq.query("""deletion_pct_diff < 0""").shape[0]} wikis where articles created using CX \
were deleted more than articles created without using CX')

During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 23 wikis where articles created using CX were deleted more than articles created without using CX


In [33]:
# wikis with high deletion ratio
currq_high_deletion_ratio = deletion_stats_currq.query("""deletion_pct_diff < 0""").sort_values('deletion_pct_diff')
currq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
shwiki,20,791,13,54,65.0,6.83,-58.17
suwiki,19,144,10,15,52.63,10.42,-42.21
tnwiki,32,32,14,4,43.75,12.5,-31.25
lvwiki,34,3521,10,225,29.41,6.39,-23.02
ltwiki,59,4077,17,574,28.81,14.08,-14.73
fiwiki,76,7945,17,765,22.37,9.63,-12.74
gdwiki,17,90,2,3,11.76,3.33,-8.43
iuwiki,16,12,16,11,100.0,91.67,-8.33
ttwiki,73,531,11,37,15.07,6.97,-8.1
kuwiki,86,568,9,23,10.47,4.05,-6.42


## Previous-Quarter

In [39]:
# overal deletion percent
dtf.overall_deletion_pct(deletion_stats_prevq_all, period='FY23-Q4', pr=True)

During FY23-Q4, overall percentage of articles that were deleted,
	- created using the Content Translation tool: 3.96%
	- created without using the Content Translation Tool: 13.2%


In [40]:
# deletion ratio by wiki
deletion_stats_prevq = dtf.generate_ratios_by_wiki(deletion_stats_prevq_all)

In [41]:
print(f'During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,\n\
there were {deletion_stats_prevq.query("""deletion_pct_diff < 0""").shape[0]} wikis where articles created using CX \
were deleted more than articles created without using CX')        

During FY23-Q4, across all wikis where more than 15 articles have been created with the Content Translation tool,
there were 26 wikis where articles created using CX were deleted more than articles created without using CX


In [42]:
# wikis with high deletion ratio

prevq_high_deletion_ratio = deletion_stats_prevq.query("""deletion_pct_diff < 0""").sort_values('deletion_pct_diff')
prevq_high_deletion_ratio

Unnamed: 0_level_0,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
wiki_db,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bowiki,17,98,15,6,88.24,6.12,-82.12
iuwiki,202,44,202,20,100.0,45.45,-54.55
htwiki,84,91,35,3,41.67,3.3,-38.37
gdwiki,18,24,9,3,50.0,12.5,-37.5
jvwiki,83,295,32,15,38.55,5.08,-33.47
lawiki,17,642,5,40,29.41,6.23,-23.18
crwiki,36,24,36,19,100.0,79.17,-20.83
yiwiki,31,68,16,21,51.61,30.88,-20.73
iswiki,25,1451,5,50,20.0,3.45,-16.55
kuwiki,59,1049,11,39,18.64,3.72,-14.92


In [43]:
# wikis that had high deletion rates for articles that have been created with CX compared articles that have not been created using CX
wikis_high_deletion_ratio = np.intersect1d(currq_high_deletion_ratio.index.values, prevq_high_deletion_ratio.index.values)
wikis_high_deletion_ratio

array(['afwiki', 'bswiki', 'gdwiki', 'hywiki', 'iuwiki', 'kswiki',
       'kuwiki', 'ltwiki', 'ttwiki', 'tumwiki'], dtype=object)

## Formatting
for publication on Meta-Wiki at [Content translation/Deletion statistics comparison](https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison)

In [58]:
currq_wtable = currq_high_deletion_ratio.copy()

In [59]:
# format the percentage columns

percentage_columns = ['deleted_cx_pct', 'deleted_non_cx_pct', 'deletion_pct_diff']
currq_wtable[percentage_columns] = currq_wtable[percentage_columns]/100

currq_wtable = (
    currq_wtable
    .assign(
        deleted_cx_pct = ftf.format_percent('deleted_cx_pct', currq_wtable),
        deleted_non_cx_pct = ftf.format_percent('deleted_non_cx_pct', currq_wtable),
        deletion_pct_diff = ftf.format_percent('deletion_pct_diff', currq_wtable)
    )
    .reset_index()
)

In [60]:
# rename columns
columns_rename_map = {
    'wiki_db': 'Wikipedia',
    'created_cx': 'Created CX Articles', 
    'created_non_cx': 'Created non-CX Articles', 
    'deleted_cx': 'Deleted CX Articles', 
    'deleted_non_cx': 'Deleted non-CX Articles',
    'deleted_cx_pct': 'CX Articles Deletion Ratio', 
    'deleted_non_cx_pct': 'Non-CX Articles Deletion Ratio', 
    'deletion_pct_diff': 'Deletion Ratio Difference'
}

currq_wtable.rename(columns_rename_map, axis=1, inplace=True)

In [61]:
# create a multi-level column
column_arrays = [
    np.array(['Wikipedia'] + ['Created Articles'] * 2 + ['Deleted Articles'] * 2 + ['Deletion Ratios'] * 3),
    currq_wtable.columns.to_numpy()
]

currq_wtable.columns = pd.MultiIndex.from_arrays(column_arrays)

currq_wtable.head()

Unnamed: 0_level_0,Wikipedia,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Unnamed: 0_level_1,Wikipedia,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
0,shwiki,20,791,13,54,65.00%,6.83%,-58.17%
1,suwiki,19,144,10,15,52.63%,10.42%,-42.21%
2,tnwiki,32,32,14,4,43.75%,12.50%,-31.25%
3,lvwiki,34,3521,10,225,29.41%,6.39%,-23.02%
4,ltwiki,59,4077,17,574,28.81%,14.08%,-14.73%


In [64]:
# add footnote (as superscript) for wikis that had high deletion ratio for article created using CX during the last quarter as well
currq_wtable[('Wikipedia', 'Wikipedia')] = currq_wtable[('Wikipedia', 'Wikipedia')].apply(lambda x:ftf.add_footnote(x, wikis_high_deletion_ratio))

In [66]:
table_headers = [
    'Wikipedias with higher deletion ratios for articles created with Content Translation',
    'Reviewed Time Period: April 2023 through June 2023 (FY 23 Q4)'
]

table_footers = [
    '<sup>1</sup> Excludes Wikipedias with 15 or fewer articles created with Content Translation during the reviewed time period.',
    '<sup>2</sup> Also identified in the prior quarter as a wiki with a higher deletion ratio for articles created with Content Translation.'
]

In [67]:
# to be published at https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison
print(ftf.dataframe_to_mediawiki(currq_wtable, table_headers, table_footers))

{| class='wikitable'
! colspan='8' | Wikipedias with higher deletion ratios for articles created with Content Translation
! colspan='8' | Reviewed Time Period: April 2023 through June 2023 (FY 23 Q4)
|-
colspan='1' | Wikipedia !! colspan='2' | Created Articles !! colspan='2' | Deleted Articles !! colspan='3' | Deletion Ratios
colspan='1' | Wikipedia !! colspan='1' | Created CX Articles !! colspan='1' | Created non-CX Articles !! colspan='1' | Deleted CX Articles !! colspan='1' | Deleted non-CX Articles !! colspan='1' | CX Articles Deletion Ratio !! colspan='1' | Non-CX Articles Deletion Ratio !! colspan='1' | Deletion Ratio Difference
|-
| shwiki || 20 || 791 || 13 || 54 || 65.00% || 6.83% || -58.17%
|-
| suwiki || 19 || 144 || 10 || 15 || 52.63% || 10.42% || -42.21%
|-
| tnwiki || 32 || 32 || 14 || 4 || 43.75% || 12.50% || -31.25%
|-
| lvwiki || 34 || 3521 || 10 || 225 || 29.41% || 6.39% || -23.02%
|-
| ltwiki<sup>2</sup> || 59 || 4077 || 17 || 574 || 28.81% || 14.08% || -14.73%
|-
| 