Starting with  2022 Q4 statistics, there was a decision to [change the approach](https://phabricator.wikimedia.org/T343300) in calculating the overall deletion percentages. With the change, we will only be considered Wikipedias, but not all Wikimedia projects. To make the stats comparable with the previous quarters, the analysis has been re-run for all the previous quarters starting with 2021 Q3.

## Data-Gathering

In [2]:
import numpy as np
import pandas as pd
import wmfdata as wmf

pd.options.display.max_columns = None
from IPython.display import clear_output

import importlib
import warnings

import data_functions as dtf




You are using Wmfdata v2.0.0, but v2.0.1 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md.


In [3]:
importlib.reload(dtf)

<module 'data_functions' from '/srv/home/kcv-wikimf/gitref/content-translation-deletion-stats/data_functions.py'>

### spark_session

In [4]:
spark_session = wmf.spark.get_active_session()

if type(spark_session) != type(None):
    spark_session.stop()
else:
    print('no active session')

no active session


In [5]:
spark_session = wmf.spark.create_custom_session(
    master="yarn",
    app_name='cx-deletion-stats',
    spark_config={
        "spark.driver.memory": "4g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "16g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.driver.maxResultSize": "2g"
        
    }
)

clear_output()

spark_session

In [6]:
spark_session.sparkContext.setLogLevel("ERROR")

### run query

In [7]:
%%time

for y in [2020, 2021, 2022]:
        for q in [f'Q{_}' for _ in range(1, 4+1)]:
            if (y == 2020) & (q in ['Q1', 'Q2']):
                pass
            else:
                quarter = dtf.generate_quarters(y)[q]
                quarter['mw_snapshot'] = '2023-08'
                quarterly_stats = dtf.query_deletion_stats(quarter)
                overall_deletion_rate = dtf.overall_deletion_pct(quarterly_stats)
                print(f'From {quarter["start_dt"]} to {quarter["end_dt"]} - deletion rate of CX created articles: {overall_deletion_rate["deleted_cx_pct"]}%;\
                        deletion rate of non-CX created articles: {overall_deletion_rate["deleted_non_cx_pct"]}%\n')

                                                                                

From 2021-01-01 to 2021-03-31 - deletion rate of CX created articles: 2.8%;                        deletion rate of non-CX created articles: 15.19%



                                                                                

From 2021-04-01 to 2021-06-30 - deletion rate of CX created articles: 3.06%;                        deletion rate of non-CX created articles: 13.78%



                                                                                

From 2021-07-01 to 2021-09-30 - deletion rate of CX created articles: 2.14%;                        deletion rate of non-CX created articles: 12.51%



                                                                                

From 2021-10-01 to 2021-12-31 - deletion rate of CX created articles: 2.77%;                        deletion rate of non-CX created articles: 10.66%



                                                                                

From 2022-01-01 to 2022-03-31 - deletion rate of CX created articles: 3.01%;                        deletion rate of non-CX created articles: 11.99%



                                                                                

From 2022-04-01 to 2022-06-30 - deletion rate of CX created articles: 3.16%;                        deletion rate of non-CX created articles: 12.82%



                                                                                

From 2022-07-01 to 2022-09-30 - deletion rate of CX created articles: 2.93%;                        deletion rate of non-CX created articles: 11.44%



                                                                                

From 2022-10-01 to 2022-12-31 - deletion rate of CX created articles: 3.42%;                        deletion rate of non-CX created articles: 12.93%



                                                                                

From 2023-01-01 to 2023-03-31 - deletion rate of CX created articles: 3.58%;                        deletion rate of non-CX created articles: 12.68%





From 2023-04-01 to 2023-06-30 - deletion rate of CX created articles: 3.62%;                        deletion rate of non-CX created articles: 13.05%

CPU times: user 2.99 s, sys: 749 ms, total: 3.74 s
Wall time: 14min 3s


                                                                                