# Language-agnostic modeling of quality of Wikipedia articles

This notebook provides a tutorial for how to explore the dataset from language-agnostic feature values and quality scores of Wikipedia articles. It has 3 stages:
1. Accessing and extending a sample of the dataset
2. Visualizing features values and qualities scores over time
3. Future Analyses

## 1. Accessing to the data

The dataset of language-agnostic feature values and quality scores of Wikipedia articles available on [Zenodo](https://zenodo.org/records/10495081) is too large to handle with a Jupyter notebook. For that reason, we have prepared a sample with the revisions of English Wikipedia articles maintained by [WikiProject Climate change](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Climate_change) with the following columns:
- wiki_db: Wikipedia language edition ('enwiki' in this sample).
- page_id: Id of the page (in the corresponding Wikipedia language edition).
- revision_id: Id of the revision (in the corresponding Wikipedia language edition).
- page_length:  Number of bytes of the revision.
- num_refs:  Number of references of the revision.
- num_wikilinks: Number of wikilinks of the revision.
- num_categories: Number of categories of the revision.
- num_media: Number of media files of the revision.
- num_headings: Number of sections of the revision.
- item_id: Id of the page in Wikidata.
- pred_qual: Predicted quality score between 0 and 1.

In [1]:
# TODO: add other libraries here as necessary
import pandas as pd

In [2]:
# Read the zipped CSV
df_revisions = pd.read_csv('https://public-paws.wmcloud.org/User:Pablo%20(WMF)/outreachy/round28/features_scores_climatechange_2022.csv.zip')
df_revisions

Unnamed: 0,wiki_db,page_id,revision_id,revision_timestamp,page_length,num_refs,num_wikilinks,num_categories,num_media,num_headings,item_id,pred_qual
0,enwiki,348869,366664976,2010-06-07T22:45:20Z,10464,11,66,4,2,7,Q1137345,0.557963
1,enwiki,348869,251114181,2008-11-11T15:34:55Z,4049,0,41,4,2,3,Q1137345,0.397999
2,enwiki,348869,712041311,2016-03-26T15:07:18Z,20701,28,84,4,2,10,Q1137345,0.696191
3,enwiki,348869,341876534,2010-02-04T12:02:16Z,10100,11,61,4,2,7,Q1137345,0.554477
4,enwiki,348869,519730962,2012-10-25T09:56:12Z,17858,21,93,4,2,9,Q1137345,0.654836
...,...,...,...,...,...,...,...,...,...,...,...,...
1456207,enwiki,66790245,1069533455,2022-02-02T19:41:20Z,42178,81,118,4,2,14,Q105549782,0.797312
1456208,enwiki,66790245,1007245703,2021-02-17T04:02:02Z,744,1,2,0,0,0,Q105549782,0.139069
1456209,enwiki,66790245,1041821278,2021-09-01T18:18:01Z,5728,9,19,4,0,5,Q105549782,0.401310
1456210,enwiki,66790245,1055285832,2021-11-15T00:29:52Z,34922,80,115,4,1,14,Q105549782,0.770978


In [3]:
# As mentioned above, pages are English Wikipedia articles maintained by WikiProject Climate change. 
# The id and title of these pages, together with their quality class and importance class can be extracted with the following Quarry query:
# https://quarry.wmcloud.org/query/52210
df_pages = pd.read_csv('https://quarry.wmcloud.org/query/52210/result/latest/0/csv')
df_pages

Unnamed: 0,page_id,page_title,quality_class,importance_class
0,39,Albedo,B,High
1,627,Agriculture,GA,Low
2,903,Arable_land,C,Low
3,1365,Ammonia,B,Low
4,3201,Attribution_of_recent_climate_change,B,High
...,...,...,...,...
3812,73428116,Anne_Therese_Gennari,Start,Low
3813,73464356,Christopher_Magadza,C,Low
3814,73540566,Britney_Schmidt,C,Low
3815,73569052,Rainwater_harvesting_in_the_Sahel,C,Low


In [4]:
# Both dataframes can be merged to extend metadata of the revisions samples
df_revisions.merge(df_pages, on='page_id')

Unnamed: 0,wiki_db,page_id,revision_id,revision_timestamp,page_length,num_refs,num_wikilinks,num_categories,num_media,num_headings,item_id,pred_qual,page_title,quality_class,importance_class
0,enwiki,348869,366664976,2010-06-07T22:45:20Z,10464,11,66,4,2,7,Q1137345,0.557963,North_Atlantic_oscillation,Start,Unknown
1,enwiki,348869,251114181,2008-11-11T15:34:55Z,4049,0,41,4,2,3,Q1137345,0.397999,North_Atlantic_oscillation,Start,Unknown
2,enwiki,348869,712041311,2016-03-26T15:07:18Z,20701,28,84,4,2,10,Q1137345,0.696191,North_Atlantic_oscillation,Start,Unknown
3,enwiki,348869,341876534,2010-02-04T12:02:16Z,10100,11,61,4,2,7,Q1137345,0.554477,North_Atlantic_oscillation,Start,Unknown
4,enwiki,348869,519730962,2012-10-25T09:56:12Z,17858,21,93,4,2,9,Q1137345,0.654836,North_Atlantic_oscillation,Start,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1426996,enwiki,66790245,1069533455,2022-02-02T19:41:20Z,42178,81,118,4,2,14,Q105549782,0.797312,Build_Back_Better_Plan,C,Mid
1426997,enwiki,66790245,1007245703,2021-02-17T04:02:02Z,744,1,2,0,0,0,Q105549782,0.139069,Build_Back_Better_Plan,C,Mid
1426998,enwiki,66790245,1041821278,2021-09-01T18:18:01Z,5728,9,19,4,0,5,Q105549782,0.401310,Build_Back_Better_Plan,C,Mid
1426999,enwiki,66790245,1055285832,2021-11-15T00:29:52Z,34922,80,115,4,1,14,Q105549782,0.770978,Build_Back_Better_Plan,C,Mid


In [5]:
!pip install mwviews 
from mwviews.api import PageviewsClient



In [6]:
# TODO: For any article, you need to use the API to gather pageviews count
# in the time period each revision was made.
# mwviews documentation: https://github.com/mediawiki-utilities/python-mwviews
# user_agent helps identify the request if there's an issue and is best practice
tutorial_label = 'PAWS Language-agnostic quality modeling tutorial (mwapi)'
# NOTE: it is best practice to include a contact email in user agents
# generally this is private information though so do not change it to yours
# if you are working in the PAWS environment or adding to a Github repo
# for Outreachy, you can leave this as Pablo's email or switch it to your Mediawiki username
# e.g., Pablo (WMF) for https://www.mediawiki.org/wiki/User:Pablo_(WMF)
contact_email = 'paragon@wikimedia.org'
p = PageviewsClient(user_agent=f'<{contact_email}> {tutorial_label}')
# See below an example for monthly pageviews of two given articles in 2022
p.article_views('en.wikipedia', ['Albedo', 'Agriculture'], granularity='monthly', start='20220101', end='20221231')

defaultdict(dict,
            {datetime.datetime(2022, 11, 1, 0, 0): {'Albedo': 35815,
              'Agriculture': 147939},
             datetime.datetime(2022, 1, 1, 0, 0): {'Albedo': 33276,
              'Agriculture': 129619},
             datetime.datetime(2022, 3, 1, 0, 0): {'Albedo': 33142,
              'Agriculture': 163182},
             datetime.datetime(2022, 5, 1, 0, 0): {'Albedo': 31365,
              'Agriculture': 122209},
             datetime.datetime(2022, 6, 1, 0, 0): {'Albedo': 26119,
              'Agriculture': 99250},
             datetime.datetime(2022, 10, 1, 0, 0): {'Albedo': 35853,
              'Agriculture': 135306},
             datetime.datetime(2022, 9, 1, 0, 0): {'Albedo': 34701,
              'Agriculture': 149693},
             datetime.datetime(2022, 12, 1, 0, 0): {'Albedo': 27722,
              'Agriculture': 114116},
             datetime.datetime(2022, 4, 1, 0, 0): {'Albedo': 35430,
              'Agriculture': 126763},
             datetime.date

## 2. Visualizing features values and qualities scores over time

Here we want to explore the evolution of individual articles and the sample of Climate change articles  by visualizing feature values and quality scores over time. For this type of data, plots like the following would make sense, showing the predicted quality scores of the latest revision up to a given year of each English Wikipedia article (the darkness of the color corresponds to the time dimension, the darker the more recent).

![image1](./enwiki_boxplot.png)

Choose a more fine-grained time granularity than yearly (e.g., monthly, weekly, daily, hourly) and create data visualizations of the features value and scores (page_length, num_refs, num_wikilinks, num_categories, num_media, num_headings, pred_qual). Write your thoughts on the trade-offs between longer vs. shorter granularities and how this affects the visualization.

In [7]:
# TODO: Build the data analysis and visualization per instructions above

Then create data visualizations using different types of charts and proposing approaches to filtering and aggregating data.

In [8]:
# TODO: Build the data analysis and visualization per instructions above

## 3. Future Analyses

In [9]:
# TODO: Describe what additional patterns you might want to explore and visualize in the data (and why). You don't know have to know how to do the analyses.