# Studying Wikipedia Page Protections
This notebook provides a tutorial for how to study [page protections](https://en.wikipedia.org/wiki/Wikipedia:Protection_policy) on Wikipedia either via the [Mediawiki dumps](https://www.mediawiki.org/wiki/Manual:Page_restrictions_table) or [API](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Binfo). It has three stages:
* Accessing the Page Protection dumps
* Accessing the Page Protection API
* Example analysis of page protection data (both descriptive statistics and learning a predictive model)

## Accessing the Page Protection Dumps
This is an example of how to parse through [Mediawiki dumps](https://www.mediawiki.org/wiki/Manual:Page_restrictions_table) and determine [what sorts of edit protections](https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#Overview_of_types_of_protection) are applied to a given Wikipedia article.

In [1]:
# TODO: add other libraries here as necessary
import gzip  # necessary for decompressing dump file into text format

In [2]:
# Every language on Wikipedia has its own page restrictions table
# you can find all the dbnames (e.g., enwiki) here: https://www.mediawiki.org/w/api.php?action=sitematrix
# for example, you could replace the LANGUAGE parameter of 'enwiki' with 'arwiki' to study Arabic Wikipedia
LANGUAGE = 'enwiki'
# e.g., enwiki -> en.wikipedia (this is necessary for the API section)
SITENAME = LANGUAGE.replace('wiki', '.wikipedia')
# directory on PAWS server that holds Wikimedia dumps
DUMP_DIR = "/public/dumps/public/{0}/latest/".format(LANGUAGE)
DUMP_FN = '{0}-latest-page_restrictions.sql.gz'.format(LANGUAGE)

In [3]:
# The dataset isn't huge -- 1.1 MB -- so should be quick to process in full
!ls -shH "{DUMP_DIR}{DUMP_FN}"

1.1M /public/dumps/public/enwiki/latest/enwiki-latest-page_restrictions.sql.gz


In [4]:
# Inspect the first 1000 characters of the page protections dump to see what it looks like
# As you can see from the CREATE TABLE statement, each datapoint has 7 fields (pr_page, pr_type, ... , pr_id)
# A description of the fields in the data can be found here:
#   https://www.mediawiki.org/wiki/Manual:Page_restrictions_table
# And the data that we want is on lines that start with INSERT INTO `page_restrictions` VALUES...
# The first datapoint (1086732,'edit','sysop',0,NULL,'infinity',1307) can be interpreted as:
#   1086732:    page ID 1086732 (en.wikipedia.org/wiki/?curid=1086732)
#   'edit':     has edit protections
#   'sysop':    that require sysop permissions (https://en.wikipedia.org/wiki/Wikipedia:User_access_levels#Administrator)
#   0:          does not cascade to other pages
#   NULL:       no user-specific restrictions
#   'infinity': restriction does not expire automatically
#   1307:       table primary key -- has no meaning by itself

!zcat "{DUMP_DIR}{DUMP_FN}" | head -46 | cut -c1-1000

-- MySQL dump 10.16  Distrib 10.1.44-MariaDB, for debian-linux-gnu (x86_64)
--
-- Host: 10.64.48.13    Database: enwiki
-- ------------------------------------------------------
-- Server version	10.1.43-MariaDB

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

--
-- Table structure for table `page_restrictions`
--

DROP TABLE IF EXISTS `page_restrictions`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40

In [5]:
# TODO: Complete example that loops through all page restrictions in the dump file above and extracts data
# The Python gzip library will allow you to decompress the file for reading: https://docs.python.org/3/library/gzip.html#gzip.open


## Accessing the Page Protection APIs
The [Page Protection API](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Binfo) can be a much simpler way to access data about page protections for a given article if you know what articles you are interested in and are interested in relatively few articles (e.g., hundreds or low thousands).

NOTE: the APIs are up-to-date while the Mediawiki dumps are always at least several days behind -- i.e. for specific snapshots in time -- so the data you get from the Mediawiki dumps might be different from the APIs if permissions have changed to a page's protections in the intervening days.

In [6]:
# TODO: add other libraries here as necessary
import mwapi  # useful for accessing Wikimedia API

In [7]:
# TODO: Gather ten random page IDs from the data gathered from the Mediawiki dump to get data for from the API


In [8]:
# mwapi documentation: https://pypi.org/project/mwapi/
# user_agent helps identify the request if there's an issue and is best practice
tutorial_label = 'Page Protection API tutorial (mwapi)'
# NOTE: it is best practice to include a contact email in user agents
# generally this is private information though so do not change it to yours
# if you are working in the PAWS environment or adding to a Github repo
# for Outreachy, you can leave this as my (isaac's) email or switch it to your Mediawiki username
# e.g., Isaac (WMF) for https://www.mediawiki.org/wiki/User:Isaac_(WMF)
contact_email = 'isaac@wikimedia.org'
session = mwapi.Session('https://{0}.org'.format(SITENAME), user_agent='{0} -- {1}'.format(tutorial_label, contact_email))

# TODO: You'll have to add additional parameters here to query the pages you're interested in
# API endpoint: https://www.mediawiki.org/w/api.php?action=help&modules=query%2Binfo
# More details: https://www.mediawiki.org/wiki/API:Info
params = {'action':'query',
          'prop':'info'}

In [9]:
# TODO: make request to API for data


In [10]:
# TODO: examine API results and compare to data from Mediawiki dump to see if they are the same and explain any discrepancies


## Example Analyses of Page Protection Data
Here we show some examples of things we can do with the data that we gathered about the protections for various Wikipedia articles. You'll want to come up with some questions to ask of the data as well. For this, you might need to gather additional data such as:
* The [page table](https://www.mediawiki.org/wiki/Manual:Page_table), which, for example, can be found in the `DUMP_DIR` under the name `{LANGUAGE}-latest-page.sql.gz`
* Selecting a sample of, for example, 100 articles and getting additional information about them from other [API endpoints](https://www.mediawiki.org/wiki/API:Properties).

In [11]:
# TODO: add any imports of data analysis / visualization libraries here as necessary

### Descriptive statistics
TODO: give an overview of basic details about page protections and any conclusions you reach based on the analyses you do below

In [12]:
# TODO: do basic analyses here to understand the data

### Predictive Model
TODO: Train and evaluate a predictive model on the data you gathered for the above descriptive statistics. Describe what you learned from the model or how it would be useful.

In [13]:
# imports

In [14]:
# TODO: preprocess data


In [15]:
# TODO: train model


In [16]:
# TODO: evaluate model


### Future Analyses
TODO: Describe any additional analyses you can think of that would be interesting (and why) -- even if you are not sure how to do them.