# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [2]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

In [3]:
# Code to load csv file containing article names into a dataframe and iterate into the ARTICLE_TITLES list

politicians_by_country = pd.read_csv('politicians_by_country_AUG.2024.csv')
pop_by_country = pd.read_csv('population_by_country_AUG.2024.csv')


ARTICLE_TITLES = politicians_by_country['name'][0:].tolist()

# Display the resulting list
print(ARTICLE_TITLES[0:10])

# first, input the article titles of politician names into page_info_api
# from the api result, take the revision ID and input revision ID into ORES Api
# then, get output of article quality score from ORES and merge the score with the population dataset
# then combine into a csv file with specific column names
# then, analyze

['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi']


The example relies on some constants that help make the code a bit more readable.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<kilpas@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_articles(article_titles, 
                                  endpoint_url=API_ENWIKIPEDIA_ENDPOINT, 
                                  request_template=PAGEINFO_PARAMS_TEMPLATE,
                                  headers=REQUEST_HEADERS):
    
    # Make sure headers contain the correct User-Agent information
    if 'kilpas@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")
    
    # Break the article titles list into chunks of 50 (API limit)
    article_batches = [article_titles[i:i + MAX_TITLES_PER_REQUEST] for i in range(0, len(article_titles), MAX_TITLES_PER_REQUEST)]
    
    all_pages_info = {}
    
    for batch in article_batches:
        page_titles = "|".join(batch)
        
        # Update the template with the current batch of titles
        request_template['titles'] = page_titles
        
        # Make the request
        try:
            # Throttle the requests to avoid overloading the API
            if API_THROTTLE_WAIT > 0.0:
                time.sleep(API_THROTTLE_WAIT)
                
            response = requests.get(endpoint_url, headers=headers, params=request_template)
            json_response = response.json()
            
            if json_response and 'query' in json_response:
                pages = json_response['query']['pages']
                all_pages_info.update(pages)
                
        except Exception as e:
            print(f"An error occurred while requesting data for titles: {page_titles}\nError: {e}")
            continue
    
    return all_pages_info


In [4]:
print(f"Getting page info data for: {ARTICLE_TITLES[3]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[3])
print(json.dumps(info,indent=4))

Getting page info data for: Chinook salmon
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1212891": {
                "pageid": 1212891,
                "ns": 0,
                "title": "Chinook salmon",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2024-08-16T10:34:52Z",
                "lastrevid": 1234351318,
                "length": 53787,
                "watchers": 108,
                "talkid": 3909817,
                "fullurl": "https://en.wikipedia.org/wiki/Chinook_salmon",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Chinook_salmon&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Chinook_salmon"
            }
        }
    }
}


In [5]:
print(f"Getting page info data for: {ARTICLE_TITLES[1]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[1])
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Northern flicker
{
    "351590": {
        "pageid": 351590,
        "ns": 0,
        "title": "Northern flicker",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-08-16T10:34:47Z",
        "lastrevid": 1237967172,
        "length": 32225,
        "watchers": 113,
        "talkid": 8324488,
        "fullurl": "https://en.wikipedia.org/wiki/Northern_flicker",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Northern_flicker&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Northern_flicker"
    }
}


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

This example also illustrates creating a copy of the template, setting values in the template, and then calling the function using the template to supply the parameters for the API request.

In [6]:
page_titles = f"{ARTICLE_TITLES[0]}|{ARTICLE_TITLES[2]}|{ARTICLE_TITLES[4]}"
print(f"Getting page info data for: {page_titles}")
request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
request_info['titles'] = page_titles
info = request_pageinfo_per_article(request_template=request_info)
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Bison|Red squirrel|Horseshoe bat
{
    "4583": {
        "pageid": 4583,
        "ns": 0,
        "title": "Bison",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-08-16T10:34:45Z",
        "lastrevid": 1238698333,
        "length": 60744,
        "watchers": 259,
        "talkid": 75239,
        "fullurl": "https://en.wikipedia.org/wiki/Bison",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Bison&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Bison"
    },
    "531505": {
        "pageid": 531505,
        "ns": 0,
        "title": "Horseshoe bat",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2024-08-16T10:34:48Z",
        "lastrevid": 1228572241,
        "length": 57108,
        "watchers": 64,
 

In [4]:
import pandas as pd
import requests
import time
import json

In [6]:
# Load data from the CSV files
politicians_by_country = pd.read_csv('politicians_by_country_AUG.2024.csv')
pop_by_country = pd.read_csv('population_by_country_AUG.2024.csv')

# Remove duplicates in the 'name' column (which is the first column)
politicians_by_country_cleaned = politicians_by_country.iloc[:, 0].drop_duplicates()

# Extract the article titles from the cleaned data (first column)
ARTICLE_TITLES = politicians_by_country_cleaned.tolist()

# Display the first 10 titles (just to verify the extraction)
print(ARTICLE_TITLES[0:10])


['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi']


In [14]:
# Constants
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
ORES_API_ENDPOINT = "https://ores.wikimedia.org/v3/scores/{}/?models=draftquality&revids={}"

REQUEST_HEADERS = {
    'User-Agent': '<kilpas@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers|lastrevid"
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

MAX_TITLES_PER_REQUEST = 50

In [21]:
def request_pageinfo_per_articles(article_titles, 
                                  endpoint_url=API_ENWIKIPEDIA_ENDPOINT, 
                                  request_template=PAGEINFO_PARAMS_TEMPLATE,
                                  headers=REQUEST_HEADERS):
    """
    Request page info for a list of article titles in batches of 50.
    Returns the page info data including the current revision ID.
    """
    if 'uwnetid@uw.edu' in headers['User-Agent']:
        raise Exception("Please replace 'uwnetid@uw.edu' with your actual UW email address in the headers.")
    
    article_batches = [article_titles[i:i + MAX_TITLES_PER_REQUEST] for i in range(0, len(article_titles), MAX_TITLES_PER_REQUEST)]
    all_pages_info = {}
    
    for batch in article_batches:
        page_titles = "|".join(batch)
        request_template['titles'] = page_titles
        
        try:
            time.sleep(0.1)  # Throttle requests
            response = requests.get(endpoint_url, headers=headers, params=request_template)
            json_response = response.json()
            
            if json_response and 'query' in json_response:
                pages = json_response['query']['pages']
                all_pages_info.update(pages)
        except Exception as e:
            print(f"Error fetching page info for titles: {page_titles}\nError: {e}")
    
    return all_pages_info


In [24]:
def request_pageinfo_per_articles(article_titles, 
                                   endpoint_url=API_ENWIKIPEDIA_ENDPOINT, 
                                   request_template=PAGEINFO_PARAMS_TEMPLATE,
                                   headers=REQUEST_HEADERS):
    """
    Request page info for a list of article titles in batches of 50.
    Returns a dictionary with politician names as keys and their last revision IDs as values.
    """
    if 'uwnetid@uw.edu' in headers['User-Agent']:
        raise Exception("Please replace 'uwnetid@uw.edu' with your actual UW email address in the headers.")
    
    article_batches = [article_titles[i:i + MAX_TITLES_PER_REQUEST] for i in range(0, len(article_titles), MAX_TITLES_PER_REQUEST)]
    politicians_revisions = {}  # Dictionary to hold politician names and their last revision IDs
    
    for batch in article_batches:
        page_titles = "|".join(batch)
        request_template['titles'] = page_titles
        
        try:
            time.sleep(0.1)  # Throttle requests
            response = requests.get(endpoint_url, headers=headers, params=request_template)
            json_response = response.json()
            
            if json_response and 'query' in json_response:
                pages = json_response['query']['pages']
                for page_id, page_data in pages.items():
                    politician_name = page_data.get('title')
                    last_revision_id = page_data.get('lastrevid')
                    if politician_name and last_revision_id:
                        # Store only title and lastrevid in the output dictionary
                        politicians_revisions[politician_name] = {
                            'title': politician_name,
                            'lastrevid': last_revision_id
                        }

        except Exception as e:
            print(f"Error fetching page info for titles: {page_titles}\nError: {e}")
    
    return politicians_revisions

# Example usage:
# Load data from CSV files and prepare ARTICLE_TITLES as before
# politicians_by_country = pd.read_csv('politicians_by_country_AUG.2024.csv')
# ARTICLE_TITLES = politicians_by_country['name'].drop_duplicates().tolist()

# Step 1: Get page info for each politician (including revision ID)
politicians_revisions = request_pageinfo_per_articles(ARTICLE_TITLES)

# Step 2: Write the data to a JSON file
with open('politicians-and-revision-ids.json', 'w') as json_file:
    json.dump(politicians_revisions, json_file, indent=4)

print("Politicians and their revision IDs have been saved to 'politicians-and-revision-ids.json'.")


Politicians and their revision IDs have been saved to 'politicians-and-revision-ids.json'.


In [22]:
# Step 1: Get page info for each politician (including revision ID)
page_info_data = request_pageinfo_per_articles(ARTICLE_TITLES)


In [20]:
# Use the first few titles as an example for making the page info request
example_titles = [ARTICLE_TITLES[0], ARTICLE_TITLES[2], ARTICLE_TITLES[4]]
# Join the titles with the pipe as the delimiter
page_titles = "|".join(example_titles)

# Print the titles being requested for clarity
print(f"Getting page info data for: {page_titles}")

# Create a copy of the PAGEINFO_PARAMS_TEMPLATE for the request
request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
request_info['titles'] = page_titles

# Make the page info request using the updated request_pageinfo_per_articles function
info = request_pageinfo_per_articles(article_titles=example_titles)

# Print the results of the request
if info and 'query' in info:
    print(json.dumps(info['query']['pages'], indent=4))
else:
    print("No data retrieved or error occurred.")


Getting page info data for: Majah Ha Adrif|Tayyab Agha|Aziza Ahmadyar
No data retrieved or error occurred.


In [23]:
info

{'47805901': {'pageid': 47805901,
  'ns': 0,
  'title': 'Aziza Ahmadyar',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'touched': '2024-10-08T13:30:38Z',
  'lastrevid': 1195651393,
  'length': 3790,
  'talkid': 47806200,
  'fullurl': 'https://en.wikipedia.org/wiki/Aziza_Ahmadyar',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Aziza_Ahmadyar&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Aziza_Ahmadyar'},
 '10483286': {'pageid': 10483286,
  'ns': 0,
  'title': 'Majah Ha Adrif',
  'contentmodel': 'wikitext',
  'pagelanguage': 'en',
  'pagelanguagehtmlcode': 'en',
  'pagelanguagedir': 'ltr',
  'touched': '2024-09-30T14:32:18Z',
  'lastrevid': 1233202991,
  'length': 3188,
  'talkid': 13330265,
  'fullurl': 'https://en.wikipedia.org/wiki/Majah_Ha_Adrif',
  'editurl': 'https://en.wikipedia.org/w/index.php?title=Majah_Ha_Adrif&action=edit',
  'canonicalurl': 'https://en.wikipedia.org/wiki/Majah_