# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



# Loading Libraries

The following code block references Dr. David W. Mcdonald's example code, with new libraries added to the originals.

In [2]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

# Import Data

Imports two csv files containing (1) information about politicians with Wikipedia article titles (and their respective countries) and (2) populations in the millions and corresponding regions of countries.

These article titles are titles of Wikipedia pages about a politician, at varying degrees of completeness.

The following code block references Dr. David W. Mcdonald's example code.

In [3]:
# Code to load csv file containing article names into a dataframe and iterate into the ARTICLE_TITLES list

politicians_by_country = pd.read_csv('politicians_by_country_AUG.2024.csv')
pop_by_country = pd.read_csv('population_by_country_AUG.2024.csv')


ARTICLE_TITLES = politicians_by_country['name'][0:].tolist()

# Display the resulting list
print(ARTICLE_TITLES[0:10])

# first, input the article titles of politician names into page_info_api
# from the api result, take the revision ID and input revision ID into ORES Api
# then, get output of article quality score from ORES and merge the score with the population dataset
# then combine into a csv file with specific column names
# then, analyze

['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi']


# Initialize Constants

Establishes the API request endpoint and parameters in order to extract the Wikipedia page revision IDs. 

The following code block references Dr. David W. Mcdonald's example code.

##### USER NOTE! 
- Toggle the email to the corresponding student's email
- specify the article titles source, or use the example commented out below

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<kilpas@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


# Calling the API

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

The following code block references Dr. David W. Mcdonald's example code.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_articles(article_titles, 
                                  endpoint_url=API_ENWIKIPEDIA_ENDPOINT, 
                                  request_template=PAGEINFO_PARAMS_TEMPLATE,
                                  headers=REQUEST_HEADERS):
    
    # Make sure headers contain the correct User-Agent information
    if 'kilpas@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")
    
    # Break the article titles list into chunks of 50 (API limit)
    article_batches = [article_titles[i:i + MAX_TITLES_PER_REQUEST] for i in range(0, len(article_titles), MAX_TITLES_PER_REQUEST)]
    
    all_pages_info = {}
    
    for batch in article_batches:
        page_titles = "|".join(batch)
        
        # Update the template with the current batch of titles
        request_template['titles'] = page_titles
        
        # Make the request
        try:
            # Throttle the requests to avoid overloading the API
            if API_THROTTLE_WAIT > 0.0:
                time.sleep(API_THROTTLE_WAIT)
                
            response = requests.get(endpoint_url, headers=headers, params=request_template)
            json_response = response.json()
            
            if json_response and 'query' in json_response:
                pages = json_response['query']['pages']
                all_pages_info.update(pages)
                
        except Exception as e:
            print(f"An error occurred while requesting data for titles: {page_titles}\nError: {e}")
            continue
    
    return all_pages_info


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

This example also illustrates creating a copy of the template, setting values in the template, and then calling the function using the template to supply the parameters for the API request.

# Sanity Check - Article Titles Extraction

This block of code ensures the article titles are being correctly extracted before finding the Revision IDs

In [6]:
# Load data from the CSV files
politicians_by_country = pd.read_csv('politicians_by_country_AUG.2024.csv')
pop_by_country = pd.read_csv('population_by_country_AUG.2024.csv')

# Remove duplicates in the 'name' column (which is the first column)
politicians_by_country_cleaned = politicians_by_country.iloc[:, 0].drop_duplicates()

# Extract the article titles from the cleaned data (first column)
ARTICLE_TITLES = politicians_by_country_cleaned.tolist()

# Display the first 10 titles (just to verify the extraction)
print(ARTICLE_TITLES[0:10])


['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi']


# Specifying Constants for the Student Example

This block of code mirrors the professor's example above

In [14]:
# Constants
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
ORES_API_ENDPOINT = "https://ores.wikimedia.org/v3/scores/{}/?models=draftquality&revids={}"

REQUEST_HEADERS = {
    'User-Agent': '<kilpas@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers|lastrevid"
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

MAX_TITLES_PER_REQUEST = 50

# Initialize Function to Request API Page Info

Edits the function above to call the API in a more efficient manner, using the "|" character and outputting a json file with article titles and corresponding revision IDs

In [24]:
def request_pageinfo_per_articles(article_titles, 
                                   endpoint_url=API_ENWIKIPEDIA_ENDPOINT, 
                                   request_template=PAGEINFO_PARAMS_TEMPLATE,
                                   headers=REQUEST_HEADERS):
    """
    Request page info for a list of article titles in batches of 50.
    Returns a dictionary with politician names as keys and their last revision IDs as values.
    """
    if 'uwnetid@uw.edu' in headers['User-Agent']:
        raise Exception("Please replace 'uwnetid@uw.edu' with your actual UW email address in the headers.")
    
    article_batches = [article_titles[i:i + MAX_TITLES_PER_REQUEST] for i in range(0, len(article_titles), MAX_TITLES_PER_REQUEST)]
    politicians_revisions = {}  # Dictionary to hold politician names and their last revision IDs
    
    for batch in article_batches:
        page_titles = "|".join(batch)
        request_template['titles'] = page_titles
        
        try:
            time.sleep(0.1)  # Throttle requests
            response = requests.get(endpoint_url, headers=headers, params=request_template)
            json_response = response.json()
            
            if json_response and 'query' in json_response:
                pages = json_response['query']['pages']
                for page_id, page_data in pages.items():
                    politician_name = page_data.get('title')
                    last_revision_id = page_data.get('lastrevid')
                    if politician_name and last_revision_id:
                        # Store only title and lastrevid in the output dictionary
                        politicians_revisions[politician_name] = {
                            'title': politician_name,
                            'lastrevid': last_revision_id
                        }

        except Exception as e:
            print(f"Error fetching page info for titles: {page_titles}\nError: {e}")
    
    return politicians_revisions

# Step 1: Get page info for each politician (including revision ID)
politicians_revisions = request_pageinfo_per_articles(ARTICLE_TITLES)

# Step 2: Write the data to a JSON file
with open('politicians-and-revision-ids.json', 'w') as json_file:
    json.dump(politicians_revisions, json_file, indent=4)

print("Politicians and their revision IDs have been saved to 'politicians-and-revision-ids.json'.")


Politicians and their revision IDs have been saved to 'politicians-and-revision-ids.json'.
