# Data 512 HW 2 (Raviprakash Rthvik, Student ID: 2272104, Email: rravipra@uw.edu)

# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.1 - August 14, 2023

In [None]:
# import the required libraries
# These are standard python modules
import json, time, urllib.parse
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import csv

The example relies on some constants that help make the code a bit more readable.

# INITIALIZING CONSTANTS

In [None]:
#########
#
#    CONSTANTS
#
#########

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

# PROCEDURES/FUNCTIONS

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#
#########

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

Below is the code to extract the list of (city, state) names from the csv file which can be found here: [us_cities_by_state](https://drive.google.com/file/d/1khouDmMaZyKo0y5WkFj4lu7g8o35x_98/view)

In [None]:
import pandas as pd

# Specify the path to your CSV file (this is the CSV file which contains the cities by the state)

csv_file_path = '/content/drive/MyDrive/us_cities_by_state_SEPT.2023.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(csv_file_path)

# Extract the 'page_title' column into a list
page_titles = df['page_title'].tolist()

# Print the list of page titles (i.e they are a list of city, state)
print(page_titles)

['Abbeville, Alabama', 'Adamsville, Alabama', 'Addison, Alabama', 'Akron, Alabama', 'Alabaster, Alabama', 'Albertville, Alabama', 'Alexander City, Alabama', 'Aliceville, Alabama', 'Allgood, Alabama', 'Altoona, Alabama', 'Andalusia, Alabama', 'Anderson, Lauderdale County, Alabama', 'Anniston, Alabama', 'Arab, Alabama', 'Ardmore, Alabama', 'Argo, Alabama', 'Ariton, Alabama', 'Arley, Alabama', 'Ashford, Alabama', 'Ashland, Alabama', 'Ashville, Alabama', 'Athens, Alabama', 'Atmore, Alabama', 'Attalla, Alabama', 'Auburn, Alabama', 'Autaugaville, Alabama', 'Avon, Alabama', 'Babbie, Alabama', 'Baileyton, Alabama', 'Bakerhill, Alabama', 'Banks, Alabama', 'Bay Minette, Alabama', 'Bayou La Batre, Alabama', 'Bear Creek, Alabama', 'Beatrice, Alabama', 'Beaverton, Alabama', 'Belk, Alabama', 'Benton, Alabama', 'Berlin, Alabama', 'Berry, Alabama', 'Bessemer, Alabama', 'Billingsley, Alabama', 'Birmingham, Alabama', 'Black, Alabama', 'Blountsville, Alabama', 'Blue Springs, Alabama', 'Boaz, Alabama', 'B

Iterate over each page_titles (i.e the city, state names) from the above created list to extract the page info:

In [None]:
page_info_per_city = [] # this is the list that will contain dictionaries (i.e it will be a list of dictionaries)

# Iterate over each page title and request page info
for page_title in page_titles:
    info = request_pageinfo_per_article(page_title)
    page_info = info['query']['pages']

    # Find the key associated with the current page title (assuming the title is unique)
    page_id = next(iter(page_info))

    # Extract the last revision ID from the page_info response and the page_id
    last_revision_id = page_info[page_id]['lastrevid']
    # Create a dictionary and append it to the result list
    page_info_per_city.append({'title': page_title, 'lastrevid': last_revision_id})

# Print the list of dictionaries
# print(json.dumps(page_info_per_city, indent=4)) (you can use this code to print your list of dictionaries if you wish)

In [None]:
# Path to the CSV file where the data is going to be saved
file_path = 'page_info_per_city.csv'

# Define the CSV header (column names)
header = ['title', 'lastrevid']

# Open the CSV file for writing
with open(file_path, mode='w', newline='') as file:
    # Create a CSV writer object
    csv_writer = csv.DictWriter(file, fieldnames = header)

    # Write the header to the CSV file
    csv_writer.writeheader()

    # Write the data from page_info_per_city to the CSV file
    csv_writer.writerows(page_info_per_city)

print(f'Data has been saved to {file_path}')

Data has been saved to page_info_per_city.csv


In [None]:
print(f"Getting page info data for: {ARTICLE_TITLES[3]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[3])
print(json.dumps(info,indent=4))

Getting page info data for: Chinook salmon
{
    "batchcomplete": "",
    "query": {
        "pages": {
            "1212891": {
                "pageid": 1212891,
                "ns": 0,
                "title": "Chinook salmon",
                "contentmodel": "wikitext",
                "pagelanguage": "en",
                "pagelanguagehtmlcode": "en",
                "pagelanguagedir": "ltr",
                "touched": "2023-10-10T22:39:15Z",
                "lastrevid": 1178125499,
                "length": 49187,
                "watchers": 102,
                "talkid": 3909817,
                "fullurl": "https://en.wikipedia.org/wiki/Chinook_salmon",
                "editurl": "https://en.wikipedia.org/w/index.php?title=Chinook_salmon&action=edit",
                "canonicalurl": "https://en.wikipedia.org/wiki/Chinook_salmon"
            }
        }
    }
}


In [None]:
print(f"Getting page info data for: {'Abbeville, Alabama'}")
info = request_pageinfo_per_article('Abbeville, Alabama')
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Abbeville, Alabama
{
    "104730": {
        "pageid": 104730,
        "ns": 0,
        "title": "Abbeville, Alabama",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2023-10-10T22:35:37Z",
        "lastrevid": 1171163550,
        "length": 24706,
        "talkid": 281244,
        "fullurl": "https://en.wikipedia.org/wiki/Abbeville,_Alabama",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Abbeville,_Alabama&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Abbeville,_Alabama"
    }
}


In [None]:
print(f"Getting page info data for: {ARTICLE_TITLES[1]}")
info = request_pageinfo_per_article(ARTICLE_TITLES[1])
print(json.dumps(info['query']['pages'],indent=4))

Getting page info data for: Northern flicker
{
    "351590": {
        "pageid": 351590,
        "ns": 0,
        "title": "Northern flicker",
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2023-10-18T00:13:37Z",
        "lastrevid": 1179719310,
        "length": 27754,
        "watchers": 105,
        "talkid": 8324488,
        "fullurl": "https://en.wikipedia.org/wiki/Northern_flicker",
        "editurl": "https://en.wikipedia.org/w/index.php?title=Northern_flicker&action=edit",
        "canonicalurl": "https://en.wikipedia.org/wiki/Northern_flicker"
    }
}


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

This example also illustrates creating a copy of the template, setting values in the template, and then calling the function using the template to supply the parameters for the API request.