# Step 1 - Getting PageInfo Data

This short notebook is designed to take in a list of Wikipedia article titles on U.S. cities and query Wikipedia's PageInfo API for its data on those articles. It takes in the list of articles from the "page\_title" column of a CSV on the same directory as this notebook, named "us\_cities\_by\_state\_SEPT.2023.csv". It will store each individual API call reponse in a subdirectory, "raw\_API\_data/Pageinfo/", with each JSON file named after the corresponding article title.

In [2]:
#this cell uses pandas to import the csv list of articles 

import pandas as pd 
article_csv = pd.read_csv("input/us_cities_by_state_SEPT.2023.csv")
article_list = article_csv['page_title'].tolist()
article_csv

Unnamed: 0,state,page_title,url
0,Alabama,"Abbeville, Alabama","https://en.wikipedia.org/wiki/Abbeville,_Alabama"
1,Alabama,"Adamsville, Alabama","https://en.wikipedia.org/wiki/Adamsville,_Alabama"
2,Alabama,"Addison, Alabama","https://en.wikipedia.org/wiki/Addison,_Alabama"
3,Alabama,"Akron, Alabama","https://en.wikipedia.org/wiki/Akron,_Alabama"
4,Alabama,"Alabaster, Alabama","https://en.wikipedia.org/wiki/Alabaster,_Alabama"
...,...,...,...
22152,Wyoming,"Wamsutter, Wyoming","https://en.wikipedia.org/wiki/Wamsutter,_Wyoming"
22153,Wyoming,"Wheatland, Wyoming","https://en.wikipedia.org/wiki/Wheatland,_Wyoming"
22154,Wyoming,"Worland, Wyoming","https://en.wikipedia.org/wiki/Worland,_Wyoming"
22155,Wyoming,"Wright, Wyoming","https://en.wikipedia.org/wiki/Wright,_Wyoming"


The below code was was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program, and made available under the [Creative a Commons](https://creativecommons.org/) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 14, 2023, and defines a function and accompanying constants needed to query the PageInfo API for an individual article's data. The User Agent constant was modified to include my email address.

In [13]:
import time, json, requests

#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<ramirost@uw.edu>, University of Washington, Ramiro Steinmann Petrasso',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



In [None]:
#This constant defines the directory in which all of the JSON files will be dumped. 
DATA_STORAGE_DIRECTORY = "raw_API_data/Pageinfo/"


#This function takes in a Python dict and stores it as a JSON in the specified subdirectory, with the specified title/filename.S
def dump_to_file(view_data, title, folder):
    #forbidden characters
    title = title.replace("\\", "_")
    title = title.replace(":", "_")
    title = title.replace("/", "_")
    title = title.replace("*", "_")
    title = title.replace('"', "_")
    title = title.replace("<", "_")
    title = title.replace(">", "_")
    title = title.replace("?", "_")


    #print("writing pageview data file for: ", title)

    with open((folder + "/" + title + ".json"), "w") as outfile:
        json.dump(view_data, outfile)

#This function loops over a list of Wikipedia article titles, gets the Pageinfo data for each article, 
#and then writes that data to the specified subdirectory as json files titled after each article. By
#default it will store the files in the folder from the above constant
def grab_articles_in_list(list_of_articles,
                          directory = DATA_STORAGE_DIRECTORY):
   for article in list_of_articles: 
        #print(f"Getting page info data for: {article}")
        info = request_pageinfo_per_article(article)
        #print(json.dumps(info,indent=4))
        dump_to_file(info, article, directory)


In [1]:
#This simple line executes the loop to download and store all of the PageInfo data.

grab_articles_in_list(article_list)