## Data acquisition to create wp_politicians_by_country.csv file

### Assumptions
* The environment variables `USER_NAME` and `ACCESS_TOKEN` have been set before running this notebook via the command `source creds.sh`.
Make sure to go into the `creds.sh` file and replace the variables with appropriate values.

* All data files (.csv etc) live at the same level as this notebook.

### Description

The notebook is intended to be ran in sequential order.

The goal of this notebook is to use data from `politicians_by_country_AUG.2024.csv` and `population_by_country_AUG.2024.csv` to call various Wikipedia API functions.

First, the `revision_id` (known as `lastrevid` to the Wikipedia data scheme) is obtained by calling the Wikipedia API using the `request_pageinfo_per_article()` function in the notebook below.
This function will take the title of a Wikipedia article, in this case the name of a politician, and return a plethora of information about the article. The `lastrevid` will be extracted
from the API response and saved as `revision_id`.

Second, the `revision_id` will be provided to the ORES Wikipedia classification algorithm via an API call which will provide a classification prediction of article quality for the Wikipedia article associated with a politician.

Third, data from the `population_by_country_AUG.2024.csv` file will be munged to aggregate countries with their particular region so that a politician can be grouped with an appropriate region.

Finally, all the above information will be grouped together in a final dictionary with the following schema: `country`, `region`, `population`, `article_title`, `revision_id`, `article_quality`.
This final dictionary will be transformed into a Pandas DataFrame and saved to .csv format.

During each step above, if a politician does not return a `lastrevid` or an ORES score, they will be logged and saved to a seperate file noting they lacked appropriate information.

In [1]:
import json, os, pickle, time, urllib.parse
import requests

import pandas as pd

In [2]:
"""
Load csv data into Pandas Dataframes
"""
politicians_df = pd.read_csv('politicians_by_country_AUG.2024.csv')
population_df = pd.read_csv('population_by_country_AUG.2024.csv')

In [3]:
"""
Constants setup ahead of time for Wikipedia API calls using the `request_pageinfo_per_article()` function below.
"""

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'dtropf@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [None]:
"""
Create function to obtain relevant information from a Wikipedia page based on the Wikipedia article title.
Will return `lastrevid` which will be used for `revision_id` later in the notebook.
"""

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    """
    Calls Wikipedia API to obtain page information for `article_title`.
    Returns JSON of information according to arguments in `request_template` and `headers`.
    """
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [11]:
"""
Use the `request_pageinfo_per_article()` function above to get the article information for each politician.
Extract from the response to the Wikipedia API `lastrevid` and keep track of the politicians which did not return a `lastrevid`.
"""
politicians_list = [] # List for politicians with articles that returned lastrevid
bad_politician_list = [] # List for politicians with articles that DID NOT return lastrevid

# For each row in the politicians dataframe, obtain article information about the politician
# via `request_pageinfo_per_article()` function, then extract and save `name` as `article_title` and `lastrevid` as `revision_id`.
# Keep all other information from row of dataframe as well (country and url).
for index, row in politicians_df.iterrows():
    rd = row.to_dict() # Convert row pandas object to dictionary
    article_title = rd['name']
    info = request_pageinfo_per_article(article_title)
    info_num = list(info['query']['pages'].keys())[0] # Trick to get key for nested dictionary needed to obtain lastrevid from returned json schema
    # Not all politicians return a `lastrevid`, so use try/except block and save politicians in seperate list that do not return `lastrevid`
    try:
        rev_id = info['query']['pages'][info_num]['lastrevid']
        rd['revision_id'] = rev_id
        politicians_list.append(rd)
    except:
        print(info) # Print out each politician info that did not have `lastrevid`
        bad_politician_list.append(rd)
        pass

{'batchcomplete': '', 'query': {'pages': {'-1': {'ns': 0, 'title': 'Barbara Eibinger-Miedl', 'missing': '', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'fullurl': 'https://en.wikipedia.org/wiki/Barbara_Eibinger-Miedl', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Barbara_Eibinger-Miedl&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Barbara_Eibinger-Miedl'}}}}
{'batchcomplete': '', 'query': {'pages': {'-1': {'ns': 0, 'title': 'Mehrali Gasimov', 'missing': '', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'fullurl': 'https://en.wikipedia.org/wiki/Mehrali_Gasimov', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Mehrali_Gasimov&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Mehrali_Gasimov'}}}}
{'batchcomplete': '', 'query': {'pages': {'-1': {'ns': 0, 'title': 'Kyaw Myint', 'missing': '', 'contentmodel': 'wikitext', 'pagelan

In [14]:
"""
Pickle lists in case something goes wrong.
These lists can be loaded later so that API calls to Wikipedia for `revision_id` do not need to be made again.
"""
# Pickle the lists
with open('bad_good_politicians_HW2lists.pkl', 'wb') as f:
    pickle.dump((politicians_list,bad_politician_list), f)  # Pickle both lists as a tuple

In [20]:
"""
Constants setup to make calls to the Wikipedia ORES algorithm for classification of article quality.
"""
# The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

# The throttling rate is a function of the Access token that you are granted when you request the token. The constants
# come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
   
# Because all LiftWing API requests require some form of authentication, you need to provide your access token
# as part of the header too
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "dtropf@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

# This is a template for the parameters that we need to supply in the headers of an API request
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}


# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This is a template of the data required as a payload when making a scoring request of the ORES model
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

# Obtain USERNAME and ACCESS_TOKEN via environment variables set by using the command `source cred.sh` before running this notebook.
# See README.txt for instructions.
USERNAME = os.environ['USER_NAME']
ACCESS_TOKEN = os.environ['ACCESS_TOKEN']
EMAIL_ADDRESS = 'dtropf@uw.edu'

In [None]:
"""
Calls to ORES model for classification of article quality.
"""

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    """
    Call Wikipedia ORES API to obtain a classification of an article based on `article_revid`.
    Returns JSON object which prediction (quality) of article will be obtained.
    """
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

In [None]:
"""
Obtain an `article_quality` ORES value for the Wikipedia article associated with each politician.
Also keep track of the articles which did not return a prediction.
"""
final_politician_list = [] # List for politicians with articles that returned ORES classification
bad_final_politician_list = [] # List for politicians with articles that DID NOT return an ORES classification

# For each politician, use their `revision_id` to call the `request_ores_score_per_article()` function
# and obtain the predicted classification for the article.
for i, p in enumerate(politicians_list):
    revision_id = p['revision_id']

    # A politician may not return an ORES prediction, so use try/except block
    try:
        # ORES prediction
        score = request_ores_score_per_article(article_revid=revision_id,
                                           email_address=EMAIL_ADDRESS,
                                           access_token=ACCESS_TOKEN)

        aq_id = list(score['enwiki']['scores'].keys())[0] # Trick to get to prediction score in nested dictionary of resonse to API
        article_quality = score['enwiki']['scores'][aq_id]['articlequality']['score']['prediction']
        p['article_quality'] = article_quality
        print(i, p)
    except:
        bad_final_politician_list.append(p)
        print(f'BAD: {i} {p}')

In [32]:
"""
Pickle lists in case something goes wrong
"""

# Pickle the lists
with open('article_quality_politicians_HW2lists.pkl', 'wb') as f:
    pickle.dump((politicians_list), f)  # Pickle both lists as a tuple

In [53]:
"""
This cell for grouping countries with their respective regions.
The `population_by_country_AUG.2024.csv` file has a heiarchy of region with countries inside said region in the same column.
Thus, some data processing needs to be done in order to properly associate a politician with a specific region.

The goal is to obtain a list of dictionaries, each dictionary is associated with a single region and will have all countries
of the region as well as the respective populations of each country.

Using the list of dictionaries, the country a politician is from can be found in one of these dictionaries and thus a region can then
be assigned to the politician.

NOTE: This cell assumes that the data is structured such that the 'Geography' value (column) in the first row is a "region"
      signified by a string that is in all caps.
"""

region_list = [] # List of dictionaries, one dictionary for each region
# Cycle through each row of the population csv file, determine if the value in `Geography`
# column is all caps (indicates it is a region), then aggregate all non-capitalized values beneath
# the region into a list and save in the region dictionary.
for index, row in population_df.iterrows():
    region = row.values[0]
    population = row.values[1]
    if region.isupper(): # Test if region
        if index > 0: # First value in data is a region, so does not have a list of countries and associated populations
            region_dict['country_list'] = country_list # Save all countries in region
            region_dict['population_list'] = population_list # Save population of each country in the region
            region_list.append(region_dict)
            print(region_dict)
        region_dict = {'region': region, 'population': population} # Create new dictionary for next region
        country_list = [] # Empty list for new region countries
        population_list = [] # Empty list for new region country populations
    else: # Logic if country and NOT region
        country_list.append(region)
        population_list.append(population)

{'region': 'WORLD', 'population': 8009.0, 'country_list': [], 'population_list': []}
{'region': 'AFRICA', 'population': 1453.0, 'country_list': [], 'population_list': []}
{'region': 'NORTHERN AFRICA', 'population': 256.0, 'country_list': ['Algeria', 'Egypt', 'Libya', 'Morocco', 'Sudan', 'Tunisia', 'Western Sahara'], 'population_list': [46.8, 105.2, 6.9, 37.0, 48.1, 11.9, 0.6]}
{'region': 'WESTERN AFRICA', 'population': 442.0, 'country_list': ['Benin', 'Burkina Faso', 'Cape Verde', "Cote d'Ivoire", 'Gambia', 'Ghana', 'Guinea', 'GuineaBissau', 'Liberia', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Senegal', 'Sierra Leone', 'Togo'], 'population_list': [13.7, 22.9, 0.6, 30.9, 2.8, 34.1, 14.2, 2.2, 5.4, 23.3, 4.9, 27.2, 223.8, 18.3, 8.9, 9.1]}
{'region': 'EASTERN AFRICA', 'population': 483.0, 'country_list': ['Burundi', 'Comoros', 'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi', 'Mauritius', 'Mayotte', 'Mozambique', 'Reunion', 'Rwanda', 'Seychelles', 'Somalia', 'South Suda

In [64]:
"""
This cell bundles together `country`, `region`, `population`, `article_title`, `revision_id`, and `article_quality`
from the various objects collected from cells above into a single dictionary.

The dictionary is this transformed into a Pandas DataFrame and finally saved out as a CSV.

NOTE: The value `population` refers to the `country` population and not the `region` population.
"""

# Schema for final output to csv
final_dict = {'country': [], 'region': [], 'population': [], 'article_title': [], 
              'revision_id': [], 'article_quality': []}

# For each politician, get the country, article, revision_id, article_quality, country population
for i, p in enumerate(politicians_list):
    if p in bad_final_politician_list: # If politician did not have an ORES score or otherwise, skip that politician
        continue
    country = p['country']
    article_title = p['name']
    revision_id = p['revision_id']
    article_quality = p['article_quality']

    # Loop through each region to find out which region the politician belongs to based on their country
    for r in region_list:
        if country in r['country_list']:
            region = r['region']
            population = r['population_list'][r['country_list'].index(country)]

    # Save all results to dictionary
    final_dict['country'].append(country)
    final_dict['region'].append(region)
    final_dict['population'].append(population)
    final_dict['article_title'].append(article_title)
    final_dict['revision_id'].append(revision_id)
    final_dict['article_quality'].append(article_quality)

In [70]:
"""
This final cell converts the `final_dict` object in the cell above to a Pandas DataFrame then saves that data to a csv file.
"""
final_df = pd.DataFrame(final_dict)
final_df.to_csv('wp_politicians_by_country.csv', index=False)
final_df

Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7141,Zimbabwe,EASTERN AFRICA,16.7,Josiah Tongogara,1203429435,C
7142,Zimbabwe,EASTERN AFRICA,16.7,Langton Towungana,1246280093,Stub
7143,Zimbabwe,EASTERN AFRICA,16.7,Sengezo Tshabangu,1228478288,Start
7144,Zimbabwe,EASTERN AFRICA,16.7,Herbert Ushewokunze,959111842,Stub
