# Getting Quality Predictions for World Politicians' Wikipedia Articles

### Homework #2 – Data 512
### Daniel Vogler

# Requesting Article Quality Scores

## Environment Setup

In [13]:
import json, time, requests
from dotenv import load_dotenv
import os
import time
import pandas as pd
import math
from tqdm.notebook import tqdm

## Constant Definition

The code below defines constants and formatting templates that will be used by functions later in this notebook that make requests to the Wikimedia Page Info API and the ORES API.

The code and comments below are adopted, with light modifications, from Dr. David McDonald,
who provided them for use in DATA 512, a course in the University of Washington MS of Data
Science Program. The code is provided and utilized here under the [Creative Commons CC-BY license.](https://creativecommons.org/licenses/by/4.0/)

I use the `python-dotenv` package to manage the secret `ACCESS_TOKEN` needed to access the ORES API.

In [14]:
USERNAME = "voglerdaniel"
load_dotenv()
ACCESS_TOKEN = os.getenv("ACCESS_TOKEN")

################################################################################################ 
# The code and comments below are adopted, with light modifications, from Dr. David McDonald,
# who provided them for use in DATA 512, a course in the University of Washington MS of Data
# Science Program. The code is provided and utilized here 
# under the Creative Commons CC-BY license
################################################################################################

#########
#
#    CONSTANTS
#    The current LiftWing ORES API endpoint and prediction model
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

# Defining the request header
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}

#    This is a template for the parameters that we need to supply in the headers of an API request

REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "dvogler@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#    This is a template of the data required as a payload when making a scoring request of the ORES model

ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#    These are used later - defined here so they, at least, have empty values
USERNAME = ""
ACCESS_TOKEN = ""

Aside from the constants above, one more input is needed to make the API requests to ORES for article quality scores: a mapping of article names to their latest revision IDs. For this, I use the ordinary page info APIs to define `ARTICLE_REVISIONS`. The constants needed for the Wikimedia Page Info API are defined below; the next section of this notebook shows how I make calls to that API.

In [15]:
######### CONSTANTS

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"

# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<dvogler@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

## Defining Helper Functions for Page Info Requests

The code below requests page info for each article. Later, I will use the `lastrevid` cotained in the page info the make calls to the ORES API. 

Once again, the code and comments below are adopted, with light modifications, from Dr. David McDonald, who provided them for use in DATA 512, a course in the University of Washington MS of Data Science Program. The code was provided under the Creative Commons CC-BY license. The original source code can be found in the [resources folder](../resources).

In [16]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

The page info requests API can take **batches** of article names as arguments; this runs faster than calls for info about single articles. Below I define a small helper function to format the page info requests per the API documentation.

In [17]:
def make_batch(names):

    # page info API indicates up to 50 titles can be searched at once, separated by a |
    separator = " | "
    result = separator.join(names)

    return result

## Importing Data and Making Page Info Requests

In this section, I will iterate over every politician's `name` in the [politicians dataset](../raw_data/politicians_by_country_AUG.2024.csv) and make a (batched) request to the Wikimedia page info API to figure out the most recent revision ID, which the ORES API needs.

First, I import the names of the politicians I'll need and define some constants that allow batch-processing this data:

In [18]:
cleaned_politicians_path = "../cleaned_data/politicians_by_country_AUG_2024_clean.csv"

politicians = pd.read_csv(cleaned_politicians_path)

BATCH_SIZE_LIMIT = 50 # per API limits

iterations = math.ceil(len(politicians) / BATCH_SIZE_LIMIT)

N = len(politicians)

Next, the code below defines a data structure `ARTICLE_REVISIONS` which will map each article's name to its latest revision number, and a function that executes this lookup in batches of 50 by calling the page info API.

In [19]:
ARTICLE_REVISIONS = {}

def get_article_revisions():
    for iteration in range(iterations):

        print(f"Starting iteration {iteration} out of {iterations}")
        
        # pandas indexing is inclusive of both endpoints, so I have to subtract
        # 1 from the index of the last article in the batch to prevent overlap
        start = iteration * BATCH_SIZE_LIMIT
        end = start + BATCH_SIZE_LIMIT - 1

        names = politicians.loc[start:end, "name"]

        batch_key = make_batch(names)

        batch_info = request_pageinfo_per_article(batch_key)

        payload = batch_info["query"]["pages"]

        # the payload has info about 50 articles (one batch); iterate over all of them
        # and get the rev ID
        for k in payload.keys():
            
            try:
                revid = payload[k]["lastrevid"]
                title = payload[k]["title"]
            except:
                revid = None
                print(f"Warning: no revid found for article with title {title}")
            
            ARTICLE_REVISIONS[title] = revid

To facilitate reproducibility and eliminate redundancy, I check if the `article_revisions.json` file already exists; if it does, I just import it. Otherwise, I populate the `ARTICLE_REVISIONS` dict and paste it there. This conditional code allows this [notebook to run with a single button press](http://www.practicereproducibleresearch.org/core-chapters/2-assessment.html), regardless of whether or not it's running for the first time.

In [20]:
article_revisions_filepath = "../cleaned_data/article_revisions.json"
if os.path.isfile(article_revisions_filepath):
    with open(article_revisions_filepath, "r") as f:
        ARTICLE_REVISIONS = json.load(f)
else: 
    get_article_revisions()
    with open("../cleaned_data/article_revisions.json", 'w') as file:
        json.dump(ARTICLE_REVISIONS, file, indent = 4)

## Calling the ORES API to Get Predicted Article Quality

Yet again, the code and comments below are adopted, with light modifications, from Dr. David McDonald, who provided them for use in DATA 512, a course in the University of Washington MS of Data Science Program. The code was provided under the [Creative Commons CC-BY license](https://creativecommons.org/licenses/by/4.0/). The original source code can be found in the [resources folder](../resources).

This function enables requests to the ORES API for a given article. Later in this section, I will iterate over all of the politicians in the dataset, make a call to the ORES API, and figure out the article quality prediction.

In [22]:
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


The code below gathers ORES article quality predictions for each politician in the provided list, leveraging the helper function for calls to the ORES API defined above.

Because this is computationally intensive and takes a while to run, I take a batched approach. The parameter `starting_point` can be changed from the default value of 0 in case `get_all_predictions()` is terminated while it is running. After getting a batch (50 articles) of quality predictions, the code below saves that batch to a folder of intermediate records. If `get_all_predictions()` is terminated or errors out, I could simply restart from the last successful batch, then merge the intermediate JSONs.

The code below explicitly looks for article quality scores that are available in the English Wikipedia. Occasionally, this means that no prediction score is available, because the page is only written in languages other than English. These pages are effectively filtered out of my analysis. **This is an intentional methodological choice**. My reasoning is that, without deeply understanding the way the machine learning models provided as services by ORES assign quality scores to articles, it is not appropriate to compare quality scores across languages, as this would not be an apples-to-apples comparison.

In [25]:
predictions = []

# starting point is the iteration we start at, in case this gets cut off
def get_all_predictions(starting_point=0,
                        df = politicians):

    # tqdm shows a progress bar when running loops
    for iteration in tqdm(range(starting_point, iterations)):
        
        start_i = iteration * BATCH_SIZE_LIMIT
        end_i = start_i + BATCH_SIZE_LIMIT - 1

        names = df.loc[start_i:end_i, "name"]

        batch = [] # will store the records in this batch
        for name in tqdm(names, leave=False):
            
            try:
                revid = ARTICLE_REVISIONS[name]
                score = request_ores_score_per_article(revid, 
                                            email_address="dvogler@uw.edu",
                                            access_token=ACCESS_TOKEN)
                
                # access the predicted article quality, which is nested several layers into the response JSON
                prediction = list(score["enwiki"]["scores"].values())[0]["articlequality"]["score"]["prediction"]
            
            except:
                # some articles don't have a revid; I will track these with print statements
                revid = None
                prediction = None
                print(f"Warning: could not access ORES quality prediction for article called {name}.json")
            
            predictions_record = {
                "title": name,
                "revid": revid,
                "prediction": prediction
            }

            predictions.append(predictions_record)
            batch.append(predictions_record)
        
        print(f"Batch {iteration} of {iterations} completed. Saving...")

        # save batch result
        with open(f"../intermediate_data_batches/quality_predictions-{iteration}.json", "w",
                  encoding="utf-8") as file:
            
            json.dump(batch, 
                      file, 
                      ensure_ascii=False, 
                      indent = 4)
# save final result  
with open(f"../output_data/quality_predictions.json", "w", encoding="utf-8") as file:
    json.dump(predictions, 
                      file, 
                      ensure_ascii=False, 
                      indent = 4)



In [26]:
get_all_predictions(0)

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

Batch 0 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 1 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 2 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 3 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 4 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 5 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 6 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 7 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 8 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 9 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 10 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 11 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 12 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 13 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 14 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 15 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 16 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 17 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 18 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 19 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 20 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 21 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 22 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 23 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 24 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 25 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 26 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 27 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 28 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 29 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 30 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 31 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 32 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 33 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 34 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 35 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 36 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 37 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 38 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 39 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 40 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 41 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 42 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 43 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 44 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 45 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 46 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 47 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 48 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 49 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 50 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 51 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 52 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 53 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 54 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 55 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 56 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 57 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 58 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 59 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 60 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 61 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 62 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 63 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 64 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 65 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 66 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 67 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 68 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 69 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 70 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 71 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 72 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 73 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 74 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 75 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 76 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 77 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 78 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 79 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 80 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 81 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 82 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 83 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 84 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 85 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 86 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 87 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 88 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 89 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 90 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 91 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 92 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 93 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 94 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 95 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 96 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 97 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 98 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 99 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 100 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 101 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 102 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 103 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 104 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 105 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 106 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 107 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 108 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 109 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 110 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 111 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 112 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 113 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 114 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 115 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 116 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 117 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 118 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 119 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 120 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 121 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 122 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 123 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 124 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 125 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 126 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 127 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 128 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 129 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 130 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 131 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 132 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 133 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 134 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 135 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 136 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 137 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 138 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 139 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 140 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 141 of 144 completed. Saving...


  0%|          | 0/50 [00:00<?, ?it/s]

Batch 142 of 144 completed. Saving...


  0%|          | 0/5 [00:00<?, ?it/s]

Batch 143 of 144 completed. Saving...


From the output log above, I compute the error rates below.

In [29]:
# errors

ERROR_NUMBER = 9 # manual count from print statements
                # a better and more reproducible practice would
                # be to log this during the for loop, but given tim
                # constraints I was not able to re-compute everything

ERROR_RATE = ERROR_NUMBER / N

print("Error rate (percent):", round(ERROR_RATE*100, 2))

Error rate (percent): 0.13


The ORES API encountered errors generating a prediction of article quality for significantly less than 1 percent of records.

Below, I'm defining a function that stitches together batches of records. If for any reason `output_data/quality_predictions.json` comes up empty, it can be reconstructed from the batches.

In [40]:
batch_directory = "../intermediate_data_batches"

def stitch_batches(batch_directory=batch_directory):

    combined_data = []
    for filename in os.listdir(batch_directory):
        path = os.path.join(batch_directory, filename)
        
        if not filename.endswith(".json"):
            continue
        with open(path, 'r', encoding='utf-8') as f:
            batch = json.load(f)
            for data in batch:
                combined_data.append(data)
    
    return combined_data

In [43]:
with open("../output_data/quality_predictions.json") as f:
    predictions = json.load(f)

if len(predictions) == 0:
    predictions = stitch_batches()
    print("Combined", len(predictions), "records. Writing to output_data folder")
    with open(f"../output_data/quality_predictions.json", "w", encoding="utf-8") as file:
        json.dump(predictions, 
                      file, 
                      ensure_ascii=False, 
                      indent = 4)
    
else:
    print("quality predictions are in data_output folder")


quality predictions are in data_output folder


The article quality predictions from the ORES API are now in the `output_data` folder. As a final sense check, I verify that the number of records matches the original dataset:

In [45]:
len(predictions)

7155