# Data 512, Homework 2

This script runs all the code in this repository. All output is saved in the 'processed_data' folder. See the README for a full discription of what this code does.

Code for calling the APIs are based on [this example notebook](https://drive.google.com/file/d/1Z8DqXpHmNUJ3RD7e-OOwx2WYJPIYjUPp/view?usp=sharing) and [this example notebook](https://drive.google.com/file/d/1rZdBrtCe9XO4IkDWqm0A2RA-HfZCsqHh/view?usp=sharing).

*Note: this code was last run on 10/13/2022, and the results may change if run at a later date*

## Set Up

### Preprocessing

The file 'raw_data/politicians_by_country.csv' was downloaded from [this file](https://docs.google.com/spreadsheets/u/0/d/1Y4vSTYENgNE5KltqKZqnRQQBQZN5c8uKbSM4QTt8QGg/edit). The file 'raw_data/population_by_country.csv' was downloaded from [this file](https://docs.google.com/spreadsheets/u/0/d/1POuZDfA1sRooBq9e1RNukxyzHZZ-nQ2r6H5NcXhsMPU/edit). No changes were made besides renaming the file.

### Import Packages

In [1]:
import json, time, urllib.parse
import requests
import pandas as pd
import numpy as np

### Set Constants

*Constants are variables that will not be changed later in the script.*

The basic English Wikipedia API endpoint

In [2]:
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

Throttling for requests

In [3]:
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

Unique ID for requests  
**NOTE: this should be replaced with your own email and usage information**

In [4]:
REQUEST_HEADERS = {
    'User-Agent': '<klein324@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

Additional page properties that can be returned - see the [Info documentation](https://www.mediawiki.org/wiki/API:Info) for what can be included. This analysis does not need any additional properties, so an empty string is used instead.

In [5]:
#PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
PAGEINFO_EXTENDED_PROPERTIES = ""

Template with basic parameters for making a PageInfo request

In [6]:
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

The current ORES API endpoint

In [7]:
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"

Template for mapping to the ORES URL

In [8]:
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

Template with basic parameters for making an ORES request

In [9]:
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

## Functions

### Function to request PageInfo for 1 article

This function takes inputs of all the information needed to access the PageInfo API and outputs the json response unmodified.

Inputs:
- article_title: the title of a Wikipedia article
- endpoint_url: the MediaWiki API URL
- request_template: a template with the parameter values for the request
- headers: a unique ID for the request

Output:
- a json response or None if an exception is thrown

In [10]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### Function to request ORES score for 1 article

This function takes inputs of all the information needed to access the PageInfo API and outputs the json response unmodified.

Inputs:
- article_revid: the revision ID for the version of a Wikipedia article
- endpoint_url: the ORES API URL
- endpoint_params: a parameterized string for mapping the request template to the ORES URL
- request_template: a template with the parameter values for the request
- headers: a unique ID for the request
- features: additional features to add to the request URL

Output:
- a json response or None if an exception is thrown

In [11]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### Function to take a dataset and population to return article coverage

This function takes a dataset of articles as well as a population dataset and returns the coverage of articles for a specific country or region.

Inputs:
- data: a dataset of articles
    - article title should be in a column called 'name'
    - dataset should also include a column with the region or country
    - the dataset should be filtered to a specific region or country
- pop_df: a dataset of population counts in millions
    - region/country should be in a column called 'Geography'
    - the region/country the dataset in the first parameter is filtered to should be included in this dataset
- column: a string specifying the column name the region or country name is in
    - default is 'country'

Output:
- numeric value that is the ratio of articles to population aka article number per capita OR NaN if population is 0

In [12]:
def country_region_ratio(data, pop_df, column="country"):
    country_region = data[column].iloc[0]
    num_articles = data['name'].count()
    population = pop_df[pop_df['Geography']==country_region]['Population (millions)'].iloc[0]*1000000
    if population == 0:
        return(np.nan)
    ratio = num_articles/population
    return(ratio)

## Retrieve and Process Data

### Load Input Files 

In [13]:
politicians = pd.read_csv('../raw_data/politicians_by_country.csv')
population = pd.read_csv('../raw_data/population_by_country.csv')

### Add 'Korea' as Country

Change 'Korean' to 'Korea' in politicians dataset

In [14]:
politicians['country'] = ['Korea' if x=='Korean' else x for x in politicians['country']]

Add 'Korea' to population dataset - population is sum of North and South Korea

In [15]:
# filter to North and South Korea
korea_filt = population[population['Geography'].str.contains('Korea')]
# find the index to put the new row at
idx = max(korea_filt.index) + 1
# calculate the population for Korea
pop = sum(korea_filt["Population (millions)"])
# create the new row
korea = pd.DataFrame({"Geography": 'Korea', "Population (millions)": pop}, index=[idx])
# insert the new row at the correct index
population = pd.concat([population.iloc[:idx], korea, population.iloc[idx:]]).reset_index(drop=True)

### Get Article Revision ID and Quality Prediction

- This iterates through the politicians and gets the most recent revision ID and add it to the politicians dataframe.
    - If this throws an error, the index is added to an array.
- It then uses the revision ID to get the prediction of article quality and add it to the politicians dataframe.
    - If this throws an error, the index is added to an array.
- This takes about an hour. The number of seconds the process takes is printed after the code completes.

In [16]:
start = time.time()
# create columns for revision ID and predicition of article quality
politicians['revision_id'] = None
politicians['article_quality'] = None
# create arrays to store articles that throw errors
error_rid = []
error_qlt = []
# itterate through politicians
for index, row in politicians.iterrows():
    try:
        # get most recent revision ID and add to the dataframe
        info = request_pageinfo_per_article(row["name"])
        rid = [c['lastrevid'] for c in info["query"]["pages"].values()][0]
        politicians.iloc[index]['revision_id'] = rid
        try:
            # get ORES score and add to the dataframe
            score = request_ores_score_per_article(rid)
            prediction = score["enwiki"]["scores"][str(rid)]["articlequality"]["score"]["prediction"]
            politicians.iloc[index]['article_quality'] = prediction
        except:
            # add articles able to get revision ID but not ORES score to array
            error_qlt.append(index)
    except:
        # add articles unable to get revision ID to array
        error_rid.append(index)
print(time.time()-start)

2974.440201997757


The following articles were unable to get a revision ID

In [17]:
politicians.loc[error_rid]

Unnamed: 0,name,url,country,revision_id,article_quality
2446,Prince Ofosu Sefah,https://en.wikipedia.org/wiki/Prince_Ofosu_Sefah,Ghana,,
2985,Harjit Kaur Talwandi,https://en.wikipedia.org/wiki/Harjit_Kaur_Talw...,India,,
3212,Abd al-Razzaq al-Hasani,https://en.wikipedia.org/wiki/'Abd_al-Razzaq_a...,Iraq,,
4865,Abiodun Abimbola Orekoya,https://en.wikipedia.org/wiki/Abiodun_Abimbola...,Nigeria,,
4879,Segun “Aeroland” Adewale,https://en.wikipedia.org/wiki/Segun_”Aeroland”...,Nigeria,,
5801,Roman Konoplev,https://en.wikipedia.org/wiki/Roman_Konoplev,Russia,,
6344,Nhlanhla “Lux” Dlamini,https://en.wikipedia.org/wiki/Nhlanhla_”Lux”_D...,South Africa,,


The following articles were able to get a revision ID, but unable to get an ORES score

In [18]:
politicians.loc[error_qlt]

Unnamed: 0,name,url,country,revision_id,article_quality


### Combine with Population

add 'region' column to population dataframe
- if the region/country name is all caps, it is a region and set the value to the region name
- otherwise, it is a country and set the value to None

In [19]:
population['region'] = [x if x.isupper() else None for x in population['Geography']]

Fill any rows with 'None' in region column to the closest previous row's region that is nont 'None'

In [20]:
population['region'] = population['region'].ffill()

Merge the politicians and population dataframes on country

In [21]:
pol_pop = politicians.merge(population, left_on="country", right_on="Geography")

Select relevent columns

In [22]:
pol_pop = pol_pop.loc[:, ['country', 'region', 'Population (millions)', 'name', 'revision_id', 'article_quality']]

Save data to 'processed_data' folder

In [23]:
pol_pop.to_csv('../processed_data/wp_politicians_by_country.csv', index=False)

### Find Non-Matching Countries

Get the set of unique countries in politicians dataset

In [24]:
pol_set = set(politicians.country)

Get the set of unique countries in population dataset - remove regions before getting set

In [25]:
pop_set = set(population[~population['Geography'].str.isupper()].Geography)

Get the countries in the politicians set but not the population set

In [26]:
only_pol = pol_set.difference(pop_set)

Get the countries in the population set but not the politicians set

In [27]:
only_pop = pop_set.difference(pol_set)

Combine the two sets to get a full list of what countries are not in both datasets

In [28]:
no_match = only_pol.union(only_pop)

Save this list to a txt file in the 'processed_data' folder

In [29]:
with open('../processed_data/wp_countries-no_match.txt', 'w') as f:
    for country in no_match:
        f.write(country + "\n")

## Analysis

If you do not want to run the 'Retrieve and Process Data' section yourself, the only code you need to run before this 'Analysis' section is the following:
- [Import Packages](#Import-Packages) at the top of the script
- the [last function cell](#Function-to-take-a-dataset-and-population-to-return-article-coverage)
- the cell below - just remove the '#' in order to run the code

In [30]:
#pol_pop = pd.read_csv('../processed_data/wp_politicians_by_country.csv')
#population = pd.read_csv('../raw_data/population_by_country.csv')

### Calculate Ratios

For each country, calculate the ratio of all articles to population

In [31]:
country_articles = pol_pop.groupby(['country']).apply(lambda x: country_region_ratio(x, population))

For each region, calculate the ratio of all articles to population

In [32]:
region_articles = pol_pop.groupby(['region']).apply(lambda x: country_region_ratio(x, population, column="region"))

For each country, calculate the ratio of high quality articles (featured article or good article) to population

In [33]:
country_quality = pol_pop[pol_pop['article_quality'].isin(['FA', 'GA'])].groupby(['country']).apply(lambda x: country_region_ratio(x, population))

For each region, calculate the ratio of high quality articles (featured article or good article) to population

In [34]:
region_quality = pol_pop[pol_pop['article_quality'].isin(['FA', 'GA'])].groupby(['region']).apply(lambda x: country_region_ratio(x, population, column="region"))

### Create Tables

Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order)

In [35]:
pd.set_option('precision', 6)
country_articles.sort_values(ascending=False).head(10).to_frame(name='articles per capita').reset_index().style.hide_index()

country,articles per capita
Antigua and Barbuda,0.00017
Federated States of Micronesia,0.00013
Andorra,0.0001
Barbados,9.3e-05
Marshall Islands,9e-05
Montenegro,6e-05
Seychelles,6e-05
Luxembourg,5.3e-05
Bhutan,5.1e-05
Grenada,5e-05


Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order)

In [36]:
pd.set_option('precision', 10)
country_articles.sort_values(ascending=True).head(10).to_frame(name='articles per capita').reset_index().style.hide_index()

country,articles per capita
China,1.4e-09
Mexico,7.8e-09
Saudi Arabia,8.17e-08
Romania,1.053e-07
India,1.263e-07
Sri Lanka,1.339e-07
Egypt,1.353e-07
Ethiopia,2.026e-07
Taiwan,2.155e-07
Vietnam,2.716e-07


Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

In [37]:
pd.set_option('precision', 8)
country_quality.sort_values(ascending=False).head(10).to_frame(name='high quality articles per capita').reset_index().style.hide_index()

country,high quality articles per capita
Andorra,2e-05
Montenegro,5e-06
Albania,2.14e-06
Suriname,1.67e-06
Bosnia-Herzegovina,1.47e-06
Lithuania,1.07e-06
Croatia,1.05e-06
Slovenia,9.5e-07
Palestinian Territory,9.3e-07
Gabon,8.3e-07


Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)

In [38]:
pd.set_option('precision', 10)
country_quality.sort_values(ascending=True).head(10).to_frame(name='high quality articles per capita').reset_index().style.hide_index()

country,high quality articles per capita
India,4.2e-09
Thailand,1.5e-08
Japan,1.6e-08
Nigeria,1.83e-08
Vietnam,2.01e-08
Colombia,2.04e-08
Uganda,2.12e-08
Pakistan,2.12e-08
Sudan,2.13e-08
Iran,2.26e-08


Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita

In [39]:
pd.set_option('precision', 8)
region_articles.sort_values(ascending=False).to_frame(name='articles per capita').reset_index().style.hide_index()

region,articles per capita
SOUTHERN EUROPE,5.89e-06
CARIBBEAN,4.57e-06
WESTERN EUROPE,3.55e-06
EASTERN EUROPE,2.56e-06
NORTHERN EUROPE,2.45e-06
WESTERN ASIA,2.34e-06
OCEANIA,1.95e-06
SOUTHERN AFRICA,1.71e-06
EASTERN AFRICA,1.37e-06
CENTRAL ASIA,1.36e-06


Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita

In [40]:
pd.set_option('precision', 9)
region_quality.sort_values(ascending=False).to_frame(name='high quality articles per capita').reset_index().style.hide_index()

region,high quality articles per capita
SOUTHERN EUROPE,3.05e-07
CARIBBEAN,1.82e-07
EASTERN EUROPE,1.32e-07
WESTERN EUROPE,1.12e-07
WESTERN ASIA,9.5e-08
NORTHERN EUROPE,7.5e-08
SOUTHERN AFRICA,5.8e-08
CENTRAL AMERICA,5.6e-08
OCEANIA,4.5e-08
CENTRAL ASIA,3.8e-08
