# Analyzing Bias in Data
This notebook provides code that was used to assemble the necessary datasets to examine total articles per capita and high quality articles per capita as well as its corresponding analysis.

## Import Necessary Packages
For making the API calls we import `request`. For handling the API return objects we use a combination of `pandas` and `polars` to convert them to dataframe objects.

In [72]:
# Imports
import json, time, urllib.parse
import requests
import polars as pl
import pandas as pd
from rapidfuzz import process, fuzz

## Constants
The following code was provided by Dr. David McDonald from the University of Washington under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.

In [41]:
# ORES CONSTANTS:
#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED_ORES = 0.002     # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT_ORES = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED_ORES  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

# WP Constants:
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<tliu2@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

## API Keys Manager
The following code was provided by Dr. David McDonald from the University of Washington under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.

In [42]:
# API Key
from apikeys.KeyManager import KeyManager
keyman = KeyManager()

USERNAME = "cakeymoon"
key_info = keyman.findRecord(USERNAME,API_ORES_LIFTWING_ENDPOINT)
ACCESS_TOKEN = key_info[0]['key']
print(key_info[0]['description'])


## Functions
The funcionts `request_ores_score_per_article` and `request_pageinfo_per_article` were provided by Dr. David McDonald from the University of Washington under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license.

In [103]:
# Functions:
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT_ORES > 0.0:
            time.sleep(API_THROTTLE_WAIT_ORES)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


def fuzzy_match(country, population_country, scorer=fuzz.ratio, threshold=85):
    # Does a fuzzy string match on country-region names in case of minor naming inconsistencies.
    match = process.extractOne(country, population_country, scorer=scorer)
    if match and match[1] >= threshold:
        return match[0]  # Return the matched country name
    return country  # Return the original name if no good match found

## Intermediary File Assembly
To analyze our data, we will need to pull page details and their quality from the Wikimedia API and ORES API.
There are just over 7000 pages in our list of pages - so we will call the APIs once and store them into an intermediate csv file which we can load.

We first load in the files that contain pages and regions we are interestred in.

In [105]:
polit_df = pl.read_csv('./data/politicians_by_country_AUG.2024.csv')
pop_df = pl.read_csv('./data/population_by_country_AUG.2024.csv')

print(pop_df.head(n=5))

shape: (5, 2)
┌─────────────────┬────────────┐
│ Geography       ┆ Population │
│ ---             ┆ ---        │
│ str             ┆ f64        │
╞═════════════════╪════════════╡
│ WORLD           ┆ 8009.0     │
│ AFRICA          ┆ 1453.0     │
│ NORTHERN AFRICA ┆ 256.0      │
│ Algeria         ┆ 46.8       │
│ Egypt           ┆ 105.2      │
└─────────────────┴────────────┘


We assemble the dataset that contains the page details for each politician.
First we get a list of all the unique politicians in our source data and pass each person one-by-one to the Wikimedia API.
We then take what the API returns and form it into a dataframe.

In [11]:
# Get a list of unique politicians to pull page data on.
politician_list = polit_df.unique(subset=['name']).select('name').sort(by='name').to_series().to_list()

df_list = []
counter = 1
invalid_counter = 0
invalid_articles = []
total_politicians = len(politician_list)

# Iterate through each person in the list and call the API.
for person in politician_list:
    print(f'At article: {person}, {counter} / {total_politicians} articles.')
    json_dump = request_pageinfo_per_article(person)
    temp_data = json_dump['query']['pages']
    page_id = list(temp_data.keys())[0]  # Get the page id
    print(f'Page ID for {person}: {page_id}')
    if page_id == '-1':  # If api returns invalid page info defined with pageid == -1
        invalid_counter += 1
        invalid_articles.append(person)
        pass
    else:
        page_info = temp_data[page_id]
    
        df = pd.DataFrame({
            "pageid": [page_info["pageid"]],
            "ns": [page_info["ns"]],
            "title": [page_info["title"]],
            "contentmodel": [page_info["contentmodel"]],
            "pagelanguage": [page_info["pagelanguage"]],
            "pagelanguagehtmlcode": [page_info["pagelanguagehtmlcode"]],
            "pagelanguagedir": [page_info["pagelanguagedir"]],
            "touched": [page_info["touched"]],
            "lastrevid": [page_info["lastrevid"]],
            "length": [page_info["length"]],
            "watchers": [page_info.get("watchers", None)], 
            "talkid": [page_info.get("talkid", None)],  
            "fullurl": [page_info["fullurl"]],
            "editurl": [page_info["editurl"]],
            "canonicalurl": [page_info["canonicalurl"]]
        })
        df_list.append(df)
    counter += 1

print(f'Number of invalid articles: {invalid_counter}')
print(f'Invalid Articles: {invalid_articles}')
final_df = pd.concat(df_list)

# Store the dataframe for future use.
final_df.to_csv('./politician_page_data.csv', index=False, encoding='utf-8-sig')
    

print(polit_df.head(n=5))
print(pop_df.head(n=5))

At article: 'Abd al-Razzaq al-Hasani, 1 / 7111 articles.
Page ID for 'Abd al-Razzaq al-Hasani: 34742655
At article: 8th National Assembly of Slovenia, 2 / 7111 articles.
Page ID for 8th National Assembly of Slovenia: 57710979
At article: A. J. R. de Soysa, 3 / 7111 articles.
Page ID for A. J. R. de Soysa: 52888071
At article: A. R. F. Webber, 4 / 7111 articles.
Page ID for A. R. F. Webber: 6040733
At article: A. Wahab, 5 / 7111 articles.
Page ID for A. Wahab: 51745938
At article: AKM Fazlul Haque (Jatiya Party politician), 6 / 7111 articles.
Page ID for AKM Fazlul Haque (Jatiya Party politician): 64433698
At article: AKM Fazlul Kabir Chowdhury, 7 / 7111 articles.
Page ID for AKM Fazlul Kabir Chowdhury: 54988209
At article: AKM Hafizuddin, 8 / 7111 articles.
Page ID for AKM Hafizuddin: 70349044
At article: Abai Tasbolatov, 9 / 7111 articles.
Page ID for Abai Tasbolatov: 64626956
At article: Abakar Sabone, 10 / 7111 articles.
Page ID for Abakar Sabone: 33422765
At article: Abang Muhammad

Using our page data we previously pulled, get the revision ids and pass them to the ORES API one-by-one.
With the API return, convert it to a dataframe format.

We transform the API data into a dataframe format and store it.

In [55]:
# Load in the politician page data csv we generated in the previous step and select the relevant columns.
wp_page_data = pl.read_csv('./politician_page_data.csv', infer_schema_length=int(1e10))
wp_page_data = wp_page_data.select('pageid', 'title', 'lastrevid')
print(wp_page_data.head(n=5))

# Get a list of unique revids to pull from the ORES API.
revids = wp_page_data.select('lastrevid').unique().sort(by='lastrevid').to_series().to_list()
counter = 1
invalid_counter = 0
invalid_articles = []
df_list = []
total_ids = len(revids)

# Iterate through the list of revids.
for id in revids:
    print(f'At id: {id}, {counter} / {total_ids} ids.')

    while True:
        try:  # Try a API pull.
            score = request_ores_score_per_article(article_revid=id,
                                                   email_address="tliu2@uw.edu",
                                                   access_token=ACCESS_TOKEN)
    
            revid = list(score["enwiki"]["scores"].keys())[0]
            break  # If it works, break the While loop.
        except KeyError:  # If accessing 'enwiki' throws a Key Error, something has gone wrong.
            code = score['httpCode']
            print(f'failed revid: {revid}')
            print(score)
            print('retrying...')
            # Keep trying until it works.
            if code == 429:
                time.sleep(10)
            else:
                time.sleep(5)

    
    # Transform the data into a dataframe format.
    articlequality = score["enwiki"]["scores"][revid]["articlequality"]["score"]
    prediction = articlequality["prediction"]
    probabilities = articlequality["probability"]
    data = {
        "revid": [revid],
        "prediction": [prediction],
        **{f"probability_{k}": [v] for k, v in probabilities.items()}
    }
    
    df = pd.DataFrame(data)
    df_list.append(df)
    counter += 1
    
final = pd.concat(df_list)
# Output it - use utf-8 as there are some special characters.
final.to_csv('./articlequality.csv', index=False, encoding='utf-8-sig')


shape: (5, 3)
┌──────────┬─────────────────────────────────┬────────────┐
│ pageid   ┆ title                           ┆ lastrevid  │
│ ---      ┆ ---                             ┆ ---        │
│ i64      ┆ str                             ┆ i64        │
╞══════════╪═════════════════════════════════╪════════════╡
│ 34742655 ┆ 'Abd al-Razzaq al-Hasani        ┆ 1238643436 │
│ 57710979 ┆ 8th National Assembly of Slove… ┆ 1249160149 │
│ 52888071 ┆ A. J. R. de Soysa               ┆ 1187003494 │
│ 6040733  ┆ A. R. F. Webber                 ┆ 1156355955 │
│ 51745938 ┆ A. Wahab                        ┆ 1213828687 │
└──────────┴─────────────────────────────────┴────────────┘
At id: 395521877, 1 / 7103 ids.
At id: 696310078, 2 / 7103 ids.
At id: 719237570, 3 / 7103 ids.
At id: 747227236, 4 / 7103 ids.
At id: 797826672, 5 / 7103 ids.
At id: 800277980, 6 / 7103 ids.
At id: 825090349, 7 / 7103 ids.
At id: 826516065, 8 / 7103 ids.
At id: 867840089, 9 / 7103 ids.
At id: 872461806, 10 / 7103 ids.
At id

## Analysis - Preprocess
Now that we have the data we need from the APIs, we must create the merged dataset which has columns:
- country
- region
- population
- article_title
- revision_id
- article_quality

We'll load in our intermediate files we generated and pull the columns we need from each and merge on article title.

In the original source files of population and politician by region, there are inconsistencies with how a country / region is named.
For example, South Korea may appear as Korea, South or Korea (South). For us to properly merge these sets, we'll need to handle these edge cases.


In [106]:
manual_mapping = {
    'Korea, South': 'Korea (South)',
    'Korea, North': 'Korea (North)',
    'Bosnia Herzegovina': 'Bosnia and Herzegovina',
    'Timor Leste': 'Timor-Leste',
    "Cote d'Ivoire": 'Côte d’Ivoire',
    'Myanmar': 'Burma'
}
polit_df = polit_df.with_columns(
    pl.col('country').replace_strict(manual_mapping, default=pl.col('country'))
)
# Load in page data.
polit_page_df = pl.read_csv('./data/politician_page_data.csv', infer_schema_length=int(1e10)).rename({'lastrevid': 'revid'})

# Remap the polit source data to fuzzy match with the population source data's geography column.
pop_countries = pop_df.select('Geography').unique().to_series().to_list()
polit_df = polit_df.to_pandas()
polit_df['country'] = polit_df['country'].apply(
    lambda x: fuzzy_match(x, pop_countries)
)
polit_df = pl.from_pandas(polit_df)

quality_df = pl.read_csv('./data/articlequality.csv', infer_schema_length=int(1e10))

# Merge page info with ORES data.
merged_df = polit_page_df.join(
    quality_df, on=['revid'], how='full', coalesce=True
)

merged_df = merged_df.select('pageid', 'title', 'revid', 'prediction')

# Merge with the original source data on pop and polit by country
merged_df = merged_df.join(
    polit_df.rename({'name': 'title'}), on=['title'], how='full', coalesce=True
)
merged_df = merged_df.join(
    pop_df, left_on=['country'], right_on=['Geography'], how='full'
)

merged_df = merged_df.select('country', 'Geography', 'Population', 'title', 'revid', 'prediction')
merged_df = merged_df.rename({
    'Geography': 'region', 'title': 'article_title', 'revid': 'revision_id', 'prediction': 'article_quality', 'Population': 'population'
})

print(merged_df.sort(by='article_title').head(n=5))
print(merged_df.sort(by='article_title', nulls_last=True).head(n=5))


shape: (5, 6)
┌─────────┬─────────────────┬────────────┬───────────────┬─────────────┬─────────────────┐
│ country ┆ region          ┆ population ┆ article_title ┆ revision_id ┆ article_quality │
│ ---     ┆ ---             ┆ ---        ┆ ---           ┆ ---         ┆ ---             │
│ str     ┆ str             ┆ f64        ┆ str           ┆ i64         ┆ str             │
╞═════════╪═════════════════╪════════════╪═══════════════╪═════════════╪═════════════════╡
│ null    ┆ CENTRAL ASIA    ┆ 80.0       ┆ null          ┆ null        ┆ null            │
│ null    ┆ Iceland         ┆ 0.4        ┆ null          ┆ null        ┆ null            │
│ null    ┆ SOUTHERN EUROPE ┆ 152.0      ┆ null          ┆ null        ┆ null            │
│ null    ┆ Mexico          ┆ 131.0      ┆ null          ┆ null        ┆ null            │
│ null    ┆ Fiji            ┆ 0.9        ┆ null          ┆ null        ┆ null            │
└─────────┴─────────────────┴────────────┴───────────────┴─────────────┴────

In [107]:
# Get no matches
no_matches = merged_df.filter(
    (pl.col('region').is_null()) | (pl.col('country').is_null())
)
no_matches = no_matches.with_columns(
    pl.col('country').fill_null(pl.col('region'))
).select('country').unique().to_series().sort().to_list()

with open('./data/wp_countries-no_match.txt', 'w', encoding='utf-8-sig') as f:
    for line in no_matches:
        f.write(f"{line}\n")
        
matches_df = merged_df.filter(
    (pl.col('region').is_not_null()) & (pl.col('country').is_not_null())
)
matches_df.write_csv('./data/wp_politicians_by_country.csv', include_bom=True)

## Analysis
With our final dataset assembled, we can begin our analysis.

We will specifically be looking for 6 things:
1. Top 10 countries by coverage.
    - The 10 countries with highest total articles per capita (in descending order).
2. Bottom 10 countries by coverage.
    - The 10 countries with lowest total articles per capita (in descending order).
3. Top 10 countries by high quality
    - The 10 countries with highest high quality articles per capita (in descending order).
4. Bottom 10 countries by high quality.
    - The 10 countries with the lowest high quality articles per capita (in ascending order).
5. Geographic regions by total coverage.
    - A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. Geographic regions by high quality coverage.
    - Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

We define a "high quality" article to be that has `article_quality` of "FA" (featured article) or "GA" (good article).

In the population dataset, there are some countries that have a population of 0. For the sake of this analysis, we will be removing these countries as we cannot calculate accurate per capita calculations for countries that have 0 population.
Examples of countries with 0 population are Monaco and Tuvalu.


In [108]:
df = pl.read_csv('./data/wp_politicians_by_country.csv', infer_schema_length=int(1e10))
df = df.filter(pl.col('population') != 0)
high_quality = ['FA', 'GA']


In [109]:
# Per Capita Calculations
df = df.with_columns(
    pl.col('population') * 1_000_000
)

country_df = df.group_by(['country']).agg(
    pl.col('article_title').count().alias('total_articles'),
    pl.col('article_quality').is_in(high_quality).sum().alias('high_quality_articles'),
    pl.col('population').max().alias('population')
)

country_df = country_df.with_columns(
    (pl.col('total_articles') / pl.col('population')).alias('articles_per_capita'),
    (pl.col('high_quality_articles') / pl.col('population')).alias('high_quality_articles_per_capita'),
)

country_df = country_df.sort(by=['country'])

# 1. Top 10 countries by coverage
top10_coverage = country_df.top_k(10, by='articles_per_capita').sort(by='articles_per_capita', descending=True)
print('1. Top 10 Countries By Coverage')
display(top10_coverage)

# 2. Bottom 10 countries by coverage
bottom10_coverage = country_df.bottom_k(10, by='articles_per_capita').sort(by='articles_per_capita', descending=True)
print('2. Bottom 10 Countries By Coverage')
display(bottom10_coverage)

# 3. Top 10 countries by high quality
top10_quality = country_df.top_k(10, by='high_quality_articles_per_capita').sort(by='high_quality_articles_per_capita', descending=True)
print('3. Top 10 Countries By Quality')
display(top10_quality)

# 4. Bottom 10 countries by high quality
bottom10_quality = country_df.bottom_k(10, by='high_quality_articles_per_capita').sort(by='high_quality_articles_per_capita', descending=True)
print('4. Bottom 10 Countries By Quality')
display(bottom10_quality)

1. Top 10 Countries By Coverage


country,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Antigua and Barbuda""",33,0,100000.0,0.00033,0.0
"""Federated States of Micronesia""",14,0,100000.0,0.00014,0.0
"""Marshall Islands""",13,0,100000.0,0.00013,0.0
"""Tonga""",10,0,100000.0,0.0001,0.0
"""Barbados""",25,0,300000.0,8.3e-05,0.0
"""Montenegro""",36,3,600000.0,6e-05,5e-06
"""Seychelles""",6,0,100000.0,6e-05,0.0
"""Bhutan""",44,0,800000.0,5.5e-05,0.0
"""Maldives""",33,1,600000.0,5.5e-05,2e-06
"""Samoa""",8,0,200000.0,4e-05,0.0


2. Bottom 10 Countries By Coverage


country,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Mozambique""",12,0,33900000.0,3.5398e-07,0.0
"""Ethiopia""",44,2,126500000.0,3.4783e-07,1.581e-08
"""Egypt""",32,1,105200000.0,3.0418e-07,9.5057e-09
"""Israel""",2,0,9800000.0,2.0408e-07,0.0
"""Norway""",1,0,5500000.0,1.8182e-07,0.0
"""Zambia""",3,0,20200000.0,1.4851e-07,0.0
"""Saudi Arabia""",5,2,36900000.0,1.355e-07,5.4201e-08
"""Ghana""",4,1,34100000.0,1.173e-07,2.9326e-08
"""India""",151,0,1428600000.0,1.057e-07,0.0
"""China""",16,0,1411300000.0,1.1337e-08,0.0


3. Top 10 Countries By Quality


country,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Montenegro""",36,3,600000.0,6e-05,5e-06
"""Luxembourg""",27,2,700000.0,3.9e-05,3e-06
"""Albania""",70,7,2700000.0,2.6e-05,3e-06
"""Kosovo""",26,4,1700000.0,1.5e-05,2e-06
"""Maldives""",33,1,600000.0,5.5e-05,2e-06
"""Lithuania""",58,4,2900000.0,2e-05,1e-06
"""Croatia""",65,5,3800000.0,1.7e-05,1e-06
"""Guyana""",17,1,800000.0,2.1e-05,1e-06
"""Palestinian Territory""",61,6,5500000.0,1.1e-05,1e-06
"""Slovenia""",38,2,2100000.0,1.8e-05,9.5238e-07


4. Bottom 10 Countries By Quality


country,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Liberia""",25,0,5400000.0,5e-06,0.0
"""Zambia""",3,0,20200000.0,1.4851e-07,0.0
"""Antigua and Barbuda""",33,0,100000.0,0.00033,0.0
"""Bahamas""",9,0,400000.0,2.25e-05,0.0
"""Barbados""",25,0,300000.0,8.3e-05,0.0
"""Belize""",9,0,500000.0,1.8e-05,0.0
"""Benin""",7,0,13700000.0,5.1095e-07,0.0
"""Bhutan""",44,0,800000.0,5.5e-05,0.0
"""Botswana""",3,0,2700000.0,1e-06,0.0
"""Cape Verde""",9,0,600000.0,1.5e-05,0.0


In [110]:
# Region Calculations
region_df = df.group_by(['region']).agg(
    pl.col('article_title').count().alias('total_articles'),
    pl.col('article_quality').is_in(high_quality).sum().alias('high_quality_articles'),
    pl.col('population').sum().alias('population')
).sort(by=['region'])

region_df = region_df.with_columns(
    (pl.col('total_articles') / pl.col('population')).alias('articles_per_capita'),
    (pl.col('high_quality_articles') / pl.col('population')).alias('high_quality_articles_per_capita'),
)

regions_by_cover = region_df.sort(by='articles_per_capita', descending=True)
regions_by_quality = region_df.sort(by='high_quality_articles_per_capita', descending=True)

print('5. Regions by Total Coverage')
display(regions_by_cover)

print('6. Regions by Quality')
display(regions_by_quality)

5. Regions by Total Coverage


region,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Antigua and Barbuda""",33,0,3.3e6,0.00001,0.0
"""Federated States of Micronesia""",14,0,1.4e6,0.00001,0.0
"""Grenada""",2,0,200000.0,0.00001,0.0
"""Marshall Islands""",13,0,1.3e6,0.00001,0.0
"""Seychelles""",6,0,600000.0,0.00001,0.0
…,…,…,…,…,…
"""Nigeria""",246,6,5.5055e10,4.4683e-9,1.0898e-10
"""Pakistan""",94,4,2.2607e10,4.1580e-9,1.7694e-10
"""Indonesia""",113,15,3.1493e10,3.5881e-9,4.7629e-10
"""China""",16,0,2.2581e10,7.0857e-10,0.0


6. Regions by Quality


region,total_articles,high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
str,u32,u32,f64,f64,f64
"""Montenegro""",36,3,2.16e7,0.000002,1.3889e-7
"""Luxembourg""",27,2,1.89e7,0.000001,1.0582e-7
"""Kosovo""",26,4,4.42e7,5.8824e-7,9.0498e-8
"""Gabon""",5,1,1.2e7,4.1667e-7,8.3333e-8
"""Latvia""",7,1,1.33e7,5.2632e-7,7.5188e-8
…,…,…,…,…,…
"""Uzbekistan""",25,0,9.1e8,2.7473e-8,0.0
"""Vanuatu""",4,0,1.2e6,0.000003,0.0
"""Yemen""",32,0,1.1008e9,2.9070e-8,0.0
"""Zambia""",3,0,6.06e7,4.9505e-8,0.0
