# Homework 2 - Considering bias in data

This repository aims to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. We will them perform an analysis of the number of articles, and specifically high quality articles, per capita.

## License
Snippets of the code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023

## Step 1: Getting the Article and Population Data and analysis

Data Source

The Wikipedia [Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries which is available in the [politicians_by_country.AUG.2024.csv](HCDS/data-512-homework_2/politicians_by_country_AUG.2024.csv) file.

The population data is available in CSV format as [population_by_country_AUG.2024.csv](data-512-homework_2/raw_files/population_by_country_AUG.2024.csv) used for analysis is downloaded from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau.


In [11]:
#########
#
#    IMPORT MODULES/PACKAGES
#

# These are standard python modules
import json, time, urllib.parse
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests

At this step we will read and perform basic analysis on the data

In [54]:
# Load the CSV files
population_df = pd.read_csv('./raw_files/population_by_country_AUG.2024.csv')
politicians_df = pd.read_csv('./raw_files/politicians_by_country_AUG.2024.csv')

# Some initial analysis
print("Total number of rows in the provided politicians csv: ", politicians_df.shape[0])
print("Total number of rows in the provided popultaion csv: ", population_df.shape[0])

# find number of duplicate politicians
politicians_count_df = politicians_df.groupby(['name', 'url'])['country'].nunique().reset_index(name='country_count')
politicians_count_dup_df = politicians_count_df[politicians_count_df['country_count'] > 1]
print("Total number of politicians that are associated with more than 1 country:" ,politicians_count_dup_df.shape[0])

politicians_count_dup_df


Total number of rows in the provided politicians csv:  7155
Total number of rows in the provided popultaion csv:  233
Total number of politicians that are associated with more than 1 country: 41


Unnamed: 0,name,url,country_count
182,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,2
456,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,2
568,Ali al-Qaradaghi,https://en.wikipedia.org/wiki/Ali_al-Qaradaghi,2
800,Antonio Gutiérrez y Ulloa,https://en.wikipedia.org/wiki/Antonio_Gutiérre...,2
815,Antonín Janoušek,https://en.wikipedia.org/wiki/Antonín_Janoušek,2
892,Ashab Uddin Ahmad,https://en.wikipedia.org/wiki/Ashab_Uddin_Ahmad,2
1016,Bak Jungyang,https://en.wikipedia.org/wiki/Bak_Jungyang,2
1161,Bona Malwal,https://en.wikipedia.org/wiki/Bona_Malwal,2
1495,Count Václav Antonín Chotek of Chotkov and Vojnín,https://en.wikipedia.org/wiki/Count_Václav_Ant...,2
1682,Djama Ali Moussa,https://en.wikipedia.org/wiki/Djama_Ali_Moussa,2


### API call to retrieve page info and ORES score for every article

The provided list of politicians has information only about the article titles, but not about the latest revision id. To get the list of article titles with the corresponding revision ids, we make use of the [MediaWiki REST API](https://www.mediawiki.org/wiki/API:Main_page) for the EN Wikipedia. For our exact usecase, we follow the documentation at [API:Info](https://www.mediawiki.org/wiki/API:Info)

When using this APi we need to setup our account on wikiMedia and generate a private API token to be used to call this API.

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called [ORES](https://www.mediawiki.org/wiki/ORES). This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. 

We use a few constants for better readablity of the code.

In [18]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<swarali@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included.

PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}


#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values

USERNAME = ""
ACCESS_TOKEN = ""

In [55]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.

#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 

#   Note: if you don't want to use the key manager to help manage your API keys, you can specify the values as constants
#   below. Just don't distribute the notebook without removing the constants or you'll be distributing your key too.

USERNAME = '<your username for the wikimedia account>'
ACCESS_TOKEN = '<your private API access token>'

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

Below function also incule some additional functions which are required for processing of data and create a few intermediary files

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

# Function to read article titles from the CSV file
def read_article_file(file_path):
    """
    Read the article titles from a CSV file.

    Args:
    - file_path (str): Path to the CSV file containing article titles.

    Returns:
    - list: A list of article titles extracted from the 'name' column of the CSV file.

    """
    try:
        # Read the CSV file into a DataFrame
        df = pd.read_csv(file_path)

        # Check if 'name' column exists and extract titles
        if 'name' in df.columns:
            article_titles = df['name'].tolist()
        else:
            print(f"Error: 'name' column not found in the CSV file.")
            return []
    except FileNotFoundError:
        print(f"Error: CSV file '{file_path}' not found. Please check the file path and try again.")
        raise
    except Exception as e:
        print(f"Unexpected error while reading '{file_path}': {e}")
        raise
    
    return article_titles


# iterate over article and save info in a json
def fetch_and_save_article_info(article_list):
    """
    Process the Wikipedia API data for each article and save it as a JSON file.

    Args:
    - article_list (list): List of article titles to query from the Wikipedia API.

    Returns:
    - None
    """
    wiki_information = {}
    batch_size = 50  # Set the batch size for requests

    # Fetch information in batches
    for i in range(0, len(article_list), batch_size):
        batch_articles = article_list[i:i + batch_size]
        names = '|'.join(batch_articles)
        print(f"Fetching information for batch: {batch_articles}")
        
        info = request_pageinfo_per_article(names)
        
        if 'query' in info and 'pages' in info['query']:
            response = info['query']['pages']
            for page_id, page_info in response.items():
                name = page_info['title']
                wiki_information[name] = page_info
        else:
            print(f"No valid response for batch starting with {batch_articles[0]}")

    # Write the information to a single JSON file
    output_file = "./intermediary_files/articles_page_info.json"
    try:
        with open(output_file, 'w', encoding='utf-8') as json_file:
            json.dump(wiki_information, json_file, ensure_ascii=False, indent=4)
        print(f"Data successfully saved to {output_file}.")
    except Exception as e:
        print(f"Error saving data to JSON: {e}")


# function API call to get the quality score for every article
def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


# read the json file with article title and last rev id and get quality scores
def process_ores_scores_from_json(json_file_path, email_address, access_token):
    """
    Get ORES scores for articles from a JSON file and save them to a CSV file.

    Args:
        json_file_path (str): Path to the JSON file containing article page information.
        email_address (str): Your email address for the API request.
        access_token (str): Your access token for the API request.
    """
    
    output_csv = 'intermediary_files/articles_ores_scores.csv'
    all_article_scores = []
    
    try:
        # Load the JSON file
        with open(json_file_path, mode='r', encoding='utf-8') as json_file:
            articles = json.load(json_file)

            for article in articles.values():  # Assuming articles is a dictionary
                lastrevid = article.get('lastrevid')
                
                if lastrevid:
                    print(f"Requesting ORES score for revision ID: {lastrevid}")
                    response = request_ores_score_per_article(
                        article_revid=lastrevid,
                        email_address=email_address,
                        access_token=access_token
                    )
                    
                    if response is not None:
                        # Process the response and store the scores
                        score_data = {
                            'revision_id': lastrevid,
                            'quality_prediction': response.get('enwiki', {}).get('scores', {}).get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('prediction')
                        }

                        # Extract probabilities
                        probabilities = response.get('enwiki', {}).get('scores', {}).get(str(lastrevid), {}).get('articlequality', {}).get('score', {}).get('probability', {})
                        score_data.update({f'Probability {key}': value for key, value in probabilities.items()})
                        
                        all_article_scores.append(score_data)
                    else:
                        print(f"Failed to get score for lastrevid {lastrevid}")
                else:
                    print("lastrevid not found for an article.")
        
        # Save the scores to a CSV file
        all_article_scores_df = pd.DataFrame(all_article_scores)
        all_article_scores_df.to_csv(output_csv, index=False)
        print(f"Scores saved to {output_csv}")

    except FileNotFoundError:
        print(f"Error: JSON file '{json_file_path}' not found. Please check the file path and try again.")
        raise
    except json.JSONDecodeError:
        print(f"Error: Failed to decode JSON from '{json_file_path}'.")
        raise

## Step 2: Getting Article Quality Predictions

ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

    FA - Featured article(high quality)
    GA - Good article(high quality)
    B - B-class article
    C - C-class article
    Start - Start-class article
    Stub - Stub-class article

We run the below cell to get the artcile titles from the csv files and get the lastest revision id for a politician article using the page info API and save it to a json file.

There is a way to get the information for multiple pages at the same time, as implemented in 'fetch_and_save_article_info' function by separating the page titles with the vertical bar "|" character. However, this approach has a limit - the API can handle requests with upto 50 article titles appended. Reference documentation can be found [here](https://www.mediawiki.org/w/api.php?action=help&modules=query)

In [None]:
# Read the article titles from the CSV
article_titles = read_article_file("./raw_files/politicians_by_country_AUG.2024.csv")
# Process and save data for all articles
fetch_and_save_article_info(article_titles)


The number of articles for which we could successfully get the API information were 7103. 

The next step takes about 2-2.5 hours to run. I have already run this step, and the corresponding csv has been saved in a file named articles_ores_score.csv under the intermediary_files folder. Skip the next cells to proceed the analysis using the already fetched data.

In [None]:
#    Make the call, just pass in the article revision ID, email address, and access token
json_file_path = '.intermediary_files/articles_page_info.json'
score = process_ores_scores_from_json(json_file_path, email_address='swarali@uw.edu', access_token=ACCESS_TOKEN)

# Step 3: Combining the Datasets

We need to combine the revision id json with the csv file with ores score to generate a final csv which we will use for further analysis. Which will then be merged with the population csv.

In [76]:
def create_final_dataframe(json_file_path, csv_file_path, source_data_file_path):
    """
    Create a final DataFrame by merging politician data from a JSON file with scores from a CSV file.

    Args:
        json_file_path (str): Path to the JSON file containing politician data.
        csv_file_path (str): Path to the CSV file containing ORES scores.
        source_data_csv (str): Path to source data about the politician artciles
    """
    
    # Load JSON data
    with open(json_file_path, 'r', encoding='utf-8') as json_file:
        politicians_data = json.load(json_file)

    source_data_df = pd.read_csv(source_data_file_path)
    
    # Convert JSON data to DataFrame
    df_article_info = pd.DataFrame.from_dict(politicians_data, orient='index')

    merged_df_join_with_source = pd.merge(source_data_df, df_article_info, left_on='name', right_on='title', how='left')

    # Load CSV data
    ores_scores_df = pd.read_csv(csv_file_path)

    # Perform inner join on different column names
    merged_df_source_with_scores = pd.merge(merged_df_join_with_source, ores_scores_df, left_on='lastrevid', right_on='revision_id', how='left')
    merged_df_source_with_scores = merged_df_source_with_scores[['name', 'country', 'pageid', 'title', 'revision_id', 'quality_prediction']]
    merged_df_source_with_scores["revision_id"] = merged_df_source_with_scores["revision_id"].astype("Int64")

    return merged_df_source_with_scores

    # Save final DataFrame to CSV
    # final_df.to_csv(output_csv_path, index=False)
    # print(f"Final DataFrame saved to {output_csv_path}")
# Example usage
final_df_source_with_scores = create_final_dataframe(
    json_file_path='intermediary_files/articles_page_info.json',
    csv_file_path='intermediary_files/articles_ores_scores.csv',
    source_data_file_path= './raw_files/politicians_by_country_AUG.2024.csv'
)

final_df_source_with_scores


Unnamed: 0,name,country,pageid,title,revision_id,quality_prediction
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233202991,Start
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230459615,B
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225661708,Start
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234741562,Stub
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7150,Josiah Tongogara,Zimbabwe,633594.0,Josiah Tongogara,1203429435,C
7151,Langton Towungana,Zimbabwe,16375315.0,Langton Towungana,1246280093,Stub
7152,Sengezo Tshabangu,Zimbabwe,75270547.0,Sengezo Tshabangu,1228478288,Start
7153,Herbert Ushewokunze,Zimbabwe,11742819.0,Herbert Ushewokunze,959111842,Stub


Analysing aquired data: Now we have a list of politicians' wiki articles along with their revision_id, page_id and article_rating. For articles where ORES call failed, we will have a row in the list where article rating is set to empty.

In [77]:
# Wikilist politician articles with no revision_id => pageInfo call failed
missing_revision_id_df = final_df_source_with_scores[final_df_source_with_scores["revision_id"].isna()]
print("Total number of articles for which pageInfo call failed: " , missing_revision_id_df.shape[0])
missing_revision_id_df


Total number of articles for which pageInfo call failed:  8


Unnamed: 0,name,country,pageid,title,revision_id,quality_prediction
430,Barbara Eibinger-Miedl,Austria,,Barbara Eibinger-Miedl,,
516,Mehrali Gasimov,Azerbaijan,,Mehrali Gasimov,,
1200,Kyaw Myint,Myanmar,,Kyaw Myint,,
1342,André Ngongang Ouandji,Cameroon,,André Ngongang Ouandji,,
1955,Tomás Pimentel,Dominican Republic,,Tomás Pimentel,,
2427,Richard Sumah,Ghana,,Richard Sumah,,
4496,Segun ''Aeroland'' Adewale,Nigeria,,Segun ''Aeroland'' Adewale,,
5719,Bashir Bililiqo,Somalia,,Bashir Bililiqo,,


In [106]:
# Measuring error rate for the ORES and Pageinfo API
with_revision_id_df = final_df_source_with_scores[~final_df_source_with_scores["revision_id"].isna()]
print("Total number of articles for which pageInfo call succeeded: " , with_revision_id_df.shape[0])

missing_rating_df = with_revision_id_df[with_revision_id_df["quality_prediction"].isna()]
print("Total number of articles for which pageInfo call succeeded, but ORES API call failed: " , missing_rating_df.shape[0])

with_rating_df = final_df_source_with_scores[~final_df_source_with_scores["quality_prediction"].isna()]
print("Total number of articles for which pageInfo call succeeded, and ORES API call succeesed: " , with_rating_df.shape[0])

Total number of articles for which pageInfo call succeeded:  7147
Total number of articles for which pageInfo call succeeded, but ORES API call failed:  12
Total number of articles for which pageInfo call succeeded, and ORES API call succeesed:  7135


### Adding population and region details

At this stage, we have the initial list of artiles provided with data about latest revision id and article rating. We now want to enrich the list even further by adding details about the population and region of the country associated to the article title.

We now find which country belongs to which region. As we know the regions and countries are in a hierarchical order. 

In [70]:
# Load the population CSV into a DataFrame
population_df = pd.read_csv('./raw_files/population_by_country_AUG.2024.csv')

# Loop through each row in the dataframe. If name is all capital letters, means it is a region.
region = ""

population_list = []

for index, row in population_df.iterrows():
    row_dict = row.to_dict()
    if row["Geography"].isupper():
        # save the geography in a variable
        region = row["Geography"]
        region_population = row["Population"]
        row_dict["region"] = ""
    else:
        # assign the region to the latest value in the region variable
        row_dict["region"] = region

    population_list.append(row_dict)

population_df_regions = pd.DataFrame(population_list)
population_df_regions = population_df_regions.rename(columns = {"Geography": "country", "Population": "population"})
country_region_df = population_df_regions[~population_df_regions["country"].str.isupper()].reset_index(drop=True)
print("Total number of countries in the `population_by_country_AUG.2024.csv` : ", country_region_df.shape[0])
 
country_region_df

Total number of countries in the `population_by_country_AUG.2024.csv` :  209


Unnamed: 0,country,population,region
0,Algeria,46.8,NORTHERN AFRICA
1,Egypt,105.2,NORTHERN AFRICA
2,Libya,6.9,NORTHERN AFRICA
3,Morocco,37.0,NORTHERN AFRICA
4,Sudan,48.1,NORTHERN AFRICA
...,...,...,...
204,Samoa,0.2,OCEANIA
205,Solomon Islands,0.8,OCEANIA
206,Tonga,0.1,OCEANIA
207,Tuvalu,0.0,OCEANIA


### Merging datasets

At this stage, we have two datasets:

1. A list of Wikipedia articles about politicians, containing details such as the politician's country, article revision ID, and article title.
2. A list of countries with information on population and their respective regions.

We aim to merge these datasets for the next steps in our analysis, to determine the number of articles, particularly high-quality ones, per capita, both country-wise and region-wise.



In [82]:
# Merging the data frames of interest using outer join to find the mismatched countries.
article_scores_with_country = pd.merge(final_df_source_with_scores, country_region_df, on=["country"], how="outer", indicator=True)
article_scores_with_country


Unnamed: 0,name,country,pageid,title,revision_id,quality_prediction,population,region,_merge
0,Majah Ha Adrif,Afghanistan,10483286.0,Majah Ha Adrif,1233202991,Start,42.4,SOUTH ASIA,both
1,Haroon al-Afghani,Afghanistan,11966231.0,Haroon al-Afghani,1230459615,B,42.4,SOUTH ASIA,both
2,Tayyab Agha,Afghanistan,46841383.0,Tayyab Agha,1225661708,Start,42.4,SOUTH ASIA,both
3,Khadija Zahra Ahmadi,Afghanistan,71600382.0,Khadija Zahra Ahmadi,1234741562,Stub,42.4,SOUTH ASIA,both
4,Aziza Ahmadyar,Afghanistan,47805901.0,Aziza Ahmadyar,1195651393,Start,42.4,SOUTH ASIA,both
...,...,...,...,...,...,...,...,...,...
7193,Langton Towungana,Zimbabwe,16375315.0,Langton Towungana,1246280093,Stub,16.7,EASTERN AFRICA,both
7194,Sengezo Tshabangu,Zimbabwe,75270547.0,Sengezo Tshabangu,1228478288,Start,16.7,EASTERN AFRICA,both
7195,Herbert Ushewokunze,Zimbabwe,11742819.0,Herbert Ushewokunze,959111842,Stub,16.7,EASTERN AFRICA,both
7196,Denis Walker,Zimbabwe,3255571.0,Denis Walker,1247902630,C,16.7,EASTERN AFRICA,both


We see that there are countries which do not have any articles associated as well as articles which do not have a country associated. To find a list of countries which do not have this match we look at the '_merge' column in the above dataframe and filter out the rows with values other than 'both' and get only the unique country names.

In [89]:
output_no_match_path='./result_files/wp_countries-no_match.txt'
output_consolidated_path='./result_files/wp_politicians_by_country.csv'

# Identify unmatched countries
unmatched_df = article_scores_with_country[article_scores_with_country['_merge'] != 'both']

# Extract unique unmatched countries
unmatched_countries = unmatched_df['country'].dropna().unique()

# Save unmatched countries to a text file
with open(output_no_match_path, 'w') as f:
    for country in unmatched_countries:
        f.write(f"{country}\n")

print(f"Unmatched countries saved to {output_no_match_path}")

# Consolidate remaining matched data into a single CSV
consolidated_df = article_scores_with_country[article_scores_with_country['_merge'] == 'both']
wp_politicians_by_country = consolidated_df.drop(columns = [  "_merge"], axis=1)
wp_politicians_by_country = wp_politicians_by_country.rename(columns = { "quality_prediction": "article_quality"})
wp_politicians_by_country = wp_politicians_by_country[["country", "region", "population", "title", "revision_id", "article_quality"]]
wp_politicians_by_country = wp_politicians_by_country.drop_duplicates()
wp_politicians_by_country

Unmatched countries saved to ./result_files/wp_countries-no_match.txt


Unnamed: 0,country,region,population,title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7192,Zimbabwe,EASTERN AFRICA,16.7,Josiah Tongogara,1203429435,C
7193,Zimbabwe,EASTERN AFRICA,16.7,Langton Towungana,1246280093,Stub
7194,Zimbabwe,EASTERN AFRICA,16.7,Sengezo Tshabangu,1228478288,Start
7195,Zimbabwe,EASTERN AFRICA,16.7,Herbert Ushewokunze,959111842,Stub


In [87]:
print("unmatched countries:" ,unmatched_countries)

unmatched countries: ['Andorra' 'Australia' 'Brunei' 'Canada' 'China (Hong Kong SAR)'
 'China (Macao SAR)' 'Curacao' 'Denmark' 'Dominica' 'Fiji' 'French Guiana'
 'French Polynesia' 'Georgia' 'Guadeloupe' 'Guam' 'Guinea-Bissau'
 'GuineaBissau' 'Iceland' 'Ireland' 'Jamaica' 'Kiribati' 'Korea (North)'
 'Korea (South)' 'Korea, South' 'Korean' 'Liechtenstein' 'Martinique'
 'Mauritius' 'Mayotte' 'Mexico' 'Nauru' 'Netherlands' 'New Caledonia'
 'New Zealand' 'Palau' 'Philippines' 'Puerto Rico' 'Reunion' 'Romania'
 'San Marino' 'Sao Tome and Principe' 'Suriname' 'United Kingdom'
 'United States' 'Western Sahara' 'eSwatini']


In [90]:
# Write the dataframe to a file
wp_politicians_by_country.to_csv("result_files/wp_politicians_by_country.csv", index=False)

# Step 4: Analysis and results

This analysis involves calculating two key metrics on both a country-by-country and regional basis:

1. **Total Articles per Capita**: The ratio of the number of articles to the population of each country or region.
2. **High-Quality Articles per Capita**: The ratio of high-quality articles (classified as "FA" for Featured Article or "GA" for Good Article by ORES) to the population of each country or region.

### Key Considerations:
- Each country is assigned to only one region, and the `population_by_country_AUG.2024.csv` file represents regions in a hierarchical structure. In the analysis, ensure that each country is placed in the closest, most specific (lowest in the hierarchy) region.
- The population data in `population_by_country_AUG.2024.csv` is provided in millions, so the calculated ratios may result in very small values. Careful thought should be given to how the results are represented to ensure clarity.

### Steps Done in the Code:
- Calculated the total number of articles per capita and high-quality articles per capita, both at the country level and the regional level.
- Ensured that countries were correctly assigned to their closest hierarchical region based on the dataset.
- Applied filtering to classify articles as high-quality if ORES predicted them as either "FA" (Featured Article) or "GA" (Good Article). 
- Adjusted the representation of results, given that population data is in millions and the resulting ratios are very small.


In [95]:
# Grouping data by country, region, and population to count total articles
article_count_by_country = wp_politicians_by_country.groupby(
    ['country', 'region', 'population']
).size().reset_index(name='total_articles')

# Filter out rows where population is 0 for the calculation
article_count_by_country = article_count_by_country.loc[article_count_by_country['population'] > 0]

# Calculating total articles per capita
article_count_by_country["articles_per_capita"] = article_count_by_country["total_articles"] / article_count_by_country["population"]

# Sorting and selecting the top 10 countries with the highest articles per capita
top_10_countries = article_count_by_country.sort_values("articles_per_capita", ascending=False).reset_index(drop=True)
top_10_countries = top_10_countries.head(10)

# Displaying the result
print("The 10 countries with the highest total articles per capita (in descending order):")
top_10_countries

The 10 countries with the highest total articles per capita (in descending order):


Unnamed: 0,country,region,population,total_articles,articles_per_capita
0,Antigua and Barbuda,CARIBBEAN,0.1,33,330.0
1,Federated States of Micronesia,OCEANIA,0.1,14,140.0
2,Marshall Islands,OCEANIA,0.1,13,130.0
3,Tonga,OCEANIA,0.1,10,100.0
4,Barbados,CARIBBEAN,0.3,25,83.333333
5,Seychelles,EASTERN AFRICA,0.1,6,60.0
6,Montenegro,SOUTHERN EUROPE,0.6,36,60.0
7,Bhutan,SOUTH ASIA,0.8,44,55.0
8,Maldives,SOUTH ASIA,0.6,33,55.0
9,Samoa,OCEANIA,0.2,8,40.0


The total articles per capita for countries like Monaco and Tuvalu comes out to be inf because the population data we have for these countries is 0 therefore I decided to remove them from the calculation.

In [94]:
# Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
# Sorting and selecting the bottom 10 countries with the highest articles per capita
bottom_10_countries = article_count_by_country.sort_values("articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10_countries = bottom_10_countries.head(10)

# Displaying the result
print("The 10 countries with the highest total articles per capita (in ascending order):")
bottom_10_countries


The 10 countries with the highest total articles per capita (in ascending order):


Unnamed: 0,country,region,population,total_articles,articles_per_capita
0,China,EAST ASIA,1411.3,16,0.011337
1,India,SOUTH ASIA,1428.6,151,0.105698
2,Ghana,WESTERN AFRICA,34.1,4,0.117302
3,Saudi Arabia,WESTERN ASIA,36.9,5,0.135501
4,Zambia,EASTERN AFRICA,20.2,3,0.148515
5,Norway,NORTHERN EUROPE,5.5,1,0.181818
6,Israel,WESTERN ASIA,9.8,2,0.204082
7,Egypt,NORTHERN AFRICA,105.2,32,0.304183
8,Cote d'Ivoire,WESTERN AFRICA,30.9,10,0.323625
9,Ethiopia,EASTERN AFRICA,126.5,44,0.347826


An article is considered high quality if it has an FA or GA ratings. So we will filter out articles with top rating and then display the top 10 countries which have the high quality articles.

In [97]:
# Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

high_quality_articles = wp_politicians_by_country[wp_politicians_by_country["article_quality"].isin(["FA", "GA"])]
high_quality_articles_count = high_quality_articles.groupby(['country', 'region', 'population']).size().reset_index(name='total_articles')
high_quality_articles_count["total_articles_per_capita"] = high_quality_articles_count["total_articles"] / high_quality_articles_count["population"]

top_10_quality = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
top_10_quality = top_10_quality.head(10)

print("The 10 countries with the highest high quality articles per capita (in descending order):")
top_10_quality


The 10 countries with the highest high quality articles per capita (in descending order):


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Montenegro,SOUTHERN EUROPE,0.6,3,5.0
1,Luxembourg,WESTERN EUROPE,0.7,2,2.857143
2,Albania,SOUTHERN EUROPE,2.7,7,2.592593
3,Kosovo,SOUTHERN EUROPE,1.7,4,2.352941
4,Maldives,SOUTH ASIA,0.6,1,1.666667
5,Lithuania,NORTHERN EUROPE,2.9,4,1.37931
6,Croatia,SOUTHERN EUROPE,3.8,5,1.315789
7,Guyana,SOUTH AMERICA,0.8,1,1.25
8,Palestinian Territory,WESTERN ASIA,5.5,6,1.090909
9,Slovenia,SOUTHERN EUROPE,2.1,2,0.952381


In [98]:
# Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
bottom_10_quality = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10_quality = bottom_10_quality.head(10)

print("The 10 countries with the least high quality articles per capita (in ascending order):")
bottom_10_quality

The 10 countries with the least high quality articles per capita (in ascending order):


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Bangladesh,SOUTH ASIA,173.5,1,0.005764
1,Egypt,NORTHERN AFRICA,105.2,1,0.009506
2,Ethiopia,EASTERN AFRICA,126.5,2,0.01581
3,Japan,EAST ASIA,124.5,2,0.016064
4,Pakistan,SOUTH ASIA,240.5,4,0.016632
5,Colombia,SOUTH AMERICA,52.2,1,0.019157
6,Congo DR,MIDDLE AFRICA,102.3,2,0.01955
7,Vietnam,SOUTHEAST ASIA,98.9,2,0.020222
8,Uganda,EASTERN AFRICA,48.6,1,0.020576
9,Algeria,NORTHERN AFRICA,46.8,1,0.021368


In [99]:
# Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

articles_count_per_region = wp_politicians_by_country.groupby(['region']).size().reset_index(name='total_articles')
articles_count_per_region

Unnamed: 0,region,total_articles
0,CARIBBEAN,219
1,CENTRAL AMERICA,188
2,CENTRAL ASIA,106
3,EAST ASIA,152
4,EASTERN AFRICA,665
5,EASTERN EUROPE,709
6,MIDDLE AFRICA,231
7,NORTHERN AFRICA,302
8,NORTHERN EUROPE,191
9,OCEANIA,72


In [101]:
# Merging article counts with population data based on region
merged_regional_data = pd.merge(
    articles_count_per_region[['region', 'total_articles']],
    population_df[['Geography', 'Population']],
    left_on="region", right_on="Geography", how="left"
)

# Calculating total articles per capita for each region
merged_regional_data["articles_per_capita"] = merged_regional_data["total_articles"] / merged_regional_data["Population"]

# Sorting the data by articles per capita in descending order and resetting the index
sorted_regional_data = merged_regional_data.sort_values("articles_per_capita", ascending=False).reset_index(drop=True)

# Displaying the sorted data
sorted_regional_data[["region", "total_articles", "Population", "articles_per_capita"]]


Unnamed: 0,region,total_articles,Population,articles_per_capita
0,SOUTHERN EUROPE,797,152.0,5.243421
1,CARIBBEAN,219,44.0,4.977273
2,WESTERN EUROPE,498,199.0,2.502513
3,EASTERN EUROPE,709,285.0,2.487719
4,WESTERN ASIA,610,299.0,2.040134
5,NORTHERN EUROPE,191,108.0,1.768519
6,SOUTHERN AFRICA,123,70.0,1.757143
7,OCEANIA,72,45.0,1.6
8,EASTERN AFRICA,665,483.0,1.376812
9,SOUTH AMERICA,569,426.0,1.335681


In [103]:
# Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.
articles_count_per_region_high_quality = high_quality_articles.groupby(['region']).size().reset_index(name='total_articles')
articles_count_per_region_high_quality


Unnamed: 0,region,total_articles
0,CARIBBEAN,9
1,CENTRAL AMERICA,10
2,CENTRAL ASIA,5
3,EAST ASIA,3
4,EASTERN AFRICA,17
5,EASTERN EUROPE,38
6,MIDDLE AFRICA,8
7,NORTHERN AFRICA,16
8,NORTHERN EUROPE,9
9,OCEANIA,1


In [105]:
# Merging high-quality articles count data with population data based on region
regional_high_quality_df = pd.merge(
    hq_articles_count_per_region,  
    population_df,  
    left_on=["region"],  
    right_on=["Geography"],  
    how="left"  
)

# Selecting only relevant columns: region, total articles, and population
regional_high_quality_df = regional_high_quality_df[["region", "total_articles", "Population"]]

# Calculating total high-quality articles per capita
regional_high_quality_df["total_articles_per_capita"] = regional_high_quality_df["total_articles"] / regional_high_quality_df["Population"]

# Sorting the regions by total high-quality articles per capita in descending order
regional_high_quality_df = regional_high_quality_df.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)

# Displaying the final sorted dataframe
regional_high_quality_df



Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,53,152.0,0.348684
1,CARIBBEAN,9,44.0,0.204545
2,EASTERN EUROPE,38,285.0,0.133333
3,SOUTHERN AFRICA,8,70.0,0.114286
4,WESTERN EUROPE,21,199.0,0.105528
5,WESTERN ASIA,27,299.0,0.090301
6,NORTHERN EUROPE,9,108.0,0.083333
7,NORTHERN AFRICA,16,256.0,0.0625
8,CENTRAL ASIA,5,80.0,0.0625
9,CENTRAL AMERICA,10,182.0,0.054945
