# Homework 2 Code - Considering Bias in Data

## Data Analysis
This notebook cleans the requested data and analyzes bias in online political resources by looking at Wikipedia article prevalence and quality of politicians in over 167 countries. The code produces six tables, embedded at the bottom of this notebook.

1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.
This homework uses example code referenced in the markdown below and in the README.

##### USER NOTE!

This note in the markdowns encourages users to make customized edits to control their desired inputs and outputs. Please ctrl+F to see all user notes.

# Requesting ORES scores through LiftWing ML Service API
Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).

# Loading Libraries

The following code block references Dr. David W. Mcdonald's example code, with new libraries added to the originals.

In [22]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd
import csv
import numpy as np
from IPython.display import display

# Import Data

Imports the extracted dictionary (a json file) of politician names ("article_title"), and revision IDs (lastrevid/rev_id) provided by Wikipedia . 

These article titles are  titles of Wikipedia pages about a politician, at varying degrees of completeness.

The following code block references Dr. David W. Mcdonald's example code.

In [46]:
# Load the JSON data from the file
with open('politicians-and-revision-ids.json', 'r') as json_file:
    data = json.load(json_file)

# Create a new dictionary with 'article_title' as a string and 'rev_id' as an integer
ARTICLE_REVISIONS = {article: {'article_title': article, 'rev_id': int(info.pop('lastrevid'))} for article, info in data.items() if 'lastrevid' in info}

# Print the first five key-value pairs to check the result
for i, (key, value) in enumerate(ARTICLE_REVISIONS.items()):
    if i == 5:
        break
    print(f"{key}: {value}")


Abdul Baqi Turkistani: {'article_title': 'Abdul Baqi Turkistani', 'rev_id': 1231655023}
Abdul Ghani Ghani: {'article_title': 'Abdul Ghani Ghani', 'rev_id': 1227026187}
Abdul Rahim Ayoubi: {'article_title': 'Abdul Rahim Ayoubi', 'rev_id': 1226326055}
Ahmad Wali Massoud: {'article_title': 'Ahmad Wali Massoud', 'rev_id': 1221720658}
Aimal Faizi: {'article_title': 'Aimal Faizi', 'rev_id': 1185105938}


# Initialize Constants

Establishes the API request endpoint and parameters in order to extract the pageviews. 

The following code block references Dr. David W. Mcdonald's example code, edited only for the specific parameters email and access token

##### USER NOTE! 
- User must change the email and Access Token in order for this code to work.

In [47]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "kilpas@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "kilas@uw.edu",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = "kilpas"
ACCESS_TOKEN = ""
#

# Get your access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.


The Wikimedia Foundation appears to be issuing access tokens that are adhering to the [JWT (JSON Web Token) standard](https://jwt.io/introduction/). There was also some documentation by IBM about the [use of JWT tokens](https://www.ibm.com/docs/en/cics-ts/6.1?topic=cics-json-web-token-jwt) that I found useful. Keep in mind, documentation from IBM is specific to their implementation of the JWT standard. Access tokens are composed of different parts that specify the domain being accessed and rate limits. The little snippet of code below is not required to make ORES requests. It just allows us to see what is in the Wikimedia provided access token that you were issued.

## Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [48]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):
    
    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


**Example 1** - Call the function for one specific article title by passing in three items, revision id, email, and access token. The following code block references Dr. David W. Mcdonald's example code, edited to confirm the constants configuration and success in calling the ORES function above.

In [50]:
#   Which article - the key for the article dictionary defined above
article_title = "Abdul Baqi Turkistani"
#
print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {ARTICLE_REVISIONS[article_title]:d}")
#
#    Make the call, just pass in the article revision ID, email address, and access token
score = request_ores_score_per_article(article_revid=ARTICLE_REVISIONS[article_title],
                                       email_address="kilpas@uw.edu",
                                       access_token=ACCESS_TOKEN)
#
#    Output the result
print(json.dumps(score,indent=4))
#

Getting LiftWing ORES scores for 'Abdul Baqi Turkistani' with revid: 1231655023
{
    "enwiki": {
        "models": {
            "articlequality": {
                "version": "0.9.2"
            }
        },
        "scores": {
            "1231655023": {
                "articlequality": {
                    "score": {
                        "prediction": "Stub",
                        "probability": {
                            "B": 0.007563164541978764,
                            "C": 0.010571998067933679,
                            "FA": 0.0014567448768872152,
                            "GA": 0.00350824677167893,
                            "Start": 0.04243220742433865,
                            "Stub": 0.9344676383171826
                        }
                    }
                }
            }
        }
    }
}


# Call the function to get ORES Scores for all article titles

This block of code references Dr. McDonald's example code, generalized for a larger dictionary of article titles and revision IDs. The successful predictions are saved to a file titled 'politicians_final_ORES_scores.json'.

The code takes approximately 4 hours to run, and outputs a list of article titles whose predictions are not available through the ORES tool. These articles will not be included in the analysis due to lacking the necessary prediction data.

In [54]:
# Initialize a dictionary to store the final ORES scores for all articles
final_ores_scores = {}

# Initialize a list to store article titles that failed to get an ORES score
failed_articles = []

# Iterate over all article titles and their corresponding data in ARTICLE_REVISIONS
for article_title, info in ARTICLE_REVISIONS.items():
    
    # Access the rev_id from the ARTICLE_REVISIONS dictionary
    rev_id = info['rev_id']

    # Print the message with the correct formatting for rev_id
    print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {rev_id}")
    
    try:
        # Make the call, passing in the article revision ID, email address, and access token
        score = request_ores_score_per_article(article_revid=rev_id,
                                               email_address="kilpas@uw.edu",
                                               access_token=ACCESS_TOKEN)
        
        if score:  # If the score was successfully returned
            final_ores_scores[article_title] = {
                "rev_id": rev_id,
                "ores_score": score
            }
            # Output the result for debugging purposes (optional)
            # print(json.dumps(score, indent=4))
        else:
            raise ValueError("No ORES score returned.")

    except Exception as e:
        # Handle any errors (e.g., failed request, missing data, etc.)
        print(f"Error fetching ORES score for article '{article_title}' (rev_id: {rev_id}): {e}")
        failed_articles.append(article_title)  # Add the article title to the failed list
    
    # Optional: add a delay between requests if needed to avoid hitting API rate limits
    time.sleep(API_THROTTLE_WAIT)

# Save the successful ORES scores to a JSON file
with open('politicians_final_ORES_scores.json', 'w') as output_file:
    json.dump(final_ores_scores, output_file, indent=4)

# Print and save the list of articles that failed to get a score
if failed_articles:
    print("\nThe following articles failed to get an ORES score:")
    for article in failed_articles:
        print(f"- {article}")
    
    # Save the failed articles to a separate JSON file for logging purposes
    with open('failed_politicians_ORES_scores.json', 'w') as failed_file:
        json.dump(failed_articles, failed_file, indent=4)
else:
    print("\nAll articles were successfully scored.")


Getting LiftWing ORES scores for 'Abdul Baqi Turkistani' with revid: 1231655023
Getting LiftWing ORES scores for 'Abdul Ghani Ghani' with revid: 1227026187
Getting LiftWing ORES scores for 'Abdul Rahim Ayoubi' with revid: 1226326055
Getting LiftWing ORES scores for 'Ahmad Wali Massoud' with revid: 1221720658
Getting LiftWing ORES scores for 'Aimal Faizi' with revid: 1185105938
Getting LiftWing ORES scores for 'Amir Muhammad Akhundzada' with revid: 1247931713
Getting LiftWing ORES scores for 'Aziza Ahmadyar' with revid: 1195651393
Getting LiftWing ORES scores for 'Azizullah Lodin' with revid: 1247762293
Getting LiftWing ORES scores for 'Baran Khan Kudezai' with revid: 1176481824
Getting LiftWing ORES scores for 'Bashir Ahmad Bezan' with revid: 1248505877
Getting LiftWing ORES scores for 'Cheragh Ali Cheragh' with revid: 1193992206
Getting LiftWing ORES scores for 'Ezatullah (Nangarhar)' with revid: 1158302291
Getting LiftWing ORES scores for 'Fazel Ahmed Manawi' with revid: 1234514379
G

# Clean ORES Output for Data Analysis

This code contains a function called extract_predictions that creates a dictionary retaining only article title and its corresponding revision ID and prediction. 

It also prints the article names of article-revision ID pairs who did not previously trip the error in the ORES function, but which lack the suficient prediction data to be analyzed. These articles will not be included in the analysis due to their incomplete data.

In [60]:
# Load the JSON data from the file
with open('politicians_final_ORES_scores.json', 'r') as json_file:
    input_dict = json.load(json_file)

# Function to extract the prediction for each entry in the dictionary
def extract_predictions(article_dict):
    result = {}
    
    for name, details in article_dict.items():
        try:
            # Get the revision ID as a string
            rev_id = str(details['rev_id'])
            
            # Index through the nested structure to get the 'prediction'
            prediction = details['ores_score']['enwiki']['scores'][rev_id]['articlequality']['score']['prediction']
            
            # Store the name, rev_id, and prediction in the result dictionary
            result[name] = {
                'rev_id': rev_id,
                'prediction': prediction
            }
        except KeyError as e:
            # If any required key is missing, print an error message and continue
            print(f"KeyError for {name}: {e}")
    
    return result

# Extract predictions for all entries in the dictionary
predictions_dict = extract_predictions(input_dict)

# Print the result
for name, data in predictions_dict.items():
    print(f"{name}: rev_id = {data['rev_id']}, prediction = {data['prediction']}")


KeyError for Josip Ferfolja: 'enwiki'
KeyError for Nerio Giovanazzi: 'enwiki'
KeyError for Bernice Dahn: 'enwiki'
KeyError for Karol Dembovský: 'enwiki'
Abdul Baqi Turkistani: rev_id = 1231655023, prediction = Stub
Abdul Ghani Ghani: rev_id = 1227026187, prediction = Stub
Abdul Rahim Ayoubi: rev_id = 1226326055, prediction = Start
Ahmad Wali Massoud: rev_id = 1221720658, prediction = Start
Aimal Faizi: rev_id = 1185105938, prediction = Stub
Amir Muhammad Akhundzada: rev_id = 1247931713, prediction = Start
Aziza Ahmadyar: rev_id = 1195651393, prediction = Start
Azizullah Lodin: rev_id = 1247762293, prediction = Start
Baran Khan Kudezai: rev_id = 1176481824, prediction = Start
Bashir Ahmad Bezan: rev_id = 1248505877, prediction = Start
Cheragh Ali Cheragh: rev_id = 1193992206, prediction = Start
Ezatullah (Nangarhar): rev_id = 1158302291, prediction = Stub
Fazel Ahmed Manawi: rev_id = 1234514379, prediction = Start
Gajinder Singh Safri: rev_id = 1212323536, prediction = Stub
Ghulam Ghaus

# Merge Data for Analysis

This block of code loads the geography data and population data, combining them into the following files: 'politicians_by_country_no_ARTQUAL.csv' and 'wp_countries-no_match.txt'

- 'politicians_by_country_no_ARTQUAL.csv' is an intermediary file that will be later merged with article quality data (from the ORES tool predictions) to be included int the analysis.

- 'wp_countries-no_match.txt' is a final file that contains the list of countries not represented in the source data of politicians with wikipedia articles. However, that does not mean these countries do not have politicians with wikipedia articles.

Countries with less than 1 million inhabitants will be excluded from the analysis, due to the data's tendency to mark these countries with a population of 0.

The following code has been edited by a text-generator. 

ChatGPT prompt:
- "Please load the population and geography data and create a function in python that will merge the two based on country. Then, identify the unmatched countries and output them in a text file"

In [84]:
# Load the population and geography data, return as a pandas DataFrame
def load_geography_population(file_path):
    geography_data = []
    no_population_countries = []
    current_region = None
    
    with open(file_path, mode='r') as file:
        reader = csv.reader(file)
        next(reader)  # Skip the header
        
        for row in reader:
            # Ensure there are exactly two columns and both are non-empty
            if len(row) != 2 or not row[0].strip() or not row[1].strip():
                print(f"Incomplete or missing data in row: {row}")
                continue
            
            name, population = row[0].strip(), row[1].strip()
            
            # If name is in all caps, it's a region
            if name.isupper():
                current_region = name
            else:
                # Check if population is empty or zero
                try:
                    population = float(population)
                except ValueError:
                    population = 0  # If the population is not a number, treat it as 0

                if population == 0:
                    no_population_countries.append(name)
                
                # Add the data to the list
                geography_data.append({
                    'country': name,
                    'region': current_region,
                    'population': population
                })
    
    # Convert to pandas DataFrame
    df = pd.DataFrame(geography_data)
    
    # Print first 5 rows
    print(df.head())
    print("The following countries have less than 1 million inhabitants,\n and will be excluded from the analysis:", no_population_countries)
    return df, no_population_countries

# Load the politicians_by_country CSV data
def load_politicians_data(file_path):
    politicians_df = pd.read_csv(file_path)
    return politicians_df

# Merge the two datasets and identify unmatched countries
def merge_datasets(geo_df, politicians_df, output_csv, no_match_txt):
    # Merge the two dataframes on 'country'
    merged_df = pd.merge(politicians_df, geo_df, how='outer', on='country', indicator=True)
    
    # Identify countries with no match in the geography data
    no_match_countries = merged_df[merged_df['_merge'] != 'both']['country'].dropna().unique()
    
    # Write no match countries to a text file
    with open(no_match_txt, 'w') as file:
        for country in no_match_countries:
            file.write(f"{country}\n")
    
    # Filter out the rows with unmatched countries and keep only successful merges
    # matched_df = merged_df[merged_df['_merge'] == 'both'].drop(columns=['_merge'])
    matched_df = merged_df[merged_df['_merge'] == 'both'].drop(columns=['_merge', 'url'])
    
    # Save the merged dataset to a CSV file
    matched_df.to_csv(output_csv, index=False)

    return matched_df

# File paths
geo_pop_file = 'population_by_country_AUG.2024.csv'  # The first CSV file
politicians_file = 'politicians_by_country_AUG.2024.csv'  # The second CSV file
output_csv = 'politicians_by_country_no_ARTQUAL.csv'
no_match_txt = 'wp_countries-no_match.txt'

# Load the data
geo_pop_data, no_population_countries = load_geography_population(geo_pop_file)
politicians_df = load_politicians_data(politicians_file)

# Merge datasets and handle unmatched countries
merged_data = merge_datasets(geo_pop_data, politicians_df, output_csv, no_match_txt)

print(f"Merged data has been written to {output_csv}")


   country           region  population
0  Algeria  NORTHERN AFRICA        46.8
1    Egypt  NORTHERN AFRICA       105.2
2    Libya  NORTHERN AFRICA         6.9
3  Morocco  NORTHERN AFRICA        37.0
4    Sudan  NORTHERN AFRICA        48.1
The following countries have less than 1 million inhabitants,
 and will be excluded from the analysis: ['Liechtenstein', 'Monaco', 'San Marino', 'Nauru', 'Palau', 'Tuvalu']
Merged data has been written to politicians_by_country_no_ARTQUAL.csv


# Create Final File for Data Analysis

This block of code creates a final file titled 'wp_politicians_by_country.csv' with six columns: 
- country
- region 
- population
- article title
- revision ID
- article quality

It also outputs a list of the article names from the source dataset with no revision IDs or predictions, as these steps merge source data with processed data.

In [85]:
# Load the merged CSV file
csv_file = 'politicians_by_country_no_ARTQUAL.csv'
df = pd.read_csv(csv_file)

# Create new columns for 'revision_id' and 'article_quality' initialized with None
df['revision_id'] = None
df['article_quality'] = None

# Iterate through the DataFrame and populate the 'revision_id' and 'article_quality' columns
for index, row in df.iterrows():
    article_title = row['name'].strip()  # Assuming 'name' is the article title in CSV, remove leading/trailing spaces
    if article_title in predictions_dict:
        df.at[index, 'revision_id'] = predictions_dict[article_title]['rev_id']
        df.at[index, 'article_quality'] = predictions_dict[article_title]['prediction']
    else:
        print(f"No match found for {article_title} in the dictionary")

# Save the updated DataFrame to a new CSV file
output_csv = 'wp_politicians_by_country.csv'
df.to_csv(output_csv, index=False)

print(f"Updated data has been saved to {output_csv}")

No match found for Barbara Eibinger-Miedl in the dictionary
No match found for Mehrali Gasimov in the dictionary
No match found for Kyaw Myint in the dictionary
No match found for André Ngongang Ouandji in the dictionary
No match found for Tomás Pimentel in the dictionary
No match found for Richard Sumah in the dictionary
No match found for Carlos Pineda (politician) in the dictionary
No match found for János Ghyczy in the dictionary
No match found for Josip Ferfolja in the dictionary
No match found for Nerio Giovanazzi in the dictionary
No match found for Bernice Dahn in the dictionary
No match found for Segun ''Aeroland'' Adewale in the dictionary
No match found for Karol Dembovský in the dictionary
No match found for Klemen Boštjančič in the dictionary
No match found for Josip Ferfolja in the dictionary
No match found for Bashir Bililiqo in the dictionary
Updated data has been saved to wp_politicians_by_country.csv


# Edit Population Column

Converts 'population' column in the 'wp_politicians_by_country.csv' file into millions for analysis.

In [90]:
# Load the dataset
file_name = 'wp_politicians_by_country.csv'
df = pd.read_csv(data_file)

# Update population values to be in millions
df['population'] = df['population'] * 1_000_000  # Convert population to millions

# Save the updated DataFrame back to the CSV file
df.to_csv(file_name, index=False)

# Add Columns for Aggregate Article Analysis

The function create_cleaned_analysis creates two new columns, the number of articles per country, and the number of high quality articles per country.

The following code has been edited by a text-generator. 

ChatGPT prompt:
- "Please count the amount of article quality values containing "FA" and "GA", per country, and per region, and output the following as two new columns in the original dataset"

In [103]:
def create_cleaned_analysis(data_file, output_file='wp_cleaned_analysis.csv'):
    # Load the dataset
    df = pd.read_csv(data_file)

    # Calculate the number of high-quality articles (FA and GA) per country and region
    high_quality_count = df[df['article_quality'].isin(['FA', 'GA'])].groupby(['country', 'region']).size().reset_index(name='number_of_high_quality_articles')

    # Calculate the total number of articles per country and region
    article_count = df.groupby(['country', 'region']).size().reset_index(name='number_of_articles')

    # Create a DataFrame with all countries, regions, and their populations
    population_data = df[['country', 'region', 'population']].drop_duplicates()

    # Merge the high-quality article counts with the population data
    output_df = pd.merge(population_data, high_quality_count, on=['country', 'region'], how='left')

    # Merge the total article counts with the output DataFrame
    output_df = pd.merge(output_df, article_count, on=['country', 'region'], how='left')

    # Fill NaN values with 0 for countries without high-quality articles
    output_df['number_of_high_quality_articles'] = output_df['number_of_high_quality_articles'].fillna(0)

    # Display the output DataFrame
    display(output_df)

    # Save the output DataFrame to a new CSV file
    output_df.to_csv(output_file, index=False)

# Example usage
create_cleaned_analysis('wp_politicians_by_country.csv')


Unnamed: 0,country,region,population,number_of_high_quality_articles,number_of_articles
0,Afghanistan,SOUTH ASIA,42400000.0,3.0,85
1,Albania,SOUTHERN EUROPE,2700000.0,7.0,70
2,Algeria,NORTHERN AFRICA,46800000.0,1.0,71
3,Angola,MIDDLE AFRICA,36700000.0,2.0,58
4,Antigua and Barbuda,CARIBBEAN,100000.0,0.0,33
...,...,...,...,...,...
161,Venezuela,SOUTH AMERICA,28800000.0,1.0,56
162,Vietnam,SOUTHEAST ASIA,98900000.0,2.0,36
163,Yemen,WESTERN ASIA,34400000.0,0.0,32
164,Zambia,EASTERN AFRICA,20200000.0,0.0,3


# Create Separate Files for Analysis by Country or by Region

This function creates two separate intermediary files, one for generating tables analyzing articles and article quality per country, per capita, and one for generating tables analyzing articles and article quality per region, per capita.

The following code has been edited by a text-generator. 

ChatGPT prompt:
- "Please iterate the code to create two calculations, high-quality articles and total articles by country, and high quality articles and total articles by region."

In [104]:
def calculate_articles_per_capita(data_file, group_by='country', output_file_country='wp_cleaned_by_country.csv', output_file_region='wp_cleaned_by_region.csv'):
    # Load the cleaned analysis data
    df = pd.read_csv(data_file)

    # Replace any zero populations with NaN to avoid division errors
    df['population'] = df['population'].replace(0, np.nan)

    if group_by == 'country':
        # Group by country and calculate per capita metrics
        df_grouped = df.groupby('country').agg({
            'number_of_articles': 'sum',
            'number_of_high_quality_articles': 'sum',
            'population': 'mean'  # Assuming population doesn't change within a country
        }).reset_index()

        # Calculate articles per capita
        df_grouped['articles_per_capita'] = df_grouped['number_of_articles'] / df_grouped['population']

        # Calculate high-quality articles per capita
        df_grouped['high_quality_articles_per_capita'] = df_grouped['number_of_high_quality_articles'] / df_grouped['population']

        # Save the results to the country-based output file
        df_grouped.to_csv(output_file_country, index=False)

        # Display the updated DataFrame for country
        display(df_grouped.head())

    elif group_by == 'region':
        # Group by region and calculate per capita metrics
        df_grouped = df.groupby('region').agg({
            'number_of_articles': 'sum',
            'number_of_high_quality_articles': 'sum',
            'population': 'sum'  # Summing populations within a region
        }).reset_index()

        # Calculate articles per capita
        df_grouped['articles_per_capita'] = df_grouped['number_of_articles'] / df_grouped['population']

        # Calculate high-quality articles per capita
        df_grouped['high_quality_articles_per_capita'] = df_grouped['number_of_high_quality_articles'] / df_grouped['population']

        # Save the results to the region-based output file
        df_grouped.to_csv(output_file_region, index=False)

        # Display the updated DataFrame for region
        display(df_grouped.head())

# Example usage:
# To calculate per capita by country:
calculate_articles_per_capita('wp_cleaned_analysis.csv', group_by='country')

# To calculate per capita by region:
calculate_articles_per_capita('wp_cleaned_analysis.csv', group_by='region')


Unnamed: 0,country,number_of_articles,number_of_high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
0,Afghanistan,85,3.0,42400000.0,2e-06,7.075472e-08
1,Albania,70,7.0,2700000.0,2.6e-05,2.592593e-06
2,Algeria,71,1.0,46800000.0,2e-06,2.136752e-08
3,Angola,58,2.0,36700000.0,2e-06,5.449591e-08
4,Antigua and Barbuda,33,0.0,100000.0,0.00033,0.0


Unnamed: 0,region,number_of_articles,number_of_high_quality_articles,population,articles_per_capita,high_quality_articles_per_capita
0,CARIBBEAN,219,9.0,36600000.0,5.983607e-06,2.459016e-07
1,CENTRAL AMERICA,188,9.0,51300000.0,3.664717e-06,1.754386e-07
2,CENTRAL ASIA,106,5.0,80400000.0,1.318408e-06,6.218905e-08
3,EAST ASIA,152,3.0,1562700000.0,9.726755e-08,1.919754e-09
4,EASTERN AFRICA,665,17.0,480900000.0,1.382824e-06,3.535038e-08


# Generate and Embed Tables From Analysis

The following block generates six tables:

- Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order)
- Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order)
- Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)
- Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order)
- Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
- Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


In [105]:
def generate_tables(country_file='wp_cleaned_by_country.csv', region_file='wp_cleaned_by_region.csv'):
    # Load the country-level and region-level datasets
    df_country = pd.read_csv(country_file)
    df_region = pd.read_csv(region_file)

    # 1. Top 10 countries by total articles per capita (descending order)
    top_10_countries_by_coverage = df_country[['country', 'articles_per_capita']].sort_values(by='articles_per_capita', ascending=False).head(10)
    print("Top 10 countries by coverage:")
    display(top_10_countries_by_coverage)

    # 2. Bottom 10 countries by total articles per capita (ascending order)
    bottom_10_countries_by_coverage = df_country[['country', 'articles_per_capita']].sort_values(by='articles_per_capita', ascending=True).head(10)
    print("\nBottom 10 countries by coverage:")
    display(bottom_10_countries_by_coverage)

    # 3. Top 10 countries by high quality articles per capita (descending order)
    top_10_countries_by_high_quality = df_country[['country', 'high_quality_articles_per_capita']].sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)
    print("\nTop 10 countries by high quality articles:")
    display(top_10_countries_by_high_quality)

    # 4. Bottom 10 countries by high quality articles per capita (ascending order)
    bottom_10_countries_by_high_quality = df_country[['country', 'high_quality_articles_per_capita']].sort_values(by='high_quality_articles_per_capita', ascending=True).head(10)
    print("\nBottom 10 countries by high quality articles:")
    display(bottom_10_countries_by_high_quality)

    # 5. Geographic regions by total coverage (articles per capita in descending order)
    regions_by_total_coverage = df_region[['region', 'articles_per_capita']].sort_values(by='articles_per_capita', ascending=False)
    print("\nGeographic regions by total coverage:")
    display(regions_by_total_coverage)

    # 6. Geographic regions by high quality articles per capita (descending order)
    regions_by_high_quality_coverage = df_region[['region', 'high_quality_articles_per_capita']].sort_values(by='high_quality_articles_per_capita', ascending=False)
    print("\nGeographic regions by high quality articles:")
    display(regions_by_high_quality_coverage)

# Example usage
generate_tables()


Top 10 countries by coverage:


Unnamed: 0,country,articles_per_capita
4,Antigua and Barbuda,0.00033
51,Federated States of Micronesia,0.00014
93,Marshall Islands,0.00013
149,Tonga,0.0001
12,Barbados,8.3e-05
125,Seychelles,6e-05
98,Montenegro,6e-05
17,Bhutan,5.5e-05
90,Maldives,5.5e-05
121,Samoa,4e-05



Bottom 10 countries by coverage:


Unnamed: 0,country,articles_per_capita
31,China,1.133707e-08
66,India,1.056979e-07
57,Ghana,1.173021e-07
122,Saudi Arabia,1.355014e-07
164,Zambia,1.485149e-07
108,Norway,1.818182e-07
70,Israel,2.040816e-07
45,Egypt,3.041825e-07
37,Cote d'Ivoire,3.236246e-07
50,Ethiopia,3.478261e-07



Top 10 countries by high quality articles:


Unnamed: 0,country,high_quality_articles_per_capita
98,Montenegro,5e-06
86,Luxembourg,2.857143e-06
1,Albania,2.592593e-06
76,Kosovo,2.352941e-06
90,Maldives,1.666667e-06
85,Lithuania,1.37931e-06
38,Croatia,1.315789e-06
62,Guyana,1.25e-06
111,Palestinian Territory,1.090909e-06
129,Slovenia,9.52381e-07



Bottom 10 countries by high quality articles:


Unnamed: 0,country,high_quality_articles_per_capita
165,Zimbabwe,0.0
34,Congo,0.0
77,Kuwait,0.0
137,St. Lucia,0.0
37,Cote d'Ivoire,0.0
136,St. Kitts and Nevis,0.0
130,Solomon Islands,0.0
40,Cyprus,0.0
127,Singapore,0.0
42,Djibouti,0.0



Geographic regions by total coverage:


Unnamed: 0,region,articles_per_capita
8,NORTHERN EUROPE,6.870504e-06
9,OCEANIA,6.486486e-06
0,CARIBBEAN,5.983607e-06
14,SOUTHERN EUROPE,5.260726e-06
1,CENTRAL AMERICA,3.664717e-06
17,WESTERN EUROPE,2.746828e-06
5,EASTERN EUROPE,2.663411e-06
16,WESTERN ASIA,2.064997e-06
13,SOUTHERN AFRICA,1.800878e-06
4,EASTERN AFRICA,1.382824e-06



Geographic regions by high quality articles:


Unnamed: 0,region,high_quality_articles_per_capita
14,SOUTHERN EUROPE,3.49835e-07
8,NORTHERN EUROPE,3.23741e-07
0,CARIBBEAN,2.459016e-07
1,CENTRAL AMERICA,1.754386e-07
5,EASTERN EUROPE,1.427498e-07
13,SOUTHERN AFRICA,1.171303e-07
17,WESTERN EUROPE,1.158301e-07
16,WESTERN ASIA,9.140149e-08
9,OCEANIA,9.009009e-08
7,NORTHERN AFRICA,6.64322e-08
