### Loading the Data

We begin by loading two datasets:

1. **Politicians Data**: This dataset (`politicians_by_country_AUG.2024.csv`) contains information about Wikipedia articles on politicians from various countries, including article titles, countries, revision IDs, and predicted quality scores provided by the ORES API.

2. **Population Data**: This dataset (`population_by_country_AUG.2024.csv`) provides population information for countries and regions, with population values reported in millions.

The following code loads both datasets and prints the first few rows of each to inspect their structure.

In [4]:
import pandas as pd

#Load the politicians data
politicians_df = pd.read_csv('../data/politicians_by_country_AUG.2024.csv')

#Load the population data
population_df = pd.read_csv('../data/population_by_country_AUG.2024.csv')

#View the first few rows of each dataset
print(politicians_df.head())
politicians_df.count()

print(population_df.head())
population_df.count()


                   name                                                url  \
0        Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1     Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2           Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3  Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4        Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   

       country  
0  Afghanistan  
1  Afghanistan  
2  Afghanistan  
3  Afghanistan  
4  Afghanistan  
         Geography  Population
0            WORLD      8009.0
1           AFRICA      1453.0
2  NORTHERN AFRICA       256.0
3          Algeria        46.8
4            Egypt       105.2


Geography     233
Population    233
dtype: int64

### Checking for Duplicates in the Politicians Data

To ensure data integrity, we check for duplicate entries in the politicians dataset. Duplicate records can arise from various sources, such as errors in data scraping or discrepancies in data entry. Identifying these duplicates is crucial for accurate analysis.

The following steps are taken to identify potential duplicates:

1. **Check for Duplicates Based on All Columns**:  
   We first check for any duplicate entries where all columns (e.g., name, URL, country, revision ID) are identical.

2. **Check for Duplicates Based on Name and URL**:  
   In some cases, articles with the same name and URL may exist for different countries (e.g., the same politician could be listed in multiple countries due to dual citizenship or other reasons). We specifically check for duplicates based on the `name` and `url` columns to identify such cases.


In [5]:
#Check for duplicates based on all columns
duplicate_politicians_all = politicians_df[politicians_df.duplicated()]
print(len(duplicate_politicians_all))

#Check for duplicates based on name + url
duplicate_politicians_name = politicians_df[politicians_df.duplicated(subset=['name'])]
duplicate_politicians_url = politicians_df[politicians_df.duplicated(subset=['url'])]
print(len(duplicate_politicians_name))
print(len(duplicate_politicians_url))

#This means that the only thing setting them apart is different countries for the same name and url.




0
44
44


### Checking for Duplicates in the Population Data

To ensure the population dataset is clean and free from duplicate entries, we perform the following checks:

1. **Check for Duplicates Based on All Columns**:  
   We first check for any duplicate entries where all columns (e.g., Geography and Population) are identical. This ensures that there are no exact duplicates in the dataset.

2. **Check for Duplicates Based on Geography**:  
   We also check for duplicates based solely on the `Geography` column, which represents the country or region name. This will help identify cases where the same country or region might have multiple entries, possibly with different population values.


In [6]:
#Check for duplicates based on all columns
duplicate_population_all = population_df[population_df.duplicated()]
print(len(duplicate_population_all))

#Check for duplicates based on geography
duplicate_population_geo = population_df[population_df.duplicated(subset=['Geography'])]
print(len(duplicate_population_geo))


0
0


### Handling Politicians Appearing in Multiple Countries

During the data cleaning process, we identified 44 politicians who appear in multiple countries. This duplication is likely due to their serving roles in different countries, either through holding multiple nationalities or by having political positions in various countries. As such, it is logical to include them in all countries they are associated with.

#### Steps Taken:
1. **Retaining Duplicates**:  
   Rather than removing these entries, we keep a copy of the duplicate politicians. This ensures that they are accurately represented in all relevant countries for our analysis. The decision is based on information retrieved via the Wikipedia API, which shows that these politicians serve or are associated with multiple countries.

2. **Saving the Duplicate Politicians**:  
   We create a combined dataset of these duplicate entries and save it for future reference. The duplicate politicians, based on either name or URL, are concatenated and written to a separate CSV file.


In [7]:
# 44 of politician appear in mulitple countries (2 or more). This has been decided (based on the Wikipedia API)
#either by their nationalities or the next country served, so it makes sense to have them be a part of both/all 
#the countries their names appear in

# however we keep a copy of the duplicate politicians
combined_duplicates = pd.concat([duplicate_politicians_name, duplicate_politicians_url]).drop_duplicates()
combined_duplicates.to_csv('../data/combined_duplicates_politicians.csv', index=False)



### Expanding Population Data with Continents and Regions

In this section, we expand the population dataset by adding `Continent` and `Region` columns. The population data contains a mix of continents, regions, and countries, so the goal is to assign each country to its respective continent and region. Some special cases (like "Northern America" and "Oceania") are handled explicitly to ensure proper assignment.

#### Steps:
1. **Copying the Data**:  
   We begin by creating a copy of the original population DataFrame to avoid modifying the original data.

2. **Defining Special Cases**:  
   Some entries, such as "Northern America" and "Oceania", are regions and continents simultaneously. These are handled through the `special_cases` dictionary.

3. **Assigning Continents and Regions**:  
   We loop through the rows of the DataFrame, checking whether a row represents a continent, region, or country:
   - If the `Geography` is in uppercase, it indicates either a continent or region.
   - Countries are assigned the current `Continent` and `Region` based on the most recent continent and region encountered in the loop.

4. **Filtering and Cleaning**:  
   Once the `Continent` and `Region` columns are populated, we filter out rows that represent continents or regions themselves, keeping only the country-level data. Finally, the `is_region` helper column is dropped.

5. **Saving the Expanded Data**:  
   The expanded DataFrame, now containing `Country`, `Continent`, and `Region` information, is saved as `population_expanded.csv`.


In [24]:
import pandas as pd

population_df_copy = population_df.copy()
population_df_copy['Continent'] = None
population_df_copy['Region'] = None

special_cases = {
    'NORTHERN AMERICA': 'NORTHERN AMERICA',
    'OCEANIA': 'OCEANIA'
}

#variables to store the current continent and region
current_continent = None
current_region = None

#iterate over the rows of the df and assign the continent and region
for index, row in population_df_copy.iterrows():
    geography = row['Geography']
    
    if geography.isupper():
        if geography == 'WORLD':
            continue #skipping world
        elif geography in ['AFRICA', 'NORTHERN AMERICA','LATIN AMERICA AND THE CARIBBEAN', 'EUROPE', 'ASIA', 'OCEANIA']:

            current_continent = geography
            current_region = special_cases.get(geography, None)
   
        else:
            current_region = geography
    else:
        population_df_copy.at[index, 'Continent'] = current_continent
        population_df_copy.at[index, 'Region'] = current_region

population_df_copy.rename(columns={'Geography': 'Country'}, inplace=True)

#filter out rows that represent continents or regions
#we already have the regions and continents information populated, so we will drop rows where continent or region is None.
population_df_new = population_df_copy.dropna(subset=['Continent', 'Region'])

population_df_new.reset_index(drop=True, inplace=True)

if 'is_region' in population_df_new.columns:
    population_df_new = population_df_new.drop(columns=['is_region'])
population_df_new.to_csv('../data/population_expanded.csv', index=False)

print(population_df_new.head())


   Country  Population Continent           Region
0  Algeria        46.8    AFRICA  NORTHERN AFRICA
1    Egypt       105.2    AFRICA  NORTHERN AFRICA
2    Libya         6.9    AFRICA  NORTHERN AFRICA
3  Morocco        37.0    AFRICA  NORTHERN AFRICA
4    Sudan        48.1    AFRICA  NORTHERN AFRICA


### Checking for Missing Values in the Datasets

Before proceeding with any data analysis, it is important to identify any missing values in the datasets. Missing values can skew the results, so identifying them early on helps us decide how to handle such cases in further processing.

#### Steps:
1. **Checking Missing Values in the Politicians Dataset**:  
   We use the `isnull()` function combined with `sum()` to calculate the number of missing values in each column of the `politicians_df`.

2. **Checking Missing Values in the Population Dataset**:  
   Similarly, we apply the same method to the `population_df` to identify missing values in the population data.


In [25]:
#check for missing values in each column
missing_values_politicians = politicians_df.isnull().sum()
print(missing_values_politicians)

#check for missing values in each column
missing_values_population = population_df.isnull().sum()
print(missing_values_population)



name             0
url              0
country          0
revision_id      8
quality_score    8
dtype: int64
Geography     0
Population    0
is_region     0
dtype: int64


### Retrieving Article Page Info from Wikipedia API

To enrich our analysis with article quality predictions, we first need to obtain the most recent revision ID of each article in the Wikipedia dataset. We do this by making requests to the MediaWiki API using the article titles provided in the dataset.

#### Steps:
1. **Setting Up API Requests**:  
   We use the MediaWiki API to fetch page information for each article. The API endpoint for this is `https://en.wikipedia.org/w/api.php`, and the parameters specify that we want the page URL and revision ID.

2. **API Throttling**:  
   To avoid overwhelming the API, we implement a throttle with a calculated wait time between requests, ensuring compliance with Wikipedia’s request rate limits. 

3. **User-Agent**:  
   Every API request includes a custom `User-Agent` header that identifies our project and affiliation (University of Washington, MSDS program).

4. **Functionality**:  
   The function `request_pageinfo_per_article` takes an article title as input and retrieves its most recent revision ID by sending a request to the Wikipedia API. If the article title is valid, it extracts and returns the revision ID, or `None` if the revision ID cannot be found.

5. **Error Handling**:  
   If there’s an issue with the request (e.g., connection failure or invalid response), the function catches the exception, logs an error message, and returns `None`.


In [10]:
import requests
import time
import json

#constants
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'
API_LATENCY_ASSUMED = 0.002
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

REQUEST_HEADERS = {
    'User-Agent': '<tbaner@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

#pageInfo Request Template
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",  
    "prop": "info",
    "inprop": "url|talkid"
}

#function to request page info and retrieve revision ID
def request_pageinfo_per_article(article_title):
    request_template = PAGEINFO_PARAMS_TEMPLATE.copy()
    request_template['titles'] = article_title

    if API_THROTTLE_WAIT > 0.0:
        time.sleep(API_THROTTLE_WAIT)

    try:
        response = requests.get(API_ENWIKIPEDIA_ENDPOINT, headers=REQUEST_HEADERS, params=request_template)
        json_response = response.json()
        
        #extract the page info
        pages = json_response["query"]["pages"]
        for page_id, page_info in pages.items():
            revision_id = page_info.get("lastrevid", None)
            if revision_id:
                return revision_id
            else:
                return None
    except Exception as e:
        print(f"Error fetching page info for {article_title}: {e}")
        return None


### Retrieving Article Quality Predictions from ORES API

After obtaining the latest revision ID for each Wikipedia article, we request the ORES API to predict the article's quality. ORES (Objective Revision Evaluation Service) uses machine learning models to provide quality predictions based on Wikipedia’s assessment scale (e.g., "FA", "GA", "B", "C", "Start", "Stub").

#### Steps:
1. **Setting Up API Requests**:  
   The ORES API is used to fetch the article quality score. The endpoint is `https://ores.wikimedia.org/v3/scores/enwiki/`, with the quality prediction model being `wp10`. The request includes the revision ID of the article as a query parameter to get the prediction.

2. **Functionality**:  
   The function `get_ores_quality_prediction` takes an article title and its revision ID as inputs, constructs the API request URL, and sends the request to the ORES API. If the request is successful, it extracts the quality score from the JSON response. The quality score is one of Wikipedia's quality classes such as "FA" (featured article), "GA" (good article), etc.

3. **Error Handling**:  
   If there’s an issue retrieving the score (e.g., connection failure or an invalid response), the function catches the exception, logs an error message, and returns `None`.


In [11]:
# ORES API Constants
ORES_ENDPOINT = "https://ores.wikimedia.org/v3/scores/enwiki/"
ORES_MODEL = "wp10"  # The model used for quality predictions

#function to request article quality using ORES API
def get_ores_quality_prediction(article_title, revision_id):
    try:
        ores_url = f"{ORES_ENDPOINT}?models={ORES_MODEL}&revids={revision_id}"
        response = requests.get(ores_url)

        if response.status_code == 200:
            data = response.json()
            scores = data['enwiki']['scores'][str(revision_id)]['wp10']['score']['prediction']
            return scores
        else:
            print(f"Failed to get ORES score for {article_title} (Revision ID: {revision_id})")
            return None
    except Exception as e:
        print(f"Error getting ORES score for {article_title}: {e}")
        return None


### Retrieving Revision IDs and Quality Scores for Politicians' Articles

To enrich the dataset with article revision IDs and ORES-predicted quality scores, this section of the code iterates over each row in the `politicians_df` DataFrame, retrieving the revision ID and article quality score for each Wikipedia article. Any articles that fail to retrieve a revision ID or quality score are logged for further review.

#### Steps:
1. **Setting Up Error Log**:  
   We initialize an empty list `error_log` to track articles for which revision IDs or quality scores could not be retrieved.

2. **Adding New Columns**:  
   Two new columns, `revision_id` and `quality_score`, are added to the `politicians_df` DataFrame to store the retrieved revision ID and predicted article quality, respectively.

3. **Loop Through Each Article**:  
   The code loops through each row in the DataFrame, extracting the article title from the URL and calling two functions: 
   - `request_pageinfo_per_article` to get the latest revision ID.
   - `get_ores_quality_prediction` to obtain the article quality score from the ORES API using the revision ID.

4. **Handling Missing Data**:  
   If a revision ID cannot be retrieved, the article title is added to the `error_log`. After completing the loop, the `error_log` is saved to a file (`ores_error_log.txt`) for review.

5. **Saving the Updated Data**:  
   Once the DataFrame is updated with the retrieved revision IDs and quality scores, the updated dataset is saved to `politicians_with_quality.csv`.

6. **Calculating Error Rate**:  
   The code calculates the error rate as the proportion of articles for which revision IDs or quality scores could not be retrieved, and prints it for review.


In [12]:
#list to log articles without scores
error_log = []

#dd columns to store revision ID and quality score
politicians_df['revision_id'] = None
politicians_df['quality_score'] = None

#loop through each politician and get revision ID and quality score
for index, row in politicians_df.iterrows():
    article_title = row['url'].split('/')[-1]
    revision_id = request_pageinfo_per_article(article_title)
    
    if revision_id:
        quality_score = get_ores_quality_prediction(article_title, revision_id)
        politicians_df.at[index, 'revision_id'] = revision_id
        politicians_df.at[index, 'quality_score'] = quality_score
    else:
        error_log.append(article_title)

#save the error log for any missing articles
with open('../data/ores_error_log.txt', 'w') as log_file:
    log_file.write("\n".join(error_log))

politicians_df.to_csv('../data/politicians_with_quality.csv', index=False)

#calculate error rate
error_rate = len(error_log) / len(politicians_df)
print(f"Error rate: {error_rate:.2%}")


Error rate: 0.11%


### Combining Entries for China and Guinea-Bissau, While Excluding Korea

In this section of the code, we handled special cases by combining entries for countries that had different variations in their names between the Wikipedia politicians dataset and the population dataset. For example, we merged entries for China and Guinea-Bissau, but chose to leave Korea out of the merge due to the ambiguity between North and South Korea in the articles.

#### Key Steps:
1. **Handling Special Cases with `name_map`:**  
   We created a dictionary `name_map` that standardizes the country names. Variants like "China (Hong Kong SAR)" and "China (Macao SAR)" are both mapped to "China," and multiple spellings of Guinea-Bissau are normalized. We excluded Korea because merging "North Korea" and "South Korea" into a single "Korea" would introduce bias due to political differences.

2. **Applying the Name Mapping:**  
   We applied the `name_map` to both the `politicians_df` and `population_df_new` DataFrames to standardize the country names before performing any merges.

3. **Identifying Unmatched Countries:**  
   After merging the two datasets based on country names, we identified unmatched countries from both datasets and saved them into the file `wp_countries-no_match.txt`. This file lists all countries that were not found in either dataset.

4. **Inner Merge of Full Data:**  
   We performed an inner merge of the two DataFrames, `politicians_df` and `population_df_new`, based on the standardized country names. This merge results in a dataset that includes only countries that appear in both datasets.

5. **Cleaning the Merged Data:**  
   The resulting merged DataFrame was cleaned to include only the relevant columns: `country`, `region`, `population`, `article_title`, `revision_id`, and `article_quality`. The final cleaned dataset is saved to `wp_politicians_by_country.csv`.


In [35]:
#combining the China entries and Guinea -Bissau ones. Leaving Korea out (as the articles do not mention 
#whether the politicians are in North or South Korea so clubbing them into Korea will add bias)

#adding a mapping
name_map = {
    "guineabissau": "Guinea-Bissau",
    "GuineaBissau": "Guinea-Bissau",
    "China (Hong Kong SAR)": "China",
    "China (Macao SAR)": "China"
}

#apply the name_map to both dfs before merging
politicians_df['country'] = politicians_df['country'].replace(name_map)
population_df_new['Country'] = population_df_new['Country'].replace(name_map)

#extract unique countries from both dfs
unique_countries_politicians = politicians_df['country'].drop_duplicates()
unique_countries_population = population_df_new['Country'].drop_duplicates()

#perform an inner merge based on unique country names
merged_df = pd.merge(
    unique_countries_politicians.to_frame(name='country'),
    unique_countries_population.to_frame(name='Country'),
    how='inner',
    left_on='country',
    right_on='Country'
)

#identify unmatched countries by comparing the two unique sets
unmatched_countries_left = unique_countries_politicians[~unique_countries_politicians.isin(merged_df['country'])]
unmatched_countries_right = unique_countries_population[~unique_countries_population.isin(merged_df['Country'])]

# ombine all unmatched countries into a list
unmatched_countries = list(unmatched_countries_left) + list(unmatched_countries_right)

#saving unmatched countries
with open('../data/wp_countries-no_match.txt', 'w') as f:
    for country in unmatched_countries:
        if pd.notna(country):  # Only write valid country names
            f.write(f"{country}\n")

#perform full inner join between the original dfs based on the matched countries
merged_full_df = pd.merge(
    politicians_df,
    population_df_new,
    how='inner',
    left_on='country',
    right_on='Country'
)

#cleaning the merged data to include only required columns
merged_df_clean = merged_full_df[['country', 'Region', 'Population', 'name', 'revision_id', 'quality_score']]
merged_df_clean.rename(columns={
    'Region': 'region',
    'Population': 'population',
    'name': 'article_title',
    'quality_score': 'article_quality'
}, inplace=True)

merged_df_clean.to_csv('../data/wp_politicians_by_country.csv', index=False)
merged_df_clean.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df_clean.rename(columns={


Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start


### Removing Rows with Missing `revision_id`

In this step, we handle rows that are missing the `revision_id`, which are also logged in the `ores_error_log.txt`. Since a `revision_id` is required to retrieve the quality score from the ORES API, these rows must be removed from the main dataset to ensure accurate analysis.

#### Key Steps:
1. **Identify Rows with Missing `revision_id`:**  
   We first created a subset of the DataFrame `rows_with_missing_revision` to capture rows where the `revision_id` is missing. These rows correspond to articles for which we could not retrieve a valid revision ID via the Wikipedia API.

2. **Count Removed Rows:**  
   Although the code allows for counting the rows with missing `revision_id`, we skip this step here, but it can be enabled as needed by uncommenting the line `count_removed_rows = len(rows_with_missing_revision)`.

3. **Remove Rows with Missing `revision_id`:**  
   Finally, the rows with missing `revision_id` are removed from the main DataFrame `merged_df_clean`, and the filtered dataset is stored in `merged_df_filtered`.


In [37]:
#revmoing rows where revision_id is missing (also the same list as ores_error_log) and count the removed rows
rows_with_missing_revision = merged_df_clean[merged_df_clean['revision_id'].isna()]
# count_removed_rows = len(rows_with_missing_revision)

#remove the rows with missing revision_id from the main df
merged_df_filtered = merged_df_clean.dropna(subset=['revision_id'])


8


### Calculating Total Articles Per Capita and High-Quality Articles Per Capita

In this section, we calculate two important metrics to analyze the coverage and quality of Wikipedia articles about politicians for each country and region:
- **Total Articles Per Capita**: The number of Wikipedia articles about politicians per person, using the population data.
- **High-Quality Articles Per Capita**: The number of high-quality articles (as defined by ORES with the labels "FA" or "GA") per person.

#### Key Steps:
1. **Identifying High-Quality Articles:**  
   A new column `high_quality` is added to the DataFrame, indicating whether each article is considered high-quality based on the ORES labels "FA" (Featured Article) or "GA" (Good Article).

2. **Calculating Country-Level Metrics:**  
   The data is grouped by `country` to calculate the total number of articles and high-quality articles per country. Additionally, we retrieve the population for each country.

3. **Per Capita Calculations:**  
   - The **Total Articles Per Capita** is calculated by dividing the total number of articles by the country's population.
   - The **High-Quality Articles Per Capita** is calculated by dividing the number of high-quality articles by the country's population.

4. **Calculating Region-Level Metrics:**  
   To ensure each country's population is only counted once, we group the data by both `region` and `country`, and then aggregate the results by region.
   - Total articles and high-quality articles are summed for each region, and population values are aggregated for unique countries within each region.
   - The **Total Articles Per Capita** and **High-Quality Articles Per Capita** are then calculated for each region.


In [102]:
#Calculating total-articles-per-capita and high-quality-articles-per-capita -->

#definition of high-quality articles
high_quality_classes = ["FA", "GA"]

#calculate total articles and high-quality articles per country
merged_df_filtered['high_quality'] = merged_df_filtered['article_quality'].isin(high_quality_classes)

#group by country and calculate the required metrics
country_grouped = merged_df_filtered.groupby('country').agg(
    total_articles=('article_title', 'count'),
    high_quality_articles=('high_quality', 'sum'),
    population=('population', 'first')
).reset_index()

#total-articles-per-capita and high-quality-articles-per-capita
country_grouped['total_articles_per_capita'] = country_grouped['total_articles'] / (country_grouped['population'])
country_grouped['high_quality_articles_per_capita'] = country_grouped['high_quality_articles'] / (country_grouped['population'])

#group by both region and country, then aggregate. 
#This helps in making sure each country's population is only counted once per regionn
grouped = merged_df_filtered.groupby(['region', 'country']).agg(
    total_articles=('article_title', 'count'),
    high_quality_articles=('high_quality', 'sum'),
    population=('population', 'first')
).reset_index()

#aggregate by region only to sum up the results from unique countries
region_grouped = grouped.groupby('region').agg(
    total_articles=('total_articles', 'sum'),
    high_quality_articles=('high_quality_articles', 'sum'),
    population=('population', 'sum')
).reset_index()


#calculate total-articles-per-capita and high-quality-articles-per-capita for regions
region_grouped['total_articles_per_capita'] = region_grouped['total_articles'] / (region_grouped['population'])
region_grouped['high_quality_articles_per_capita'] = region_grouped['high_quality_articles'] / (region_grouped['population'])


country_grouped.head(), region_grouped.head()
region_grouped.to_csv('../data/region_grouped2.csv', index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df_filtered['high_quality'] = merged_df_filtered['article_quality'].isin(high_quality_classes)


## RESULTS: Top 10 Countries by Coverage

In this section, we display the top 10 countries with the highest total articles per capita (normalized by population). Countries with population values of zero (resulting in infinite values for articles per capita) have been excluded. Additionally, we exclude countries where the population in millions is less than 0.05 (those countries' total articles per capita are ignored, as they would be too small to display).

Below is the code used to generate the top 10 countries by coverage, ranked in descending order:


In [90]:
## RESULTS
#producing the required tables based on the previous calculations

# Top 10 countries by coverage, excluding infinite values

import numpy as np
from tabulate import tabulate

#filter out infinity values from the total_articles_per_capita column since their values in million is 0 
#since the data shows upto decimal of 1, so the minimum the data would show is 0.1
#thus any value that is lesser than 0.05 million will not show up in the data - ignoring such countries (2 in number)
filtered_countries_coverage = country_grouped.replace([np.inf, -np.inf], np.nan).dropna(subset=['total_articles_per_capita'])

#top 10 countries by coverage, excluding infinite values
top_10_countries_coverage = filtered_countries_coverage.nlargest(10, 'total_articles_per_capita')
top_10_countries_coverage['total_articles_per_capita (per million)'] = top_10_countries_coverage['total_articles_per_capita'].apply(lambda x: format(x, '.4f'))

#add a rank column to explicitly show the ordering
top_10_countries_coverage['Rank (from the top)'] = range(1, len(top_10_countries_coverage) + 1)
top_10_display = top_10_countries_coverage[['Rank (from the top)', 'country', 'total_articles_per_capita (per million)']]

print("Top 10 countries by coverage-->")
print(tabulate(top_10_display, headers='keys', tablefmt='psql', showindex=False))




Top 10 countries by coverage-->
+-----------------------+--------------------------------+-------------------------------------------+
|   Rank (from the top) | country                        |   total_articles_per_capita (per million) |
|-----------------------+--------------------------------+-------------------------------------------|
|                     1 | Antigua and Barbuda            |                                  330      |
|                     2 | Federated States of Micronesia |                                  140      |
|                     3 | Marshall Islands               |                                  130      |
|                     4 | Tonga                          |                                  100      |
|                     5 | Barbados                       |                                   83.3333 |
|                     6 | Montenegro                     |                                   60      |
|                     7 | Seychelles     

## RESULTS: Bottom 10 Countries by Coverage

In this section, we display the bottom 10 countries with the lowest total articles per capita (normalized by population). Countries with population values of zero (resulting in infinite values for articles per capita) have been excluded from this analysis. Below is the code used to generate the bottom 10 countries by coverage, ranked in ascending order.


In [91]:
#Bottom 10 countries by coverage

#excluding infinite values
bottom_10_countries_coverage = filtered_countries_coverage.nsmallest(10, 'total_articles_per_capita')

bottom_10_countries_coverage['total_articles_per_capita (per million)'] = bottom_10_countries_coverage['total_articles_per_capita'].apply(lambda x: f"{x:.6f}")

#add a rank column
bottom_10_countries_coverage['Rank (from the bottom)'] = range(1, len(bottom_10_countries_coverage) + 1)

bottom_10_display = bottom_10_countries_coverage[['Rank (from the bottom)', 'country', 'total_articles_per_capita (per million)']]

print("Bottom 10 countries by coverage-->")
print(tabulate(bottom_10_display, headers='keys', tablefmt='psql', showindex=False))


Bottom 10 countries by coverage-->
+--------------------------+---------------+-------------------------------------------+
|   Rank (from the bottom) | country       |   total_articles_per_capita (per million) |
|--------------------------+---------------+-------------------------------------------|
|                        1 | China         |                                  0.034011 |
|                        2 | Ghana         |                                  0.087977 |
|                        3 | India         |                                  0.105698 |
|                        4 | Saudi Arabia  |                                  0.135501 |
|                        5 | Zambia        |                                  0.148515 |
|                        6 | Norway        |                                  0.181818 |
|                        7 | Israel        |                                  0.204082 |
|                        8 | Egypt         |                               

## RESULTS: Top 10 Countries by High-Quality Articles

This section presents the top 10 countries ranked by high-quality articles per capita, where "high-quality" is defined as articles classified as "FA" (featured article) or "GA" (good article) based on ORES predictions. The data is normalized by population (in millions) to calculate per capita metrics. Countries are ranked in descending order of high-quality articles per capita.


In [92]:
#Top 10 countries by high quality


top_10_countries_high_quality = country_grouped.nlargest(10, 'high_quality_articles_per_capita')

top_10_countries_high_quality['high_quality_articles_per_capita (per million)'] = top_10_countries_high_quality['high_quality_articles_per_capita'].apply(lambda x: f"{x:.6f}")

#adding a rank column to mark the order
top_10_countries_high_quality['Rank (from the top)'] = range(1, len(top_10_countries_high_quality) + 1)
top_10_high_quality_display = top_10_countries_high_quality[['Rank (from the top)', 'country', 'high_quality_articles_per_capita (per million)']]

print("Top 10 countries by high quality-->")
print(tabulate(top_10_high_quality_display, headers='keys', tablefmt='psql', showindex=False))


Top 10 countries by high quality-->
+-----------------------+-----------------------+--------------------------------------------------+
|   Rank (from the top) | country               |   high_quality_articles_per_capita (per million) |
|-----------------------+-----------------------+--------------------------------------------------|
|                     1 | Montenegro            |                                         5        |
|                     2 | Luxembourg            |                                         2.85714  |
|                     3 | Albania               |                                         2.59259  |
|                     4 | Kosovo                |                                         2.35294  |
|                     5 | Maldives              |                                         1.66667  |
|                     6 | Lithuania             |                                         1.37931  |
|                     7 | Croatia               |      

## RESULTS: Bottom 10 Countries by High-Quality Articles

This section lists the bottom 10 countries by high-quality articles per capita. "High-quality" articles are defined as those classified as "FA" (featured article) or "GA" (good article) based on ORES predictions. These countries have the lowest proportion of high-quality articles per capita, normalized by population (in millions).


In [93]:
#Bottom 10 countries by high quality

bottom_10_countries_high_quality = country_grouped.nsmallest(10, 'high_quality_articles_per_capita')

bottom_10_countries_high_quality['high_quality_articles_per_capita (per million)'] = bottom_10_countries_high_quality['high_quality_articles_per_capita'].apply(lambda x: f"{x:.6f}")

#adding rank column to mark the order
bottom_10_countries_high_quality['Rank (from the bottom)'] = range(1, len(bottom_10_countries_high_quality) + 1)

# Select only the relevant columns for display
bottom_10_high_quality_display = bottom_10_countries_high_quality[['Rank (from the bottom)', 'country', 'high_quality_articles_per_capita (per million)']]

print("Bottom 10 countries by high quality-->")
print(tabulate(bottom_10_high_quality_display, headers='keys', tablefmt='psql', showindex=False))

#there are a lot of countries have articles but are not high quality articles, hence
#showing 10 of them in ascending order of the country names


Bottom 10 countries by high quality-->
+--------------------------+---------------------+--------------------------------------------------+
|   Rank (from the bottom) | country             |   high_quality_articles_per_capita (per million) |
|--------------------------+---------------------+--------------------------------------------------|
|                        1 | Antigua and Barbuda |                                                0 |
|                        2 | Bahamas             |                                                0 |
|                        3 | Barbados            |                                                0 |
|                        4 | Belize              |                                                0 |
|                        5 | Benin               |                                                0 |
|                        6 | Bhutan              |                                                0 |
|                        7 | Botswana      

# RESULTS: Geographic Regions by Total Coverage

This section lists geographic regions ranked by total articles per capita, sorted in descending order. The "total articles per capita" metric represents the number of Wikipedia articles about political figures for each region, normalized by the population of the region (in millions).


In [103]:
# Geographic regions by total coverage

#Sort the region_grouped df by total_articles_per_capita in desc order
region_grouped_sorted_total_coverage = region_grouped.sort_values('total_articles_per_capita', ascending=False)

region_grouped_sorted_total_coverage['total_articles_per_capita (per million)'] = region_grouped_sorted_total_coverage['total_articles_per_capita'].apply(lambda x: f"{x:.6f}")

#adding a rank column to mark the order
region_grouped_sorted_total_coverage['Rank'] = range(1, len(region_grouped_sorted_total_coverage) + 1)

region_total_coverage_display = region_grouped_sorted_total_coverage[['Rank', 'region', 'total_articles_per_capita (per million)']]

print("Geographic regions by total coverage-->")
print(tabulate(region_total_coverage_display, headers='keys', tablefmt='psql', showindex=False))


Geographic regions by total coverage-->
+--------+-----------------+-------------------------------------------+
|   Rank | region          |   total_articles_per_capita (per million) |
|--------+-----------------+-------------------------------------------|
|      1 | NORTHERN EUROPE |                                  6.8705   |
|      2 | OCEANIA         |                                  6.48649  |
|      3 | CARIBBEAN       |                                  5.95628  |
|      4 | SOUTHERN EUROPE |                                  5.26073  |
|      5 | CENTRAL AMERICA |                                  3.66472  |
|      6 | WESTERN EUROPE  |                                  2.74131  |
|      7 | EASTERN EUROPE  |                                  2.66341  |
|      8 | WESTERN ASIA    |                                  2.06161  |
|      9 | SOUTHERN AFRICA |                                  1.80088  |
|     10 | EASTERN AFRICA  |                                  1.38074  |
|     11 | 

## RESULTS: Geographic Regions by High Quality Coverage

This section ranks geographic regions based on the number of high-quality Wikipedia articles per capita. High-quality articles are those classified as "FA" (Featured Article) or "GA" (Good Article) by the ORES model. The metric represents the number of high-quality articles per person (per million people) in each region.


In [104]:
# Geographic regions by high quality coverage

#Sort the region_grouped df by high_quality_articles_per_capita in descending order
region_grouped_sorted_high_quality = region_grouped.sort_values('high_quality_articles_per_capita', ascending=False)

region_grouped_sorted_high_quality['high_quality_articles_per_capita (per million)'] = region_grouped_sorted_high_quality['high_quality_articles_per_capita'].apply(lambda x: f"{x:.6f}")

#adding a rank column to mark the order
region_grouped_sorted_high_quality['Rank'] = range(1, len(region_grouped_sorted_high_quality) + 1)

region_high_quality_display = region_grouped_sorted_high_quality[['Rank', 'region', 'high_quality_articles_per_capita (per million)']]

print("Geographic regions by high quality coverage -->")
print(tabulate(region_high_quality_display, headers='keys', tablefmt='psql', showindex=False))


Geographic regions by high quality coverage -->
+--------+-----------------+--------------------------------------------------+
|   Rank | region          |   high_quality_articles_per_capita (per million) |
|--------+-----------------+--------------------------------------------------|
|      1 | SOUTHERN EUROPE |                                         0.349835 |
|      2 | NORTHERN EUROPE |                                         0.323741 |
|      3 | CARIBBEAN       |                                         0.245902 |
|      4 | CENTRAL AMERICA |                                         0.194932 |
|      5 | EASTERN EUROPE  |                                         0.14275  |
|      6 | SOUTHERN AFRICA |                                         0.11713  |
|      7 | WESTERN EUROPE  |                                         0.11583  |
|      8 | WESTERN ASIA    |                                         0.091401 |
|      9 | OCEANIA         |                                         0.0