### Importing Necessary Libraries
In this cell, we import all the essential libraries that we'll need for data processing, analysis, and visualization. These libraries include:
- `numpy` for numerical operations
- `pandas` for handling dataframes
- `matplotlib` for plotting results
- `tabulate` for displaying clean tables
- `requests` for fetching data via API (if applicable)


In [1]:

import pandas as pd 
import requests  
import time  
import json  




### Loading the Datasets
In this cell, we load the datasets that will be used throughout the analysis. These include:
1. `politicians_with_quality.csv` - A dataset containing details about Wikipedia articles on politicians, including revision IDs and quality scores.
2. `population_by_country_AUG.2024.csv` - A dataset providing population data by country, which will be used for calculating articles per capita.

We use `pandas` to load these CSV files into dataframes for further processing.


In [2]:
# Loading the datasets
politicians_df = pd.read_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/politicians_with_quality.csv')
population_df = pd.read_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/population_by_country_AUG.2024.csv')

# Displaying the first few rows to ensure the data has been loaded correctly
print("Politicians Data:")
print(politicians_df.head())

print("\nPopulation Data:")
print(population_df.head())


Politicians Data Preview:
                   name                                                url  \
0        Majah Ha Adrif       https://en.wikipedia.org/wiki/Majah_Ha_Adrif   
1     Haroon al-Afghani    https://en.wikipedia.org/wiki/Haroon_al-Afghani   
2           Tayyab Agha          https://en.wikipedia.org/wiki/Tayyab_Agha   
3  Khadija Zahra Ahmadi  https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...   
4        Aziza Ahmadyar       https://en.wikipedia.org/wiki/Aziza_Ahmadyar   

       country  
0  Afghanistan  
1  Afghanistan  
2  Afghanistan  
3  Afghanistan  
4  Afghanistan  

Population Data Preview:
         Geography  Population
0            WORLD      8009.0
1           AFRICA      1453.0
2  NORTHERN AFRICA       256.0
3          Algeria        46.8
4            Egypt       105.2


### Standardizing and Cleaning Country Names
Before merging the datasets, we need to ensure that country names are consistent across both datasets. This cell:
1. Converts country names to lowercase and removes extra spaces in both the politicians and population datasets.
2. Applies a mapping dictionary (`name_map`) to standardize specific country names (e.g., different variations of China and Guinea-Bissau).
3. Ensures that merging will occur seamlessly later on by standardizing names between the datasets.


In [3]:
# Checking for duplicates based on all columns in the politicians dataset
duplicate_politicians_all = politicians_df[politicians_df.duplicated()]
print(f"Total duplicate rows (all columns): {len(duplicate_politicians_all)}")

# Checking for duplicates based on just 'name' and 'url' columns
duplicate_politicians_name = politicians_df[politicians_df.duplicated(subset=['name'])]
duplicate_politicians_url = politicians_df[politicians_df.duplicated(subset=['url'])]
print(f"Duplicate rows based on 'name': {len(duplicate_politicians_name)}")
print(f"Duplicate rows based on 'url': {len(duplicate_politicians_url)}")

# Combine duplicate data based on name and url and export to a CSV file
combined_duplicates = pd.concat([duplicate_politicians_name, duplicate_politicians_url]).drop_duplicates()
combined_duplicates.to_csv('../data/combined_duplicates_politicians.csv', index=False)
print("Combined duplicates have been saved to 'combined_duplicates_politicians.csv'")


Total duplicate rows (all columns): 0
Duplicate rows based on 'name': 44
Duplicate rows based on 'url': 44
Combined duplicates have been saved to 'combined_duplicates_politicians.csv'


In [4]:
# Adding a helper column to identify regions (uppercase 'Geography' means a region)
population_df['is_region'] = population_df['Geography'].apply(lambda x: x.isupper())

# Split the dataframe into two: regions and countries
df_region = population_df[population_df['is_region'] == True].copy()
df_country = population_df[population_df['is_region'] == False].copy()

# Dropping the helper column after the split
df_region.drop(columns=['is_region'], inplace=True)
df_country.drop(columns=['is_region'], inplace=True)

# View the country dataset to verify the operation
print(f"Total countries: {df_country.count()[0]}")


Total countries: 209


  print(f"Total countries: {df_country.count()[0]}")


### Fetching Wikipedia Article Revision IDs
In this section, we use the Wikipedia API to retrieve the `revision_id` for a list of Wikipedia articles. The `revision_id` is essential for tracking the latest version of an article.

Key points:
- We utilize the MediaWiki API to query Wikipedia for page information.
- The function `get_revision_id` is used to fetch the latest revision ID for each article title.
- To avoid exceeding API rate limits, a throttle delay is introduced between requests.


In [7]:
import requests
import time
import logging

# API URL for Wikipedia
API_URL = "https://en.wikipedia.org/w/api.php"
# Throttle time to avoid hitting the API too fast
API_THROTTLE = 0.7  # Adjusted for faster requests

# Headers for making requests to Wikipedia API
HEADERS = {
    'User-Agent': 'rsethi3@uw.edu, University Project - 2024'
}

# Template for page info request (this will be copied and modified per request)
PAGE_INFO_PARAMS = {
    "action": "query",
    "format": "json",
    "titles": "",  # Title will be added dynamically
    "prop": "info",
    "inprop": "url|talkid"
}

# This function requests page info for an article and retrieves the revision ID
def get_revision_id(article_title):
    # Copy the request template and add the article title
    params = PAGE_INFO_PARAMS.copy()
    params["titles"] = article_title
    
    # Throttle to respect API limits
    time.sleep(API_THROTTLE)
    
    try:
        # Make the request to the Wikipedia API
        response = requests.get(API_URL, headers=HEADERS, params=params)
        data = response.json()

        # Extract page info from the JSON response
        if 'query' in data:
            pages = data['query']['pages']
            for page_id, page_info in pages.items():
                return page_info.get("lastrevid", None)
        else:
            logging.error(f"Could not find 'query' in the response for {article_title}")
            return None
    except Exception as e:
        logging.error(f"Error fetching revision ID for {article_title}: {e}")
        return None


### Setting up API Configuration
Before making API requests to ORES (Objective Revision Evaluation Service), we need to configure the API endpoints and ensure we have the required libraries and setup. This cell:
1. Defines the ORES API base URL for fetching quality scores for Wikipedia articles.
2. Loads necessary libraries (`requests` for API calls and `json` for handling API responses).


In [8]:
# ORES API endpoint and model
ORES_URL = "https://ores.wikimedia.org/v3/scores/enwiki/"
ORES_MODEL = "wp10"  # Predicting quality

# Function to get quality score based on revision ID
def get_quality_prediction(article_title, revision_id):
    try:
        # Construct the ORES request URL with revision ID
        full_url = f"{ORES_URL}?models={ORES_MODEL}&revids={revision_id}"
        response = requests.get(full_url)

        # Check for successful response
        if response.status_code == 200:
            data = response.json()
            # Navigate through the JSON structure to extract the prediction
            return data['enwiki']['scores'][str(revision_id)]['wp10']['score']['prediction']
        else:
            logging.error(f"Failed to retrieve ORES quality score for {article_title} (revision: {revision_id})")
            return None
    except Exception as e:
        logging.error(f"Error retrieving ORES score for {article_title}: {e}")
        return None


### Adding Revision IDs and Quality Scores to the Dataset
In this section, we iterate through each row in the `politicians_df` DataFrame and:
1. Extract the article title from the URL.
2. Fetch the article's `revision_id` using the `get_revision_id` function.
3. Retrieve the ORES quality score using the `get_quality_prediction` function.
4. Log any errors or failed requests, saving them for review.

The final dataset with both the `revision_id` and `quality_score` is saved, along with a log of articles that failed to fetch.


In [9]:
# Initialize an empty error log for articles that fail to get revision IDs or quality scores
error_log = []

# Add columns to store the revision ID and quality score in the DataFrame
politicians_df['revision_id'] = None
politicians_df['quality_score'] = None

# Iterate through each row (each politician) in the dataset
for index, row in politicians_df.iterrows():
    # Extract the article title from the Wikipedia URL (the last part of the URL)
    article_title = row['url'].split('/')[-1]
    logging.info(f"Processing {article_title}")

    # Step 1: Get the revision ID for the article
    revision_id = get_revision_id(article_title)
    
    if revision_id:
        # Step 2: If we successfully got a revision ID, fetch the quality score using ORES
        quality_score = get_quality_prediction(article_title, revision_id)
        politicians_df.at[index, 'revision_id'] = revision_id  # Store the revision ID
        politicians_df.at[index, 'quality_score'] = quality_score  # Store the quality score
    else:
        # If we couldn't retrieve the revision ID, log the article title in the error log
        error_log.append(article_title)

# Save the error log to a text file for review
with open('../data/ores_error_log.txt', 'w') as log_file:
    log_file.write("\n".join(error_log))

# Save the DataFrame (now with quality scores and revision IDs) to a CSV file
politicians_df.to_csv('../data/politicians_with_quality.csv', index=False)

# Calculate and log the error rate (percentage of articles that failed)
error_rate = len(error_log) / len(politicians_df)
logging.info(f"Error rate: {error_rate:.2%}")

2024-10-14 00:17:25,088 - INFO - Processing Majah_Ha_Adrif
2024-10-14 00:17:27,373 - INFO - Processing Haroon_al-Afghani
2024-10-14 00:17:28,879 - INFO - Processing Tayyab_Agha
2024-10-14 00:17:30,243 - INFO - Processing Khadija_Zahra_Ahmadi
2024-10-14 00:17:31,674 - INFO - Processing Aziza_Ahmadyar
2024-10-14 00:17:33,214 - INFO - Processing Muqadasa_Ahmadzai
2024-10-14 00:17:35,121 - INFO - Processing Mohammad_Sarwar_Ahmedzai
2024-10-14 00:17:36,781 - INFO - Processing Amir_Muhammad_Akhundzada
2024-10-14 00:17:39,037 - INFO - Processing Nasrullah_Baryalai_Arsalai
2024-10-14 00:17:40,526 - INFO - Processing Abdul_Rahim_Ayoubi
2024-10-14 00:17:42,024 - INFO - Processing Ismael_Balkhi
2024-10-14 00:17:43,528 - INFO - Processing Abdul_Baqi_Turkistani
2024-10-14 00:17:44,977 - INFO - Processing Mohammad_Ghous_Bashiri
2024-10-14 00:17:46,361 - INFO - Processing Jan_Baz
2024-10-14 00:17:47,740 - INFO - Processing Bashir_Ahmad_Bezan
2024-10-14 00:17:49,039 - INFO - Processing Rafiullah_Bidar

### Merging Wikipedia Politicians Data with Population Data
In this section, we perform the following steps:
1. **Load Data**: Load the cleaned Wikipedia politicians dataset (`politicians_with_quality.csv`) and the population data (`population_by_country_AUG.2024.csv`).
2. **Assign Regions to Countries**: Process the population data to assign regions to countries, using uppercase entries to identify regions.
3. **Standardize Country Names**: Handle inconsistencies in country names (e.g., variations of "China" and "Guinea-Bissau").
4. **Merge Datasets**: Perform an outer merge on the `country` field to ensure both datasets align correctly.
5. **Log Non-Matching Countries**: Identify countries that are missing from either dataset and save them to a log file.
6. **Filter and Save Matched Data**: Filter the merged data to include only matching countries and save the cleaned dataset to a CSV file.


In [67]:
import pandas as pd

# Step 1: Load the Wikipedia politicians data and population data
# This includes the processed politicians data with quality scores and the population dataset with regions
politicians_df = pd.read_csv('../data/politicians_with_quality.csv')
population_df_new = pd.read_csv('../data/population_by_country_AUG.2024.csv')

# Step 2: Process the population data to assign regions to countries
# We use an iterative approach, identifying regions (uppercase entries) and associating countries with their regions
current_region = None  # Tracks the current region during iteration
processed_pop_data = []  # List to hold the processed data with region-country-population associations

# Iterate through each row in the population data
for _, row in population_df_new.iterrows():
    geography = row['Geography']
    
    # If the name is in all caps, it's a region
    if geography.isupper():
        current_region = geography  # Set the current region
    else:
        # If it's a country, add its country name, region, and population to the processed list
        processed_pop_data.append({
            'country': geography.lower().strip(),  # Standardizing country names (lowercase, no spaces)
            'region': current_region,              # The associated region
            'population': row['Population']        # Population value
        })

# Convert the processed population data into a DataFrame for merging later
pop_df = pd.DataFrame(processed_pop_data)

# Step 3: Standardize country names in the Wikipedia data
# This ensures consistent formatting of country names for merging
politicians_df['country'] = politicians_df['country'].str.lower().str.strip()

# Step 4: Handle special cases (China, Guinea-Bissau, etc.) using a name_map dictionary
# These cases address variations or inconsistencies in country names across the two datasets
name_map = {
    "china (hong kong sar)": "china",
    "china (macao sar)": "china",
    "guinea-bissau": "guinea-bissau",
    "guineabissau": "guinea-bissau"
}

# Apply the name_map to both datasets to standardize country names for matching
politicians_df['country'] = politicians_df['country'].replace(name_map)
pop_df['country'] = pop_df['country'].replace(name_map)

# Step 5: Perform an outer merge on 'country' between politicians_df and pop_df
# This ensures that we don't lose data from either dataset, even if a country is missing in one
merged_df = pd.merge(politicians_df, pop_df, on='country', how='outer', indicator=True)

# Step 6: Identify unmatched countries and write them to a file
# These are countries that exist in one dataset but not the other, and we'll save them for review
non_matches = merged_df[merged_df['_merge'] != 'both']
non_matching_countries = non_matches['country'].dropna().unique()

# Write the non-matching countries to a text file for troubleshooting
with open('../data/wp_countries-no_match.txt', 'w') as f:
    for country in non_matching_countries:
        f.write(f"{country}\n")

# Step 7: Filter for matched data (countries present in both datasets)
# We only keep rows where there is a match between both datasets
matched_data = merged_df[merged_df['_merge'] == 'both']

# Step 8: Select relevant columns for the final dataset
# This includes country, region, population, and article-related fields
final_columns = ['country', 'region', 'population', 'name', 'revision_id', 'quality_score']
final_df = matched_data[final_columns]

# Step 9: Rename the columns to be more meaningful
# This makes the column names more descriptive
final_df.rename(columns={
    'name': 'article_title',
    'quality_score': 'article_quality'
}, inplace=True)

# Step 10: Save the consolidated final DataFrame to a CSV file for future analysis
final_df.to_csv('../data/wp_politicians_by_country.csv', index=False)

# Quick preview of the final DataFrame to verify the results
print(final_df.head())


       country      region  population         article_title   revision_id  \
0  afghanistan  SOUTH ASIA        42.4        Majah Ha Adrif  1.233203e+09   
1  afghanistan  SOUTH ASIA        42.4     Haroon al-Afghani  1.230460e+09   
2  afghanistan  SOUTH ASIA        42.4           Tayyab Agha  1.225662e+09   
3  afghanistan  SOUTH ASIA        42.4  Khadija Zahra Ahmadi  1.234742e+09   
4  afghanistan  SOUTH ASIA        42.4        Aziza Ahmadyar  1.195651e+09   

  article_quality  
0           Start  
1               B  
2           Start  
3            Stub  
4           Start  


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df.rename(columns={


### Handling Missing `revision_id`
In this step, we clean the dataset by removing rows where the `revision_id` is missing. This field is essential for retrieving the article quality score from the ORES API, so any row missing this value is excluded from further analysis. The steps include:
1. **Identify Missing `revision_id`**: Create a subset of rows where `revision_id` is NaN (missing).
2. **Remove Rows with Missing `revision_id`**: Filter the DataFrame to exclude these rows.
3. **Log Missing Data**: Optionally save rows with missing `revision_id` to a log file for review.


In [68]:
# Step 11: Handle Rows with Missing 'revision_id'
# In this part, we'll handle rows where the 'revision_id' is missing.
# Since 'revision_id' is critical to retrieve the quality score from ORES,
# we need to remove any rows where this value is missing.

# Step 11.1: Identify rows with missing 'revision_id'
# We'll create a subset of rows where 'revision_id' is NaN (not available).
rows_with_missing_revision = final_df[final_df['revision_id'].isna()]

# Step 11.2: Remove rows where 'revision_id' is missing
# Now, we filter out the rows with missing 'revision_id' from the main DataFrame to ensure data accuracy.
final_df_filtered = final_df.dropna(subset=['revision_id'])

# Step 11.3: Log missing rows for review
# Save the rows with missing 'revision_id' to a separate file for review (optional).
rows_with_missing_revision.to_csv('../data/ores_error_log.txt', index=False)

# Step 11.4: Quick preview of the cleaned data after removing rows with missing 'revision_id'
# Let's print the cleaned DataFrame to verify that the missing rows have been successfully removed.
print(final_df_filtered.head())

# Step 11.5: Save the cleaned data to a new CSV file
# This cleaned dataset will be used for further analysis.
final_df_filtered.to_csv('../data/wp_politicians_by_country_clean.csv', index=False)


       country      region  population         article_title   revision_id  \
0  afghanistan  SOUTH ASIA        42.4        Majah Ha Adrif  1.233203e+09   
1  afghanistan  SOUTH ASIA        42.4     Haroon al-Afghani  1.230460e+09   
2  afghanistan  SOUTH ASIA        42.4           Tayyab Agha  1.225662e+09   
3  afghanistan  SOUTH ASIA        42.4  Khadija Zahra Ahmadi  1.234742e+09   
4  afghanistan  SOUTH ASIA        42.4        Aziza Ahmadyar  1.195651e+09   

  article_quality  
0           Start  
1               B  
2           Start  
3            Stub  
4           Start  


### Calculating Total and High-Quality Articles Per Capita (Country and Region Level)
In this step, we compute two key metrics to analyze the coverage and quality of Wikipedia articles for politicians:
1. **Total Articles Per Capita**: Number of articles about politicians per person, using population data.
2. **High-Quality Articles Per Capita**: Number of high-quality articles (those classified as "FA" or "GA") per person.

This is calculated for both country and region levels, ensuring that we aggregate data correctly and calculate per capita metrics for comparison.


In [70]:
# Step 1: Define high-quality classes
# We're defining 'FA' (Featured Article) and 'GA' (Good Article) as high-quality articles based on ORES scoring.
high_quality_classes = ["FA", "GA"]

# Step 2: Mark high-quality articles
# Here, we're adding a new column 'high_quality' to the DataFrame to mark whether an article is of high quality.
final_df_filtered['high_quality'] = final_df_filtered['article_quality'].apply(lambda x: x in high_quality_classes)

# Step 3: Group data by country to calculate total and high-quality articles
# We group the data by country and calculate:
# - Total articles per country (count of articles)
# - High-quality articles per country (sum of high-quality articles)
# - Population for each country (using the first population entry since it doesn't change per article)
country_grouped = final_df_filtered.groupby('country').agg(
    total_articles=('article_title', 'count'),
    high_quality_articles=('high_quality', 'sum'),
    population=('population', 'first')  # Population should be the same for all articles in a country
).reset_index()

# Step 4: Calculate per capita metrics
# Now we calculate the total articles per capita and high-quality articles per capita for each country.
# This is done by dividing the article counts by the population.
country_grouped['total_articles_per_capita'] = country_grouped['total_articles'] / country_grouped['population']
country_grouped['high_quality_articles_per_capita'] = country_grouped['high_quality_articles'] / country_grouped['population']

# Step 5: Group data by both region and country
# Grouping by both region and country to ensure that each country's population is only counted once in regional aggregates.
grouped = final_df_filtered.groupby(['region', 'country']).agg(
    total_articles=('article_title', 'count'),
    high_quality_articles=('high_quality', 'sum'),
    population=('population', 'first')
).reset_index()

# Step 6: Aggregate data by region
# After grouping by both region and country, we then aggregate the results by region.
# We sum the total articles, high-quality articles, and population across all countries within each region.
region_grouped = grouped.groupby('region').agg(
    total_articles=('total_articles', 'sum'),
    high_quality_articles=('high_quality_articles', 'sum'),
    population=('population', 'sum')
).reset_index()

# Step 7: Calculate regional per capita metrics
# Just like with countries, we now calculate the per capita metrics for each region.
# This is done by dividing the total articles and high-quality articles by the total population in the region.
region_grouped['total_articles_per_capita'] = region_grouped['total_articles'] / region_grouped['population']
region_grouped['high_quality_articles_per_capita'] = region_grouped['high_quality_articles'] / region_grouped['population']

# Step 8: Output results
# Now that we've calculated the metrics, let's take a look at the results for both countries and regions.
print(country_grouped.head())  # Preview the country-level metrics
print(region_grouped.head())   # Preview the region-level metrics

# Step 9: Save results to CSV files
# Finally, we save the grouped metrics to CSV files for future use or reporting.
country_grouped.to_csv('../data/country_grouped_metrics.csv', index=False)
region_grouped.to_csv('../data/region_grouped_metrics.csv', index=False)


               country  total_articles  high_quality_articles  population  \
0          afghanistan              85                      3        42.4   
1              albania              70                      7         2.7   
2              algeria              71                      1        46.8   
3               angola              58                      2        36.7   
4  antigua and barbuda              33                      0         0.1   

   total_articles_per_capita  high_quality_articles_per_capita  
0                   2.004717                          0.070755  
1                  25.925926                          2.592593  
2                   1.517094                          0.021368  
3                   1.580381                          0.054496  
4                 330.000000                          0.000000  
            region  total_articles  high_quality_articles  population  \
0        CARIBBEAN             218                      9        36.6   
1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df_filtered['high_quality'] = final_df_filtered['article_quality'].apply(lambda x: x in high_quality_classes)


### Analyzing Wikipedia Politicians Coverage: Top 10 and Bottom 10 Countries by Articles Per Capita
In this section, we will analyze Wikipedia articles about politicians by calculating the total articles per capita for each country. We'll then display the top 10 and bottom 10 countries based on this metric, while excluding countries with very small populations or missing values. Infinite values (due to zero population) are handled by replacing them with NaN and subsequently removing these entries.


### Results Summary

This section summarizes the coverage of Wikipedia articles about political figures across various countries and regions. We focus on two main aspects: total article coverage and high-quality article coverage.

1. **Top 10 Countries by Article Coverage**: 
   - Displays the top 10 countries with the highest number of Wikipedia articles about political figures, normalized by population.
   - Countries with a population below 0.05 million are excluded to avoid skewed results.

2. **Bottom 10 Countries by Article Coverage**: 
   - Highlights the countries with the lowest article coverage per capita.
   - As with the top 10, countries with a population of zero are excluded.

3. **Top 10 Countries by High-Quality Articles**: 
   - Lists the top 10 countries with the highest number of high-quality articles per capita. 
   - High-quality articles are defined as those classified as "FA" (Featured Article) or "GA" (Good Article).

4. **Bottom 10 Countries by High-Quality Articles**: 
   - Shows the countries with the lowest proportion of high-quality articles per capita.

5. **Geographic Regions by Total Coverage**: 
   - Ranks geographic regions based on the total number of articles per capita.

6. **Geographic Regions by High-Quality Coverage**: 
   - Ranks regions based on the number of high-quality articles per capita.


In [85]:
import pandas as pd
import numpy as np
from tabulate import tabulate

# Step 1: Load the country and region data
# We're loading the metrics that have already been processed for countries and regions
country_stats = pd.read_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/country_grouped_metrics.csv')
region_stats = pd.read_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/region_grouped_metrics.csv')

# Step 2: Handle infinite values in 'total_articles_per_capita'
# Replace infinite values (from zero population) with NaN, and drop rows with missing values
filtered_country_stats = country_stats.replace([np.inf, -np.inf], np.nan).dropna(subset=['total_articles_per_capita'])

# Step 3: Exclude countries with populations less than 0.05 million
# Filter out countries with very small populations since they can distort the results
filtered_country_stats = filtered_country_stats[filtered_country_stats['population'] > 0.05 * 1e6]

# Step 4: Get the top 10 countries by total articles per capita
# Sort to get the top 10 countries that have the highest number of articles per capita
top_10_countries_coverage = filtered_country_stats.nlargest(10, 'total_articles_per_capita')

# Step 5: Format the 'total_articles_per_capita' column for display
# Format to four decimal places for consistency
top_10_countries_coverage['total_articles_per_capita (per million)'] = top_10_countries_coverage['total_articles_per_capita'].apply(lambda x: format(x, '.4f'))

# Step 6: Add a rank column for clarity
# Number the countries from 1 to 10 based on their rank in coverage
top_10_countries_coverage['Rank (from the top)'] = range(1, len(top_10_countries_coverage) + 1)

# Step 7: Select relevant columns for display
# We're only interested in showing the rank, country, and the articles per capita
top_10_display = top_10_countries_coverage[['Rank (from the top)', 'country', 'total_articles_per_capita (per million)']]

# Step 8: Display the top 10 countries using tabulate for a cleaner output
# This prints the top 10 countries by articles per capita in a nicely formatted table
print("Top 10 countries by coverage-->")
print(tabulate(top_10_display, headers='keys', tablefmt='psql', showindex=False))

# Step 9: Save the top 10 countries by coverage as a CSV (optional)
# Save to CSV for future use or reporting
top_10_display.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/top_10_countries_by_coverage.csv', index=False)

# Step 10: Get the bottom 10 countries by total articles per capita
# Same as the top 10 but now we're grabbing the bottom 10 countries
bottom_10_countries_coverage = filtered_country_stats.nsmallest(10, 'total_articles_per_capita')
bottom_10_countries_coverage['total_articles_per_capita (per million)'] = bottom_10_countries_coverage['total_articles_per_capita'].apply(lambda x: format(x, '.4f'))
bottom_10_countries_coverage['Rank (from the bottom)'] = range(1, len(bottom_10_countries_coverage) + 1)

# Step 11: Select relevant columns for display for the bottom 10 countries
bottom_10_display = bottom_10_countries_coverage[['Rank (from the bottom)', 'country', 'total_articles_per_capita (per million)']]

# Step 12: Display the bottom 10 countries in a clean table
print("Bottom 10 countries by coverage-->")
print(tabulate(bottom_10_display, headers='keys', tablefmt='psql', showindex=False))


dnkcs
                 country  total_articles  high_quality_articles  population  \
0            afghanistan              85                      3        42.4   
1                albania              70                      7         2.7   
2                algeria              71                      1        46.8   
3                 angola              58                      2        36.7   
4    antigua and barbuda              33                      0         0.1   
..                   ...             ...                    ...         ...   
162            venezuela              56                      1        28.8   
163              vietnam              36                      2        98.9   
164                yemen              32                      0        34.4   
165               zambia               3                      0        20.2   
166             zimbabwe              69                      0        16.7   

     total_articles_per_capita  high_quality_

In [86]:
# Save the bottom 10 countries by coverage as a CSV for later use
bottom_10_display.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/bottom_10_countries_by_coverage.csv', index=False)

# Step 11: Top 10 countries by high-quality articles per capita
# Find the top 10 countries based on high-quality articles per capita
top_10_countries_high_quality = filtered_country_stats.nlargest(10, 'high_quality_articles_per_capita')
top_10_countries_high_quality['high_quality_articles_per_capita (per million)'] = top_10_countries_high_quality['high_quality_articles_per_capita'].apply(lambda x: format(x, '.4f'))

# Add a rank to make it clear who's at the top
top_10_countries_high_quality['Rank (from the top)'] = range(1, len(top_10_countries_high_quality) + 1)

# Select the columns we need to display
top_10_high_quality_display = top_10_countries_high_quality[['Rank (from the top)', 'country', 'high_quality_articles_per_capita (per million)']]

# Display the top 10 countries for high-quality articles in a nice table
print("Top 10 countries by high-quality articles per capita-->")
print(tabulate(top_10_high_quality_display, headers='keys', tablefmt='psql', showindex=False))

# Save the top 10 high-quality countries to a CSV for later use
top_10_high_quality_display.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/top_10_countries_by_high_quality.csv', index=False)

# Step 12: Bottom 10 countries by high-quality articles per capita
# Find the bottom 10 countries based on high-quality articles per capita
bottom_10_countries_high_quality = filtered_country_stats.nsmallest(10, 'high_quality_articles_per_capita')
bottom_10_countries_high_quality['high_quality_articles_per_capita (per million)'] = bottom_10_countries_high_quality['high_quality_articles_per_capita'].apply(lambda x: format(x, '.4f'))

# Add a rank for clarity
bottom_10_countries_high_quality['Rank (from the bottom)'] = range(1, len(bottom_10_countries_high_quality) + 1)

# Select relevant columns for display
bottom_10_high_quality_display = bottom_10_countries_high_quality[['Rank (from the bottom)', 'country', 'high_quality_articles_per_capita (per million)']]

# Display the bottom 10 countries for high-quality articles in a table
print("Bottom 10 countries by high-quality articles per capita-->")
print(tabulate(bottom_10_high_quality_display, headers='keys', tablefmt='psql', showindex=False))

# Save the bottom 10 countries for high-quality articles to a CSV for later use
bottom_10_high_quality_display.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/bottom_10_countries_by_high_quality.csv', index=False)

# Step 13: Geographic regions by total articles per capita
# Rank regions by total articles per capita and display the table
regions_by_total_coverage = region_stats.sort_values(by='total_articles_per_capita', ascending=False)
print("Geographic regions by total coverage-->")
print(tabulate(regions_by_total_coverage[['region', 'total_articles_per_capita']], headers='keys', tablefmt='psql', showindex=False))

# Step 14: Geographic regions by high-quality articles per capita
# Rank regions by high-quality articles per capita and display the table
regions_by_high_quality_coverage = region_stats.sort_values(by='high_quality_articles_per_capita', ascending=False)
print("Geographic regions by high-quality coverage-->")
print(tabulate(regions_by_high_quality_coverage[['region', 'high_quality_articles_per_capita']], headers='keys', tablefmt='psql', showindex=False))

# Save the region data for future reference
regions_by_total_coverage.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/regions_by_total_coverage.csv', index=False)
regions_by_high_quality_coverage.to_csv('/Users/radhikasethi/Documents/github/data-512-homework-2/data/regions_by_high_quality_coverage.csv', index=False)


Top 10 countries by high-quality articles per capita-->
+-----------------------+-----------------------+--------------------------------------------------+
|   Rank (from the top) | country               |   high_quality_articles_per_capita (per million) |
|-----------------------+-----------------------+--------------------------------------------------|
|                     1 | montenegro            |                                           5      |
|                     2 | luxembourg            |                                           2.8571 |
|                     3 | albania               |                                           2.5926 |
|                     4 | kosovo                |                                           2.3529 |
|                     5 | maldives              |                                           1.6667 |
|                     6 | lithuania             |                                           1.3793 |
|                     7 | croatia  