# Homework 2 Description
## Considering Bias in Data

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, we combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We were expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. The final analysis consists of a series of tables that show:

The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
The countries with the highest and lowest proportion of high quality articles about politicians.
A ranking of geographic regions by articles-per-person and proportion of high quality articles.


# Step 1: Getting the Article and Population Data


The first step is getting the data, which lives in several different places. You will need data that lists Wikipedia articles of politicians and data for country populations.
The Wikipedia Category:Politicians by nationality was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. This data is in the homework folder as politicians_by_country.SEPT.2022.csv.
The population data is available in CSV format as population_by_country_2022.csv from the homework folder. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.


## Step 1a - Read the population and politicians input data files
Load the below two workbooks:

population data (population_by_country_2022.csv) - Consists of countries, region and population for each country/region
Politicians data (politicians_by_country.SEPT.2022.csv) - Consists of crawled Wikipedia article pages about politicians from different countries

#### Brief overview of the approach:
In this phase we will configure the REST API endpoints for downloading the data for revisions to start with, then we go about handling the inconsistencies like removing countries with zero population or duplicate politician articles stated under different countries. In step 2 we go about getting the ORES scores

## Import Required Libraries

In [1]:
# These are standard python modules for dealing with json and time objects and REST APIs and regular expressions
import json, time, urllib.parse, re

# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests

# Standard libraries for dataframe, numpy array loading and manipulation and plotting graphs and other standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings


In [94]:
### Creating required repositories
os.makedirs('./outputs', exist_ok=True)
os.makedirs('./intermediate_outputs', exist_ok=True)

warnings.filterwarnings(action='once')

## Declaring Global variables

Declaring global parameters in a single cell at the start of the notebook which is a good practice

### Page Info Variables

In [3]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'rlokwani@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}


# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## ORES Scores Global variables

In [4]:
#########
#
#    CONSTANTS
#

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}
#
# The current ML models for English wikipedia are:
#   "articlequality"
#   "articletopic"
#   "damaging"
#   "version"
#   "draftquality"
#   "drafttopic"
#   "goodfaith"
#   "wp10"
#
# The specific documentation on these is scattered so if you want to use one you'll have to look around.
#

## Declaring Global Functions

## Function for requesting page info

In [5]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

## Function for getting ORES scores

In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


# Data Preprocessing

In [47]:
# Reading both the given CSVs for politicians and country population
df_politicians=pd.read_csv('./input_data/politicians_by_country_SEPT.2022.csv - politicians_international_SEPT.2022.csv.csv')
df_country_population=pd.read_csv('./input_data/population_by_country_2022.csv - population_by_country_2022.csv.csv')

In [48]:
df_politicians.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [49]:
df_country_population.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


# Step 1 b) Check for data inconsistencies

You should be a little careful with the data. Crawling Wikipedia categories to identify relevant page subsets can result in misleading and/or duplicate category labels. Naturally, the data crawl attempted to resolve these, but not all may have been caught. You should document how you handle any data inconsistencies.

The population_by_country_2022.csv contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in politicians_by_country.SEPT.2022.csv, but you will want to retain some of them so that you can report coverage and quality by region as specified in the analysis section below.


## Checking politician CSV for duplicate records
1. Checking for duplicate politician entries with different country
2. Checking for duplicate politician with same country

In [50]:
redundant_articles_without_country = df_politicians[df_politicians.duplicated(subset=['name', 'url'], keep = False)]
redundant_articles_with_country = df_politicians[df_politicians.duplicated(subset=['name', 'url', 'country'], keep = False)]

print ("Duplicate articles from a different country: ", len(redundant_articles_without_country))
print ("Duplicate articles: ", len(redundant_articles_with_country))

Duplicate articles from a different country:  98
Duplicate articles:  4


In [51]:
# Removing duplicate entries
df_politicians = df_politicians[~df_politicians.duplicated(subset=['name', 'url','country'], keep = 'last')]
print (len(df_politicians))
df_politicians.head()

7582


Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


## Checking country population CSV 

In [52]:
### If all the countries are unique

len(df_country_population)==len(np.unique(df_country_population['Geography']))

True

In [53]:
### Check if the population does not have anomalous values
print ("Minimum population:", np.min(df_country_population['Population (millions)']))
print ("Maximum population: ", np.max(df_country_population['Population (millions)']))

Minimum population: 0.0
Maximum population:  7963.0


In [54]:
## Getting rid of countries with zero population to avoid divide by zero errors in future

minimum_population = df_country_population[df_country_population['Population (millions)'] == 0]
print ("Countries with zero population: ", len(minimum_population), minimum_population['Geography'].values)

Countries with zero population:  6 ['Liechtenstein' 'Monaco' 'San Marino' 'Nauru' 'Palau' 'Tuvalu']


## Handling cumulative and country level counts

Using the structure of the CSV, assign each entry the region closest to it in the nested hierarchy

In [55]:
country=[]
population=[]
regions=[]
temporary_region=""

for idx, row in df_country_population.iterrows():
    if row['Geography'].isupper():
        temporary_region=row['Geography']
    else:
        country.append(row['Geography'])
        population.append(row['Population (millions)'])
        regions.append(temporary_region)

In [56]:
preprocessed_country_population_dataframe=pd.DataFrame()
preprocessed_country_population_dataframe['Geography']=country
preprocessed_country_population_dataframe['Population (in millions)']=population
preprocessed_country_population_dataframe['Region']=regions
preprocessed_country_population_dataframe.head()      #Final dataframe for future use

Unnamed: 0,Geography,Population (in millions),Region
0,Algeria,44.9,NORTHERN AFRICA
1,Egypt,103.5,NORTHERN AFRICA
2,Libya,6.8,NORTHERN AFRICA
3,Morocco,36.7,NORTHERN AFRICA
4,Sudan,46.9,NORTHERN AFRICA


In [93]:
preprocessed_region_population=preprocessed_country_population_dataframe.groupby(['Region']).sum().round().reset_index()
preprocessed_region_population.columns=['region', 'population']
preprocessed_region_population

  preprocessed_region_population=preprocessed_country_population_dataframe.groupby(['Region']).sum().round().reset_index()


Unnamed: 0,region,population
0,CARIBBEAN,44.0
1,CENTRAL AMERICA,178.0
2,CENTRAL ASIA,78.0
3,EAST ASIA,1674.0
4,EASTERN AFRICA,473.0
5,EASTERN EUROPE,287.0
6,MIDDLE AFRICA,196.0
7,NORTHERN AFRICA,251.0
8,NORTHERN AMERICA,372.0
9,NORTHERN EUROPE,106.0


## Step 2: Getting Article Quality Predictions

Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:
FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors.

ORES requires a specific revision ID of a specific article to be able to make a label prediction. You can use the API:Info request to get a range of metadata on an article, including the most current revision ID of the article page.
Putting this together, to get a Wikipedia page quality prediction from ORES for each politician’s article page you will need to: a) read each line of politicians_by_country.SEPT.2022.csv, b) make a page info request to get the current page revision, and c) make an ORES request using the page title and current revision id.  

The homework folder contains example code in notebooks to illustrate making a page info request and making an ORES request. This sample code is licensed CC0 so feel free to reuse any of the code in either notebook without attribution.

Note: It is possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. The choice is up to you.


### Brief overview of the process:
We first call the PageViewsInfo API to get last revision IDs for articles and secondly then use these revision IDs and article names to call the ORES AI prediction API to get the class of article

## Step 2a) Get Revision ID

In [17]:
ARTICLE_TITLES=list(df_politicians['name'])     

In [18]:
# Get responses for each of the articles, parse the JSON to get required details. We use exception handling to avoid runtime errors
article_titles=[]
article_lastrevid=[]

for idx, i in enumerate(ARTICLE_TITLES):
    try:
        page_info = request_pageinfo_per_article(article_title = i, request_template = PAGEINFO_PARAMS_TEMPLATE)
        page_id = list(page_info['query']['pages'].keys())[0]
        if page_info['query']['pages'][page_id]['title'] and page_info['query']['pages'][page_id]['lastrevid']:
            article_titles.append(page_info['query']['pages'][page_id]['title'])
            article_lastrevid.append(page_info['query']['pages'][page_id]['lastrevid'])
    except:
        print("Couldn't get the page info for: ", i)    #Exception for pages not present


df_article_info=pd.DataFrame()
df_article_info['article_title']=article_titles
df_article_info['lastrevid']=article_lastrevid

df_article_info.head()

Couldn't get the page info for:  Prince Ofosu Sefah
Couldn't get the page info for:  Harjit Kaur Talwandi
Couldn't get the page info for:  Abd al-Razzaq al-Hasani
Couldn't get the page info for:  Abiodun Abimbola Orekoya
Couldn't get the page info for:  Segun “Aeroland” Adewale
Couldn't get the page info for:  Roman Konoplev
Couldn't get the page info for:  Nhlanhla “Lux” Dlamini


Unnamed: 0,article_title,lastrevid
0,Shahjahan Noori,1099689043
1,Abdul Ghafar Lakanwal,943562276
2,Majah Ha Adrif,852404094
3,Haroon al-Afghani,1095102390
4,Tayyab Agha,1104998382


In [19]:
df_article_info.to_csv('./intermediate_outputs/Articles_lastrevisionids.csv') #Saving this CSV as a checkpoint

In [57]:
#Removing null value entries for revision IDs to avoid issues for future APIs

df_articles_info=pd.read_csv('./intermediate_outputs/Articles_lastrevisionids.csv')
df_articles_info = df_articles_info[~df_articles_info['lastrevid'].isnull()]
df_articles_info.drop(columns=['Unnamed: 0'], inplace=True)
df_articles_info.head()

Unnamed: 0,article_title,lastrevid
0,Shahjahan Noori,1099689043
1,Abdul Ghafar Lakanwal,943562276
2,Majah Ha Adrif,852404094
3,Haroon al-Afghani,1095102390
4,Tayyab Agha,1104998382


### Step 2 b) Get ORES scores

In [23]:
# Get ORES API responses for each of the articles, parse the JSON to get required details. We use exception handling to avoid runtime errors
# Important to note that: We won't be getting ORES scores for the 7 articles from 2 a) for which revision IDs threw an error. 

title_revisions_dict = dict(zip(df_articles_info.article_title, df_articles_info.lastrevid))
ARTICLE_TITLES=list(df_articles_info['article_title'])

ores_scores_df = list()
for i in ARTICLE_TITLES:
    try: 
        score = request_ores_score_per_article(title_revisions_dict[i])
        revid = title_revisions_dict[i]
        score = score['enwiki']['scores'][str(revid)]['articlequality']['score']['prediction']
        ores_scores_df.append([i, revid, score])
    except:
        print ("Ores scores not found for: ", i)


ores_scores_df = pd.DataFrame(ores_scores_df, columns = ['name', 'lastrevid', 'ores_prediction'])
ores_scores_df.to_csv('./intermediate_outputs/ores_scores.csv')

## Step 3: Combining the datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

Identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called:
wp_countries-no_match.txt

Consolidate the remaining data into a single CSV file called:
wp_politicians_by_country.csv

The schema for that file should look something like this:
Column
country
region
population
article_title
revision_id
article_quality



# Process brief:
We merge the dataframes on names and country and then go onto remove null values for countries and log those entries and then finally save the rest of the dataframe as a CSV

In [58]:
# Merging the two dataframes on names and then Geography/Country to get a combined dataset
politicians_country_ores_df=df_politicians.merge(ores_scores_df, on='name', how='right').drop(columns=['url'])
politicians_region_ores_df=politicians_country_ores_df.merge(preprocessed_country_population_dataframe, left_on='country',
                                                            right_on='Geography', how='outer')
print(politicians_country_ores_df.shape)
print(politicians_region_ores_df.shape)


(7575, 4)
(7600, 7)


In [59]:
politicians_region_ores_df.head()

Unnamed: 0,name,country,lastrevid,ores_prediction,Geography,Population (in millions),Region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA


In [60]:
#Listing null values for country
no_wiki_data = politicians_region_ores_df[politicians_region_ores_df['country'].isnull()]['Geography'].unique()
# List of countries where there is no population data
no_population_data = politicians_region_ores_df[politicians_region_ores_df['Geography'].isnull()]['country'].unique()

missing_countries=list(set(np.concatenate([no_wiki_data, no_population_data])))

In [61]:
print(missing_countries)

['Martinique', 'China,  Macao SAR', 'Western Sahara', 'New Zealand', 'Ireland', 'Canada', 'Philippines', 'Mayotte', 'China,  Hong Kong SAR', 'Sao Tome and Principe', 'Reunion', 'eSwatini', 'Korean', 'Kiribati', 'Guadeloupe', 'Brunei', 'Mauritius', 'Curacao', 'French Polynesia', 'New Caledonia', 'Guam', 'French Guiana', 'United States', 'Puerto Rico', 'Australia', 'United Kingdom']


In [62]:
#Writing the list of missing countries in a log files as asked

with open('./outputs/wp_countries-no_match.txt', 'w') as file:
    for country in missing_countries:
        file.write(f"{country}\n")

In [63]:
# Removing null values and then adjusting the dataframe titles to expected schema
politicians_region_ores_df = politicians_region_ores_df[(~politicians_region_ores_df['country'].isnull())]
politicians_region_ores_df = politicians_region_ores_df[(~politicians_region_ores_df['Geography'].isnull())]

politicians_region_ores_df = politicians_region_ores_df.drop(columns=['Geography'])
politicians_region_ores_df.columns=['article_title', 'country', 'revision_id', 'article_quality', 'population', 'region']

politicians_region_ores_df.to_csv('./outputs/wp_politicians_by_country.csv', index = False)
politicians_region_ores_df.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,41.1,SOUTH ASIA


# Step 4: Analysis
Your analysis will consist of calculating total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. All of these values are to be “per capita”.

In this analysis a country can only exist in one region. The population_by_country_2022.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) region.

For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Also, keep in mind that the population_by_country_2022.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbers

## Process brief:

1. We aggregate the articles and population at country and region levels, then calculate articles_per_capita by dividing number of articles by population*1,000,000.
2. For high quality articles dataframe we filter the dataframe for respective categories first (FA and GA) and then follow step 1.

## 4a) Total articles per population

#### By region

In [75]:
politicians_region_ores_df.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,41.1,SOUTH ASIA


In [95]:
# We remove the duplicates for countries, Group the regions and sum the population per region and counting the number of articles, later getting the article_per_capita
region_population = preprocessed_region_population    # From step 1a)
region_count=politicians_region_ores_df[['region', 'article_title']].groupby('region').count().reset_index()
total_articles_region_data=region_population.merge(region_count, on='region')
total_articles_region_data.columns=['region', 'population', 'article_count']
total_articles_region_data['articles_per_capita'] = total_articles_region_data['article_count'] / (total_articles_region_data['population'] * 1000000)
total_articles_region_data

Unnamed: 0,region,population,article_count,articles_per_capita
0,CARIBBEAN,44.0,201,4.568182e-06
1,CENTRAL AMERICA,178.0,195,1.095506e-06
2,CENTRAL ASIA,78.0,106,1.358974e-06
3,EAST ASIA,1674.0,246,1.469534e-07
4,EASTERN AFRICA,473.0,648,1.369979e-06
5,EASTERN EUROPE,287.0,735,2.560976e-06
6,MIDDLE AFRICA,196.0,203,1.035714e-06
7,NORTHERN AFRICA,251.0,227,9.043825e-07
8,NORTHERN EUROPE,106.0,262,2.471698e-06
9,OCEANIA,44.0,86,1.954545e-06


### By country

In [77]:
# We remove the duplicates for countries, and then group the countries and sum the population per region and counting the number of articles, later getting the article_per_capita
temporary_politicians_df=politicians_region_ores_df[~politicians_region_ores_df.duplicated(subset=['country', 'region'], keep = 'last')]

country_population = temporary_politicians_df[['country', 'population']].groupby('country').sum().reset_index()
country_count=politicians_region_ores_df[['country', 'article_title']].groupby('country').count().reset_index()
total_articles_country=country_population.merge(country_count, on='country')
total_articles_country.columns=['country', 'population', 'article_count']
total_articles_country['articles_per_capita'] = total_articles_country['article_count'] / (total_articles_country['population'] * 1000000)
total_articles_country = total_articles_country[total_articles_country['articles_per_capita'] != np.inf] #Removing divide-by-zero error
print(len(total_articles_country['country'].unique()))
total_articles_country.reset_index(inplace=True)
total_articles_country

178


Unnamed: 0,index,country,population,article_count,articles_per_capita
0,0,Afghanistan,41.1,118,2.871046e-06
1,1,Albania,2.8,83,2.964286e-05
2,2,Algeria,44.9,34,7.572383e-07
3,3,Andorra,0.1,10,1.000000e-04
4,4,Angola,35.6,42,1.179775e-06
...,...,...,...,...,...
173,179,Venezuela,28.3,62,2.190813e-06
174,180,Vietnam,99.4,27,2.716298e-07
175,181,Yemen,33.7,61,1.810089e-06
176,182,Zambia,20.0,13,6.500000e-07


## 4 b) High quality articles

## By region

In [96]:
# Filtering for article quality(2 categories), We remove the duplicates for countries, Grouping the regions and summing the population per region and counting the number of articles, later getting the article_per_capita

region_population = preprocessed_region_population
high_quality_region_df = politicians_region_ores_df[(politicians_region_ores_df['article_quality'] == 'FA') | (politicians_region_ores_df['article_quality'] == 'GA')]
region_count=high_quality_region_df[['region', 'article_title']].groupby('region').count().reset_index()
high_quality_region_df=region_population.merge(region_count, on='region')
high_quality_region_df.columns=['region', 'population', 'article_count']
high_quality_region_df['articles_per_capita'] = high_quality_region_df['article_count'] / (high_quality_region_df['population'] * 1000000)
high_quality_region_df

Unnamed: 0,region,population,article_count,articles_per_capita
0,CARIBBEAN,44.0,8,1.818182e-07
1,CENTRAL AMERICA,178.0,10,5.617978e-08
2,CENTRAL ASIA,78.0,3,3.846154e-08
3,EAST ASIA,1674.0,16,9.557945e-09
4,EASTERN AFRICA,473.0,15,3.171247e-08
5,EASTERN EUROPE,287.0,38,1.324042e-07
6,MIDDLE AFRICA,196.0,5,2.55102e-08
7,NORTHERN AFRICA,251.0,7,2.788845e-08
8,NORTHERN EUROPE,106.0,8,7.54717e-08
9,OCEANIA,44.0,2,4.545455e-08


## By country

In [80]:
# Filtering for article quality(2 categories), We remove the duplicates for countries, Grouping the countries and summing the population per region and counting the number of articles, later getting the article_per_capita

temporary_politicians_df= politicians_region_ores_df[~politicians_region_ores_df.duplicated(subset=['country', 'region'], keep = 'last')]
country_population = temporary_politicians_df[['country', 'population']].groupby('country').sum().reset_index()

high_quality_country_df = politicians_region_ores_df[(politicians_region_ores_df['article_quality'] == 'FA') | (politicians_region_ores_df['article_quality'] == 'GA')]
country_count=high_quality_country_df[['country', 'article_title']].groupby('country').count().reset_index()
high_quality_country_df=country_population.merge(country_count, on='country')
high_quality_country_df.columns=['country', 'population', 'article_count']
high_quality_country_df['articles_per_capita'] = high_quality_country_df['article_count'] / (high_quality_country_df['population'] * 1000000)
high_quality_country_df = high_quality_country_df[high_quality_country_df['articles_per_capita'] != np.inf]
high_quality_country_df.reset_index(inplace=True)
high_quality_country_df.drop(columns=['index'], inplace=True)
high_quality_country_df

Unnamed: 0,country,population,article_count,articles_per_capita
0,Afghanistan,41.1,6,1.459854e-07
1,Albania,2.8,6,2.142857e-06
2,Andorra,0.1,2,2.000000e-05
3,Armenia,3.0,1,3.333333e-07
4,Azerbaijan,10.2,1,9.803922e-08
...,...,...,...,...
87,Ukraine,41.0,4,9.756098e-08
88,United Arab Emirates,9.4,4,4.255319e-07
89,Uruguay,3.6,1,2.777778e-07
90,Vietnam,99.4,2,2.012072e-08


### Step 5: Results

We filter the dataframe as required and produce results as data tables.

### 5.1 Top 10 countries by coverage 
The 10 countries with the highest total articles per capita (in descending order) 

In [81]:
# Sorting entire country dataframe by articles_per_capita and then taking the top 10
top10_total_country=total_articles_country.sort_values(by=['articles_per_capita'], ascending=False).head(10).reset_index()
top10_total_country=top10_total_country[['country']]
top10_total_country.columns=['top10_countries_by_coverage']
top10_total_country

Unnamed: 0,top10_countries_by_coverage
0,Antigua and Barbuda
1,Federated States of Micronesia
2,Andorra
3,Barbados
4,Marshall Islands
5,Montenegro
6,Seychelles
7,Luxembourg
8,Bhutan
9,Grenada


### 5.2 Bottom 10 countries by coverage: 
The 10 countries with the lowest total articles per capita (in ascending order) .

In [82]:
# Sorting entire country dataframe by articles_per_capita and then taking the bottom 10

bottom10_total_country=total_articles_country.sort_values(by=['articles_per_capita'], ascending=True).head(10).reset_index()
bottom10_total_country=bottom10_total_country[['country']]
bottom10_total_country.columns=['bottom10_countries_by_coverage']
bottom10_total_country

Unnamed: 0,bottom10_countries_by_coverage
0,China
1,Mexico
2,Saudi Arabia
3,Romania
4,India
5,Sri Lanka
6,Egypt
7,Ethiopia
8,Taiwan
9,Vietnam


### 5.3 Top 10 countries by high quality 
The 10 countries with the highest high quality articles per capita (in descending order)

In [83]:
# Sorting high quality dataframe for countries by articles_per_capita and then taking the top 10

top10_total_country=high_quality_country_df.sort_values(by=['articles_per_capita'], ascending=False).head(10).reset_index()
top10_total_country=top10_total_country[['country']]
top10_total_country.columns=['top10_countries_high_quality']
top10_total_country

Unnamed: 0,top10_countries_high_quality
0,Andorra
1,Montenegro
2,Albania
3,Suriname
4,Bosnia-Herzegovina
5,Lithuania
6,Croatia
7,Slovenia
8,Palestinian Territory
9,Gabon


### 5.4 Bottom 10 countries by high quality 
The 10 countries with the lowest high quality articles per capita (in ascending order)

In [84]:
# Sorting high quality dataframe for countries by articles_per_capita and then taking the bottom 10

bottom10_total_country=high_quality_country_df.sort_values(by=['articles_per_capita'], ascending=True).head(10).reset_index()
bottom10_total_country=bottom10_total_country[['country']]
bottom10_total_country.columns=['bottom10_countries_high_quality']
bottom10_total_country

Unnamed: 0,bottom10_countries_high_quality
0,India
1,Thailand
2,Japan
3,Nigeria
4,Vietnam
5,Colombia
6,Uganda
7,Pakistan
8,Sudan
9,Iran


### 5.5 Geographic regions by total coverage
A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [97]:
# Sorting total dataframe for regions by articles_per_capita (descending order)

top_total_region=total_articles_region_data.sort_values(by=['articles_per_capita'], ascending=False).reset_index()
top_total_region=top_total_region[['region']]
top_total_region.columns=['regions_by_coverage']
top_total_region

Unnamed: 0,regions_by_coverage
0,SOUTHERN EUROPE
1,CARIBBEAN
2,WESTERN EUROPE
3,EASTERN EUROPE
4,NORTHERN EUROPE
5,WESTERN ASIA
6,OCEANIA
7,SOUTHERN AFRICA
8,EASTERN AFRICA
9,CENTRAL ASIA


### 5.6 Geographic regions by high quality coverage
Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [98]:
# Sorting total dataframe for regions by articles_per_capita (descending order)

bottom_total_region=high_quality_region_df.sort_values(by=['articles_per_capita'], ascending=False).reset_index()
bottom_total_region=bottom_total_region[['region']]
bottom_total_region.columns=['regions_by_high_quality']
bottom_total_region

Unnamed: 0,regions_by_high_quality
0,SOUTHERN EUROPE
1,CARIBBEAN
2,EASTERN EUROPE
3,WESTERN EUROPE
4,WESTERN ASIA
5,NORTHERN EUROPE
6,SOUTHERN AFRICA
7,CENTRAL AMERICA
8,OCEANIA
9,CENTRAL ASIA
