# Homework 2

## Considering Bias in Data

The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article. We will them perform an analysis of the number of articles, and specifically high quality articles, per capita.

For reproducing my analysis, make sure to run all the cells in order.

## License

Snippets of the below code were taken from code examples developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024.

A copy of the reference codes is also available in this repository under the folder named `reference_code`. 

As a first step, import all the required libraries that allow us to parse csv data, save data into files and make API requests.

In [1]:
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

# Data Aquisition

## Available data

For this project, we utilize some pre-collected data. These data files are made available in this repository under the `resources` folder.

- *`politicians_by_country.csv`* : Contains a list Wikipedia articles of politicians and data for country populations. The [Wikipedia Category:Politicians by nationality](https://en.wikipedia.org/wiki/Category:Politicians_by_nationality) was crawled to generate a list of Wikipedia article pages about politicians from a wide range of countries. 

- *`population_by_country_AUG.2024.csv`*: This dataset was downloaded from the [world population data sheet](https://www.prb.org/international/indicator/population/table/) published by the Population Reference Bureau, and contains the population data. This also contains rows that provide cumulative regional population counts; such rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA).


## Fetching page quality

To get a Wikipedia page quality prediction from ORES for a given politician’s article page we need to perform the below 2 steps:

- *Step 1:*  make a page info request to get the current page revision 
- *Step 2:*  make an ORES request using the page title and current revision id  


# Step 0: Understanding the list of politicians

Lets start by doing a basic analysis of the provided list of articles on politicians from each country.


In [2]:
WIKIPEDIA_ARTICLES = "/Users/sushmavankayala/Documents/DATA_512/data-512-homework_2/resources/politicians_by_country_AUG.2024.csv"

# Read the list of wikipedia articles from the CSV file and create a dataframe
wiki_list = pd.read_csv(WIKIPEDIA_ARTICLES)
print("Total number of rows in the provided csv: ", wiki_list.shape[0])

Total number of rows in the provided csv:  7155


In [3]:
# Politicians associated to more than 1 country
wiki_count_df = wiki_list.groupby(['name', 'url'])['country'].nunique().reset_index(name='country_count')
wiki_count_dup_df = wiki_count_df[wiki_count_df['country_count'] > 1]
print("Total number of politicians that are associated with more than 1 country:" ,wiki_count_dup_df.shape[0])

wiki_count_dup_df

Total number of politicians that are associated with more than 1 country: 41


Unnamed: 0,name,url,country_count
182,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,2
456,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,2
568,Ali al-Qaradaghi,https://en.wikipedia.org/wiki/Ali_al-Qaradaghi,2
800,Antonio Gutiérrez y Ulloa,https://en.wikipedia.org/wiki/Antonio_Gutiérre...,2
815,Antonín Janoušek,https://en.wikipedia.org/wiki/Antonín_Janoušek,2
892,Ashab Uddin Ahmad,https://en.wikipedia.org/wiki/Ashab_Uddin_Ahmad,2
1016,Bak Jungyang,https://en.wikipedia.org/wiki/Bak_Jungyang,2
1161,Bona Malwal,https://en.wikipedia.org/wiki/Bona_Malwal,2
1495,Count Václav Antonín Chotek of Chotkov and Vojnín,https://en.wikipedia.org/wiki/Count_Václav_Ant...,2
1682,Djama Ali Moussa,https://en.wikipedia.org/wiki/Djama_Ali_Moussa,2


In [4]:
# Finding list of unique articles
wiki_list = wiki_list[["name", "url"]].drop_duplicates()
print("Total number of distinct articles in the csv : ", wiki_list.shape[0])

Total number of distinct articles in the csv :  7111


# Step 1: API call to fetch article's latest revision id

The provided list of politicians has information only about the article titles, but not about the latest revision id. To enrich the list of article titles with the corresponding revision ids, we make use of the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). For our exact usecase, we follow the documentation at [API:Info](https://www.mediawiki.org/wiki/API:Info).

Creating some constants to make the code more readable

In [5]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
REQUEST_HEADERS = {
    'User-Agent': '<vanksu@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [7]:
def fetch_data_multiple_articles(appended_titles, collected_data):

    try:
        # Fetch pageinfo the article
        api_response = request_pageinfo_per_article(appended_titles)
    
        pages = api_response['query']['pages']
        
        for page_id, page_data in pages.items():
            try:
                collected_data.append({
                    "pageid": page_data["pageid"],
                    "name": page_data["title"],
                    "revision_id": page_data["lastrevid"],
                })
            except Exception as e:
                print("Error while fetching data for specific article title: ", page_data["title"])
                print(e)
    except Exception as e:
        print("Error while fetching data for article titles:", appended_titles)
        print(e)

    return collected_data
        

There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has a limit - the API can handle requests with upto 50 article titles appended. Reference documentation can be found [here](https://www.mediawiki.org/w/api.php?action=help&modules=query)

In [8]:
# Create empty list to store the collected information from MediaWiki API 
appended_title = ""
collected_data = []

for index, row in wiki_list.iterrows():

    # Fetch article title from dataframe
    article_title = row["name"]
    # Append article title to the "title" string to make a single call containing 50 names
    appended_title = appended_title + '\u001F' + article_title
    
    if ((index+1) % 50) ==0:
        # API call to fetch revision numbers and append results to collected_Data
        collected_data = fetch_data_multiple_articles(appended_title, collected_data)
        appended_title = ""
    else:
        continue

# API call to fetch revision numbers and append results to collected_data for the remaining titles
collected_data = fetch_data_multiple_articles(appended_title, collected_data)

wiki_list_with_rev = pd.DataFrame(collected_data)
wiki_list_with_rev = wiki_list_with_rev.drop_duplicates()

print("Total number of articles for which we attempted to fetch revision id: ", wiki_list.shape[0])
print("Total number of articles for which we succesfully fetched revision id: ", wiki_list_with_rev.shape[0])

Error while fetching data for specific article title:  Barbara Eibinger-Miedl
'pageid'
Error while fetching data for specific article title:  Mehrali Gasimov
'pageid'
Error while fetching data for specific article title:  Kyaw Myint
'pageid'
Error while fetching data for specific article title:  André Ngongang Ouandji
'pageid'
Error while fetching data for specific article title:  Tomás Pimentel
'pageid'
Error while fetching data for specific article title:  Richard Sumah
'pageid'
Error while fetching data for specific article title:  Segun ''Aeroland'' Adewale
'pageid'
Error while fetching data for specific article title:  Bashir Bililiqo
'pageid'
Total number of articles for which we attempted to fetch revision id:  7111
Total number of articles for which we succesfully fetched revision id:  7103


In [9]:
# Write the intermediary dataframe to a file
wiki_list_with_rev.to_csv("generated_files/wiki_list_with_revision.csv",  index=False)
wiki_list_with_rev

Unnamed: 0,pageid,name,revision_id
0,27428272,Abdul Baqi Turkistani,1231655023
1,29443640,Abdul Ghani Ghani,1227026187
2,44482763,Abdul Rahim Ayoubi,1226326055
3,34682634,Ahmad Wali Massoud,1221720658
4,52438668,Aimal Faizi,1185105938
...,...,...,...
7098,3255571,Denis Walker,1247902630
7099,11742819,Herbert Ushewokunze,959111842
7100,633594,Josiah Tongogara,1203429435
7101,16375315,Langton Towungana,1246280093


# Step 2: Call ORES API to fetch article rating

By now, we have the revision ids for each article in the provided list of politician names. We need to loop through the list of articles, use the corresponding revision id and make a a call to the ORES API to fetch the article rating.

Note: This step takes ~2 hours.

In [10]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [11]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.

#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 

#   Note: if you don't want to use the key manager to help manage your API keys, you can specify the values as constants
#   below. Just don't distribute the notebook without removing the constants or you'll be distributing your key too.

USERNAME = "your_user_name"
ACCESS_TOKEN = "your_access_token"

In [12]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Note: The next step takes ~2 hours.

This step has already been run, and the corresponding enrichment has been saved in a file named `enriched_article_data.csv` under the `generated_files` folder. Skip the next 2 cells to proceed the analysis using the already fetched data

In [None]:
# Read the article names and correspondind revision IDs from the intermediary file dump

wiki_with_rev_df = pd.read_csv("generated_files/wiki_list_with_revision.csv")

# Create an empty list for appending the results of ORES API call
wiki_with_rating_list = []

for index,row in wiki_with_rev_df.iterrows():

    row_dict = row.to_dict()

    try:
        score = request_ores_score_per_article(article_revid=row["revision_id"],
                                           email_address="vanksu@uw.edu",
                                           access_token=ACCESS_TOKEN)
        
        final_rating = score["enwiki"]["scores"][f'{row["revision_id"]}']["articlequality"]["score"]["prediction"]
    
        row_dict["article_rating"] = final_rating
        wiki_with_rating_list.append(row_dict)
    
    except Exception as e:
        print("Error fetching article_rating for", row["name"], "| Revision:", row["revision_id"])
        print(e)
        row_dict["article_rating"] = ""
        wiki_with_rating_list.append(row_dict)

print("Total number of articles for which we attempted to fetch article rating: ", wiki_with_rev_df.shape[0])


In [None]:
wiki_with_rating_df = pd.DataFrame(wiki_with_rating_list)
wiki_with_rating_df = wiki_with_rating_df.drop_duplicates()
wiki_with_rating_df.to_csv("generated_files/enriched_article_data.csv", index=False)

print("Total number of articles for which we succesfully fetched article rating: ", wiki_list_with_rev.shape[0])

At this stage, we have 2 dataframes - one containing the initial list of articles, and the other one having the article rating and revision id of each of those articles.

We proceed to merge these 2 dataframes, to create an enriched list of articles listed in the `politicians_by_country.csv`

In [13]:
wiki_list = pd.read_csv(WIKIPEDIA_ARTICLES)
wiki_with_rating_df = pd.read_csv("generated_files/enriched_article_data.csv")

# Join wiki_list with wikilist_with_rev to merge dataframes
wiki_list_enriched = pd.merge(wiki_list, wiki_with_rating_df, on=["name"], how="left" )

wiki_list_enriched = wiki_list_enriched.rename(columns={'name': 'article_title'})
wiki_list_enriched = wiki_list_enriched.drop(['url', 'pageid'], axis=1)
wiki_list_enriched["revision_id"] = wiki_list_enriched["revision_id"].astype("Int64")

wiki_list_enriched

Unnamed: 0,article_title,country,revision_id,article_rating
0,Majah Ha Adrif,Afghanistan,1233202991,Start
1,Haroon al-Afghani,Afghanistan,1230459615,B
2,Tayyab Agha,Afghanistan,1225661708,Start
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub
4,Aziza Ahmadyar,Afghanistan,1195651393,Start
...,...,...,...,...
7150,Josiah Tongogara,Zimbabwe,1203429435,C
7151,Langton Towungana,Zimbabwe,1246280093,Stub
7152,Sengezo Tshabangu,Zimbabwe,1228478288,Start
7153,Herbert Ushewokunze,Zimbabwe,959111842,Stub


## Analysing data aquisition

Now we have a list of politicians' wiki articles along with their revision_id, page_id and article_rating. For articles where ORES call failed, we will have a row in the list where article rating is set to empty.

Considering this, lets analyse our data aquisition process

In [14]:
# Wikilist articles with no revision_id => pageInfo call failed
missing_revision_id_df = wiki_list_enriched[wiki_list_enriched["revision_id"].isna()]
print("Total number of articles for which pageInfo call failed: " , missing_revision_id_df.shape[0])
missing_revision_id_df

Total number of articles for which pageInfo call failed:  8


Unnamed: 0,article_title,country,revision_id,article_rating
430,Barbara Eibinger-Miedl,Austria,,
516,Mehrali Gasimov,Azerbaijan,,
1200,Kyaw Myint,Myanmar,,
1342,André Ngongang Ouandji,Cameroon,,
1955,Tomás Pimentel,Dominican Republic,,
2427,Richard Sumah,Ghana,,
4496,Segun ''Aeroland'' Adewale,Nigeria,,
5719,Bashir Bililiqo,Somalia,,


In [15]:
# Wikilist articles with revision_id but no article rating => ORES call failed

having_revision_id_df = wiki_list_enriched[~wiki_list_enriched["revision_id"].isna()]
print("Total number of articles for which pageInfo call succeeded: " , having_revision_id_df.shape[0])

missing_rating_df = having_revision_id_df[having_revision_id_df["article_rating"].isna()]
print("Total number of articles for which pageInfo call succeeded, but ORES API call failed: " , missing_rating_df.shape[0])

having_rating_df = wiki_list_enriched[~wiki_list_enriched["article_rating"].isna()]
print("Total number of articles for which pageInfo call succeeded, and ORES API call succeesed: " , having_rating_df.shape[0])

Total number of articles for which pageInfo call succeeded:  7147
Total number of articles for which pageInfo call succeeded, but ORES API call failed:  7
Total number of articles for which pageInfo call succeeded, and ORES API call succeesed:  7140


# Adding population and region details

At this stage, we have the initial list of artiles provided with data about latest revision id and article rating. We now want to enrich the list even further by adding details about the population and region of the country associated to the article title.

In [16]:
# Load the population data into a dataframe
population_df = pd.read_csv("/Users/sushmavankayala/Documents/DATA_512/data-512-homework_2/resources/population_by_country_AUG.2024.csv")
population_df

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2
...,...,...
228,Samoa,0.2
229,Solomon Islands,0.8
230,Tonga,0.1
231,Tuvalu,0.0


As described in the document, the `population_by_country_AUG.2024.csv` actually represents regions in a hierarchical order. For this analysis, we put a country in the closest (lowest in the hierarchy) region. In this section, we find the region to which each country belongs.

In [17]:
# Loop through each row in the dataframe. If name is all capital letters, means it is a region.
region = ""

population_list = []

for index, row in population_df.iterrows():
    row_dict = row.to_dict()
    if row["Geography"].isupper():
        # safe the geography in a variable
        region = row["Geography"]
        region_population = row["Population"]
        row_dict["region"] = ""
    else:
        # assign the region to the latest value in the region variable
        row_dict["region"] = region

    population_list.append(row_dict)

population_df_enriched = pd.DataFrame(population_list)
population_df_enriched

Unnamed: 0,Geography,Population,region
0,WORLD,8009.0,
1,AFRICA,1453.0,
2,NORTHERN AFRICA,256.0,
3,Algeria,46.8,NORTHERN AFRICA
4,Egypt,105.2,NORTHERN AFRICA
...,...,...,...
228,Samoa,0.2,OCEANIA
229,Solomon Islands,0.8,OCEANIA
230,Tonga,0.1,OCEANIA
231,Tuvalu,0.0,OCEANIA


In [18]:
population_df_enriched = population_df_enriched.rename(columns = {"Geography": "country", "Population": "population"})
country_region_df = population_df_enriched[~population_df_enriched["country"].str.isupper()].reset_index(drop=True)

print("Total number of countries in the `population_by_country_AUG.2024.csv` : ", country_region_df.shape[0])
country_region_df

Total number of countries in the `population_by_country_AUG.2024.csv` :  209


Unnamed: 0,country,population,region
0,Algeria,46.8,NORTHERN AFRICA
1,Egypt,105.2,NORTHERN AFRICA
2,Libya,6.9,NORTHERN AFRICA
3,Morocco,37.0,NORTHERN AFRICA
4,Sudan,48.1,NORTHERN AFRICA
...,...,...,...
204,Samoa,0.2,OCEANIA
205,Solomon Islands,0.8,OCEANIA
206,Tonga,0.1,OCEANIA
207,Tuvalu,0.0,OCEANIA


# Step 4: Analysis

In [19]:
# Merging the data frames of interest using outer join. This is because I am interested in finding the mismatched countries.
accumulated_data = pd.merge(wiki_list_enriched, country_region_df, on=["country"], how="outer", indicator=True)
accumulated_data

Unnamed: 0,article_title,country,revision_id,article_rating,population,region,_merge
0,Majah Ha Adrif,Afghanistan,1233202991,Start,42.4,SOUTH ASIA,both
1,Haroon al-Afghani,Afghanistan,1230459615,B,42.4,SOUTH ASIA,both
2,Tayyab Agha,Afghanistan,1225661708,Start,42.4,SOUTH ASIA,both
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub,42.4,SOUTH ASIA,both
4,Aziza Ahmadyar,Afghanistan,1195651393,Start,42.4,SOUTH ASIA,both
...,...,...,...,...,...,...,...
7193,,Kiribati,,,0.1,OCEANIA,right_only
7194,,Nauru,,,0.0,OCEANIA,right_only
7195,,New Caledonia,,,0.3,OCEANIA,right_only
7196,,New Zealand,,,5.2,OCEANIA,right_only


In [20]:
# Articles for which country data is not found in country_region_df
country_population_missing_df = accumulated_data[accumulated_data["_merge"] == "left_only"]
print("Countries in the wiki_list for which there is no population info:", country_population_missing_df["country"].unique())

Countries in the wiki_list for which there is no population info: ['Guinea-Bissau' 'Korean' 'Korea, South']


In [21]:
# Countries for which there are no articles present in the wiki_list_ref
country_article_missing_df = accumulated_data[accumulated_data["_merge"] == "right_only"]
print("Countries in the country_region_df for which there are no articles:", country_article_missing_df["country"].unique())

Countries in the country_region_df for which there are no articles: ['Western Sahara' 'GuineaBissau' 'Mauritius' 'Mayotte' 'Reunion'
 'Sao Tome and Principe' 'eSwatini' 'Canada' 'United States' 'Mexico'
 'Curacao' 'Dominica' 'Guadeloupe' 'Jamaica' 'Martinique' 'Puerto Rico'
 'French Guiana' 'Suriname' 'Georgia' 'Brunei' 'Philippines'
 'China (Hong Kong SAR)' 'China (Macao SAR)' 'Korea (North)'
 'Korea (South)' 'Denmark' 'Iceland' 'Ireland' 'United Kingdom'
 'Liechtenstein' 'Netherlands' 'Romania' 'Andorra' 'San Marino'
 'Australia' 'Fiji' 'French Polynesia' 'Guam' 'Kiribati' 'Nauru'
 'New Caledonia' 'New Zealand' 'Palau']


In [22]:
# Write countries with no match (i.e, response from the above to queries) to a .txt file
# Combine the two lists
combined_list = country_population_missing_df["country"].unique().tolist() + country_article_missing_df["country"].unique().tolist()

# Write the combined list to a text file with each string on a new line
with open('generated_files/wp_countries-no_match.txt', 'w+') as file:
    for country in combined_list:
        file.write(country + '\n')

In [23]:
# Find the articles for which we have information about country population
wp_politicians_by_country = accumulated_data[accumulated_data["_merge"] == "both"]
wp_politicians_by_country = wp_politicians_by_country.drop(columns = [  "_merge"], axis=1)
wp_politicians_by_country = wp_politicians_by_country.rename(columns = { "article_rating": "article_quality"})
wp_politicians_by_country = wp_politicians_by_country[["country", "region", "population", "article_title", "revision_id", "article_quality"]]
wp_politicians_by_country = wp_politicians_by_country.drop_duplicates()
wp_politicians_by_country

Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7150,Zimbabwe,EASTERN AFRICA,16.7,Josiah Tongogara,1203429435,C
7151,Zimbabwe,EASTERN AFRICA,16.7,Langton Towungana,1246280093,Stub
7152,Zimbabwe,EASTERN AFRICA,16.7,Sengezo Tshabangu,1228478288,Start
7153,Zimbabwe,EASTERN AFRICA,16.7,Herbert Ushewokunze,959111842,Stub


In [24]:
# Write the dataframe to a file
wp_politicians_by_country.to_csv("generated_files/wp_politicians_by_country.csv", index=False)

# Step 5: Results

In [25]:
# Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

country_article_count = wp_politicians_by_country.groupby(['country', 'region', 'population']).size().reset_index(name='total_articles')
country_article_count["total_articles_per_capita"] = country_article_count["total_articles"] / country_article_count["population"]

top_10 = country_article_count.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
top_10 = top_10.head(10)

print("The 10 countries with the highest total articles per capita (in descending order):", top_10["country"].tolist())
top_10

The 10 countries with the highest total articles per capita (in descending order): ['Monaco', 'Tuvalu', 'Antigua and Barbuda', 'Federated States of Micronesia', 'Marshall Islands', 'Tonga', 'Barbados', 'Montenegro', 'Seychelles', 'Maldives']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Monaco,WESTERN EUROPE,0.0,10,inf
1,Tuvalu,OCEANIA,0.0,1,inf
2,Antigua and Barbuda,CARIBBEAN,0.1,33,330.0
3,Federated States of Micronesia,OCEANIA,0.1,14,140.0
4,Marshall Islands,OCEANIA,0.1,13,130.0
5,Tonga,OCEANIA,0.1,10,100.0
6,Barbados,CARIBBEAN,0.3,25,83.333333
7,Montenegro,SOUTHERN EUROPE,0.6,36,60.0
8,Seychelles,EASTERN AFRICA,0.1,6,60.0
9,Maldives,SOUTH ASIA,0.6,33,55.0


We see that the total articles per capita is an extremely high value for countries Monaco and Tuvalu. This can be attributed to the fact that these countries have comparatively low population, so the per capita value is high.

In [26]:
# Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
bottom_10 = country_article_count.sort_values("total_articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10 = bottom_10.head(10)

print("The 10 countries with the lowest total articles per capita (in ascending order):", bottom_10["country"].tolist())
bottom_10

The 10 countries with the lowest total articles per capita (in ascending order): ['China', 'India', 'Ghana', 'Saudi Arabia', 'Zambia', 'Norway', 'Israel', 'Egypt', "Cote d'Ivoire", 'Ethiopia']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,China,EAST ASIA,1411.3,16,0.011337
1,India,SOUTH ASIA,1428.6,151,0.105698
2,Ghana,WESTERN AFRICA,34.1,4,0.117302
3,Saudi Arabia,WESTERN ASIA,36.9,5,0.135501
4,Zambia,EASTERN AFRICA,20.2,3,0.148515
5,Norway,NORTHERN EUROPE,5.5,1,0.181818
6,Israel,WESTERN ASIA,9.8,2,0.204082
7,Egypt,NORTHERN AFRICA,105.2,32,0.304183
8,Cote d'Ivoire,WESTERN AFRICA,30.9,10,0.323625
9,Ethiopia,EASTERN AFRICA,126.5,44,0.347826


In [27]:
# Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

high_quality_articles = wp_politicians_by_country[wp_politicians_by_country["article_quality"].isin(["FA", "GA"])]
high_quality_articles_count = high_quality_articles.groupby(['country', 'region', 'population']).size().reset_index(name='total_articles')
high_quality_articles_count["total_articles_per_capita"] = high_quality_articles_count["total_articles"] / high_quality_articles_count["population"]

top_10 = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
top_10 = top_10.head(10)

print("The 10 countries with the highest high quality articles per capita (in descending order):", top_10["country"].tolist())
top_10

The 10 countries with the highest high quality articles per capita (in descending order): ['Montenegro', 'Luxembourg', 'Albania', 'Kosovo', 'Maldives', 'Lithuania', 'Croatia', 'Guyana', 'Palestinian Territory', 'Slovenia']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Montenegro,SOUTHERN EUROPE,0.6,3,5.0
1,Luxembourg,WESTERN EUROPE,0.7,2,2.857143
2,Albania,SOUTHERN EUROPE,2.7,7,2.592593
3,Kosovo,SOUTHERN EUROPE,1.7,4,2.352941
4,Maldives,SOUTH ASIA,0.6,1,1.666667
5,Lithuania,NORTHERN EUROPE,2.9,4,1.37931
6,Croatia,SOUTHERN EUROPE,3.8,5,1.315789
7,Guyana,SOUTH AMERICA,0.8,1,1.25
8,Palestinian Territory,WESTERN ASIA,5.5,6,1.090909
9,Slovenia,SOUTHERN EUROPE,2.1,2,0.952381


In [28]:
# Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
bottom_10 = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10 = bottom_10.head(10)

print("The 10 countries with the least high quality articles per capita (in ascending order):", bottom_10["country"].tolist())
bottom_10

The 10 countries with the least high quality articles per capita (in ascending order): ['Bangladesh', 'Egypt', 'Ethiopia', 'Japan', 'Pakistan', 'Colombia', 'Congo DR', 'Vietnam', 'Uganda', 'Algeria']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Bangladesh,SOUTH ASIA,173.5,1,0.005764
1,Egypt,NORTHERN AFRICA,105.2,1,0.009506
2,Ethiopia,EASTERN AFRICA,126.5,2,0.01581
3,Japan,EAST ASIA,124.5,2,0.016064
4,Pakistan,SOUTH ASIA,240.5,4,0.016632
5,Colombia,SOUTH AMERICA,52.2,1,0.019157
6,Congo DR,MIDDLE AFRICA,102.3,2,0.01955
7,Vietnam,SOUTHEAST ASIA,98.9,2,0.020222
8,Uganda,EASTERN AFRICA,48.6,1,0.020576
9,Algeria,NORTHERN AFRICA,46.8,1,0.021368


In [29]:
# Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

articles_count_per_region = wp_politicians_by_country.groupby(['region']).size().reset_index(name='total_articles')
articles_count_per_region

Unnamed: 0,region,total_articles
0,CARIBBEAN,219
1,CENTRAL AMERICA,188
2,CENTRAL ASIA,106
3,EAST ASIA,152
4,EASTERN AFRICA,665
5,EASTERN EUROPE,709
6,MIDDLE AFRICA,231
7,NORTHERN AFRICA,302
8,NORTHERN EUROPE,191
9,OCEANIA,72


In [30]:
regional_data = pd.merge(articles_count_per_region, population_df, left_on=["region"], right_on=["Geography"], how="left")
regional_data = regional_data[["region", "total_articles", "Population"]]
regional_data["total_articles_per_capita"] = regional_data["total_articles"] / regional_data["Population"]
regional_data = regional_data.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
regional_data

Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,797,152.0,5.243421
1,CARIBBEAN,219,44.0,4.977273
2,WESTERN EUROPE,498,199.0,2.502513
3,EASTERN EUROPE,709,285.0,2.487719
4,WESTERN ASIA,610,299.0,2.040134
5,NORTHERN EUROPE,191,108.0,1.768519
6,SOUTHERN AFRICA,123,70.0,1.757143
7,OCEANIA,72,45.0,1.6
8,EASTERN AFRICA,665,483.0,1.376812
9,SOUTH AMERICA,569,426.0,1.335681


In [31]:
# Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

high_quality_articles_regional = wp_politicians_by_country[wp_politicians_by_country["article_quality"].isin(["FA", "GA"])]

hq_articles_count_per_region = high_quality_articles_regional.groupby(['region']).size().reset_index(name='total_articles')
hq_articles_count_per_region

Unnamed: 0,region,total_articles
0,CARIBBEAN,9
1,CENTRAL AMERICA,10
2,CENTRAL ASIA,5
3,EAST ASIA,3
4,EASTERN AFRICA,17
5,EASTERN EUROPE,38
6,MIDDLE AFRICA,8
7,NORTHERN AFRICA,17
8,NORTHERN EUROPE,9
9,OCEANIA,1


In [32]:
hq_regional_data = pd.merge(hq_articles_count_per_region, population_df, left_on=["region"], right_on=["Geography"], how="left")
hq_regional_data = hq_regional_data[["region", "total_articles", "Population"]]
hq_regional_data["total_articles_per_capita"] = hq_regional_data["total_articles"] / hq_regional_data["Population"]
hq_regional_data = hq_regional_data.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
hq_regional_data

Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,53,152.0,0.348684
1,CARIBBEAN,9,44.0,0.204545
2,EASTERN EUROPE,38,285.0,0.133333
3,SOUTHERN AFRICA,8,70.0,0.114286
4,WESTERN EUROPE,21,199.0,0.105528
5,WESTERN ASIA,27,299.0,0.090301
6,NORTHERN EUROPE,9,108.0,0.083333
7,NORTHERN AFRICA,17,256.0,0.066406
8,CENTRAL ASIA,5,80.0,0.0625
9,CENTRAL AMERICA,10,182.0,0.054945
