# Considering bias in data
The goal of this analysis is to explore the concept of bias in data using Wikipedia articles. This analysis will consider articles on political figures from different countries. We will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We will also perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. It will consist of a series of tables that show:
1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.


## 1. Setup
Note - this analysis was done in Google Colab, which is a Jupyter Notebook setup that runs in cloud. It gets the data files from your Google Drive, for which it requires the below snipped works by making the data files available in Google Drive.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change to the required path where all the data files are located. This will differ for each user.

In [2]:
%cd 'drive/MyDrive/data 512/week 2'

/content/drive/MyDrive/data 512/week 2


We're grabbing the ORES API access token and username from Google Colab's user data. This way, we can securely use the ORES service without hardcoding sensitive information.

In [3]:
from google.colab import userdata

ORES_ACCESS_TOKEN = userdata.get('ACCESS_TOKEN')
ORES_USERNAME = userdata.get('USERNAME')

## 2. Loading packages and defining constants and functions
Here, we load the necessary packages that we'll be using to fetch the data. We will also define the required API constants and define API calling functions that we will use later down when making the API calls.

In [4]:
# These are standard python modules
import json, time, urllib.parse

# The 'requests' and 'python' module is not a standard Python module.
# You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

### 2.1 Definitions for Wikimedia API calls
The below code was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the Creative Commons CC-BY license. Revision 1.2 - September 16, 2024.


The example relies on some constants that help make the code a bit more readable.


In [5]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<raaguln@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### 2.2 Definitions for ORES article quality API calls
The below code was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the Creative Commons CC-BY license. Revision 1.2 - September 16, 2024.


The example relies on some constants that help make the code a bit more readable.

In [7]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<raaguln@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "raaguln@uw.edu",         # your email address should go here
    'access_token'  : ORES_ACCESS_TOKEN         # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ORES_USERNAME
ACCESS_TOKEN = ORES_ACCESS_TOKEN

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for one article at a time. Therefore the parameter most likely to change is the article_revid.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## 3. Get Wikipedia information for all politicians
In this call, we make the API call to Wikimedia API to get all the required information (revision id) for the wikipedia articles about the politicians.

We load the CSV files that have the information on both politicians articles and countries population information, which we'll be using throughout the analysis.

In [9]:
# Use the pandas library to load
politicians = pd.read_csv("politicians_by_country_AUG.2024.csv")
countries = pd.read_csv("population_by_country_AUG.2024.csv")

In [10]:
# Exploring what's in the dataset
politicians.head()

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan


In [11]:
# Finding out how many rows data has
len(politicians)

7155

In [12]:
# Finding out if there are duplicate names
len(politicians['name'].unique())

7111

In [13]:
# Finding out if there are duplicate combinations wrt name + country combination
unique_politicians_group = politicians.groupby(['name', 'country']).size().reset_index(name='count')
unique_politicians_group[unique_politicians_group['count'] > 1]

Unnamed: 0,name,country,count


It seems like there are 7155 - 7111 = 44 duplicate values for politicians, but they belong to different countries. Care must be taken to not discard them as it is as they are associated with different countries.

To highlight - a politician is associated with two different countries, but there is only one Wiki article with the name of the politician. It is impossible to have two wiki articles with same title name as it would break the URL of the webpage. Looking into Wikipedia naming strategies, it is aparent that they include additional details in the article title for similar names (usually extra information in round brackets). Refer the sample articles below -

1. Jack Sparrow - https://en.wikipedia.org/wiki/Jack_Sparrow
2. Jack Sparrow (song) - https://en.wikipedia.org/wiki/Jack_Sparrow_(song)

In [14]:
# Exploring what's in the dataset `countries`
countries.head()

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2


In [15]:
# Exploring what's in the dataset `countries`
len(countries)

233

Here, we loop through the politician's name in batches of 50 and make the API call to Wikimedia to get the revision id information. The API allows us to get the information for 50 wiki articles in one API call, and we will make use of that to reduce the number of API calls made. Note - the final dictionary will only have 7111 values - the number of unique wikipedia titles.


API Reference - https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page

In [16]:
# Create dictionary mapping for storing all the information
wiki_information = {}
# Number of article titles that the API will accept in one go. Refer the documentation.
batch_size = 50
for i in range(0, len(politicians), batch_size):
    # Put all titles in a `|` separated string
    names = '|'.join(politicians['name'][i:i+batch_size])
    # Make the API call
    info = request_pageinfo_per_article(names)
    if 'query' in info and 'pages' in info['query']:
        response = info['query']['pages']
        # For each title, add the response to the dictionary
        for page_id in response:
            name = response[page_id]['title']
            wiki_information[name] = response[page_id]

We will inspect the values that we got from the API to check if there's any issues from the API response we got. It seems like there are 8 articles with no revision data.

In [17]:
revision_ids = []
titles_with_no_revisions = []
for name in wiki_information:
    if 'lastrevid' in wiki_information[name]:
        revision_ids.append(wiki_information[name]['lastrevid'])
    else:
        titles_with_no_revisions.append(name)
print('Number of entries with no revision data -', len(titles_with_no_revisions))
print('List of titles with no revision data -', titles_with_no_revisions)

Number of entries with no revision data - 8
List of titles with no revision data - ['Barbara Eibinger-Miedl', 'Mehrali Gasimov', 'Kyaw Myint', 'André Ngongang Ouandji', 'Tomás Pimentel', 'Richard Sumah', "Segun ''Aeroland'' Adewale", 'Bashir Bililiqo']


In [18]:
# Full information for the data with no revisions
for name in titles_with_no_revisions:
    print(wiki_information[name])

{'ns': 0, 'title': 'Barbara Eibinger-Miedl', 'missing': '', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'fullurl': 'https://en.wikipedia.org/wiki/Barbara_Eibinger-Miedl', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Barbara_Eibinger-Miedl&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Barbara_Eibinger-Miedl'}
{'ns': 0, 'title': 'Mehrali Gasimov', 'missing': '', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'fullurl': 'https://en.wikipedia.org/wiki/Mehrali_Gasimov', 'editurl': 'https://en.wikipedia.org/w/index.php?title=Mehrali_Gasimov&action=edit', 'canonicalurl': 'https://en.wikipedia.org/wiki/Mehrali_Gasimov'}
{'ns': 0, 'title': 'Kyaw Myint', 'missing': '', 'contentmodel': 'wikitext', 'pagelanguage': 'en', 'pagelanguagehtmlcode': 'en', 'pagelanguagedir': 'ltr', 'fullurl': 'https://en.wikipedia.org/wiki/Kyaw_Myint', 'editurl': 'https://en.wi

If we open and click the above links, we can see that 7 of the articles have been migrated to a draft article or a new link, and hence there is no revision id information for these. The article `Segun ''Aeroland'' Adewale` seems to be completely deleted. We will handle these gracefully later down so that there is no errors thrown.


In [19]:
# Sample print to understand the output
wiki_information['Denis Walker']

{'pageid': 3255571,
 'ns': 0,
 'title': 'Denis Walker',
 'contentmodel': 'wikitext',
 'pagelanguage': 'en',
 'pagelanguagehtmlcode': 'en',
 'pagelanguagedir': 'ltr',
 'touched': '2024-10-05T14:24:34Z',
 'lastrevid': 1247902630,
 'length': 10247,
 'talkid': 3338681,
 'fullurl': 'https://en.wikipedia.org/wiki/Denis_Walker',
 'editurl': 'https://en.wikipedia.org/w/index.php?title=Denis_Walker&action=edit',
 'canonicalurl': 'https://en.wikipedia.org/wiki/Denis_Walker'}

We will write the information and store it in a separate JSON for easier access later on.

In [20]:
# Utility function to write the data to JSON
def write_to_json(filename, data):
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)
        print(f"{filename} created successfully!")

# This writes the files to the same folder that the code is structured in. If
# you want to change the path, make sure you provide the right path.
write_to_json('wiki_page_info.json', wiki_information)

wiki_page_info.json created successfully!


## 4. Get article quality from ORES API
In this call, we make the API call to ORES API to get all the quality prediction for the wikipedia articles about the politicians. If you are on a time crunch, just load the existing data that I've saved from earlier API call. This API call takes approx. 2 hours to run (which I've commented out below), so sit back and grab some more popcorn while this runs. This is [exactly how long the process will run for](https://www.youtube.com/watch?v=oMrfhk-MXRg).


In [21]:
# ores_information = {}

# for title in wiki_information.keys():
#     info = wiki_information[title]
#     if 'lastrevid' in info:
#         revid = info['lastrevid']
#         score = request_ores_score_per_article(article_revid=revid,
#                                        email_address=REQUEST_HEADER_PARAMS_TEMPLATE['email_address'],
#                                        access_token=ACCESS_TOKEN)
#         ores_information[title] = score

In [22]:
# for key, value in ores_information.items():
#     print(key, value)

In [23]:
# # Utility function to write the data to JSON
# def write_to_json(filename, data):
#     with open(filename, 'w') as f:
#         json.dump(data, f, indent=4)
#         print(f"{filename} created successfully!")

# # This writes the files to the same folder that the code is structured in. If
# # you want to change the path, make sure you provide the right path.
# write_to_json('ores_info.json', ores_information)

In [24]:
# Read the `ores_info.json` file
ores_information = {}
with open('ores_info.json', 'r') as f:
    ores_information = json.load(f)

In [25]:
# Exploring the length of the dataset `ores_information`
len(ores_information)

7103

In [50]:
# Understanding the structure of `ores_information`
list(ores_information.values())[1]

{'enwiki': {'models': {'articlequality': {'version': '0.9.2'}},
  'scores': {'1227026187': {'articlequality': {'score': {'prediction': 'Stub',
      'probability': {'B': 0.011385918433177827,
       'C': 0.01746636720083123,
       'FA': 0.0023603522965185463,
       'GA': 0.005156605723717649,
       'Start': 0.0518527973277106,
       'Stub': 0.9117779590180441}}}}}}}

There seems to be lot of values that we will not be using for our analysis. We will be formatting the responses to create a mapping with only those values that we want to use.

In [27]:
def get_lastrevid(wiki_info):
    """
    Retrieve the last revision ID from the wiki information.

    Parameters:
    wiki_info (dict): Dictionary containing wiki data.

    Returns:
    int or None: The last revision ID if present; otherwise, None.
    """
    if 'lastrevid' in wiki_info:
        return wiki_info['lastrevid']
    else:
        return None


ores_prediction = {}
issues = {}

# Iterate through each title and value in wiki_information
for title, value in wiki_information.items():
    # Check for missing ores information
    if title not in ores_information:
        issues[title] = 'ores_information missing'
        continue
    ores_info = ores_information[title]
    revid = get_lastrevid(value)
    if revid is not None:
        # Check for enwiki and scores
        if 'enwiki' in ores_info and 'scores' in ores_info['enwiki']:
            first_level = ores_info['enwiki']['scores']
            # Check if revid exists in scores
            if str(revid) in first_level.keys():
                ores_prediction[title] = first_level[f'{revid}']["articlequality"]["score"]["prediction"]
            else:
                # Handle missing revid
                issues[title] = 'revid id missing in prediction value'
                continue
        else:
            # Handle missing enwiki/scores
            issues[title] = 'enwiki/scores missing'
            continue
    else:
        # Handle missing last revision ID
        issues[title] = 'lastrevid missing'

In [28]:
# Checking if the response from the API only has our desired outputs
set(ores_prediction.values())

{'B', 'C', 'FA', 'GA', 'Start', 'Stub'}

In [29]:
# Count the number of entries with ORES API issues
ores_error_count = 0
for key, value in issues.items():
    if value == 'enwiki/scores missing' or value == 'revid id missing in prediction value':
        print(key)
        print(value)
        print("-x-x-x-x")
        ores_error_count += 1


Abdul Baqi Turkistani
enwiki/scores missing
-x-x-x-x
Bernard Percival
enwiki/scores missing
-x-x-x-x
Carlos de la Rosa
enwiki/scores missing
-x-x-x-x
Ali bin Khalifa Al Khalifa
revid id missing in prediction value
-x-x-x-x
Salman bin Hamad Al Khalifa
revid id missing in prediction value
-x-x-x-x
Hristo Petrov
revid id missing in prediction value
-x-x-x-x
Ventsislav Varbanov
enwiki/scores missing
-x-x-x-x
Za Hlei Thang
revid id missing in prediction value
-x-x-x-x
Bun Chanmol
enwiki/scores missing
-x-x-x-x
Ieu Koeus
revid id missing in prediction value
-x-x-x-x
Andargachew Tsege
revid id missing in prediction value
-x-x-x-x
Georg Magnus Sprengtporten
revid id missing in prediction value
-x-x-x-x
Muhammed Magassy
enwiki/scores missing
-x-x-x-x
Bachhav Shobha Dinesh
revid id missing in prediction value
-x-x-x-x
Ono no Imoko
revid id missing in prediction value
-x-x-x-x
Sugawara no Kiyotomo
revid id missing in prediction value
-x-x-x-x
Faisal Shboul
enwiki/scores missing
-x-x-x-x
Haditha J

In [30]:
# ORES error rate
print('Number of entries with ORES API issues =', ores_error_count)
print('ORES error rate =', ores_error_count/len(ores_prediction))

Number of entries with ORES API issues = 29
ORES error rate = 0.004099519366694939


## 5. Finding regions for each country
In `population_by_country.csv`, there is multiple levels of heading for each country, and all the regions are in capital letters. Using that as a guiding principle to identify regions, and the instructions to always put a country in the closest (lowest in the hierarchy) region, we arrive at the below code to find the regions.

In [31]:
current_region = None
country_region_mapping = []

for index, row in countries.iterrows():
    geography = row['Geography']
    population = row['Population']

    # Check if the geography is in all caps, which is an indication of the region
    if geography.isupper():
        current_region = geography
    else:
        # If it's a country, map it to the current region
        if current_region:
            country_region_mapping.append([geography, current_region, population])

# Convert the mapping list into a Pandas dataframe
df_mapped = pd.DataFrame(country_region_mapping, columns=['country', 'region', 'population'])
df_mapped.head()

Unnamed: 0,country,region,population
0,Algeria,NORTHERN AFRICA,46.8
1,Egypt,NORTHERN AFRICA,105.2
2,Libya,NORTHERN AFRICA,6.9
3,Morocco,NORTHERN AFRICA,37.0
4,Sudan,NORTHERN AFRICA,48.1


In [32]:
print(f'Length of country_region_mapping = {len(df_mapped)}')

Length of country_region_mapping = 209


## 6. Combining the two datasets
In this step, we merge the wikipedia and the population data together. We first create the wikipedia dataframe with the revision id and ORES article rating prediction and create a dataframe with all 7155 rows of politician data with `article_title`.


In [33]:
def get_revid(title):
    """
    Get the last revision ID for a given title.

    Parameters:
    title (str): The title of the article.

    Returns:
    str or None: Last revision ID as a string or None if not found.
    """
    if title in wiki_information:
        return str(get_lastrevid(wiki_information[title]))
    else:
        return None

def get_article_quality(title):
    """
    Retrieve the article quality prediction for a given title.

    Parameters:
    title (str): The title of the article.

    Returns:
    Any or None: Article quality prediction or None if not found.
    """
    if title in ores_prediction:
        return ores_prediction[title]
    else:
        return None

politicians_df = politicians.drop(columns='url').copy()

# Rename the columns using the mapping
column_rename_mapping = {'name': 'article_title', 'country': 'country'}
politicians_df = politicians_df.rename(columns=column_rename_mapping)

# Create the new columns `revision_id` and `article_title`
politicians_df['revision_id'] = politicians_df['article_title'].apply(get_revid)
politicians_df['article_quality'] = politicians_df['article_title'].apply(get_article_quality)

In [34]:
# Checking the output of the previous step
politicians_df.head()

Unnamed: 0,article_title,country,revision_id,article_quality
0,Majah Ha Adrif,Afghanistan,1233202991,Start
1,Haroon al-Afghani,Afghanistan,1230459615,B
2,Tayyab Agha,Afghanistan,1225661708,Start
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub
4,Aziza Ahmadyar,Afghanistan,1195651393,Start


We identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called `wp_countries-no_match.txt`.


In [35]:
a = set(politicians_df['country'].unique())
b = set(df_mapped['country'].unique())

In [36]:
# Finding the union ( | ) of
# 1. countries in A but not in B
# 2. countries in B but not in A
country_mismatches = a - b | b - a

In [37]:
country_mismatches

{'Andorra',
 'Australia',
 'Brunei',
 'Canada',
 'China (Hong Kong SAR)',
 'China (Macao SAR)',
 'Curacao',
 'Denmark',
 'Dominica',
 'Fiji',
 'French Guiana',
 'French Polynesia',
 'Georgia',
 'Guadeloupe',
 'Guam',
 'Guinea-Bissau',
 'GuineaBissau',
 'Iceland',
 'Ireland',
 'Jamaica',
 'Kiribati',
 'Korea (North)',
 'Korea (South)',
 'Korea, South',
 'Korean',
 'Liechtenstein',
 'Martinique',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Nauru',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Palau',
 'Philippines',
 'Puerto Rico',
 'Reunion',
 'Romania',
 'San Marino',
 'Sao Tome and Principe',
 'Suriname',
 'United Kingdom',
 'United States',
 'Western Sahara',
 'eSwatini'}

In [38]:
print('Number of country names that dont match =', len(country_mismatches))

Number of country names that dont match = 46


In [39]:
# Writing the output to a txt file
with open('wp_countries-no_match.txt', 'w') as f:
    f.write("\n".join(country_mismatches))

We consolidate the remaining data into a single CSV file called `wp_politicians_by_country.csv`. The CSV has the following columns - country, region, population, article_title, revision_id, and article_quality. We perform a inner join because for the countries that don't match, there will be no region and population value. And our analysis involves a lot of per-capita values, which would throw errors or skew our analysis with missing data, so inner join fixes that issue.




In [40]:
# Merge the dataframes based on the 'country' column
merged_df = politicians_df.merge(df_mapped, on='country', how='inner')
merged_df.head()

Unnamed: 0,article_title,country,revision_id,article_quality,region,population
0,Majah Ha Adrif,Afghanistan,1233202991,Start,SOUTH ASIA,42.4
1,Haroon al-Afghani,Afghanistan,1230459615,B,SOUTH ASIA,42.4
2,Tayyab Agha,Afghanistan,1225661708,Start,SOUTH ASIA,42.4
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub,SOUTH ASIA,42.4
4,Aziza Ahmadyar,Afghanistan,1195651393,Start,SOUTH ASIA,42.4


In [41]:
print(f'Length of the final merged data = {len(merged_df)}')

Length of the final merged data = 7013


We do a little bit of checking of the join we performed in the last step just to make sure everything is alright.

In [42]:
columns = merged_df.columns.values.tolist()
for col in columns:
    print(f"Column = {col}")
    # find unique values
    print(f'Unique values - {merged_df[col].unique()}')
    # check the count of unique values
    print(f'Count of unique values - {merged_df[col].nunique()}')
    # check if there are nay na values
    print(f'Count of NA values - {merged_df[col].isna().sum()}')
    print()

Column = article_title
Unique values - ['Majah Ha Adrif' 'Haroon al-Afghani' 'Tayyab Agha' ...
 'Sengezo Tshabangu' 'Herbert Ushewokunze' 'Denis Walker']
Count of unique values - 6970
Count of NA values - 0

Column = country
Unique values - ['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Antigua and Barbuda'
 'Argentina' 'Armenia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain'
 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bhutan'
 'Bolivia' 'Bosnia Herzegovina' 'Botswana' 'Brazil' 'Bulgaria'
 'Burkina Faso' 'Myanmar' 'Burundi' 'Cambodia' 'Cameroon' 'Cape Verde'
 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia' 'Comoros'
 'Congo' 'Congo DR' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus' 'Czechia'
 'Djibouti' 'Dominican Republic' 'Timor Leste' 'Ecuador' 'Egypt'
 'United Arab Emirates' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia'
 'Finland' 'France' 'Gabon' 'Gambia' 'Germany' 'Ghana' 'Greece' 'Grenada'
 'Guatemala' 'Guinea' 'Guyana' 'Haiti' 'Honduras' 'Hungary' 'India'

We write our final data to the CSV file named `wp_politicians_by_country.csv`

In [43]:
merged_df.to_csv('wp_politicians_by_country.csv')

## 7. Analysis
Here, we perform basic analysis on the dataset we created above and generate the following tables -
1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .
2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.


### 7.1 Top 10 countries by coverage
The 10 countries with the highest total articles per capita (in descending order)

In [44]:
# Calculate articles per capita
articles_per_capita = merged_df.groupby('country')['article_title'].count() / merged_df.groupby('country')['population'].first()

# Top 10 countries by coverage
top_10_countries = articles_per_capita.sort_values(ascending=False).head(10)
top_10_countries

Unnamed: 0_level_0,0
country,Unnamed: 1_level_1
Monaco,inf
Tuvalu,inf
Antigua and Barbuda,330.0
Federated States of Micronesia,140.0
Marshall Islands,130.0
Tonga,100.0
Barbados,83.333333
Montenegro,60.0
Seychelles,60.0
Maldives,55.0


### 7.2 Bottom 10 countries by coverage
The 10 countries with the lowest total articles per capita (in ascending order).

In [45]:
# Bottom 10 countries by coverage
bottom_10_countries = articles_per_capita.sort_values(ascending=True).head(10)
bottom_10_countries

Unnamed: 0_level_0,0
country,Unnamed: 1_level_1
China,0.011337
India,0.105698
Ghana,0.117302
Saudi Arabia,0.135501
Zambia,0.148515
Norway,0.181818
Israel,0.204082
Egypt,0.304183
Cote d'Ivoire,0.323625
Ethiopia,0.347826


### 7.3 Top 10 countries by high quality
The 10 countries with the highest high quality articles per capita (in descending order)

In [46]:
# Filter for high-quality articles
high_quality_articles = merged_df[(merged_df['article_quality'] == 'FA') | (merged_df['article_quality'] == 'GA')]

# High-quality articles per capita
high_quality_articles_per_capita = high_quality_articles.groupby('country')['article_title'].count() / merged_df.groupby('country')['population'].first()

# Top 10 countries by high-quality article coverage
top_10_high_quality_countries = high_quality_articles_per_capita.sort_values(ascending=False).head(10)
top_10_high_quality_countries

Unnamed: 0_level_0,0
country,Unnamed: 1_level_1
Montenegro,5.0
Luxembourg,2.857143
Albania,2.592593
Kosovo,2.352941
Maldives,1.666667
Lithuania,1.37931
Croatia,1.315789
Guyana,1.25
Palestinian Territory,1.090909
Slovenia,0.952381


### 7.4 Bottom 10 countries by high quality
The 10 countries with the lowest high quality articles per capita (in ascending order).

In [47]:
# Bottom 10 countries by high-quality article coverage
bottom_10_high_quality_countries = high_quality_articles_per_capita.sort_values(ascending=True).head(10)
bottom_10_high_quality_countries

Unnamed: 0_level_0,0
country,Unnamed: 1_level_1
Bangladesh,0.005764
Egypt,0.009506
Ethiopia,0.01581
Japan,0.016064
Pakistan,0.016632
Colombia,0.019157
Congo DR,0.01955
Vietnam,0.020222
Uganda,0.020576
Algeria,0.021368


### 7.5 Geographic regions by total coverage
A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [48]:
# Geographic regions by total coverage
regions_total_coverage = merged_df.groupby('region')['article_title'].count() / merged_df.groupby('region')['population'].first()
regions_total_coverage_ranked = regions_total_coverage.sort_values(ascending=False)

# Geographic regions by high quality coverage
high_quality_articles_region = merged_df[(merged_df['article_quality'] == 'FA') | (merged_df['article_quality'] == 'GA')]
regions_high_quality_coverage = high_quality_articles_region.groupby('region')['article_title'].count() / merged_df.groupby('region')['population'].first()
regions_total_coverage_ranked

Unnamed: 0_level_0,0
region,Unnamed: 1_level_1
CARIBBEAN,2190.0
OCEANIA,720.0
CENTRAL AMERICA,376.0
SOUTHERN EUROPE,295.185185
WESTERN ASIA,203.333333
NORTHERN EUROPE,136.428571
EASTERN EUROPE,77.065217
WESTERN EUROPE,54.130435
EASTERN AFRICA,50.378788
SOUTHERN AFRICA,45.555556


### 7.6 Geographic regions by high quality coverage
Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [49]:
regions_high_quality_coverage_ranked = regions_high_quality_coverage.sort_values(ascending=False)
regions_high_quality_coverage_ranked

Unnamed: 0_level_0,0
region,Unnamed: 1_level_1
CARIBBEAN,90.0
CENTRAL AMERICA,20.0
SOUTHERN EUROPE,19.62963
OCEANIA,10.0
WESTERN ASIA,9.0
NORTHERN EUROPE,5.714286
EASTERN EUROPE,4.130435
SOUTHERN AFRICA,2.962963
WESTERN EUROPE,2.282609
EASTERN AFRICA,1.287879


## Research Implications
We found some interesting patterns. The number and quality of articles varied a lot depending on where you looked. This suggests that Wikipedia might have some biases in how it covers political figures.

Why might that be? Well, to start with, having access to the internet and knowing how to use it makes a big difference in how much people can contribute to Wikipedia. Countries with fewer people online or who aren't as comfortable with technology might have fewer articles. Second, a country's politics and culture can influence whether people want to write or edit articles about their politicians. Restrive governments like North Korea and China might have fewer people willing to write articles, fearing negative consequences for any mistakes. And finally, how well people in different countries understand English can also affect the quality and quantity of articles.

Some of the findings were also very interesting. For example, the huge difference in coverage between rich and poor countries shows that Wikipedia doesn't represent everyone equally. This could lead to a skewed view of global politics, since countries with fewer articles might not be heard as much. This might need some looking into to ensure equitable coverage and access to political information across the globe.

To make Wikipedia more fair and balanced, we need to make the articles more accessible to the people in less developed countries and understand why Wikipedia is important. We should also encourage more people from different backgrounds to contribute to Wikipedia. This will help us create a more complete and accurate picture of political figures and issues around the world.