# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [1]:
# 
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

The example relies on some constants that help make the code a bit more readable.

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<vanksu@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

WIKIPEDIA_ARTICLES = "/Users/sushmavankayala/Documents/DATA_512/data-512-homework_2/resources/politicians_by_country_AUG.2024.csv"

The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [3]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [4]:
def fetch_data_multiple_articles(appended_titles, collected_data):

    try:
        # Fetch pageinfo the article
        api_response = request_pageinfo_per_article(title)
    
        pages = api_response['query']['pages']
        
        for page_id, page_data in pages.items():
            try:
                collected_data.append({
                    "pageid": page_data["pageid"],
                    "name": page_data["title"],
                    "revision_id": page_data["lastrevid"],
                })
            except Exception as e:
                print("Error while fetching data for specific article title: ", page_data["title"])
                print(e)
    except Exception as e:
        print("Error while fetching data for article titles:", appended_titles)
        print(e)

    return collected_data
        

Documentation about max number of titles that can be sent as | seperated - https://www.mediawiki.org/w/api.php?action=help&modules=query

In [7]:
# Read the list of wikipedia articles from the CSV file and create a dataframe
wiki_list = pd.read_csv(WIKIPEDIA_ARTICLES)
print("Total number of rows in the provided csv: ", wiki_list.shape[0])

# Politicians associated to more than 1 country
wiki_count_df = wiki_list.groupby(['name', "url"]).size().reset_index(name='Count')
wikie_count_dup_df = wiki_count_df[wiki_count_df['Count'] > 1]
print("Total number of politicians that are associated with more than 1 country:" ,wikie_count_dup_df.shape[0])

wikie_count_dup_df

Total number of rows in the provided csv:  7155
Total number of politicians that are associated with more than 1 country: 41


Unnamed: 0,name,url,Count
182,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,2
456,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,2
568,Ali al-Qaradaghi,https://en.wikipedia.org/wiki/Ali_al-Qaradaghi,2
800,Antonio Gutiérrez y Ulloa,https://en.wikipedia.org/wiki/Antonio_Gutiérre...,2
815,Antonín Janoušek,https://en.wikipedia.org/wiki/Antonín_Janoušek,2
892,Ashab Uddin Ahmad,https://en.wikipedia.org/wiki/Ashab_Uddin_Ahmad,2
1016,Bak Jungyang,https://en.wikipedia.org/wiki/Bak_Jungyang,2
1161,Bona Malwal,https://en.wikipedia.org/wiki/Bona_Malwal,2
1495,Count Václav Antonín Chotek of Chotkov and Vojnín,https://en.wikipedia.org/wiki/Count_Václav_Ant...,2
1682,Djama Ali Moussa,https://en.wikipedia.org/wiki/Djama_Ali_Moussa,2


There is a way to get the information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably.

This example also illustrates creating a copy of the template, setting values in the template, and then calling the function using the template to supply the parameters for the API request.

In [None]:
# Create empty list to store the collected information from MediaWiki API 
title = ""
collected_data = []

for index, row in wiki_list.iterrows():

    # Fetch article title from dataframe
    article_title = row["name"]
    # Append article title to the "title" string to make a single call containing 50 names
    title = title + '\u001F' + article_title
    
    if ((index+1) % 50) ==0:
        # API call to fetch revision numbers and append results to collected_Data
        collected_data = fetch_data_multiple_articles(title, collected_data)
        title = ""
    else:
        continue

# API call to fetch revision numbers and append results to collected_data for the remaining titles
collected_data = fetch_data_multiple_articles(title, collected_data)

In [None]:
rev_df = pd.DataFrame(collected_data)
rev_df = rev_df.drop_duplicates()
rev_df.to_csv("generated_files/wiki_list_with_revision.csv",  index=False)
rev_df

In [None]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#    
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

In [None]:
#   Once you've done the right set up with your Wikimedia account, it should provide you with three different keys, a Client ID,
#   a Client secret, and a Access token.
#
#   In this case I don't want to distribute my keys with the source of the notebook, so I wrote a key manager object that helps
#   track all of my API keys - a username and domain name retrieves the key. The key manager hides the keys on disk separate
#   from the code. A common code idiom to hide API keys will use code to extract the key from an OS environment variable. 
#

#   Note: if you don't want to use the key manager to help manage your API keys, you can specify the values as constants
#   below. Just don't distribute the notebook without removing the constants or you'll be distributing your key too.

USERNAME = "Vanksu1501"
ACCESS_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiJmMDc5OGEzNTllMzcwZWFmM2I2ZmJmMWNjYTkxYTQxMSIsImp0aSI6Ijc4NDRlYTA1YjRlNzdhY2QxZGRlNDM5ZGNmMDgxMDc5Mjg0NDkwZWYwNDE1ZGQ2NjdkYjY1Njc1ZjBjMTgxMzRjZTFjNzAwYTJhODkzOGE3IiwiaWF0IjoxNzI4NDg4MDkzLjA3NjM2MiwibmJmIjoxNzI4NDg4MDkzLjA3NjM2NiwiZXhwIjozMzI4NTM5Njg5My4wNzMxMDUsInN1YiI6Ijc2Njc4Nzk0IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.QPC0zDzOoYIRwe-jE2twK6aCEAJlCd2B8OgM5zE6jmOiQJ8r3oMqKlEuYtsRYXJ5grh5dw7_wqmS1YlxghZqheBFAB21cachfJKr7_8YbzpvyNDVX2Hu_wVytZccWUcdH-WvYa3SclZ31C9eMFm8GC0HANal3HTmff6a8Wv0gNGq2LXr23cl1lcVpbzTxhYa1zwVDBQFD7JUbSzb8Bx1_kcdk2hRdZ_0FUyqoGmbSZQzBb4BJDzMRDNd1wIbscC9AIn9xPHxQAFfN7CO3RKP34yFNo89zLj5mRrPRNEqPvjPPwOzR4nh2ROJz7BjUqJXuYBfE1RWowSsNkVrWLx1WRuFt7e7f1buP5D6-9g18aOG-LtrcFGRraDu97Z9TJcUSKujFnoL3vkvW_ZGriUTHOiC3zQmJx255OaWEvfxaufT-Pfm5y9LMkfHYoh-7Lz-FA4I4aD1wvC2E1n03vBQHceUCBHR0VxNZEAuC_YQwWtZAh7AUKZein5OPOLJAoqs2soSIyhLYfbGPomm-RdxVLz4aIAbTTH1u2D2MhDUzNAXHFnevaMTM6jzC-vn9f_REzbzviOcFltRxA7zUPQ1GEZdsdBQ5xE8K6BJDe8QQGA5tfLqi0TgRaoTF5V_0vn6OFejx_8T_CaMUMUZX_T1Yw312Yzy4Ops0yKFoUaOk28"

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT, 
                                   model_name = API_ORES_EN_QUALITY_MODEL, 
                                   request_data = ORES_REQUEST_DATA_TEMPLATE, 
                                   header_format = REQUEST_HEADER_TEMPLATE, 
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token
    
    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")
    
    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)
    
    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [None]:
scores_list = []

for index,row in rev_df.iterrows():

    row_dict = row.to_dict()

    try:
        score = request_ores_score_per_article(article_revid=row["revision_id"],
                                           email_address="vanksu@uw.edu",
                                           access_token=ACCESS_TOKEN)
        
        final_rating = score["enwiki"]["scores"][f'{row["revision_id"]}']["articlequality"]["score"]["prediction"]
    
        row_dict["article_rating"] = final_rating
        scores_list.append(row_dict)
    
    except Exception as e:
        print("Error fetching article_rating for", row["name"], "| Revision:", row["revision_id"])
        print(e)
        row_dict["article_rating"] = ""
        scores_list.append(row_dict)


print("Completed fetching ratings for ", rev_df.shape[0], "articles")

In [None]:
scores_df = pd.DataFrame(scores_list)
scores_df = scores_df.drop_duplicates()
scores_df.to_csv("generated_files/enriched_article_data.csv", index=False)

# Initial data collection completed

Now we have a list of politicians' wiki articles along with their revision_id, page_id and article_rating. For articles where ORES call failed, we will have a row in the list where article rating is set to empty

In [15]:
# Loading the enriched wikilist from generated files
scores_df = pd.read_csv('generated_files/enriched_article_data.csv')

# Find the articles for which ORES call failed
scores_df_missing_rating = scores_df[scores_df["article_rating"].isna()]
print("Number of articles for which ORES call failed:", scores_df_missing_rating.shape[0])
scores_df_missing_rating

Number of articles for which ORES call failed: 7


Unnamed: 0,pageid,name,revision_id,article_rating
833,26878336,Marcel Van Langenhove (referee),1136613172,
2602,21395252,Giovanni de Ciotta,1222150298,
2625,38965045,Márton Berzeviczy,1218731042,
2764,77136001,Mandipalli Ramprasad Reddy,1245717814,
5146,77294450,Jerzy Swatoń,1234283402,
5655,36205190,Matija Škerbec,1180682572,
6391,27076074,Peter-Josef Schallberger,1022570699,


In [16]:
# Join wiki_list with wikilist_with_rev to merge dataframes
wiki_list_rev = pd.merge(wiki_list, scores_df, on=["name"], how="left" )

wiki_list_rev = wiki_list_rev.rename(columns={'name': 'article_title'})
wiki_list_rev = wiki_list_rev.drop(['url', 'pageid'], axis=1)
wiki_list_rev["revision_id"] = wiki_list_rev["revision_id"].astype("Int64")

wiki_list_rev

Unnamed: 0,article_title,country,revision_id,article_rating
0,Majah Ha Adrif,Afghanistan,1233202991,Start
1,Haroon al-Afghani,Afghanistan,1230459615,B
2,Tayyab Agha,Afghanistan,1225661708,Start
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub
4,Aziza Ahmadyar,Afghanistan,1195651393,Start
...,...,...,...,...
7150,Josiah Tongogara,Zimbabwe,1203429435,C
7151,Langton Towungana,Zimbabwe,1246280093,Stub
7152,Sengezo Tshabangu,Zimbabwe,1228478288,Start
7153,Herbert Ushewokunze,Zimbabwe,959111842,Stub


In [18]:
# Wikilist articles with no revision_id => pageInfo call failed
wiki_list_rev[wiki_list_rev["revision_id"].isna()]

Unnamed: 0,article_title,country,revision_id,article_rating
430,Barbara Eibinger-Miedl,Austria,,
516,Mehrali Gasimov,Azerbaijan,,
1200,Kyaw Myint,Myanmar,,
1342,André Ngongang Ouandji,Cameroon,,
1955,Tomás Pimentel,Dominican Republic,,
2427,Richard Sumah,Ghana,,
4496,Segun ''Aeroland'' Adewale,Nigeria,,
5719,Bashir Bililiqo,Somalia,,


In [21]:
# Wikilist articles with revision_id but no article rating => ORES call failed
wiki_list_rev[~wiki_list_rev["revision_id"].isna()][wiki_list_rev["article_rating"].isna()]

  wiki_list_rev[~wiki_list_rev["revision_id"].isna()][wiki_list_rev["article_rating"].isna()]


Unnamed: 0,article_title,country,revision_id,article_rating
832,Marcel Van Langenhove (referee),Belgium,1136613172,
2624,Márton Berzeviczy,Hungary,1218731042,
2628,Giovanni de Ciotta,Hungary,1222150298,
2760,Mandipalli Ramprasad Reddy,India,1245717814,
5183,Jerzy Swatoń,Poland,1234283402,
5684,Matija Škerbec,Slovenia,1180682572,
6437,Peter-Josef Schallberger,Switzerland,1022570699,


# Manipulating population DF to find region corresponding to each country

In [22]:
population_df = pd.read_csv("/Users/sushmavankayala/Documents/DATA_512/data-512-homework_2/resources/population_by_country_AUG.2024.csv")
population_df

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2
...,...,...
228,Samoa,0.2
229,Solomon Islands,0.8
230,Tonga,0.1
231,Tuvalu,0.0


In [23]:
# Loop through each row in the dataframe. If name is all capital letters, means it is a region.
region = ""
region_population = 0

population_list = []

for index, row in population_df.iterrows():
    row_dict = row.to_dict()
    if row["Geography"].isupper():
        region = row["Geography"]
        region_population = row["Population"]
        row_dict["region"] = ""
        row_dict["region_population"] = 0
    else:
        row_dict["region"] = region
        row_dict["region_population"] = region_population

    population_list.append(row_dict)

population_df_enriched = pd.DataFrame(population_list)
population_df_enriched

Unnamed: 0,Geography,Population,region,region_population
0,WORLD,8009.0,,0.0
1,AFRICA,1453.0,,0.0
2,NORTHERN AFRICA,256.0,,0.0
3,Algeria,46.8,NORTHERN AFRICA,256.0
4,Egypt,105.2,NORTHERN AFRICA,256.0
...,...,...,...,...
228,Samoa,0.2,OCEANIA,45.0
229,Solomon Islands,0.8,OCEANIA,45.0
230,Tonga,0.1,OCEANIA,45.0
231,Tuvalu,0.0,OCEANIA,45.0


In [33]:
population_df_enriched = population_df_enriched.rename(columns = {"Geography": "country", "Population": "population"})
country_region_df = population_df_enriched[population_df_enriched["region_population"] > 0]
country_region_df

Unnamed: 0,country,population,region,region_population
3,Algeria,46.8,NORTHERN AFRICA,256.0
4,Egypt,105.2,NORTHERN AFRICA,256.0
5,Libya,6.9,NORTHERN AFRICA,256.0
6,Morocco,37.0,NORTHERN AFRICA,256.0
7,Sudan,48.1,NORTHERN AFRICA,256.0
...,...,...,...,...
228,Samoa,0.2,OCEANIA,45.0
229,Solomon Islands,0.8,OCEANIA,45.0
230,Tonga,0.1,OCEANIA,45.0
231,Tuvalu,0.0,OCEANIA,45.0


# Joining population data with the articles_enriched data

In [34]:
accumulated_data = pd.merge(wiki_list_rev, country_region_df, on=["country"], how="outer", indicator=True)
accumulated_data

Unnamed: 0,article_title,country,revision_id,article_rating,population,region,region_population,_merge
0,Majah Ha Adrif,Afghanistan,1233202991,Start,42.4,SOUTH ASIA,2029.0,both
1,Haroon al-Afghani,Afghanistan,1230459615,B,42.4,SOUTH ASIA,2029.0,both
2,Tayyab Agha,Afghanistan,1225661708,Start,42.4,SOUTH ASIA,2029.0,both
3,Khadija Zahra Ahmadi,Afghanistan,1234741562,Stub,42.4,SOUTH ASIA,2029.0,both
4,Aziza Ahmadyar,Afghanistan,1195651393,Start,42.4,SOUTH ASIA,2029.0,both
...,...,...,...,...,...,...,...,...
7193,,Kiribati,,,0.1,OCEANIA,45.0,right_only
7194,,Nauru,,,0.0,OCEANIA,45.0,right_only
7195,,New Caledonia,,,0.3,OCEANIA,45.0,right_only
7196,,New Zealand,,,5.2,OCEANIA,45.0,right_only


In [36]:
# Articles for which country data is not found in country_region_df
country_population_missing_df = accumulated_data[accumulated_data["_merge"] == "left_only"]
print("Countries in the wiki_list for which there is no population info:", country_population_missing_df["country"].unique())

Countries in the wiki_list for which there is no population info: ['Guinea-Bissau' 'Korean' 'Korea, South']


In [37]:
# Countries for which there are no articles present in the wiki_list_ref
country_article_missing_df = accumulated_data[accumulated_data["_merge"] == "right_only"]
print("Countries in the country_region_df for which there are no articles:", country_article_missing_df["country"].unique())

Countries in the country_region_df for which there are no articles: ['Western Sahara' 'GuineaBissau' 'Mauritius' 'Mayotte' 'Reunion'
 'Sao Tome and Principe' 'eSwatini' 'Canada' 'United States' 'Mexico'
 'Curacao' 'Dominica' 'Guadeloupe' 'Jamaica' 'Martinique' 'Puerto Rico'
 'French Guiana' 'Suriname' 'Georgia' 'Brunei' 'Philippines'
 'China (Hong Kong SAR)' 'China (Macao SAR)' 'Korea (North)'
 'Korea (South)' 'Denmark' 'Iceland' 'Ireland' 'United Kingdom'
 'Liechtenstein' 'Netherlands' 'Romania' 'Andorra' 'San Marino'
 'Australia' 'Fiji' 'French Polynesia' 'Guam' 'Kiribati' 'Nauru'
 'New Caledonia' 'New Zealand' 'Palau']


In [43]:
# Write countries with no match (i.e, response from the above to queries) to a .txt file
# Combine the two lists
combined_list = country_population_missing_df["country"].unique().tolist() + country_article_missing_df["country"].unique().tolist()

# Write the combined list to a text file with each string on a new line
with open('generated_files/wp_countries-no_match.txt', 'w+') as file:
    for country in combined_list:
        file.write(country + '\n')

In [38]:
# Find the articles for which we have information about country population
wp_politicians_by_country = accumulated_data[accumulated_data["_merge"] == "both"]
wp_politicians_by_country = wp_politicians_by_country.drop(columns = [ "region_population", "_merge"], axis=1)
wp_politicians_by_country = wp_politicians_by_country.rename(columns = { "article_rating": "article_quality"})
wp_politicians_by_country = wp_politicians_by_country[["country", "region", "population", "article_title", "revision_id", "article_quality"]]
wp_politicians_by_country = wp_politicians_by_country.drop_duplicates()
wp_politicians_by_country

Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start
...,...,...,...,...,...,...
7150,Zimbabwe,EASTERN AFRICA,16.7,Josiah Tongogara,1203429435,C
7151,Zimbabwe,EASTERN AFRICA,16.7,Langton Towungana,1246280093,Stub
7152,Zimbabwe,EASTERN AFRICA,16.7,Sengezo Tshabangu,1228478288,Start
7153,Zimbabwe,EASTERN AFRICA,16.7,Herbert Ushewokunze,959111842,Stub


In [39]:
wp_politicians_by_country.to_csv("generated_files/wp_politicians_by_country.csv", index=False)

# Step 4: Analysis

In [54]:
# Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

country_article_count = wp_politicians_by_country.groupby(['country', 'region', 'population']).size().reset_index(name='total_articles')
country_article_count["total_articles_per_capita"] = country_article_count["total_articles"] / country_article_count["population"]

top_10 = country_article_count.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
top_10 = top_10.head(10)

print("The 10 countries with the highest total articles per capita (in descending order):", top_10["country"].tolist())
top_10

The 10 countries with the highest total articles per capita (in descending order): ['Monaco', 'Tuvalu', 'Antigua and Barbuda', 'Federated States of Micronesia', 'Marshall Islands', 'Tonga', 'Barbados', 'Montenegro', 'Seychelles', 'Maldives']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Monaco,WESTERN EUROPE,0.0,10,inf
1,Tuvalu,OCEANIA,0.0,1,inf
2,Antigua and Barbuda,CARIBBEAN,0.1,33,330.0
3,Federated States of Micronesia,OCEANIA,0.1,14,140.0
4,Marshall Islands,OCEANIA,0.1,13,130.0
5,Tonga,OCEANIA,0.1,10,100.0
6,Barbados,CARIBBEAN,0.3,25,83.333333
7,Montenegro,SOUTHERN EUROPE,0.6,36,60.0
8,Seychelles,EASTERN AFRICA,0.1,6,60.0
9,Maldives,SOUTH ASIA,0.6,33,55.0


In [55]:
# Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
bottom_10 = country_article_count.sort_values("total_articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10 = bottom_10.head(10)

print("The 10 countries with the lowest total articles per capita (in ascending order):", bottom_10["country"].tolist())
bottom_10

The 10 countries with the lowest total articles per capita (in ascending order): ['China', 'India', 'Ghana', 'Saudi Arabia', 'Zambia', 'Norway', 'Israel', 'Egypt', "Cote d'Ivoire", 'Ethiopia']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,China,EAST ASIA,1411.3,16,0.011337
1,India,SOUTH ASIA,1428.6,151,0.105698
2,Ghana,WESTERN AFRICA,34.1,4,0.117302
3,Saudi Arabia,WESTERN ASIA,36.9,5,0.135501
4,Zambia,EASTERN AFRICA,20.2,3,0.148515
5,Norway,NORTHERN EUROPE,5.5,1,0.181818
6,Israel,WESTERN ASIA,9.8,2,0.204082
7,Egypt,NORTHERN AFRICA,105.2,32,0.304183
8,Cote d'Ivoire,WESTERN AFRICA,30.9,10,0.323625
9,Ethiopia,EASTERN AFRICA,126.5,44,0.347826


In [58]:
# Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order)

high_quality_articles = wp_politicians_by_country[wp_politicians_by_country["article_quality"].isin(["FA", "GA"])]
high_quality_articles_count = high_quality_articles.groupby(['country', 'region', 'population']).size().reset_index(name='total_articles')
high_quality_articles_count["total_articles_per_capita"] = high_quality_articles_count["total_articles"] / high_quality_articles_count["population"]

top_10 = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
top_10 = top_10.head(10)

print("The 10 countries with the highest high quality articles per capita (in descending order):", top_10["country"].tolist())
top_10

The 10 countries with the highest high quality articles per capita (in descending order): ['Montenegro', 'Luxembourg', 'Albania', 'Kosovo', 'Maldives', 'Lithuania', 'Croatia', 'Guyana', 'Palestinian Territory', 'Slovenia']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Montenegro,SOUTHERN EUROPE,0.6,3,5.0
1,Luxembourg,WESTERN EUROPE,0.7,2,2.857143
2,Albania,SOUTHERN EUROPE,2.7,7,2.592593
3,Kosovo,SOUTHERN EUROPE,1.7,4,2.352941
4,Maldives,SOUTH ASIA,0.6,1,1.666667
5,Lithuania,NORTHERN EUROPE,2.9,4,1.37931
6,Croatia,SOUTHERN EUROPE,3.8,5,1.315789
7,Guyana,SOUTH AMERICA,0.8,1,1.25
8,Palestinian Territory,WESTERN ASIA,5.5,6,1.090909
9,Slovenia,SOUTHERN EUROPE,2.1,2,0.952381


In [59]:
# Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
bottom_10 = high_quality_articles_count.sort_values("total_articles_per_capita", ascending=True).reset_index(drop=True)
bottom_10 = bottom_10.head(10)

print("The 10 countries with the least high quality articles per capita (in ascending order):", bottom_10["country"].tolist())
bottom_10

The 10 countries with the least high quality articles per capita (in ascending order): ['Bangladesh', 'Egypt', 'Ethiopia', 'Japan', 'Pakistan', 'Colombia', 'Congo DR', 'Vietnam', 'Uganda', 'Algeria']


Unnamed: 0,country,region,population,total_articles,total_articles_per_capita
0,Bangladesh,SOUTH ASIA,173.5,1,0.005764
1,Egypt,NORTHERN AFRICA,105.2,1,0.009506
2,Ethiopia,EASTERN AFRICA,126.5,2,0.01581
3,Japan,EAST ASIA,124.5,2,0.016064
4,Pakistan,SOUTH ASIA,240.5,4,0.016632
5,Colombia,SOUTH AMERICA,52.2,1,0.019157
6,Congo DR,MIDDLE AFRICA,102.3,2,0.01955
7,Vietnam,SOUTHEAST ASIA,98.9,2,0.020222
8,Uganda,EASTERN AFRICA,48.6,1,0.020576
9,Algeria,NORTHERN AFRICA,46.8,1,0.021368


In [60]:
# Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

articles_count_per_region = wp_politicians_by_country.groupby(['region']).size().reset_index(name='total_articles')
articles_count_per_region

Unnamed: 0,region,total_articles
0,CARIBBEAN,219
1,CENTRAL AMERICA,188
2,CENTRAL ASIA,106
3,EAST ASIA,152
4,EASTERN AFRICA,665
5,EASTERN EUROPE,709
6,MIDDLE AFRICA,231
7,NORTHERN AFRICA,302
8,NORTHERN EUROPE,191
9,OCEANIA,72


In [67]:
regional_data = pd.merge(articles_count_per_region, population_df, left_on=["region"], right_on=["Geography"], how="left")
regional_data = regional_data[["region", "total_articles", "Population"]]
regional_data["total_articles_per_capita"] = regional_data["total_articles"] / regional_data["Population"]
regional_data

Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,CARIBBEAN,219,44.0,4.977273
1,CENTRAL AMERICA,188,182.0,1.032967
2,CENTRAL ASIA,106,80.0,1.325
3,EAST ASIA,152,1648.0,0.092233
4,EASTERN AFRICA,665,483.0,1.376812
5,EASTERN EUROPE,709,285.0,2.487719
6,MIDDLE AFRICA,231,202.0,1.143564
7,NORTHERN AFRICA,302,256.0,1.179688
8,NORTHERN EUROPE,191,108.0,1.768519
9,OCEANIA,72,45.0,1.6


In [68]:
regional_data_top_10 = regional_data.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
regional_data_top_10 = regional_data_top_10.head(10)

print("The 10 regions with the highest articles per capita (in descending order):", regional_data_top_10["region"].tolist())
regional_data_top_10

The 10 regions with the highest articles per capita (in descending order): ['SOUTHERN EUROPE', 'CARIBBEAN', 'WESTERN EUROPE', 'EASTERN EUROPE', 'WESTERN ASIA', 'NORTHERN EUROPE', 'SOUTHERN AFRICA', 'OCEANIA', 'EASTERN AFRICA', 'SOUTH AMERICA']


Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,797,152.0,5.243421
1,CARIBBEAN,219,44.0,4.977273
2,WESTERN EUROPE,498,199.0,2.502513
3,EASTERN EUROPE,709,285.0,2.487719
4,WESTERN ASIA,610,299.0,2.040134
5,NORTHERN EUROPE,191,108.0,1.768519
6,SOUTHERN AFRICA,123,70.0,1.757143
7,OCEANIA,72,45.0,1.6
8,EASTERN AFRICA,665,483.0,1.376812
9,SOUTH AMERICA,569,426.0,1.335681


In [70]:
# Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

high_quality_articles_regional = wp_politicians_by_country[wp_politicians_by_country["article_quality"].isin(["FA", "GA"])]

hq_articles_count_per_region = high_quality_articles_regional.groupby(['region']).size().reset_index(name='total_articles')
hq_articles_count_per_region


Unnamed: 0,region,total_articles
0,CARIBBEAN,9
1,CENTRAL AMERICA,10
2,CENTRAL ASIA,5
3,EAST ASIA,3
4,EASTERN AFRICA,17
5,EASTERN EUROPE,38
6,MIDDLE AFRICA,8
7,NORTHERN AFRICA,17
8,NORTHERN EUROPE,9
9,OCEANIA,1


In [71]:
hq_regional_data = pd.merge(hq_articles_count_per_region, population_df, left_on=["region"], right_on=["Geography"], how="left")
hq_regional_data = hq_regional_data[["region", "total_articles", "Population"]]
hq_regional_data["total_articles_per_capita"] = hq_regional_data["total_articles"] / hq_regional_data["Population"]
hq_regional_data

Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,CARIBBEAN,9,44.0,0.204545
1,CENTRAL AMERICA,10,182.0,0.054945
2,CENTRAL ASIA,5,80.0,0.0625
3,EAST ASIA,3,1648.0,0.00182
4,EASTERN AFRICA,17,483.0,0.035197
5,EASTERN EUROPE,38,285.0,0.133333
6,MIDDLE AFRICA,8,202.0,0.039604
7,NORTHERN AFRICA,17,256.0,0.066406
8,NORTHERN EUROPE,9,108.0,0.083333
9,OCEANIA,1,45.0,0.022222


In [72]:
hq_regional_data_top_10 = hq_regional_data.sort_values("total_articles_per_capita", ascending=False).reset_index(drop=True)
hq_regional_data_top_10 = hq_regional_data_top_10.head(10)

print("The 10 regions with the highest high quality articles per capita (in descending order):", hq_regional_data_top_10["region"].tolist())
hq_regional_data_top_10

The 10 regions with the highest high quality articles per capita (in descending order): ['SOUTHERN EUROPE', 'CARIBBEAN', 'EASTERN EUROPE', 'SOUTHERN AFRICA', 'WESTERN EUROPE', 'WESTERN ASIA', 'NORTHERN EUROPE', 'NORTHERN AFRICA', 'CENTRAL ASIA', 'CENTRAL AMERICA']


Unnamed: 0,region,total_articles,Population,total_articles_per_capita
0,SOUTHERN EUROPE,53,152.0,0.348684
1,CARIBBEAN,9,44.0,0.204545
2,EASTERN EUROPE,38,285.0,0.133333
3,SOUTHERN AFRICA,8,70.0,0.114286
4,WESTERN EUROPE,21,199.0,0.105528
5,WESTERN ASIA,27,299.0,0.090301
6,NORTHERN EUROPE,9,108.0,0.083333
7,NORTHERN AFRICA,17,256.0,0.066406
8,CENTRAL ASIA,5,80.0,0.0625
9,CENTRAL AMERICA,10,182.0,0.054945
