# Article Page Info MediaWiki API Example
This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [2]:
# These are standard python modules
import json, time, urllib.parse
#
# The 'requests' module is not a standard Python module. You will need to install this with pip/pip3 if you do not already have it
import requests
import pandas as pd

## Data Collection
Below is the section where the Wikimedia API is used to collect the page info data

In [None]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'netid@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

df = pd.read_csv("politicians_by_country_AUG.2024.csv")

# Extract values from the 'name' column and convert them into a list
ARTICLE_TITLES = df['name'].tolist()
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None,
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT,
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Here, I am collection the API response for multiple politician's name in a single request.

In [None]:
all_responses = {}
batch_size = 50
for i in range(0, len(df), batch_size):
    names = '|'.join(df['name'][i:i+batch_size])
    info = request_pageinfo_per_article(names)
    if 'query' in info and 'pages' in info['query']:
        response = info['query']['pages']
        for page_id in response:
            name = response[page_id]['title']
            all_responses[name] = response[page_id]

The collected api response is stored in the all_pageinfo_responses.json file

In [None]:
output_file = 'all_pageinfo_responses.json'

# Save the all_responses dictionary to a JSON file
with open(output_file, 'w') as f:
    json.dump(all_responses, f, indent=4)

In [None]:
len(list(all_responses.keys()))

7111

In the section below, I will be using the page info response to get the politician names and in turn use another ORES API to get the liftwing article information.

# Requesting ORES scores through LiftWing ML Service API
Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).


## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023



In [None]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#
API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "netid@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : "",         # your email address should go here
    'access_token'  : ""          # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = { 'Bison':1085687913 , 'Northern flicker':1086582504 , 'Red squirrel':1083787665 , 'Chinook salmon':1085406228 , 'Horseshoe bat':1060601936 }

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = ""
ACCESS_TOKEN = ""
#

## Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

In [None]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


In [3]:
with open('all_pageinfo_responses.json', 'r') as file:
    data = json.load(file)

# Convert JSON data to a dataframe
rev_df = pd.DataFrame.from_dict(data, orient='index')

# Adding a "name" column based on the dictionary keys
rev_df['name'] = rev_df.index

# Resetting the index so the "name" column is a regular column in the dataframe
rev_df.reset_index(drop=True, inplace=True)

rev_df['pageid'] = rev_df['pageid'].fillna(0).astype(int)
rev_df['lastrevid'] = rev_df['lastrevid'].fillna(0).astype(int)
rev_df['talkid'] = rev_df['talkid'].fillna(0).astype(int)
rev_df['length'] = rev_df['length'].fillna(0).astype(int)

# Reordering columns to have 'name' as the first column
rev_df = rev_df[['name'] + [col for col in rev_df.columns if col != 'name']]

In [4]:
# Check for NaNs, empty strings, or spaces in a column (for example: 'lastrevid')
# 1. Check for NaN values
has_nans = rev_df['lastrevid'].isna()

# 2. Check for empty strings or strings with only spaces
has_empty_spaces = rev_df['lastrevid'].apply(lambda x: isinstance(x, str) and x.strip() == '')

# Combine both checks
invalid_entries = has_nans | has_empty_spaces

# Display rows that contain NaNs, spaces, or empty strings
print("Rows with NaNs, spaces, or empty strings in 'lastrevid':")
print(rev_df[invalid_entries])

Rows with NaNs, spaces, or empty strings in 'lastrevid':
Empty DataFrame
Columns: [name, pageid, ns, title, contentmodel, pagelanguage, pagelanguagehtmlcode, pagelanguagedir, touched, lastrevid, length, talkid, fullurl, editurl, canonicalurl, watchers, missing, redirect, new]
Index: []


In [None]:
# code to check if api response works
sample_row = rev_df.iloc[0]  # Get the first row of the DataFrame
name = sample_row['name']
lastrevid = (sample_row['lastrevid'])


# Example request data and header templates (fill in with your actual templates)
request_data = {"rev_id": None}  # Template for request data
header_format = {"Authorization": "Bearer {access_token}", "From": "{email_address}"}  # Example header format
header_params = {"email_address": "smohan5@uw.edu", "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiIxYzRkOWRkZDZjZjc4NjBhOTcxMTUwYTJiMmY2OGU0NiIsImp0aSI6ImNiYTExNjU1OTkyNDhhMzdlMzk1NDk0NGNlYzkxODM5YjUwZDU1YmMwYWU2MzA3ZDhjZjIxNzM5MjUyOTNmNmU0ZTJiZWFmNzFmN2I3YWQ3IiwiaWF0IjoxNzI4NjkzODU3LjMwNzM2NCwibmJmIjoxNzI4NjkzODU3LjMwNzM2NywiZXhwIjozMzI4NTYwMjY1Ny4zMDU0NjYsInN1YiI6Ijc2Njk2MzE2IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.uRRcC9Njy8DWcjVdSx1VoLqaZVUD2t1xUbNSr-8IT5SarTeLimSH_Q17JceJrp9HhRe-weXfHV6tW_pInV6jsmvYTNm2VgkCF6hqelRAufgyMwAnIRXjkXTda_s24awrn94_lPjN00B5g4mTd42VB_obysPR0RIMjC1apyIwQDIRbSw7uFEMoMPlx4VgSJH9cHd2HBn91arsyY5uWSpzvmkj0XIsSYR4fxoF4eFRV5mCWcfvV0F-mhC-yCy05uXu0ti9TjFKkuzsnNi81c27j9RvpKjBkbq_MSaHEDAoPSYUsktDRc0SOp6XycYpT0ZXIqGhCsonEFpKLXfDGkGbGtPr7bgOQA8_Q1zF_QhUiSrSm2Q38JtiPHwrv5UuJKLYqtfaKe58KXkrd2XkfE5GWe64jqbMA48lFZXH7c6EZEcd8JUoauxKS4o-yreSEqZUxofnfgXnm-gmhX5f3YJYgaKEhJnK-w0DzF5bofLTM3aH5-yuewQcK77HOZH4k8NZX4f1oWu1UW-r2PrMQdxEF8nDcsMtSl1eYEnhash23ddq1YDw6oOnDieaiGLf73uyX_gcYa9Bjs8kRttvYu-DSQwHIPCYKtFtZEsBuOdj6g6d9axw0fKKoEcweQtNABF4JkNMl6CzXkLm34o9QBUEaqzYBSUcTOVV7tobR3P1SGI"}  # Fill with your details
# Make the API call for the sample revision ID
response = request_ores_score_per_article(
    article_revid=int(lastrevid),  # Ensure it's an integer
    email_address=header_params['email_address'],
    access_token=header_params['access_token'],
    request_data=request_data,
    header_format=header_format,
    header_params=header_params
)

# Print the response for the sample row
print(f"Response for {name} (rev_id: {lastrevid}): {response}")

Response for Abdul Baqi Turkistani (rev_id: 1231655023): {'enwiki': {'models': {'articlequality': {'version': '0.9.2'}}, 'scores': {'1231655023': {'articlequality': {'score': {'prediction': 'Stub', 'probability': {'B': 0.007563164541978764, 'C': 0.010571998067933679, 'FA': 0.0014567448768872152, 'GA': 0.00350824677167893, 'Start': 0.04243220742433865, 'Stub': 0.9344676383171826}}}}}}}


In [None]:
import json

api_responses = {}
failed_revids = []  # List to store revids that encounter errors

# Iterate over the DataFrame and make API requests
for index, row in rev_df.iterrows():
    name = row['name']
    lastrevid = row['lastrevid']

    # Example request data and header templates (fill in with your actual templates)
    print('Fetching for: ', lastrevid)
    request_data = {"rev_id": lastrevid}
    header_format = {"Authorization": "Bearer {access_token}", "From": "{email_address}"}  # Example header format
    header_params = {
        "email_address": "smohan5@uw.edu",
        "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhdWQiOiIxYzRkOWRkZDZjZjc4NjBhOTcxMTUwYTJiMmY2OGU0NiIsImp0aSI6ImNiYTExNjU1OTkyNDhhMzdlMzk1NDk0NGNlYzkxODM5YjUwZDU1YmMwYWU2MzA3ZDhjZjIxNzM5MjUyOTNmNmU0ZTJiZWFmNzFmN2I3YWQ3IiwiaWF0IjoxNzI4NjkzODU3LjMwNzM2NCwibmJmIjoxNzI4NjkzODU3LjMwNzM2NywiZXhwIjozMzI4NTYwMjY1Ny4zMDU0NjYsInN1YiI6Ijc2Njk2MzE2IiwiaXNzIjoiaHR0cHM6Ly9tZXRhLndpa2ltZWRpYS5vcmciLCJyYXRlbGltaXQiOnsicmVxdWVzdHNfcGVyX3VuaXQiOjUwMDAsInVuaXQiOiJIT1VSIn0sInNjb3BlcyI6WyJiYXNpYyJdfQ.uRRcC9Njy8DWcjVdSx1VoLqaZVUD2t1xUbNSr-8IT5SarTeLimSH_Q17JceJrp9HhRe-weXfHV6tW_pInV6jsmvYTNm2VgkCF6hqelRAufgyMwAnIRXjkXTda_s24awrn94_lPjN00B5g4mTd42VB_obysPR0RIMjC1apyIwQDIRbSw7uFEMoMPlx4VgSJH9cHd2HBn91arsyY5uWSpzvmkj0XIsSYR4fxoF4eFRV5mCWcfvV0F-mhC-yCy05uXu0ti9TjFKkuzsnNi81c27j9RvpKjBkbq_MSaHEDAoPSYUsktDRc0SOp6XycYpT0ZXIqGhCsonEFpKLXfDGkGbGtPr7bgOQA8_Q1zF_QhUiSrSm2Q38JtiPHwrv5UuJKLYqtfaKe58KXkrd2XkfE5GWe64jqbMA48lFZXH7c6EZEcd8JUoauxKS4o-yreSEqZUxofnfgXnm-gmhX5f3YJYgaKEhJnK-w0DzF5bofLTM3aH5-yuewQcK77HOZH4k8NZX4f1oWu1UW-r2PrMQdxEF8nDcsMtSl1eYEnhash23ddq1YDw6oOnDieaiGLf73uyX_gcYa9Bjs8kRttvYu-DSQwHIPCYKtFtZEsBuOdj6g6d9axw0fKKoEcweQtNABF4JkNMl6CzXkLm34o9QBUEaqzYBSUcTOVV7tobR3P1SGI"  # Replace with your actual access token
    }

    try:
        # Make the API call
        response = request_ores_score_per_article(
            article_revid=lastrevid,
            email_address=header_params['email_address'],
            access_token=header_params['access_token'],
            request_data=request_data,
            header_format=header_format,
            header_params=header_params
        )

        # Store the response in the dictionary with lastrevid as the key
        if response:
            api_responses[lastrevid] = response
        else:
            api_responses[lastrevid] = None
            print(f"No response for {name} (rev_id: {lastrevid})")

    except Exception as e:
        # If there's an error, store the revid in the failed_revids list
        print(f"Error fetching data for {name} (rev_id: {lastrevid}): {e}")
        failed_revids.append(lastrevid)

    # Save the API responses periodically to avoid data loss in case of long runtime
    if index % 100 == 0:  # Adjust frequency as needed
        with open('ores_responses_partial.json', 'w') as json_file:
            json.dump(api_responses, json_file, indent=4)

# Save the final API responses as a JSON file
with open('ores_responses.json', 'w') as json_file:
    json.dump(api_responses, json_file, indent=4)

print(f"Failed revids: {failed_revids}")


Fetching for:  1231655023
Fetching for:  1227026187
Fetching for:  1226326055
Fetching for:  1221720658
Fetching for:  1185105938
Fetching for:  1247931713
Fetching for:  1195651393
Fetching for:  1247762293
Fetching for:  1176481824
Fetching for:  1248505877
Fetching for:  1193992206
Fetching for:  1158302291
Fetching for:  1234514379
Fetching for:  1212323536
Fetching for:  1158659195
Fetching for:  1240993642
Fetching for:  1238402857
Fetching for:  1230459615
Fetching for:  1207743719
Fetching for:  1244521219
Fetching for:  1227635806
Fetching for:  1234741562
Fetching for:  1243745950
Fetching for:  1233202991
Fetching for:  1246566971
Fetching for:  1235165845
Fetching for:  1246567093
Fetching for:  1227103354
Fetching for:  1237694188
Fetching for:  1136611354
Fetching for:  1247727443
Fetching for:  1176429234
Fetching for:  1134129082
Fetching for:  949986748
Fetching for:  1235521766
Fetching for:  1246566804
Fetching for:  988838315
Fetching for:  1225385278
Fetching for: 

KeyboardInterrupt: 

Now, the article information and scores data is stored in the ORES response json file. Below I will be reading that file and converting it to a dataframe to further use them in the analysis

In [9]:
# code to read json and convert to df
with open('ores_responses(1).json', 'r') as file:
    data = json.load(file)

rows = []
for rev_id, rev_data in data.items():
    try:
        # Ensure the data for enwiki, scores, articlequality, and prediction exist and are not None
        if (
            rev_data
            and 'enwiki' in rev_data
            and 'scores' in rev_data['enwiki']
            and rev_id in rev_data['enwiki']['scores']
            and 'articlequality' in rev_data['enwiki']['scores'][rev_id]
            and 'score' in rev_data['enwiki']['scores'][rev_id]['articlequality']
            and 'prediction' in rev_data['enwiki']['scores'][rev_id]['articlequality']['score']
        ):
            article_quality = rev_data['enwiki']['scores'][rev_id]['articlequality']['score']['prediction']
            # Append the revision ID and article quality to the list
            rows.append({'revision_id': rev_id, 'article_quality': article_quality})
        else:
            print(f"Data missing for revision ID {rev_id}")
    except KeyError:
        print(f"KeyError for revision ID {rev_id}: some data is missing.")


article_df = pd.DataFrame(rows)
print(article_df)


Data missing for revision ID 1245373664
     revision_id article_quality
0     1231655023            Stub
1     1227026187            Stub
2     1226326055           Start
3     1221720658           Start
4     1185105938            Stub
...          ...             ...
7097  1247902630               C
7098   959111842            Stub
7099  1203429435               C
7100  1246280093            Stub
7101  1228478288           Start

[7102 rows x 2 columns]


Data missing for revision ID 1245373664

Your notebook should compute and print the score error rate. The error rate is the ratio of the number of articles for which you were not able to get a score divided by the total number of articles. If your request error rate is higher than 1%, then you should review your code, determine what is going wrong, fix it, and rerun your score collection.

Next, I am checking if I got the scores for all the articles. Below I am calculating the request error rate.

In [15]:
# check if there's a result for all politican names.
# I am joining the page info dataframe with article info dataframe based on the revision id column
# Convert both 'lastrevid' and 'revision_id' columns to string type
rev_df['lastrevid'] = rev_df['lastrevid'].astype(str)
article_df['revision_id'] = article_df['revision_id'].astype(str)

# Use pd.merge() to join the two DataFrames on 'lastrevid' and 'revision_id'
merged_df = pd.merge(rev_df, article_df, left_on='lastrevid', right_on='revision_id', how='left')

# Replace missing values (NaN) in the merged DataFrame with empty strings
merged_df.fillna('', inplace=True)

missing_article_info = merged_df[merged_df['revision_id'] == '']

# Print the names of the politicians with missing article info
missing_politicians = missing_article_info['name'].tolist()

print("Politicians where article information is missing:")
for name in missing_politicians:
    print(name)

# Display the merged DataFrame
print(merged_df)



Politicians where article information is missing:
Naw Susanna Hla Hla Soe
Barbara Eibinger-Miedl
Mehrali Gasimov
Kyaw Myint
André Ngongang Ouandji
Tomás Pimentel
Richard Sumah
Segun ''Aeroland'' Adewale
Bashir Bililiqo
                            name    pageid  ns                       title  \
0          Abdul Baqi Turkistani  27428272   0       Abdul Baqi Turkistani   
1              Abdul Ghani Ghani  29443640   0           Abdul Ghani Ghani   
2             Abdul Rahim Ayoubi  44482763   0          Abdul Rahim Ayoubi   
3             Ahmad Wali Massoud  34682634   0          Ahmad Wali Massoud   
4                    Aimal Faizi  52438668   0                 Aimal Faizi   
...                          ...       ...  ..                         ...   
7106      André Ngongang Ouandji         0   0      André Ngongang Ouandji   
7107              Tomás Pimentel         0   0              Tomás Pimentel   
7108               Richard Sumah         0   0               Richard Sumah   
7

  merged_df.fillna('', inplace=True)


Politicians where article information is missing:
Naw Susanna Hla Hla Soe
Barbara Eibinger-Miedl
Mehrali Gasimov
Kyaw Myint
André Ngongang Ouandji
Tomás Pimentel
Richard Sumah
Segun ''Aeroland'' Adewale
Bashir Bililiqo

From this we can calculate the error rate.

In [19]:
Total_number_of_articles = 7102
number_of_missing = len(missing_politicians)
error_rate = number_of_missing/Total_number_of_articles
error_rate*100

0.12672486623486343

Error rate = 0.126%
Which is within the limit

Combining the datasets

In [26]:
# Now I need to merge the population by country csv with merged data

politicians_df = pd.read_csv('politicians_by_country_AUG.2024.csv')
population_df = pd.read_csv('population_by_country_AUG.2024.csv')

print(politicians_df.columns)
print(population_df.columns)

# merging these csvs
# Merging on 'country' from politicians_df and 'Geography' from population_df
#politicians_df = pd.merge(politicians_df, population_df, left_on='country', right_on='Geography', how='left')
#politicians_df.head()

# Merging the two DataFrames using an outer join
politicians_merged_df = pd.merge(
    politicians_df,
    population_df,
    left_on='country',
    right_on='Geography',
    how='outer'
)

# Optional: Fill NaN values with an empty string or any other value you prefer
politicians_merged_df.fillna('', inplace=True)

politicians_merged_df.head()




Index(['name', 'url', 'country'], dtype='object')
Index(['Geography', 'Population'], dtype='object')


  politicians_merged_df.fillna('', inplace=True)


Unnamed: 0,name,url,country,Geography,Population
0,,,,AFRICA,1453.0
1,,,,ASIA,4739.0
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Afghanistan,42.4
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Afghanistan,42.4
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Afghanistan,42.4


Combining all the dataframes together

In [29]:
# combining merged_df that has article data with country information based on the name column
# Merging the DataFrames using an outer join
combined_data = pd.merge(
    merged_df,
    politicians_merged_df,
    on='name',  # Use 'on' since both DataFrames have the same column name
    how='outer'  # Use 'outer' to retain all rows
)

# Display the first few rows of the combined DataFrame
combined_data.head()



Unnamed: 0,name,pageid,ns,title,contentmodel,pagelanguage,pagelanguagehtmlcode,pagelanguagedir,touched,lastrevid,...,watchers,missing,redirect,new,revision_id,article_quality,url,country,Geography,Population
0,,,,,,,,,,,...,,,,,,,,,AFRICA,1453.0
1,,,,,,,,,,,...,,,,,,,,,ASIA,4739.0
2,,,,,,,,,,,...,,,,,,,,,Andorra,0.1
3,,,,,,,,,,,...,,,,,,,,,Australia,26.6
4,,,,,,,,,,,...,,,,,,,,,Brunei,0.4


In [37]:
# Identify rows with no matches (NaN or empty string in 'name' column)
no_match_data = combined_data[combined_data['name'].isna() | (combined_data['name'] == '')]

# Extract the 'country' and 'Geography' columns
no_match_countries_geography = no_match_data[['country', 'Geography']]

# Write the output to a file
with open('wp_countries-no_match.txt', 'w') as file:
    for index, row in no_match_countries_geography.iterrows():
        country = row['country'] if pd.notna(row['country']) else ""
        geography = row['Geography'] if pd.notna(row['Geography']) else ""
        file.write(f"{country}, {geography}\n")

# Check if any countries with no matches were found
if no_match_data.empty:
    print("No countries with no matches found.")




In [40]:
# consolidating rest of the data
consolidated_data = combined_data[~(combined_data['name'].isna() | (combined_data['name'] == ''))]

# Select the specified columns for the final DataFrame
final_data = consolidated_data[['country', 'Geography', 'Population', 'title', 'revision_id', 'article_quality']].copy()

# Rename 'Geography' to 'region' and 'Population' to 'population' for clarity
final_data.rename(columns={
    'Geography': 'region',
    'Population': 'population',
    'title': 'article_title'
}, inplace=True)

# Save the final DataFrame to a CSV file
final_data.to_csv('wp_politicians_by_country.csv', index=False)

## Analysis

In [46]:
high_quality = final_data[final_data["article_quality"].isin(["FA", "GA"])]

# Step 2: Calculate total articles and high-quality articles per country
total_articles_per_country = final_data.groupby("country").size().reset_index(name="total_articles")
high_quality_articles_per_country = high_quality.groupby("country").size().reset_index(name="high_quality_articles")

# Step 3: Merge the article counts with population data
merged = pd.merge(total_articles_per_country, high_quality_articles_per_country, on="country", how="left").fillna(0)
merged = pd.merge(merged, final_data[["country", "population", "region"]].drop_duplicates(), on="country", how="left")

# Convert population from millions to actual number
merged["population_actual"] = merged["population"] * 1_000_000

# step 4
# Convert the relevant columns to numeric types
merged["total_articles"] = pd.to_numeric(merged["total_articles"], errors='coerce')
merged["high_quality_articles"] = pd.to_numeric(merged["high_quality_articles"], errors='coerce')
merged["population_actual"] = pd.to_numeric(merged["population_actual"], errors='coerce')

# Perform the division again
merged["articles_per_capita"] = merged["total_articles"] / merged["population_actual"]
merged["high_quality_articles_per_capita"] = merged["high_quality_articles"] / merged["population_actual"]


# Step 5: Calculate regional aggregates
regional_aggregates = merged.groupby("region").agg(
    total_articles=("total_articles", "sum"),
    total_population=("population_actual", "sum"),
    high_quality_articles=("high_quality_articles", "sum")
).reset_index()

regional_aggregates["articles_per_capita"] = regional_aggregates["total_articles"] / regional_aggregates["total_population"]
regional_aggregates["high_quality_articles_per_capita"] = (
    regional_aggregates["high_quality_articles"] / regional_aggregates["total_population"]
)

In [53]:
filtered_merged = merged[merged["population_actual"] > 0]

In [54]:
top_10_coverage = filtered_merged.sort_values("articles_per_capita", ascending=False).head(10)
top_10_coverage

Unnamed: 0,country,total_articles,high_quality_articles,population,region,population_actual,articles_per_capita,high_quality_articles_per_capita
4,Antigua and Barbuda,33,0.0,0.1,Antigua and Barbuda,100000.0,0.00033,0.0
51,Federated States of Micronesia,14,0.0,0.1,Federated States of Micronesia,100000.0,0.00014,0.0
96,Marshall Islands,13,0.0,0.1,Marshall Islands,100000.0,0.00013,0.0
152,Tonga,10,0.0,0.1,Tonga,100000.0,0.0001,0.0
12,Barbados,25,0.0,0.3,Barbados,300000.0,8.3e-05,0.0
128,Seychelles,6,0.0,0.1,Seychelles,100000.0,6e-05,0.0
101,Montenegro,36,3.0,0.6,Montenegro,600000.0,6e-05,5e-06
17,Bhutan,44,0.0,0.8,Bhutan,800000.0,5.5e-05,0.0
93,Maldives,33,1.0,0.6,Maldives,600000.0,5.5e-05,2e-06
124,Samoa,8,0.0,0.2,Samoa,200000.0,4e-05,0.0


In [56]:
# Bottom 10 countries by total articles per capita
bottom_10_coverage = filtered_merged.sort_values("articles_per_capita", ascending=True).head(10)
bottom_10_coverage


Unnamed: 0,country,total_articles,high_quality_articles,population,region,population_actual,articles_per_capita,high_quality_articles_per_capita
31,China,16,0.0,1411.3,China,1411300000.0,1.133707e-08,0.0
67,India,151,0.0,1428.6,India,1428600000.0,1.056979e-07,0.0
57,Ghana,4,1.0,34.1,Ghana,34100000.0,1.173021e-07,2.932551e-08
125,Saudi Arabia,5,2.0,36.9,Saudi Arabia,36900000.0,1.355014e-07,5.420054e-08
167,Zambia,3,0.0,20.2,Zambia,20200000.0,1.485149e-07,0.0
111,Norway,1,0.0,5.5,Norway,5500000.0,1.818182e-07,0.0
71,Israel,2,0.0,9.8,Israel,9800000.0,2.040816e-07,0.0
45,Egypt,32,1.0,105.2,Egypt,105200000.0,3.041825e-07,9.505703e-09
37,Cote d'Ivoire,10,0.0,30.9,Cote d'Ivoire,30900000.0,3.236246e-07,0.0
50,Ethiopia,44,2.0,126.5,Ethiopia,126500000.0,3.478261e-07,1.581028e-08


In [57]:
# Top 10 countries by high-quality articles per capita
top_10_high_quality = filtered_merged.sort_values("high_quality_articles_per_capita", ascending=False).head(10)
top_10_high_quality


Unnamed: 0,country,total_articles,high_quality_articles,population,region,population_actual,articles_per_capita,high_quality_articles_per_capita
101,Montenegro,36,3.0,0.6,Montenegro,600000.0,6e-05,5e-06
89,Luxembourg,27,2.0,0.7,Luxembourg,700000.0,3.9e-05,2.857143e-06
1,Albania,70,7.0,2.7,Albania,2700000.0,2.6e-05,2.592593e-06
79,Kosovo,26,4.0,1.7,Kosovo,1700000.0,1.5e-05,2.352941e-06
93,Maldives,33,1.0,0.6,Maldives,600000.0,5.5e-05,1.666667e-06
88,Lithuania,58,4.0,2.9,Lithuania,2900000.0,2e-05,1.37931e-06
38,Croatia,65,5.0,3.8,Croatia,3800000.0,1.7e-05,1.315789e-06
63,Guyana,17,1.0,0.8,Guyana,800000.0,2.1e-05,1.25e-06
114,Palestinian Territory,61,6.0,5.5,Palestinian Territory,5500000.0,1.1e-05,1.090909e-06
132,Slovenia,38,2.0,2.1,Slovenia,2100000.0,1.8e-05,9.52381e-07


In [58]:
# Bottom 10 countries by high-quality articles per capita
bottom_10_high_quality = filtered_merged.sort_values("high_quality_articles_per_capita", ascending=True).head(10)
bottom_10_high_quality


Unnamed: 0,country,total_articles,high_quality_articles,population,region,population_actual,articles_per_capita,high_quality_articles_per_capita
168,Zimbabwe,69,0.0,16.7,Zimbabwe,16700000.0,4.131737e-06,0.0
34,Congo,31,0.0,6.1,Congo,6100000.0,5.081967e-06,0.0
80,Kuwait,17,0.0,4.4,Kuwait,4400000.0,3.863636e-06,0.0
140,St. Lucia,3,0.0,0.2,St. Lucia,200000.0,1.5e-05,0.0
37,Cote d'Ivoire,10,0.0,30.9,Cote d'Ivoire,30900000.0,3.236246e-07,0.0
139,St. Kitts and Nevis,3,0.0,0.1,St. Kitts and Nevis,100000.0,3e-05,0.0
133,Solomon Islands,12,0.0,0.8,Solomon Islands,800000.0,1.5e-05,0.0
40,Cyprus,16,0.0,1.3,Cyprus,1300000.0,1.230769e-05,0.0
130,Singapore,4,0.0,5.8,Singapore,5800000.0,6.896552e-07,0.0
42,Djibouti,15,0.0,1.1,Djibouti,1100000.0,1.363636e-05,0.0


In [62]:
filtered_regional_aggregates = regional_aggregates.dropna(subset=["region"]).query("region != ''")

In [63]:
# Regions ranked by total articles per capita
regions_by_coverage = filtered_regional_aggregates.sort_values("articles_per_capita", ascending=False)
regions_by_coverage


Unnamed: 0,region,total_articles,total_population,high_quality_articles,articles_per_capita,high_quality_articles_per_capita
97,Monaco,10,0.000000e+00,0.0,inf,
155,Tuvalu,1,0.000000e+00,0.0,inf,
5,Antigua and Barbuda,33,1.000000e+05,0.0,3.300000e-04,0.000000e+00
52,Federated States of Micronesia,14,1.000000e+05,0.0,1.400000e-04,0.000000e+00
94,Marshall Islands,13,1.000000e+05,0.0,1.300000e-04,0.000000e+00
...,...,...,...,...,...,...
165,Zambia,3,2.020000e+07,0.0,1.485149e-07,0.000000e+00
123,Saudi Arabia,5,3.690000e+07,2.0,1.355014e-07,5.420054e-08
58,Ghana,4,3.410000e+07,1.0,1.173021e-07,2.932551e-08
67,India,151,1.428600e+09,0.0,1.056979e-07,0.000000e+00


In [64]:
# Regions ranked by high-quality articles per capita
regions_by_high_quality = filtered_regional_aggregates.sort_values("high_quality_articles_per_capita", ascending=False)
regions_by_high_quality

Unnamed: 0,region,total_articles,total_population,high_quality_articles,articles_per_capita,high_quality_articles_per_capita
99,Montenegro,36,600000.0,3.0,0.000060,0.000005
87,Luxembourg,27,700000.0,2.0,0.000039,0.000003
2,Albania,70,2700000.0,7.0,0.000026,0.000003
77,Kosovo,26,1700000.0,4.0,0.000015,0.000002
91,Maldives,33,600000.0,1.0,0.000055,0.000002
...,...,...,...,...,...,...
65,Honduras,17,9700000.0,0.0,0.000002,0.000000
64,Haiti,34,11600000.0,0.0,0.000003,0.000000
166,Zimbabwe,69,16700000.0,0.0,0.000004,0.000000
97,Monaco,10,0.0,0.0,inf,
