# Homework 2 - Considering Bias in Data
The goal of this assignment is to explore the concept of bias in data using Wikipedia articles that are about political figures from different countries. We will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

## License
This code was developed by Sarah Nguyen for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [MIT License](https://chatgpt.com/c/67048ab0-3d24-8001-88ce-c354bb934b32#:~:text=under%20the%20MIT-,License,-.).

In [1]:
import pandas as pd
import json, time, urllib.parse
import requests

# Data Acquistion
We need data that lists Wikipedia articles of politicians and data for country populations, which lives in different places.

The `population_by_country_AUG.2024.csv` contains rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows should not match the country values in `politicians_by_country.AUG.2024.csv`, but you will want to retain them so that you can report coverage and quality by region as specified in the analysis section below.

In [2]:
politicians_data = pd.read_csv('politicians_by_country_AUG.2024.csv')
population_data = pd.read_csv('population_by_country_AUG.2024.csv')

In [4]:
# Creates a list of diseases from the csv so we can loop through all diseases
if 'name' in politicians_data.columns:
    politician_list = politicians_data['name'].unique().tolist()

    print(politician_list)

['Majah Ha Adrif', 'Haroon al-Afghani', 'Tayyab Agha', 'Khadija Zahra Ahmadi', 'Aziza Ahmadyar', 'Muqadasa Ahmadzai', 'Mohammad Sarwar Ahmedzai', 'Amir Muhammad Akhundzada', 'Nasrullah Baryalai Arsalai', 'Abdul Rahim Ayoubi', 'Ismael Balkhi', 'Abdul Baqi Turkistani', 'Mohammad Ghous Bashiri', 'Jan Baz', 'Bashir Ahmad Bezan', 'Rafiullah Bidar', 'Mohammad Siddiq Chakari', 'Cheragh Ali Cheragh', 'Nasir Ahmad Durrani', 'Muhammad Hashim Esmatullahi', 'Ezatullah (Nangarhar)', 'Aimal Faizi', 'Gajinder Singh Safri', 'Sharif Ghalib', 'Hashmat Ghani Ahmadzai', 'Abdul Ghani Ghani', 'Ghulam Ghaus', 'Ghulam Muhammad Ghobar', 'Mohammad Gul (Helmand Council)', 'Sayed Yousuf Halim', 'Rangina Hamidi', 'Sayed Zafar Hashemi', 'Qutbuddin Hilal', 'Mahboba Hoqomal', 'Musa Hotak', 'Mirza Muhammad Ismail', 'Sayed Jalal', 'Said Tayeb Jawad', 'Sayed Jalal Karim', 'Hafizullah Shabaz Khail', 'Masoud Khalili', 'Mohammad Khan (athlete)', 'Samoud Khan', 'Baran Khan Kudezai', 'Azizullah Lodin', 'Razaq Mamoon', 'Fazel

### Article Page Info MediaWiki API Example

This example illustrates how to access page info data using the [MediaWiki REST API for the EN Wikipedia](https://www.mediawiki.org/wiki/API:Main_page). This example shows how to request summary 'page info' for a single article page. The API documentation, [API:Info](https://www.mediawiki.org/wiki/API:Info), covers additional details that may be helpful when trying to use or understand this example.

### License
The next two code blocks are example code developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - September 16, 2024



In [5]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"
API_HEADER_AGENT = 'User-Agent'

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024'
}

# # This is just a list of English Wikipedia article titles that we can use for example requests
# ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This is a modified version of his code
ARTICLE_TITLES = politician_list

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


In [6]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    
    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['titles'] = article_title

    if not request_template['titles']:
        raise Exception("Must supply an article title to make a pageinfo request.")

    if API_HEADER_AGENT not in headers:
        raise Exception(f"The header data should include a '{API_HEADER_AGENT}' field that contains your UW email address.")

    if 'uwnetid@uw' in headers[API_HEADER_AGENT]:
        raise Exception(f"Use your UW email address in the '{API_HEADER_AGENT}' field.")

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

# put politician names into page info
# under api result, take revision ID and input rid into ORES API
# get output and take predicted quality score then you merge this info with 
# population dataset in csv file which he tell us column name
# do dataset and then do analysis


## Step 2: Getting Article Quality Predictions
In order to evaluate the predicted quality of each article in the Wikipedia dataset, we'll use a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. 

The article quality estimates are, from best to worst:

FA - Featured article

GA - Good article (also known as A-Class)

B - B-Class article

C - C-Class article

Start - Start-class article

Stub - Stub-class article


In order use this tool, it requires a specific revision ID of an article to make a label prediction. Therefore, we need to get this information using the REST API. The steps to get an page quality prediction from ORES for each politician's article page are the following:

a) read each line of `politicians_by_country.AUG.2024.csv`, 

b) make a page info request to get the current page revision, and 

c) make an ORES request using the page title and current revision id.  


### Part A & B: Make a page info request to get the current page revision for each line of `politicians_by_country.AUG.2024.csv`

This is done by functions `batch_request_pageinfo` and `filter_pageinfo`.
- `batch_request_pageinfo` calls the API to get the page information for multiple pages at the same time, by separating the page titles with the vertical bar "|" character. However, this approach has limits. You should probably check the API documentation if you want to do multiple pages in a single request - and limit the number of pages in one request reasonably. At the time of writing, the limit is 50 pages in a single request.

- `filter_pageinfo` filters the page information we just got into `title:lastrevid`

In [7]:
def batch_request_pageinfo(ARTICLE_TITLES, batch_size=50):
    """
    Retrieve page information for a list of article titles in batches.

    This function processes a list of article titles by dividing them into batches of a specified size.
    For each batch, it sends a request to the page information API to fetch relevant data.
    The responses from all batches are aggregated and returned as a comprehensive collection of page data.

    Args:
        ARTICLE_TITLES (list of str):
            A list of article titles for which to retrieve page information.
        batch_size (int, optional):
            The number of titles to include in each API request batch.
            Defaults to 50 according to API documentation.

    Returns:
        list:
            A list containing the page information data retrieved from the API for all provided article titles.
            Each element in the list corresponds to the 'pages' data from an individual API response.
    """
    all_page_data = []  # To store all API responses

    # Doing this in batches/multipage resqest as this essentially means we are only doing one API call per batch - efficient time saver
    for i in range(0, len(ARTICLE_TITLES), batch_size):
        batch = ARTICLE_TITLES[i:i + batch_size]
        # Join the batch into a single string separated by '|', this creates a multipage request
        page_titles = '|'.join(batch)
        print(f"Getting page info data for: {page_titles}")

        request_info = PAGEINFO_PARAMS_TEMPLATE.copy()
        request_info['titles'] = page_titles
        info = request_pageinfo_per_article(request_template=request_info) # get API page info

        # Process the response
        if info and 'query' in info and 'pages' in info['query']:
            all_page_data.append(info['query']['pages'])  # Append the 'pages' part of the response
        else:
            print("No data returned for this batch.")

        # Put in an optional throttle the requests to avoid overloading the API
        time.sleep(API_THROTTLE_WAIT)

    return all_page_data  


In [8]:
def filter_pageinfo(all_page_data):
    """
    Extract and filter relevant information from page data.

    This function processes a list of page data and extracts the 'title' and 'lastrevid' fields for each page.
    It returns a dictionary where the keys are the article titles and the values are the corresponding 'lastrevid' values.

    Args:
        all_page_data (list):
            A list of page data, where each element is a dictionary containing page information
            (typically the response from a page information API request).

    Returns:
        dict:
            A dictionary where keys are article titles and values are the corresponding 'lastrevid' values.

    Raises:
        ValueError:
            If any page data lacks the required 'title' or 'lastrevid' fields.
    """

    # Iterate through the full page data and extract title and lastrevid
    filtered_data = {}
    for pages in all_page_data:
        for page_id, page_data in pages.items():
            title = page_data.get('title')
            lastrevid = page_data.get('lastrevid')
            if title and lastrevid:
                filtered_data[title] = lastrevid  # Stores in a dictionary as 'title':lastrevid, which is desired input format for ORES
    return filtered_data

In [9]:
# Call the function to collect all the page data
all_page_data = batch_request_pageinfo(ARTICLE_TITLES)

Getting page info data for: Majah Ha Adrif|Haroon al-Afghani|Tayyab Agha|Khadija Zahra Ahmadi|Aziza Ahmadyar|Muqadasa Ahmadzai|Mohammad Sarwar Ahmedzai|Amir Muhammad Akhundzada|Nasrullah Baryalai Arsalai|Abdul Rahim Ayoubi|Ismael Balkhi|Abdul Baqi Turkistani|Mohammad Ghous Bashiri|Jan Baz|Bashir Ahmad Bezan|Rafiullah Bidar|Mohammad Siddiq Chakari|Cheragh Ali Cheragh|Nasir Ahmad Durrani|Muhammad Hashim Esmatullahi|Ezatullah (Nangarhar)|Aimal Faizi|Gajinder Singh Safri|Sharif Ghalib|Hashmat Ghani Ahmadzai|Abdul Ghani Ghani|Ghulam Ghaus|Ghulam Muhammad Ghobar|Mohammad Gul (Helmand Council)|Sayed Yousuf Halim|Rangina Hamidi|Sayed Zafar Hashemi|Qutbuddin Hilal|Mahboba Hoqomal|Musa Hotak|Mirza Muhammad Ismail|Sayed Jalal|Said Tayeb Jawad|Sayed Jalal Karim|Hafizullah Shabaz Khail|Masoud Khalili|Mohammad Khan (athlete)|Samoud Khan|Baran Khan Kudezai|Azizullah Lodin|Razaq Mamoon|Fazel Ahmed Manawi|Moeen Marastial|Ahmad Wali Massoud|Mohammad Daud Miraki
Getting page info data for: Abdul Sattar M

### Prepping Data for ORES request
Desired output:
`{'title1':lastrevid,  'title2':lastrevid, ...}`

In [10]:
filtered_politician_page_info = filter_pageinfo(all_page_data)

print(filtered_politician_page_info)

{'Abdul Baqi Turkistani': 1231655023, 'Abdul Ghani Ghani': 1227026187, 'Abdul Rahim Ayoubi': 1226326055, 'Ahmad Wali Massoud': 1221720658, 'Aimal Faizi': 1185105938, 'Amir Muhammad Akhundzada': 1247931713, 'Aziza Ahmadyar': 1195651393, 'Azizullah Lodin': 1247762293, 'Baran Khan Kudezai': 1176481824, 'Bashir Ahmad Bezan': 1248505877, 'Cheragh Ali Cheragh': 1193992206, 'Ezatullah (Nangarhar)': 1158302291, 'Fazel Ahmed Manawi': 1234514379, 'Gajinder Singh Safri': 1212323536, 'Ghulam Ghaus': 1158659195, 'Ghulam Muhammad Ghobar': 1240993642, 'Hafizullah Shabaz Khail': 1238402857, 'Haroon al-Afghani': 1230459615, 'Hashmat Ghani Ahmadzai': 1207743719, 'Ismael Balkhi': 1244521219, 'Jan Baz': 1227635806, 'Khadija Zahra Ahmadi': 1234741562, 'Mahboba Hoqomal': 1243745950, 'Majah Ha Adrif': 1233202991, 'Masoud Khalili': 1246566971, 'Mirza Muhammad Ismail': 1235165845, 'Moeen Marastial': 1246567093, 'Mohammad Daud Miraki': 1227103354, 'Mohammad Ghous Bashiri': 1237694188, 'Mohammad Gul (Helmand Cou

This is a check to see if after the API call, we are missing any politicians.

In [118]:
missing_politicians = []

# Check to see if there are any politicians missing between politician_list and ARTICLE_REVISIONS
for politician in politician_list:
    if politician not in filtered_politician_page_info:
        missing_politicians.append(politician)

# Save the missing politicians to missing_politicians.json
with open('missing_politicians.json', 'w') as f:
    json.dump(missing_politicians, f, indent=4)

total_politicians = len(politician_list)
missing_count = len(missing_politicians)

print(f"\nTotal Politicians in List from Wiki Crawl: {total_politicians}")
print(f"Politicians Missing from filtered_politician_page_info: {missing_count}")

if missing_politicians:
    print("\nThe following politicians are missing from API Call:")
    for missing in missing_politicians:
        print(missing)
else:
    print("\nAll politicians in the list are found in filtered_politician_page_info.")


Total Politicians in List from Wiki Crawl: 7111
Politicians Missing from filtered_politician_page_info: 8
Politicians Found in filtered_politician_page_info: 7103

The following politicians are missing from API Call:
Barbara Eibinger-Miedl
Mehrali Gasimov
Kyaw Myint
André Ngongang Ouandji
Tomás Pimentel
Richard Sumah
Segun ''Aeroland'' Adewale
Bashir Bililiqo


# Part C: Requesting ORES scores through LiftWing ML Service API
Wikimedia Foundation (WMF) is reworking access to their APIs. It is likely in the coming years that all API access will require some kind of authentication, either through a simple key/token or through some version of OAuth. For now this is still a work in progress. You can follow the progress from their [API portal](https://api.wikimedia.org/wiki/Main_Page). Another on-going change is better control over API services in situations where those services require additional computational resources, beyond simply serving the text of a web page (i.e., the text of an article). Services like ORES that require running an ML model over the text of an article page is an example of a compute intensive API service.

Wikimedia is implementing a new Machine Learning (ML) service infrastructure that they call [LiftWing](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing). Given that ORES already has several ML models that have been well used, ORES is the first set of APIs that are being moved to LiftWing.

This example illustrates how to generate article quality estimates for article revisions using the LiftWing version of [ORES](https://www.mediawiki.org/wiki/ORES). The [ORES API documentation](https://ores.wikimedia.org) can be accessed from the main ORES page. The [ORES LiftWing documentation](https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage) is very thin ... even thinner than the standard ORES documentation. Further, it is clear that some parameters have been renamed (e.g., "revid" in the old ORES API is now "rev_id" in the LiftWing ORES API).


## License
This code example was developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023. 

This current code block includes modifications by Sarah Nguyen, specifically:
- Adding the variables `user`, `personal_token`, and `personal_email` to avoid using part of the sample code which involved encoding your private information and have removed it manually
- I removed the apikeys.zip file to avoid using the encoding functions

# Get your access token

You will need a Wikimedia user account to get access to Lift Wing (the ML API service). You can either [create an account or login](https://api.wikimedia.org/w/index.php?title=Special:UserLogin). If you have a Wikipedia user account - you might already have an Wikimedia account. If you are not sure try your Wikipedia username and password to check it. If you do not have a Wikimedia account you will need to create an account that you can use to get an access token.

There is [a 'guide' that describes how to get authentication tokens](https://api.wikimedia.org/wiki/Authentication) - but not everything works the way it is described in that documentation. You should review that documentation and then read the rest of this comment.

The documentation talks about using a "dashboard" for managing authentication tokens. That's a rather generous description for what looks like a simple list of token things. You might have a hard time finding this "dashboard". First, on the left hand side of the page, you'll see a column of links. The bottom section is a set of links titled "Tools". In that section is a link that says [Special pages](https://api.wikimedia.org/wiki/Special:SpecialPages) which will take you to a list of ... well, special pages. At the very bottom of the "Special pages" page is a section titled "Other special pages" (scroll all the way to the bottom). The first link in that section is called [API keys](https://api.wikimedia.org/wiki/Special:AppManagement). When you get to the "API keys" page you can create a new key.

The authentication guide suggests that you should create a server-side app key. This does not seem to work correctly - as yet. It failed on multiple attempts when I attempted to create a server-side app key. BUT, there is an option to create a [Personal API token](https://api.wikimedia.org/wiki/Authentication) that should work for this course and the type of ORES page scoring that you will need to perform.

Note, when you create a Personal API token you are granted the three items - a Client ID, a Client secret, and a Access token - you shold save all three of these. When you dismiss the box they are gone. If you lose any one of the tokens you can destroy or deactivate the Personal API token from the dashboard and then create a new one.

The value you need to work the code below is the Access token - a very long string.

In [263]:
user = "yourusername"
personal_token = "yourverylongpersonaltoken"
personal_email = "yourpersonalemail"

In [12]:
#########
#
#    CONSTANTS
#

#    The current LiftWing ORES API endpoint and prediction model
#

API_ORES_LIFTWING_ENDPOINT = "https://api.wikimedia.org/service/lw/inference/v1/models/{model_name}:predict"
API_ORES_EN_QUALITY_MODEL = "enwiki-articlequality"

#
#    The throttling rate is a function of the Access token that you are granted when you request the token. The constants
#    come from dissecting the token and getting the rate limits from the granted token. An example of that is below.
#
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = ((60.0*60.0)/5000.0)-API_LATENCY_ASSUMED  # The key authorizes 5000 requests per hour

#    When making automated requests we should include something that is unique to the person making the request
#    This should include an email - your UW email would be good to put in there
#
#    Because all LiftWing API requests require some form of authentication, you need to provide your access token
#    as part of the header too
#
REQUEST_HEADER_TEMPLATE = {
    'User-Agent': "<{email_address}>, University of Washington, MSDS DATA 512 - AUTUMN 2024",
    'Content-Type': 'application/json',
    'Authorization': "Bearer {access_token}"
}
#
#    This is a template for the parameters that we need to supply in the headers of an API request
#
REQUEST_HEADER_PARAMS_TEMPLATE = {
    'email_address' : personal_email,         # your email address should go here
    'access_token'  : personal_token # the access token you create will need to go here
}

#
#    A dictionary of English Wikipedia article titles (keys) and sample revision IDs that can be used for this ORES scoring example
#
ARTICLE_REVISIONS = filtered_politician_page_info

#
#    This is a template of the data required as a payload when making a scoring request of the ORES model
#
ORES_REQUEST_DATA_TEMPLATE = {
    "lang":        "en",     # required that its english - we're scoring English Wikipedia revisions
    "rev_id":      "",       # this request requires a revision id
    "features":    True
}

#
#    These are used later - defined here so they, at least, have empty values
#
USERNAME = user
ACCESS_TOKEN = personal_token
#

## Define a function to make the ORES API request

The API request will be made using a function to encapsulate call and make access reusable in other notebooks. The procedure is parameterized, relying on the constants above for some important default parameters. The primary assumption is that this function will be used to request data for a set of article revisions. The main parameter is 'article_revid'. One should be able to simply pass in a new article revision id on each call and get back a python dictionary as the result. A valid result will be a dictionary that contains the probabilities that the specific revision is one of six different article quality levels. Generally, quality level with the highest probability score is considered the quality level for the article. This can be tricky when you have two (or more) highly probable quality levels.

This is a continuation of Dr. David W. McDonald's example code for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.0 - August 15, 2023. 

In [13]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_ores_score_per_article(article_revid = None, email_address=None, access_token=None,
                                   endpoint_url = API_ORES_LIFTWING_ENDPOINT,
                                   model_name = API_ORES_EN_QUALITY_MODEL,
                                   request_data = ORES_REQUEST_DATA_TEMPLATE,
                                   header_format = REQUEST_HEADER_TEMPLATE,
                                   header_params = REQUEST_HEADER_PARAMS_TEMPLATE):

    #    Make sure we have an article revision id, email and token
    #    This approach prioritizes the parameters passed in when making the call
    if article_revid:
        request_data['rev_id'] = article_revid
    if email_address:
        header_params['email_address'] = email_address
    if access_token:
        header_params['access_token'] = access_token

    #   Making a request requires a revision id - an email address - and the access token
    if not request_data['rev_id']:
        raise Exception("Must provide an article revision id (rev_id) to score articles")
    if not header_params['email_address']:
        raise Exception("Must provide an 'email_address' value")
    if not header_params['access_token']:
        raise Exception("Must provide an 'access_token' value")

    # Create the request URL with the specified model parameter - default is a article quality score request
    request_url = endpoint_url.format(model_name=model_name)

    # Create a compliant request header from the template and the supplied parameters
    headers = dict()
    for key in header_format.keys():
        headers[str(key)] = header_format[key].format(**header_params)

    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free data
        # source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        #response = requests.get(request_url, headers=headers)
        response = requests.post(request_url, headers=headers, data=json.dumps(request_data))
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


### Make an ORES request using the page title and current revision id

- `score_all_articles` is a function to perform ORES scoring for all articles.

The ORES scores are logged and saved in file `ores_scores.json`. Articles which we could not get a score for were logged in `failed_articles.json`

In [131]:
def score_all_articles(article_revisions, email_address, access_token):
    """
    Retrieve ORES scores for a collection of articles based on their revision IDs.

    This function iterates over a dictionary of article titles and their corresponding revision IDs,
    making ORES API calls to retrieve scoring information for each article. It stores the scores
    in a dictionary and tracks articles that fail to get a score.

    Args:
        article_revisions (dict):
            A dictionary where the keys are article titles and the values are revision IDs.
        email_address (str):
            The user's email address required for authenticating the ORES API requests.
        access_token (str):
            The access token for authenticating ORES API requests.

    Returns:
        tuple:
            - all_scores (dict): A dictionary where keys are article titles and values are the retrieved ORES scores.
            - failed_articles (list): A list of article titles for which ORES scoring failed.

    Raises:
        APIRequestError:
            If an API request fails or raises an exception during the scoring process.
    """
    all_scores = {}  # We are storing the scores for each article in a dict
    failed_articles = []  # If we cannot get a score, we will store here

    # for all articles and revids, we want to get the ores score by passing in the revid, email address, and access token
    for article_title, revision_id in article_revisions.items():
        print(f"Getting LiftWing ORES scores for '{article_title}' with revid: {revision_id}")
        
        try:
            # Make the ORES API call
            score = request_ores_score_per_article(
                article_revid=revision_id,
                email_address=email_address,
                access_token=access_token
            )
            
            # Check if the score is valid, else mark as failed
            if score is None:
                print(f"Failed to retrieve score for '{article_title}'")
                failed_articles.append(article_title)
            else:
                all_scores[article_title] = score
                print(json.dumps(score, indent=4))
                
        except Exception as e:
            print(f"Error retrieving score for '{article_title}': {e}")
            failed_articles.append(article_title)
        
    return all_scores, failed_articles

In [None]:
# Does ORES scoring for all articles and collects ones that didn't go through
all_article_scores, failed_articles = score_all_articles(
    article_revisions=ARTICLE_REVISIONS,
    email_address=personal_email,
    access_token=ACCESS_TOKEN
)

### Please note: In my notebook, there is a cell that has an error here that it cannot find a variable previously used. This is not actually an error, it cannot find it because I ran it in a tmux session and did not want to run it again here in my notebook as it took 3 hours to run. You can find the output file `ores_scores.json` in my repository.

In [119]:
# saves ORES scores here, ran score_all_articles in a tmux session originally and didn't want to rerun as it took a long time
# so there is a NameError here as I did not run the cell above again in the notebook
with open('ores_scores.json', 'w') as f:
    json.dump(all_article_scores, f, indent=4)

NameError: name 'all_article_scores' is not defined

In [17]:
# stores articles that did not get score 
with open('failed_articles.json', 'w') as f:
    json.dump(failed_articles, f, indent=4)

As my failed articles were empty, I wanted to do an extra validation comparing politicial in original input `ARTICLE_REVISIONS` to politician in final ORES output `ores_score.json` to ensure I got scores for each article. I also did this separately as it took 3 hours for it to run in the loop, so checking again that way would've taken too long.

In [121]:
with open('ores_scores.json', 'r') as f:
    ores_scores = json.load(f)

failed_articles = []

ARTICLE_REVISIONS = filtered_politician_page_info 

# Comparing the keys in ARTICLE_REVISIONS to those in ores_scores
for article_title in ARTICLE_REVISIONS:
    if article_title not in ores_scores:
        failed_articles.append(article_title)

# Now saving the failed articles to failed_articles.json
with open('failed_articles.json', 'w') as f:
    json.dump(failed_articles, f, indent=4)

# Calculate the error rate
total_articles = len(ARTICLE_REVISIONS)
failed_count = len(failed_articles)
error_rate = (failed_count / total_articles) * 100
print(f"\nTotal Articles: {total_articles}")
print(f"Failed Articles: {failed_count}")
print(f"Error Rate: {error_rate:.2f}%")

# Per the assignments request, checking our error rate
if error_rate > 1:
    print("Error rate exceeds 1%. Please review the code and investigate the issue.")



Total Articles: 7103
Failed Articles: 0
Error Rate: 0.00%


# Step 3: Combining the Datasets
After retrieving and including the ORES data for each article, you'll need to merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

## Output Files:

Identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called:
`wp_countries-no_match.txt`
Consolidate the remaining data into a single CSV file called:
`wp_politicians_by_country.csv`


GPT-4 was used to help generate the code block and comments below. I prompted GPT with my own code for merging, renaming, reordering, and checking for missing countries in the merged dataset with the `population_by_country_AUG.2024`, and asked it to help extract the revision ID and article quality score from `ores_scores.json` and create a new intermediary dataframe called `article_quality_df`. 

The first step was to begin by combining `politicians_by_country_AUG.2024.csv` and `ores_scores.json` to extract the article_quality scores for each politician. I then took that merged dataframe `article_quality_df` to help generate the final output files. 



In [287]:
with open('ores_scores.json', 'r') as f:
    ores_scores = json.load(f)

# Extract article quality predictions and revision IDs from the ORES scores
article_quality_scores = {}
revision_ids = {}

for politician, data in ores_scores.items():
    try:
        # Get the most recent revision's articlequality prediction
        revision_id = list(data['enwiki']['scores'].keys())[0]  # Extract lastrevid
        prediction = data['enwiki']['scores'][revision_id]['articlequality']['score']['prediction']
        
        # Store the article quality and revision_id
        article_quality_scores[politician] = prediction
        revision_ids[politician] = revision_id
    except KeyError:
        article_quality_scores[politician] = 'Unknown'
        revision_ids[politician] = None

# Create a DataFrame from the article quality scores and revision IDs
article_quality_df = pd.DataFrame({
    'name': list(article_quality_scores.keys()),
    'article_quality': list(article_quality_scores.values()),
    'revision_id': list(revision_ids.values())
})

# Merge the politicians_data with the article_quality_df on the 'name' column
wp_politicians_by_country = pd.merge(politicians_data, article_quality_df, how='left', on='name')

# Step: Standardize country names to match between datasets
wp_politicians_by_country['country'] = wp_politicians_by_country['country'].replace({
    'Guinea-Bissau': 'GuineaBissau',  # Match 'GuineaBissau' in population data
    'Korea, South': 'Korea (South)',  # Match 'Korea (South)'
    'Korea, North': 'Korea (North)'   # Match 'Korea (North)'
})

# Forward-fill the region for all countries in the population dataset
population_data['is_region'] = population_data['Geography'].str.isupper()
population_data['Region'] = population_data['Geography'].where(population_data['is_region']).ffill()

# Filter out the regions for country-level merging
population_countries = population_data[~population_data['is_region']].copy()

# Merge the politicians data (with ORES scores) and population data
merged_data = pd.merge(wp_politicians_by_country, population_countries, how='outer', left_on='country', right_on='Geography')

# Drop redundant columns and clean up
merged_data = merged_data.drop(columns=['Geography', 'is_region'])

# Reorder and rename columns to match the desired output format
merged_data = merged_data[['country', 'Region', 'Population', 'name', 'revision_id', 'article_quality']]
merged_data.rename(columns={'Region': 'region', 'Population': 'population', 'name': 'article_title'}, inplace=True)

# Save the merged data to a CSV file
merged_data.to_csv('wp_politicians_by_country.csv', index=False)

# Step 1: Extract unique countries from both datasets
# Extract country names from politicians_data
politician_countries = set(wp_politicians_by_country['country'].unique())

# Extract country names from population_data, excluding regions (ALL CAPS entries)
population_countries = set(population_countries['Geography'].unique())

# Extract country names from merged_data
merged_countries = set(merged_data['country'].dropna().unique())

# Step 2: Identify countries in politicians_data but missing in the population_data
missing_in_population_data = politician_countries - population_countries

# Step 3: Identify countries in population_data but missing in the politicians_data
missing_in_politicians_data = population_countries - politician_countries

# Combine both missing sets of countries into one list
all_missing_countries = sorted(list(missing_in_population_data | missing_in_politicians_data))

# Step 4: Save all unmatched countries to a text file
with open('wp_countries-no_match.txt', 'w') as f:
    for country in all_missing_countries:
        f.write(f"{country}\n")

# Print the summary of unmatched countries
print(f"Total number of missing countries: {len(all_missing_countries)}")

# Output the first few rows of the merged data for verification
merged_data.head()


Total number of missing countries: 42


Unnamed: 0,country,region,population,article_title,revision_id,article_quality
0,Afghanistan,SOUTH ASIA,42.4,Majah Ha Adrif,1233202991,Start
1,Afghanistan,SOUTH ASIA,42.4,Haroon al-Afghani,1230459615,B
2,Afghanistan,SOUTH ASIA,42.4,Tayyab Agha,1225661708,Start
3,Afghanistan,SOUTH ASIA,42.4,Khadija Zahra Ahmadi,1234741562,Stub
4,Afghanistan,SOUTH ASIA,42.4,Aziza Ahmadyar,1195651393,Start


# Step 4: Analysis
The analysis will consist of calculating `total-articles-per-capita` (a ratio representing the number of articles per person)  and `high-quality-articles-per-capita` (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis.

To find `total-articles-per-capita`, we first grouped our merged dataset by country and region, counting the total number of Wikipedia articles per country. The dataset provided article information, and by grouping it, we could calculate the total number of articles available for each country and region.

To find `high-quality-articles-per-capita`, we filtered the dataset to include only articles that were predicted by ORES to be in the "FA" (Featured Article) or "GA" (Good Article) classes. These articles are considered "high quality" based on our assignment's criteria. We grouped this filtered dataset by country and region, counting how many high-quality articles each country had.

Next, we merged the article counts (both total and high-quality) with population data, using the country as the common key. Population data was provided in millions, so we converted it to individuals for more precise per capita calculations. The `articles-per-capita` was calculated by dividing the total number of articles by the population of each country. Similarly, the `high-quality-articles-per-capita` was calculated by dividing the number of high-quality articles by the population.

To handle edge cases where the population might be missing or zero, we replaced any NaN or infinity values (resulting from division by zero) with 0 to avoid errors in the calculations.

Finally, we sorted the data by region and country to improve readability and saved the results into a CSV file for further analysis.

### Please note, I converted the population from millions to individuals for more accurate ratios

In [280]:
# Based on assignment, filter for high qual articles
high_quality_articles = merged_data[merged_data['article_quality'].isin(['FA', 'GA'])]

# First we will want to group by country and region to count total articles and high-quality articles
article_counts = merged_data.groupby(['country', 'region']).size().reset_index(name='total_articles')
high_quality_counts = high_quality_articles.groupby(['country', 'region']).size().reset_index(name='high_quality_articles')

article_summary = pd.merge(article_counts, high_quality_counts, how='left', on=['country', 'region'])
article_summary = pd.merge(article_summary, population_data, how='left', left_on='country', right_on='Geography')
article_summary = article_summary.drop(columns=['Geography'])

# To avoid bad data, replace NaNs in high_quality_articles and Pop with 0 for countries without high-quality articles and in case of unmatched countries
article_summary['high_quality_articles'] = article_summary['high_quality_articles'].fillna(0)
article_summary['Population'] = article_summary['Population'].fillna(0)

# Then, we will convert the population from millions to individuals for more accurate ratios
article_summary['Population'] = article_summary['Population'] * 1_000_000

# Calculate total and high quality articles per capita
article_summary['articles_per_capita'] = article_summary['total_articles'] / article_summary['Population']
article_summary['high_quality_articles_per_capita'] = article_summary['high_quality_articles'] / article_summary['Population']

# In some cases, the population of the country is so small it appears as zero since the population
# file provides the number in millions. Therefore, we want to handle cases where population is zero to avoid division by zero
article_summary['articles_per_capita'] = article_summary['articles_per_capita'].replace([float('inf'), -float('inf')], 0)
article_summary['high_quality_articles_per_capita'] = article_summary['high_quality_articles_per_capita'].replace([float('inf'), -float('inf')], 0)

article_summary = article_summary.sort_values(by=['region', 'country'])
article_summary.to_csv('article_quality_per_capita.csv', index=False)
article_summary.head()
s

Unnamed: 0,country,region,total_articles,high_quality_articles,Population,is_region,Region,articles_per_capita,high_quality_articles_per_capita
4,Antigua and Barbuda,CARIBBEAN,33,0.0,100000.0,False,CARIBBEAN,0.00033,0.0
9,Bahamas,CARIBBEAN,9,0.0,400000.0,False,CARIBBEAN,2.3e-05,0.0
12,Barbados,CARIBBEAN,25,0.0,300000.0,False,CARIBBEAN,8.3e-05,0.0
39,Cuba,CARIBBEAN,50,6.0,11000000.0,False,CARIBBEAN,5e-06,5.454545e-07
43,Dominican Republic,CARIBBEAN,38,2.0,11300000.0,False,CARIBBEAN,3e-06,1.769912e-07


In [288]:
top_10_coverage = article_summary.sort_values(by='articles_per_capita', ascending=False).head(10)
print("\nTop 10 Countries by Coverage (Articles per Capita, Descending Order):")
display(top_10_coverage[['country', 'total_articles', 'articles_per_capita']])


Top 10 Countries by Coverage (Articles per Capita, Descending Order):


Unnamed: 0,country,total_articles,articles_per_capita
4,Antigua and Barbuda,33,0.00033
51,Federated States of Micronesia,14,0.00014
95,Marshall Islands,13,0.00013
151,Tonga,10,0.0001
12,Barbados,25,8.3e-05
127,Seychelles,6,6e-05
100,Montenegro,36,6e-05
92,Maldives,33,5.5e-05
17,Bhutan,44,5.5e-05
123,Samoa,8,4e-05


In [289]:
bottom_10_coverage = article_summary.sort_values(by='articles_per_capita', ascending=True).head(10)

print("\nBottom 10 Countries by Coverage (Articles per Capita, Ascending Order):")
display(bottom_10_coverage[['country', 'total_articles', 'articles_per_capita']])


Bottom 10 Countries by Coverage (Articles per Capita, Ascending Order):


Unnamed: 0,country,total_articles,articles_per_capita
98,Monaco,10,0.0
156,Tuvalu,1,0.0
31,China,16,1.133707e-08
67,India,151,1.056979e-07
57,Ghana,4,1.173021e-07
124,Saudi Arabia,5,1.355014e-07
166,Zambia,3,1.485149e-07
110,Norway,1,1.818182e-07
71,Israel,2,2.040816e-07
45,Egypt,32,3.041825e-07


In [290]:
top_10_high_quality = article_summary.sort_values(by='high_quality_articles_per_capita', ascending=False).head(10)

print("\nTop 10 Countries by High Quality Coverage (Articles per Capita, Descending Order):")
display(top_10_high_quality[['country', 'high_quality_articles', 'high_quality_articles_per_capita']])


Top 10 Countries by High Quality Coverage (Articles per Capita, Descending Order):


Unnamed: 0,country,high_quality_articles,high_quality_articles_per_capita
100,Montenegro,3.0,5e-06
88,Luxembourg,2.0,2.857143e-06
1,Albania,7.0,2.592593e-06
78,Kosovo,4.0,2.352941e-06
92,Maldives,1.0,1.666667e-06
87,Lithuania,4.0,1.37931e-06
38,Croatia,5.0,1.315789e-06
63,Guyana,1.0,1.25e-06
113,Palestinian Territory,6.0,1.090909e-06
131,Slovenia,2.0,9.52381e-07


In [291]:
bottom_10_high_quality = article_summary.sort_values(by='high_quality_articles_per_capita', ascending=True).head(10)

print("\nBottom 10 Countries by High Quality Coverage (Articles per Capita, Ascending Order):")
display(bottom_10_high_quality[['country', 'high_quality_articles', 'high_quality_articles_per_capita']])



Bottom 10 Countries by High Quality Coverage (Articles per Capita, Ascending Order):


Unnamed: 0,country,high_quality_articles,high_quality_articles_per_capita
4,Antigua and Barbuda,0.0,0.0
37,Cote d'Ivoire,0.0,0.0
110,Norway,0.0,0.0
55,Gambia,0.0,0.0
49,Estonia,0.0,0.0
62,GuineaBissau,0.0,0.0
85,Liberia,0.0,0.0
47,Equatorial Guinea,0.0,0.0
51,Federated States of Micronesia,0.0,0.0
34,Congo,0.0,0.0


In [247]:
# There are two ways to do this, use the population in the region or sum up all the regions
# and calculate the total articles and high-quality articles for each region. I have chosen the later as I felt calculating per country was 
# more granular. Therefore, you might get different answers if you use the population region provided in the table.

region_summary = article_summary.groupby('region').agg(
    # First we want the total num of articles and high quality articles
    total_articles=('total_articles', 'sum'),  
    high_quality_articles=('high_quality_articles', 'sum'),  
    # Then get the population per region
    total_population=('Population', 'sum') 
).reset_index()

# Now, we calculate articles per capita and high-quality articles per capita for each region
region_summary['articles_per_capita'] = region_summary['total_articles'] / region_summary['total_population']
region_summary['high_quality_articles_per_capita'] = region_summary['high_quality_articles'] / region_summary['total_population']
region_summary = pd.DataFrame(region_summary)

In [268]:
# Sort the result in descending order by 'articles_per_capita' 
regions_by_coverage = region_summary.sort_values(by='articles_per_capita', ascending=False)

print("\nGeographic Regions by Total Coverage (Articles per Capita, Descending Order):")
display(regions_by_coverage[['region', 'total_articles', 'articles_per_capita']])



Geographic Regions by Total Coverage (Articles per Capita, Descending Order):


Unnamed: 0,region,total_articles,articles_per_capita
8,NORTHERN EUROPE,191,6.870504e-06
9,OCEANIA,72,6.486486e-06
0,CARIBBEAN,219,5.983607e-06
14,SOUTHERN EUROPE,797,5.260726e-06
1,CENTRAL AMERICA,188,3.664717e-06
17,WESTERN EUROPE,498,2.746828e-06
5,EASTERN EUROPE,709,2.663411e-06
16,WESTERN ASIA,610,2.064997e-06
13,SOUTHERN AFRICA,123,1.800878e-06
4,EASTERN AFRICA,665,1.382824e-06


In [292]:
regions_by_high_quality = region_summary.sort_values(by='high_quality_articles_per_capita', ascending=False)

print("\nGeographic Regions by High Quality Coverage (Articles per Capita, Descending Order):")
display(regions_by_high_quality[['region', 'high_quality_articles', 'total_population','high_quality_articles_per_capita']])



Geographic Regions by High Quality Coverage (Articles per Capita, Descending Order):


Unnamed: 0,region,high_quality_articles,total_population,high_quality_articles_per_capita
14,SOUTHERN EUROPE,53.0,151500000.0,3.49835e-07
8,NORTHERN EUROPE,9.0,27800000.0,3.23741e-07
0,CARIBBEAN,9.0,36600000.0,2.459016e-07
1,CENTRAL AMERICA,10.0,51300000.0,1.949318e-07
5,EASTERN EUROPE,38.0,266200000.0,1.427498e-07
13,SOUTHERN AFRICA,8.0,68300000.0,1.171303e-07
17,WESTERN EUROPE,21.0,181300000.0,1.158301e-07
16,WESTERN ASIA,27.0,295400000.0,9.140149e-08
9,OCEANIA,1.0,11100000.0,9.009009e-08
7,NORTHERN AFRICA,17.0,255900000.0,6.64322e-08
