# A2: Bias in data

The goal of this assignment is to identify potential biases with the volume and quality of English Wikipedia articles about politicians, across many different countries.

Two external data sources were used for this assignment and they are both stored in the `data` subdirectory of this repository. The first external data source is a CSV file storing minimal information about ~50,000 Wikipedia articles about politicians ("page_data.csv"), which can be downloaded at https://figshare.com/articles/Untitled_Item/5513449. The second external data source is another CSV file storing ~200 countries/continents and their populations ("WPDS_2018_data.csv"), which can be downloaded at https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0.

The schemas for the CSV files are as follows:

**page_data.csv**

|column |description                                          |
|-------|-----------------------------------------------------|
|page   |name of Wikipedia article                            |
|country|country of politician the Wikipedia article is about |
|rev_id |revision ID of the last edit to the Wikipedia article|

**WPDS_2018_data.csv**

|column                        |description                                             |
|------------------------------|--------------------------------------------------------|
|Geography                     |country or contintent                                   |
|Population mid-2018 (millions)|population of country or continent in millions of people|

In [1]:
import csv
import pandas
import requests

## 1. Retrieving article quality predictions

To predict an article's quality, we are using the Object Revision Evaluation Service ([ORES](https://www.mediawiki.org/wiki/ORES)) API provided by Wikimedia. Given a version of an article (a revision ID for an article), the ORES API provides class probabilities for six of the grades in the [WikiProject article quality grading scheme](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment#Grades) (FA, GA, B, C, Start, Stub). The ORES API also provides a grade prediction, which we will be extracting for use in later stages of this assignment.

The ORES API does allow for the [grades of multiple articles to be predicted at once](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context), but there are limitations. First of all, there's a limit to how long a URL can be (2083 characters). Since each revision ID passed to the ORES API is 9 characters long and multiple revision IDs are separated by the "|" character, this limits us to passing ~200 Wikipedia articles to the ORES API with each call. Secondly, through trial and error, it seems like the ORES API provides its _own_ limit to the number of Wikipedia articles it can query within the same API call. This limit was found to be 140 Wikipedia articles, with any number of articles above this causing the API call to come back with a 503 Service Unavailable response.

Due to the limit on revision IDs that can be passed to the ORES API with each call, we first separate the revision IDs found in the Wikipedia article CSV file into chunks no longer than 140 IDs. We then make an API call for each of these chunks and save each article's predicted grade/quality. There are some cases where predictions can't be made either because the revision ID can't be found (RevisionNotFound error) or the Wikipedia article has been deleted (TextDeleted error). We ignore Wikipedia articles where quality predictions cannot be made in later stages of this assignment. Since ~300 revision ID chunks are created from the Wikipedia article CSV file and ~300 API calls are made, this step may take a couple minutes.

In [2]:
# maximum revision IDs allowed per API call, found through trial and error
MAX_REV_IDS_PER_CALL = 140

def create_string_chunk(rev_ids, chunk_list):
    """
    Transforms a list of revision IDs into a string separated by "|" characters to be passed to a single API call.
    Adds the revision ID string chunk to a list of all revision ID string chunks, representing the number of API calls.
    Clears the original list of revision IDs so we can start creating the next list of revision IDs.
    """
    rev_ids_string = "|".join(rev_ids)
    chunk_list.append(rev_ids_string)
    rev_ids.clear()

rev_ids_string_chunks = []
with open("data/page_data.csv", "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    is_header = True
    tmp_rev_ids = []
    for _, _, rev_id in csv_reader:
        # ignore header row
        if is_header:
            is_header = False
            continue
        tmp_rev_ids.append(rev_id)
        # if we've accumulated the maximum number of revision IDs allowed for an API call, transform the revision IDs
        # into a string to pass as a parameter to an API call
        if len(tmp_rev_ids) == MAX_REV_IDS_PER_CALL:
            create_string_chunk(tmp_rev_ids, rev_ids_string_chunks)
    # transform all leftover revision IDs into a string
    if len(tmp_rev_ids) > 0:
        create_string_chunk(tmp_rev_ids, rev_ids_string_chunks)
            
print("{} revision ID chunks created".format(len(rev_ids_string_chunks)))

338 revision ID chunks created


In [3]:
# documentation found here: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context
ORES_ENDPOINT = "https://ores.wikimedia.org/v3/scores/{context}?models={model}&revids={revids}"

predictions = {}
for rev_ids_string in rev_ids_string_chunks:
    ores_params = {
        "context": "enwiki",
        "model": "wp10",
        "revids": rev_ids_string
    }
    response = requests.get(ORES_ENDPOINT.format(**ores_params))
    # raise an AssertionError if the API call does not come back successfully
    assert response.status_code == 200, "API call came back with status code: {}".format(response.status_code)
    print("*", end="")
    json = response.json()
    scores = json["enwiki"]["scores"]
    for rev_id in scores.keys():
        model = scores[rev_id]["wp10"]
        # some queries by the API don't return predictions because the revision ID wasn't found (RevisionNotFound error)
        # or the Wikipedia article has since been deleted (TextDeleted error), we will not save these revision IDs in
        # the predictions map
        if "score" in model:
            prediction = model["score"]["prediction"]
            predictions[rev_id] = prediction

print("\nArticle quality predictions retrieved")

**************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Article quality predictions retrieved


## 2. Merging the datasets

Now that we effectively have three data sources (Wikipedia articles CSV file, countries CSV file, article quality predictions) that can be linked with each other, we will merge the data sources into one CSV outfile ("article_qualities.csv").

The schema for the CSV outfile is as follows:

|column         |description                                         |
|---------------|----------------------------------------------------|
|country        |country of politician the Wikipedia article is about|
|article_name   |name of Wikipedia article                           |
|revision_id    |revision ID of last edit to Wikipedia article       |
|article_quality|predicted grade/quality of Wikipedia article        |
|population     |population of country in millions of people         |

Before merging the three data sources, we have to take into account mismatching country names between the articles CSV file and the countries CSV file. I've gone through the articles CSV file, saved each country that didn't have a match in the countries CSV file and tallied up how many times that mismatching country name appeared in the articles CSV file. For mismatching country names that appeared more than 30 times (arbitrary number to make sure the number of manual country name mappings wasn't too small or large) in the articles CSV file and had a slightly different country name in the countries CSV file, I've created a manual mapping between the similar country names (e.g. "Czech Republic" in the articles CSV file maps to "Czechia" in the countries CSV file) so we are able to merge Wikipedia article data from these countries. The final country name that is used in the outfile is the country name seen in the countries CSV file. All Wikipedia articles with country names that still have no match in the countries CSV file will be ignored during later stages of this assignment. All Wikipedia articles where we weren't able to make quality predictions for will also be ignored.

In [4]:
# load the country populations CSV file into memory since its small, makes it easier to merge the three data sources
# together

populations = {}
with open("data/WPDS_2018_data.csv", "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    is_header = True
    for country, population in csv_reader:
        # ignore header row
        if is_header:
            is_header = False
            continue
        # remove commas from populations (e.g. 1,284 -> 1284) to make it easier to convert populations to floats at a
        # later stage
        populations[country] = population.replace(",", "")

In [5]:
# mappings between country names (country name in articles CSV file -> country name in countries CSV file) that
# slightly differ in each CSV file, map only contains common mismatched country names
COMMON_MISMATCHING_COUNTRIES = {
    "Czech Republic": "Czechia",
    "Hondura": "Honduras",
    "Congo, Dem. Rep. of": "Congo, Dem. Rep.",
    "Salvadoran": "El Salvador",
    "South Korean": "Korea, South",
    "Ivorian": "Cote d'Ivoire",
    "Samoan": "Samoa",
    "Saint Lucian": "Saint Lucia",
    "East Timorese": "Timor-Leste",
    "Saint Kitts and Nevis": "St. Kitts-Nevis",
    "Swaziland": "eSwatini",
}

with open("data/page_data.csv", "r") as csv_file_read:
    with open("article_qualities.csv", "w") as csv_file_write:
        csv_reader = csv.reader(csv_file_read)
        csv_writer = csv.writer(csv_file_write)
        # write header row
        csv_writer.writerow(["country", "article_name", "revision_id", "article_quality", "population"])
        is_header = True
        for page, country, rev_id in csv_reader:
            # ignore header row when reading
            if is_header:
                is_header = False
                continue
            # if possible, map the country name provided by the articles CSV file
            country = COMMON_MISMATCHING_COUNTRIES.get(country, country)
            # write to the CSV outfile only if we were able to predict the article's quality and the country name
            # matches with a country name in the countries CSV file
            if rev_id in predictions.keys() and country in populations.keys():
                # join the article's quality prediction
                quality_prediction = predictions[rev_id]
                # join the country's population in millions of people
                population = populations[country]
                csv_writer.writerow([country, page, rev_id, quality_prediction, population])

## 3. Computing the analysis

To identify potential biases with English Wikipedia articles about politicians across countries, we will use two different metrics.

**articles-per-population**

This metric will be displayed in the "articles/million people" column in the tables below. For each country, it represents the number of articles about politicians per one million people of country population.

**proportion of high-quality articles**

This metric will be displayed in the "perc. of high quality articles" column in the tables below. For each country, it represents the proportion of articles about politicians that were predicted to be either a featured article (FA) or a good article (GA). The proportion is represented as a percentage.

In [6]:
# the outputted CSV file as a pandas.DataFrame
articles_df = pandas.read_csv("article_qualities.csv")

# add a boolean column determining whether the articles was predicted to be of high quality (FA or GA)
articles_df["high_quality_article"] = (articles_df["article_quality"] == "FA") | (articles_df["article_quality"] == "GA")

# aggregate the original DataFrame to calculate the number of articles for each country
article_count_df = pandas.DataFrame({"article count": articles_df.groupby(["country", "population"]).size()}).reset_index()

# aggregate the original DataFrame to calculate the number of high quality articles for each country
hq_article_count_df = pandas.DataFrame({"high quality article count": articles_df.groupby(["country", "population"])["high_quality_article"].sum()}).reset_index()

# merge the two aggregated DataFrames together
countries_df = article_count_df.merge(hq_article_count_df)

# convert a count column from float to integer
countries_df["high quality article count"] = countries_df["high quality article count"].astype(int)

# add a column representing the number of articles per one million people
countries_df["articles/million people"] = countries_df["article count"]/countries_df["population"]

# add a column representing the proportion of high quality articles as a percentage
countries_df["perc. of high quality articles"] = (countries_df["high quality article count"]/countries_df["article count"])*100


### 10 highest ranked countries in terms of articles-per-population

In [7]:
countries_df.sort_values(by="articles/million people", ascending=False).head(10)

Unnamed: 0,country,population,article count,high quality article count,articles/million people,perc. of high quality articles
175,Tuvalu,0.01,55,5,5500.0,9.090909
120,Nauru,0.01,53,0,5300.0,0.0
142,San Marino,0.03,82,0,2733.333333,0.0
113,Monaco,0.04,40,0,1000.0,0.0
98,Liechtenstein,0.04,29,0,725.0,0.0
158,St. Kitts-Nevis,0.05,32,0,640.0,0.0
170,Tonga,0.1,63,1,630.0,1.587302
108,Marshall Islands,0.06,37,0,616.666667,0.0
73,Iceland,0.4,206,2,515.0,0.970874
3,Andorra,0.08,34,0,425.0,0.0


This table is quite unexciting since it's entirely made up of countries with less than half a million people. Below is another table displaying the 10 highest ranked countries in terms of articles-per-population, but filtering out countries with fewer than 100 articles about politicians.

In [8]:
countries_df[countries_df["article count"] >= 100].sort_values(by="articles/million people", ascending=False).head(10)

Unnamed: 0,country,population,article count,high quality article count,articles/million people,perc. of high quality articles
73,Iceland,0.4,206,2,515.0,0.970874
100,Luxembourg,0.6,180,1,300.0,0.555556
57,Fiji,0.9,199,1,221.111111,0.502513
107,Malta,0.5,103,0,206.0,0.0
123,New Zealand,4.9,790,12,161.22449,1.518987
1,Albania,2.9,460,4,158.62069,0.869565
127,Norway,5.3,658,6,124.150943,0.911854
112,Moldova,3.5,426,0,121.714286,0.0
54,Estonia,1.3,153,1,117.692308,0.653595
58,Finland,5.5,572,0,104.0,0.0


### 10 lowest ranked countries in terms of articles-per-population

In [9]:
countries_df.sort_values(by="articles/million people").head(10)

Unnamed: 0,country,population,article count,high quality article count,articles/million people,perc. of high quality articles
74,India,1371.3,986,14,0.719026,1.419878
75,Indonesia,265.2,214,8,0.806938,3.738318
34,China,1393.8,1135,33,0.814321,2.907489
182,Uzbekistan,32.9,29,1,0.881459,3.448276
55,Ethiopia,107.5,105,1,0.976744,0.952381
187,Zambia,17.7,25,0,1.412429,0.0
87,"Korea, North",25.6,39,7,1.523438,17.948718
38,"Congo, Dem. Rep.",84.3,142,8,1.68446,5.633803
167,Thailand,66.2,112,3,1.691843,2.678571
13,Bangladesh,166.4,323,3,1.941106,0.928793


### 10 highest ranked countries in terms of proportion of high-quality articles

In [10]:
countries_df.sort_values(by="perc. of high quality articles", ascending=False).head(10)

Unnamed: 0,country,population,article count,high quality article count,articles/million people,perc. of high quality articles
87,"Korea, North",25.6,39,7,1.523438,17.948718
144,Saudi Arabia,33.4,119,16,3.562874,13.445378
31,Central African Republic,4.7,68,8,14.468085,11.764706
137,Romania,19.5,348,40,17.846154,11.494253
109,Mauritania,4.5,52,5,11.555556,9.615385
175,Tuvalu,0.01,55,5,5500.0,9.090909
19,Bhutan,0.8,33,3,41.25,9.090909
47,Dominica,0.07,12,1,171.428571,8.333333
180,United States,328.0,1092,82,3.329268,7.509158
18,Benin,11.5,94,7,8.173913,7.446809


### 10 lowest ranked countries in terms of proportion of high-quality articles

Since there are more than 10 countries with zero high quality articles about politicians, I've also sorted this table by the volume of articles about politicians. So, countries that have no high quality articles about politicians but have a large number of _articles_ (regardless of quality) about politicians will appear higher in the table.

In [11]:
countries_df.sort_values(by=["perc. of high quality articles", "article count"], ascending=[True, False]).head(10)

Unnamed: 0,country,population,article count,high quality article count,articles/million people,perc. of high quality articles
58,Finland,5.5,572,0,104.0,0.0
16,Belgium,11.4,523,0,45.877193,0.0
112,Moldova,3.5,426,0,121.714286,0.0
162,Switzerland,8.5,407,0,47.882353,0.0
121,Nepal,29.7,361,0,12.154882,0.0
71,Honduras,9.0,189,0,21.0,0.0
176,Uganda,44.1,188,0,4.263039,0.0
39,Costa Rica,5.0,150,0,30.0,0.0
172,Tunisia,11.6,140,0,12.068966,0.0
150,Slovakia,5.4,119,0,22.037037,0.0


## 4. Writeup

Since we're looking at _English_ Wikipedia politician articles, I expect the more English-speaking a country is, the more articles-per-population and the higher proportion of high-quality articles, and vice versa. My theory is based on a few assumptions:

1. The higher proportion of English-speaking residents in a country, the higher proportion of residents who contribute to English Wikipedia articles
2. Wikipedians are more likely to edit politician articles for politicians in their country than politicians in other countries
3. Wikipedians tend to have greater knowledge of politicians in their country than equivalent politicians in other countries, causing Wikipedians to write higher quality articles for politicians in their country

I also expect the higher a country's standard of living is (I'm using the [IHDI metric](https://en.wikipedia.org/wiki/List_of_countries_by_inequality-adjusted_HDI) for a quantitative representation of this), the more articles-per-population and the higher proportion of high-quality articles, and vice versa. To go along with assumptions 2 and 3 above, this theory is based on a new assumption:

1. The higher a country's inequality-adjusted human development index (IHDI), the higher proportion of residents who contribute to Wikipedia articles

In summary, my theories are that English Wikipedia politician articles are biased in volume and quality towards predominantly English-speaking countries and countries with higher standards of living.

It looks like a country's population heavily influences its rank according to the articles-per-population metric. Unfiltered, the top 10 countries according to this metric has a population of less than half a million people and the top 3 lowest countries according to this metric are 3 of the top 4 countries according to population. The United States has the 3rd highest population count in the world, but doesn't appear in the top 10 lowest countries according to the articles-per-population metric. Of the top 10 countries according to this metric (filtered), 7 are ranked in the top 30 of the IHDI metric and 5 have a >50% English-speaking population. Whereas, of the bottom 10 countries according to this metric, China has the highest ranking according to the IHDI metric (62) and none of the countries are predominantly English-speaking.

Of the top 10 countries according to the proportion of high-quality articles metric, there are more countries that rank in the _bottom_ 30 according to the IHDI metric (3) than rank in the top 30 (United States). Also, only 2 of the top 10 countries according to this metric are predominantly English-speaking. Of the bottom 10 countries according to this metric, there are more countries that rank in the _top_ 30 according to the IHDI metric (4) then rank in the bottom 30 (Uganda). Also, 4 of the bottom 10 countries are predominantly English-speaking, more English-speaking countries than the top 10 countries according to the proportion of high-quality articles metric.

I would say the theories of more English-speaking/more articles-per-population and higher standard of living/more articles-per-population were sufficiently supported. On the other hand, the theories related to the proportion of high-quality articles were not supported at all and the results were mostly opposite my original theories. From my limited knowledge of world government, my new theory is that the proportion of high-quality politician articles for a country is correlated with how "controversial" or how "atypical" the country's government is. The presence of North Korea and Saudi Arabia at the top of this metric supports the new theory. A case could be made that the inclusion of United States in the top 10 according to this metric also support the new theory.