# A2: Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. We combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

We then perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. The analysis will consist of a series of tables that show:
1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.


### Import Libraries

In [1]:
import os
import requests
from urllib.parse import urlencode

import pandas as pd
import numpy as np

from pprint import pprint as pp
from tqdm import tqdm

### Define Constants

In [2]:
RAW_DATA_PATH = '../data/raw'
PROCESSED_DATA_PATH = '../data/processed'
ERROR_DATA_PATH = '../data/errors'

for path in [RAW_DATA_PATH, PROCESSED_DATA_PATH, ERROR_DATA_PATH]:
    if not os.path.exists(path):
        os.makedirs(path)

In [164]:
# Raw Data
RAW_COUNTRY_DATASET_FPATH = os.path.join(RAW_DATA_PATH, 'page_data.csv')
RAW_WORLD_POPULATION_DATASET_FPATH = os.path.join(RAW_DATA_PATH, 'WPDS_2020_data.csv')

#Processed prediction data
PROCESSED_POLITICIANS_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'politicians_country.csv')
PROCESSED_WORLD_POPULATION_COUNTRY_LEVEL_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'world_population_country_level.csv')
PROCESSED_WORLD_POPULATION_REGION_LEVEL_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'world_population_region_level.csv')
PROCESSED_MISSING_PREDICTION_DATA_FPATH = os.path.join(ERROR_DATA_PATH, 'missing_prediction_revids.csv')

# Processed merged data
PROCESSED_POLITICIANS_WORLD_POPULATION_MERGED_FPATH = os.path.join(PROCESSED_DATA_PATH,'wp_wpds_politicians_by_country.csv')
PROCESSED_POLITICIANS_WORLD_POPULATION_NO_MATCH_FPATH = os.path.join(ERROR_DATA_PATH,'wp_wpds_countries-no_match.csv')

## 1. Data Acquisition

We obtain the data from several different places:
1. The Wikipedia politicians by country dataset can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449)
    * We first download the zipped folder manually
    * We then extracted the zipped folder
    * Inside the folder we go to: `country/country/data`
    * Here, we copy `page_data.csv` and place this inside the raw data path
2. The population data is available in CSV format as [WPDS_2020_data.csv](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing)
    * This dataset is drawn from the world population data sheet published by the [Population Reference Bureau](https://www.prb.org/international/indicator/population/table/).


In [4]:
df_pcd = pd.read_csv(RAW_COUNTRY_DATASET_FPATH)
df_wpd = pd.read_csv(RAW_WORLD_POPULATION_DATASET_FPATH)

In [5]:
df_pcd.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [6]:
df_wpd.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### Data Cleaning

There is some information that is not needed for analysis in each of the files mentioned above. Thus, we performing the following cleaning steps:
1. Country Dataset
    * The dataset contains some page names that start with the string "Template:".
    * These pages are not Wikipedia articles, and should not be included in your analysis.
2. Population Dataset
    * This dataset contains some rows that provide cumulative regional population counts, rather than country-level counts.
    * These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA).
    * We remove these from the dataset, but retain a copy of these in a seperate file

In [7]:
df_pcd = df_pcd[df_pcd["page"].str.contains("Template:")==False]
df_wpd_country = df_wpd[df_wpd['Name'].str.isupper() == False] # Country-level counts
df_wpd_region = df_wpd[df_wpd['Name'].str.isupper()] # Cumulative region level counts

In [8]:
# Ensure that we do not have anything that is all-caps in the `Name` field
df_wpd_country['Name'].unique()

array(['Algeria', 'Egypt', 'Libya', 'Morocco', 'Sudan', 'Tunisia',
       'Western Sahara', 'Benin', 'Burkina Faso', 'Cape Verde',
       "Cote d'Ivoire", 'Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau',
       'Liberia', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Senegal',
       'Sierra Leone', 'Togo', 'Burundi', 'Comoros', 'Djibouti',
       'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Reunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Tanzania', 'Uganda',
       'Zambia', 'Zimbabwe', 'Angola', 'Cameroon',
       'Central African Republic', 'Chad', 'Congo', 'Congo, Dem. Rep.',
       'Equatorial Guinea', 'Gabon', 'Sao Tome and Principe', 'Botswana',
       'eSwatini', 'Lesotho', 'Namibia', 'South Africa', 'Canada',
       'United States', 'Belize', 'Costa Rica', 'El Salvador',
       'Guatemala', 'Honduras', 'Mexico', 'Nicaragua', 'Panama',
       'Antigua and Barbuda', 'Bahamas', 'Barbados', 'Cuba', 'Curacao',
 

In [9]:
# Ensure that we only have strings that are all-caps in the `Name` field
df_wpd_region['Name'].unique()

array(['WORLD', 'AFRICA', 'NORTHERN AFRICA', 'WESTERN AFRICA',
       'EASTERN AFRICA', 'MIDDLE AFRICA', 'SOUTHERN AFRICA',
       'NORTHERN AMERICA', 'LATIN AMERICA AND THE CARIBBEAN',
       'CENTRAL AMERICA', 'CARIBBEAN', 'SOUTH AMERICA', 'ASIA',
       'WESTERN ASIA', 'CENTRAL ASIA', 'SOUTH ASIA', 'SOUTHEAST ASIA',
       'EAST ASIA', 'EUROPE', 'NORTHERN EUROPE', 'WESTERN EUROPE',
       'EASTERN EUROPE', 'SOUTHERN EUROPE', 'OCEANIA'], dtype=object)

We can now cache away the files that were created

In [10]:
df_pcd.to_csv(PROCESSED_POLITICIANS_DATASET_FPATH, index=False)
df_wpd_country.to_csv(PROCESSED_WORLD_POPULATION_COUNTRY_LEVEL_DATASET_FPATH, index=False)
df_wpd_region.to_csv(PROCESSED_WORLD_POPULATION_REGION_LEVEL_DATASET_FPATH, index=False)

## 2. Obtain Article Quality Predictions

Now we need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. 

ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:
* FA - Featured article
* GA - Good article
* B - B-class article
* C - C-class article
* Start - Start-class article
* Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors. 

We use a [REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context) to obtain the information for each article. 

In [11]:
ORES_ENDPOINT = 'https://ores.wikimedia.org/v3/scores/{context}?'
CONTEXT = 'enwiki'
MODEL = 'articlequality'
NUM_REVIDS_PER_BATCH = 50 # We will obtain predictions for these many articles at a time

In [12]:
df_pcd[MODEL] = np.NaN
df_pcd.set_index('rev_id', inplace=True)

In [13]:
def api_call(endpoint, parameters):
    call = requests.get(endpoint.format(**parameters))
    response = call.json()
    
    return response

We want to call the API for multiple `rev_id`'s at a time. To achieve this, we create a list of lists, where each small list will contain a batch of rev_ids to be called at a time. 

In [14]:
revids = df_pcd.index.to_list()
num_lists = round(len(revids) / NUM_REVIDS_PER_BATCH)

revids = list(map(list, np.array_split(revids, num_lists)))

In [15]:
df_pcd.head()

Unnamed: 0_level_0,page,country,articlequality
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
355319463,Bir I of Kanem,Chad,
393276188,Information Minister of the Palestinian Nation...,Palestinian Territory,
393822005,Yos Por,Cambodia,
395521877,Julius Gregr,Czech Republic,
395526568,Edvard Gregr,Czech Republic,


In [16]:
czs

NameError: name 'czs' is not defined

This section is responsible for calling the API as well as error handling. The following is the procedure for each batch of `rev_ids`

1. Populate the parameter and query part in the API endpoint
2. Call the API and obtain a JSON response
3. Check to see if the API call was sucessful 
4. Check to see if the each `rev_id` returned a valid response
    * If the response is valid, we go ahead and save this prediction information into a dataframe
    * If not, we add this `rev_id` to a list to show the errored out `rev_ids` later 

In [None]:
error_batch_list = []
missing_revids = []

for revid_batch in tqdm(revids):    
    # Define the parameters that we ill be sending to the API
    params = {
        'context': CONTEXT,
    }
    
    query_parms = {
        'revids': '|'.join(str(x) for x in revid_batch),
        'models': MODEL
    }
    
    # Call the API and get the prediction from API
    response = api_call(ORES_ENDPOINT + urlencode(query_parms), params)
    
    # For each rev id, we populate it with the correct prediction
    try:
        scores = response[CONTEXT]['scores']
    except:
        error_batch_list.append(revid_batch)
        continue
    
    for rev_id in scores.keys():
        if 'error' in rev_id: continue
        
        try:
            prediction = scores[rev_id][MODEL]['score']['prediction']
        except:
            missing_revids.append(rev_id)
            # print(scores[rev_id][MODEL])
            continue
        
        rev_id = int(rev_id)
        df_pcd.loc[rev_id, MODEL] = prediction

If a batch errored out, we iterate through each `rev_id` and call the API individually to see if that may return a valid response. This may sometimes happen because of constraints of the API

In [None]:
for revid_batch in error_batch_list:
    for revid in revid_batch:
        # This is the unique id for the article
        revid_str = str(revid)

        # Define the parameters that we ill be sending to the API
        params = {
            'context': CONTEXT,
            'revid': revid_str,
            'model': MODEL
        }
        response = api_call(ORES_ENDPOINT, params)
        if 'scores' not in response[CONTEXT]: continue
        scores = response[CONTEXT]['scores'][revid_str][MODEL]

        if 'error' in scores.keys():  
            missing_revids.append(revid) # If we do not find the article, we skip it and move on
        
        rev_id = int(revid)
        df_pcd.loc[rev_id, MODEL] = scores['score']['prediction']

### 2.1 Prediction Errors

In [None]:
missing_revids = set(missing_revids)
print(f'There are {len(missing_revids)} rev_ids for which the API did not return a prediction.')

missing_revids_df = pd.DataFrame(list(missing_revids))
missing_revids_df.columns = ['rev_id']
missing_revids_df.to_csv(PROCESSED_MISSING_PREDICTION_DATA_FPATH, index=False)

## 3. Combining Datasets

Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.

In [156]:
PROCESSED_POLITICIANS_DATASET_PREDICTIONS_FPATH = os.path.join(PROCESSED_DATA_PATH, 'politicians_country_predicted.csv')
df_pcd.to_csv(PROCESSED_POLITICIANS_DATASET_PREDICTIONS_FPATH)
# df_pcd = pd.read_csv(PROCESSED_POLITICIANS_DATASET_PREDICTIONS_FPATH).set_index('rev_id')

In [166]:
# Clean and merge our 2 dataframes
df_pcd.dropna(subset = [MODEL], inplace=True)
df_pcd.reset_index(inplace=True)

# Merge the 2 dataframes
wp_wpds_politicians_by_country_df = pd.merge(left=df_pcd, right=df_wpd_country, left_on='country', right_on='Name')

# Clean and rename the merged dataframe
wp_wpds_politicians_by_country_df.rename(columns={'page': 'article_name', 
                                                  'rev_id': 'revision_id',
                                                  MODEL: 'article_quality_est.',
                                                  'Population': 'population'}, inplace=True)
wp_wpds_politicians_by_country_df = wp_wpds_politicians_by_country_df[['country', 'article_name', 'revision_id', 'article_quality_est.', 'population']]

Check to find out what countries were not merged because there was no match

In [174]:
merged_countries = set(wp_wpds_politicians_by_country_df['country'])
all_pcd = set(df_pcd['country'])
all_wpd_country = set(df_wpd_country['Name'])

un_merged_countries_set = all_pcd.difference(merged_countries).union(all_wpd_country.difference(merged_countries))
un_merged_countries_df = pd.DataFrame(list(un_merged_countries_set))
un_merged_countries_df.columns = ['country']

In [145]:
wp_wpds_politicians_by_country_df.to_csv(PROCESSED_POLITICIANS_WORLD_POPULATION_MERGED_FPATH, index=False)
un_merged_countries_df.to_csv(index=False, PROCESSED_POLITICIANS_WORLD_POPULATION_NO_MATCH_FPATH)

In [146]:
wp_wpds_politicians_by_country_df.head(5)

Unnamed: 0,country,article_name,revision_id,article_quality_est.,population
0,Finland,Vihtori Aromaa,628261896,Stub,5529000
1,Finland,Jonathan Wartiovaara,628268705,Stub,5529000
2,Finland,Arvi Turkka,628270736,Stub,5529000
3,Finland,Juho Heikkinen,628312759,Stub,5529000
4,Finland,Emanuel Aromaa,628379479,Stub,5529000


## 4. Analysis

Here we transform the data so that it can be easily consumed for the results section. 

We create a pivot table to show the number and types of articles for each country. Moreover, we also add relevant information for each country

In [147]:
df_analysis = pd.pivot_table(wp_wpds_politicians_by_country_df,
                             fill_value=0, 
                             columns=['article_quality_est.'],
                             aggfunc={
                                 'article_quality_est.': len, #count the number of articles 
                             },
                             index=['country'] #per country
                            )
df_analysis.columns = df_analysis.columns.droplevel() #clean up multilevel index
df_analysis = df_analysis.reset_index()
df_analysis.columns.name = None

In [175]:
# Add population to the pivot table
df_analysis = pd.merge(left=df_analysis, 
                       right=wp_wpds_politicians_by_country_df.groupby(['country'])['population'].mean(), 
                       left_on='country', 
                       right_index=True)

In [151]:
df_analysis['num_articles'] = df_analysis['FA'] + df_analysis['GA'] + df_analysis['B'] + df_analysis['C'] + df_analysis['Stub'] + df_analysis['Start']
df_analysis['num_high_quality_articles'] = df_analysis['FA'] + df_analysis['GA']
df_analysis['articles_per_population_percent'] = (df_analysis['num_high_quality_articles'] / df_analysis['population']) * 100
df_analysis['high_quality_articles_percent'] = (df_analysis['num_high_quality_articles'] / df_analysis['num_articles']) * 100

After the transformation, our table now looks as follows:

In [152]:
df_analysis.head(5)

Unnamed: 0,country,B,C,FA,GA,Start,Stub,population,num_articles,num_high_quality_articles,articles_per_population_percent,high_quality_articles_percent
0,Afghanistan,8,46,1,12,99,153,38928000,319,13,3.3e-05,4.075235
1,Albania,3,59,0,3,147,244,2838000,456,3,0.000106,0.657895
2,Algeria,3,10,0,2,44,57,44357000,116,2,5e-06,1.724138
3,Andorra,0,2,0,0,8,24,82000,34,0,0.0,0.0
4,Angola,2,6,0,0,23,74,32522000,105,0,0.0,0.0


## 5. Results

In [154]:
def show_results(df, _by, _ascending=True, n=10):
    cols = ['country'] + _by
    return df.sort_values(by=_by, ascending=_ascending).head(n)[cols]

### 5.1
Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [155]:
# df_analysis.sort_values(by=['articles_per_population_percent'], ascending=False).head(10)['']
show_results(df_analysis, _by=['articles_per_population_percent'], _ascending=False)

Unnamed: 0,country,articles_per_population_percent
169,Tuvalu,0.04
44,Dominica,0.001389
177,Vanuatu,0.000935
70,Iceland,0.000543
75,Ireland,0.0005
112,Montenegro,0.000322
105,Martinique,0.000281
19,Bhutan,0.000274
120,New Zealand,0.000261
135,Romania,0.000218


### 5.2

Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

### 5.3 
Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

### 5.4
Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

### 5.5
Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

### 5.6
Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality