## A2 - Bias in Data

We will explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. In this notebook, we will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.


## Step 1: Data acquisition

In this section, we will load the required dataset for the analysis. We will use 2 datasets:

* Wikipedia politicians by country dataset - [FigShare Link](https://figshare.com/articles/dataset/Untitled_Item/5513449)
* World population data sheet - [Population Reference Bureau](https://www.prb.org/international/indicator/population/table/)

We have downloaded the dataset files into the repository from the sources above:

* Wikipedia politicians by country dataset - data/page_data.csv
* World population data sheet - data/world_population.csv

In [1]:
import pandas as pd

In [2]:
# Load Wikipedia politicians by country dataset
politicians_dataset = pd.read_csv("../data/page_data.csv")

# Load World population data sheet
world_population_dataset = pd.read_csv("../data/world_population.csv")

## Step 2: Data cleaning

In this step we will perform some cleaning on the datasets obtained from the previous step. Specifically, we will:

For the Wikipedia politicians by country dataset:
* Remove all rows that start with `Template:`

For World population data sheet:
* Seperate country records and sub-region records to different files

In [3]:
politicians_dataset_cleaned = politicians_dataset[~politicians_dataset.page.str.startswith("Template:")]

In [4]:
world_population_dataset_cleaned_country = world_population_dataset[~world_population_dataset.Name.str.isupper()]
world_population_dataset_cleaned_sub_region = world_population_dataset[world_population_dataset.Name.str.isupper()]

Then we will store the cleaned datasets in `data-cleaned/`:

* Cleaned Wikipedia politicians by country dataset: `data-cleaned/page_data_cleaned.csv`
* Cleaned World population data sheet (Country): `data-cleaned/world_population_cleaned_country.csv`
* Cleaned World population data sheet (Sub-Region): `data-cleaned/world_population_cleaned_sub_region.csv`


In [5]:
politicians_dataset_cleaned.to_csv('../data-cleaned/page_data_cleaned.csv')

In [6]:
world_population_dataset_cleaned_country.to_csv('../data-cleaned/world_population_cleaned_country.csv')
world_population_dataset_cleaned_sub_region.to_csv('../data-cleaned/world_population_cleaned_sub_region.csv')

## Step 3: Getting Article Quality Predictions

For this step, we will run the "ORES" model on the `rev_ids` of the Wikipedia pages within the politicians dataset. I was not able to install the ORES package through pip, [official repository](https://github.com/wikimedia/ores). Hence I will query the predictions through the [REST API](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)

Next we will run ORES on the politicians dataset

In [7]:
import requests
import pickle
from tqdm import tqdm

def get_prediction(rev_id):
    headers = {
        'User-Agent': 'https://github.com/wanggy0201',
        'From': 'wanggy@uw.edu'
    }

    config = "https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids=" + str(rev_id)
    
    if requests.get(config, headers=headers).status_code == 200:
        return requests.get(config, headers=headers).json()
    else:
        print("Not able to retrive records for rev_id:" + str(rev_id))
        return "NaN"
    
def get_predictions_from_df(df, batch_size=50):
    rev_ids = df['rev_id'].tolist()
    
    result = {}
    
    for i in tqdm(range(len(rev_ids)//batch_size + 1)):
        rev_id_batch = rev_ids[i * batch_size: min((i + 1) * batch_size, len(rev_ids))]
        
        rev_id_batch = ('|').join(map(str,rev_id_batch))
        responses = get_prediction(rev_id_batch)['enwiki']['scores']
        
        result.update(responses)
    
    with open("../data-cleaned/api_responses.pickle","wb") as file:
        pickle.dump(result, file)
        
    return result

Toggle `run_queries` to call the REST API for predictions, if not we will load the file existing in the repository

In [8]:
run_queries = False

if run_queries:
    results = get_predictions_from_df(politicians_dataset_cleaned)
else:
    with open("../data-cleaned/api_responses.pickle","rb") as file:
        results = pickle.load(file)


Here we extract the probability from the responses and join it with the politicians dataset

In [10]:
def extract_probability(rev_id):
    rev_id_str = str(rev_id)
    if 'score' in results[rev_id_str]['articlequality']:
        return results[rev_id_str]['articlequality']['score']['prediction']
    else:
        return "NaN"

In [11]:
politicians_dataset_cleaned['prediction'] = politicians_dataset_cleaned['rev_id'].apply(extract_probability)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  politicians_dataset_cleaned['prediction'] = politicians_dataset_cleaned['rev_id'].apply(extract_probability)


In [12]:
politicians_dataset_cleaned['prediction'].unique()

array(['Stub', 'Start', 'C', 'NaN', 'B', 'GA', 'FA'], dtype=object)

Logging the files that have no prediction, then filtering them out

In [13]:
print("These are the pages that does not have predictions:")
politicians_dataset_cleaned[politicians_dataset_cleaned['prediction'] == 'NaN']

These are the pages that does not have predictions:


Unnamed: 0,page,country,rev_id,prediction
126,List of politicians in Poland,Poland,516633096,
222,Tingtingru,Vanuatu,550682925,
330,Daud Arsala,Afghanistan,627547024,
359,Book:Two Political Biographies,India,636911471,
514,Dilaver Bey,Turkey,669987106,
...,...,...,...,...
46782,John Rose (Trotskyist),United Kingdom,807336308,
46862,Jalal Movaghar,Iran,807367030,
46863,Mohsen Movaghar,Iran,807367166,
47182,King Gutierrez,Philippines,807479587,


In [14]:
politicians_dataset_cleaned = politicians_dataset_cleaned[politicians_dataset_cleaned['prediction'] != 'NaN']

In [15]:
politicians_dataset_cleaned['prediction'].unique()

array(['Stub', 'Start', 'C', 'B', 'GA', 'FA'], dtype=object)

## Step 4: Combining datasets

In [16]:
joined_df = pd.merge(
    politicians_dataset_cleaned, 
    world_population_dataset_cleaned_country,
    left_on=['country'], 
    right_on=['Name'],
    how='outer',
    indicator=True
)

joined_df = joined_df.rename(columns = {
    "page" : "article_name", 
    "rev_id":"revision_id", 
    "prediction":"article_quality_est.", 
    "Population":"population"
})[['country', 'article_name', 'revision_id', 'article_quality_est.', 'population', '_merge']]


Get and save entries that have no match to `data-combined/wp_wpds_countries-no_match.csv`

In [17]:
no_match = joined_df[joined_df['_merge'] != 'both'].drop(columns=['_merge'])

In [18]:
no_match.to_csv('../data-combined/wp_wpds_countries-no_match.csv')

Get and save rest of the entries to `data-combined/wp_wpds_politicians_by_country.csv`

In [19]:
joined_df = joined_df[joined_df['_merge'] == 'both'].drop(columns=['_merge'])

In [20]:
joined_df.to_csv('../data-combined/wp_wpds_politicians_by_country.csv')

## Step 5: Analysis

In this step we will calculate our metrics used for analysis:
* The proportion (as a percentage) of articles-per-population and high-quality articles for each country
* The proportion (as a percentage) of articles-per-population and high-quality articles for each geographic region

In order to do that, we will create 4 dataframes:
* `country_high_quality_proportion`: The proportion (as a percentage) of high-quality articles for each country
* `country_all_articles_proportion`: The proportion (as a percentage) of articles-per-population for each country
* `region_high_quality_proportion`: The proportion (as a percentage) of high-quality articles for each geographic region
* `region_all_article_proportion`: The proportion (as a percentage) of articles-per-population for each geographic region


In [21]:
high_quality_df = joined_df[(joined_df['article_quality_est.'] == 'FA') | (joined_df['article_quality_est.'] == 'GA')]

In [22]:
country_population = joined_df[['country', 'population']].drop_duplicates()

In [23]:
country_high_quality_articles_count = high_quality_df.groupby(['country']).size().reset_index(name='high_quailty_articles')

In [24]:
country_all_articles_count = joined_df.groupby(['country']).size().reset_index(name='all_articles')

Calculating the proportion of articles-per-population and high-quality articles for each country

In [25]:
country_high_quality_proportion = pd.merge(
    country_population, 
    country_high_quality_articles_count,
    on=['country'], 
    how='left'
).fillna(0)

country_high_quality_proportion['high_quality_proportion'] = country_high_quality_proportion['high_quailty_articles'] / country_high_quality_proportion['population']

Calculating the proportion of articles-per-population for each country

In [26]:
country_all_articles_proportion = pd.merge(
    country_population, 
    country_all_articles_count,
    on=['country'], 
    how='left'
).fillna(0)

country_all_articles_proportion['all_article_proportion'] = country_all_articles_proportion['all_articles'] / country_high_quality_proportion['population']

Here we calculate the country to sub-region mapping

In [27]:
regions = world_population_dataset_cleaned_sub_region['Name'].tolist()

In [28]:
def find_region(country, population_df=world_population_dataset):
    index = population_df.index[population_df['Name'] == country].tolist()[0]
    while population_df.iloc[index]['Name'] not in regions:
        index -= 1
    
    return population_df.iloc[index]['Name']

# Create a new column for sub region
distinct_countries = joined_df[['country']].drop_duplicates()
distinct_countries['region'] = distinct_countries['country'].apply(find_region)

joined_df_with_region = pd.merge(
    joined_df, 
    distinct_countries,
    on=['country'], 
    how='left'
)


In [29]:
# calculate region population
region_population = joined_df_with_region[['country', 'region', 'population']].drop_duplicates().groupby(['region']).sum()

In [30]:
region_high_quality_df = joined_df_with_region[(joined_df_with_region['article_quality_est.'] == 'FA') | (joined_df_with_region['article_quality_est.'] == 'GA')]

In [31]:
region_high_quality_articles_count = region_high_quality_df.groupby(['region']).size().reset_index(name='high_quailty_articles')

In [32]:
region_all_articles_count = joined_df_with_region.groupby(['region']).size().reset_index(name='all_articles')

Calculating the proportion of articles-per-population and high-quality articles for each region

In [33]:
region_high_quality_proportion = pd.merge(
    region_population, 
    region_high_quality_articles_count,
    on=['region'], 
    how='left'
).fillna(0)

region_high_quality_proportion['high_quality_proportion'] = region_high_quality_proportion['high_quailty_articles'] / region_high_quality_proportion['population']

In [34]:
region_all_article_proportion = pd.merge(
    region_population, 
    region_all_articles_count,
    on=['region'], 
    how='left'
).fillna(0)

region_all_article_proportion['all_articles_proportion'] = region_all_article_proportion['all_articles'] / region_all_article_proportion['population']

## Step 6: Results

1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [35]:
country_all_articles_proportion.sort_values(by=['all_article_proportion'], ascending=False)[0:10]

Unnamed: 0,country,population,all_articles,all_article_proportion
99,Tuvalu,10000.0,54,0.0054
150,Nauru,11000.0,52,0.004727
41,San Marino,34000.0,81,0.002382
65,Monaco,38000.0,40,0.001053
98,Liechtenstein,39000.0,28,0.000718
105,Marshall Islands,57000.0,37,0.000649
87,Tonga,99000.0,63,0.000636
68,Iceland,368000.0,201,0.000546
169,Andorra,82000.0,34,0.000415
174,Federated States of Micronesia,106000.0,36,0.00034


2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [36]:
country_all_articles_proportion.sort_values(by=['all_article_proportion'])[0:10]

Unnamed: 0,country,population,all_articles,all_article_proportion
7,India,1400100000.0,968,6.913792e-07
60,Indonesia,271739000.0,209,7.691204e-07
22,China,1402385000.0,1129,8.050571e-07
151,Uzbekistan,34174000.0,28,8.193363e-07
106,Ethiopia,114916000.0,101,8.789029e-07
181,Zambia,18384000.0,25,1.359878e-06
165,"Korea, North",25779000.0,36,1.396486e-06
126,Thailand,66534000.0,112,1.68335e-06
125,Mozambique,31166000.0,58,1.861002e-06
116,Bangladesh,169809000.0,317,1.866803e-06


3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [37]:
country_high_quality_proportion.sort_values(by=['high_quality_proportion'], ascending=False)[0:10]

Unnamed: 0,country,population,high_quailty_articles,high_quality_proportion
99,Tuvalu,10000.0,4.0,0.0004
175,Dominica,72000.0,1.0,1.4e-05
121,Vanuatu,321000.0,3.0,9e-06
68,Iceland,368000.0,2.0,5e-06
33,Ireland,5003000.0,25.0,5e-06
123,Montenegro,622000.0,2.0,3e-06
138,Martinique,356000.0,1.0,3e-06
124,Bhutan,730000.0,2.0,3e-06
58,New Zealand,4987000.0,13.0,3e-06
47,Romania,19241000.0,42.0,2e-06


4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [38]:
country_high_quality_proportion.sort_values(by=['high_quality_proportion'])[0:10]

Unnamed: 0,country,population,high_quailty_articles,high_quality_proportion
182,Seychelles,98000.0,0.0,0.0
24,Angola,32522000.0,0.0,0.0
135,Estonia,1331000.0,0.0,0.0
119,Kiribati,125000.0,0.0,0.0
30,Finland,5529000.0,0.0,0.0
34,Tunisia,11896000.0,0.0,0.0
105,Marshall Islands,57000.0,0.0,0.0
100,Sao Tome and Principe,210000.0,0.0,0.0
137,Costa Rica,5111000.0,0.0,0.0
98,Liechtenstein,39000.0,0.0,0.0


5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [39]:
region_all_article_proportion.sort_values(by=['all_articles_proportion'], ascending=False)

Unnamed: 0,region,population,all_articles,all_articles_proportion
10,OCEANIA,42031000.0,3126,7.4e-05
9,NORTHERN EUROPE,105680000.0,3763,3.6e-05
15,SOUTHERN EUROPE,151136000.0,3710,2.5e-05
18,WESTERN EUROPE,195479000.0,4560,2.3e-05
0,CARIBBEAN,39056000.0,695,1.8e-05
5,EASTERN EUROPE,281186000.0,3732,1.3e-05
14,SOUTHERN AFRICA,66628000.0,634,1e-05
1,CENTRAL AMERICA,162267000.0,1543,1e-05
17,WESTERN ASIA,272499000.0,2563,9e-06
6,MIDDLE AFRICA,90189000.0,665,7e-06


6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [40]:
region_high_quality_proportion.sort_values(by=['high_quality_proportion'], ascending=False)

Unnamed: 0,region,population,high_quailty_articles,high_quality_proportion
10,OCEANIA,42031000.0,63,1.498894e-06
9,NORTHERN EUROPE,105680000.0,102,9.651779e-07
15,SOUTHERN EUROPE,151136000.0,74,4.896252e-07
5,EASTERN EUROPE,281186000.0,118,4.19651e-07
0,CARIBBEAN,39056000.0,13,3.328554e-07
17,WESTERN ASIA,272499000.0,89,3.266067e-07
18,WESTERN EUROPE,195479000.0,56,2.864758e-07
8,NORTHERN AMERICA,368068000.0,104,2.825565e-07
6,MIDDLE AFRICA,90189000.0,16,1.774052e-07
1,CENTRAL AMERICA,162267000.0,23,1.417417e-07


## Reflections and Implications


My initial thoughts when I saw this analysis approach is whether the number of well known politicians, or at least well documented politicians, has a correlation with population or not. This is under the assumption that only politicians that are well known enough will have authors write about them on wikipedia, regardless the quality. It is not intuitive that the more population a country has, the more politicians they will have. There are other factors that could impact the number of politicians more for each country, such as the historiy length (age) of the country, their government structure, etc. For example, countries that have extensive history like France, UK will have far more well known politicians than newer countries, such as Singapore. This can tell another story than the population.

Therefore the bias that I was expecting before the analysis are:
* The proportion of politician articles per population will be biased towards countries that have low population, i.e. the lower populated countries will have higher rates.
* Countries with longer and richer history will have higher politician articles per population rate.
* Countries that have higher attention in the world will have higher high quality article rate.
* Countries that are likely to have more writers (UK, US) will have higher high quality article rate.

In the end the results proves the first point to be true, but none of the other 3 points were reflected through the analysis. The countries that came up top in the list of both all article and high quality article per population rates are countries that had very low population. The difference in population dominated the rates, and made the actual article counts not significant. This can be shown since Tuvalu only had 54 articles but came to the top of the list with highest rates, while India and China had around 1000 articles were the top on the table for lowest rates.

I will try to explain why the other 3 expected biases did not occur from the analysis. The population became such a dominant factor that other factors might not be as strong. For example we do see a big count of articles for France, Australia, India, China. These are the countries that had the most article count, and we see a high correlation with the history length and richness of these countries. The forth point is also true when we look at the list of countries that have highest high quality articles, UK and US came in top 2. I have attached the 2 tables below.

In addition to validating my assumptions, I also found that there is bias in English speaking regions vs non English speaking regions. Both the article count per population and high article count per population for each region, Europe and North America stood out, where non-English speaking regions fell to the bottom, like Asia and Africa. This makes sense since more authors will focus their attention in their own language-speaking countries, and are more likely to write about politicians in their own countries.

I do think if any business or research would use the assumption made in this analysis that tries to tie population with the count of articles, regardless quality, they will end up with findings that are heavily biased towards the population. Countries or regions with lower population will have significantly higher attention since the population is such a dominant factor. If they really want to make a case out of this, then I would suggest reducing the impact of the population by taking a log or try to log-normalize the population.

In [44]:
country_all_articles_proportion.sort_values(by=['all_articles'], ascending=False)[0:10]

Unnamed: 0,country,population,all_articles,all_article_proportion
43,France,64940000.0,1672,2.574684e-05
109,Australia,25754000.0,1559,6.053429e-05
22,China,1402385000.0,1129,8.050571e-07
73,Mexico,127792000.0,1075,8.412107e-06
6,United States,329878000.0,1062,3.219372e-06
5,Pakistan,220940000.0,1019,4.612112e-06
7,India,1400100000.0,968,6.913792e-07
28,Russia,146733000.0,874,5.956397e-06
17,Spain,47635000.0,871,1.828487e-05
32,United Kingdom,67160000.0,855,1.273079e-05


In [45]:
country_high_quality_proportion.sort_values(by=['high_quailty_articles'], ascending=False)[0:10]

Unnamed: 0,country,population,high_quailty_articles,high_quality_proportion
6,United States,329878000.0,80.0,2.425139e-07
32,United Kingdom,67160000.0,56.0,8.338297e-07
47,Romania,19241000.0,42.0,2.182839e-06
22,China,1402385000.0,40.0,2.852284e-08
17,Spain,47635000.0,38.0,7.977328e-07
109,Australia,25754000.0,38.0,1.475499e-06
28,Russia,146733000.0,33.0,2.248983e-07
43,France,64940000.0,26.0,4.003696e-07
33,Ireland,5003000.0,25.0,4.997002e-06
3,Canada,38190000.0,24.0,6.284368e-07
