# Data 512 A2 Bias

# Step 1 & Step 2 
## Getting the Article and Population Data, and Cleaning the Data

In this section, we read in the two dataset that we will be working with in this project.
- page_data.csv: The dataset contains English Wikipedia politicians pages by country.  
- WPDS_2020_data.csv: The dataset population record data countries and regions in 2021

To clean the dataset, we did the following steps:
- The page_data.csv dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles so we removed them.
- The WPDS_2020_data.csv dataset contains some rows that provide cumulative regional population counts. These rows are distinguished by having ALL CAPS values in the 'geography' field. We removed these ALL CAPS values to later match the page_data.csv data. However, we keeped these recoreds in the original dataset for later analysis.

In [1]:
# Read in necessary packages
import json
import requests
import pandas as pd
import numpy as np
from itertools import product
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

In [2]:
# Read in page_data.csv
page_data = pd.read_csv('page_data.csv')
# Remove page names that start with the string "Template:"
page_data_cleaned = page_data.loc[~(page_data['page'].str.startswith('Template:')),:]
page_data_cleaned.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [3]:
# Read in WPDS_2020_data.csv
WPDS_2020_data = pd.read_csv('WPDS_2020_data.csv')
# Temporarily remove ALL CAPS values
WPDS_2020_data_cleaned = WPDS_2020_data.loc[~WPDS_2020_data['Name'].str.isupper(),:]
WPDS_2020_data_cleaned.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


# Step 3
## Getting Article Quality Predictions
In this section, we get the predicted quality scores for each article in the Wikipedia dataset. We used a machine learning system called ORES, "Objective Revision Evaluation Service". ORES is a machine learning tool that can provide estimates of Wikipedia article quality. 

The article quality estimates are, from best to worst:
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article


In [4]:
# Create function for API call
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [5]:
# Define parameter for API calls
endpoint_legacy = 'https://ores.wikimedia.org/v3/scores/enwiki/{revids}/articlequality'
headers = {
    'User-Agent': 'https://github.com/zihuiz',
    'From': 'zihuiz@uw.edu'
}

In [6]:
# Create list to store quality score 
m,n = page_data_cleaned.shape

# article_quality_est list stores the quality score
article_quality_est = []

# no_est_index list stores the index of articles that does not have a prediction score
no_est_index = []

In [7]:
for i in tqdm(range(m)):
    # Extract curr row that contains page into
    curr_page = page_data_cleaned.iloc[i,:]
    
    # Extract rev_id for the page
    curr_revid = curr_page['rev_id']
    # Set API parameter
    para_temp = {'revids': str(curr_revid)}
    # Receive information from API
    curr_est = api_call(endpoint_legacy, para_temp)
    curr_pred = 'NoPrediction'
    # Try to extract prediction for the page
    try:
        curr_pred = curr_est['enwiki']['scores'][str(curr_revid)]['articlequality']['score']['prediction']
    except:
        # Add the index to no_est_index if cannot get a prediction, and continue with the next page
        no_est_index.append(i)
    article_quality_est.append(curr_pred)

HBox(children=(FloatProgress(value=0.0, max=46701.0), HTML(value='')))




In [36]:
# Append the quality score to page_data dataset
page_data_cleaned['article_quality_est'] = article_quality_est
page_data_with_pred = page_data_cleaned.loc[page_data_cleaned['article_quality_est']!='NoPrediction',]
page_data_with_pred.head()

Unnamed: 0,page,country,rev_id,article_quality_est
1,Bir I of Kanem,Chad,355319463,Stub
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
12,Yos Por,Cambodia,393822005,Stub
23,Julius Gregr,Czech Republic,395521877,Stub
24,Edvard Gregr,Czech Republic,395526568,Stub


In [42]:
# Save pages that does not have a prediction score to wp_no_prediction.csv
page_data_no_pred = page_data_cleaned.loc[page_data_cleaned['article_quality_est']=='NoPrediction',]
page_data_no_pred.to_csv('wp_no_prediction.csv',index=False)

# Step 4
## Combining the Datasets
In this section, we merged the wikipedia data and population data together using the country names. Both datasets have fields containing country names. For countries that we cannot find a match in the wpds dataset, we store it into a seperate dataset and output them to a CSV file called: wp_wpds_countries-no_match.csv

The result dataset has the following attributes
- country: The country of the article
- article_name: The name of the article
- revision_id: The revision id for the article used to generate quality score prediction
- article_quality_est.: The quality score prediction
- population: The population of the country

Lastly, w consolidate the remaining data into a single CSV file called: wp_wpds_politicians_by_country.csv.

In [56]:
# Match the country population to the page_data data set based on the country name
wp_wpds_countries = page_data_with_pred.merge(WPDS_2020_data, how='left', left_on='country', right_on='Name')
wp_wpds_countries_no_match = wp_wpds_countries[wp_wpds_countries['Population'].isna()]
wp_wpds_countries_no_match = wp_wpds_countries_no_match.loc[:,['country','page','rev_id',
                                                               'article_quality_est','Population']]
wp_wpds_countries_no_match.to_csv('wp_wpds_countries-no_match.csv',index=False)
wp_wpds_countries_by_country = wp_wpds_countries[wp_wpds_countries['Population'].notna()]
wp_wpds_countries_by_country = wp_wpds_countries_by_country.loc[:,['country','page','rev_id',
                                                                   'article_quality_est','Population']]
wp_wpds_countries_by_country.to_csv('wp_wpds_countries_by_country.csv',index=False)

In [55]:
wp_wpds_countries_by_country.head()

Unnamed: 0,country,page,rev_id,article_quality_est,Population
0,Chad,Bir I of Kanem,355319463,Stub,16877000.0
1,Palestinian Territory,Information Minister of the Palestinian Nation...,393276188,Stub,5008000.0
2,Cambodia,Yos Por,393822005,Stub,15497000.0
5,Canada,Robert Douglas Cook,401577829,Stub,38190000.0
6,Egypt,List of Grand Viziers of Egypt,442937236,Stub,100803000.0


# Step 5
## Analysis
In this section, we calculate two matrices on both the country level and sub-region level for analysis:
- articles-per-population: The percentage of the number of politician articles as a proportion of the country population
- high-quality: The proportion of politician articles that are of GA and FA-quality. 


In [249]:
# Match countries fo sub-regions (e.x. East Asia, East Europe)
sub_region_index = WPDS_2020_data.index[WPDS_2020_data['Name'].str.isupper()]
sub_region_match = []
for i in range(WPDS_2020_data.shape[0]):
    curr_sub_region_index = max(sub_region_index[sub_region_index<=i])
    curr_sub_region = WPDS_2020_data.loc[curr_sub_region_index, 'Name']
    sub_region_match.append(curr_sub_region)

In [192]:
# Match countries to larger sub-regions (e.x. Asia, Europe)
# WORLD - 0
# AFRICA - 1:63
# NORTHERN AMERICA - 64:66
# LATIN AMERICA AND THE CARIBBEAN - 67:94
# SOUTH AMERICA - 95:108
# ASIA - 109:165
# EUROPE - 166:215
# OCEANIA - 216:
country_size = WPDS_2020_data.shape[0]
larger_sub_region_match = []
larger_sub_region_match.append('WORLD')
larger_sub_region_match.extend(['AFRICA' for _ in range(1,64)])
larger_sub_region_match.extend(['NORTHERN AMERICA' for _ in range(64,67)])
larger_sub_region_match.extend(['LATIN AMERICA AND THE CARIBBEAN' for _ in range(67,95)])
larger_sub_region_match.extend(['SOUTH AMERICA' for _ in range(95,109)])
larger_sub_region_match.extend(['ASIA' for _ in range(109,166)])
larger_sub_region_match.extend(['EUROPE' for _ in range(166,216)])
larger_sub_region_match.extend(['OCEANIA' for _ in range(216,country_size)])

In [196]:
sub_region_match_df = pd.DataFrame(data={'country':WPDS_2020_data['Name'].values,
                                         'sub_region':sub_region_match, 
                                         'larger_sub_region':larger_sub_region_match})
wp_wpds_countries_by_country_subregion = wp_wpds_countries_by_country.merge(sub_region_match_df, 
                                                                            how='left', on='country')

In this step, we group up data by country, and count the number of politician article (All and FA/GA).

In [210]:
total_page = wp_wpds_countries_by_country_subregion.groupby('country').count()
wp_wpds_countries_by_country_subregion_A = \
wp_wpds_countries_by_country_subregion.loc[(wp_wpds_countries_by_country_subregion['article_quality_est'] == 'FA') | 
                                           (wp_wpds_countries_by_country_subregion['article_quality_est'] == 'GA'),:]
total_page_A = wp_wpds_countries_by_country_subregion_A.groupby('country').count()

page_count = total_page.merge(total_page_A, how='left', left_index=True, right_index=True)
page_count = page_count.fillna(0)
page_count_w_population = page_count.merge(WPDS_2020_data, how='left', left_index=True, right_on='Name')
page_count_w_population = page_count_w_population[['Name', 'article_quality_est_x','article_quality_est_y',
                                                   'Population']]
page_count_w_population = page_count_w_population.rename(columns={'article_quality_est_x':'total_article', 
                                                                'article_quality_est_y':'total_good_article'})

page_count_w_population['articles-per-population'] = \
100*page_count_w_population['total_article']/page_count_w_population['Population']

page_count_w_population['A-articles-per-population'] = \
100*page_count_w_population['total_good_article']/page_count_w_population['Population']

page_count_w_population['high-quality'] = \
100*page_count_w_population['total_good_article']/page_count_w_population['total_article']

In this step, we group up data by region, and count the number of politician article (All and FA/GA).

In [235]:
total_page_rg = wp_wpds_countries_by_country_subregion.groupby('sub_region').count()
total_page_rg_A = wp_wpds_countries_by_country_subregion_A.groupby('sub_region').count()
page_count_rg = total_page_rg.merge(total_page_rg_A, how='left', left_index=True, right_index=True)
page_count_rg_w_population = page_count_rg.merge(WPDS_2020_data, how='left', left_index=True, right_on='Name')
page_count_rg_w_population = page_count_rg_w_population[['Name', 'article_quality_est_x',
                                                         'article_quality_est_y','Population']]

total_page_lrg = wp_wpds_countries_by_country_subregion.groupby('larger_sub_region').count()
total_page_lrg_A = wp_wpds_countries_by_country_subregion_A.groupby('larger_sub_region').count()
page_count_lrg = total_page_lrg.merge(total_page_lrg_A, how='left', left_index=True, right_index=True)
page_count_lrg_w_population = page_count_lrg.merge(WPDS_2020_data, how='left', left_index=True, right_on='Name')
page_count_lrg_w_population = page_count_lrg_w_population[['Name', 'article_quality_est_x',
                                                         'article_quality_est_y','Population']]

page_count_sbrg_w_population = pd.concat([page_count_rg_w_population, page_count_lrg_w_population])\
                                .drop_duplicates().reset_index(drop=True)
page_count_sbrg_w_population = page_count_sbrg_w_population.rename(columns={\
                                            'article_quality_est_x':'total_article',
                                            'article_quality_est_y':'total_good_article'})
page_count_sbrg_w_population['articles-per-population'] = \
100*page_count_sbrg_w_population['total_article']/page_count_sbrg_w_population['Population']

page_count_sbrg_w_population['A-articles-per-population'] = \
100*page_count_sbrg_w_population['total_good_article']/page_count_sbrg_w_population['Population']

page_count_sbrg_w_population['high-quality'] = \
100*page_count_sbrg_w_population['total_good_article']/page_count_sbrg_w_population['total_article']

# Step 6
In this section, we publish result for our analysis

### Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [242]:
page_count_w_population.sort_values(by='articles-per-population',ascending=False,ignore_index=True)\
[['Name','Population','total_article','articles-per-population']].head(10)

Unnamed: 0,Name,Population,total_article,articles-per-population
0,Tuvalu,10000,54,0.54
1,Nauru,11000,52,0.472727
2,San Marino,34000,81,0.238235
3,Monaco,38000,40,0.105263
4,Liechtenstein,39000,28,0.071795
5,Marshall Islands,57000,37,0.064912
6,Tonga,99000,63,0.063636
7,Iceland,368000,201,0.05462
8,Andorra,82000,34,0.041463
9,Federated States of Micronesia,106000,36,0.033962


### Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [247]:
page_count_w_population.sort_values(by='articles-per-population',ascending=False,ignore_index=True)\
[['Name','Population','total_article','articles-per-population']].tail(10)

Unnamed: 0,Name,Population,total_article,articles-per-population
173,Bangladesh,169809000,317,0.000187
174,Mozambique,31166000,58,0.000186
175,Thailand,66534000,112,0.000168
176,"Korea, North",25779000,36,0.00014
177,Zambia,18384000,25,0.000136
178,Ethiopia,114916000,101,8.8e-05
179,Uzbekistan,34174000,28,8.2e-05
180,China,1402385000,1129,8.1e-05
181,Indonesia,271739000,209,7.7e-05
182,India,1400100000,968,6.9e-05


### Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [239]:
page_count_w_population.sort_values(by='high-quality',ascending=False,ignore_index=True)\
[['Name','total_article','total_good_article','high-quality']].head(10)

Unnamed: 0,Name,total_article,total_good_article,high-quality
0,"Korea, North",36,8.0,22.222222
1,Saudi Arabia,117,15.0,12.820513
2,Romania,343,42.0,12.244898
3,Central African Republic,66,8.0,12.121212
4,Uzbekistan,28,3.0,10.714286
5,Mauritania,48,5.0,10.416667
6,Guatemala,83,7.0,8.433735
7,Dominica,12,1.0,8.333333
8,Syria,128,10.0,7.8125
9,Benin,91,7.0,7.692308


### Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [248]:
page_count_w_population.sort_values(by='high-quality',ascending=False,ignore_index=True)\
[['Name','total_article','total_good_article','high-quality']].tail(10)

Unnamed: 0,Name,total_article,total_good_article,high-quality
173,San Marino,81,0.0,0.0
174,Sao Tome and Principe,21,0.0,0.0
175,Bahrain,42,0.0,0.0
176,Guyana,20,0.0,0.0
177,Seychelles,21,0.0,0.0
178,Guadeloupe,49,0.0,0.0
179,Solomon Islands,97,0.0,0.0
180,Grenada,36,0.0,0.0
181,Cape Verde,36,0.0,0.0
182,Bahamas,20,0.0,0.0


### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [246]:
page_count_sbrg_w_population.sort_values(by='articles-per-population',ascending=False,ignore_index=True)\
[['Name','Population','total_article','articles-per-population']]

Unnamed: 0,Name,Population,total_article,articles-per-population
0,OCEANIA,43155000,3126,0.007244
1,NORTHERN EUROPE,105990000,3763,0.00355
2,SOUTHERN EUROPE,153251000,3710,0.002421
3,WESTERN EUROPE,195479000,4560,0.002333
4,EUROPE,746622000,15765,0.002112
5,CARIBBEAN,43233000,695,0.001608
6,EASTERN EUROPE,291902000,3732,0.001279
7,SOUTHERN AFRICA,67732000,634,0.000936
8,WESTERN ASIA,280927000,2563,0.000912
9,CENTRAL AMERICA,178611000,1543,0.000864


### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [244]:
page_count_sbrg_w_population.sort_values(by='high-quality',ascending=False,ignore_index=True)\
[['Name','total_article','total_good_article','high-quality']]

Unnamed: 0,Name,total_article,total_good_article,high-quality
0,NORTHERN AMERICA,1901,104,5.470805
1,SOUTHEAST ASIA,2020,73,3.613861
2,WESTERN ASIA,2563,89,3.472493
3,EASTERN EUROPE,3732,118,3.161844
4,EAST ASIA,2473,76,3.07319
5,CENTRAL ASIA,245,7,2.857143
6,NORTHERN EUROPE,3763,102,2.710603
7,ASIA,11667,316,2.708494
8,MIDDLE AFRICA,665,16,2.406015
9,EUROPE,15765,350,2.220108


# Reflection

This project aims to explore the concept of bias through data on Wikipedia political figure articles from various countries. The two matrices we are looking at are articles-per-population, percentage of the number of politician articles as a proportion of the country population, and high-quality, the proportion of politician articles that are of GA and FA-quality. 

As a result, we can see that biases exist in the Wikipedia data, and we can suggest the following possible cause of biases based on the table. First, language has a large impact on articles-per-population on the regional level. Since we are only using English Wikipedia, regions that use English as the official language would have a higher articles-per-population. We can see the regions with the highest articles-per-population are primarily English-speaking regions. Second, the population has a significant impact on articles-per-population on the country level. Countries with high articles-per-population rankings usually have small populations, while Countries with low articles-per-population rankings have large populations.

What surprised me was the biases in the high-quality tables. The bias patterns are different from the patterns in the articles-per-population table. We see countries and regions with a mix of languages and characteristics having high rankings in high-quality. One possible cause is the public political interest. Countries with high high-quality ranking have significant public political interests, such as North Korea and Saudi Arabia. 

**What might your results suggest about (English) Wikipedia as a data source?**
(English) Wikipedia as a data source is going to have biases related to language. Our tables show that regions that use English as the official language would have a higher articles-per-population. Most people in these regions would choose to edit Wikipedia pages using English. People in other regions, such as Asia, may prefer other languages.

**What might your results suggest about the internet and global society in general?**
The results do not suggest an impact from the internet on the dataset. Since anyone can edit Wikipedia, even a country with litter internet access may have a high-quality article created by people from other countries. However, global political interest does have an impact. From the results, we can see that counties with high high-quality ranking usually have a high global political interest.

**How might a researcher supplement or transform this dataset to potentially correct for the limitations/biases you observed?**
For the biases related to language, a researcher can take official languages as an additional attribute. They can create subgroups using official languages and analysis the patterns within a subgroup. A researcher can also use the entire Wikipedia to minimize the effect of language. For the biases related to population, we can use a tranformation of population, such as the log or square root of population, for the percentage calculation.

