# Wikipedia politician pages, and bias in data
### <a href="https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data">University of Washington, DATA 512 Autumn 2019, Assignment 2</a>
### Bianca Zlavog

In this assignment, we analyze the number and quality of English Wikipedia politician pages relative to country and region population. The two main data sources we use are the Politicians by Country from the English-language Wikipedia dataset and the 2018 World Population Data from the Population Reference Bureau. We process this data, then use the Wikipedia ORES API to estimate the quality of each Wikipedia politician article. Finally, we merge all the data and analyze the trends in relative coverage and quality of articles by country and region, then discuss our findings along with potential sources of bias in the data.

First, we will import all the needed packages.

In [1]:
import csv
import urllib.request
from urllib.request import urlopen
import codecs
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
import oresapi
import numpy as np

## Part 1: Data Acquisition

In this first step, we download and clean all the necessary input data.

The first dataset we read in is the [Politicians by Country from the English-language Wikipedia](https://figshare.com/articles/Untitled_Item/5513449) dataset. This file contains over 400,000 observations, and the variables:

   `page`: Wikipedia page title, containing the name of a politician
   
   `country`: Name of the country that the politician worked in
   
   `rev_id`: edit ID of the last edit to the page
    
We save out a copy of the raw data, then clean it by removing any entries starting with the string "Template:", since these entries are not Wikipedia articles. 

In [2]:
# Read in Politicians by Country dataset from website, and parse the resulting csv file
resp = urlopen("https://ndownloader.figshare.com/files/9614893")
zipfile = ZipFile(BytesIO(resp.read()))
file = zipfile.open("country/data/page_data.csv")
csvfile = csv.reader(codecs.iterdecode(file, 'utf-8'))
header = next(csvfile)
page_data = pd.DataFrame(csvfile)
page_data.columns = header

# Save out raw data
page_data.to_csv('../data_raw/page_data.csv', index = False) 

# Drop pages starting with "Template:", which are not Wikipedia articles
page_data = page_data[page_data['page'].str.startswith('Template:') != True]
page_data = page_data.reset_index(drop = True)

page_data.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


Next, we read in the Population Reference Bureau's 2018 World Population Data Sheet. This dataset contains two variables: 

   `Geography`: Country or region measured 

   `Population mid-2018 (millions)`: Population of the respective location in mid-2018, in millions of people
    
We downloaded this dataset manually from the [Canvas course site](https://canvas.uw.edu/files/58607571/download?download_frd=1) to the `data_raw` directory. We then remove any entries of locations given in all capital letters, because these are regions rather than countries.

Note that there is an updated set of population estimates is available from the [Population Reference Bureau's 2019 World Population Data Sheet](https://www.prb.org/international/indicator/population/table/), which was not used in this assignment. I have included some unused code in the commented out section below that will instead read this dataset from the website, keep only the geographical information and population variables, then save out the formatted raw data.

In [3]:
# Read in 2018 World Population data
populations = pd.read_csv('../data_raw/WPDS_2018_data.csv')
populations.columns = ['country', 'Population mid-2018 (millions)']

## Alternately, read in 2019 World Population data
# ftpstream = urllib.request.urlopen("https://datacenter.prb.org/download/international/indicator/population/csv")
# csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
# populations = pd.DataFrame(csvfile)
#
## Keep only needed rows and columns
# populations = populations.drop([0, 1, 2, 3, 4])
# populations = populations.drop(populations.columns[[0, 2, 3]], axis = 1)
#
## Rename columns
# populations.columns = ['Geography', 'Population mid-2019 (millions)']
#
# populations.to_csv('../data_raw/WPDS_2019_data.csv', index = False) 

# Remove uppercased rows that contain region-level data
populations_country = populations[populations.country != populations.country.str.upper()]
populations_country = populations_country.reset_index(drop = True)

populations_country.head()

Unnamed: 0,country,Population mid-2018 (millions)
0,Algeria,42.7
1,Egypt,97.0
2,Libya,6.5
3,Morocco,35.2
4,Sudan,41.7


## Part 2: Data Processing

In this section, we process the population dataset to map each country to its respective region. Then, we use the ORES API to obtain article quality data, and finally merge all our datasets together in preparation for analysis.

First, let's create a dataset of countries together with their corresponding region.

In [4]:
# First, extract just the regions
populations_region = populations[populations.country == populations.country.str.upper()]
regions_list = []
regs = populations_region['country'].tolist()

# Extract the indices of the regions dataset, and subtract them to get the number of times each region should be repeated
indices = populations_region.index
last_element = indices[-1]
indices = [(indices[i - 1] - indices[i]) * -1 for i in range(1, len(indices))]
indices.append(len(populations.index) + 1 - last_element)

# Then loop over the regions, and append each to a list a number of times equal to the number of rows belonging to that region in the original populations dataset
for i in range(len(regs)): 
    j = 1
    while j <= indices[i]:
        regions_list.append(regs[i])
        j += 1

# Finally, convert the regions list to a DataFrame, and merge the region data on to the country populations data
regions_df = pd.DataFrame.from_dict(regions_list)
regions_df.columns = ["region"]
populations = populations.merge(regions_df, left_index = True, right_index = True)

Next, we query the [ORES client](https://github.com/wikimedia/ores), from Wikimedia Foundation and authors Aaron Halfaker, Yuvi Panda, Amir Sarabadani, Justin Du, Adam Wight, available under an MIT License. Documentation pages for the API are available [here](https://www.mediawiki.org/wiki/ORES). 
We obtain predictions of article quality for each Wikipedia politician page. 

Note that there are six possible predicted values of article quality: FA (Featured article), GA (Good article), B (B-class article), C (C-class article), Start (Start-class article), Stub (Stub-class article). For the purposes of this assignment, we only consider FA and GA as corresponding to high-quality articles.

Note that I was not able to install the `ores` package due to package compatibility errors, but was able to instead use the [`oresapi` package](https://github.com/halfak/oresapi), available from Aaron Halfaker, 2019, under an MIT License.

In [5]:
# Start an ORES API session
# Provide this useragent argument for the class to help the ORES team track requests
ores_session = oresapi.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>")

# Obtain predictions of article quality for each revision ID in the politicians dataset
results = ores_session.score("enwiki", ["articlequality"], page_data['rev_id'])
gen_lst = list(results)

# Convert outputs to a dataframe
results2 = pd.DataFrame(columns = ['articlequality'])
for i in gen_lst:
    results2 = results2.append(i, ignore_index = True)
        
# Extract just the predicted article quality
results2['articlequality'] = results2['articlequality'].astype(str).str.slice(start = 26).str.rsplit("'", expand = True)

results2.head()

Unnamed: 0,articlequality
0,Stub
1,Stub
2,Stub
3,Stub
4,Stub


Finally, merge all the datasets together, and save out a cleaned dataset for analysis.

In [6]:
# First merge article quality predictions onto politicians pages dataset
all_data = page_data.merge(results2, left_index = True, right_index = True)

# Remove and save out entries for which the ORES API query did not return article quality scores
all_data_noscores = all_data[~all_data['articlequality'].isin(['B', 'C', 'FA', 'GA', 'Stub', 'Start'])]
all_data_noscores.to_csv('../data_clean/wp_wpds_politicians_noscores.csv', index = False)
all_data = all_data[all_data['articlequality'].isin(['B', 'C', 'FA', 'GA', 'Stub', 'Start'])]

# Now merge on country population data
all_data = all_data.merge(populations_country, how = 'outer', indicator = True)

# Output a dataset containing the data that failed to merge - either no country population data, or no politician data
all_data_nomerge = all_data[all_data['_merge'].isin(['left_only', 'right_only'])]
all_data_nomerge.to_csv('../data_clean/wp_wpds_countries-no_match.csv', index = False)

# Save out the final cleaned dataset with all the entries that merged
all_data_fin = all_data[all_data['_merge'] == "both"]
all_data_fin = all_data_fin.rename(columns = {"page": "article_name", "rev_id": "revision_id", "articlequality": 
                                           "article_quality", "Population mid-2018 (millions)": "population"})
all_data_fin = all_data_fin[["country", "article_name", "revision_id", "article_quality", "population"]]
all_data_fin.to_csv('../data_clean/wp_wpds_politicians_by_country.csv', index = False)


# Merge on region data
all_data_fin = all_data_fin.merge(populations)

# Merge on region populations
populations_region.columns = ['region', 'region_population']
all_data_fin = all_data_fin.merge(populations_region)

# Create indicator for whether the article quality is considered high-quality (GA or FA)
all_data_fin['high_quality'] = np.where((all_data_fin['article_quality'] == "GA") | 
                                        (all_data_fin['article_quality'] == "FA"), 1, 0)

# Convert population data type from string to integer
all_data_fin['population'] = pd.to_numeric(all_data_fin['population'].str.replace(',', ''))
all_data_fin['region_population'] = all_data_fin['region_population'].str.replace(',', '').astype(int)

all_data_fin.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population,Population mid-2018 (millions),region,region_population,high_quality
0,Chad,Bir I of Kanem,355319463,Stub,15.4,15.4,AFRICA,1284,0
1,Chad,Abdullah II of Kanem,498683267,Stub,15.4,15.4,AFRICA,1284,0
2,Chad,Salmama II of Kanem,565745353,Stub,15.4,15.4,AFRICA,1284,0
3,Chad,Kuri I of Kanem,565745365,Stub,15.4,15.4,AFRICA,1284,0
4,Chad,Mohammed I of Kanem,565745375,Stub,15.4,15.4,AFRICA,1284,0


# Step 3: Analysis

In this section, we create six tables comparing the relative quantity and quality of Wikipedia polititcal pages relative to population across countries and regions. Finally, we conclude with a writeup of the findings and potential sources of bias in the data.

In [13]:
# TABLE 1: Top 10 countries by coverage
# "10 highest-ranked countries in terms of number of politician articles as a proportion of country population"
articles_country = all_data_fin.groupby(['country', 'population'], as_index = False).count()
articles_country['proportion'] = articles_country['article_name'] / articles_country['population']
articles_country = articles_country.sort_values('proportion', ascending = False)
articles_country2 = articles_country[['country', 'proportion']]
articles_country2.head(10)

Unnamed: 0,country,proportion
166,Tuvalu,5400.0
115,Nauru,5200.0
135,San Marino,2700.0
108,Monaco,1000.0
93,Liechtenstein,700.0
161,Tonga,630.0
103,Marshall Islands,616.666667
68,Iceland,502.5
3,Andorra,425.0
61,Grenada,360.0


In [15]:
# TABLE 2: Bottom 10 countries by coverage
# "10 lowest-ranked countries in terms of number of politician articles as a proportion of country population"
articles_country2 = articles_country2.sort_values('proportion')
articles_country2.head(10)

Unnamed: 0,country,proportion
69,India,0.71465
70,Indonesia,0.791855
34,China,0.810733
173,Uzbekistan,0.851064
51,Ethiopia,0.939535
82,"Korea, North",1.40625
178,Zambia,1.412429
159,Thailand,1.691843
112,Mozambique,1.901639
13,Bangladesh,1.917067


In [9]:
# TABLE 3: Top 10 countries by relative quality
# "10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality"
articles_qual = all_data_fin.groupby(['country'], as_index = False).sum()
articles_qual = articles_qual.merge(articles_country[['country', 'article_name']])
articles_qual['proportion'] = articles_qual['high_quality'] / articles_qual['article_name']
articles_qual = articles_qual.sort_values('proportion', ascending = False)
articles_qual = articles_qual[['country', 'proportion']]
articles_qual.head(10)

Unnamed: 0,country,proportion
82,"Korea, North",0.194444
137,Saudi Arabia,0.127119
104,Mauritania,0.125
31,Central African Republic,0.121212
132,Romania,0.113703
166,Tuvalu,0.092593
19,Bhutan,0.090909
44,Dominica,0.083333
155,Syria,0.078125
18,Benin,0.076923


In [10]:
# TABLE 4: Bottom 10 countries by relative quality
# "10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality"
articles_qual = articles_qual.sort_values('proportion')
articles_qual.head(10)

Unnamed: 0,country,proportion
90,Lesotho,0.0
38,Costa Rica,0.0
36,Comoros,0.0
108,Monaco,0.0
163,Tunisia,0.0
14,Barbados,0.0
167,Uganda,0.0
17,Belize,0.0
112,Mozambique,0.0
165,Turkmenistan,0.0


In [11]:
# TABLE 5: Geographic regions by coverage 
# "Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population"
articles_region = all_data_fin.groupby(['region', 'region_population'], as_index = False).count()
articles_region['proportion'] = articles_region['article_name'] / articles_region['region_population']
articles_region = articles_region.sort_values('proportion', ascending = False)
articles_region2 = articles_region[['region', 'proportion']]
articles_region2

Unnamed: 0,region,proportion
5,OCEANIA,76.292683
2,EUROPE,21.265416
3,LATIN AMERICA AND THE CARIBBEAN,7.964561
0,AFRICA,5.33567
4,NORTHERN AMERICA,5.263014
1,ASIA,2.542108


In [12]:
# TABLE 6: Geographic regions by coverage
# "Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality"
region_qual = all_data_fin.groupby(['region'], as_index = False).sum()
region_qual = region_qual.merge(articles_region[['region', 'article_name']])
region_qual['proportion'] = region_qual['high_quality'] / region_qual['article_name']
region_qual = region_qual.sort_values('proportion', ascending = False)
region_qual = region_qual[['region', 'proportion']]
region_qual

Unnamed: 0,region,proportion
4,NORTHERN AMERICA,0.051536
1,ASIA,0.026884
5,OCEANIA,0.0211
2,EUROPE,0.020298
0,AFRICA,0.018246
3,LATIN AMERICA AND THE CARIBBEAN,0.013349


### Conclusions

One finding from this analysis that did not surprise me was that in Tables 1, 2, and 5, countries and regions with a relatively small population ranked in the top 10 for number of articles per capita, whereas the most populous locations ranked in the bottom 10.
From Table 3, I was surprised to see North Korea as the highest ranked country in terms of article quality. While North Korea has a small number of articles (36), almost 20% (7) are high quality, so I wonder whether it is just the small sample size driving these rates, or whether there may be some bias in the data. It seems that many of the top countries in this table have more autocratic or formerly authoritarian governments, so I wonder whether this may drive more in-depth English Wikipedia articles about their politicians.
Within Table 4, I found that some countries have zero high-quality articles, so I am curious what factors might drive this, and whether some bias may be present in these countries' articles.
In Table 5, I noted that North America was the highest-ranked region in terms of article quality, which I believe makes sense since there is widespread political discourse especially in the United States, which has over 5000 politician articles.


A potential source of bias in this dataset may be that the politicians data was only collected from English Wikipedia, and there could be more and higher-quality Wikipedia politician articles on pages written in their country's language. This suggests to me that English language Wikipedia may not be a complete and representative data source of information on politicians around the world. In addition, politicians in less prominent or local-level roles may not be so widely known in countries outside their own, thus limiting the number of English articles about them. I think this suggests that we may have access to some of the most prominent information that comes out of a country, but we may also lose some prespective and information about smaller-scale events or lesser-known people within a country that are not prominent outside their country. It would be interesting to have some information about the number and quality of politician articles written in the native language of a country, to see how that compares to the English language pages.


I was quite curious about some of the aspects of the ORES API and how it scores article quality. 
[This page](https://en.wikipedia.org/wiki/Wikipedia:Content_assessment) states that "quality assessments are mainly performed by members of WikiProjects, who tag talk pages of articles". While there on guidelines on how a page should be rated, there would likely be some subjective bias introduced by the people who rate these pages. Interestingly, the high-quality FA and GA pages are rated not by usual members, but by "independent editors, rather than by WikiProjects. GAs are generally reviewed by a single editor, and FA by a panel." I wonder whether this different set of raters introduces a different type of bias for high-quality pages. This information suggests that the algorithm used by ORES is based on supervised learning, and there is likely inherent bias in its scoring algorithm introduced by manually scored entries in the training data.
Something I noticed is that the article quality scores are based on the ID of an edit made to a page at a given point in time, so I wonder whether the article quality for a page is pretty stable across time, or whether particular edits at different points in time might yield very different quality scores (for example, if someone made a damaging edit). I wonder whether some of these issues may be present in the politicians dataset we are working with.


According to the [ORES FAQ page](https://www.mediawiki.org/wiki/ORES/FAQ), "ORES tools use machine learning to predict the quality of new edits and articles, quickly identify and address damaging edits (sometimes called "vandalism"), check for copyright violations, and patrol recent changes to Wikipedia articles." This made me curious about some of the applications of the ORES API, some examples of which are listed on [this page](https://www.mediawiki.org/wiki/ORES/Applications). Websites like Wikihow articles or Facebook informational pages about public figures might be similar enough in content and editor demographics that a machine learning model such as the ORES API trained on our politicians dataset could be a good application of the data to identify article quality and damaging edits in those contexts.