# A2 Data Bias

The goal of this assignment is to detect bias on political figures from multiple countries by analyzing of the *coverage* and the *quality* of the the politican's Wikipedia articles. The main output of this project is a data visualization of highest-ranked and lowest-ranked countries in term of politican article density. The visuals itself does not tell the entire story, but it hopefully helps uncovering political biases in Wikipedia. 

To support the analysis, I produce these artifacts:
1. A CSV file with politicans Wikipedia article information (country, article name, revision ID, article quality, and population of the country)
2. A Jupyter notebook with all the code
3. A comprehensive README file
4. A MIT LICENSE file
5. A visualization png file

To be able to effectively process and analyze the data, this project requires:
1. Politicians Wikipedia dataset 
2. Population data of each country
3. Article quality prediction by Wikimedia ORES machine learning service

these will be downloaded, requested, combined, plotted, and eventually analyzed in this Jupyter notebook.

### Shared Library

Setting up shared import and variables to be used throughout this Jupyter notebook.

In [2]:
import csv
import numpy as np
import math
import os
import pandas as pd
from pprint import pprint
import requests


# Variables
my_github = 'reyadji'
my_email = 'adjir@uw.edu'
ores_url = 'https://ores.wikimedia.org/v3/scores/enwiki'
page_file = 'page_data.csv'
population_file = 'Population Mid-2015.csv'
project = 'enwiki'
model = 'wp10'
csv_file = 'A2 data.csv'

## Getting Article & Population Data

The article dataset CSV file can be downloaded on the figshare(https://figshare.com/articles/Untitled_Item/5513449), while the population dataset CSV file is available on the Population Research Bureau website (http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14). This step is pretty straight-forward as long as the correct dataset are downloaded (the try-catch is to make sure it is available in the current directory).

In [3]:
# loading page_data and population csv files to pandas dataframe inside try-catch block
try: 
    page_data = pd.read_csv(page_file)
    pop_data = pd.read_csv(population_file, header=1, thousands=',')
except Exception as e:
    print(str(e))

# print(page_data.columns)

## Getting article quality predictions and Combining the Datasets

Each Wikipedia article is classified into one of 6 categories by ORES (Objective Revision Evaluation Service) machine learning service. These categoris from best to worst are:
FA: Featured article
GA: Good article
B: B-class article
C: C-class article
Start: Start-class article
Stub: Stub-class article

ORES predicts probability number of each category for every article, and it picks category with the highest probability as the article's category. Fortunately, ORES has a public ReST API that accepts an article's revision ID or a batch of articles' revision IDs (up to 140), and it returns the prediction value of said article with a the full set of probability for each category.

After being loaded, each dataset is cleaned and combined by dropping unnecessary columns (Location Type, TimeFrame, Data Type, Foornotes from population dataset), by renaming remaining columns to the specification, and by being inner-joined on *country* attribute (orphaned rows are filtered out). 

The cleaned dataset is split into subsets of 100-rows since ORES API only takes roughly up to 140 revision IDs in one single batch request. The resulting article quality dataset is then merged to the main dataset on the *revision_id* before everything is stored into a CSV file.

Manual testing is added at the end to check the integrity of the dataset as well as its accuracy (I chose Indonesia since it's my birth country and I am pretty familiar with the names of the politician). 

In [4]:
# Cleaning and processing both dataframes before joining them on country
page_data.rev_id = page_data.rev_id.apply(str)
page_data = page_data.rename(columns = {'page': 'article_name', 'rev_id': 'revision_id'})
pop_data = pop_data.drop(['Location Type','TimeFrame', 'Data Type', 'Footnotes'], axis=1)
pop_data = pop_data.rename(columns = {'Location': 'country', 'Data': 'population'})
data = pd.merge(page_data, pop_data, on='country', how='inner')

# Subset the dataframe to 100-row chunks
num_of_subsets = math.ceil(len(data)/100)
subsets = np.array_split(data, num_of_subsets)

# Processing each subset to the ORES API to get quality
aq_df = pd.DataFrame()
for s in subsets:
    rev_ids = s.revision_id.str.cat(sep='|')
    headers = {
        'User-Agent': 'https://github.com/{}'.format(my_github), 
        'From': my_email}
    payload = {
        'models': 'wp10',
        'revids': rev_ids
    }
    r = requests.get(ores_url, params=payload, headers=headers)
    if r.status_code != 200:
        print ('Response status code is 200. Response text: {}'.format(r.text))
        continue
    revision_id = []
    article_quality = []
    
    for k,v in r.json()['enwiki']['scores'].items():
        try:
            pred = v['wp10']['score']['prediction']
            revision_id.append(k)
            article_quality.append(pred)
        except KeyError:
            print('Could not get a valid article quality for revision id:{0}, article:{1}'.format(k, page_data.loc[page_data.revision_id == k].article_name))
            continue
    aq_df = pd.concat([aq_df, pd.DataFrame(data={'revision_id': revision_id, 'article_quality':article_quality})]) 

# Merge the article quality dataframe with the main dataframe and stored to CSV
data = pd.merge(data, aq_df, on='revision_id', how='inner')
data.to_csv(csv_file, index = False)

# Testing
print('Dataset shape: {}'.format(data.shape))
print('Indonesian politicans: {}'.format(data.loc[(data.country =='Indonesia')].article_name.sample(10)))

Could not get a valid article quality for revision id:807367030, article:46862    Jalal Movaghar
Name: article_name, dtype: object
Could not get a valid article quality for revision id:807367166, article:46863    Mohsen Movaghar
Name: article_name, dtype: object
Dataset shape: (45797, 5)
Indonesian politicans: 7447    Template:Indonesia-politician-stub
7448      Template:Indonesia-diplomat-stub
7449      Template:Indonesia-activist-stub
7450                       Burhan Muhammad
7451                         Ateng Wahyudi
7452                        Muhammad Nazar
7453                            Imam Utomo
7454                   I Gusti Putu Martha
7455          I Gede Sumantara Ady Pratama
7456                        Chris Soumokil
7457                  Piet Alexander Tallo
7458                          Bibit Waluyo
7459                  Sugondo Djojopuspito
7460                  Soeprapto (governor)
7461                       Bustanil Arifin
7462                         Biem Benyamin


## Analysis

In the analysis part, we have 2 metrics to calculate:
1. The proportion of the number of each country articles per its population per country.
2. The percentage of high-quality articles (either FA or GA article) over total articles.
Using these two metrics, I hopefully am able to prove or disprove the notion there is biases toward certain country politicians in English Wikipedia.

Proportion of articles per country population is calculated by dividing the number of articles of each country with the country population. This can be achieved by grouping the dataset by its *country* and summing the number of the articles. This new groupby dataset is divided by the country's *population* from the population dataset. Finding the highest and lowest proportion is only a matter of sorting.

Calculating high-quality articles are a little trickier since I need to define a boolean function to determine whether each article a high-quality one or not. Aside from that extra function, the number of high-quality articles proportioned to the number of all articles is straight-forward.

To test this analysis module, I select all Indonesian politicians with high-quality articles. The number is disappointingly very low in proportion to population of Indonesia. Countries with high and low proportions will be discussed in the next section.


In [5]:
# Grouping the number of articles by the country, and merged it with population dataset
pop_data = pop_data.set_index('country')
country_df = pd.concat(
    [pd.DataFrame({'articles':data.groupby('country').size()}), pop_data], 
    axis=1, 
    join='inner')
country_df = country_df.assign(art_per_country=country_df.articles*100/country_df.population)
# country_df = country_df.sort_values('art_per_country')

# Define a function to count all articles with 'FA' or 'GA' quality
def is_highquality(article):
    return True if (article == 'FA' or article == 'GA') else False

# Apply the above function to article_quality column to get boolean high_quality column
data['high_quality']= data.article_quality.apply(is_highquality)
hq_df = data.groupby('country')['high_quality'].sum()

# Join the high_quality dataset with the previous one and calculate the proportion of high quality articles
country_df = country_df.join(hq_df)
country_df['high_quality_prop'] = country_df.high_quality*100/country_df.articles

# Testing
print(data.loc[(data.country =='Indonesia') & (data.high_quality)])

                   article_name    country revision_id  population  \
7533                    Sukarni  Indonesia   781071895   255741973   
7596  Alexander Andries Maramis  Indonesia   798756726   255741973   
7599               Khouw Kim An  Indonesia   798837376   255741973   
7601        Maria Ulfah Santoso  Indonesia   798891601   255741973   
7610      Sri Mulyani Indrawati  Indonesia   800672774   255741973   
7629            Teuku Nyak Arif  Indonesia   802855428   255741973   
7630       Abdul Haris Nasution  Indonesia   803287932   255741973   
7650                 Rano Karno  Indonesia   805553992   255741973   
7659        Korrie Layun Rampan  Indonesia   807049212   255741973   

     article_quality high_quality  
7533              GA         True  
7596              GA         True  
7599              GA         True  
7601              GA         True  
7610              GA         True  
7629              GA         True  
7630              GA         True  
7650       

## Visualization

This section is to visualize the 10 highest-ranked and lowest-ranked countries in term of number of politician articles as proportion of country population and the number of high-quality articles as proportion of all articles. Before I start any plotting, I need to make sure I the countries are ranked appropriately. There are 39 countries with 0 high-quality articles; these countries are tied up for the last place in high-quality articles per total articles rank.
While having the lowest and highest-ranked countries in the same bar chart could be useful, this is not the case since the lowest-ranked countries have much smaller articles-per-population than the highest-ranked. Hence, I split the lowest-ranked countries and highest-ranked countries in different plot.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')

# Plotting highest-ranked and lowest-ranked countries in term of number of politician articles per population
fig = plt.figure(figsize=(12, 10))
ax1 = fig.add_subplot(121)
low_apc = country_df.sort_values('art_per_country').head(10)
low_apc.art_per_country.plot.bar()
ax1.set_ylabel('# Articles per Population (\%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

ax2 = fig.add_subplot(122)
high_apc = country_df.sort_values('art_per_country').tail(10)
high_apc.art_per_country.plot.bar()
ax2.set_ylabel('# Articles per Population (\%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()


fig.suptitle('The 10 lowest-ranked and highest-ranked countries in terms of number of politician articles as a proportion of country population')
plt.show()
fig.savefig('articles_per_population.png')

# Plotting highest-ranked and lowest-ranked countries in term of number of high-quality articles per total articles
fig = plt.figure(figsize=(12, 10))
ax3 = fig.add_subplot(121)
low_hqa = country_df.loc[country_df.high_quality!=0].sort_values('high_quality_prop').head(9)
low_hqa = pd.concat([pd.DataFrame({'high_quality': 0.0, 'high_quality_prop': 0.0}, index=['Misc.']), low_hqa])
low_hqa.high_quality_prop.plot.bar()
ax3.set_ylabel('# High Quality Articles per Total Articles (\%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

ax4 = fig.add_subplot(122)
high_hqa = country_df.loc[country_df.high_quality!=0].sort_values('high_quality_prop').tail(10)
high_hqa.high_quality_prop.plot.bar()
ax4.set_ylabel('# High Quality Articles per Total Articles (\%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

fig.suptitle('The 10 lowest-ranked and highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country')
plt.show()
fig.savefig('high_quality_articles_proportion.png')



## Results

### number of politician articles as a proportion of country population
The 10 lowest-ranked country in terms of number of politician articles as a proportion of country population in Wikipedia is dominated by countries with big population (China, India, Indonesia) or by countries where English-speaking editors are less familiar with, such as Bangladesh, Zambia or by countries with closed political system, like North Korea. On the other hand, the top 10 country in terms of number of politician articles as a proportion of country population consists exclusively of countries with very small population. While this metric sheds a light on  bias against politicians from less-familiar countries in Wikipedia, I found the metric/analysis itself has inherently flawed against countries with big population and vice versa. It is by no means an indication that English-speaking Wikipedia biased against China, India, and Indonesia. Afterall, China(1138 articles), India(990 articles), and Indonesia(215 articles) have decent number of total articles in English Wikipedia, and there is no linear correlation between total number of articles and population.

### number of high-quality articles as a proportion of all articles about politicians from that country
The second metric filters out "placeholder" and "empty calories" articles about politicians to reduce the noise. The hypotheses is biases would be more pronounced when there is only a handful relevant articles. However, there is no obvious bias against the lowest-ranked countries in this metric (excluding 39 countries without good or featured article). It consists of countries from 4 different continents with various population (Nigeria has the largest population, tiny Luxembourg has the smalles one), developed economically (Finland, Luxembourg) or developing economically (Nigeria, Czech Republic). The only similarities among these countries is English is not the primary or even secondary everyday language. The other end of the spectrum also shows very little bias. Although United States is in the top 10, it does not even have the highest proportion of well-written articles to all articles. That distinction surprisingly belongs to North Korea by a wide margin!! 9 out of 39 articles about the Hermit Kingdom's politicians are deemed good or featured article. It is even more surprising considering North Korea is in the bottom 10 of number of politician articles as proportion of country population since there is not that many articles. I am under the impression that a North Korean politic expert (or a group of researchers) is dedicated to write most, if not all, of North Korea politician articles. A cursory glance at the articles' edit history confirm that WIkiproject North Korea makes significant contribution to the North Korean-related Wikipedia articles.

