# DATA 512 Assignment 2: Bias in Data

**DATA 512 Fall 2018**

**Ryan Bae**

Due: November 1st, 2018

The instructions to the assignment can be found in the following link:

https://wiki.communitydata.cc/Human_Centered_Data_Science_(Fall_2018)/Assignments#A2:_Bias_in_data

In [1]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import csv
import requests
import json

## Import Data and Call ORES API

In [3]:
# load datasets
page_data = pd.read_csv('page_data.csv')
wpds_2018 = pd.read_csv('WPDS_2018_data.csv')

Below is the function to call the Wikimedia ORES API to get the quality ratings of each article. The code below is taken from the course instructor Jonathan Morgan's github page in the link below:

https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

The ORES API returns the quality of article in the following format, according to the assignment instructions:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

In [4]:
# function to call ORES API. API limit is ~290, so the rev_ids must be split into smaller chunks.
headers = {'User-Agent' : 'https://github.com/ryanbae89',
           'From' : 'rbae@uw.edu'}

def get_ores_data(revision_ids, headers):
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}' 
    # Specify the parameters - smushing all the revision IDs together separated by | marks. 
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
#     print(json.dumps(response, indent=4, sort_keys=True))
    return response

Code in cell below divides the revision ids from page_data into chunks of size 100 each. It then loops through each chunk and obtains the article quality from ORES API.

Each ORES API result (dict) for each revision id chunk is saved into a list in `ores_ratings`.

In [5]:
# get ORES scores for each article
# get_ores_data must be called in chunks due to API limits
rev_ids_full = list(page_data['rev_id'])
n = 100
rev_ids_chunks = [rev_ids_full[i:i+n] for i in range(0, len(rev_ids_full), n)]
ores_ratings = []

# call API using the function in cell above for each rev_ids chunk
for rev_ids in rev_ids_chunks:
    ores_results = get_ores_data(rev_ids, headers)
    ores_ratings.append(ores_results['enwiki'])

## Data Cleaning and Engineering

Cell blocks below perform data engineering and cleaning to get the final table for analysis. The final dataframe has the following schema:

| country  | article_name  |  revision_id | article_quality | population |
|---------:|:-------------:|-------------:|----------------:|-----------:|
| Chad	   |Bir I of Kanem |  355319463	  | Stub            | 15400000.0 |

First, the dict in `ores_ratings` is turned into pandas dataframes and concatenated.

In [117]:
# turn to pandas dataframes and concatenate each chunk
ores = pd.DataFrame()
for ores_ratings_chunk in ores_ratings:
    ores_chunk = pd.DataFrame.from_dict(ores_ratings_chunk['scores'], orient='index')
    ores = pd.concat([ores, ores_chunk])
ores = ores.reset_index()
ores.columns = ['rev_id', 'score']

The dict inside each `score` column of `ores` is further processed to get the article_quality feature. Rows without valid article_quality rating are dropped.

In [118]:
# function to get the prediction
def get_pred(row):
    if 'score' in row.keys():
        return row['score']['prediction']
    else:
        return 'NaN'

# apply to every row in the ores dataframe
ores['article_quality'] = ores['score'].apply(lambda x: get_pred(x))
ores = ores.drop('score', axis=1)

# change datatypes for joins 
ores['rev_id'] = ores['rev_id'].apply(int)

# drop rows that do not have article_quality 
ores = ores[ores['article_quality'] != 'NaN']

The `ores` dataframe is now inner-joined with `page_data` and `wpds_2018` to get population and article_name columns.

In [119]:
# join with page_data and wpds_2018 tables
ores = ores.merge(page_data, on='rev_id', how='inner')
ores = ores.merge(wpds_2018, left_on='country', right_on='Geography', how='inner')

In [120]:
# rename columns, drop unnecessary columns, and reorder columns
ores = ores.rename(index=str, columns={"page": "article_name",
                                       "Population mid-2018 (millions)": "population",
                                       "rev_id": "revision_id"})
ores = ores.drop('Geography', axis=1)
ores = ores[['country', 'article_name', 'revision_id', 'article_quality', 'population']]

The `population` feature is a string, so it must be processed and changed into a float.

In [121]:
# clean population feature
def clean_population_column(population):
    population = population.replace(',', '')
    return float(population)*1e6

ores['population'] = ores['population'].apply(clean_population_column)

The final `ores` dataframe is shown below:

In [127]:
print(ores.shape)
ores.head()

(44973, 5)


Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Chad,Bir I of Kanem,355319463,Stub,15400000.0
1,Chad,Abdullah II of Kanem,498683267,Stub,15400000.0
2,Chad,Salmama II of Kanem,565745353,Stub,15400000.0
3,Chad,Kuri I of Kanem,565745365,Stub,15400000.0
4,Chad,Mohammed I of Kanem,565745375,Stub,15400000.0


This is the final cleaned table containing the article, it's revision id, article quality from ORES, country, and country's population. It is saved to a csv file in the cell below.

In [128]:
# save to csv
ores.to_csv('final_data.csv')

## Data Analysis

Code below performs the data analysis using the cleaned final dataframe `ores`. Per assignment instructions, the following 4 tables are produced:

1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [172]:
# Get articles per population for each country

# start with copy of ores dataframe and get articles per country
articles_per_pop = ores.copy()
articles_per_country = ores.groupby('country')['revision_id'].count().to_frame().reset_index()

# get population for each country
pop_per_country = ores.groupby('country')['population'].mean().to_frame().reset_index()

# join the two tables on country and calculate articles per population for each country
articles_per_country = articles_per_country.merge(pop_per_country, on='country', how='inner')
articles_per_country['articles_per_population(%)'] = (articles_per_country['revision_id'] \
    / articles_per_country['population'])*100

# rename columns and sort
articles_per_country = articles_per_country.rename(index=str, 
                                                   columns={'revision_id':'num_articles'})
articles_per_country = articles_per_country.sort_values('articles_per_population(%)',
                                                       ascending=False).reset_index(drop=True)

### 1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [173]:
# 10 highest-ranked countries in terms of number of politician articles as a 
# proportion of country population
articles_per_country.head(10)

Unnamed: 0,country,num_articles,population,articles_per_population(%)
0,Tuvalu,55,10000.0,0.55
1,Nauru,53,10000.0,0.53
2,San Marino,82,30000.0,0.273333
3,Monaco,40,40000.0,0.1
4,Liechtenstein,29,40000.0,0.0725
5,Tonga,63,100000.0,0.063
6,Marshall Islands,37,60000.0,0.061667
7,Iceland,206,400000.0,0.0515
8,Andorra,34,80000.0,0.0425
9,Federated States of Micronesia,38,100000.0,0.038


### 2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [174]:
# 10 lowest-ranked countries in terms of number of politician articles as a proportion 
# of country population
articles_per_country[::-1].head(10)

Unnamed: 0,country,num_articles,population,articles_per_population(%)
179,India,986,1371300000.0,7.2e-05
178,Indonesia,214,265200000.0,8.1e-05
177,China,1135,1393800000.0,8.1e-05
176,Uzbekistan,29,32900000.0,8.8e-05
175,Ethiopia,105,107500000.0,9.8e-05
174,Zambia,25,17700000.0,0.000141
173,"Korea, North",39,25600000.0,0.000152
172,Thailand,112,66200000.0,0.000169
171,Bangladesh,323,166400000.0,0.000194
170,Mozambique,60,30500000.0,0.000197


In [186]:
# Get quality articles per country

# start with copy of ores dataframe
quality_articles = ores[ores['article_quality'].isin(['FA', 'GA'])]
quality_articles = quality_articles.groupby('country')['revision_id'].count().to_frame().reset_index()

# join with articles_per_country and rename columns 
quality_per_country = articles_per_country.merge(quality_articles, 
                                                 on='country',
                                                 how='inner')
quality_per_country = quality_per_country.rename(index=str, 
                                                   columns={'revision_id':'num_quality_articles'})

# calculate quality articles percentage and sort
quality_per_country['quality_article_percentage'] = (quality_per_country['num_quality_articles'] \
    / quality_per_country['num_articles'])*100

quality_per_country = quality_per_country.sort_values('quality_article_percentage',
                                                     ascending=False).reset_index(drop=True)

### 3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [187]:
# 10 highest-ranked countries in terms of number of GA and FA-quality articles as a 
# proportion of all articles about politicians from that country
quality_per_country.head(10)

Unnamed: 0,country,num_articles,population,articles_per_population(%),num_quality_articles,quality_article_percentage
0,"Korea, North",39,25600000.0,0.000152,7,17.948718
1,Saudi Arabia,119,33400000.0,0.000356,16,13.445378
2,Central African Republic,68,4700000.0,0.001447,8,11.764706
3,Romania,348,19500000.0,0.001785,40,11.494253
4,Mauritania,52,4500000.0,0.001156,5,9.615385
5,Bhutan,33,800000.0,0.004125,3,9.090909
6,Tuvalu,55,10000.0,0.55,5,9.090909
7,Dominica,12,70000.0,0.017143,1,8.333333
8,United States,1092,328000000.0,0.000333,82,7.509158
9,Benin,94,11500000.0,0.000817,7,7.446809


### 4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [188]:
# 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a 
# proportion of all articles about politicians from that country
quality_per_country[::-1].head(10)

Unnamed: 0,country,num_articles,population,articles_per_population(%),num_quality_articles,quality_article_percentage
142,Tanzania,408,59100000.0,0.00069,1,0.245098
141,Peru,354,32200000.0,0.001099,1,0.282486
140,Lithuania,248,2800000.0,0.008857,1,0.403226
139,Nigeria,682,195900000.0,0.000348,3,0.439883
138,Morocco,208,35200000.0,0.000591,1,0.480769
137,Fiji,199,900000.0,0.022111,1,0.502513
136,Bolivia,187,11300000.0,0.001655,1,0.534759
135,Brazil,551,209400000.0,0.000263,3,0.544465
134,Luxembourg,180,600000.0,0.03,1,0.555556
133,Sierra Leone,166,7700000.0,0.002156,1,0.60241


## Writeup

The analysis confirmed some biases I expected to find in the data, while other insights turned out to be unexpected. 

In general, I expected countries with smaller populations to have higher percentage of articles when compared to its population. This is because every country, regardless of the population size, has 1 head of state or government, and similar number of cabinet members, etc. In fact, the first table (showing top 10 highest-ranked countries in terms of number of politician articles as a proportion of country population) contains only countries that have population smaller than 1 million, with 7 out of 10 having population of 100,000 or less. 

And the second table showing top 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population includes the most populous countries, such as India, China, and Indonesia. Countries like Bangladesh and Ethiopia also have large populations over 100 million. However, there were few surprising exceptions, for example North Korea at 7th place in the second table. One theory is that because North Korea is a secretive dictatorship with little known info about its politicians and because sites like Wikipedia are censored, there's few articles about its politicians. United States, despite having the 3rd largest population in the world, is also missing from the second table. This could be due to the fact that some countries with larger populations also have more congressmen/represetatives in their legislature, offesting the above effect (US also has 2 houses of legislature, meaning more politicians). In addition, most Wikipedia editors are from United States and therefore are familiar with US politicians. 

However, regarding the third and fourth tables, I expected a different result. Here, we are looking at *proportion of high quality articles among articles about politicians*. I expected powerful countries with world-wide famous politicians to top the list in the third table, while the fourth table would contain mostly countries that do not play large roles in world politics. Instead, the result was actually a mixed-bag. Most of the countries in the third table had multiple numbers of high quality articles, while most of the countries in the fourth table had just one high quality article. It seems that population size did not play a role in determining countries in either tables. 

This was something I did not expect, and I do not have a good conjecture as to why this is the case. It could simply be due to the small sample sizes of high quality articles and therefore the proportion of high quality articles among all politican articles for each country may not be a very precise metric. It could be that situations and conditions of individual countries may play a larger role in determining how many articles about politicians are of high quality.  