# Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.


## Setup

In [1]:
import json
import matplotlib.pyplot as plt
import os
import numpy as np
import pandas as pd
import requests

%matplotlib inline

# data_dir = 'data/'

## Data Ingest

First step is to get the raw data. We need population data and article data.

### Population Data

Download population data from Population Reference Bureau here:  
http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14

This data is saved in `./data/raw/`

In [None]:
# From local data
filename = 'data/raw/Population Mid-2015.csv'
population_data = pd.read_csv(filename, skiprows=2, thousands=',')
population_data.head(4)

In [None]:
# # From website -- may want to include this one only since data may not be licensed for distribution
# filename = 'http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c472%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480'
# population_data = pd.read_csv(filename, skiprows=2, thousands=',')
# population_data.head(4)

In [None]:
# # TODO: consider using Gary Gregg's map to update some of the country names

# country_map = {
#    "East Timorese" : "Timor-Leste",
#    "Hondura" : "Honduras",
#    "Rhodesian" : "Zimbabwe",
#    "Salvadoran" : "El Salvador",
#    "Samoan" : "Samoa",
#    "São Tomé and Príncipe" : "Sao Tome and Principe",
#    # "Somaliland" : "Somalia",  # Oliver says this one is not correct
#    "South African Republic" : "South Africa",
#    "South Korean" : "Korea, South"
# }

OK, so we have the population data by country, both direct from online and from local data. Good.

Now let's see about getting the article data.

### Politician/Article Data

You'll find the wikipedia politician article dataset on Figshare here:  
https://figshare.com/articles/Untitled_Item/5513449

If you want to do this yourself, you'll need to go to the link above, read through the documentation for this repository, then download and unzip it.

In [None]:
# From website: TBD

In [None]:
# From local data
filename = 'data/raw/page_data.csv'
page_data = pd.read_csv(filename)
page_data.head(4)

## Article Scores from ORES

Now that we have our article data, we can get scores for each article using the ORES API.

Documentation for the ORES API can be found here:  
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

In [None]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/rexthompson', 'From' : 'rext@uw.edu'}

In [None]:
revids_all = list(page_data['rev_id'])

# # TEMP FOR TESTING
# revids_all = revids_all[0:50]

In [None]:
# set up empty dataframe to hold results for each article
rev_ratings = pd.DataFrame()

# loop 100 entries at a time
idx_start = 0
idx_end = 100
while idx_start < len(revids_all):
    
    # retrieve and concatenate subset of revids
    revids = revids_all[idx_start:idx_end]
    revids = '|'.join(str(x) for x in revids)
    
    # pull article data from API
    params = {'project' : 'enwiki',
              'revids' : revids,
              'model' : 'wp10'
          }
    
    api_call = requests.get(endpoint.format(**params), headers)
    response = api_call.json()
    
    for revid in response['enwiki']['scores']:
        try:
            # temp_dict = response['enwiki']['scores'][revid]['wp10']['score']['probability']
            # rating = max(temp_dict, key=temp_dict.get)
            rating = response['enwiki']['scores'][revid]['wp10']['score']['prediction']
        except:
            print('unable to load score for ' + revid)
            rating = np.nan
        rev_ratings = rev_ratings.append({'revid':revid, 'rating':rating}, ignore_index=True)
    
    # NOTE -- ratings do not return in the same order as they were passed to the API!!!
    
    # update indexes
    idx_start += 100
    idx_end = min(idx_start+100, len(revids_all))
    
# revid, temp_dict, rating

So, above, we see that the following two rev_ids do not return a valid result from ORES:

 * 807367030
 * 807367166

Let's see how this data looks.

In [None]:
rev_ratings.head()

OK, so now we have a nice table of ratings for each article. Let's merge this back with the original article data. First we convert the revids to ints since they are currently strings. We need them to be the same as page_data which has this as an int.

In [None]:
rev_ratings['revid'] = pd.to_numeric(rev_ratings['revid'], errors='coerce')

Then we merge, and drop the redundant column.

In [None]:
page_data_with_rating = page_data.merge(rev_ratings, left_on='rev_id', right_on='revid').drop('revid', 1)
page_data_with_rating.head()

In [None]:
population_data.head()

Now we'll merge the 'page_data_with_rating' dataframe with the 'population_data' dataframe on country. Note that the country column in the 'population_data' dataframe is actually called 'Location'.

In [None]:
merged_df = population_data.merge(page_data_with_rating, left_on='Location', right_on='country')

Now we do a little cleanup. A sample of the cleaned, merged dataframe is below.

In [None]:
# pull out columns of interest
merged_df = pd.DataFrame({'country':merged_df['Location'],
                          'population':merged_df['Data'],
                          'article_name':merged_df['page'],
                          'revision_id':merged_df['rev_id'],
                          'article_quality':merged_df['rating']})

# convert population to int
pd.to_numeric(merged_df['population'])

# reorder columns
merged_df = merged_df[['country',
                       'population',
                       'article_name',
                       'revision_id',
                       'article_quality']]
merged_df.head()

Let's save this data to CSV.

The code below saves the data if it has not already been saved. It loads it if it has already been saved.

In [2]:
# set filename for combined data CSV
filename = 'data/population_and_article_quality_data.csv'

# check if file already exists; load if so, create if not
if os.path.isfile(filename):
    merged_df = pd.read_csv(filename)
    print('loaded CSV data from ' + filename)
else:
    merged_df.to_csv(filename, index=False)
    print('saved CSV data to ' + filename)

loaded CSV data from data/population_and_article_quality_data.csv


## Analysis

So, now we have a good dataframe with country and ratings for each article, and population for each country. Sample below.

In [4]:
# merged_df.head()

### Articles Per Population

First, we'll need population data. We could use the original data (from above) but instead we will rebuild it from the data from the csv for transparency and in the spirit of reproducibility.

In [5]:
# rebuild population data
population_data = pd.DataFrame(merged_df[['country','population']])
population_data.drop_duplicates(inplace=True)
population_data.set_index('country', inplace=True)
population_data.head()

Unnamed: 0_level_0,population
country,Unnamed: 1_level_1
Afghanistan,32247000
Albania,2892000
Algeria,39948000
Andorra,78000
Angola,25000000


We want to determine the proportion of articles-per-population for each country. This means dividing the number of articles per country by the population of the corresponding country.

The first step is to determine the number of articles per country. To do this, we group our 'merged_df' by country and count the number of rows. This will return the number of articles per country.

In [6]:
# get number of articles per country
articles_per_country = merged_df.groupby(['country']).size().reset_index(name='article_count').set_index('country')
articles_per_country.head()

Unnamed: 0_level_0,article_count
country,Unnamed: 1_level_1
Afghanistan,327
Albania,460
Algeria,119
Andorra,34
Angola,110


OK, so now we have number of articles per country and population per country. Let's join these two datasets.

In [7]:
article_count_and_population = population_data.merge(articles_per_country, left_index=True, right_index=True, how='left')
article_count_and_population.head()

Unnamed: 0_level_0,population,article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,32247000,327
Albania,2892000,460
Algeria,39948000,119
Andorra,78000,34
Angola,25000000,110


Now let's divide these to give proportion of articles per countty. We'll also sort by the proportion.

In [8]:
article_count_and_population['articles_per_person_pct'] = 100*article_count_and_population['article_count']/article_count_and_population['population']
article_count_and_population.sort_values(by='articles_per_person_pct', ascending=False, inplace=True)

In [11]:
####################
##### First 10 #####
####################

article_count_and_population.head(10)

Unnamed: 0_level_0,population,article_count,articles_per_person_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nauru,10860,53,0.488029
Tuvalu,11800,55,0.466102
San Marino,33000,82,0.248485
Monaco,38088,40,0.10502
Liechtenstein,37570,29,0.077189
Marshall Islands,55000,37,0.067273
Iceland,330828,206,0.062268
Tonga,103300,63,0.060987
Andorra,78000,34,0.04359
Federated States of Micronesia,103000,38,0.036893


In [12]:
#######################################
##### Last 10 (lowest at the top) #####
#######################################

article_count_and_population.tail(10).sort_values('articles_per_person_pct')

Unnamed: 0_level_0,population,article_count,articles_per_person_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,1314097616,990,7.5e-05
China,1371920000,1138,8.3e-05
Indonesia,255741973,215,8.4e-05
Uzbekistan,31290791,29,9.3e-05
Ethiopia,98148000,105,0.000107
"Korea, North",24983000,39,0.000156
Zambia,15473900,26,0.000168
Thailand,65121250,112,0.000172
"Congo, Dem. Rep. of",73340200,142,0.000194
Bangladesh,160411000,324,0.000202


We can make bar graphs if we feel so inclined...

### High-Quality Articles Per Population

Now we'll look at the number of high-quality articles per population. This is a similar exercise to the previous, except that instead of summing all articles, we want to only count those that are in the "FA" or "GA" category. We do this by subsetting the original 'merged_df' dataframe, then grouping in a similar manner to what we did above.

In [13]:
# get number of high-quality articles per country
hq_articles_per_country = merged_df[(merged_df['article_quality'] == 'GA') |
                                    (merged_df['article_quality'] == 'FA' )]
hq_articles_per_country = hq_articles_per_country.groupby(['country']).size().reset_index(name='hq_article_count').set_index('country')
hq_articles_per_country.head(10)

Unnamed: 0_level_0,hq_article_count
country,Unnamed: 1_level_1
Afghanistan,19
Albania,5
Algeria,3
Angola,2
Argentina,16
Armenia,6
Australia,44
Austria,3
Azerbaijan,3
Bangladesh,6


We now merge with the total article count. Only thing to note here is that we need to add in countries that don't have any high-quality articles. Sub zero for these ones.

In [14]:
hq_article_proportions = articles_per_country.merge(hq_articles_per_country, left_index=True, right_index=True, how='left').fillna(0).astype(int)
hq_article_proportions.head()

Unnamed: 0_level_0,article_count,hq_article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,327,19
Albania,460,5
Algeria,119,3
Andorra,34,0
Angola,110,2


Now let's divide these to give proportion of articles per countty. We'll also sort by the proportion.

In [15]:
hq_article_proportions['hq_article_pct'] = 100*hq_article_proportions['hq_article_count']/hq_article_proportions['article_count']
hq_article_proportions.sort_values(by='hq_article_pct', ascending=False, inplace=True)

Let's look at the first 10 rows...

In [16]:
####################
##### First 10 #####
####################

hq_article_proportions.head(10)

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",39,9,23.076923
Romania,348,45,12.931034
Saudi Arabia,119,15,12.605042
Central African Republic,68,8,11.764706
Qatar,51,5,9.803922
Guinea-Bissau,21,2,9.52381
Vietnam,191,18,9.424084
Bhutan,33,3,9.090909
Ireland,381,31,8.136483
United States,1098,86,7.832423


In [17]:
#######################################
##### Last 10 (lowest at the top) #####
#######################################

hq_article_proportions.tail(10).sort_values('hq_article_pct')

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Monaco,40,0,0.0
Comoros,51,0,0.0
Bahrain,42,0,0.0
Guyana,20,0,0.0
Marshall Islands,37,0,0.0
Suriname,40,0,0.0
Swaziland,32,0,0.0
Guadeloupe,49,0,0.0
Belize,16,0,0.0
Solomon Islands,98,0,0.0


Here we see that this is just a list of several that have no high-quality articles. Let's see how many of these there are, and list them.

In [26]:
len(hq_article_proportions[hq_article_proportions['hq_article_count']==0])

39

So we see that there are 39 countries that don't have a single high-quality article written about them. These are listed here.

In [37]:
print(list(hq_article_proportions[hq_article_proportions['hq_article_count']==0].index))

['Antigua and Barbuda', 'Turkmenistan', 'Nepal', 'Nauru', 'Tonga', 'Zambia', 'Tunisia', 'Burundi', 'French Guiana', 'Federated States of Micronesia', 'Dominica', 'Eritrea', 'Macedonia', 'Tajikistan', 'Andorra', 'Liechtenstein', 'Switzerland', 'Djibouti', 'Bahamas', 'Belgium', 'Lesotho', 'San Marino', 'Sao Tome and Principe', 'Barbados', 'Cape Verde', 'Seychelles', 'Kiribati', 'Kazakhstan', 'Mozambique', 'Monaco', 'Comoros', 'Bahrain', 'Guyana', 'Marshall Islands', 'Suriname', 'Swaziland', 'Guadeloupe', 'Belize', 'Solomon Islands']


In [42]:
hq_article_proportions[hq_article_proportions['hq_article_count']==0].sort_index()

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andorra,34,0,0.0
Antigua and Barbuda,25,0,0.0
Bahamas,20,0,0.0
Bahrain,42,0,0.0
Barbados,14,0,0.0
Belgium,523,0,0.0
Belize,16,0,0.0
Burundi,76,0,0.0
Cape Verde,37,0,0.0
Comoros,51,0,0.0


# SAMPLE CODE, ETC

In [None]:
# get number of articles per country
articles_per_country = merged_df.groupby(['country']).count()
articles_per_country.head()

In [None]:
# The API can only handle so many requests at a time, so we'll go 100 at a time.


# loop over 
params = {'project' : 'enwiki',
          'revids' : '797882322',
          'model' : 'wp10'
          }

api_call = requests.get(endpoint.format(**params), headers)
response = api_call.json()
response
#print(json.dumps(response, indent=4, sort_keys=True))

for revid in response['enwiki']['scores']:
    print(revid)
    temp_dict = response['enwiki']['scores'][revid]['wp10']['score']['probability']
    rating = max(temp_dict, key=temp_dict.get)
    print(rating)

In [None]:
response

Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [None]:
print(data[782])

### Note from Andrew Enfield on Slack on 10/25/17:

FYI that I was able to get scores for all articles in just two minutes with the https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context API. I retrieved the scores in chunks of 140 articles at a time/per call - when I experimented, 140 always worked but 145 or 150 gave me errors.

# UPDATED EXAMPLE

# Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

# Making ORES requests

Below is an example of how to make requests through the ORES system in Python to find out the current quality of an article. Specifically, this is a function designed to make a request with *multiple* revision IDs. You can take this function, split your revision IDs up into chunks of 50 or 100 to avoid hitting limits in ORES, pass each chunk through this function, and then stitch the whole set together.

In [None]:
import requests
import json

headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    print(json.dumps(response, indent=4, sort_keys=True))


# So if we grab some example revision IDs and turn them into a list and then call get_ores_data...
example_ids = [783381498, 807355596, 757539710]
get_ores_data(example_ids, headers)

Importing the other data is just a matter of reading CSV files in! And if you're an R programmer wondering where the R example is - check the other file in this example.

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('./data/raw/page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [None]:
print(data[782])

# ORIGINAL EXAMPLE

# Get ORES Scores

Below is an example of how to make a request through the ORES system in Python to find out the current quality of the article on [Aaron Halfaker](https://en.wikipedia.org/wiki/Aaron_Halfaker) (the person who created ORES):

Actually use the following link for documentation:  
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

In [None]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'enwiki',
          'revids' : '797882120|797882121',
          'model' : 'wp10'
          }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
print(json.dumps(response, indent=4, sort_keys=True))


Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [None]:
print(data[782])