# Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.


## Setup

In [250]:
import json
import matplotlib.pyplot as plt
# import os
import numpy as np
import pandas as pd
import requests

%matplotlib inline

raw_data_dir = '/Users/Thompson/Desktop/3 - DATA 512/Assignments/A2 - Bias in Data/data/raw/'

## Data Ingest

### Population Data:

Download population data from Population Reference Bureau here:  
http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14

This data is saved in `./data/raw/`

In [2]:
# From website
filename = 'http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c472%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480'
population_data = pd.read_csv(filename, skiprows=2)
population_data.head(4)

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,


In [3]:
filename = raw_data_dir + 'Population Mid-2015.csv'
population_data = pd.read_csv(filename, skiprows=2)
population_data.head(4)

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,


OK, so we have the population data by country, both direct from online and from local data. Good.

Now let's see about getting the article data and running it through ORES.

### Politician/Article Data

You'll find the wikipedia politician article dataset on Figshare here:  
https://figshare.com/articles/Untitled_Item/5513449

If you want to do this yourself, you'll need to go to the link above, read through the documentation for this repository, then download and unzip it.

In [4]:
filename = raw_data_dir + 'page_data.csv'
page_data = pd.read_csv(filename)
page_data.head(4)

Unnamed: 0,country,page,last_edit
0,Abkhazia,Zurab Achba,802551672
1,Abkhazia,Garri Aiba,774499188
2,Abkhazia,Zaur Avidzba,803841397
3,Abkhazia,Raul Eshba,789818648


In [286]:
page_data.iloc[1,2]
type(page_data.iloc[1,2])

numpy.int64

# Get ORES Scores

Documentation for the ORES API can be found here:  
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

In [199]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/rexthompson', 'From' : 'rext@uw.edu'}

In [225]:
revids_all = list(page_data['last_edit'])
revids_all = revids_all[0:17]  # TEMP FOR TESTING

In [303]:
df = pd.DataFrame()

# loop 100 entries at a time
idx_start = 0
idx_end = 100
while idx_start <= len(revids_all):

    # retrieve and concatenate subset of revids
    revids = revids_all[idx_start:idx_end]
    revids = '|'.join(str(x) for x in revids)
    print(revids)
    
    # pull article data from API
    params = {'project' : 'enwiki',
              'revids' : revids,
              'model' : 'wp10'
          }
    
    api_call = requests.get(endpoint.format(**params), headers)
    response = api_call.json()
    
    for revid in response['enwiki']['scores']:
        try:
            temp_dict = response['enwiki']['scores'][revid]['wp10']['score']['probability']
            rating = max(temp_dict, key=temp_dict.get)
        except:
            rating = np.nan
        df = df.append({'revid':int(revid), 'rating':rating}, ignore_index=True)
    
    # NOTE -- ratings do not return in the same order as they were passed to the API!!!
    
    # update indexes
    idx_start += 100
    idx_end = min(idx_start+100, len(revids_all))

802551672|774499188|803841397|789818648|785284614|798644673|728644481|788591677|758713659|802860970|797469371|804349394|799618550|805063877|718383950|805775169|778690357


In [304]:
revid, temp_dict, rating

('805775169',
 {'B': 0.05049766900209879,
  'C': 0.12329713718688096,
  'FA': 0.003235367931432042,
  'GA': 0.004870197426292483,
  'Start': 0.8020940288738824,
  'Stub': 0.016005599579413436},
 'Start')

In [305]:
df.head()

Unnamed: 0,rating,revid
0,Stub,718383950.0
1,Stub,728644481.0
2,C,758713659.0
3,Stub,774499188.0
4,Start,778690357.0


OK, so now we have a nice table of ratings for each article. Let's merge this back with the original article data

In [311]:
rating_data = page_data.merge(df, left_on='last_edit', right_on='revid').drop('revid', 1)
rating_data.head()

Unnamed: 0,country,page,last_edit,rating
0,Abkhazia,Zurab Achba,802551672,C
1,Abkhazia,Garri Aiba,774499188,Stub
2,Abkhazia,Zaur Avidzba,803841397,C
3,Abkhazia,Raul Eshba,789818648,Start
4,Abkhazia,Guram Gabiskiria,785284614,Start


In [271]:
# The API can only handle so many requests at a time, so we'll go 100 at a time.


# loop over 
params = {'project' : 'enwiki',
          'revids' : '797882322',
          'model' : 'wp10'
          }

api_call = requests.get(endpoint.format(**params), headers)
response = api_call.json()
response
#print(json.dumps(response, indent=4, sort_keys=True))

for revid in response['enwiki']['scores']:
    print(revid)
    temp_dict = response['enwiki']['scores'][revid]['wp10']['score']['probability']
    rating = max(temp_dict, key=temp_dict.get)
    print(rating)

797882322
C


In [56]:
response

{'enwiki': {'models': {'wp10': {'version': '0.5.0'}},
  'scores': {'797882322': {'wp10': {'score': {'prediction': 'C',
      'probability': {'B': 0.2248799278914761,
       'C': 0.4778248966543574,
       'FA': 0.004980276199873982,
       'GA': 0.052563166554547465,
       'Start': 0.23619365248131424,
       'Stub': 0.0035580802184307985}}}}}}}

Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [None]:
print(data[782])

### Note from Andrew Enfield on Slack on 10/25/17:

FYI that I was able to get scores for all articles in just two minutes with the https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context API. I retrieved the scores in chunks of 140 articles at a time/per call - when I experimented, 140 always worked but 145 or 150 gave me errors.

# OLD STUFF

# Get ORES Scores

Below is an example of how to make a request through the ORES system in Python to find out the current quality of the article on [Aaron Halfaker](https://en.wikipedia.org/wiki/Aaron_Halfaker) (the person who created ORES):

Actually use the following link for documentation:  
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

In [None]:
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'enwiki',
          'revids' : '797882120|797882121',
          'model' : 'wp10'
          }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
print(json.dumps(response, indent=4, sort_keys=True))


Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).

In [None]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [None]:
print(data[782])