## Human-Centered Data Science
### Assignment 2: Bias

Andrew Smith (ucalegon@uw.edu)

Instructions: You will first load the page_data.csv from Wikipedia. This file contains the following fields:

    1) the name of the page/article
    2) the country from which the article originated
    3) the last revision ID of the article

This file contains information about politicians in each country. Our task is to analyze any bias found in the data.

In [16]:
import requests
import pandas
import json

# Load the page data
fp = 'page_data.csv'
df = pandas.read_csv(fp)
df

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409
5,Template:Nigeria-politician-stub,Nigeria,391862819
6,Template:Colombia-politician-stub,Colombia,391863340
7,Template:Chile-politician-stub,Chile,391863361
8,Template:Fiji-politician-stub,Fiji,391863617
9,Template:Solomons-politician-stub,Solomon Islands,391863809


Load the population data to merge join the population numbers to it. I had to clean up the original population numbers for them to be consumable by this process.

In [10]:
dfpop = pandas.read_csv('population_transformed.csv')
dfpop.columns = ['country', 'population']
dfpop

Unnamed: 0,country,population
0,Afghanistan,32247000
1,Albania,2892000
2,Algeria,39948000
3,Andorra,78000
4,Angola,25000000
5,Antigua and Barbuda,90000
6,Argentina,42426000
7,Armenia,3017106
8,Australia,23888000
9,Austria,8615955


Send API calls to Wikimedia, 50 at a time, to retrieve the article quality score.

In [9]:
# Set a few variables for use later
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/ucalegon1979', 'From' : 'ucalegon@uw.edu'}
project = 'enwiki'
model = 'wp10'

i = 0
increment = 50

# Create an empty dataframe to contain the responses of article quality for each rev_id
dfresponse = pandas.DataFrame(columns=['rev_id','response'])

while i < df.shape[0]:
    params = {'project': project,
              'model': model,
              'revids': '|'.join(str(x) for x in df['rev_id'][i:i+increment])
              }

    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()

    # retrieve the response out of the returned dictionary for each rev_id
    for id in response['enwiki']['scores']:
        r = ''
        try:
            r = response['enwiki']['scores'][id]['wp10']['score']['prediction']
        except KeyError:
            r = 'Revision Not Found'

        dfresponse.loc[i] = {'rev_id':id, 'response':r}
        i += 1

    if i % 1000 == 0:
        print(str(i) + ' at time ' + strftime("%a, %d %b %Y %H:%M:%S +0000", localtime()))

1000 at time Thu, 02 Nov 2017 04:27:04 +0000
2000 at time Thu, 02 Nov 2017 04:27:12 +0000
3000 at time Thu, 02 Nov 2017 04:27:19 +0000
4000 at time Thu, 02 Nov 2017 04:27:27 +0000
5000 at time Thu, 02 Nov 2017 04:27:35 +0000
6000 at time Thu, 02 Nov 2017 04:27:43 +0000
7000 at time Thu, 02 Nov 2017 04:27:51 +0000
8000 at time Thu, 02 Nov 2017 04:28:00 +0000
9000 at time Thu, 02 Nov 2017 04:28:08 +0000
10000 at time Thu, 02 Nov 2017 04:28:16 +0000
11000 at time Thu, 02 Nov 2017 04:28:25 +0000
12000 at time Thu, 02 Nov 2017 04:28:33 +0000
13000 at time Thu, 02 Nov 2017 04:28:41 +0000
14000 at time Thu, 02 Nov 2017 04:28:51 +0000
15000 at time Thu, 02 Nov 2017 04:29:00 +0000
16000 at time Thu, 02 Nov 2017 04:29:09 +0000
17000 at time Thu, 02 Nov 2017 04:29:18 +0000
18000 at time Thu, 02 Nov 2017 04:29:27 +0000
19000 at time Thu, 02 Nov 2017 04:29:37 +0000
20000 at time Thu, 02 Nov 2017 04:29:46 +0000
21000 at time Thu, 02 Nov 2017 04:29:57 +0000
22000 at time Thu, 02 Nov 2017 04:30:07 +00

Merge our original data with the responses from ORES, and save it

In [17]:
dfresponse['rev_id'] = dfresponse['rev_id'].astype(int)
df = df.merge(right=dfresponse, how='inner', on='rev_id')
df.to_csv('page_data_responses.csv')

# Assign the correct column names to the dataset
df.columns = ['article_name','country','revision_id','article_quality']
df.loc[0:10]

Unnamed: 0,article_name,country,revision_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub
5,Template:Nigeria-politician-stub,Nigeria,391862819,Stub
6,Template:Colombia-politician-stub,Colombia,391863340,Stub
7,Template:Chile-politician-stub,Chile,391863361,Stub
8,Template:Fiji-politician-stub,Fiji,391863617,Stub
9,Template:Solomons-politician-stub,Solomon Islands,391863809,Stub


Prepare a table to group by country to find proportion of all articles with respect to population in each country.

In [18]:
# 1st aggregate up the left dataset to country before joining
# 2nd join on country since the right is already one row per country
dfall = df.groupby(by='country')['revision_id'].count()
dfall = dfall.reset_index()
dfall = dfall.merge(right=dfpop, how='inner', on='country')
dfall.columns = ['country','rev_id_count','population']
dfall['proportion'] = dfall['rev_id_count']/dfall['population']
dfall

Unnamed: 0,country,rev_id_count,population,proportion
0,Afghanistan,327,32247000,1.014048e-05
1,Albania,460,2892000,1.590595e-04
2,Algeria,119,39948000,2.978873e-06
3,Andorra,34,78000,4.358974e-04
4,Angola,110,25000000,4.400000e-06
5,Antigua and Barbuda,25,90000,2.777778e-04
6,Argentina,496,42426000,1.169094e-05
7,Armenia,199,3017106,6.595725e-05
8,Australia,1566,23888000,6.555593e-05
9,Austria,340,8615955,3.946167e-05


Prepare a table which groups by country to find highest proportion of good articles by population.

In [20]:
# flag those which are FA or GA as true, then implicitly convert boolean array to int
df['high_quality'] = ((df.article_quality == 'FA') | (df.article_quality == 'GA'))*1

# sum by the highquality 1 or 0 field, grouping by country
dfhighquality = df.groupby(by=['country'])['high_quality'].sum()

# set the country index as its own named column
dfhighquality = dfhighquality.reset_index()

# join this now to the population dataset to be able to compute proportions
dfhighquality = dfhighquality.merge(right=dfpop, how='inner', on='country')
dfhighquality.columns = ['country','high_quality_count','population']
dfhighquality['proportion'] = dfhighquality['high_quality_count']/dfhighquality['population']

dfhighquality

Unnamed: 0,country,high_quality_count,population,proportion
0,Afghanistan,19,32247000,5.892021e-07
1,Albania,5,2892000,1.728907e-06
2,Algeria,3,39948000,7.509763e-08
3,Andorra,0,78000,0.000000e+00
4,Angola,2,25000000,8.000000e-08
5,Antigua and Barbuda,0,90000,0.000000e+00
6,Argentina,16,42426000,3.771272e-07
7,Armenia,6,3017106,1.988661e-06
8,Australia,44,23888000,1.841929e-06
9,Austria,3,8615955,3.481912e-07


### Analysis

1) Display the 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [21]:
dfall.sort_values(by='proportion', ascending=False)[0:10]

Unnamed: 0,country,rev_id_count,population,proportion
120,Nauru,53,10860,0.00488
173,Tuvalu,55,11800,0.004661
141,San Marino,82,33000,0.002485
113,Monaco,40,38088,0.00105
97,Liechtenstein,29,37570,0.000772
107,Marshall Islands,37,55000,0.000673
72,Iceland,206,330828,0.000623
168,Tonga,63,103300,0.00061
3,Andorra,34,78000,0.000436
54,Federated States of Micronesia,38,103000,0.000369


2) Display 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [22]:
dfall.sort_values(by='proportion', ascending=True)[0:10]

Unnamed: 0,country,rev_id_count,population,proportion
73,India,990,1314097616,7.533687e-07
34,China,1138,1371920000,8.294944e-07
74,Indonesia,215,255741973,8.406911e-07
180,Uzbekistan,29,31290791,9.267902e-07
53,Ethiopia,105,98148000,1.069813e-06
86,"Korea, North",39,24983000,1.561062e-06
185,Zambia,26,15473900,1.680249e-06
166,Thailand,112,65121250,1.719869e-06
38,"Congo, Dem. Rep. of",142,73340200,1.936182e-06
13,Bangladesh,324,160411000,2.019812e-06


3) 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [23]:
dfhighquality.sort_values(by='proportion', ascending=False)[0:10]

Unnamed: 0,country,high_quality_count,population,proportion
173,Tuvalu,1,11800,8.5e-05
181,Vanuatu,3,277500,1.1e-05
72,Iceland,3,330828,9e-06
64,Grenada,1,111000,9e-06
77,Ireland,31,4630308,7e-06
104,Maldives,2,346946,6e-06
19,Bhutan,3,757000,4e-06
59,Gabon,6,1751000,3e-06
115,Montenegro,2,622421,3e-06
129,Palestinian Territory,12,4481195,3e-06


4) 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [24]:
dfhighquality.sort_values(by='proportion', ascending=True)[0:10]

Unnamed: 0,country,high_quality_count,population,proportion
69,Guyana,0,743000,0.0
117,Mozambique,0,25736000,0.0
83,Kazakhstan,0,17544274,0.0
85,Kiribati,0,113400,0.0
146,Seychelles,0,92833,0.0
26,Burundi,0,10742000,0.0
113,Monaco,0,38088,0.0
46,Dominica,0,68000,0.0
120,Nauru,0,10860,0.0
45,Djibouti,0,900000,0.0
