# Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

## Getting the article and population data

The first step is to load data files downloaded from different online resources. The data files are:
1. page_data.csv: Wikipedia political articles data
2. Population Mid-2015.csv: population data of a variety of countries

Getting the data from page_data.csv file

In [1]:
import csv

data = []
revid = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])
        revid.append(row[2])
# Remove the first element ('rev_id') from revid so that the list only contains revision IDs.
revid.pop(0)

'rev_id'

Getting the data (country and population) from the population file

In [2]:
from itertools import islice
import csv

import pandas as pd
population = []
with open('Population Mid-2015.csv') as population_file:
    reader = csv.reader(population_file)
    # note that first row is title; the second and last two rows are blank
    # skip first and last two rows in the csv file
    for row in islice(reader,2,213):
        population.append([row[0],row[4]])

## Getting article quality predictions

In this step, we'll get article quality predictions by using ORES API. In order to avoid hitting the limits in ORES, we split all revision IDs into chunks of 50. The response from ORES for each article is in one of 6 categories:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

Split revision IDs into chunks of 50

In [3]:
chunks = [revid[x:x+50] for x in range(0, len(revid), 50)]

Write a function to make a request with multiple revision IDs

In [4]:
import requests
import json

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

Request the values for prediction (the quality of an article) from ORES API.

In [5]:
headers = {'User-Agent' : 'https://github.com/yawen32', 'From' : 'liy44@uw.edu'}
article_quality = []
for i in range(len(chunks)):
    response = get_ores_data(chunks[i],headers)
    aq = response['enwiki']['scores']
    for j in range(len(chunks[i])):
        for key in aq[chunks[i][j]]["wp10"]:
            # Flag the articles have been deleted
            if key == "error":
                article_quality.append("None")
            else:
                article_quality.append(aq[chunks[i][j]]['wp10']['score']['prediction'])

Save prediction values to a file

In [6]:
aq = open("article_quality.txt","w")
for item in article_quality:
    aq.write("{}\n".format(item))
aq.close()

In [7]:
with open("article_quality.csv","w",newline="") as f:
    aqcsv = csv.writer(f)
    aqcsv.writerow(article_quality)

Read prediction values from the saved file

In [8]:
with open('article_quality.txt','r') as f:
    articleQuality = f.read().splitlines()

## Combining the datasets
In this step, we'll combine the article quality data, article data and population data together. In addition, the rows without matching data will be removed in the process of combining the data. Write merged data into a single CSV file contains five columns: country, article_name, revision_id, article_quality, population

First, add the ORES data into the Wikipedia data, then merge the Wikipedia data and population data together on the common key value (country).

In [9]:
wiki_data = pd.DataFrame(data[1:],columns=data[0])

In [10]:
wiki_data
len(pd.Series(articleQuality).values)

47197

In [11]:
# Add the ORES data into the Wikipedia data 
wiki_data["article_quality"] = pd.Series(articleQuality).values

In [12]:
# Rename columns of the Wikipedia data
wiki_data.columns = ["article_name","country","revision_id","article_quality"]

In [13]:
# Convert data (country and population) from the population file to dataframe
population_data = pd.DataFrame(population[1:],columns=population[0])

In [14]:
# Renames the columns with suitable names
population_data.columns = ["Location","population"]

In [15]:
# Merge two datasets(wiki_data and population_data) base on the common key (country name). This step removes the rows do not have
# matching data automatically.
merge_data = pd.merge(wiki_data, population_data, left_on = 'country', right_on = 'Location', how = 'inner')
merge_data = merge_data.drop('Location', axis=1)
# Swap first and second columns so that the dataframe follows the formatting conventions
merge_data = merge_data[["country","article_name","revision_id","article_quality","population"]]

Write merged data to a CSV file

In [16]:
merge_data.to_csv("final_data.csv")

## Analysis

In this step, we'll analyze merged dataset ("final_data.csv") and understand how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among different countries

Calculate the proportion (as a percentage) of articles-per-population

In [26]:
# Extract column "country" from merge data
merge_country = merge_data.iloc[:,0].tolist()

In [27]:
# Count the number of articles for each country
from collections import Counter
count_article = Counter(merge_country)

In [28]:
prop_article_per_population = []
df_prop_article_per_population = pd.DataFrame(columns=['country', 'population', 'num_articles','prop_article_per_population'])
num_country = 0

for country in count_article:
    population = int(population_data.loc[population_data["Location"] == country, "population"].iloc[0].replace(",",""))
    percentage = count_article[country] / population
    prop_article_per_population.append("{:.10%}".format(percentage))
    df_prop_article_per_population.loc[num_country] = [country,population,count_article[country],"{:.10%}".format(percentage)]
    num_country += 1

In [29]:
# Show the table of the proportion of articles-per-population for each country
df_prop_article_per_population

Unnamed: 0,country,population,num_articles,prop_article_per_population
0,Zambia,15473900.0,26.0,0.0001680249%
1,Chad,13707000.0,100.0,0.0007295542%
2,Zimbabwe,17354000.0,167.0,0.0009623142%
3,Uganda,40141000.0,188.0,0.0004683491%
4,Namibia,2482100.0,165.0,0.0066475968%
5,Nigeria,181839400.0,684.0,0.0003761561%
6,Colombia,48218000.0,288.0,0.0005972873%
7,Chile,18025000.0,352.0,0.0019528433%
8,Fiji,867000.0,199.0,0.0229527105%
9,Solomon Islands,641900.0,98.0,0.0152671756%


Calculate the proportion (as a percentage) of high-quality articles for each country.

In [30]:
prop_high_quality_articles_each_country = []
df_prop_high_quality_articles_each_country = pd.DataFrame(columns=["country","num_high_quality_articles","num_articles","prop_high_quality_articles"])
num_country = 0

for country in count_article:
    num_FA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['FA']
    num_GA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['GA']
    num_high_quality = num_FA + num_GA
    percentage = num_high_quality / count_article[country]
    prop_high_quality_articles_each_country.append("{:.10%}".format(percentage))
    df_prop_high_quality_articles_each_country.loc[num_country] = [country,num_high_quality,count_article[country],"{:.10%}".format(percentage)]
    num_country += 1

In [31]:
# Show the table of the proportion of high-quality articles for each country
df_prop_high_quality_articles_each_country

Unnamed: 0,country,num_high_quality_articles,num_articles,prop_high_quality_articles
0,Zambia,0.0,26.0,0.0000000000%
1,Chad,2.0,100.0,2.0000000000%
2,Zimbabwe,2.0,167.0,1.1976047904%
3,Uganda,1.0,188.0,0.5319148936%
4,Namibia,1.0,165.0,0.6060606061%
5,Nigeria,5.0,684.0,0.7309941520%
6,Colombia,3.0,288.0,1.0416666667%
7,Chile,3.0,352.0,0.8522727273%
8,Fiji,1.0,199.0,0.5025125628%
9,Solomon Islands,0.0,98.0,0.0000000000%


## Tables

Produce four tables that show:
1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [32]:
# Get index of 10 highest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
highest_rank_10_prop_article_per_population.to_csv("highest_rank_10_prop_article_per_population.csv")
highest_rank_10_prop_article_per_population

Unnamed: 0,country,population,num_articles,prop_article_per_population
124,Nauru,10860.0,53.0,0.4880294659%
114,Tuvalu,11800.0,55.0,0.4661016949%
98,San Marino,33000.0,82.0,0.2484848485%
134,Monaco,38088.0,40.0,0.1050199538%
142,Liechtenstein,37570.0,29.0,0.0771892467%
148,Marshall Islands,55000.0,37.0,0.0672727273%
53,Iceland,330828.0,206.0,0.0622680063%
138,Tonga,103300.0,63.0,0.0609874153%
177,Andorra,78000.0,34.0,0.0435897436%
180,Federated States of Micronesia,103000.0,38.0,0.0368932039%


10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [33]:
# Get index of 10 lowest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
lowest_rank_10_prop_article_per_population.to_csv("lowest_rank_10_prop_article_per_population.csv")
lowest_rank_10_prop_article_per_population

Unnamed: 0,country,population,num_articles,prop_article_per_population
44,India,1314098000.0,990.0,0.0000753369%
80,China,1371920000.0,1138.0,0.0000829494%
30,Indonesia,255742000.0,215.0,0.0000840691%
167,Uzbekistan,31290790.0,29.0,0.0000926790%
113,Ethiopia,98148000.0,105.0,0.0001069813%
119,"Korea, North",24983000.0,39.0,0.0001561062%
0,Zambia,15473900.0,26.0,0.0001680249%
157,Thailand,65121250.0,112.0,0.0001719869%
110,"Congo, Dem. Rep. of",73340200.0,142.0,0.0001936182%
43,Bangladesh,160411000.0,324.0,0.0002019812%


10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [34]:
# Get index of 10 highest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
highest_rank_10_prop_high_quality_articles.to_csv("highest_rank_10_prop_high_quality_articles.csv")
highest_rank_10_prop_high_quality_articles

Unnamed: 0,country,num_high_quality_articles,num_articles,prop_high_quality_articles
119,"Korea, North",9.0,39.0,23.0769230769%
128,Saudi Arabia,14.0,119.0,11.7647058824%
167,Uzbekistan,3.0,29.0,10.3448275862%
172,Central African Republic,7.0,68.0,10.2941176471%
55,Romania,34.0,348.0,9.7701149425%
144,Guinea-Bissau,2.0,21.0,9.5238095238%
156,Bhutan,3.0,33.0,9.0909090909%
91,Vietnam,16.0,191.0,8.3769633508%
181,Dominica,1.0,12.0,8.3333333333%
162,Mauritania,4.0,52.0,7.6923076923%


10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [35]:
# Get index of 10 lowest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
lowest_rank_10_prop_high_quality_articles.to_csv("lowest_rank_10_prop_high_quality_articles_allzeros.csv")
lowest_rank_10_prop_high_quality_articles

Unnamed: 0,country,num_high_quality_articles,num_articles,prop_high_quality_articles
0,Zambia,0.0,26.0,0.0000000000%
138,Tonga,0.0,63.0,0.0000000000%
134,Monaco,0.0,40.0,0.0000000000%
131,Tajikistan,0.0,40.0,0.0000000000%
127,Mozambique,0.0,60.0,0.0000000000%
124,Nauru,0.0,53.0,0.0000000000%
115,Antigua and Barbuda,0.0,25.0,0.0000000000%
142,Liechtenstein,0.0,29.0,0.0000000000%
107,Malta,0.0,103.0,0.0000000000%
102,French Guiana,0.0,28.0,0.0000000000%


In [70]:
# Get index of 10 lowest-ranked countries that proportions of high-quality articles are NOT equal to 0
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True)!=0
idx_not_zero = idx[idx == True].index[0:10]
lowest_rank_10_prop_high_quality_articles_not_zero = df_prop_high_quality_articles_each_country.loc[idx_not_zero]
lowest_rank_10_prop_high_quality_articles_not_zero.to_csv("lowest_rank_10_prop_high_quality_articles_notzeros.csv")
lowest_rank_10_prop_high_quality_articles_not_zero

Unnamed: 0,country,num_high_quality_articles,num_articles,prop_high_quality_articles
72,Tanzania,1.0,408.0,0.2450980392%
22,Czech Republic,1.0,254.0,0.3937007874%
89,Lithuania,1.0,248.0,0.4032258065%
135,Morocco,1.0,208.0,0.4807692308%
8,Fiji,1.0,199.0,0.5025125628%
3,Uganda,1.0,188.0,0.5319148936%
68,Bolivia,1.0,187.0,0.5347593583%
57,Luxembourg,1.0,180.0,0.5555555556%
37,Peru,2.0,354.0,0.5649717514%
73,Sierra Leone,1.0,166.0,0.6024096386%
