# Assignment 2

## Part 1: Acquisition

The goal of the following assignment is to explore the concept of bias through data on Wikipedia articles about political figures from a variety of countries. In this first step, we acquire all data needed to perform the analysis. 

The first data import is for a file "page_data.csv" that contains a list of articles on wikipedia for world politicians, their associated countries, and ids to lookup the articles. Next is an import of "WPDS_2018.data.csv" that has the population data for countries of the world in 2018 as well as for overall world regions. 

In [296]:
# Import packages for dataframe manipulation and viewing all code output in-notebook.
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Import data and peek at heads for sanity check.
page_data = pd.read_csv('page_data.csv')
pop_data = pd.read_csv('WPDS_2018_data.csv')
page_data.head()

In [298]:
pop_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


In the next few cells, I filter out rows from both dataframes that don't match up with valid politician articles from valid countries.

In [299]:
# Filter out pages that have "Template:" in the title.
page_filtered = page_data[~page_data['page'].str.contains("Template")]
page_filtered.head()

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [300]:
# Filter out population rows for geographic regions listed in uppercase (as opposed to countries).
pop_filtered = pop_data[~pop_data['Geography'].str.isupper()]
pop_filtered.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


In the following code I import the ORES API which I'll use to query for article quality values.  Beforehand I ran the command "pip install oresapi" in my OS terminal which enabled import of this package. After import, I used the session header provided in the assignment instructions to call the score function on all politician page ids post-filtering.

In [301]:
# Get ORES data for each article.
import oresapi
ores_session = oresapi.Session("https://ores.wikimedia.org", "Class project <jmorgan@wikimedia.org>")
page_scores = ores_session.score("enwiki", ["articlequality"], page_filtered['rev_id'])

## Part 2: Processing

The main goal of the processing step is to merge the three data sources into a single coherent dataframe that can be exported and used for further analysis. This dataframe should have page name, country, country population, and article quality information in each row. 

In [302]:
# Create dataframe with empty article qualities and rev_ids so we can merge with page dataframe
# Note that I do it in this maybe strange way because I wasn't able to place the popped quality values right into
# the page_filtered dataframe because the indices had been modified post-filtering.
n_articles = page_filtered.shape[0]
score_columns = ['rev_id','article_quality']
scores_temp = pd.DataFrame(index=range(0,n_articles),columns=score_columns)
scores_temp.head()

Unnamed: 0,rev_id,article_quality
0,,
1,,
2,,
3,,
4,,


In [303]:
# Copy over rev_ids.
scores_temp['rev_id'] = page_filtered['rev_id']

In [304]:
# Pop page quality values into rows whenever there is a rev_id at each index.
import math
row = 0
for score in page_scores:
    if row == n_articles:
        break
    while math.isnan(scores_temp.loc[row,'rev_id']):
        row = row + 1
    if row == n_articles:
        break
    else:
        scores_temp.loc[row, 'article_quality'] = score.pop('articlequality').pop('score',{'prediction':'n/a'}).pop('prediction','n/a')
        row = row + 1

In [305]:
# Sanity check.
scores_temp[0:20]

Unnamed: 0,rev_id,article_quality
0,,
1,355319463.0,Stub
2,,
3,,
4,,
5,,
6,,
7,,
8,,
9,,


Here's where I start doing actual data frame merges. After populating the page quality values into a dataframe and getting it merge-ready, I merge it with the filtered page list using page ids as the lookup. 

In [306]:
# Merge article quality with page counts.
page_filtered = pd.DataFrame.merge(page_filtered,scores_temp,how='left',on='rev_id')

In [307]:
# Rename the Geography column so that we can do a merge on the country column name.
pop_filtered.rename(columns={'Geography':'country'}, inplace=True)
pop_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,country,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


Here's where I do my second merge to add in population data. Because there are some listed countries without population data and some countries with population data with no articles, I perform an outer merge here as the assignment instructions specify exporting any rows with no matches later. 

In [308]:
# Keep all the rows because we want the NaNs for later export and then do a sanity check.
merged_pages = pd.DataFrame.merge(page_filtered,pop_filtered,how='outer',on='country')
merged_pages.head()
merged_pages[20000:20010]

Unnamed: 0,page,country,rev_id,article_quality,Population mid-2018 (millions)
0,Bir I of Kanem,Chad,355319463.0,Stub,15.4
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15.4
2,Salmama II of Kanem,Chad,565745353.0,Stub,15.4
3,Kuri I of Kanem,Chad,565745365.0,Stub,15.4
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15.4


Unnamed: 0,page,country,rev_id,article_quality,Population mid-2018 (millions)
20000,Anker Boye,Denmark,773602945.0,Start,5.8
20001,Charlotte Sahl-Madsen,Denmark,773993189.0,Start,5.8
20002,Søren Pape Poulsen,Denmark,775942803.0,Start,5.8
20003,Jens Kramer Mikkelsen,Denmark,776481266.0,Start,5.8
20004,Jens Pauli Skaalum,Denmark,776481984.0,Stub,5.8
20005,Jesper Langballe,Denmark,776539879.0,C,5.8
20006,Yildiz Akdogan,Denmark,776594023.0,Stub,5.8
20007,John Brædder,Denmark,776932311.0,Stub,5.8
20008,Ruth Kristiansen,Denmark,777288352.0,Start,5.8
20009,Kashif Ahmad,Denmark,777908594.0,Stub,5.8


In [309]:
# Rename columns before export.
merged_pages.columns=['article_name','country','revision_id','article_quality','population']
merged_pages.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,15.4
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15.4
2,Salmama II of Kanem,Chad,565745353.0,Stub,15.4
3,Kuri I of Kanem,Chad,565745365.0,Stub,15.4
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15.4


In [310]:
# Export any rows that have a lack of a match anywhere to a csv file.
no_match = merged_pages[merged_pages.isnull().any(axis=1)]
no_match.to_csv('wp_wpds_countries-no_match.csv',index=False)

In [311]:
# Export all complete merged data to a separate csv file.
matches = merged_pages[~merged_pages.isnull().any(axis=1)]
matches.to_csv('wp_wpds_politicians_by_country.csv',index=False)

## Part 3: Analysis

In this step, I perform all calculations and dataframe manipulations necessary to produce six data tables. These will be as follows: 

- Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
- Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
- Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
- Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of - politician articles that are of GA and FA-quality
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
- Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [312]:
# Get list of unique countries in matches and initialize a dataframe that we'll use to storethe calculated values.
country_data = pd.DataFrame(matches['country'].unique())
country_data.insert(1,'population',0)
country_data.insert(2,'n_articles',0)
country_data.insert(3,'n_quality_articles',0)
country_data.insert(4,'coverage',0)
country_data.insert(5,'relative_quality',0)

In [313]:
# Rename columns and sanity check.
country_data.columns = ['country','population','n_articles','n_quality_articles','coverage','relative_quality']
country_data.head()

Unnamed: 0,country,population,n_articles,n_quality_articles,coverage,relative_quality
0,Chad,0,0,0,0,0
1,Cambodia,0,0,0,0,0
2,Canada,0,0,0,0,0
3,Egypt,0,0,0,0,0
4,Pakistan,0,0,0,0,0


In [314]:
# Pump in population data and sanity check.
n_countries = country_data.shape[0]
for i in range(n_countries):
    pop_index = matches[matches.country == country_data.loc[i,'country']].index[0]
    country_data.loc[i,'population'] = matches.loc[pop_index,'population']
country_data.head()

In [316]:
# Count number of articles for each country by how many rows in the list of matches there are for each country.
for i in range(n_countries):
    country_data.loc[i,'n_articles'] = len(matches[matches.country == country_data.loc[i,'country']])
country_data.head()

Unnamed: 0,country,population,n_articles,n_quality_articles,coverage,relative_quality
0,Chad,15.4,96,0,0,0
1,Cambodia,16.0,213,0,0,0
2,Canada,37.2,844,0,0,0
3,Egypt,97.0,232,0,0,0
4,Pakistan,200.6,1032,0,0,0


In [317]:
# Manually set China and India's populations so they don't have commas in them. 
country_data.loc[6,'population'] = 1371.3
country_data.loc[22,'population'] = 1393.8

# Compute coverage of each country as number of politician articles divided by country population.
for i in range(n_countries):
    country_data.loc[i,'coverage'] = country_data.loc[i,'n_articles']/float(country_data.loc[i,'population'])
country_data.head()

Unnamed: 0,country,population,n_articles,n_quality_articles,coverage,relative_quality
0,Chad,15.4,96,0,6.233766,0
1,Cambodia,16.0,213,0,13.3125,0
2,Canada,37.2,844,0,22.688172,0
3,Egypt,97.0,232,0,2.391753,0
4,Pakistan,200.6,1032,0,5.144566,0


After counting the number of articles in each country and computing coverage by dividing number of articles by the country population, I sort the dataframe by coverage to see which countries are the top 10 and which are the bottom 10 by coverage. These are the first two deliverable data tables. 

In [372]:
# Order data frame by coverage in ascending and descending order to get top and bottom countries by coverage.
top_by_coverage = country_data.sort_values(by='coverage', ascending=False)
bottom_by_coverage = country_data.sort_values(by='coverage')

# Display top 10 and bottom 10 countries by coverage.
top_by_coverage['country'][0:10]
bottom_by_coverage['country'][0:10]

99               Tuvalu
149               Nauru
42           San Marino
65               Monaco
98        Liechtenstein
87                Tonga
105    Marshall Islands
68              Iceland
166             Andorra
78              Grenada
Name: country, dtype: object

6             India
60        Indonesia
22            China
150      Uzbekistan
107        Ethiopia
163    Korea, North
178          Zambia
126        Thailand
125      Mozambique
116      Bangladesh
Name: country, dtype: object

A quality article is defined as one that the ORES API defined as most like a Featured Article (FA) or a Good Article (GA). Total number of quality articles for each country is the sum of articles that match each label. 

In [319]:
# Count number of quality articles for each country.
for i in range(n_countries):
    country_matches = matches[matches.country == country_data.loc[i,'country']]
    country_data.loc[i,'n_quality_articles'] = len(country_matches[country_matches.article_quality == 'FA']) + len(country_matches[country_matches.article_quality == 'GA'])
country_data.head()

Unnamed: 0,country,population,n_articles,n_quality_articles,coverage,relative_quality
0,Chad,15.4,96,2,6.233766,0
1,Cambodia,16.0,213,4,13.3125,0
2,Canada,37.2,844,22,22.688172,0
3,Egypt,97.0,232,9,2.391753,0
4,Pakistan,200.6,1032,18,5.144566,0


Relative quality is computed as the number of quality articles for a country divided by the total number of articles for a country. 

In [320]:
# Compute relative quality.
for i in range(n_countries):
    country_data.loc[i,'relative_quality'] = country_data.loc[i,'n_quality_articles']/float(country_data.loc[i,'n_articles'])
country_data.head()

Unnamed: 0,country,population,n_articles,n_quality_articles,coverage,relative_quality
0,Chad,15.4,96,2,6.233766,0.020833
1,Cambodia,16.0,213,4,13.3125,0.018779
2,Canada,37.2,844,22,22.688172,0.026066
3,Egypt,97.0,232,9,2.391753,0.038793
4,Pakistan,200.6,1032,18,5.144566,0.017442


After computing relative quality for each country, I sort the dataframe by relative quality to see which countries are the top 10 and which are the bottom 10. These are the third and fourth deliverable data tables. 

In [321]:
# Get top 10 and bottom 10 countries by quality.
top_by_quality = country_data.sort_values(by='relative_quality', ascending=False)
bottom_by_quality = country_data.sort_values(by='relative_quality')

top_by_quality['country'][0:10]
bottom_by_quality['country'][0:10]

163                Korea, North
138                  Mauritania
48                      Romania
161    Central African Republic
169                Saudi Arabia
99                       Tuvalu
124                      Bhutan
172                    Dominica
50                        Syria
47                        Benin
Name: country, dtype: object

179             Seychelles
153    Antigua and Barbuda
151                Namibia
35                 Tunisia
149                  Nauru
42              San Marino
140                Lesotho
52                  Uganda
55              Tajikistan
139               Cameroon
Name: country, dtype: object

In the next few cells, I start putting together a data frame that computes the same coverage and relative quality values, but rolled up by geographic region. Region names and country member information is available in the original WPDS_2018_data.csv file for population data. Region names are listed in all caps and every successive country in the table belongs to that region until encountering a new region name. I take advantage of this behavior in the following code. 

In [364]:
# Gather region names from the geography rows that have capital letters and add new columns that will be 
# used to compute final data tables.
regions = pop_data[pop_data['Geography'].str.isupper()]
regions.insert(2, "n_articles",0.0)
regions.insert(3, "coverage",0.0)
regions.insert(4, "relative_quality",0.0)
regions

Unnamed: 0,Geography,Population mid-2018 (millions),n_articles,coverage,relative_quality
0,AFRICA,1284,0.0,0.0,0.0
56,NORTHERN AMERICA,365,0.0,0.0,0.0
59,LATIN AMERICA AND THE CARIBBEAN,649,0.0,0.0,0.0
95,ASIA,4536,0.0,0.0,0.0
144,EUROPE,746,0.0,0.0,0.0
189,OCEANIA,41,0.0,0.0,0.0


This section was a pain to get working due to getting my head around indices. It may not be very efficient, but it hopefully works correctly. In the below code, I step through every country in the full population data file. If the country does in fact exist in the processed dataframe (if it has any articles), I add the number of articles and quality articles to that region's total count. If I've instead hit upon a region, I move to the next region and add subsequent article counts to that region.  

In [365]:
# Compute coverage and relative quality by region.
import copy
i = 0
while i < pop_data.shape[0]:
    country = pop_data.loc[i,'Geography']
    if country.isupper():
        region = copy.copy(country)
        i += 1
    else:
        country_index = country_data[country_data.country == country].index
        if country_index.shape[0] > 0:
            country_index = country_index[0]
            regions.loc[regions['Geography'] == region,'n_articles'] += country_data.loc[country_index,'n_articles']
            regions.loc[regions['Geography'] == region,'relative_quality'] += country_data.loc[country_index,'n_quality_articles']
        i += 1

In [366]:
# Clean up the region dataframe a bit. Reset indices to avoid some further headache, shorten the name of
# the population column, and remove commas from all numbers.
regions.reset_index(drop=True, inplace=True)
regions.rename(columns={'Population mid-2018 (millions)':'population'}, inplace=True)
regions.loc[0,'population'] = 1284
regions.loc[3,'population'] = 4536
regions

Unnamed: 0,Geography,population,n_articles,coverage,relative_quality
0,AFRICA,1284,6796.0,0.0,119.0
1,NORTHERN AMERICA,365,1906.0,0.0,93.0
2,LATIN AMERICA AND THE CARIBBEAN,649,5141.0,0.0,66.0
3,ASIA,4536,11441.0,0.0,287.0
4,EUROPE,746,15770.0,0.0,302.0
5,OCEANIA,41,3101.0,0.0,60.0


In the below code, I compute coverage and relative quality for each region in the same way as was done for each country: coverage is number of articles divided by population and relative quality is number of quality articles divided by total number of articles. 

In [367]:
# Compute region coverage and relative_quality from article counts and population. 
for i in range(regions.shape[0]):
    regions.loc[i,'coverage'] = regions.loc[i,'n_articles']/float(regions.loc[i,'population'])
    regions.loc[i,'relative_quality'] /= regions.loc[i,'n_articles']
regions

Unnamed: 0,Geography,population,n_articles,coverage,relative_quality
0,AFRICA,1284,6796.0,5.292835,0.01751
1,NORTHERN AMERICA,365,1906.0,5.221918,0.048793
2,LATIN AMERICA AND THE CARIBBEAN,649,5141.0,7.921418,0.012838
3,ASIA,4536,11441.0,2.522266,0.025085
4,EUROPE,746,15770.0,21.13941,0.01915
5,OCEANIA,41,3101.0,75.634146,0.019349


After computing coverage and relative quality by geographic region. I sort the entire dataframe by each  These are the final deliverable data tables. 

In [368]:
# Display regions sorted by coverage. 
regions.sort_values(by='coverage', ascending=False)

Unnamed: 0,Geography,population,n_articles,coverage,relative_quality
5,OCEANIA,41,3101.0,75.634146,0.019349
4,EUROPE,746,15770.0,21.13941,0.01915
2,LATIN AMERICA AND THE CARIBBEAN,649,5141.0,7.921418,0.012838
0,AFRICA,1284,6796.0,5.292835,0.01751
1,NORTHERN AMERICA,365,1906.0,5.221918,0.048793
3,ASIA,4536,11441.0,2.522266,0.025085


In [369]:
# Display regions sorted by relative quality.
regions.sort_values(by='relative_quality', ascending=False)

Unnamed: 0,Geography,population,n_articles,coverage,relative_quality
1,NORTHERN AMERICA,365,1906.0,5.221918,0.048793
3,ASIA,4536,11441.0,2.522266,0.025085
5,OCEANIA,41,3101.0,75.634146,0.019349
4,EUROPE,746,15770.0,21.13941,0.01915
0,AFRICA,1284,6796.0,5.292835,0.01751
2,LATIN AMERICA AND THE CARIBBEAN,649,5141.0,7.921418,0.012838


## Part 4: Writeup

After completing this assignment, I was most surprised at how poor the (predicted) quality of the vast majority of wikipedia articles are. Often times I go to Wikipedia to learn basic facts or get a rough overview of a topic, and my purposes are served fairly well, but there's a whole world of quality that's still missing from most articles, not even getting into any regional differences. Considering regions, though, I was surprised that North America was not the leader in terms of article coverage and that the USA wasn't even in the top 10 of coverage or article quality. Perhaps I should have seen it coming, but population seemed to be the deciding factor in coverage (small island nations even with few articles technically have high coverage). 

The most perplexing table to me was the list of top nations by relative article quality. This list seemed to span geographic regions and country sizes and did not include any large western powers, which surprised me. On the flip side, I expected Asia and Africa to be drastically underrepresented due to their large populations relative to western presence in education media, but neither came in last in either coverage or relative quality. 

I didn't see any extremely obvious takeaways regarding Wikipedia as a data source from the final list of data tables produced in this assignment. One idea is that it seems there's a cluster of core editors with their particular wheelhouses, and although there may be many such wheelhouses, certain areas will be left totally barren. I actually was most surprised that relative quality was greater than 1% but less than 5% in all regions, though it seems that some countries (the entire bottom 10 by relative quality) were deprived of quality content entirely. 

Using this dataset to try and make any statement about the "world's politicians" will probably not yield great results since it's obviously skewed towards certain countries over others. However, even in the most deprived regions, some quality content can still be found, so I think Wikipedia may still be a good place to learn high-level information about the world's most famous people. It seems there's still quite a way to go to capture all of the world's information...