# Assignment A2: Bias in data
## Richard Todd

## Step 1: Data acquisition

This assignment combines data from three sources:
* Wikipedia politicians by country, made available on [figshare](https://figshare.com/articles/Untitled_Item/5513449) under the CC-BY-SA 4.0 license. This was downloaded from source and unzipped.
* Population data from the United Nations [International Indicators](https://www.prb.org/international/indicator/population/table/), made available under a CC BY 3.0 license. This data was provided in csv format as part of the class assignment.
* Output from the ORES ("Objective Revision Evaluation Service") machine learning package. In this data, each page is assigned one of six quality categories used in English Wikipedia [content assessment](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades).

First we import python libraries used to access, process and analyze the data:

In [1]:
import pandas as pd
import numpy as np
import json
import requests

Load the two csv files acessed as described above.

In [2]:
page_df = pd.read_csv('page_data.csv')
wpds_df = pd.read_csv('WPDS_2018_data.csv')

## Step 2: Data processing

### Cleaning page data

In [3]:
page_df.shape

(47197, 3)

Examining the data shows that some pages have a 'template' prefix, which should be removed for this analysis:

In [4]:
page_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [5]:
page_df = page_df[~page_df["page"].str.startswith("Template")]
page_df.shape

(46701, 3)

### Cleaning population data

In [6]:
wpds_df.shape

(207, 2)

In [7]:
wpds_df.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


The WPDS_2018_data combines county and regional population counts. Regional counts are distinguished by upper-case names (both in the 'Geography' field in the dataframe). I split these two groups into two dataframes ahead of analysis:

In [8]:
countries_df = wpds_df[~wpds_df['Geography'].str.isupper()]
regions_df = wpds_df[wpds_df['Geography'].str.isupper()]

### Acquire and attach article quality predictions

The methodology and code in this section is based upon material provided to the class in the [class wiki](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data) and related materials. I use REST API calls to return quality estimates of each page generated with the ORES ("Objective Revision Evaluation Service") machine learning package. In this data, each page is assigned one of six quality categories used in English Wikipedia [content assessment](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades).

Create a function to access ORES data:

In [9]:
default_headers = {'User-Agent': 'https://github.com/rcctodd', 'From': 'rcctodd@uw.edu'}

def get_ores_data(revision_ids, headers=default_headers):
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
 
    params = {
        'project': 'enwiki',
        'model': 'wp10',
        'revids': '|'.join(str(x) for x in revision_ids)
    }
    json_response = requests.get(endpoint.format(**params)).json()
    return json_response

The json output comes in nested dictionary format that requries careful extraction; example format is available from wikipedia [here](https://www.mediawiki.org/wiki/ORES#Edit_quality). Here I create a function which extracts only the quality class prediction from the JSON output:

In [10]:
def extract_quality(json_input):
    quality_list = []
    for key, value in json_input["enwiki"]["scores"].items():
        #wp10 is the name of the current ORES model and is the container label for its scores
        temp_dict = value["wp10"]
        #need to account for error values
        if "error" not in temp_dict:
            quality_pred = {
                'rev_id': int(key),
                'quality_cat': temp_dict["score"]["prediction"]
            }
            quality_list.append(quality_pred)
    
    return quality_list

In order not to overwhelm the API (following advice recieved in assignment instructions!), I create a simple function to chunk the page list and query each in turn (code here adapted from a [geeksforgeeks](https://www.geeksforgeeks.org/break-list-chunks-size-n-python/) posting).

In [11]:
def chunk_query(l, n): 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 

In [12]:
chunked_pages = list(chunk_query(page_df['rev_id'], 100))

Using the functions created above, I incrementally retrieve ORES data, extract the quality field and convert the resulting information to a dataframe:

In [13]:
quality_json = [get_ores_data(subset) for subset in chunked_pages]

In [14]:
ores_predictions = [extract_quality(subset) for subset in quality_json]

In [15]:
ores_prediction_dfs = [pd.DataFrame.from_records(json_subset) for json_subset in ores_predictions]

In [16]:
quality_prediction_df = pd.concat(ores_prediction_dfs)
quality_prediction_df.to_csv("ores_quality_preds.csv", index=False)

In [17]:
quality_prediction_df.head()

Unnamed: 0,quality_cat,rev_id
0,Stub,355319463
1,Stub,393276188
2,Stub,393822005
3,Stub,395521877
4,Stub,395526568


### Combine data sources

In order to combine page and country data, rename "geography" field to "country"

In [18]:
countries_df.rename(columns={'Geography':'country'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


I create a dataframe from merging page data: and country data:

In [19]:
wp_wpds_politicians_by_country = pd.merge(page_df, countries_df, on='country', how='outer')

To this, I merge in the ORES quality prediction.

In [20]:
wp_wpds_politicians_by_country = pd.merge(wp_wpds_politicians_by_country, quality_prediction_df, on='rev_id', how='outer')

Records without a quality prediction are dropped:

In [21]:
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['quality_cat'].notnull()]


Records with and without an associated country match separated and saved as csvs.

In [22]:
wp_wpds_countries_no_match_df = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['country'].isna()]
wp_wpds_countries_no_match_df.to_csv("wp_wpds_countries-no_match_df.csv", index=False)

In [23]:
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['country'].notnull()]
wp_wpds_politicians_by_country.to_csv("wp_wpds_politicians_by_country.csv", index=False)

## Step 3: Analysis

In this stage I explore the relationship between population, numbers of articles about politicians and the quality of those articles. "High quality" articles are defined as having an ORES-predicted class of "FA" ("featured article") or "GA" ("good article").

### Country-level analysis

In order to calculate the ten highest-ranked countries in by number of politician articles as a proportion of country population, I convert population data into numeric data, then group population  data by country and append a calculation of count of articles by country.

In [24]:
wp_wpds_politicians_by_country['Population mid-2018 (millions)'] = pd.to_numeric(wp_wpds_politicians_by_country['Population mid-2018 (millions)'].str.replace(',', ''))

In [25]:
country_df = pd.DataFrame(wp_wpds_politicians_by_country.groupby(['country'])['Population mid-2018 (millions)'].max())

In [26]:
country_df['pagecount']= wp_wpds_politicians_by_country.groupby(['country'])['page'].count()

Add a calculation of articles per million people population:

In [27]:
country_df['articles_per_million_pop'] = country_df['pagecount'] / country_df['Population mid-2018 (millions)']

Add to the dataframe a count of articles by quality prediction - replacing NAs with 0 - then calculate the proportion of articles that are high quality: 

In [28]:
country_df = country_df.join(wp_wpds_politicians_by_country.groupby(['country'])['quality_cat'].value_counts().unstack().fillna(0))

In [29]:
country_df['prop_high_quality']=(country_df['GA']+country_df['FA'])/country_df['pagecount']

Sort and truncate dataframe, displaying only variables of interest:

#### Top 10 countries by coverage: 10 highest-ranked countries by politician articles as a proportion of country population

In [30]:
country_df[['articles_per_million_pop']].sort_values('articles_per_million_pop', ascending=False).head(10)

Unnamed: 0_level_0,articles_per_million_pop
country,Unnamed: 1_level_1
Tuvalu,5400.0
Nauru,5200.0
San Marino,2700.0
Monaco,1000.0
Liechtenstein,700.0
Tonga,630.0
Marshall Islands,616.666667
Iceland,502.5
Andorra,425.0
Grenada,360.0


#### Bottom 10 countries by coverage: 10 lowest-ranked countries by politician articles as a proportion of country population

In [31]:
country_df[['articles_per_million_pop']].sort_values('articles_per_million_pop', ascending=True).head(10)

Unnamed: 0_level_0,articles_per_million_pop
country,Unnamed: 1_level_1
India,0.71465
Indonesia,0.791855
China,0.810733
Uzbekistan,0.851064
Ethiopia,0.939535
"Korea, North",1.40625
Zambia,1.412429
Thailand,1.691843
Mozambique,1.901639
Bangladesh,1.917067


#### Top 10 countries by relative quality: 10 highest-ranked countries by relative proportion of politician articles that are of GA and FA-quality

In [32]:
country_df[['prop_high_quality']].sort_values('prop_high_quality', ascending=False).head(10)

Unnamed: 0_level_0,prop_high_quality
country,Unnamed: 1_level_1
"Korea, North",0.194444
Rhodesian,0.146667
Saudi Arabia,0.127119
Mauritania,0.125
Central African Republic,0.121212
Romania,0.113703
Tuvalu,0.092593
Bhutan,0.090909
Dominica,0.083333
Syria,0.078125


#### Bottom 10 countries by relative quality: 10 lowest-ranked countries by relative proportion of politician articles that are of GA and FA-quality

In [33]:
country_df[['prop_high_quality']].sort_values('prop_high_quality', ascending=True).head(10)

Unnamed: 0_level_0,prop_high_quality
country,Unnamed: 1_level_1
South Korean,0.0
Slovakia,0.0
Ivorian,0.0
Solomon Islands,0.0
Somaliland,0.0
Incan,0.0
Hondura,0.0
Guyana,0.0
South Ossetian,0.0
Kazakhstan,0.0


Analysis below shows that 62 counties (~28% have no articles predicted to be high quality, so the ten selected above are arbitrary.

In [34]:
country_df.shape[0] - np.count_nonzero(country_df[['prop_high_quality']])

62

### Region-level analysis

The original population data file contained regions as well as countries, with an upper-case region preceding countries in that region. We can use this structure to loop through the dataframe and allocate countries to regions, then merging this into the country dataframe.

In [35]:
region_list = []
for geog in wpds_df['Geography'].tolist():
    if geog.isupper():
        current_region = geog
        region_list.append('regionname')
    else:
        region_list.append(current_region)

In [36]:
wpds_df['region_cat'] = region_list

In [37]:
country_df = country_df.merge(wpds_df[['Geography','region_cat']],left_index=True,right_on='Geography')

Resetting the index to make the dataframe consistent with above.

In [38]:
country_df = country_df.set_index('Geography')

Grouping data by region, then calculating articles per million population as above.

In [45]:
region_df = pd.DataFrame(country_df.groupby(['region_cat'])[['Population mid-2018 (millions)','pagecount','GA','FA']].sum())

In [46]:
region_df.columns

Index(['Population mid-2018 (millions)', 'pagecount', 'GA', 'FA'], dtype='object')

In [40]:
region_df['articles_per_million_pop'] = region_df['pagecount'] / region_df['Population mid-2018 (millions)']

Sorting values for purposes of output:

#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [41]:
region_df.sort_values('articles_per_million_pop', ascending=False)

Unnamed: 0_level_0,Population mid-2018 (millions),pagecount,articles_per_million_pop
region_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OCEANIA,39.78,3128,78.632479
EUROPE,734.59,15864,21.59572
LATIN AMERICA AND THE CARIBBEAN,628.27,5169,8.227354
AFRICA,1172.4,6851,5.843569
NORTHERN AMERICA,365.2,1921,5.260131
ASIA,4513.1,11531,2.555007


As above, add to the dataframe a count of articles by quality prediction - replacing NAs with 0 - then calculate the proportion of articles that are high quality: 

In [48]:
region_df['prop_high_quality']=(region_df['GA']+region_df['FA'])/region_df['pagecount']

#### Geographic regions by quality: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [50]:
region_df.sort_values('prop_high_quality', ascending=False)

Unnamed: 0_level_0,Population mid-2018 (millions),pagecount,GA,FA,prop_high_quality
region_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NORTHERN AMERICA,365.2,1921,69.0,30.0,0.051536
ASIA,4513.1,11531,254.0,56.0,0.026884
OCEANIA,39.78,3128,52.0,14.0,0.0211
EUROPE,734.59,15864,206.0,116.0,0.020298
AFRICA,1172.4,6851,95.0,30.0,0.018246
LATIN AMERICA AND THE CARIBBEAN,628.27,5169,50.0,19.0,0.013349


## Step 4: Initial reflections

#### Reflections on the results tables
Most striking in the results is the variation. On a population-adjusted basis, there is over a 7000x variation in the number of English Wikipedia articles on politicians between the highest and lowest-ranked countries and a twenty percentage point difference in the proportion of those articles that are "high quality". These differences are less extreme, though still striking, when data is aggregated to the regional level.

Almost as striking are some of the countries identified as high- and low-ranked. Small and nations such as Monaco and San Marino which do not primarily speak English are among the countries with the highest number of articles when adjusted for population, whereas large, demogratic countries such as India appear on the lowest-ranked lists. Interestingly, non-democratic countries such as Saudi Arabia and North Korea are well-represented in this low-ranked list; less democratic nations are also well-represented on the highest-ranked countries by article quality, where small nations dominate the lowest-ranked list.

#### Internal validity considerations
Outputs from this initial, descriptive analysis should be treated with some caution. Where datasets were linked - country to pagecounts, and pagecounts to quality estimates - unmatched values were dropped, potentially introducing bias. The analysis of article quality was entirely based on the output of an algorithm - ORES, referenced above - that I have no insight into the construction or bias of. 

#### External validity considerations
Extrapolations based on article "quality" should be interpreted with caution. Even assuming a highly-functioning and unbiased ORES output, the algorithm was designed and maintained for a specific purpose: supporting wikipedia maintenance. An ORES determination of "high quality" should not be confused with a common-or-garden use of the term; for example, no determination is made by ORES of the likely accuracy of any given article.

It is tempting to draw conclusions from this work as to the quality of political discourse in the countries under question, but we shoudl resist this temptation. For a host of reasons, this data might not represent such an underlying phenonemon. Articles may be edited by users all-over the world; there is reason to believe that many external users may be edited in politicians for reasons that could be un- or inversely related to the quality of political discourse in the county under question. The results cover only English wikipedia; variation could be created by levels of internet access, English-language ability and the popularity of wikipedia.


#### Wider reflections
The exercise casued me to reflect on the causes and consequences of biased data in large, complex data science projects in three ways:

* Uncritical use of data. Even after review of linked materials, I completed this exercise with very little understanding of the data at hand, and even less of the inherent bias in its collection. The stark variation in coverage between coverage illustrated in this analysis give us pause in assuming the completeness or representativeness of any observational dataset.
* Uncritical use of algorithms. I have deployed and relied upon the results of an algorithm here without engagement of the biases that it might create.
* The temptation to rush to conclusions unwarranted by the data. Domain specific terms such as "quality" invite us to equate model output with the common-or-garden use of the term; the engaging subject matter of political articles beckons us to draw unwarranted conclusions.