# Assignment A2: Bias in data
## Richard Todd

## Step 1: Data acquisition

This assignment combines data from two sources:
* Wikipedia politicians by country, made available on [figshare](https://figshare.com/articles/Untitled_Item/5513449) under the CC-BY-SA 4.0 license. This was downloaded from source and unzipped.
* Population data from the United Nations [International Indicators](https://www.prb.org/international/indicator/population/table/), made available under a CC BY 3.0 license. This data was provided in csv format as part of the class assignment.

First we import python libraries used to access, process and analyze the data:

In [211]:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.patches as mpatches
import pandas as pd
import numpy as np
import os

Load the two csv files acessed as described above.

In [13]:
os.chdir('C:\\Users\\Richard\\Documents\\MSDS\\512\\A2')
page_df = pd.read_csv('page_data.csv')
wpds_df = pd.read_csv('WPDS_2018_data.csv')

## Step 2: Data processing

### Cleaning page data

In [11]:
page_df.shape

(47197, 3)

Examining the data shows that some pages have a 'template' prefix, which should be removed for this analysis:

In [9]:
page_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [16]:
page_df = page_df[~page_df["page"].str.startswith("Template")]
page_df.shape

(46701, 3)

### Cleaning population data

In [17]:
wpds_df.shape

(207, 2)

In [21]:
wpds_df.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


The WPDS_2018_data combines county and regional population counts. Regional counts are distinguished by upper-case names (both in the 'Geography' field in the dataframe). I split these two groups into two dataframes ahead of analysis:

In [94]:
countries_df = wpds_df[~wpds_df['Geography'].str.isupper()]
regions_df = wpds_df[wpds_df['Geography'].str.isupper()]

### Acquire and attach article quality predictions

The methodology and code in this section is based upon material provided to the class in the [class wiki](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data) and related materials. I use REST API calls to return quality estimates of each page generated with the ORES ("Objective Revision Evaluation Service") machine learning package. In this data, each page is assigned one of six quality categories used in English Wikipedia [content assessment](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades).

Create a function to access ORES data:

In [52]:
default_headers = {'User-Agent': 'https://github.com/rcctodd', 'From': 'rcctodd@uw.edu'}

def get_ores_data(revision_ids, headers=default_headers):
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
 
    params = {
        'project': 'enwiki',
        'model': 'wp10',
        'revids': '|'.join(str(x) for x in revision_ids)
    }
    json_response = requests.get(endpoint.format(**params)).json()
    return json_response

Create a function which extracts only the quality prediction from the JSON output:

In [53]:
def extract_quality(json_input):
    quality_pred = []
    for key, value in json_input["enwiki"]["scores"].items():
        result_dict = value["wp10"]
        if "error" not in result_dict:
            quality = {
                'rev_id': int(key),
                'prediction': result_dict["score"]["prediction"]
            }
            quality_pred.append(quality)
    
    return quality_pred

In order not to overwhelm the API, I create a simple function to chunk the page list and query each in turn (code here adapted from a [geeksforgeeks](https://www.geeksforgeeks.org/break-list-chunks-size-n-python/) posting).

In [59]:
def chunk_query(l, n): 
    for i in range(0, len(l), n):  
        yield l[i:i + n] 

In [60]:
chunked_pages = list(chunk_query(page_df['rev_id'], 100))

Using the functions created above, I incrementally retrieve ORES data, extract the quality field and convert the resulting information to a dataframe:

In [68]:
quality_json = [get_ores_data(subset) for subset in chunked_pages]

In [73]:
ores_predictions = [extract_quality(subset) for subset in quality_json]

In [90]:
ores_prediction_dfs = [pd.DataFrame.from_records(json_subset) for json_subset in ores_predictions]

In [91]:
quality_prediction_df = pd.concat(ores_prediction_dfs)
quality_prediction_df.to_csv("ores_quality_preds.csv", index=False)

In [96]:
quality_prediction_df.head()

Unnamed: 0,prediction,rev_id
0,Stub,355319463
1,Stub,393276188
2,Stub,393822005
3,Stub,395521877
4,Stub,395526568


### Combine data sources

In order to combine page and country data, rename "geography" field to "County"

In [127]:
countries_df.rename(columns={'Geography':'country'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,country,Population mid-2018 (millions)
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2
5,Sudan,41.7


I create a dataframe from merging page data: and country data:

In [128]:
wp_wpds_politicians_by_country = pd.merge(page_df, countries_df, on='country', how='outer')

To this, I merge in the ORES quality prediction.

In [130]:
wp_wpds_politicians_by_country = pd.merge(wp_wpds_politicians_by_country, quality_prediction_df, on='rev_id', how='outer')

Records without a quality prediction are dropped:

In [134]:
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['prediction'].notnull()]


Records with and without an associated country match separated and saved as csvs.

In [245]:
wp_wpds_countries_no_match_df = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['country'].isna()]
wp_wpds_countries_no_match_df.to_csv("wp_wpds_countries-no_match_df.csv", index=False)

In [247]:
wp_wpds_politicians_by_country = wp_wpds_politicians_by_country[wp_wpds_politicians_by_country['country'].notnull()]
wp_wpds_politicians_by_country.to_csv("wp_wpds_politicians_by_country.csv", index=False)

## Step 3: Analysis

In this stage I explore the relationship between population, numbers of articles about politicians and the quality of those articles. "High quality" articles are defined as having an ORES-predicted class of "FA" ("featured article") or "GA" ("good article").

### Country-level analysis

In order to calculate the ten highest-ranked countries in by number of politician articles as a proportion of country population, I convert population data into numeric data, then group population  data by country and append a calculation of count of articles by country.

In [156]:
wp_wpds_politicians_by_country['Population mid-2018 (millions)'] = pd.to_numeric(wp_wpds_politicians_by_country['Population mid-2018 (millions)'].str.replace(',', ''))

In [188]:
country_df = pd.DataFrame(wp_wpds_politicians_by_country.groupby(['country'])['Population mid-2018 (millions)'].max())

In [189]:
country_df['pagecount']= wp_wpds_politicians_by_country.groupby(['country'])['page'].count()

Add a calculation of articles per million people population:

In [190]:
country_df['articles_per_million_pop'] = country_df['pagecount'] / country_df['Population mid-2018 (millions)']

Add to the dataframe a count of articles by quality prediction - replacing NAs with 0 - then calculate the proportion of articles that are high quality: 

In [191]:
country_df = country_df.join(wp_wpds_politicians_by_country.groupby(['country'])['prediction'].value_counts().unstack().fillna(0))

In [194]:
country_df['prop_high_quality']=(country_df['GA']+country_df['FA'])/country_df['pagecount']

Sort and truncate dataframe, displaying only variables of interest:

#### Top 10 countries by coverage: 10 highest-ranked countries by politician articles as a proportion of country population

In [201]:
country_df[['articles_per_million_pop']].sort_values('articles_per_million_pop', ascending=False).head(10)

Unnamed: 0_level_0,articles_per_million_pop
country,Unnamed: 1_level_1
Tuvalu,5400.0
Nauru,5200.0
San Marino,2700.0
Monaco,1000.0
Liechtenstein,700.0
Tonga,630.0
Marshall Islands,616.666667
Iceland,502.5
Andorra,425.0
Grenada,360.0


#### Bottom 10 countries by coverage: 10 lowest-ranked countries by politician articles as a proportion of country population

In [202]:
country_df[['articles_per_million_pop']].sort_values('articles_per_million_pop', ascending=True).head(10)

Unnamed: 0_level_0,articles_per_million_pop
country,Unnamed: 1_level_1
India,0.71465
Indonesia,0.791855
China,0.810733
Uzbekistan,0.851064
Ethiopia,0.939535
"Korea, North",1.40625
Zambia,1.412429
Thailand,1.691843
Mozambique,1.901639
Bangladesh,1.917067


#### Top 10 countries by relative quality: 10 highest-ranked countries by relative proportion of politician articles that are of GA and FA-quality

In [204]:
country_df[['prop_high_quality']].sort_values('prop_high_quality', ascending=False).head(10)

Unnamed: 0_level_0,prop_high_quality
country,Unnamed: 1_level_1
"Korea, North",0.194444
Rhodesian,0.146667
Saudi Arabia,0.127119
Mauritania,0.125
Central African Republic,0.121212
Romania,0.113703
Tuvalu,0.092593
Bhutan,0.090909
Dominica,0.083333
Syria,0.078125


#### Bottom 10 countries by relative quality: 10 lowest-ranked countries by relative proportion of politician articles that are of GA and FA-quality

In [208]:
country_df[['prop_high_quality']].sort_values('prop_high_quality', ascending=True).head(10)

Unnamed: 0_level_0,prop_high_quality
country,Unnamed: 1_level_1
South Korean,0.0
Slovakia,0.0
Ivorian,0.0
Solomon Islands,0.0
Somaliland,0.0
Incan,0.0
Hondura,0.0
Guyana,0.0
South Ossetian,0.0
Kazakhstan,0.0


Analysis below shows that 62 counties (~28% have no articles predicted to be high quality, so the ten selected above are arbitrary.

In [216]:
country_df.shape[0] - np.count_nonzero(country_df[['prop_high_quality']])

62

### Region-level analysis

The original population data file contained regions as well as countries, with an upper-case region preceding countries in that region. We can use this structure to loop through the dataframe and allocate countries to regions, then merging this into the country dataframe.

In [219]:
region_list = []
for geog in wpds_df['Geography'].tolist():
    if geog.isupper():
        current_region = geog
        region_list.append('regionname')
    else:
        region_list.append(current_region)

In [221]:
wpds_df['region_cat'] = region_list

In [234]:
country_df = country_df.merge(wpds_df[['Geography','region_cat']],left_index=True,right_on='Geography')

Resetting the index to make the dataframe consistent with above.

In [237]:
country_df = country_df.set_index('Geography')

Grouping data by region, then calculating articles per million population as above.

In [239]:
region_df = pd.DataFrame(country_df.groupby(['region_cat'])[['Population mid-2018 (millions)','pagecount']].sum())

In [241]:
region_df['articles_per_million_pop'] = region_df['pagecount'] / region_df['Population mid-2018 (millions)']

Sorting values for purposes of output:

#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [243]:
region_df.sort_values('articles_per_million_pop', ascending=False)

Unnamed: 0_level_0,Population mid-2018 (millions),pagecount,articles_per_million_pop
region_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OCEANIA,39.78,3128,78.632479
EUROPE,734.59,15864,21.59572
LATIN AMERICA AND THE CARIBBEAN,628.27,5169,8.227354
AFRICA,1172.4,6851,5.843569
NORTHERN AMERICA,365.2,1921,5.260131
ASIA,4513.1,11531,2.555007


#### Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [244]:
region_df.sort_values('articles_per_million_pop', ascending=True)

Unnamed: 0_level_0,Population mid-2018 (millions),pagecount,articles_per_million_pop
region_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ASIA,4513.1,11531,2.555007
NORTHERN AMERICA,365.2,1921,5.260131
AFRICA,1172.4,6851,5.843569
LATIN AMERICA AND THE CARIBBEAN,628.27,5169,8.227354
EUROPE,734.59,15864,21.59572
OCEANIA,39.78,3128,78.632479
