# 512hw2 Bias

In [128]:
# Import the libraries
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1 - data acquisition
There are two initial data files downloaded in csv format.

1. The Wikipedia politicians by country dataset called 'page_data.csv'. (https://figshare.com/articles/dataset/Untitled_Item/5513449)

2. The population data is available in CSV format as 'WPDS_2020_data.csv'. (https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit)


## Step 2 - data cleaning
We need to ilter the page_data by removing the page names that start with the string 'Template' in 'page_data.csv'.

There is no cleaning for 'WPDS_2020_data.csv'. For convenience purposes, I created a reduced version of that data (full data used in step 6!).

In [155]:
page_data = pd.read_csv('page_data.csv')
WPDS_data = pd.read_csv('WPDS_2020_data.csv')

In [156]:
print(page_data.shape)

(47197, 3)


In [157]:
page_data_cleaned = page_data.loc[(page_data['page'].apply(lambda x: (x[0:8] != 'Template')))]
print(page_data_cleaned.shape)
page_data_cleaned

(46701, 3)


Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568
...,...,...,...
47192,Yahya Jammeh,Gambia,807482007
47193,Lucius Fairchild,United States,807483006
47194,Fahd of Saudi Arabia,Saudi Arabia,807483153
47195,Francis Fessenden,United States,807483270


In [159]:
WPDS_data_reduced = WPDS_data[['Name', 'Type', 'Population']]
WPDS_data_reduced

Unnamed: 0,Name,Type,Population
0,WORLD,World,7772850000
1,AFRICA,Sub-Region,1337918000
2,NORTHERN AFRICA,Sub-Region,244344000
3,Algeria,Country,44357000
4,Egypt,Country,100803000
...,...,...,...
229,Samoa,Country,200000
230,Solomon Islands,Country,715000
231,Tonga,Country,99000
232,Tuvalu,Country,10000


## Step 3 - get article quality prediction 
I used the ORES machine learning system to get the quality prediction for each article.

'ORES' is an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:<br>
FA - Featured article<br>
GA - Good article<br>
B - B-class article<br>
C - C-class article<br>
Start - Start-class article<br>
Stub - Stub-class article<br>

I chose to use REST API endpoint to get the prediction results. The API documentation can be found here. (https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model)<br>
For this assignment, I only extracted the 'prediction' from the API response.

In [47]:
# api header
headers = {
    'User-Agent': 'https://github.com/tommycqy',
    'From': 'qingyuc@uw.edu'
}

# api endpoint
endpoint = "https://ores.wikimedia.org/v3/scores/enwiki?models=articlequality&revids={rev_ids}"

In [48]:
# 
def api_call(endpoint, rev_ids):
    call = requests.get(endpoint.format(rev_ids = rev_ids), headers=headers)
    response = call.json()
    
    return response

In [85]:
def call_api(score_dict):
    for i in range(0, len(page_data_cleaned), 50):
        if i + 50 > len(page_data_cleaned):
            batch_ids = page_data_cleaned.rev_id.iloc[i:]
        else:
            batch_ids = page_data_cleaned.rev_id.iloc[i:i+50]
        res = api_call(endpoint, '|'.join(str(s) for s in batch_ids))
        for key in res['enwiki']['scores']:
            if 'score' in res['enwiki']['scores'][key]['articlequality']:
                curr_score = res['enwiki']['scores'][key]['articlequality']['score']['prediction']
                score_dict[key] = curr_score
            else:
                score_dict[key] = 'NA'

In [86]:
score_dict = dict()
call_api(score_dict)

In [143]:
print(len(score_dict))

46701


After we got all the predictions (some of them are NAs), we need to filter out the NAs and keep a seperate log.<br>
There are 276 articles that we cannot get a prediction score!

In [201]:
page_data_cleaned['article_quality_est.'] = page_data_cleaned['rev_id'].astype(str).map(score_dict)
no_score_log = page_data_cleaned.loc[page_data_cleaned['article_quality_est.']=='NA']
print(len(no_score_log))
no_score_log

276


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  page_data_cleaned['article_quality_est.'] = page_data_cleaned['rev_id'].astype(str).map(score_dict)


Unnamed: 0,page,country,rev_id,article_quality_est.
126,List of politicians in Poland,Poland,516633096,
222,Tingtingru,Vanuatu,550682925,
330,Daud Arsala,Afghanistan,627547024,
359,Book:Two Political Biographies,India,636911471,
514,Dilaver Bey,Turkey,669987106,
...,...,...,...,...
46782,John Rose (Trotskyist),United Kingdom,807336308,
46862,Jalal Movaghar,Iran,807367030,
46863,Mohsen Movaghar,Iran,807367166,
47182,King Gutierrez,Philippines,807479587,


In [202]:
page_data_reduced = page_data_cleaned.loc[page_data_cleaned['article_quality_est.']!='NA']
print(len(page_data_reduced))

46425


## Step 4 - combine the datasets 
Remove any rows that do not have matching data, and output them to a CSV file called: wp_wpds_countries-no_match.csv

Consolidate the remaining data into a single CSV file called: wp_wpds_politicians_by_country.csv<br>
The schema for that file is the following:<br>
Column<br>
country<br>
article_name<br>
revision_id<br>
article_quality_est.<br>
population<br>

Note: revision_id here is the same thing as rev_id, which you used to get scores from ORES.

In [203]:
df_merge = page_data_reduced.merge(WPDS_data_reduced,how='outer',left_on=['country'],right_on=['Name'])
print(len(df_merge))

46476


In [204]:
no_match_df = df_merge.loc[df_merge['country'].isna() | df_merge['Name'].isna()]
no_match_index = no_match_df.index
no_match_df

Unnamed: 0,page,country,rev_id,article_quality_est.,Name,Type,Population
488,Julius Gregr,Czech Republic,395521877.0,Stub,,,
489,Edvard Gregr,Czech Republic,395526568.0,Stub,,,
490,Miroslav Poche,Czech Republic,672862914.0,Stub,,,
491,Vojtěch Mynář,Czech Republic,673008587.0,Stub,,,
492,Jan Malypetr,Czech Republic,704424304.0,Stub,,,
...,...,...,...,...,...,...,...
46471,,,,,French Polynesia,Country,280000.0
46472,,,,,Guam,Country,175000.0
46473,,,,,New Caledonia,Country,295000.0
46474,,,,,Palau,Country,18000.0


In [209]:
no_match_df.to_csv('wp_wpds_countries-no_match.csv', index=False)  

In [208]:
filter_df  = df_merge[~df_merge.index.isin(no_match_index)]
filter_df.rev_id = filter_df.rev_id.astype('int64')
filter_df.Population = filter_df.Population.astype('int64')
filter_df_reduced = filter_df[['page','country', 'rev_id', 'article_quality_est.','Population']]
filter_df_reduced.rename(columns={'page':'article_name','rev_id':'revision_id','Population':'population'}, inplace=True)
filter_df_reduced

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Unnamed: 0,article_name,country,revision_id,article_quality_est.,population
0,Bir I of Kanem,Chad,355319463,Stub,16877000
1,Abdullah II of Kanem,Chad,498683267,Stub,16877000
2,Salmama II of Kanem,Chad,565745353,Stub,16877000
3,Kuri I of Kanem,Chad,565745365,Stub,16877000
4,Mohammed I of Kanem,Chad,565745375,Stub,16877000
...,...,...,...,...,...
46414,Rita Sinon,Seychelles,800323154,Stub,98000
46415,Sylvette Frichot,Seychelles,800323798,Stub,98000
46416,May De Silva,Seychelles,800969960,Start,98000
46417,Vincent Meriton,Seychelles,802051093,Stub,98000


In [445]:
filter_df_reduced.to_csv('wp_wpds_politicians_by_country.csv', index=False)  

## Step 5 - analysis
The following are the code for generating the tables in step 6.

In [446]:
# for tasks 1 and 2
article_count_by_country = filter_df_reduced.groupby('country').agg({'article_name':'count'})
population_by_country = filter_df_reduced.groupby('country').agg({'population':'mean'})
article_proportion_by_country = article_count_by_country['article_name'].astype('int64') / population_by_country['population'].astype('int64') * 100

In [447]:
# for tasks 3 and 4
groups = filter_df_reduced.groupby('country')
group = groups.apply(lambda g: g[(g['article_quality_est.'] == 'FA') | (g['article_quality_est.'] == 'GA')])
high_quality_article_count = group.groupby(level=[0]).size()
df = filter_df_reduced.set_index('country')

article_count_by_country_with_hqa = df.loc[high_quality_article_count.index].groupby('country').agg({'article_name':'count'})
hqa_proportion_by_country = high_quality_article_count / article_count_by_country_with_hqa['article_name'].astype('int64').astype('int64') * 100

In [448]:
# for tasks 5 and 6
region_population_temp= WPDS_data.loc[(WPDS_data['Type'].apply(lambda x: x == 'Sub-Region'))]
region_population = region_population_temp.rename(columns={'Population': 'population', 'Name': 'sub_region'})
population_by_region = region_population[['sub_region', 'population']]

In [449]:
article_pop_by_country = article_count_by_country.merge(population_by_country,how='left',left_on=['country'],right_on=['country'])
hqa_count_by_country = pd.DataFrame(high_quality_article_count, columns={'hqa_count'})
article_pop_hqa_by_country =  article_pop_by_country.merge(hqa_count_by_country,how='left',left_on=['country'],right_on=['country'])
article_pop_hqa_by_country.fillna(0, inplace=True)
article_pop_hqa_by_country.rename(columns={'article_name':'article_count'}, inplace=True)
article_pop_hqa_by_country.reset_index(inplace=True)
article_pop_hqa_by_country

Unnamed: 0,country,article_count,population,hqa_count
0,Afghanistan,319,38928000,13.0
1,Albania,456,2838000,3.0
2,Algeria,116,44357000,2.0
3,Andorra,34,82000,0.0
4,Angola,106,32522000,0.0
...,...,...,...,...
178,Venezuela,130,28645000,3.0
179,Vietnam,187,96209000,13.0
180,Yemen,116,29826000,3.0
181,Zambia,25,18384000,0.0


In [450]:
# get the dictionary for country region by iterating the pandas dataframe
def get_country_region_dict():
    country_region_dict = {}
    region_name = None
    for index, row in WPDS_data.iterrows():
        type_ = row['Type']
        name = row['Name']
        if (name == 'WORLD'):
            continue
        if (type_ == 'Sub-Region' and name.isupper()):
            if region_name is None or region_name != name:
                region_name = name
                continue
        country_region_dict[name] = region_name
    return country_region_dict

country_region_dict = get_country_region_dict()
print(len(country_region_dict))

210


In [451]:
article_pop_hqa_by_country['sub_region'] = article_pop_hqa_by_country['country'].map(country_region_dict)
article_pop_hqa_by_country

Unnamed: 0,country,article_count,population,hqa_count,sub_region
0,Afghanistan,319,38928000,13.0,SOUTH ASIA
1,Albania,456,2838000,3.0,SOUTHERN EUROPE
2,Algeria,116,44357000,2.0,NORTHERN AFRICA
3,Andorra,34,82000,0.0,SOUTHERN EUROPE
4,Angola,106,32522000,0.0,MIDDLE AFRICA
...,...,...,...,...,...
178,Venezuela,130,28645000,3.0,SOUTH AMERICA
179,Vietnam,187,96209000,13.0,SOUTHEAST ASIA
180,Yemen,116,29826000,3.0,WESTERN ASIA
181,Zambia,25,18384000,0.0,EASTERN AFRICA


In [462]:
article_count_by_region = article_pop_hqa_by_country.groupby('sub_region').agg({'article_count':'sum'})
article_count_by_region.reset_index(inplace=True)
article_count_by_region

Unnamed: 0,sub_region,article_count
0,CARIBBEAN,695
1,CENTRAL AMERICA,1543
2,CENTRAL ASIA,245
3,EAST ASIA,2473
4,EASTERN AFRICA,2502
5,EASTERN EUROPE,3732
6,MIDDLE AFRICA,665
7,NORTHERN AFRICA,899
8,NORTHERN AMERICA,1901
9,NORTHERN EUROPE,3763


In [463]:
region_df = article_count_by_region.merge(population_by_region,how='left',left_on=['sub_region'],right_on=['sub_region'])
region_df['article_proportion'] = region_df['article_count']/region_df['population'] * 100
region_df

Unnamed: 0,sub_region,article_count,population,article_proportion
0,CARIBBEAN,695,43233000,0.001608
1,CENTRAL AMERICA,1543,178611000,0.000864
2,CENTRAL ASIA,245,74961000,0.000327
3,EAST ASIA,2473,1641063000,0.000151
4,EASTERN AFRICA,2502,444970000,0.000562
5,EASTERN EUROPE,3732,291902000,0.001279
6,MIDDLE AFRICA,665,179757000,0.00037
7,NORTHERN AFRICA,899,244344000,0.000368
8,NORTHERN AMERICA,1901,368193000,0.000516
9,NORTHERN EUROPE,3763,105990000,0.00355


In [464]:
hqa_count_by_region = article_pop_hqa_by_country.groupby('sub_region').agg({'hqa_count':'sum'})
hqa_count_by_region.reset_index(inplace=True)
hqa_count_by_region

Unnamed: 0,sub_region,hqa_count
0,CARIBBEAN,13.0
1,CENTRAL AMERICA,23.0
2,CENTRAL ASIA,7.0
3,EAST ASIA,76.0
4,EASTERN AFRICA,35.0
5,EASTERN EUROPE,118.0
6,MIDDLE AFRICA,16.0
7,NORTHERN AFRICA,19.0
8,NORTHERN AMERICA,104.0
9,NORTHERN EUROPE,102.0


In [465]:
final_region_df = region_df.merge(hqa_count_by_region,how='left',left_on=['sub_region'],right_on=['sub_region'])
final_region_df['hqa_proportion'] = final_region_df['hqa_count']/final_region_df['article_count'] * 100
final_region_df

Unnamed: 0,sub_region,article_count,population,article_proportion,hqa_count,hqa_proportion
0,CARIBBEAN,695,43233000,0.001608,13.0,1.870504
1,CENTRAL AMERICA,1543,178611000,0.000864,23.0,1.490603
2,CENTRAL ASIA,245,74961000,0.000327,7.0,2.857143
3,EAST ASIA,2473,1641063000,0.000151,76.0,3.07319
4,EASTERN AFRICA,2502,444970000,0.000562,35.0,1.398881
5,EASTERN EUROPE,3732,291902000,0.001279,118.0,3.161844
6,MIDDLE AFRICA,665,179757000,0.00037,16.0,2.406015
7,NORTHERN AFRICA,899,244344000,0.000368,19.0,2.113459
8,NORTHERN AMERICA,1901,368193000,0.000516,104.0,5.470805
9,NORTHERN EUROPE,3763,105990000,0.00355,102.0,2.710603


## Step 6 - analysis results
There are six analysis tables below, corresponding to the six analysis questions.

First Analysis Task: Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [466]:

print(article_proportion_by_country.sort_values(ascending=False).head(10))

country
Tuvalu                            0.540000
Nauru                             0.472727
San Marino                        0.238235
Monaco                            0.105263
Liechtenstein                     0.071795
Marshall Islands                  0.064912
Tonga                             0.063636
Iceland                           0.054620
Andorra                           0.041463
Federated States of Micronesia    0.033962
dtype: float64


Second Analysis Task: Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [467]:
print(article_proportion_by_country.sort_values(ascending=True).head(10))

country
India           0.000069
Indonesia       0.000077
China           0.000081
Uzbekistan      0.000082
Ethiopia        0.000088
Zambia          0.000136
Korea, North    0.000140
Thailand        0.000168
Mozambique      0.000186
Bangladesh      0.000187
dtype: float64


Third Analysis Task: Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [468]:
print(hqa_proportion_by_country.sort_values(ascending=False).head(10))

country
Korea, North                22.222222
Saudi Arabia                12.820513
Romania                     12.244898
Central African Republic    12.121212
Uzbekistan                  10.714286
Mauritania                  10.416667
Guatemala                    8.433735
Dominica                     8.333333
Syria                        7.812500
Benin                        7.692308
dtype: float64


Forth Analysis Task: Bottom 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

(Note: I only outputed the countries with 'GA' or 'FA' quality article in this task, if done the other way, the bottom ten countries all have 0% of high quality articles!)

In [469]:
print(hqa_proportion_by_country.sort_values(ascending=True).head(10))

country
Belgium        0.192678
Tanzania       0.247525
Switzerland    0.248756
Nepal          0.280899
Peru           0.285714
Nigeria        0.295858
Portugal       0.314465
Colombia       0.350877
Lithuania      0.409836
Morocco        0.485437
dtype: float64


Fifth Analysis Task: Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [470]:
final_region_df.sort_values('article_proportion', ascending=False)

Unnamed: 0,sub_region,article_count,population,article_proportion,hqa_count,hqa_proportion
10,OCEANIA,3126,43155000,0.007244,63.0,2.015355
9,NORTHERN EUROPE,3763,105990000,0.00355,102.0,2.710603
15,SOUTHERN EUROPE,3710,153251000,0.002421,74.0,1.994609
18,WESTERN EUROPE,4560,195479000,0.002333,56.0,1.22807
0,CARIBBEAN,695,43233000,0.001608,13.0,1.870504
5,EASTERN EUROPE,3732,291902000,0.001279,118.0,3.161844
14,SOUTHERN AFRICA,634,67732000,0.000936,9.0,1.419558
17,WESTERN ASIA,2563,280927000,0.000912,89.0,3.472493
1,CENTRAL AMERICA,1543,178611000,0.000864,23.0,1.490603
11,SOUTH AMERICA,3032,429191000,0.000706,40.0,1.319261


Sixth Analysis Task: Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [471]:
final_region_df.sort_values('hqa_proportion', ascending=False)

Unnamed: 0,sub_region,article_count,population,article_proportion,hqa_count,hqa_proportion
8,NORTHERN AMERICA,1901,368193000,0.000516,104.0,5.470805
13,SOUTHEAST ASIA,2020,661845000,0.000305,73.0,3.613861
17,WESTERN ASIA,2563,280927000,0.000912,89.0,3.472493
5,EASTERN EUROPE,3732,291902000,0.001279,118.0,3.161844
3,EAST ASIA,2473,1641063000,0.000151,76.0,3.07319
2,CENTRAL ASIA,245,74961000,0.000327,7.0,2.857143
9,NORTHERN EUROPE,3763,105990000,0.00355,102.0,2.710603
6,MIDDLE AFRICA,665,179757000,0.00037,16.0,2.406015
7,NORTHERN AFRICA,899,244344000,0.000368,19.0,2.113459
10,OCEANIA,3126,43155000,0.007244,63.0,2.015355
