# A2: Bias in Data
___

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import requests
import json
import time

## Step 1: Getting the Article and Population Data
___
The first step is getting the data, which lives in several different places. The Wikipedia politicians by country dataset can be found on Figshare. Read through the documentation for this repository, then download and unzip it to extract the data file, which is called `page_data.csv`.
The population data is available in CSV format as `WPDS_2020_data.csv`. This dataset is drawn from the world population data sheet published by the Population Reference Bureau.


In [20]:
page_data = pd.read_csv('data_raw/page_data.csv')
WPDS_data = pd.read_csv('data_raw/WPDS_2020_data - WPDS_2020_data.csv.csv')

## Step 2: Cleaning the Data
___
Both `page_data.csv` and `WPDS_2020_data.csv` contain some rows that you will need to filter out and/or ignore when you combine the datasets in the next step. In the case of `page_data.csv`, the dataset contains some page names that start with the string "Template:". These pages are not Wikipedia articles, and should not be included in your analysis.

Similarly, `WPDS_2020_data.csv` contains some rows that provide cumulative regional population counts, rather than country-level counts. These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA). These rows won't match the country values in `page_data.csv`, but you will want to retain them (either in the original file, or a separate file) so that you can report coverage and quality by region in the analysis section.


In [21]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [29]:
page_data[page_data['country'].str.contains('Hondura')]

Unnamed: 0,page,country,rev_id
22,Template:Honduras-politician-stub,Hondura,394587547
46,Template:Honduras-mayor-stub,Hondura,443469862
1155,Céleo Arias,Hondura,704789339
1239,Juan Francisco de Molina,Hondura,705346284
1240,Felipe Neri Medina,Hondura,705346304
...,...,...,...
45822,Selvin Laínez,Hondura,806796619
45988,Ana Julia García,Hondura,806875293
46089,Francisco Ferrera,Hondura,806948506
47076,Juan Ángel Arias Boquín,Hondura,807445333


In [31]:
# standardizes the country names in page_data so that we can merge with WPDS_data, if necessary
def standardize_countries(country):
    if country == 'Salvadoran':
        return 'El Salvador'
    elif country == 'East Timorese':
        return 'Timor Leste'
    elif country == 'Hondura':
        return 'Honduras'
    elif country == 'Rhodesian':
        return 'Zimbabwe'
    elif country == 'Samoan':
        return 'Samoan'
    elif country == 'São Tomé and Príncipe':
        return "Sao Tome and Principe"
    elif country == 'South African Republic':
        return 'South Africa'
    elif country == 'South Korean':
        return 'South Korea'
    else:
        return country

In [32]:
# cleaning page_data
page_data_clean = page_data[~page_data['page'].str.contains('Template:')]
page_data_clean['country'] = page_data_clean['country'].apply(standardize_countries)
page_data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [23]:
WPDS_data.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


In [25]:
WPDS_data_clean = WPDS_data[WPDS_data['Type'] == 'Country']
WPDS_data_clean.head()

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000
5,LY,Libya,Country,2019,6.891,6891000
6,MA,Morocco,Country,2019,35.952,35952000
7,SD,Sudan,Country,2019,43.849,43849000


In [33]:
page_data_clean.to_csv('data_clean/page_data.csv', sep=',')
WPDS_data_clean.to_csv('data_clean/WPDS_data.csv', sep=',')

## Step 3: Getting Article Quality Predictions
___
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article


In [90]:
def pred_page_quality_scores(revids):
    ret_scores = []
    ret_revs = []
    
    ore_endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
    rev_ids = list(map(str, revids))
    params = {'project': 'enwiki', 
                   'model':'wp10', 
                   'revids':'|'.join(rev_ids)}
    
    api_call = requests.get(ore_endpoint.format(**params))
    response = api_call.json()
#     print(response)
    for rev_id in rev_ids:
        try:
            ret_scores.append(response["enwiki"]["scores"][rev_id]["wp10"]["score"]["prediction"])
            ret_revs.append(rev_id)
        except:
#             continue
            print(f"Could not use rev_id={rev_id}")

    return ret_revs, ret_scores

In [92]:
try:
    page_scores_df = pd.read_csv("page_scores.csv")
    
    if page_scores_df["Unnamed: 0"].any():
        page_scores_df.drop(["Unnamed: 0"], axis=1, inplace=True)
except:
    revid_lst = []
    scores_lst = []
    batch_size = 50
    start = time.time()
    for i in range(0, len(page_data_clean), batch_size):
        revid_batch = page_data_clean["rev_id"][i:(i+batch_size)]
        
        rev_ids, scores = pred_page_quality_scores(revid_batch)
#         print(rev_ids, scores)
#         break
#         for rev_id in rev_ids:
#             revid_lst.append(rev_id)
        revid_lst.extend(rev_ids)
        
#         for score in scores:
#             scores_lst.append(score)
        scores_lst.extend(scores)
        
        if i % 10000 == 0:
            print(f"Iter {i} after {time.time() - start} seconds")
    print(f"Finished after {(time.time()-start) / 60} minutes")
#         scores_lst.extend(scores)
#         print(len(revid_lst))

Could not use rev_id=516633096
Could not use rev_id=550682925
Iter 0 after 0.649864912033081 seconds
Could not use rev_id=627547024
Could not use rev_id=636911471
Could not use rev_id=669987106
Could not use rev_id=671484594
Could not use rev_id=680981536
Could not use rev_id=684023803
Could not use rev_id=684023859
Could not use rev_id=696608092
Could not use rev_id=698572327
Could not use rev_id=699260156
Could not use rev_id=703773782
Could not use rev_id=706204833
Could not use rev_id=706810694
Could not use rev_id=708482569
Could not use rev_id=708813010
Could not use rev_id=709508670
Could not use rev_id=710135228
Could not use rev_id=710311600
Could not use rev_id=710715953
Could not use rev_id=711224007
Could not use rev_id=711288191
Could not use rev_id=711513274
Could not use rev_id=712411818
Could not use rev_id=712872338
Could not use rev_id=712872421
Could not use rev_id=712872473
Could not use rev_id=712872531
Could not use rev_id=712873183
Could not use rev_id=712873308


Could not use rev_id=805936041
Could not use rev_id=805970954
Could not use rev_id=805993160
Could not use rev_id=806030859
Could not use rev_id=806179522
Could not use rev_id=806223655
Could not use rev_id=806372496
Could not use rev_id=806403304
Could not use rev_id=806411021
Could not use rev_id=806542084
Could not use rev_id=806695084
Could not use rev_id=806708318
Could not use rev_id=806811023
Could not use rev_id=807000274
Could not use rev_id=807161510
Could not use rev_id=807196681
Could not use rev_id=807336308
Could not use rev_id=807367030
Could not use rev_id=807367166
Could not use rev_id=807479587
Could not use rev_id=807484325
Finished after 6.331365931034088 minutes


In [116]:
page_scores_df = pd.DataFrame({"rev_id": revid_lst, "article_quality_est":scores_lst})
page_scores_df['rev_id'] = page_scores_df['rev_id'].astype(int)
page_scores_df.to_csv("page_scores.csv")
page_scores_df.head()
# print(len(revid_lst), len(scores_lst))

Unnamed: 0,rev_id,article_quality_est
0,355319463,Stub
1,393276188,Stub
2,393822005,Stub
3,395521877,Stub
4,395526568,Stub


## Step 4: Combining the Datasets
___
Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vise versa.
Please remove any rows that do not have matching data, and output them to a CSV file called:
`wp_wpds_countries-no_match.csv`
Consolidate the remaining data into a single CSV file called:
`wp_wpds_politicians_by_country.csv`


In [121]:
wp_wpds_countries = page_scores_df.merge(page_data_clean, on='rev_id', how='outer')
wp_wpds_countries = wp_wpds_countries.merge(WPDS_data_clean, left_on='country', right_on='Name', how='outer')
wp_wpds_countries_no_match = wp_wpds_countries.dropna()
wp_wpds_countries_no_match.to_csv('data_clean/wp_wpds_countries-no_match.csv', sep=',')

In [124]:
# wp_wpds_politicians_by_country
column_map = {'rev_id':'revision_id', 'page':'article_name', 'Population':'population'}
wp_wpds_countries_no_match.rename(columns=column_map, inplace=True)
wp_wpds_politicians_by_country = wp_wpds_countries_no_match.loc[:, ['country', 
                                                                    'article_name', 
                                                                    'revision_id', 
                                                                    'article_quality_est', 
                                                                    'population']]
wp_wpds_politicians_by_country.to_csv('data_clean/wp_wpds_politicians_by_country.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


## Step 5: Analysis
___
Your analysis will consist of calculating the proportion (as a percentage) of articles-per-population and high-quality articles for each country AND for each geographic region. By "high quality" articles, in this case we mean the number of articles about politicians in a given country that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Examples:

- if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
- if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.


In [158]:
country_article_cnts = wp_wpds_politicians_by_country.loc[:, ['country', 'article_name']].groupby('country').count().reset_index().rename(columns={'article_name':'article_count'})
country_article_cnts.head()

Unnamed: 0,country,article_count
0,Afghanistan,319
1,Albania,456
2,Algeria,116
3,Andorra,34
4,Angola,106


In [159]:
# wp_wpds_politicians_by_country[(wp_wpds_politicians_by_country['article_quality_est'] == 'FA') | wp_wpds_politicians_by_country['article_quality_est'] == 'GA']
hq_articles = wp_wpds_politicians_by_country[(wp_wpds_politicians_by_country['article_quality_est'] == 'FA') 
                                             | (wp_wpds_politicians_by_country['article_quality_est'] == 'GA')]

hq_article_counts = hq_articles.loc[:, ['country', 'article_quality_est']].groupby('country').count().reset_index().rename(columns={'article_quality_est':'hq_article_count'})
hq_article_counts.head()

Unnamed: 0,country,hq_article_count
0,Afghanistan,13
1,Albania,3
2,Algeria,2
3,Argentina,16
4,Armenia,5


In [174]:
article_pop_df = country_article_cnts.merge(hq_article_counts, on=['country'])
article_pop_df = article_pop_df.merge(wp_wpds_politicians_by_country.loc[:, ['country', 'population']].groupby('country').mean().reset_index(), how='left', on='country')
article_pop_df.head()


Unnamed: 0,country,article_count,hq_article_count,population
0,Afghanistan,319,13,38928000.0
1,Albania,456,3,2838000.0
2,Algeria,116,2,44357000.0
3,Argentina,491,16,45377000.0
4,Armenia,193,5,2956000.0


In [176]:
article_pop_df['article_per_pop'] = article_pop_df['article_count'] / article_pop_df['population']
article_pop_df['hq_per_article'] = article_pop_df['hq_article_count'] / article_pop_df['article_count']
article_pop_df.head()

Unnamed: 0,country,article_count,hq_article_count,population,article_per_pop,hq_per_article
0,Afghanistan,319,13,38928000.0,8e-06,0.040752
1,Albania,456,3,2838000.0,0.000161,0.006579
2,Algeria,116,2,44357000.0,3e-06,0.017241
3,Argentina,491,16,45377000.0,1.1e-05,0.032587
4,Armenia,193,5,2956000.0,6.5e-05,0.025907


Here we will also map the countries to their respective regions

In [213]:
regions = []
prev = WPDS_data.iloc[0, :]
for i in range(len(WPDS_data)):
    curr = WPDS_data.iloc[i, :]
    prev = WPDS_data.iloc[i-1, :]
    if curr["Type"] != "Country":
        regions.append(curr["Name"])
    else:
        regions.append(regions[-1])
WPDS_data["region"] = regions
region_mapping = WPDS_data[["Name", "region"]]

# #regional 
region_pop_df = article_pop_df.merge(region_mapping, how="left", left_on="country", right_on="Name").drop(columns="Name")
region_pop_df = region_pop_df.groupby("region")[["article_count","population"]].sum().reset_index()
region_pop_df["regional_prop"] =  region_pop_df["article_count"] / region_pop_df["population"] * 100
region_pop_df

Unnamed: 0,region,article_count,population,regional_prop
0,CARIBBEAN,552,37790000.0,0.001461
1,CENTRAL AMERICA,1496,163218000.0,0.000917
2,CENTRAL ASIA,135,50197000.0,0.000269
3,Channel Islands,3046,98820000.0,0.003082
4,EAST ASIA,2472,1632883000.0,0.000151
5,EASTERN AFRICA,2369,388773000.0,0.000609
6,EASTERN EUROPE,3311,277651000.0,0.001193
7,MIDDLE AFRICA,538,57457000.0,0.000936
8,NORTHERN AFRICA,761,231852000.0,0.000328
9,NORTHERN AMERICA,1901,368068000.0,0.000516


In [227]:
region_quality_df = article_pop_df.merge(region_mapping, left_on='country', right_on='Name', how='left')
region_quality_df = region_quality_df.groupby("region")[['country', 'hq_article_count', 'article_count']].sum().reset_index()
region_quality_df["hq_per_article"] = region_quality_df["hq_article_count"] / region_quality_df["article_count"]
region_quality_df


Unnamed: 0,region,hq_article_count,article_count,hq_per_article
0,CARIBBEAN,13,552,0.023551
1,CENTRAL AMERICA,25,1496,0.016711
2,CENTRAL ASIA,7,135,0.051852
3,Channel Islands,102,3046,0.033487
4,EAST ASIA,75,2472,0.03034
5,EASTERN AFRICA,45,2369,0.018995
6,EASTERN EUROPE,117,3311,0.035337
7,MIDDLE AFRICA,16,538,0.02974
8,NORTHERN AFRICA,19,761,0.024967
9,NORTHERN AMERICA,104,1901,0.054708


## Step 6: Results
___
Your results from this analysis will be published in the form of data tables. You are being asked to produce six total tables, that show:
1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality
5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population
6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

Embed these tables in your Jupyter notebook. You do not need to graph or otherwise visualize the data for this assignment, although you are welcome to do so in addition to generating the data tables described above, if you wish.


### 1. Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [196]:
article_pop_df.sort_values('article_per_pop', ascending=False).loc[:, ['country', 'article_count', 'population', 'article_per_pop']].head(10)

Unnamed: 0,country,article_count,population,article_per_pop
133,Tuvalu,54,10000.0,0.0054
52,Iceland,201,368000.0,0.000546
75,Luxembourg,178,632000.0,0.000282
40,Fiji,197,896000.0,0.00022
141,Vanuatu,58,321000.0,0.000181
33,Dominica,12,72000.0,0.000167
1,Albania,456,2838000.0,0.000161
91,New Zealand,783,4987000.0,0.000157
79,Maldives,83,541000.0,0.000153
95,Norway,656,5387000.0,0.000122


### 2. Bottom 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [195]:
article_pop_df.sort_values('article_per_pop').loc[:, ['country', 'article_count', 'population', 'article_per_pop']].head(10)

Unnamed: 0,country,article_count,population,article_per_pop
53,India,968,1400100000.0,6.913792e-07
54,Indonesia,209,271739000.0,7.691204e-07
26,China,1129,1402385000.0,8.050571e-07
140,Uzbekistan,28,34174000.0,8.193363e-07
39,Ethiopia,101,114916000.0,8.789029e-07
64,"Korea, North",36,25779000.0,1.396486e-06
129,Thailand,112,66534000.0,1.68335e-06
8,Bangladesh,317,169809000.0,1.866803e-06
143,Vietnam,187,96209000.0,1.943685e-06
121,Sudan,95,43849000.0,2.166526e-06


### 3. Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [186]:
article_pop_df.sort_values('hq_per_article', ascending=False).loc[:, ['country', 'hq_per_article']].head(10)

Unnamed: 0,country,hq_per_article
64,"Korea, North",0.222222
109,Saudi Arabia,0.128205
106,Romania,0.122449
23,Central African Republic,0.121212
140,Uzbekistan,0.107143
82,Mauritania,0.104167
47,Guatemala,0.084337
33,Dominica,0.083333
125,Syria,0.078125
11,Benin,0.076923


### 4. Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

In [187]:
article_pop_df.sort_values('hq_per_article').head(10).loc[:, ['country', 'hq_per_article']]

Unnamed: 0,country,hq_per_article
10,Belgium,0.001927
128,Tanzania,0.002475
124,Switzerland,0.002488
89,Nepal,0.002809
101,Peru,0.002857
94,Nigeria,0.002959
27,Colombia,0.003509
74,Lithuania,0.004098
87,Morocco,0.004854
40,Fiji,0.005076


### 5. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

In [218]:
region_pop_df.sort_values('regional_prop', ascending=False)

Unnamed: 0,region,article_count,population,regional_prop
10,OCEANIA,2811,40918000.0,0.00687
3,Channel Islands,3046,98820000.0,0.003082
15,SOUTHERN EUROPE,3494,150498000.0,0.002322
18,WESTERN EUROPE,4492,195402000.0,0.002299
0,CARIBBEAN,552,37790000.0,0.001461
6,EASTERN EUROPE,3311,277651000.0,0.001193
7,MIDDLE AFRICA,538,57457000.0,0.000936
17,WESTERN ASIA,2520,271034000.0,0.00093
1,CENTRAL AMERICA,1496,163218000.0,0.000917
14,SOUTHERN AFRICA,458,61945000.0,0.000739


### 6. Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

In [229]:
region_quality_df.sort_values('hq_per_article', ascending=False)

Unnamed: 0,region,hq_article_count,article_count,hq_per_article
9,NORTHERN AMERICA,104,1901,0.054708
2,CENTRAL ASIA,7,135,0.051852
13,SOUTHEAST ASIA,73,2020,0.036139
6,EASTERN EUROPE,117,3311,0.035337
17,WESTERN ASIA,89,2520,0.035317
3,Channel Islands,102,3046,0.033487
4,EAST ASIA,75,2472,0.03034
7,MIDDLE AFRICA,16,538,0.02974
8,NORTHERN AFRICA,19,761,0.024967
0,CARIBBEAN,13,552,0.023551
