# Exploring the concept of bias in data through Wikipedia articles

## Overview

In this work I explore the concept of bias in data through wikipedia articles. For this purpose, I use a publicly available [dataset of wikipedia articles about politicians from different countries](https://figshare.com/articles/Untitled_Item/5513449)  and also take advantage of a machine learning web service called [ORES](https://www.mediawiki.org/wiki/ORES) to estimate the quality of each of these articles. I then combine this data with another publicly available dataset of country populations (with population information as of Mid-2015) from the [Population Research Bureau website](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14). With the combined dataset, then I venture out to perform a tabular format visualization of the following: 

1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

## Notebook flow

The following sections of this notebook are organized in the below format for a step-by-step walkthrough of the different activities performed to achieve the desired result:

1. Loading Prerequisite Libraries and declaring variables
2. Data Acquisition
3. Data Processing
4. Data Visualization

Most of the steps in this notebook can be repeated and followed along if readers want to reproduce the work for themselves.

## Loading Prerequisite Libraries and declaring commonly used values as variables

In this section, I have the code to load a few modules from some of the publicly available Python libraries that serve as good helper methods for use throughout the rest of the code in this notebook. The purpose of each of the library is described in brief as comments inline with the code.

Also, I like to define upfront all the variables and settings I would be using in this notebook so they are all together for easy access as well as gives reader who might be following along running this code a chance to modify these values (for example, file names) as they please without impacting the flow of the rest of the code.

In [1]:
#json library has some good helper methods for working with JSON objects. 
#Since our raw data is in json format, we need the ability to deserialize json data into python objects for consumption.
import json
#Periodically we would need a way to check the intermediate results. We do that by printing the values of the variables.
#This is done using the display module from the IPython.core.display library.
from IPython.core.display import display
#pandas is another super useful python library that has many valueable data storage and manipulation functions. 
import pandas as pd
#numpy for using na
import numpy as np
#requests module would be used to retrieve the data from the REST API endpoints
import requests
#module used during printing exception info
import sys

In [2]:
#Directory where the raw files exist
raw_data_dir = './data/raw/'
#Directory where the processed file will be saved to (at the end of this notebook if all steps are successful)
processed_data_dir = './data/processed/'

#Variables to hold the file names of the raw data files
page_data_file = 'page_data.csv'
page_data_with_scores_file = 'page_data_scores.csv'
population_mid_2015_file = 'Population Mid-2015.csv'

#Variables to hold the file names to contain the processed data at the end of successful execution of the steps in this notebook.
page_data_with_scores_and_population_file = 'page_data_with_scores_and_population.csv'

#header values that are required to be passed to the API. 
#NOTE: You are strongly advised to modify these values to point to your github url and account if you plan on running this code
headers = {'User-Agent' : 'https://github.com/sumanbhagavathula', 'From' : 'sumanbh@uw.edu'}
project = 'enwiki'
model = 'wp10'

#REST API endpoint for ORES
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revid}'

#batch size of revid list to speed up retrieving the scores using ORES endpoint
batch_size=50

#preventing scientific notation of numbers in the results of executions in this notebook, for readability
pd.set_option('precision',20)

## Data Acquisition Steps

In this section, I have the code to acquire the three datasets (two raw datasets and the third that is the ORES scores for the final revision ids of the articles) that are required for this analysis work. More information along with provenance is provided in the overview section. Note, I had downloaded the two raw datasets from the respective websites, and included the wikipedia articles dataset along with this repository. The dataset on country population has restrictions on access and hence I have not included in this repository. So, if you like to follow along you will need to download the file yourself directly from the source website and save it to the /data/raw directory to be able to run the most of the steps of this notebook. Please also note the format of the calls to the ORES machine learning API that provides an estimate of the article quality. We can either pass in one revision id per API call or can pipe a batch of them, the latter will speed up the retrieval process and is used in this work. However, there is a limit to how many can be piped at a time. Whie I do not know the exact limit, I have used a batch size of 50 as you may have noticed in the variable declaration section. You may modify that value to a different number to experiemnt out other batch sizes if you are interested. You may also choose to skip the ORES API calls and proceed to the next step in the order to use the offline copy made available in this repository to avoid having to wait while retrieving the scores again.

### importing the two datasets

In [3]:
#import the datasets
page_data = pd.read_csv(raw_data_dir+page_data_file)
display(page_data.head())

#NOTE: Population Mid-2015 is copyrighted by PRB and is not included in this repository. TBD: include a link.
population_mid_2015 = pd.read_csv(raw_data_dir+population_mid_2015_file)
display(population_mid_2015.head())

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


### wrapping the ORES score retrieval mechanism into a function that can be reused any number of times as needed and testing its functionality using some sample values

In [4]:
#function to call ORES endpoint and retrieve the predicted quality score
#TBD: Add exception handling if possible
def predicted_ores_score(pipedrevids):
    params = {'project' : project,
              'model' : model,
              'revid' : (pipedrevids)
              }

    api_call = requests.get(endpoint.format(**params))
    try:
        response = api_call.json()['enwiki']['scores']
        return(response)
    except:
        print("Unexpected error:", sys.exc_info()[0])
        return 

#for testing purposes: sample call, can comment out after testing
output = (predicted_ores_score('798538579|798539797|798541884|798544723|798548287|798550386|798552371|798552999|798553325|798553329|798553416|798555546|798555786|798555984|798556097|798556283|798556613|798558498|798558692|798560330|798560381|798561197|798561595|798561903|798564951|798565577|798565999|798566821|798567091|798567112|798567496|798569014|798569398|798570513|798571014|798574254|798574475|798575514|798576875|798577626|798578057|798578265|798579045|798579775|798580067|798582884|798584054|798584996|798585322|798588458'))

display(output)
for key in output:
    print('revision_id:' + str(key) + ', score:' + output[key][model]['score']['prediction'])

{'798538579': {'wp10': {'score': {'prediction': 'Start',
    'probability': {'B': 0.018838974095864495,
     'C': 0.07027727876713798,
     'FA': 0.002067241949280503,
     'GA': 0.005101627561409373,
     'Start': 0.5254816646953945,
     'Stub': 0.37823321293091317}}}},
 '798539797': {'wp10': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.007427766042343175,
     'C': 0.008317146097822468,
     'FA': 0.0009571476724460801,
     'GA': 0.0021312165180934076,
     'Start': 0.06936370603079242,
     'Stub': 0.9118030176385021}}}},
 '798541884': {'wp10': {'score': {'prediction': 'Stub',
    'probability': {'B': 0.0074285519814473655,
     'C': 0.009013143929150482,
     'FA': 0.0011238728410746703,
     'GA': 0.0024012967541774052,
     'Start': 0.06662812876013935,
     'Stub': 0.9134050057340106}}}},
 '798544723': {'wp10': {'score': {'prediction': 'Start',
    'probability': {'B': 0.0372169790785653,
     'C': 0.16466738033540718,
     'FA': 0.002920783867841236,
     'GA':

revision_id:798538579, score:Start
revision_id:798539797, score:Stub
revision_id:798541884, score:Stub
revision_id:798544723, score:Start
revision_id:798548287, score:C
revision_id:798550386, score:Start
revision_id:798552371, score:C
revision_id:798552999, score:Start
revision_id:798553325, score:Start
revision_id:798553329, score:C
revision_id:798553416, score:B
revision_id:798555546, score:Stub
revision_id:798555786, score:Start
revision_id:798555984, score:C
revision_id:798556097, score:Stub
revision_id:798556283, score:Stub
revision_id:798556613, score:C
revision_id:798558498, score:Stub
revision_id:798558692, score:Stub
revision_id:798560330, score:C
revision_id:798560381, score:Stub
revision_id:798561197, score:C
revision_id:798561595, score:Stub
revision_id:798561903, score:Start
revision_id:798564951, score:Stub
revision_id:798565577, score:Start
revision_id:798565999, score:Stub
revision_id:798566821, score:Start
revision_id:798567091, score:Start
revision_id:798567112, score

### wrapping the steps required to save ORES Scores into a function for any time reuse as needed

In [5]:
def save_ores_scores(edits_ores_scores_batch_json, edits_ores_scores):
    for key in edits_ores_scores_batch_json:
        if(str(edits_ores_scores_batch_json[key][model]).find('RevisionNotFound')!=-1):
            print(edits_ores_scores_batch_json[key][model])
        else:
            edits_ores_scores.append({'revision_id':key,'score':edits_ores_scores_batch_json[key][model]['score']['prediction']})
    return edits_ores_scores

### retrieve ORES Scores for article revisions in the wikipedia articles dataset

In [6]:
#if you wish to use the already downloaded dataset and not rerun the ORES endpoint to save time, please skip this step
#and run the optional next step to load the offline dataset that I have saved during my execution.
#in batches of batch_size, call ORES endpoint and retrieve the scores
pipedrevids = ''
edits_ores_scores_list = []
for i in range(0,len(page_data['rev_id'])):
    pipedrevids = pipedrevids + (str(page_data.loc[i,'rev_id']))
    if i==len(page_data['rev_id'])-1:
        edits_ores_scores_batch_json = predicted_ores_score(pipedrevids)
        edits_ores_scores_list = save_ores_scores(edits_ores_scores_batch_json,edits_ores_scores_list)
        pipedrevids = ''
        break;
    elif i == 0 or i%batch_size != 0:
        pipedrevids = pipedrevids + '|'
    else:
        edits_ores_scores_batch_json = predicted_ores_score(pipedrevids)
        edits_ores_scores_list = save_ores_scores(edits_ores_scores_batch_json,edits_ores_scores_list)
        pipedrevids = ''

    
    #for testing purposes: break condition to speed up testing, can comment out after testing 
    #if i == 2400:
        #break;

{'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:806811023)', 'type': 'RevisionNotFound'}}
{'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367030)', 'type': 'RevisionNotFound'}}
{'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367166)', 'type': 'RevisionNotFound'}}
{'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807484325)', 'type': 'RevisionNotFound'}}


### save the revision ids' ORES scores into a separate file for offline usage to speed during reproducing effort of this article or when ORES endpoint is unavailable 

In [7]:
#convert the scores list to pandas dataframe for easier processing operations
edits_ores_scores = pd.DataFrame(edits_ores_scores_list)
        
#save the dataset with the ORES scores
edits_ores_scores.to_csv(raw_data_dir+page_data_with_scores_file, index=False)

#display the first few rows to get an idea about how the dataset looks like.
display(edits_ores_scores.head())

Unnamed: 0,revision_id,score
0,235107991,Stub
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


## Data Processing Steps

In this section, I perform the required steps to clean up the data (by renaming some columns, dropping some columns and filtering as needed), transform the data (modify data types of columns, join datasets as needed). At the end of this section, we will have a final data structure that can be used for our Analysis purposes. We start with an optional step to load the ORES scores dataset that was saved to an offline folder in the previous step. This is to faciliate some readers who might be following along running the code in this notebook but preferred to skip the time consuming process of retrieving ORES scores or if ORES scoring API is unavailable for some reason.

### load the ORES scores dataset

In [8]:
#As mentioned above, this is an optional step. If you are following along and have run the previous step
#to retrieve the scores using ORES api and assuming that was successful, you can skip this step
#otherwise, run this to load the page scores dataset.
edits_ores_scores = pd.read_csv(raw_data_dir+page_data_with_scores_file)

#display the first few rows to make sure the dataset is loaded.
display(edits_ores_scores.head())

Unnamed: 0,revision_id,score
0,235107991,Stub
1,355319463,Stub
2,391862046,Stub
3,391862070,Stub
4,391862409,Stub


### rename rev_id column in the page data dataset to revision_id to facilitate merge operation in next step

In [9]:
#In the page data dataset, rename last_edit column to revision_id
#so as to have a common name with the edit scores dataset and be able to join in next steps
page_data = page_data.rename(columns={"rev_id":"revision_id"})

#display the first few rows to see the renamed column in the dataframe
display(page_data.head())

Unnamed: 0,page,country,revision_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### merge page data and ORES scores datasets to get a combined dataset that has ORES scores for wikipedia articles, where available

In [10]:
#combine the page data and edit scores datasets
page_data_with_scores = page_data.merge(edits_ores_scores,how='inner',on=['revision_id'])

#display the first few rows of the merged dataset
display(page_data_with_scores.head())

Unnamed: 0,page,country,revision_id,score
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


### rename Location column in population dataset to facilitate join with the page data and scores dataset in next step. Also convert the Data column from string to int type since population units is in numbers and not strings

In [11]:
#In the population dataset, rename Location column to country
#so as to have a common name with the edit scores dataset and be able to join in next steps
population_mid_2015 = population_mid_2015.rename(columns={"Location":"country"})

population_mid_2015['Data'] = pd.to_numeric(population_mid_2015['Data'].str.replace(',',''))

#display the first few rows to see the renamed column in the dataframe
display(population_mid_2015.head())

Unnamed: 0,country,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


### merge the page scores and population data to get the final data together into one dataset and for further processing

In [12]:
#combine the page data and edit scores datasets
page_data_with_scores_and_population = page_data_with_scores.merge(population_mid_2015)

#display the first few rows of the merged dataset
display(page_data_with_scores_and_population.head())

Unnamed: 0,page,country,revision_id,score,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,Country,Mid-2015,Number,15473900,
1,Gladys Lundwe,Zambia,757566606,Stub,Country,Mid-2015,Number,15473900,
2,Mwamba Luchembe,Zambia,764848643,Stub,Country,Mid-2015,Number,15473900,
3,Thandiwe Banda,Zambia,768166426,Start,Country,Mid-2015,Number,15473900,
4,Sylvester Chisembele,Zambia,776082926,C,Country,Mid-2015,Number,15473900,


### retain only the relevant columns and rename fields as needed to get the final dataset ready for analysis use

In [13]:
#keep only the necessary columns in the dataset
page_data_with_scores_and_population = page_data_with_scores_and_population.loc[:,('country','page','revision_id','score','Data')]

#Rename the columns to form the final dataset
page_data_with_scores_and_population = page_data_with_scores_and_population.rename(columns={'page':'article_name','score':'article_quality','Data':'population'})

#display the first few rows of the final dataset
page_data_with_scores_and_population.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Zambia,Template:ZambiaProvincialMinisters,235107991,Stub,15473900
1,Zambia,Gladys Lundwe,757566606,Stub,15473900
2,Zambia,Mwamba Luchembe,764848643,Stub,15473900
3,Zambia,Thandiwe Banda,768166426,Start,15473900
4,Zambia,Sylvester Chisembele,776082926,C,15473900


### save the final dataset for offline analysis and usage

In [14]:
#save this dataset
page_data_with_scores_and_population.to_csv(processed_data_dir+page_data_with_scores_and_population_file,index=False)

## Data Analysis Steps

This section consists of the steps and code required to perform the relevant aggregations and joins that are needed to perform the analysis that was the focus of this notebook. These summary views will then be required in the next section to come up with our final tabular format visualizations.

### get the summary of all politician articles count and total population as of mid-2015 for all countries where at least one article was published. Other countries that may have existed in the PRB dataset will be skipped from here 

In [15]:
country_revisionid=page_data_with_scores_and_population.loc[:,('country','revision_id')]

country_revisionid_count = country_revisionid.groupby(by='country',as_index=False).count()

country_allarticlescount = country_revisionid_count.rename(columns={'revision_id':'all_articles_count'})

country_population_raw = page_data_with_scores_and_population.loc[:,('country','population')]

country_population = country_population_raw.groupby(by='country',as_index=False).max()

country_population_allarticlescount = country_allarticlescount.merge(country_population)

country_population_allarticlescount.head()

Unnamed: 0,country,all_articles_count,population
0,Afghanistan,327,32247000
1,Albania,460,2892000
2,Algeria,119,39948000
3,Andorra,34,78000
4,Angola,110,25000000


### get the summary of the high quality articles (for politicians) and total population as of mid-2015 for all countries where at least one article was published. Other countries that may have existed in the PRB dataset will be skipped from here

In [16]:
country_articlequality_revisionid=page_data_with_scores_and_population.loc[:,('country','article_quality','revision_id')]

country_highqualityarticles_revisionid = (country_articlequality_revisionid
                                          [(country_articlequality_revisionid['article_quality']=='GA')
                                           |(country_articlequality_revisionid['article_quality']=='FA')])

country_revisionid_filtered = country_highqualityarticles_revisionid.loc[:,('country','revision_id')]

country_highqualityarticlecount_raw = country_revisionid_filtered.groupby(by='country',as_index=False).count()

country_highqualityarticlecount = (country_highqualityarticlecount_raw
                                   .rename(columns={'revision_id':'highquality_articles_count'}))

country_population_allandhighqualityarticlecount = (country_population_allarticlescount.merge
                                                    (country_highqualityarticlecount,how='left',on='country'))

country_population_allandhighqualityarticlecount.head()

Unnamed: 0,country,all_articles_count,population,highquality_articles_count
0,Afghanistan,327,32247000,15.0
1,Albania,460,2892000,5.0
2,Algeria,119,39948000,2.0
3,Andorra,34,78000,
4,Angola,110,25000000,1.0


### since some countries may not contain even a single high quality article, we replace NaN with zero for such articles. Also, change the data type for the high quality articles count to int since counts can only be integer numbers

In [17]:
#replace NaN in highquality_articles_count with zeros
country_population_allandhighqualityarticlecount['highquality_articles_count'].fillna(int(0), inplace=True)

country_population_allandhighqualityarticlecount['highquality_articles_count'] = (country_population_allandhighqualityarticlecount
                                                                                  ['highquality_articles_count'].astype(int))

### now we calculate the proportion of articles in each category

we define and calculate proportion of articles per population as the ratio of the number of articles to the total population for that country

In [18]:
country_population_allandhighqualityarticlecount['articles_per_population'] = (
    country_population_allandhighqualityarticlecount['all_articles_count']*100.0
    /country_population_allandhighqualityarticlecount['population'])

country_population_allandhighqualityarticlecount.head()

Unnamed: 0,country,all_articles_count,population,highquality_articles_count,articles_per_population
0,Afghanistan,327,32247000,15,0.0010140478184017
1,Albania,460,2892000,5,0.0159059474412171
2,Algeria,119,39948000,2,0.0002978872534294
3,Andorra,34,78000,0,0.0435897435897435
4,Angola,110,25000000,1,0.00044


and then we define and calculate proportion of high quality articles to all articles count for that country

In [19]:
country_population_allandhighqualityarticlecount['highqualityarticles_percentage'] = (
    country_population_allandhighqualityarticlecount['highquality_articles_count']*100.0
    /country_population_allandhighqualityarticlecount['all_articles_count'])

country_population_allandhighqualityarticlecount.head()

Unnamed: 0,country,all_articles_count,population,highquality_articles_count,articles_per_population,highqualityarticles_percentage
0,Afghanistan,327,32247000,15,0.0010140478184017,4.587155963302752
1,Albania,460,2892000,5,0.0159059474412171,1.0869565217391304
2,Algeria,119,39948000,2,0.0002978872534294,1.680672268907563
3,Andorra,34,78000,0,0.0435897435897435,0.0
4,Angola,110,25000000,1,0.00044,0.909090909090909


### retain only the relevant columns that are needed for the next section on Visualization

In [20]:
country_all_and_highquality_articles_per_population = (
country_population_allandhighqualityarticlecount.loc[:,('country','articles_per_population'
                                                      ,'highqualityarticles_percentage')])

country_all_and_highquality_articles_per_population.head()

Unnamed: 0,country,articles_per_population,highqualityarticles_percentage
0,Afghanistan,0.0010140478184017,4.587155963302752
1,Albania,0.0159059474412171,1.0869565217391304
2,Algeria,0.0002978872534294,1.680672268907563
3,Andorra,0.0435897435897435,0.0
4,Angola,0.00044,0.909090909090909


## Data Visualization Steps

In this section, we perform the relevant steps for coming up with the four visualization we set forth with at the beginning of this notebook. Note that these visualizations are very simple tabular format reports with no sophistication. 

### ten highest ranked countries in terms of number of politician articles as proportion of country population

In [21]:
(pd.DataFrame(country_all_and_highquality_articles_per_population
 .sort_values(by='articles_per_population',ascending=False)
 .loc[:,('country','articles_per_population')]
 .head(10)
 .values,columns=['country','articles_per_population']))

Unnamed: 0,country,articles_per_population
0,Nauru,0.4880294659300183
1,Tuvalu,0.4661016949152542
2,San Marino,0.2484848484848484
3,Monaco,0.1050199537912203
4,Liechtenstein,0.0771892467394197
5,Marshall Islands,0.0672727272727272
6,Iceland,0.0622680063356185
7,Tonga,0.0609874152952565
8,Andorra,0.0435897435897435
9,Federated States of Micronesia,0.0368932038834951


### ten lowest ranked countries in terms of number of politician articles as proportion of country population

In [22]:
(pd.DataFrame(country_all_and_highquality_articles_per_population
 .sort_values(by='articles_per_population',ascending=True)
 .loc[:,('country','articles_per_population')]
 .head(10)
 .values,columns=['country','articles_per_population']))

Unnamed: 0,country,articles_per_population
0,India,7.52607711906845e-05
1,China,8.294944311621667e-05
2,Indonesia,8.40691097663503e-05
3,Uzbekistan,9.267902495657587e-05
4,Ethiopia,0.0001069812935566
5,"Korea, North",0.0001561061521834
6,Zambia,0.0001680248676804
7,Thailand,0.0001719868706451
8,"Congo, Dem. Rep. of",0.000193618233929
9,Bangladesh,0.0002019811608929


### 10 highest-ranked countries in terms of number of GA and FA-quality articles #as a proportion of all articles about politicians from that country

In [23]:
(pd.DataFrame(country_all_and_highquality_articles_per_population
 .sort_values(by='highqualityarticles_percentage',ascending=False)
 .loc[:,('country','highqualityarticles_percentage')]
 .head(10)
 .values,columns=['country','highqualityarticles_percentage']))

Unnamed: 0,country,highqualityarticles_percentage
0,"Korea, North",23.07692307692308
1,Saudi Arabia,11.764705882352942
2,Uzbekistan,10.344827586206897
3,Central African Republic,10.294117647058824
4,Romania,9.770114942528735
5,Guinea-Bissau,9.523809523809524
6,Bhutan,9.090909090909092
7,Vietnam,8.37696335078534
8,Dominica,8.333333333333334
9,Mauritania,7.6923076923076925


### 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [24]:
(pd.DataFrame(country_all_and_highquality_articles_per_population
 .sort_values(by='highqualityarticles_percentage',ascending=True)
 .loc[:,('country','highqualityarticles_percentage')]
 .head(10)
 .values,columns=['country','highqualityarticles_percentage']))

Unnamed: 0,country,highqualityarticles_percentage
0,Turkmenistan,0
1,Tajikistan,0
2,Monaco,0
3,Mozambique,0
4,Nauru,0
5,Tonga,0
6,Cape Verde,0
7,Guadeloupe,0
8,Kazakhstan,0
9,Suriname,0
