# Investigating Bias in Wikipedia Article Counts by Country

Here I explore bias in English Wikipedia's content by looking at the coverage of politicians by country. I investigate using two metrics regarding the proportion of articles about politicians by country:

* **Total Coverage:** proportion of articles compared to the country's population
* **High-Quality Coverage:** proportion of high-quality articles compared to the total number of articles for the country

I report on the extremes of both of these metrics, i.e. countries with the highest and lowest proportions of each metric.

## Setup

We run a few lines of code to set up the system before we get going.

In [1]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import requests

%matplotlib inline

## Data Ingest

We need a few different datasets for this analysis. For both metrics we need data on English Wikipedia politician pages by country, along with a quality rating for each page. We also need country population data for the first metric. The following sections walk through the data retrieval process.

### Population Data

Population data can be downloaded from Population Reference Bureau (PRB) here:  
http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14

The data represents world populations for 210 countries as of Mid-2015.

Unfortunately this data is copyrighted and therefore I could not include it within this repository. Therefore, if you would like to run this analysis yourself, you should download the data from the link above (click the Excel icon at top right) and save to `./data/raw/`. Alternatively, you can just run the code below, which will attempt to load the file from local storage, or will download the source data from the PRB website if it does not exist locally. Of course, the latter assumes that the resource is still available at the time you choose to do so, which may not be the case.

In [2]:
# Population data location (if stored locally)
filename = './data/raw/Population Mid-2015.csv'

# check if file exists locally; if not, attempt to download from source website
if not os.path.isfile(filename):
    filename = 'http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c472%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480'
else:
    pass

# load data from .CSV and view structure
population_data = pd.read_csv(filename, skiprows=2, thousands=',')
print('data loaded from ' + filename)

data loaded from http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c288%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c364%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c419%

Let's have a quick look at the data structure.

In [3]:
population_data.head()

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


As shown, this data includes six columns. We'll only use two for this analysis: **Location** and **Data**, which represent each country and their populations, respectively.

### Page Data

Data on political pages by country can be found here:  
https://figshare.com/articles/Untitled_Item/5513449

Please see the page above for important information on the dataset.

The data is provided in a file named `country.zip`. The zip file includes raw data as a .csv file, as well as a .RProj file and the source code that was used to retrieve the raw data. For simplicity I have unzipped the file manually and saved the raw data file to this repository, at `./data/raw/page_data.csv`.

We load this data to memory below, and take a quick look at the data structure.

In [4]:
# Load page_data from local data
filename = './data/raw/page_data.csv'
page_data = pd.read_csv(filename)
page_data.head(4)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070


The data includes page names, the country associated with each page, and the ID of the latest revision to each page. We will use the latter two columns for this analysis.

### Article Scores from ORES

For our second metric we wish to look at the proportion of high-quality articles compared to the total number of articles for each country. To do this we will need some way to rate the quality of each page in our `page_data` dataset. For this we turn to the Wikimedia ORES API.

Documentation for the ORES API can be found here:  
https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

The API takes a handful of arguments, including project, model, and a string of revision IDs, separated by '|', and returns, among other things, a rating for each revision ID. The rating options, from best to worst, consist of the following:

* **FA:** Featured article
* **GA:** Good article
* **B:** B-class article
* **C:** C-class article
* **Start:** Start-class article
* **Stub:** Stub-class article

For the purposes of this project, we will consider "high-quality" articles to be those rated as either "FA" or "GA".

A few setup tasks before we ping the API:

In [5]:
# set endpoint and headers
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/rexthompson', 'From' : 'rext@uw.edu'}

# pull out revision IDs from page_data
revids_all = list(page_data['rev_id'])

# set up empty dataframe to hold results for each article
rev_ratings = pd.DataFrame()

Now we'll feed the API 100 revision IDs at a time until we have made it through all IDs. Supposedly the API can handle up to ~150 IDs at a time, but we'll stick with 100 for cleanliness and to reduce the chances of crashing the system. We are interested in English Wikipedia only, and the `wp10` model.

The code below will return a DataFrame with revision ID and the rating for each page. We'll print out the ID of any pages that fail to return a valid result (e.g. if the pages have been deleted since `page_data.csv` was created).

In [6]:
# loop 100 entries at a time and save rev_id and rating to rev_ratings DataFrame
idx_start = 0
idx_end = 100
while idx_start < len(revids_all):
    
    # retrieve and concatenate subset of revids
    revids = revids_all[idx_start:idx_end]
    revids = '|'.join(str(x) for x in revids)
    
    # pull article data from API
    params = {'project' : 'enwiki',
              'revids' : revids,
              'model' : 'wp10'
          }
    
    # make the api call and output the results as JSON
    api_call = requests.get(endpoint.format(**params), headers)
    response = api_call.json()
        
    # loop through the response and pull out the 100 rev_id's and rating for each page; fill invalid entries w/ NaN
    for revid in response['enwiki']['scores']:    
        try:
            # # The following two lines could be used if you wanted to verify the reported ratings
            # temp_dict = response['enwiki']['scores'][revid]['wp10']['score']['probability']
            # rating = max(temp_dict, key=temp_dict.get)
            rating = response['enwiki']['scores'][revid]['wp10']['score']['prediction']
        except:
            print('unable to load score for ' + revid)
            rating = np.nan
        rev_ratings = rev_ratings.append({'revid':revid, 'rating':rating}, ignore_index=True)
    
    # NOTE: results are not returned in the same order as they were passed to the API!
    # NOTE: we will handle this by simply doing a merge with the original dataset, so order won't matter
    
    # update indexes
    idx_start += 100
    idx_end = min(idx_start+100, len(revids_all))

unable to load score for 806811023
unable to load score for 807367030
unable to load score for 807367166
unable to load score for 807484325


We see that the following rev_ids do not return a valid result from ORES:

* 806811023
* 807367030
* 807367166
* 807484325
 
That's no problem, we'll address the handling of these articles later.

Let's see how the data looks.

In [7]:
rev_ratings.head()

Unnamed: 0,rating,revid
0,Stub,235107991
1,Stub,355319463
2,Stub,391862046
3,Stub,391862070
4,Stub,391862409


## Data Merge

Now we need to perform a few merges to get a good robust dataset for our analysis.

First we wish to add the rating for each page (as shown above) to the `page_data` dataframe. This is a simple operation, but before we proceed we need to do a bit of cleanup. The `revid` variable in the `rev_ratings` DataFrame consists of strings (from the JSON output), but the `revid` column in the `page_data` DataFrame is integers. We convert `revid` from string to integer so we can more easily merge these two dataframes.

In [8]:
rev_ratings['revid'] = pd.to_numeric(rev_ratings['revid'], errors='coerce')

Now we can merge the two dataframes. We'll also drop the redundant column at the same time.

In [9]:
page_data_with_rating = page_data.merge(rev_ratings, left_on='rev_id', right_on='revid').drop('revid', 1)

Let's have a look at the new merged dataframe, which we call `page_data_with_rating`.

In [10]:
page_data_with_rating.head()

Unnamed: 0,page,country,rev_id,rating
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


This looks great! We can move on to the second merge.

We now wish to merge the table above with the population data which, as you may recall, has the following structure:

In [11]:
population_data.head()

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes
0,Afghanistan,Country,Mid-2015,Number,32247000,
1,Albania,Country,Mid-2015,Number,2892000,
2,Algeria,Country,Mid-2015,Number,39948000,
3,Andorra,Country,Mid-2015,Number,78000,
4,Angola,Country,Mid-2015,Number,25000000,


Cleaerly, we want to merge this table with the new `page_data_with_rating` dataframe on the shared country columns. Let's go ahead and do that now, noting that we have to specify the column names since they are different between the two DataFrames (i.e. `Location` vs. `country`).

In [12]:
merged_df = population_data.merge(page_data_with_rating, left_on='Location', right_on='country')

Let's have a look at the data structure of our new merged_df DataFrame.

In [13]:
merged_df.head()

Unnamed: 0,Location,Location Type,TimeFrame,Data Type,Data,Footnotes,page,country,rev_id,rating
0,Afghanistan,Country,Mid-2015,Number,32247000,,Template:Afghanistan-politician-stub,Afghanistan,394580295,Stub
1,Afghanistan,Country,Mid-2015,Number,32247000,,Template:Afghanistan-mayor-stub,Afghanistan,443496992,Stub
2,Afghanistan,Country,Mid-2015,Number,32247000,,Template:Afghanistan-diplomat-stub,Afghanistan,540459929,Stub
3,Afghanistan,Country,Mid-2015,Number,32247000,,Daud Arsala,Afghanistan,627547024,Stub
4,Afghanistan,Country,Mid-2015,Number,32247000,,Murad Quenili,Afghanistan,670462475,Stub


Looking good! However, we must call out something important at this step. We just merged data from two separate sources. In doing so, we trusted that we would find matches between the `Location` column in `popoulation_data` and the `country` column in the `page_data_with_rating`. Let's check our country count in our new `merged_df` DataFrame to see if we were able to match up all 210 countries.

In [14]:
len(merged_df.groupby('country').size())

187

Hmm, not quite 210, is it? While not perfect, this is pretty good, and for the sake of this assignment we were instructed to "remove the rows that do not have matching data". So we'll simply ignore the 23 countries that did not match up perfectly by name between the two data sources.

Now, before we continue, let's do a little cleanup to get rid of some of the columns we don't need, and reorder the ones we do to be more intuitive.

In [15]:
# pull out columns of interest
merged_df = pd.DataFrame({'country':merged_df['Location'],
                          'population':merged_df['Data'],
                          'article_name':merged_df['page'],
                          'revision_id':merged_df['rev_id'],
                          'article_quality':merged_df['rating']})

# convert population to int
pd.to_numeric(merged_df['population'])

# reorder columns
merged_df = merged_df[['country',
                       'population',
                       'article_name',
                       'revision_id',
                       'article_quality']]

Let's see how this new, cleaned dataframe looks.

In [16]:
merged_df.head()

Unnamed: 0,country,population,article_name,revision_id,article_quality
0,Afghanistan,32247000,Template:Afghanistan-politician-stub,394580295,Stub
1,Afghanistan,32247000,Template:Afghanistan-mayor-stub,443496992,Stub
2,Afghanistan,32247000,Template:Afghanistan-diplomat-stub,540459929,Stub
3,Afghanistan,32247000,Daud Arsala,627547024,Stub
4,Afghanistan,32247000,Murad Quenili,670462475,Stub


This is looking good, so for the sake of reproducibiilty (and per the assignment instructions) let's save this data to CSV.

The code chunk below checks if a file by the name of `population_and_article_quality_data.csv` already exists in the `./data/` folder. It saves `merged_df` to such a file if it does not exist, or if the file does already exists, it imports the file and saves it to `merged_df`.

Thus, if you are duplicating or expanding upon this analysis and don't want to wait on the API call above, you can simply start at this point by loading in the `population_and_article_quality_data.csv` which is saved in the `./data/` folder on the GitHub repository for this project.

In [17]:
# set filename for combined data CSV
filename = './data/population_and_article_quality_data.csv'

# check if file already exists; load if so, create if not
if os.path.isfile(filename):
    merged_df = pd.read_csv(filename)
    print('loaded CSV data from ' + filename)
else:
    merged_df.to_csv(filename, index=False)
    print('saved CSV data to ' + filename)

saved CSV data to ./data/population_and_article_quality_data.csv


We should now be all set to perform some analyses on this data.

## Analysis

So, now we have a good DataFrame with country and ratings for each article, and population for each country. Let's go about creating data for the two metrics we identified at the beginning of this notebook, i.e. **Total Coverage** and **High-Quality Coverage**.

### Total Coverage

**Total Coverage** seeks to calculate the proportion of political articles for each country compared to each country's population. Thus, for this task we'll need article count and population for each country.

To get article counts, we group our `merged_df` DataFrame by country and count the number of rows. This will return the number of articles per country.

In [18]:
# get number of articles per country
articles_per_country = merged_df.groupby(['country']).size().reset_index(name='article_count').set_index('country')
articles_per_country.head()

Unnamed: 0_level_0,article_count
country,Unnamed: 1_level_1
Afghanistan,327
Albania,460
Algeria,119
Andorra,34
Angola,110


For our population data, we could use the original `population_data` DataFrame from above. However, in the spirit of reproducibiity, and to enable "checkpointing" as described above, we will rebuild this dataframe from `population_and_article_quality_data.csv`.

In [19]:
# rebuild population data
population_data = pd.DataFrame(merged_df[['country','population']])
population_data.drop_duplicates(inplace=True)
population_data.set_index('country', inplace=True)
population_data.head()

Unnamed: 0_level_0,population
country,Unnamed: 1_level_1
Afghanistan,32247000
Albania,2892000
Algeria,39948000
Andorra,78000
Angola,25000000


We now have number of articles per country and population per country. Let's join these two datasets.

In [20]:
article_count_and_population = population_data.merge(articles_per_country, left_index=True, right_index=True, how='left')
article_count_and_population.head()

Unnamed: 0_level_0,population,article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,32247000,327
Albania,2892000,460
Algeria,39948000,119
Andorra,78000,34
Angola,25000000,110


Good! Now let's use these two columns to calculate the proportion of articles per country.

In [21]:
article_count_and_population['articles_per_person_pct'] = 100*article_count_and_population['article_count']/article_count_and_population['population']
article_count_and_population.head()

Unnamed: 0_level_0,population,article_count,articles_per_person_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,32247000,327,0.001014
Albania,2892000,460,0.015906
Algeria,39948000,119,0.000298
Andorra,78000,34,0.04359
Angola,25000000,110,0.00044


Excellent!

Now, let's have a look at the ten highest- and lowest-ranked countries in terms of number of politician articles as a proportion of country population.

#### Highest-Ranked

The following table shows the ten highest-ranked countries in terms of number of politician articles as a proportion of country population.

In [22]:
article_count_and_population.sort_values(by='articles_per_person_pct', ascending=False).head(10)

Unnamed: 0_level_0,population,article_count,articles_per_person_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nauru,10860,53,0.488029
Tuvalu,11800,55,0.466102
San Marino,33000,82,0.248485
Monaco,38088,40,0.10502
Liechtenstein,37570,29,0.077189
Marshall Islands,55000,37,0.067273
Iceland,330828,206,0.062268
Tonga,103300,63,0.060987
Andorra,78000,34,0.04359
Federated States of Micronesia,103000,38,0.036893


As shown, Nauru blows everyone else out of the water, with 53 politician articles compared to a population of just over 10,000, for an article-per-person rate of 0.488%. Tuvalu is not far behind, with 55 articles and a population of ~12,000. The proportion then drops significantly for the next several countries.

#### Lowest-Ranked

The following table shows the ten lowest-ranked countries in terms of number of politician articles as a proportion of country population. The lowest-ranked countries are towards the top, with increasing rank as you descend in the table.

In [23]:
article_count_and_population.sort_values('articles_per_person_pct').head(10)

Unnamed: 0_level_0,population,article_count,articles_per_person_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
India,1314097616,990,7.5e-05
China,1371920000,1138,8.3e-05
Indonesia,255741973,215,8.4e-05
Uzbekistan,31290791,29,9.3e-05
Ethiopia,98148000,105,0.000107
"Korea, North",24983000,39,0.000156
Zambia,15473900,26,0.000168
Thailand,65121250,112,0.000172
"Congo, Dem. Rep. of",73340200,142,0.000194
Bangladesh,160411000,324,0.000202


Perhaps not surprisingly, India and China round out the bottom (top of this table), with a relatively small number of pages compared to their populations, both of which are over 1.3 billion. Also included on the list is North Korea, which may have low number of pages due to government censorship in the hostile state.

### High-Quality Articles Per Population

Now we'll look at the number of high-quality articles per population. This is a similar exercise to the previous, except that instead of summing all articles for each country, in this case we only want to count those that are in the "FA" or "GA" category. We do this by subsetting the original `merged_df` dataframe, then grouping in a similar manner to what we did above.

In [24]:
# get number of high-quality articles per country
hq_articles_per_country = merged_df[(merged_df['article_quality'] == 'GA') |
                                    (merged_df['article_quality'] == 'FA' )]
hq_articles_per_country = hq_articles_per_country.groupby(['country']).size().reset_index(name='hq_article_count').set_index('country')

Let's see what this gives us.

In [25]:
hq_articles_per_country.head()

Unnamed: 0_level_0,hq_article_count
country,Unnamed: 1_level_1
Afghanistan,15
Albania,5
Algeria,2
Angola,1
Argentina,17


This looks good. But do you notice anything interesting about this dataframe? Notice any difference in the countries listed, compared to prior DataFrames?

You might have noticed that Andorra is missing from the DataFrame above, since it apparently has no high-quality articles. We will need to take this -- and other such countries -- into account when we merge the DataFrame above with the total article count DataFrame. Let's do that now with a left join, and we'll substitute a `hq_article_count` of zero for any countries that are not included in the merge's right DataFrame (i.e. `hq_articles_per_country`).

In [26]:
hq_article_proportions = articles_per_country.merge(hq_articles_per_country, left_index=True, right_index=True, how='left').fillna(0).astype(int)
hq_article_proportions.head()

Unnamed: 0_level_0,article_count,hq_article_count
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,327,15
Albania,460,5
Algeria,119,2
Andorra,34,0
Angola,110,1


This looks good. And note that Andorra is back in the picture now, with 34 articles, none of which are high-quality. 

Now let's use these two columns to calculate the proportion of high-quality articles per country.

In [27]:
hq_article_proportions['hq_article_pct'] = 100*hq_article_proportions['hq_article_count']/hq_article_proportions['article_count']
hq_article_proportions.head()

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,327,15,4.587156
Albania,460,5,1.086957
Algeria,119,2,1.680672
Andorra,34,0,0.0
Angola,110,1,0.909091


Excellent!

Now, let's have a look at the ten highest- and lowest-ranked countries in terms of number of high-quality articles as a prorotion of all articles about politicians from each country.

#### Highest-Ranked

The following table shows the ten highest-ranked countries in terms of number of high-quality articles as a prorotion of all articles about politicians from each country.

In [28]:
hq_article_proportions.sort_values(by='hq_article_pct', ascending=False).head(10)

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Korea, North",39,9,23.076923
Saudi Arabia,119,14,11.764706
Uzbekistan,29,3,10.344828
Central African Republic,68,7,10.294118
Romania,348,34,9.770115
Guinea-Bissau,21,2,9.52381
Bhutan,33,3,9.090909
Vietnam,191,16,8.376963
Dominica,12,1,8.333333
Mauritania,52,4,7.692308


Interestingly, we see that North Korea tops the list by a wide margin, with 9 of its 39 articles being ranked as "high-quality".

#### Lowest-Ranked

The following table shows the ten lowest-ranked countries in terms of number of high-quality articles as a prorotion of all articles about politicians from each country. The lowest-ranked countries are towards the top, with increasing rank as you descend in the table.

In [29]:
hq_article_proportions.sort_values(by='hq_article_pct').head(10)

Unnamed: 0_level_0,article_count,hq_article_count,hq_article_pct
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Turkmenistan,33,0,0.0
Tajikistan,40,0,0.0
Monaco,40,0,0.0
Mozambique,60,0,0.0
Nauru,53,0,0.0
Tonga,63,0,0.0
Cape Verde,37,0,0.0
Guadeloupe,49,0,0.0
Kazakhstan,79,0,0.0
Suriname,40,0,0.0


Hmm, that's actually not too helpful. It looks like there are quite a few countries (at least 10) that have a percentage of zero, as a direct result of having exactly zero high-quality articles. Let's look at how many countries have no high-quality articles.

In [30]:
len(hq_article_proportions[hq_article_proportions['hq_article_count']==0])

39

So we see that there are 39 countries that don't have a single high-quality article written about any of their politicians. These 39 countries are listed here.

In [31]:
print(list(hq_article_proportions[hq_article_proportions['hq_article_count']==0].index))

['Andorra', 'Antigua and Barbuda', 'Bahamas', 'Bahrain', 'Barbados', 'Belgium', 'Belize', 'Cape Verde', 'Comoros', 'Costa Rica', 'Djibouti', 'Eritrea', 'Federated States of Micronesia', 'Finland', 'French Guiana', 'Guadeloupe', 'Kazakhstan', 'Kiribati', 'Lesotho', 'Liechtenstein', 'Macedonia', 'Malta', 'Marshall Islands', 'Moldova', 'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino', 'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Suriname', 'Swaziland', 'Switzerland', 'Tajikistan', 'Tonga', 'Turkmenistan', 'Zambia']


This concludes the analysis.