### DATA COLLECTION

**Getting the Article and Population Data**

The data for Politicians by country and Population by country has been obtained by crawling a general list of Wikipedia pages. The data is stored in csv files and can be found in the ```raw``` folder.

Considerations in the raw data:
1. Duplicate category labels
2. Cummulative Regional population counts also present

In [1]:
# import libraries
import json, time, urllib.parse
import requests
import pandas as pd
import numpy as np

All the constants that are required for the API requests are declared below.
The politicians and population dataset are also imported here.

In [22]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'sghela@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
politicians = pd.read_csv('../data/politicians_by_country_SEPT.2022.csv')
population = pd.read_csv('../data/population_by_country_2022.csv')
ARTICLE_TITLES = politicians['name']

# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

**Getting Article Quality Predictions**

An estimate of the Article Quality is provided by ORES, a machine learning tool. The article quality estimates are:
1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

ORES requires a specific revision ID of a specific article to be able to make a label prediction.

MediaWiki Action API is used to make a page info request to get the current page revision of each article and this is provided to ORES system to retrieve an ORES score.

In [3]:
"""
Function to make a page info request for an article to get the current page revision ID.

Returns a JSON response containing all the page info details for the article.
"""

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

"""
Function to request an ORES page quality score given an article revision ID.

Returns a JSON response containing the ORES score.
"""
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

To get a Wikipedia page quality prediction from ORES for each politician’s article page, these steps are to be performed: 
1. read each line of politicians_by_country.SEPT.2022.csv, 
2. make a page info request to get the current page revision, and 
3. make an ORES request using the page title and current revision id. 

The dataframe `df`  is used to store the article title and the last revsion ID. <br>
The dataframe `no_val` contains the title for all the recirds that do not have last revision ID.

First I loop through all the articles in the politicians.. file and call the `request_pageinfo_per_article` for each article. Then the lastrevid is extracted from the resulting JSON response and stored in one of the dataframes mentioned above.

In [4]:
page_info = {}
no_val = pd.DataFrame(columns=['title'])
df = pd.DataFrame(columns=['title','lastrevid'])
for index, article in enumerate(ARTICLE_TITLES):
    if index % 500 == 0:
        print(f"Done with {index} records")
    page_info = request_pageinfo_per_article(article)   
    page_info = page_info['query']['pages']
    info_df = pd.DataFrame.from_dict(page_info, orient = 'index')
    if 'lastrevid' in info_df.columns:
        info_df = info_df[['title','lastrevid']]
        df = pd.concat([df, info_df])
    else:
        no_val = pd.concat([no_val, info_df[['title']]])


Done with 0 records
Done with 500 records
Done with 1000 records
Done with 1500 records
Done with 2000 records
Done with 2500 records
Done with 3000 records
Done with 3500 records
Done with 4000 records
Done with 4500 records
Done with 5000 records
Done with 5500 records
Done with 6000 records
Done with 6500 records
Done with 7000 records
Done with 7500 records


Displaying all the articles do not have any last revision IDs, stored in the `no_val` dataframe.

In [16]:
no_val

Unnamed: 0,title
-1,Prince Ofosu Sefah
-1,Harjit Kaur Talwandi
-1,Abd al-Razzaq al-Hasani
-1,Kang Sun-nam
-1,Segun “Aeroland” Adewale
-1,Nhlanhla “Lux” Dlamini


Displaying the data in `df` containing article titles and last revision IDs.

In [6]:
df.head(10)

Unnamed: 0,title,lastrevid
69537737,Shahjahan Noori,1099689043
42972519,Abdul Ghafar Lakanwal,943562276
10483286,Majah Ha Adrif,852404094
11966231,Haroon al-Afghani,1095102390
46841383,Tayyab Agha,1104998382
68624823,Ahmadullah Wasiq,1109361754
47805901,Aziza Ahmadyar,1087211008
70019038,Muqadasa Ahmadzai,1082489593
27664854,Mohammad Sarwar Ahmedzai,1038918070
12084570,Amir Muhammad Akhundzada,1069322182


Now, for each row in the `df`, `request_ores_score_per_article` is called to get the article quality score for the title. The score is then extracted from the JSON response and stored in the `df` under the score column.

In [7]:
i = 0
for index, row in df.iterrows():
    if(i%500 == 0):
        print(f"Done with {i} records")
    lastrevid = row["lastrevid"]
    score = request_ores_score_per_article(lastrevid)
    score = score["enwiki"]["scores"][str(lastrevid)]["articlequality"]["score"]["prediction"]
    df.loc[index,'score'] = score
    i += 1


Done with 0 records
Done with 500 records
Done with 1000 records
Done with 1500 records
Done with 2000 records
Done with 2500 records
Done with 3000 records
Done with 3500 records
Done with 4000 records
Done with 4500 records
Done with 5000 records
Done with 5500 records
Done with 6000 records
Done with 6500 records
Done with 7000 records
Done with 7500 records


Displaying the data present in `df`:

In [8]:
df.head()

Unnamed: 0,title,lastrevid,score
69537737,Shahjahan Noori,1099689043,GA
42972519,Abdul Ghafar Lakanwal,943562276,Start
10483286,Majah Ha Adrif,852404094,Start
11966231,Haroon al-Afghani,1095102390,B
46841383,Tayyab Agha,1104998382,Start


### COMBINING DATASETS

Checking to see if the score column has any null values.

In [9]:
df['score'].isnull().values.any()

False

Since there are no null values in the scores, I'll merge the `df` dataframe with the `politicians` dataset.

In [43]:
combined = pd.merge(politicians, df, left_on='name', right_on='title', how='left')
combined.head()

Unnamed: 0,name,url,country,title,lastrevid,score
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043,GA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094,Start
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390,B
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382,Start


Before merging the population dataset with the politicians dataset, let's look at the data in the `population` dataframe.

In [47]:
population.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


A country can only exist in one region. The`population` dataframe has regions in a hierarchical order and are distinguished by ALL CAPS value. I have created a new column region in the dataframe where I put the country in the last region above it that is in uppercase.

In [29]:
for i in range(0,len(population)):
    region = population.loc[i,'Geography'] 
    population.loc[i,'region'] = region if region.isupper() else population.loc[i-1,'region'] 


In [30]:
population.head()

Unnamed: 0,Geography,Population (millions),region
0,WORLD,7963.0,WORLD
1,AFRICA,1419.0,AFRICA
2,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,Algeria,44.9,NORTHERN AFRICA
4,Egypt,103.5,NORTHERN AFRICA


From analysing the population data, it can be noted that 6 countries have zero population. This could be mostly because the populations is provided in millions and these countries probably have a population much lesser than a million. 

In [116]:
population[population['Population (millions)']==0]['Geography'].unique()

array(['Liechtenstein', 'Monaco', 'San Marino', 'Nauru', 'Palau',
       'Tuvalu'], dtype=object)

Next, I combine the `population` and the `combined` dataframes to give a combined dataset where each politician is mapped to a country, region and population.

In [50]:
combined = pd.merge(combined, population, left_on='country', right_on='Geography', how='left')
combined.head()

Unnamed: 0,name,url,country,title,lastrevid,score,Geography,Population (millions),region
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan,Shahjahan Noori,1099689043,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan,Abdul Ghafar Lakanwal,943562276,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan,Majah Ha Adrif,852404094,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan,Haroon al-Afghani,1095102390,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan,Tayyab Agha,1104998382,Start,Afghanistan,41.1,SOUTH ASIA


The `combined` dataframe is then cleaned to remove any unnecessary columns such as title, url and Geography. The columns are then renamed for convenience.

In [51]:
combined = combined.drop(columns=['title','url', 'Geography'])
combined.head()

Unnamed: 0,name,country,lastrevid,score,Population (millions),region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [52]:
combined = combined.rename(columns={'name':'article_title', 'lastrevid': 'revision_id', 'score':'article_quality', 'Population (millions)':'population'})
combined.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


Next, I check for any duplicates in the dataset.

In [53]:
duplicateRows = combined[combined.duplicated()]
duplicateRows

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
198,Visar Ymeri,Albania,1036757024,Stub,2.8,SOUTHERN EUROPE
382,Hrant Maloyan,Armenia,1114902744,C,3.0,WESTERN ASIA
418,Count Wenzel Chotek of Chotkow and Wognin,Austria,1083654825,Start,9.0,WESTERN EUROPE
434,Eduard Hedvicek,Austria,1072556655,Stub,9.0,WESTERN EUROPE
442,Konstantin Jireček,Austria,1100601439,C,9.0,WESTERN EUROPE
...,...,...,...,...,...,...
7066,Manuel Carrascalão,Timor-Leste,1071880738,Start,1.3,SOUTHEAST ASIA
7290,Sergey Abisov,Ukraine,1113303500,Start,41.0,EASTERN EUROPE
7445,Torokul Dzhanuzakov,Uzbekistan,1092752457,C,35.6,CENTRAL ASIA
7446,Torokul Dzhanuzakov,Uzbekistan,1092752457,C,35.6,CENTRAL ASIA


There 108 exact duplicates the the dataset so I drop the duplicates.

In [54]:
combined = combined.drop_duplicates()

I check for any duplicates at the article title level. This could indicate that one politician is related to multiple countries. There are 48 such politicians.

In [56]:
len(combined[combined['article_title'].duplicated()])

48

I also check for any null values in the `combined` dataest.

In [57]:
combined.isnull().values.any()

True

In [58]:
combined.isnull().sum()

article_title       0
country             0
revision_id         6
article_quality     6
population         70
region             70
dtype: int64

There are 70 null values in the region field. This means that there exist certain countries in the `politicians` dataset that did not map to any countries in the `population` dataset. I am going to find the countries and store them in the `wp_countries-no_match` file in the `data` folder.

In [81]:
no_match = pd.DataFrame(combined[combined['region'].isnull()]["country"].unique(),
columns=["country"])
no_match

Unnamed: 0,country
0,Korean


In [36]:
combined_right = pd.merge(combined, population, left_on='country', right_on='Geography', how='right')
combined_right.head()

Unnamed: 0.1,Unnamed: 0,article_title,country,revision_id,article_quality,population,region_x,Geography,Population (millions),region_y
0,,,,,,,,WORLD,7963.0,WORLD
1,,,,,,,,AFRICA,1419.0,AFRICA
2,,,,,,,,NORTHERN AFRICA,251.0,NORTHERN AFRICA
3,201.0,Said Abadou,Algeria,1112194000.0,Stub,44.9,NORTHERN AFRICA,Algeria,44.9,NORTHERN AFRICA
4,202.0,Tahar Allan,Algeria,1059626000.0,Stub,44.9,NORTHERN AFRICA,Algeria,44.9,NORTHERN AFRICA


In [38]:
combined_right = combined_right[combined_right['country'].isna()]
combined_right = combined_right[combined_right['Geography'] != combined_right['region_y']]
combined_right = combined_right['Geography']

In [39]:
combined_right

230            Western Sahara
1028                Mauritius
1029                  Mayotte
1039                  Reunion
1656    Sao Tome and Principe
1669                 eSwatini
1777                   Canada
1778            United States
2064                  Curacao
2106               Guadeloupe
2157               Martinique
2158              Puerto Rico
2538            French Guiana
4206                   Brunei
4544              Philippines
4623    China,  Hong Kong SAR
4624        China,  Macao SAR
4963                  Ireland
5133           United Kingdom
7463                Australia
7491         French Polynesia
7492                     Guam
7493                 Kiribati
7505            New Caledonia
7506              New Zealand
Name: Geography, dtype: object

In [40]:
combined_right.to_csv("../data/wp_countries-no_match.csv", index=False)

There are also 6 articles having null values for revision id and article quality. These 6 values were also previously discovered and stored in the `no_val` dataframe. I am going to export these records into the `wp_article_quality-no_match` file in the `data` folder.

In [87]:
no_score =(combined[combined['article_quality'].isnull()])
no_score

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
2480,Prince Ofosu Sefah,Ghana,,,33.5,WESTERN AFRICA
3024,Harjit Kaur Talwandi,India,,,1417.2,SOUTH ASIA
3253,Abd al-Razzaq al-Hasani,Iraq,,,44.5,WESTERN ASIA
3834,Kang Sun-nam,"Korea, North",,,26.1,EAST ASIA
4943,Segun “Aeroland” Adewale,Nigeria,,,218.5,WESTERN AFRICA
6434,Nhlanhla “Lux” Dlamini,South Africa,,,60.6,SOUTHERN AFRICA


In [93]:
no_score.to_csv("../data/wp_article_quality-no_match.csv", index=False)

All the non-null values are then stored in the `combined` dataframe and the records are exported to the `wp_politicians_by_country` file in the `data` folder.

In [90]:
combined = combined[combined['region'].notna()]
combined = combined[combined['article_quality'].notna()]
combined.head()

Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562276,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404094,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102390,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998382,Start,41.1,SOUTH ASIA


In [91]:
combined.isnull().sum()

article_title      0
country            0
revision_id        0
article_quality    0
population         0
region             0
dtype: int64

In [109]:
combined.reset_index(drop=True).to_csv("../data/wp_politicians_by_country.csv")

### DATA ANALYSIS

In [2]:
combined = pd.read_csv("../data/wp_politicians_by_country.csv")

As mentioned before there are 6 countries that have a zero population. These countries have been excluded from the analysis below to avoid any infinite values. The countries are as follows:
1. Liechtenstein
2. Monaco
3. San Marino
4. Nauru
5. Palau
6. Tuvalu

#### Top 10 countries by coverage: The 10 countries with the highest total articles per capita

Group the dataset by country and population and count the number of articles for each country. Then, the total articles per capita is calculated by dividing the article count by population multiplied by 1000000 since the value is in millions.

In [5]:
total_article_capita = combined.groupby(['country', 'population'])['country'].count().reset_index(name='count')

In [6]:
total_article_capita['article_per_capita'] = total_article_capita['count']/(total_article_capita['population'] * 1000000)

In [7]:
ans1 = total_article_capita[total_article_capita['article_per_capita'] != np.inf].sort_values(by=['article_per_capita'], ascending=False)[:10]

In [8]:
ans1

Unnamed: 0,country,population,count,article_per_capita
5,Antigua and Barbuda,0.1,17,0.00017
54,Federated States of Micronesia,0.1,13,0.00013
3,Andorra,0.1,10,0.0001
13,Barbados,0.3,28,9.3e-05
104,Marshall Islands,0.1,9,9e-05
110,Montenegro,0.6,36,6e-05
143,Seychelles,0.1,6,6e-05
97,Luxembourg,0.7,37,5.3e-05
18,Bhutan,0.8,41,5.1e-05
64,Grenada,0.1,5,5e-05


#### Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita

In [9]:
ans2 = total_article_capita[total_article_capita['article_per_capita'] != np.inf].sort_values(by=['article_per_capita'], ascending=True)[:10]

In [10]:
ans2

Unnamed: 0,country,population,count,article_per_capita
32,China,1436.6,2,1.392176e-09
106,Mexico,127.5,1,7.843137e-09
140,Saudi Arabia,36.7,3,8.174387e-08
134,Romania,19.0,2,1.052632e-07
73,India,1417.2,178,1.255998e-07
153,Sri Lanka,22.4,3,1.339286e-07
48,Egypt,103.5,14,1.352657e-07
53,Ethiopia,123.4,25,2.025932e-07
161,Taiwan,23.2,5,2.155172e-07
180,Vietnam,99.4,27,2.716298e-07


#### Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita

"High quality" articles are articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes. Store the high quality articles in a dataframe. Group the dataset by country and population and count the number of articles with high quality scores. Then, the total high quality articles per capita is calculated by dividing the high quality article count by population multiplied by 1000000 since the value is in millions.

In [31]:
high_quality = combined.loc[(combined['article_quality'] == "FA") | (combined['article_quality'] == "GA")]
high_quality.head()

Unnamed: 0.1,Unnamed: 0,article_title,country,revision_id,article_quality,population,region
0,0,Shahjahan Noori,Afghanistan,1099689043,GA,41.1,SOUTH ASIA
55,55,Ahmed Wali Karzai,Afghanistan,1090245979,GA,41.1,SOUTH ASIA
59,59,Masoud Khalili,Afghanistan,1103105365,GA,41.1,SOUTH ASIA
93,93,Amrullah Saleh,Afghanistan,1115022704,FA,41.1,SOUTH ASIA
107,107,Nur ul-Haq Ulumi,Afghanistan,1107429109,GA,41.1,SOUTH ASIA


In [32]:
total_quality_capita = high_quality.groupby(['country', 'population'])['article_quality'].count().reset_index(name='count')

In [33]:
total_quality_capita

Unnamed: 0,country,population,count
0,Afghanistan,41.1,6
1,Albania,2.8,6
2,Andorra,0.1,2
3,Armenia,3.0,1
4,Azerbaijan,10.2,1
...,...,...,...
88,Ukraine,41.0,5
89,United Arab Emirates,9.4,4
90,Uruguay,3.6,1
91,Vietnam,99.4,2


In [34]:
total_quality_capita['quality_per_capita'] = total_quality_capita['count']/(total_quality_capita['population'] * 1000000)

In [35]:
ans3 = total_quality_capita[total_quality_capita['quality_per_capita'] != np.inf].sort_values(by=['quality_per_capita'], ascending=False)[:10]

In [36]:
ans3

Unnamed: 0,country,population,count,quality_per_capita
2,Andorra,0.1,2,2e-05
53,Montenegro,0.6,3,5e-06
1,Albania,2.8,6,2.142857e-06
80,Suriname,0.6,1,1.666667e-06
9,Bosnia-Herzegovina,3.4,5,1.470588e-06
49,Lithuania,2.8,3,1.071429e-06
19,Croatia,3.8,4,1.052632e-06
74,Slovenia,2.1,2,9.52381e-07
61,Palestinian Territory,5.4,5,9.259259e-07
28,Gabon,2.4,2,8.333333e-07


#### Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita

In [37]:
ans4 = total_quality_capita[total_quality_capita['quality_per_capita'] != np.inf].sort_values(by=['quality_per_capita'], ascending=True)[:10]

In [38]:
ans4

Unnamed: 0,country,population,count,quality_per_capita
35,India,1417.2,6,4.2337e-09
84,Thailand,66.8,1,1.497006e-08
39,Japan,124.9,2,1.601281e-08
58,Nigeria,218.5,4,1.830664e-08
91,Vietnam,99.4,2,2.012072e-08
17,Colombia,49.1,1,2.03666e-08
87,Uganda,47.2,1,2.118644e-08
60,Pakistan,235.8,5,2.120441e-08
79,Sudan,46.9,1,2.132196e-08
37,Iran,88.6,2,2.257336e-08


#### Geographic regions by total coverage: A rank ordered list of geographic regions by total articles per capita.

Group the dataset by region and count the number of articles by region and the sum populations by region. Then, the total articles per capita by region is calculated by dividing the article count by population multiplied by 1000000 since the value is in millions.

In [19]:
total_region_capita = combined.groupby('region').agg(count=("article_title", "count"), population=("population", "sum")).reset_index()

In [20]:
total_region_capita

Unnamed: 0,region,count,population
0,CARIBBEAN,201,1239.5
1,CENTRAL AMERICA,195,1755.7
2,CENTRAL ASIA,106,1788.4
3,EAST ASIA,245,21763.7
4,EASTERN AFRICA,648,19032.8
5,EASTERN EUROPE,736,37316.1
6,MIDDLE AFRICA,203,7919.0
7,NORTHERN AFRICA,227,7639.9
8,NORTHERN EUROPE,262,1348.4
9,OCEANIA,86,110.1


In [21]:
total_region_capita['article_per_capita'] = total_region_capita['count']/(total_region_capita['population'] * 1000000)

In [22]:
ans5 = total_region_capita[total_region_capita['article_per_capita'] != np.inf].sort_values(by=['article_per_capita'], ascending=False)[:10]

In [23]:
ans5

Unnamed: 0,region,count,population,article_per_capita
9,OCEANIA,86,110.1,7.811081e-07
8,NORTHERN EUROPE,262,1348.4,1.943044e-07
0,CARIBBEAN,201,1239.5,1.621622e-07
1,CENTRAL AMERICA,195,1755.7,1.110668e-07
2,CENTRAL ASIA,106,1788.4,5.927086e-08
14,SOUTHERN EUROPE,890,19444.6,4.577106e-08
16,WESTERN ASIA,686,15532.6,4.416518e-08
4,EASTERN AFRICA,648,19032.8,3.404649e-08
7,NORTHERN AFRICA,227,7639.9,2.971243e-08
6,MIDDLE AFRICA,203,7919.0,2.563455e-08


#### Geographic regions by high quality coverage: Rank ordered list of geographic regions by high quality articles per capita.

Group the dataset by region and count the number of high quality articles by region and the sum populations by region. Then, the total high quality articles per capita by region is calculated by dividing the high quality article count by population multiplied by 1000000 since the value is in millions.

In [39]:
total_quality_region = high_quality.groupby('region').agg(count=("article_quality", "count"), population=("population", "sum")).reset_index()

In [40]:
total_quality_region

Unnamed: 0,region,count,population
0,CARIBBEAN,8,89.6
1,CENTRAL AMERICA,10,102.2
2,CENTRAL ASIA,3,45.2
3,EAST ASIA,16,946.7
4,EASTERN AFRICA,15,653.4
5,EASTERN EUROPE,39,2893.2
6,MIDDLE AFRICA,5,43.9
7,NORTHERN AFRICA,6,125.8
8,NORTHERN EUROPE,8,46.6
9,OCEANIA,2,9.3


In [41]:
total_quality_region['quality_per_capita'] = total_quality_region['count']/(total_quality_region['population'] * 1000000)

In [42]:
ans6 = total_quality_region[total_quality_region['quality_per_capita'] != np.inf].sort_values(by=['quality_per_capita'], ascending=False)[:10]

In [43]:
ans6

Unnamed: 0,region,count,population,quality_per_capita
9,OCEANIA,2,9.3,2.150538e-07
8,NORTHERN EUROPE,8,46.6,1.716738e-07
6,MIDDLE AFRICA,5,43.9,1.138952e-07
1,CENTRAL AMERICA,10,102.2,9.784736e-08
0,CARIBBEAN,8,89.6,8.928571e-08
2,CENTRAL ASIA,3,45.2,6.637168e-08
16,WESTERN ASIA,28,538.8,5.196733e-08
14,SOUTHERN EUROPE,46,910.6,5.051614e-08
7,NORTHERN AFRICA,6,125.8,4.769475e-08
10,SOUTH AMERICA,13,302.3,4.300364e-08
