# Data 512A Homework 2 Considering Bias in Data

### Andy Wang, 10/13/2022

## Step 1: Data Acquisition

In this step, I followed the exmaple code notebook to request:
- Page View data from access page view data using the [Wikimedia REST API](https://www.mediawiki.org/wiki/Wikimedia_REST_API)
- Quality Scores for article revisions using [ORES](https://www.mediawiki.org/wiki/ORES)

The main steps are as follows:
- Define Constants, Functions, and Parameters
- read and clean politicians_by_country data and population_by_country data
- Loop for each article to Request Page View data to find revid
- Loop for each article to Quality Scores data to find ORES score

#### 1.1: import packages, define constant

In [1]:
#########
#
#    IMPORT MODULES/PACKAGES
#

# These are standard python modules
import json, time, urllib.parse
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests

In [2]:
#########
#
#    CONSTANTS
#

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': 'wangqc@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

# This is a string of additional page properties that can be returned see the Info documentation for
PAGEINFO_EXTENDED_PROPERTIES = ""


# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"

# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

#### 1.2: Read and Clean 'politicians_by_country_SEPT.2022.csv' (w/ special considertation)

One main problems after reading the politicians_by_country_SEPT.2022.csv is that there are duplicate values.

There are total 50 duplicates, where 2 of them are completely duplicated value(all columns have same value), so they are just droped.

Other 48 duplicated values are more complicated, the rows having same "name" and "url" but the "country" are different

The three examples shown below demonstrates the strategy to deal with the duplicate values

In [3]:
# Obtain the politicions names
politicians_by_country_df = pd.read_csv('politicians_by_country_SEPT.2022.csv', encoding='utf-8')
politicians_by_country_df = politicians_by_country_df.drop_duplicates()
#POLITICIONS_NAMES = politicians_by_country_df['name'].to_list()

#### Example 1: only one result can be found

By visiting the wiki page url, I find that there is only one "Rudi Kolak" from Bosnia-Herzegovina

Since, no "Rudi Kolak" from Croatia can be found, drop column 1566

Same strategy are used when only one result can be found

In [4]:
politicians_by_country_df[politicians_by_country_df['name']=='Rudi Kolak']

Unnamed: 0,name,url,country
888,Rudi Kolak,https://en.wikipedia.org/wiki/Rudi_Kolak,Bosnia-Herzegovina
1566,Rudi Kolak,https://en.wikipedia.org/wiki/Rudi_Kolak,Croatia


#### Example 2: information mismatch

By visiting the wiki page url, I find that there is 'Count Wenzel Chotek of Chotkow and Wognin' from Austria;

However, there is also a 'Karl, Count Chotek of Chotkow and Wognin' from Czechia

Therefore, change 'name' in column 1566 to Karl, Count Chotek of Chotkow and Wognin

Same strategy are used when there is mismatch information

In [5]:
politicians_by_country_df[politicians_by_country_df['name']=='Count Wenzel Chotek of Chotkow and Wognin']
# Karl, Count Chotek of Chotkow and Wognin

Unnamed: 0,name,url,country
415,Count Wenzel Chotek of Chotkow and Wognin,https://en.wikipedia.org/wiki/Count_Wenzel_Cho...,Austria
1654,Count Wenzel Chotek of Chotkow and Wognin,https://en.wikipedia.org/wiki/Count_Wenzel_Cho...,Czechia


#### Example 3: other cases

There are some more complicated cases, for example, 'Torokul Dzhanuzakov' was a Soviet politician.

The problem is that he cound be count in either countries (Kazakhstan, Kyrgyzstan, Tajikistan, Uzbekistan),

The best solution I can come up with is keeping only his birth country "Kazakhstan" to avoid duplications

In [6]:
politicians_by_country_df[politicians_by_country_df['name']=='Torokul Dzhanuzakov']
# Soviet, born in Kazakhstan

Unnamed: 0,name,url,country
3626,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Kazakhstan
3983,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Kyrgyzstan
6894,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Tajikistan
7341,Torokul Dzhanuzakov,https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov,Uzbekistan


In [7]:
politicians_by_country_df = politicians_by_country_df.drop_duplicates(subset=['url'])
POLITICIONS_NAMES = politicians_by_country_df['name'].to_list()

#### 1.3: define functions

Here in addition to the 2 request function from the provided notebook

I defined 2 functions get_revid and get_ores_score:

Input(get_revid): 
- lsit of all article names

Output(get_revid): 
- list of article names for which revid_id cannot be found
- ataframe of article names with revid_id which request from page view API

Input(get_ores_score): 
- dataframe of article names with revid_id

Output(get_ores_score): 
- list of article names for which ores_score cannot be found
- dataframe of article names with ores_score which request from page view API

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


# define a function for revid
def get_revid(article_list):
    
    if not article_list: return None
    
    # define empty list for output
    article_no_revid_list = []
    article_with_revid_list = []
    revid_list = []
    
    # check if the json file contains all the needed keys
    for article in article_list:
        info = request_pageinfo_per_article(article)
        if 'query' not in info:
            article_no_revid_list.append(article)
        elif 'pages' not in info['query']:
            article_no_revid_list.append(article)
        else:
            for key, value in info['query']['pages'].items():
                if 'lastrevid' not in value:
                    article_no_revid_list.append(article)
                else:
                    article_with_revid_list.append(article)
                    revid_list.append(value['lastrevid'])
    
    # create df contins article name and revid
    revid_df = pd.DataFrame({'article': article_with_revid_list,
                             'revid': revid_list})
    
    return article_no_revid_list, revid_df

def get_ores_score(article_revid_df):
    
    # define empty list for output
    article_no_ores_list = []
    article_with_ores_list = []
    ores_list = []
    
    # check if the json file contains all the needed keys
    for i in range(len(article_revid_df)):
        score = request_ores_score_per_article(article_revid_df['revid'][i])
        if 'enwiki' not in score:
             article_no_ores_list.append(article_revid_df['article'][i])
        elif 'scores' not in score['enwiki']:
            article_no_ores_list.append(article_revid_df['article'][i])
        else:
            for key,value in score['enwiki']['scores'].items():
                if 'articlequality' not in value:
                    article_no_ores_list.append(article_revid_df['article'][i])
                elif 'score' not in value['articlequality']:
                    article_no_ores_list.append(article_revid_df['article'][i])
                elif 'prediction' not in value['articlequality']['score']:
                    article_no_ores_list.append(article_revid_df['article'][i])
                else:
                    article_with_ores_list.append(article_revid_df['article'][i])
                    ores_list.append(value['articlequality']['score']['prediction'])
    
    # create df contins article name and ores_score
    ores_df = pd.DataFrame({'name': article_with_ores_list,
                             'ores_score': ores_list})
    
    return article_no_ores_list, ores_df

#### 1.3: get revid and ores score
- call functions to get revid and ores score
- save the two dataframes to local .csv file
- save the list which cannot get revid/ores score to local .txt file

In [9]:
#########
#
#    GET/SAVE REVID
#

# call functions to get revid
article_no_revid_list, revid_df = get_revid(POLITICIONS_NAMES)

# save the list which cannot get revid score to local .txt file
with open('article_no_revid.txt', 'w', errors = 'ignore') as f:
    for line in article_no_revid_list:
        f.write(f"{line}\n")

# save the dataframe to local .csv file
revid_df.to_csv('revid_df.csv', sep='\t')

In [10]:
#########
#
#    GET/SAVE ORES_SCORE
#

# call functions to get ores score
article_no_ores_list, ores_df = get_ores_score(revid_df)
# save the dataframe to local .csv file
ores_df.to_csv('ores_df.csv', sep='\t')

## Step 2: Combining the Datasets

In this step, I cleaned and combined the following dataset:
- 'revid_df.csv' aquired from last step (page view API)
- 'ores_df.csv' aquired from last step (ORES API)
- 'population_by_country_2022.csv' which is provided (drawn from the world population data sheet published by the Population Reference Bureau)

The main steps are:
- merge dataframes
- add 'region' column
- update 'population' column
- speical considertaion (rows that contains 'Korean' as 'country')


#### 2.1: read in dataframes and merge

In [11]:
# read in dataframes from local
revid_df = pd.read_csv('revid_df.csv', sep='\t',index_col=0)
ores_df = pd.read_csv('ores_df.csv', sep='\t',index_col=0)
# change column names for merge
ores_df.columns = ['article_title', 'article_quality']
revid_df.columns = ['article_title', 'revision_id']
politicians_by_country_df.columns = ['article_title','url','country']
# merge dataframes
ores_revid_df = pd.merge(ores_df, revid_df, how="outer", on=["article_title"])
ores_revid_df = pd.merge(ores_revid_df,politicians_by_country_df,how="left", on=["article_title"])
ores_revid_df = ores_revid_df.drop(columns='url')

#### 2.2: add 'region columns'

To identify the "lowest hierarchy" region names , I use two conditions:
- the rwo contains 'Geography' all letters is upper (identify all region names)
- next row contains 'Geography' not all letters is upper (identify all the "lowest hierarchy" region names)

Then create a seperate column "region", and fill the proper "lowest hierarchy" region names to all countries

In [12]:
# read in 'population_by_country_2022.csv'
population_by_country = pd.read_csv('population_by_country_2022.csv',encoding='utf-8')
population_by_country['Region'] = None

# create two conditions
condition1 = population_by_country['Geography'].str.isupper()
condition2 = population_by_country['Geography'].shift(-1).str.isupper()
condition2[232] = False
condition2 = ~condition2.astype('bool')

# save a copy of datafram only contains "lowest hierarchy" region names with their population
region_population = population_by_country[condition1 & condition2]
region_population = region_population.drop(columns = 'Region')

# add region names for countries
region_name = population_by_country[condition1 & condition2].reset_index()
region_name['Region'] = region_name.loc[:, 'Geography']
region_name = region_name.set_index('index')
population_by_country.update(region_name)
# fill region name for countries
population_by_country['Region'] = population_by_country['Region'].fillna(method='ffill')
# keep only country name (exclude region name)
country_region_population = population_by_country[~condition1]
# rename the column name to match the schema
country_region_population.columns = ['country', 'population','region']
# reset index
country_region_population = country_region_population.reset_index().drop(columns='index')

#### 2.3: Special considerations --- "Korean", "population"

In this part, I adjust the dataframe to solve two problems
- Some article are having country name "Korean" which cannot be found in 'population_by_country_2022.csv'
- Some population values are 0.0 (below 1 million)

to solve the problem, the following strategy are used:
- add another country to the dataframe called 'Korean' which contains population = North Korea + South Korea in EAST ASIA region
  (solving this way because these politicions are from acient Korea when North Korea and South Korea were united)

- change population values 0.0 (below 1 million) to 0.05 million. Since the smallest value of population in the dataframe is 0.1, so take the avg value between 0.1 and 0.0.

In [13]:
# add Korean which combines North Korea and South Korea
country_region_population = country_region_population.append({'country':'Korean', 'population': 77.7, 'region':'EAST ASIA'},ignore_index=True)
# Replace 0 population to 0.05 and covert back from "millions"
country_region_population.replace(to_replace = 0, value = 0.05, inplace=True)
country_region_population['population'] = country_region_population['population'].apply(float)*1000000

#### 2.4: save outputs
There are two main outputs:
- 'wp_countries-no_match.txt': countries that does not have a article or population info cannot be found
- 'wp_politicians_by_country.csv': remaining data following the provided shcema

There is a additional out put:
- 'region_population.csv': the "lowest hierarchy" region names with their populations

In [14]:
# outer merge
outer_merge = pd.merge(ores_revid_df, country_region_population, how="outer", on=['country'])
wp_countries_no_match = outer_merge[outer_merge['article_title'].isna()]['country'].to_list()

# save 'wp_countries-no_match.txt'
with open('wp_countries-no_match.txt', 'w', errors = 'ignore') as f:
    for line in wp_countries_no_match:
        f.write(f"{line}\n")

# 'wp_politicians_by_country.csv'
wp_politicians_by_country = pd.merge(ores_revid_df, country_region_population, how="inner", on=['country'])
wp_politicians_by_country.to_csv('wp_politicians_by_country.csv', sep='\t')
region_population.to_csv('region_population.csv', sep='\t')

## Step 3: Analysis and Results

In this step, I cleaned the data for analysis, the main steps are:
- count "total articles numbers" and "high quality article numbers" for each country/region
- calculate 'high-quality-articles-per-population' and 'total-articles-per-population'
- merge result to two dataframes: one by country, the other one by region

Then provide results for the following questions:
- Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).
- Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .
- Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .
- Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).
- Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.
- Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

#### 3.1: clean data and calculate 'high-quality-articles-per-population' and 'total-articles-per-population'

In [15]:
# read data saved last step
wp_politicians_by_country = pd.read_csv('wp_politicians_by_country.csv',sep='\t',index_col=0)

In [16]:
#########
#
#    CLEAN DATA BY COUNTRY
#

# define a function for high quality
def high_quality(article_quality):
    return article_quality == "FA" or article_quality == "GA"

# calculate 'total-articles-per-population' for country
total_articles_country = wp_politicians_by_country.groupby(['country','population']).size().reset_index(name='counts')
total_articles_country['total-articles-per-population'] = total_articles_country.loc[:, 'counts']/total_articles_country.loc[:, 'population']
# 'high-quality-articles-per-population' for country 
high_quality_articles = wp_politicians_by_country[wp_politicians_by_country['article_quality'].apply(high_quality)]
high_quality_articles_country = high_quality_articles.groupby(['country','population']).size().reset_index(name='counts')
high_quality_articles_country['high-quality-articles-per-population'] = total_articles_country.loc[:, 'counts']/total_articles_country.loc[:, 'population']

# merge result together
merged_articles_country = pd.merge(total_articles_country, high_quality_articles_country.loc[:,['country','high-quality-articles-per-population']], how="left", on=['country'])
merged_articles_country = merged_articles_country.drop(columns='counts')
merged_articles_country['high-quality-articles-per-population'] = merged_articles_country['high-quality-articles-per-population'].fillna(0)

In [17]:
#########
#
#    CLEAN DATA BY REGION
#

# calculate 'total-articles-per-population' for region
total_articles_region = wp_politicians_by_country.groupby(['region']).size().reset_index(name='counts')
region_population = pd.read_csv('region_population.csv',sep='\t',index_col=0)
region_population.columns = ['region','population']
total_articles_region_population = pd.merge(total_articles_region, region_population, how ='left', on=['region'])
total_articles_region_population['population'] = total_articles_region_population['population'].apply(float)*1000000
total_articles_region_population['total-articles-per-population'] = total_articles_region_population.loc[:, 'counts']/total_articles_region_population.loc[:, 'population']

# calculate 'high-quality-articles-per-population' for region
high_quality_articles_region_population = wp_politicians_by_country[wp_politicians_by_country['article_quality'].apply(high_quality)]
high_quality_articles_region = high_quality_articles.groupby(['region']).size().reset_index(name='counts')
high_quality_articles_region_population = pd.merge(high_quality_articles_region, region_population, how ='left', on=['region'])
high_quality_articles_region_population['population'] = high_quality_articles_region_population['population'].apply(float)*1000000
high_quality_articles_region_population['high-quality-articles-per-population'] = high_quality_articles_region_population.loc[:, 'counts']/high_quality_articles_region_population.loc[:, 'population']

# merge result together
merged_articles_region = pd.merge(total_articles_region_population, high_quality_articles_region_population.loc[:,['region','high-quality-articles-per-population']], how="left", on=['region'])
merged_articles_region = merged_articles_region.drop(columns='counts')

#### 3.2: Provide result for 6 study Questions

#### Q1: Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order).

In [21]:
merged_articles_country.sort_values(by=['total-articles-per-population'], ascending=False).head(10)

Unnamed: 0,country,population,total-articles-per-population,high-quality-articles-per-population
109,Monaco,50000.0,0.00026,0.0
173,Tuvalu,50000.0,0.00022,8.751609e-07
5,Antigua and Barbuda,100000.0,0.00017,0.0
54,Federated States of Micronesia,100000.0,0.00013,0.0
3,Andorra,100000.0,0.0001,7.572383e-07
13,Barbados,300000.0,9.3e-05,0.0
105,Marshall Islands,100000.0,9e-05,0.0
144,Seychelles,100000.0,6e-05,0.0
111,Montenegro,600000.0,6e-05,0.00013
98,Luxembourg,700000.0,5.3e-05,0.0


#### Q2: Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [22]:
merged_articles_country.sort_values(by=['total-articles-per-population'], ascending=True).head(10)

Unnamed: 0,country,population,total-articles-per-population,high-quality-articles-per-population
32,China,1436600000.0,1.392176e-09,0.0
107,Mexico,127500000.0,7.843137e-09,0.0
141,Saudi Arabia,36700000.0,8.174387e-08,1.3e-05
135,Romania,19000000.0,1.052632e-07,2.3e-05
73,India,1417200000.0,1.255998e-07,5e-06
154,Sri Lanka,22400000.0,1.339286e-07,0.0
48,Egypt,103500000.0,1.352657e-07,0.0
53,Ethiopia,123400000.0,1.944895e-07,2e-06
162,Taiwan,23200000.0,2.155172e-07,0.0
181,Vietnam,99400000.0,2.716298e-07,5e-06


#### Q3: Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [18]:
merged_articles_country.sort_values(by=['high-quality-articles-per-population'], ascending=False).head(10)

Unnamed: 0,country,population,total-articles-per-population,high-quality-articles-per-population
14,Belarus,9200000.0,4.23913e-06,0.00017
111,Montenegro,600000.0,6e-05,0.00013
7,Armenia,3000000.0,1.533333e-05,0.0001
26,Cambodia,16800000.0,2.02381e-06,9.3e-05
37,Costa Rica,5200000.0,1.211538e-05,5.1e-05
129,Papua New Guinea,9300000.0,9.677419e-07,5e-05
86,Korean,77700000.0,8.751609e-07,4e-05
142,Senegal,17900000.0,1.843575e-06,3.2e-05
1,Albania,2800000.0,2.964286e-05,3e-05
71,Hungary,9700000.0,1.340206e-05,2.5e-05


#### Q4: Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

Since the 10 countries with the lowest high quality articles per capita lowest are having values of 0 (too few high quality articles comparing to their population or even no high quality articles), which is not informative, I also find the 10 countries with the lowest high quality articles per capita which having non-zero values.

In [19]:
merged_articles_country.sort_values(by=['high-quality-articles-per-population','total-articles-per-population'], ascending=True).head(10)

Unnamed: 0,country,population,total-articles-per-population,high-quality-articles-per-population
32,China,1436600000.0,1.392176e-09,0.0
107,Mexico,127500000.0,7.843137e-09,0.0
154,Sri Lanka,22400000.0,1.339286e-07,0.0
48,Egypt,103500000.0,1.352657e-07,0.0
162,Taiwan,23200000.0,2.155172e-07,0.0
113,Mozambique,33000000.0,2.727273e-07,0.0
12,Bangladesh,171200000.0,3.271028e-07,0.0
100,Malawi,20400000.0,3.431373e-07,0.0
38,Cote d'Ivoire,28200000.0,3.900709e-07,0.0
164,Tanzania,65500000.0,4.122137e-07,0.0


the 10 countries with the lowest high quality articles per capita which having non-zero values (using the dataframe before merge).

In [20]:
high_quality_articles_country.sort_values(by=['high-quality-articles-per-population'], ascending=True).head(10)

Unnamed: 0,country,population,counts,high-quality-articles-per-population
32,Guinea,13900000.0,2,1.392176e-09
73,Serbia,6800000.0,4,1.255998e-07
48,Lebanon,5500000.0,3,1.352657e-07
53,Mauritania,4700000.0,1,1.944895e-07
12,Burundi,12900000.0,2,3.271028e-07
38,Iraq,44500000.0,5,3.900709e-07
74,Slovakia,5400000.0,1,4.029038e-07
22,Dominican Republic,11200000.0,3,4.143389e-07
36,Indonesia,275500000.0,14,5.151515e-07
77,South Africa,60600000.0,4,5.263158e-07


#### Q5: Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [23]:
merged_articles_region.sort_values(by=['total-articles-per-population'], ascending=False)

Unnamed: 0,region,population,total-articles-per-population,high-quality-articles-per-population
14,SOUTHERN EUROPE,151000000.0,5.788079e-06,3.046358e-07
0,CARIBBEAN,44000000.0,4.568182e-06,1.818182e-07
17,WESTERN EUROPE,197000000.0,3.543147e-06,1.116751e-07
5,EASTERN EUROPE,287000000.0,2.526132e-06,1.324042e-07
8,NORTHERN EUROPE,107000000.0,2.429907e-06,7.476636e-08
16,WESTERN ASIA,294000000.0,2.326531e-06,9.52381e-08
9,OCEANIA,44000000.0,1.954545e-06,4.545455e-08
13,SOUTHERN AFRICA,69000000.0,1.695652e-06,5.797101e-08
4,EASTERN AFRICA,473000000.0,1.365751e-06,3.171247e-08
10,SOUTH AMERICA,434000000.0,1.327189e-06,2.764977e-08


#### Q6: Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [24]:
merged_articles_region.sort_values(by=['high-quality-articles-per-population'], ascending=False)

Unnamed: 0,region,population,total-articles-per-population,high-quality-articles-per-population
14,SOUTHERN EUROPE,151000000.0,5.788079e-06,3.046358e-07
0,CARIBBEAN,44000000.0,4.568182e-06,1.818182e-07
5,EASTERN EUROPE,287000000.0,2.526132e-06,1.324042e-07
17,WESTERN EUROPE,197000000.0,3.543147e-06,1.116751e-07
16,WESTERN ASIA,294000000.0,2.326531e-06,9.52381e-08
8,NORTHERN EUROPE,107000000.0,2.429907e-06,7.476636e-08
13,SOUTHERN AFRICA,69000000.0,1.695652e-06,5.797101e-08
1,CENTRAL AMERICA,178000000.0,1.08427e-06,5.617978e-08
9,OCEANIA,44000000.0,1.954545e-06,4.545455e-08
2,CENTRAL ASIA,78000000.0,1.320513e-06,3.846154e-08
