# Considering Bias in Data


The goal of this assignment is to explore the concept of bias in data using Wikipedia articles. This assignment will consider articles on political figures from different countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.
You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among countries. Your analysis will consist of a series of tables that show:

1. The countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. The countries with the highest and lowest proportion of high quality articles about politicians.
3. A ranking of geographic regions by articles-per-person and proportion of high quality articles.

You are also expected to write a short reflection on the project that focuses on how both your findings from this analysis and the process you went through to reach those findings helps you understand the causes and consequences of biased data in large, complex data science projects.


## Step 1: Getting the Article and Population Data


Importing the neccesary packages for this exercise

In [None]:
import json, time, urllib.parse
import re, requests

import pandas as pd
import numpy as np

from datetime import datetime

Reading the input files

In [177]:
politician_df = pd.read_csv('politicians_by_country_SEPT.2022.csv')
population_df = pd.read_csv('population_by_country_2022.csv')

In [178]:
politician_df.head(10) #sample view of politicians by country data

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
5,Ahmadullah Wasiq,https://en.wikipedia.org/wiki/Ahmadullah_Wasiq,Afghanistan
6,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan
7,Muqadasa Ahmadzai,https://en.wikipedia.org/wiki/Muqadasa_Ahmadzai,Afghanistan
8,Mohammad Sarwar Ahmedzai,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,Afghanistan
9,Amir Muhammad Akhundzada,https://en.wikipedia.org/wiki/Amir_Muhammad_Ak...,Afghanistan


In [179]:
population_df.head(10) #sample view of population by country data

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5
5,Libya,6.8
6,Morocco,36.7
7,Sudan,46.9
8,Tunisia,11.8
9,Western Sahara,0.6


Cleaning data before analysis

1. Removing duplicates

In [180]:
population_df.shape[0]  #233 row count- before cleanup
politician_df.shape[0]  #7584 row count - before cleanup

7584

In [181]:
population_df.drop_duplicates()
politician_df.drop_duplicates()
politician_df = politician_df[~politician_df.duplicated(subset=['name', 'url', 'country'], keep = 'last')]
population_df = population_df[~population_df.duplicated(subset=['Geography','Population (millions)'], keep = 'last')]

In [182]:
population_df.shape[0]  #233 row count- after cleanup
politician_df.shape[0]  #7582 row count - after cleanup

7582

2. Removing countries with 0 population as input

In [183]:
population_df = population_df[population_df["Population (millions)"]!=0]

In [184]:
politician_df.shape  #7582 row count - after cleanup

(7582, 3)

3. Removing the cumulative regional population counts and obtaining the closest in the hierarchy regions for the countries

In [185]:
population_df['shifted'] = population_df['Geography'].shift(-1)
population_df = population_df[~((population_df['Geography'].str.isupper() == True) & (population_df['shifted'].str.isupper() == True))].iloc[:,0:2].reset_index().drop('index', axis = 1)


# Obtain the regions which are capital case in the data
regions = pd.DataFrame()
regions['region'] = population_df[population_df['Geography'].str.isupper()]['Geography']
regions['flag'] = np.arange(1, len(regions['region']) + 1)

# Obtain the country and region linkage
df = population_df.merge(regions, left_on = "Geography", right_on = "region", how = 'left').iloc[:,[0,1,3]]

# Fill the region column for linkage region to be populated
df['flag'] = df['flag'].expanding().max()
population_df = df.merge(regions, on = "flag", how = 'inner')
population_df = population_df.iloc[:,[0, 1, 3]]

# Filter for only countries in the geography column
population_df = population_df[population_df['Geography'] != population_df['region']]


In [186]:
population_df

Unnamed: 0,Geography,Population (millions),region
1,Algeria,44.9,NORTHERN AFRICA
2,Egypt,103.5,NORTHERN AFRICA
3,Libya,6.8,NORTHERN AFRICA
4,Morocco,36.7,NORTHERN AFRICA
5,Sudan,46.9,NORTHERN AFRICA
...,...,...,...
217,Papua New Guinea,9.3,OCEANIA
218,Samoa,0.2,OCEANIA
219,Solomon Islands,0.7,OCEANIA
220,Tonga,0.1,OCEANIA


# Step 2: Getting Article Quality Predictions


Step 2: Getting Article Quality Predictions
Now you need to get the predicted quality scores for each article in the Wikipedia dataset. We're using a machine learning system called ORES. This was originally an acronym for "Objective Revision Evaluation Service" but was simply renamed “ORES”. ORES is a machine learning tool that can provide estimates of Wikipedia article quality. The article quality estimates are, from best to worst:

1. FA - Featured article
2. GA - Good article
3. B - B-class article
4. C - C-class article
5. Start - Start-class article
6. Stub - Stub-class article

These were learned based on articles in Wikipedia that were peer-reviewed using the Wikipedia content assessment procedures.These quality classes are a sub-set of quality assessment categories developed by Wikipedia editors.
ORES requires a specific revision ID of a specific article to be able to make a label prediction. You can use the API:Info request to get a range of metadata on an article, including the most current revision ID of the article page.
Putting this together, to get a Wikipedia page quality prediction from ORES for each politician’s article page you will need to: a) read each line of politicians_by_country.SEPT.2022.csv, b) make a page info request to get the current page revision, and c) make an ORES request using the page title and current revision id.  

The homework folder contains example code in notebooks to illustrate making a page info request and making an ORES request. This sample code is licensed CC0 so feel free to reuse any of the code in either notebook without attribution.
Note: It is possible that you will be unable to get a score for a particular article. If that happens, make sure to maintain a log of articles for which you were not able to retrieve an ORES score. This log can be saved as a separate file, or (if it's only a few articles), simply printed and logged within the notebook. The choice is up to you.

    

1. Page Information Endpoint Details

In [173]:

# The basic English Wikipedia API endpoint
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

# We'll assume that there needs to be some throttling for these requests - we should always be nice to a free data resource
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<shubhacp@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}


ARTICLE_TITLES = politician_df['name'].unique()
# This is a string of additional page properties that can be returned see the Info documentation for
# what can be included. If you don't want any this can simply be the empty string
PAGEINFO_EXTENDED_PROPERTIES = "talkid|url|watched|watchers"
#PAGEINFO_EXTENDED_PROPERTIES = ""

# This template lists the basic parameters for making this
PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}

2. Function for REST API call to get the page information

In [21]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    #global output
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title   
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    #output.append(json_response)
    #print(output)
    return json_response

3. Pulling the page information from the endpoint

In [30]:
data_dump = {}
for i in range(0, len(ARTICLE_TITLES), 50):
  # Get the concatenated titles string
    titles = "|".join(ARTICLE_TITLES[i:i+50])
    response = request_pageinfo_per_article(article_title = titles, request_template = PAGEINFO_PARAMS_TEMPLATE)['query']['pages']
    with open(r'C:\Users\shubh\Downloads\hw2\no_page_info_articles.txt', 'w') as f
    for j in list(response.keys()):
        # Check for the missing articles information from endpoint
        if int(j) < 0:
            line = "Couldn't get the page info for: " + response[j]['title']
            print(line)
            f.write(line)
            f.write('\n')
            del response[j]
        else:
            break
    info.update(response)

Couldn't get the page info for: Prince Ofosu Sefah
Couldn't get the page info for: Harjit Kaur Talwandi
Couldn't get the page info for: Abd al-Razzaq al-Hasani
Couldn't get the page info for: Abiodun Abimbola Orekoya
Couldn't get the page info for: Segun “Aeroland” Adewale
Couldn't get the page info for: Roman Konoplev
Couldn't get the page info for: Nhlanhla “Lux” Dlamini


In [31]:

articles = pd.DataFrame.from_dict(data_dump, orient='index', columns=['title', 'lastrevid'])
articles.reset_index(inplace = True, drop = True)
articles.head()

Unnamed: 0,title,lastrevid
0,Abas Basir,1098419766
1,Abdul Baqi Turkistani,889226470
2,Abdul Ghafar Lakanwal,943562276
3,Abdul Ghani Ghani,1072441893
4,Abdul Malik Hamwar,1100874645


4. ORES API Endpoint Details

In [32]:
# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/?models={model}&revids={revids}"

# Use some delays so that we do not hammer the API with our requests
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0) - API_LATENCY_ASSUMED

# When making automated requests we should include something that is unique to the person making the request
# This should include an email - your UW email would be good to put in there
REQUEST_HEADERS = {
    'User-Agent': '<shubhacp@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022'
}

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

5. Function for REST API call to get the ORES scores for article quantity

In [33]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revids'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

6. Pulling the ORES scores from the endpoint

In [35]:
scores_dict = {}
for i in range(0, len(articles.lastrevid), 50):
    try: 
        # Get the concatenated revision ids string
        revids = map(str, list(articles.lastrevid)[i:i+50])
        revids = "|".join(revids)
        response = request_ores_score_per_article(revids)['enwiki']['scores']
        for j in list(response.keys()):
            scores_dict[j] = response[j]['articlequality']['score']['prediction']
    except:
        print("Couldn't get the ORES info for: ", i)




In [36]:
df_scores = pd.DataFrame.from_dict(scores_dict, orient='index', columns=['prediction'])
df_scores.reset_index(inplace = True)
df_scores = df_scores.rename(columns = {'index': 'lastrevid'})
df_scores['lastrevid'] = df_scores['lastrevid'].astype('int')
df_scores

Unnamed: 0,lastrevid,prediction
0,1013838830,Stub
1,1033383351,Stub
2,1038918070,Start
3,1041460606,B
4,1060707209,Start
...,...,...
7522,1112385169,C
7523,1112725980,Start
7524,1114641622,Stub
7525,904246837,Stub


# Step 3: Combining the Datasets


Some processing of the data will be necessary! In particular, you'll need to - after retrieving and including the ORES data for each article - merge the wikipedia data and population data together. Both have fields containing country names for just that purpose. After merging the data, you'll invariably run into entries which cannot be merged. Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice-versa.

Identify all countries for which there are no matches and output a list of those countries, with each country on a separate line called:
wp_countries-no_match.txt
Consolidate the remaining data into a single CSV file called:
wp_politicians_by_country.csv

The schema for that file should look something like this:
Columns
1. country
2. region
3. population
4. article_title
5. revision_id
6. article_quality



In [53]:
# Taking articles page info
articles = articles[['title', 'lastrevid']]
print(articles.shape)

# Merging with the articles quality scores
df_joined = articles.merge(df_scores, on = ['lastrevid'], how = 'left')
print(df_joined.shape)

# Merging with the politicians input data to obtain the country
df_joined = politician_df.merge(df_joined, left_on = "name", right_on = "title", how = 'left')
politicians_scores = df_joined.drop(['title', 'url'], axis = 1)
print(politicians_scores.shape)
politicians_scores.head()

(7527, 2)
(7527, 3)
(7582, 4)


Unnamed: 0,name,country,lastrevid,prediction
0,Shahjahan Noori,Afghanistan,1099689000.0,GA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start
2,Majah Ha Adrif,Afghanistan,852404100.0,Start
3,Haroon al-Afghani,Afghanistan,1095102000.0,B
4,Tayyab Agha,Afghanistan,1104998000.0,Start


In [55]:
# merging population data as well
final = politicians_scores.merge(population_df, left_on = 'country', right_on = 'Geography', how = 'outer')
print(final.shape)
final.head()

(7607, 7)


Unnamed: 0,name,country,lastrevid,prediction,Geography,Population (millions),region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA


To obtain the no match countries list to a output text file

In [58]:
# List of countries where there is no wikipedia data
l1 = final[final['country'].isnull()]['Geography'].unique()
# List of countries where there is no population data
l2 = final[final['Geography'].isnull()]['country'].unique()
# Combine the lists to obtains the no match countries list
no_match = list(set(np.append(l1, l2)))
no_match.sort()
no_match


['Australia',
 'Brunei',
 'Canada',
 'China,  Hong Kong SAR',
 'China,  Macao SAR',
 'Curacao',
 'French Guiana',
 'French Polynesia',
 'Guadeloupe',
 'Guam',
 'Ireland',
 'Kiribati',
 'Korean',
 'Liechtenstein',
 'Martinique',
 'Mauritius',
 'Mayotte',
 'Monaco',
 'Nauru',
 'New Caledonia',
 'New Zealand',
 'Palau',
 'Philippines',
 'Puerto Rico',
 'Reunion',
 'San Marino',
 'Sao Tome and Principe',
 'Tuvalu',
 'United Kingdom',
 'United States',
 'Western Sahara',
 'eSwatini']

Writing the no match countries into wp_countries-no_match.txt

In [63]:
with open(r'C:\Users\shubh\Downloads\hw2\wp_countries-no_match.txt', 'w') as fp:
    for i in no_match:
        # write each item on a new line
        fp.write("%s\n" % i)
    print('Done')

Done


Obtaining the final consolidated dataframe (with the schema requested in Step 3)

In [70]:
final = final[(~final['country'].isnull()) & (~final['Geography'].isnull())]
final = final.drop('Geography', axis = 1)
final = final.rename(columns={Population (millions)': 'population', 'name': 'article_title', 'latestrevid': 'revision_id', 'prediction': 'article_quality'})

final.head()

Unnamed: 0,article_title,country,lastrevid,article_quality,population,region
0,Shahjahan Noori,Afghanistan,1099689000.0,GA,41.1,SOUTH ASIA
1,Abdul Ghafar Lakanwal,Afghanistan,943562300.0,Start,41.1,SOUTH ASIA
2,Majah Ha Adrif,Afghanistan,852404100.0,Start,41.1,SOUTH ASIA
3,Haroon al-Afghani,Afghanistan,1095102000.0,B,41.1,SOUTH ASIA
4,Tayyab Agha,Afghanistan,1104998000.0,Start,41.1,SOUTH ASIA


Writing the final dataframe to wp_politicians_by_country.csv CSV file

In [69]:
final.to_csv('wp_politicians_by_country.csv', index=False)

# Step 4: Analysis

Your analysis will consist of calculating total-articles-per-population (a ratio representing the number of articles per person)  and high-quality-articles-per-population (a ratio representing the number of high quality articles per person) on a country-by-country and regional basis. All of these values are to be “per capita”.
In this analysis a country can only exist in one region. 
The population_by_country_2022.csv actually represents regions in a hierarchical order. For your analysis always put a country in the closest (lowest in the hierarchy) region.
For this analysis you should consider "high quality" articles to be articles that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Also, keep in mind that the population_by_country_2022.csv file provides population in millions. The calculated proportions in this step are likely to be very small numbers.

# 4-a) Total articles per population - By country



Generating a temporary df to remove duplicate entries of country in any df (here I've used the output df of STep 3) to obtain the accurate population of each country

In [224]:
country_population = pd.DataFrame()
temp=final[~final.duplicated(subset=['country', 'region'], keep = 'last')]
temp

Unnamed: 0,article_title,country,lastrevid,article_quality,population,region
117,Abdul Ghafor Zori,Afghanistan,9.940234e+08,Stub,41.1,SOUTH ASIA
200,Menduh Zavalani,Albania,1.071578e+09,GA,2.8,SOUTHERN EUROPE
234,Naima Salhi,Algeria,1.113203e+09,Stub,44.9,NORTHERN AFRICA
244,Enric Tarrado Vives,Andorra,9.866714e+08,Stub,0.1,SOUTHERN EUROPE
286,Mfulupinga Nlando Victor,Angola,9.940851e+08,Start,35.6,MIDDLE AFRICA
...,...,...,...,...,...,...
7417,Nervis Villalobos,Venezuela,1.098002e+09,C,28.3,SOUTH AMERICA
7444,Võ Văn Thưởng,Vietnam,1.084427e+09,C,99.4,SOUTHEAST ASIA
7505,Aidarus al-Zoubaidi,Yemen,1.115038e+09,Start,33.7,WESTERN ASIA
7518,Brenda Mwika Tambatamba,Zambia,1.090872e+09,Stub,20.0,EASTERN AFRICA


In [118]:
country_population= temp[['country', 'population']].groupby('country').sum().reset_index()
country_population['population']=country_population['population']
#country_population.head()


articles_by_country = final[['country', 'article_title']]
articles_by_country = articles_by_country.merge(country_population, on = "country", how = "inner")
articles_by_country = articles_by_country.groupby(['country', 'population']).nunique().reset_index()


articles_by_country['articles_per_capita'] = articles_by_country['article_title'] / (articles_by_country['population'] * 1000000)


Unnamed: 0,country,population
0,Afghanistan,41.1
1,Albania,2.8
2,Algeria,44.9
3,Andorra,0.1
4,Angola,35.6


In [124]:
articles_by_country

Unnamed: 0,country,population,article_title,articles_per_capita
0,Afghanistan,41.1,118,2.871046e-06
1,Albania,2.8,83,2.964286e-05
2,Algeria,44.9,34,7.572383e-07
3,Andorra,0.1,10,1.000000e-04
4,Angola,35.6,42,1.179775e-06
...,...,...,...,...
173,Venezuela,28.3,62,2.190813e-06
174,Vietnam,99.4,27,2.716298e-07
175,Yemen,33.7,61,1.810089e-06
176,Zambia,20.0,13,6.500000e-07


# 4-a) Total articles per population - By Region

In [126]:
# region_population= temp[['region', 'population']].groupby('region').sum().reset_index()
# region_population['population']=region_population['population']
# #country_population.head()


# articles_by_region = final[['region', 'article_title']]
# articles_by_region = articles_by_region.merge(region_population, on = "region", how = "inner")
# articles_by_region = articles_by_region.groupby(['region', 'population']).nunique().reset_index()

# articles_by_region['articles_per_capita'] = articles_by_region['article_title'] / (articles_by_region['population'] * 1000000)




In [226]:
# for total population by region

reg_population= population_df[['region', 'Population (millions)']].groupby('region').sum().reset_index()
reg_population['population']= reg_population['Population (millions)']

reg_population['population'] = reg_population['population'].round().astype('int')
reg_population= reg_population.drop(['Population (millions)'], axis=1)

reg_population


Unnamed: 0,region,population
0,CARIBBEAN,44
1,CENTRAL AMERICA,178
2,CENTRAL ASIA,78
3,EAST ASIA,1674
4,EASTERN AFRICA,473
5,EASTERN EUROPE,287
6,MIDDLE AFRICA,196
7,NORTHERN AFRICA,251
8,NORTHERN AMERICA,372
9,NORTHERN EUROPE,106


In [227]:
#getting the data frame with both the population by region data and articles per capita calculations

total_articles = final[['region', 'article_title']]
    
total_articles = total_articles.merge(reg_population, on = "region", how = "inner")
total_articles = total_articles.groupby(['region', 'population']).nunique().reset_index()


total_articles['articles_per_capita'] = total_articles['article_title'] / (total_articles['population'] * 1000000)

In [228]:
total_articles

Unnamed: 0,region,population,article_title,articles_per_capita
0,CARIBBEAN,44,201,4.568182e-06
1,CENTRAL AMERICA,178,193,1.08427e-06
2,CENTRAL ASIA,78,103,1.320513e-06
3,EAST ASIA,1674,246,1.469534e-07
4,EASTERN AFRICA,473,646,1.365751e-06
5,EASTERN EUROPE,287,733,2.554007e-06
6,MIDDLE AFRICA,196,203,1.035714e-06
7,NORTHERN AFRICA,251,227,9.043825e-07
8,NORTHERN EUROPE,106,261,2.462264e-06
9,OCEANIA,44,72,1.636364e-06


# 4-b)High quality articles per person - By Country

In [229]:
hq_by_country = final[(final['article_quality'] == 'FA') | (final['article_quality'] == 'GA')]
country_count=hq_by_country[['country', 'article_title']].groupby('country').count().reset_index()

hq_by_country=country_population.merge(country_count, on='country')
hq_by_country.columns=['country', 'population', 'article_count']
hq_by_country['articles_per_capita'] = hq_by_country['article_count'] / (hq_by_country['population'] * 1000000)
hq_by_country

Unnamed: 0,country,population,article_count,articles_per_capita
0,Afghanistan,41.1,6,1.459854e-07
1,Albania,2.8,6,2.142857e-06
2,Andorra,0.1,2,2.000000e-05
3,Armenia,3.0,1,3.333333e-07
4,Azerbaijan,10.2,1,9.803922e-08
...,...,...,...,...
87,Ukraine,41.0,4,9.756098e-08
88,United Arab Emirates,9.4,4,4.255319e-07
89,Uruguay,3.6,1,2.777778e-07
90,Vietnam,99.4,2,2.012072e-08


# 4-b)High quality articles per person - By Region

In [231]:
hq_by_region = final[(final['article_quality'] == 'FA') | (final['article_quality'] == 'GA')]
region_count=hq_by_region[['region', 'article_title']].groupby('region').count().reset_index()

hq_by_region=reg_population.merge(region_count, on='region')
hq_by_region.columns=['region', 'population', 'article_count']
hq_by_region['articles_per_capita'] = hq_by_region['article_count'] / (hq_by_region['population'] * 1000000)
hq_by_region

Unnamed: 0,region,population,article_count,articles_per_capita
0,CARIBBEAN,44,8,1.818182e-07
1,CENTRAL AMERICA,178,10,5.617978e-08
2,CENTRAL ASIA,78,3,3.846154e-08
3,EAST ASIA,1674,16,9.557945e-09
4,EASTERN AFRICA,473,15,3.171247e-08
5,EASTERN EUROPE,287,38,1.324042e-07
6,MIDDLE AFRICA,196,5,2.55102e-08
7,NORTHERN AFRICA,251,7,2.788845e-08
8,NORTHERN EUROPE,106,8,7.54717e-08
9,OCEANIA,44,1,2.272727e-08


# Step 5: Results

Your results from this analysis will be produced in the form of data tables. You are being asked to produce six total tables, that show:

1. Top 10 countries by coverage: The 10 countries with the highest total articles per capita (in descending order) .

In [232]:
top10_coverage_countries = articles_by_country.sort_values(by=['articles_per_capita'], ascending=False).head(10)[['country', 'articles_per_capita']]
top10_coverage_countries.reset_index(inplace = True, drop = True)
top10_coverage_countries["country"]

0               Antigua and Barbuda
1    Federated States of Micronesia
2                           Andorra
3                          Barbados
4                  Marshall Islands
5                        Montenegro
6                        Seychelles
7                        Luxembourg
8                            Bhutan
9                           Grenada
Name: country, dtype: object

2. Bottom 10 countries by coverage: The 10 countries with the lowest total articles per capita (in ascending order) .

In [233]:
bottom10_coverage_countries = articles_by_country.sort_values(by=['articles_per_capita'], ascending=True).head(10)[['country', 'articles_per_capita']]
bottom10_coverage_countries.reset_index(inplace = True, drop = True)
bottom10_coverage_countries["country"]

0           China
1          Mexico
2    Saudi Arabia
3         Romania
4           India
5       Sri Lanka
6           Egypt
7        Ethiopia
8          Taiwan
9         Vietnam
Name: country, dtype: object

3. Top 10 countries by high quality: The 10 countries with the highest high quality articles per capita (in descending order) .

In [234]:
top10_hq_countries = hq_by_country.sort_values(by=['articles_per_capita'], ascending=False).head(10)[['country', 'articles_per_capita']]
top10_hq_countries.reset_index(inplace = True, drop = True)
top10_hq_countries["country"]

0                  Andorra
1               Montenegro
2                  Albania
3                 Suriname
4       Bosnia-Herzegovina
5                Lithuania
6                  Croatia
7                 Slovenia
8    Palestinian Territory
9                    Gabon
Name: country, dtype: object

4. Bottom 10 countries by high quality: The 10 countries with the lowest high quality articles per capita (in ascending order).

In [235]:
bottom10_hq_countries = hq_by_country.sort_values(by=['articles_per_capita'], ascending=True).head(10)[['country', 'articles_per_capita']]
bottom10_hq_countries.reset_index(inplace = True, drop = True)
bottom10_hq_countries["country"]

0       India
1    Thailand
2       Japan
3     Nigeria
4     Vietnam
5    Colombia
6      Uganda
7    Pakistan
8       Sudan
9        Iran
Name: country, dtype: object

5. Geographic regions by total coverage: A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [237]:
top_region_coverage= articles_by_region.sort_values(by=['articles_per_capita'], ascending=False)[['region', 'articles_per_capita']]
top_region_coverage.reset_index(inplace = True, drop = True)
top_region_coverage["region"]

0     SOUTHERN EUROPE
1           CARIBBEAN
2      WESTERN EUROPE
3      EASTERN EUROPE
4     NORTHERN EUROPE
5        WESTERN ASIA
6     SOUTHERN AFRICA
7             OCEANIA
8      EASTERN AFRICA
9       SOUTH AMERICA
10     WESTERN AFRICA
11       CENTRAL ASIA
12    CENTRAL AMERICA
13      MIDDLE AFRICA
14    NORTHERN AFRICA
15     SOUTHEAST ASIA
16         SOUTH ASIA
17          EAST ASIA
Name: region, dtype: object

6. Geographic regions by high quality coverage: Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [241]:
top_region_hq = hq_by_region.sort_values(by=['articles_per_capita'], ascending=False)[['region', 'articles_per_capita']]
top_region_hq.reset_index(inplace = True, drop = True)
top_region_hq["region"]

0     SOUTHERN EUROPE
1           CARIBBEAN
2      EASTERN EUROPE
3      WESTERN EUROPE
4        WESTERN ASIA
5     NORTHERN EUROPE
6     SOUTHERN AFRICA
7     CENTRAL AMERICA
8        CENTRAL ASIA
9      SOUTHEAST ASIA
10     EASTERN AFRICA
11     WESTERN AFRICA
12      SOUTH AMERICA
13    NORTHERN AFRICA
14      MIDDLE AFRICA
15            OCEANIA
16         SOUTH ASIA
17          EAST ASIA
Name: region, dtype: object