# Considering Bias in Data
## DATA512 Homework 2

In [264]:
import pandas as pd
import json, time, urllib.parse
import requests
import numpy as np

## Step 1: Getting the Article and Population Data

Below I have downloaded the politicians and population data given to us and am ingesting it into python using a pandas dataframe. I am also taking a look at the first five values of each of the dataframes so I have an idea of what the dataframes look like.

In [117]:
politicians = pd.read_csv('politicians_by_country_SEPT.2022.csv - politicians_international_SEPT.2022.csv')
populations = pd.read_csv('population_by_country_2022.csv - population_by_country_2022.csv')

politicians.head()

Unnamed: 0,name,url,country
0,Shahjahan Noori,https://en.wikipedia.org/wiki/Shahjahan_Noori,Afghanistan
1,Abdul Ghafar Lakanwal,https://en.wikipedia.org/wiki/Abdul_Ghafar_Lak...,Afghanistan
2,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
3,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
4,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan


In [118]:
populations.head()

Unnamed: 0,Geography,Population (millions)
0,WORLD,7963.0
1,AFRICA,1419.0
2,NORTHERN AFRICA,251.0
3,Algeria,44.9
4,Egypt,103.5


## Step 2: Getting Article Quality Predictions

**Note:** For articles where we can't find the current page revision id or get a value for the article quality estimates, they will still be added to the final dataframe but will contain the value NULL for article quality score.

### Part 1: Code for Page Info Requests API

### Creating Constants

Here I am creating constants later used in our functions. Most of these constants are taken from the example code, the changes are listed below:
- Change user-agent in request headers to my email
- ARTICLE_TITLES are from the politicians dataframe we ingested above

In [119]:
API_ENWIKIPEDIA_ENDPOINT = "https://en.wikipedia.org/w/api.php"

API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

REQUEST_HEADERS = {
    'User-Agent': '<kandulat@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

ARTICLE_TITLES = politicians['name'].to_list()

PAGEINFO_EXTENDED_PROPERTIES = ""

PAGEINFO_PARAMS_TEMPLATE = {
    "action": "query",
    "format": "json",
    "titles": "",           # to simplify this should be a single page title at a time
    "prop": "info",
    "inprop": PAGEINFO_EXTENDED_PROPERTIES
}


## Creating Functions
The first function created is request_pageinfo_per_article. This is mostly taken from the example code (wp_page_info_example)

In [120]:
def request_pageinfo_per_article(article_title = None, 
                                 endpoint_url = API_ENWIKIPEDIA_ENDPOINT, 
                                 request_template = PAGEINFO_PARAMS_TEMPLATE,
                                 headers = REQUEST_HEADERS):
    # Make sure we have an article title
    if not article_title: return None
    
    request_template['titles'] = article_title
        
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or any other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(endpoint_url, headers=headers, params=request_template)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

Below is a function that is used to extract the last revision id from the json object returned by request_pageinfo_per_article. This function returns a string for lastrevid

In [130]:
def extract_lastrevid(json_response):
    if not json_response:
        return None # Return None for lastrevid if we are unable to find article
    data = json_response['query']['pages']
    for key in data:
        try:
            return data[key]['lastrevid']
        except Exception as e:
            return None

The next function will loop through each of the politicians in our dataframe and request the page info. Once we recieve the pageinfo we will create a dictionary where the politician name is the key and the current page revision is the value and append values for each politician. In the end the function will return this dictionary which can in turn be used for the ORES API later.

In [131]:
def find_all_pagerevisions(titles):
    pol_dict = {}
    for title in titles:
        json_response = request_pageinfo_per_article(title)
        curr_page_revision = extract_lastrevid(json_response)
        pol_dict[title] = curr_page_revision
    return pol_dict

Now we will run the find_all_pagerevisions function to get the current page revision ids for each politician in our list

In [132]:
pol_revision_dict = find_all_pagerevisions(ARTICLE_TITLES)

### Part 2: Code for ORES API

### Creating Constants

Here I am creating constants later used in our functions. Most of these constants are taken from the example code(wp_ores_example), the changes are listed below:
- Change user-agent in request headers to my email
- ARTICLE_REVISIONS are from the dictionary we created above with the politicians as the key and the current page revision as the value
- Latency, throttle and request headers variables were already created when using the page info API, so I will just use those instead of recreating the variables here

In [133]:
# The current ORES API endpoint
API_ORES_SCORE_ENDPOINT = "https://ores.wikimedia.org/v3"
# A template for mapping to the URL
API_ORES_SCORE_PARAMS = "/scores/{context}/{revid}/{model}"

ARTICLE_REVISIONS = pol_revision_dict # Created above using page_info API

# This template lists the basic parameters for making an ORES request
ORES_PARAMS_TEMPLATE = {
    "context": "enwiki",        # which WMF project for the specified revid
    "revid" : "",               # the revision to be scored - this will probably change each call
    "model": "articlequality"   # the AI/ML scoring model to apply to the reviewion
}

### Creating Functions

The below function is taken from the example code (wp_ores_example). It calls the ORES API and returns a json_object of the data about a particular article

In [134]:
def request_ores_score_per_article(article_revid = None, 
                                   endpoint_url = API_ORES_SCORE_ENDPOINT, 
                                   endpoint_params = API_ORES_SCORE_PARAMS, 
                                   request_template = ORES_PARAMS_TEMPLATE,
                                   headers = REQUEST_HEADERS,
                                   features=False):
    # Make sure we have an article revision id
    if not article_revid: return None
    
    # set the revision id into the template
    request_template['revid'] = article_revid
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # the features used by the ML model can sometimes be returned as well as scores
    if features:
        request_url = request_url+"?features=true"
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like ORES - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Below is a function that is used to extract the prediction from the json object returned by request_ores_score_per_article. This function returns a string for prediction

In [135]:
def extract_prediction(json_object):
    if not json_object:
        return None # Return None if article does not exist
    data = json_object['enwiki']['scores']
    for key in data:
        return data[key]['articlequality']['score']['prediction']

The next function will loop through each article in the dictionary we created and find the ORES Score prediction for each article. The function returns a dataframe of the politician name and ORES Score as columns

In [136]:
def get_all_scores(title_dict):
    scores = []
    for key in title_dict:
        json_data = request_ores_score_per_article(title_dict[key])
        curr_score = extract_prediction(json_data)
        scores.append((key, title_dict[key], curr_score))
    scores_df = pd.DataFrame(scores, columns = ['article_title', 'revision_id', 'article_quality']) #Convert tuple to dataframe
    return scores_df

Now we will run the get_all_scores function to get a dataframe of the politicians name and the assosciated ORES Score

In [137]:
pol_scores_df = get_all_scores(ARTICLE_REVISIONS)

### Step 3: Combining the Datasets

**Note:** The politicians data set has some values that does not differentiate between North and South Korea, with the country being labeled 'Korean'. The populations data set has different values for North and South Korea. One possible solution would be to add the values of the population for North and South Korea and have one value for Korea. I choose not to do this and omit both North and South Korea all together. They both have very different politics and that may affect the population and the article scores so it didnt make sense to combine them into one country.

The populations dataframe is layed out in a way where the countries are listed under each region. For example below is the first couple rows from the Geography column:


WORLD

AFRICA

NORTHERN AFRICA

Algeria

Egypt

Libya

Morocco

Sudan

Tunisia

Western Sahara

WESTERN AFRICA

Benin

Burkina Faso

Cape Verde

Cote d'Ivoire

This tells us there are no countries for the regions World and Africa. Algeria, Egypt, Libya, Morocco, Sudan and Tunisia are all a part of Northern Africa. Benin, Burkina Faso, Cape Verde and Cote d'Ivoire are all part of western africa

Now we will be creating a function to return a dataframe that maps a country to it's region. If the value of the row is in all Capital letters we know that everything below it until we reach another row with all capital letters is a country that belongs in that region. We will loop through each row in the Geography column of the dataframe, if it is in all capital letters we will store this value as curr_region and everything below it will be added to a tuple with the name of the country and curr_region. Once we get to another row that's all caps we will change the value of curr_region and repeat the proceduce.

In [331]:
def create_region_df(pop_df):
    curr_region = 'WORLD'
    region_population = -1
    data = []
    for index, row in populations.iterrows():
        geo_val = row['Geography'] # Get value of current row
        if(geo_val.isupper()):
            curr_region = geo_val # If the value of current row is all upper change curr_region
            region_population = row['Population (millions)']
        else:
             data.append((geo_val, row['Population (millions)'], curr_region, region_population)) # Otherwise append tuple with current row, curr_region
    return pd.DataFrame(data, columns = ['country', 'population', 'region', 'region_population']) # Convert tupe to DataFrame and return

Next we will make a function to write a list to a text file. We will use this to create the wp_countries-no_match.txt file later on

In [332]:
def write_list_to_file(fil_name, list_name):
    with open(fil_name, 'w+') as f:
        for val in list_name:
            f.write('%s\n' %val)
    f.close()

Next we will use this function to create the regions dataframe and merge that with other dataframes to create the final wp_politicians_by_country. 

**Note:** Some politicians are listed under multiple countries. These politicians will have the same article_quality score but recorded multiple times for each country they are listed under. 

**Note:** Although not specified in the homework schema I am adding a column to keep track of the region population as to aid with region specific analysis

In [335]:
regions_df = create_region_df(populations) # Creating regions dataframe

# Join politicians dataframe with scores dataframe to get the politicians country
merged = pol_scores_df.merge(politicians, how = 'right', left_on = 'article_title', right_on = 'name')[['article_title', 'revision_id', 'article_quality', 'country']]

# Join regions dataframe with above dataframe to get the region and popultion for each country
wp_politicians_by_country = merged.merge(regions_df, how = 'inner', left_on = 'country', right_on = 'country')

# Save the dataframe we created as a csv
wp_politicians_by_country.to_csv('wp_politicians_by_country.csv')

# Take a look at the head of the dataframe to make sure it looks right
wp_politicians_by_country.head()

Unnamed: 0,article_title,revision_id,article_quality,country,population,region,region_population
0,Shahjahan Noori,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA,2008.0
1,Abdul Ghafar Lakanwal,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA,2008.0
2,Majah Ha Adrif,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA,2008.0
3,Haroon al-Afghani,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA,2008.0
4,Tayyab Agha,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA,2008.0


Now we are going to identify all the countries for which there are no matches and output a list of those countries. To do this we are going to find all the countries in the populations dataframe that are not in the wp_populations_by_country dataframe we created above

In [336]:
# Get a list of all the countries, removing regions by checking if the element is all uppercase
all_countries = [x for x in populations['Geography'] if not x.isupper()]
# Find all countries in populations list that are not in the final dataframe
countries_no_match1 = [x for x in all_countries if x not in wp_politicians_by_country['country'].to_list()]
# Find all unique countries in politicians list that are not in populations
countries_no_match2 = list(set([x for x in politicians['country'] if x not in all_countries])) # Set used get unique values
countries_no_match = countries_no_match1 + countries_no_match2 # Concat the two lists
write_list_to_file('wp_countries-no_match.txt', countries_no_match) # Write list to file

## Step 4: Analysis

### Function to Clean Data

#### Remove all values with population 0

Since we are finding values per capita we will have to divide by the total population of a country. If the country has a value of 0 we cannot divide by this value so we will just be ommitting countries with a population of 0

#### Multiply the Populations by a Million

The populations for each country are in millions so we will multiply the population column to get the true number of people in the country

#### Create Column to Indicate if Article is High Quality Or Not
Now we will be creating a column to indicate if an article is high quality or not. This column will have a value of 1 if the article is high quality and 0 if not.

In [337]:
pd.options.mode.chained_assignment = None #Get rid of false chained warning

def clean_data(df):
    wp_sub = df[df.population != 0]
    wp_sub['population_total'] = wp_sub['population'] * 1000000
    wp_sub['region_population'] = wp_sub['region_population'] * 1000000
    wp_sub['is_high_quality'] = np.where((wp_sub['article_quality'] == 'FA') | (wp_sub['article_quality'] == "GA"), 1, 0)
    return wp_sub

# Run function and look at clean data
wp_sub = clean_data(wp_politicians_by_country)
wp_sub.head()

Unnamed: 0,article_title,revision_id,article_quality,country,population,region,region_population,population_total,is_high_quality
0,Shahjahan Noori,1099689000.0,GA,Afghanistan,41.1,SOUTH ASIA,2008000000.0,41100000.0,1
1,Abdul Ghafar Lakanwal,943562300.0,Start,Afghanistan,41.1,SOUTH ASIA,2008000000.0,41100000.0,0
2,Majah Ha Adrif,852404100.0,Start,Afghanistan,41.1,SOUTH ASIA,2008000000.0,41100000.0,0
3,Haroon al-Afghani,1095102000.0,B,Afghanistan,41.1,SOUTH ASIA,2008000000.0,41100000.0,0
4,Tayyab Agha,1104998000.0,Start,Afghanistan,41.1,SOUTH ASIA,2008000000.0,41100000.0,0


### Create Function to group by a column and find per capita values
Now we will create a function that takes an input string of the column we want to group by. We will group by that column and find the mean of population and the total for the number of articles titles and high quality articles. We then divide each of those values by the population to find the values per capita

In [341]:
def get_values_per_capita(group_column, df):
    # Create dictionary for column names
    col_names = {'population_total': 'population_total', 'is_high_quality': 'num_high_quality_articles', 'article_title': 'num_total_articles', 'region_population': 'region_population'}

    # Get mean of population and sum of article_titles and high quality articles
    wp_grouped = df.groupby(group_column).agg({'population_total': 'mean', 'is_high_quality': 'sum', 'article_title': 'count', 'region_population' : 'mean'}).rename(columns = col_names)
    if group_column == 'country':
        wp_grouped['total_articles_per_capita'] = wp_grouped.num_total_articles/wp_grouped.population_total
        wp_grouped['hq_articles_per_capita'] = wp_grouped.num_high_quality_articles/wp_grouped.population_total
    else:
        wp_grouped['total_articles_per_capita'] = wp_grouped.num_total_articles/wp_grouped.region_population
        wp_grouped['hq_articles_per_capita'] = wp_grouped.num_high_quality_articles/wp_grouped.region_population
    return wp_grouped

In [370]:
articles_per_country = get_values_per_capita("country", wp_sub)
articles_per_region = get_values_per_capita('region', wp_sub)

In [371]:
articles_per_country.head()

Unnamed: 0_level_0,population_total,num_high_quality_articles,num_total_articles,region_population,total_articles_per_capita,hq_articles_per_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,41100000.0,6,118,2008000000.0,2.871046e-06,1.459854e-07
Albania,2800000.0,6,83,151000000.0,2.964286e-05,2.142857e-06
Algeria,44900000.0,0,34,251000000.0,7.572383e-07,0.0
Andorra,100000.0,2,10,151000000.0,0.0001,2e-05
Angola,35600000.0,0,42,196000000.0,1.179775e-06,0.0


In [372]:
articles_per_region.head()

Unnamed: 0_level_0,population_total,num_high_quality_articles,num_total_articles,region_population,total_articles_per_capita,hq_articles_per_capita
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CARIBBEAN,6166667.0,8,201,44000000.0,4.568182e-06,1.818182e-07
CENTRAL AMERICA,9003590.0,10,195,178000000.0,1.095506e-06,5.617978e-08
CENTRAL ASIA,16871700.0,3,106,78000000.0,1.358974e-06,3.846154e-08
EAST ASIA,88576420.0,16,246,1674000000.0,1.469534e-07,9.557945e-09
EASTERN AFRICA,29335380.0,15,650,473000000.0,1.374207e-06,3.171247e-08


## Step 5: Results

### Top 10 Countries by Coverage
The 10 countries with the highest total articles per capita (in descending order) 

In [373]:
articles_per_country.sort_values(by = 'total_articles_per_capita', ascending = False).head(10)[['total_articles_per_capita']]

Unnamed: 0_level_0,total_articles_per_capita
country,Unnamed: 1_level_1
Antigua and Barbuda,0.00017
Federated States of Micronesia,0.00013
Andorra,0.0001
Barbados,9.3e-05
Marshall Islands,9e-05
Montenegro,6e-05
Seychelles,6e-05
Luxembourg,5.3e-05
Bhutan,5.1e-05
Grenada,5e-05


### Bottom 10 Countries by Coverage
The 10 countries with the lowest total articles per capita (in ascending order) 

In [374]:
articles_per_country.sort_values(by = 'total_articles_per_capita').head(10)[['total_articles_per_capita']]

Unnamed: 0_level_0,total_articles_per_capita
country,Unnamed: 1_level_1
China,1.392176e-09
Mexico,7.843137e-09
Saudi Arabia,8.174387e-08
Romania,1.052632e-07
India,1.263054e-07
Sri Lanka,1.339286e-07
Egypt,1.352657e-07
Ethiopia,2.025932e-07
Taiwan,2.155172e-07
Vietnam,2.716298e-07


### Top 10 Countries by High Quality
The 10 countries with the highest high quality articles per capita (in descending order)

In [375]:
articles_per_country.sort_values(by = 'hq_articles_per_capita', ascending = False).head(10)[['hq_articles_per_capita']]

Unnamed: 0_level_0,hq_articles_per_capita
country,Unnamed: 1_level_1
Andorra,2e-05
Montenegro,5e-06
Albania,2.142857e-06
Suriname,1.666667e-06
Bosnia-Herzegovina,1.470588e-06
Lithuania,1.071429e-06
Croatia,1.052632e-06
Slovenia,9.52381e-07
Palestinian Territory,9.259259e-07
Gabon,8.333333e-07


### Bottom 10 countries by high quality
The 10 countries with the lowest high quality articles per capita (in ascending order)

**Note:** All the bottom 10 countries have 0 high quality articles. To pick the bottom 10 countries I chose the countries with 0 high quality articles and ordered by highest population

In [376]:
articles_per_country.sort_values(by = ['hq_articles_per_capita', 'population_total'], ascending = [True, False]).head(10)[['population_total', 'hq_articles_per_capita']]

Unnamed: 0_level_0,population_total,hq_articles_per_capita
country,Unnamed: 1_level_1,Unnamed: 2_level_1
China,1436600000.0,0.0
Brazil,214800000.0,0.0
Bangladesh,171200000.0,0.0
Mexico,127500000.0,0.0
Egypt,103500000.0,0.0
"Congo, Dem. Rep.",99000000.0,0.0
Turkey,85200000.0,0.0
Tanzania,65500000.0,0.0
Italy,58900000.0,0.0
Argentina,46200000.0,0.0


### Geographic regions by total coverage
A rank ordered list of geographic regions (in descending order) by total articles per capita.

In [377]:
articles_per_region.sort_values(by = 'total_articles_per_capita', ascending = False)[['total_articles_per_capita']]

Unnamed: 0_level_0,total_articles_per_capita
region,Unnamed: 1_level_1
SOUTHERN EUROPE,5.880795e-06
CARIBBEAN,4.568182e-06
WESTERN EUROPE,3.472081e-06
EASTERN EUROPE,2.56446e-06
NORTHERN EUROPE,2.448598e-06
WESTERN ASIA,2.336735e-06
SOUTHERN AFRICA,1.710145e-06
OCEANIA,1.636364e-06
EASTERN AFRICA,1.374207e-06
CENTRAL ASIA,1.358974e-06


### Geographic regions by high quality coverage
Rank ordered list of geographic regions (in descending order) by high quality articles per capita.

In [379]:
articles_per_region.sort_values(by = 'hq_articles_per_capita', ascending = False)[['hq_articles_per_capita']]

Unnamed: 0_level_0,hq_articles_per_capita
region,Unnamed: 1_level_1
SOUTHERN EUROPE,3.046358e-07
CARIBBEAN,1.818182e-07
EASTERN EUROPE,1.324042e-07
WESTERN EUROPE,1.116751e-07
WESTERN ASIA,9.52381e-08
NORTHERN EUROPE,7.476636e-08
SOUTHERN AFRICA,5.797101e-08
CENTRAL AMERICA,5.617978e-08
CENTRAL ASIA,3.846154e-08
SOUTHEAST ASIA,3.550296e-08
