<center> <h1> DATA 516 HW2 </h1>
<center> <h5> Win Nawat Suvansinpan </h5>

The goal of this assignment is to look at Wikipedia articles on political figures from different countries while focusing on the bias within data. This is done by measuring the coverage and quality of Wikipedia articles on politicians across different countries.[[1](#appendix)]  
**The data sources are:**
- Politicians by Country from the English-language Wikipedia
    -  Data on most English-language Wikipedia articles within the category "Category:Politicians by nationality" and subcategories, along with the code used to generate that data.
    - https://figshare.com/articles/Untitled_Item/5513449
- Population data
    - Data on population size of countries in the world as of mid-2018
    - https://www.prb.org/international/indicator/population/table/

**Coverage and quality**  
A machine learning service called [ORES](https://www.mediawiki.org/wiki/ORES) (Objective Revision Evaluation Service) is used to estimate the quality of each article. The tiers of qualities are:
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article
  
Coverage is defined by the number of politician articles per country population.


### Notes
[This cell](#api)  calls the function that generates the predictions through a for loop of small API calls.  
It takes a while to run and can be skipped if `predictions.csv` exists and does not need to be updated.  
The cell is commented out.

## Part 0: Importing necessary packages

In [16]:
import json
import numpy as np
import pandas as pd
import requests

## Part 1.1: Reading the data

In [17]:
page_data_csv = pd.read_csv("page_data.csv")
WPDS_2018_data_csv = pd.read_csv("WPDS_2018_data.csv")

Looking at the head of the dataframes.

In [18]:
print(page_data_csv.head())
print("df shape is " + str(page_data_csv.shape))

                                 page   country     rev_id
0  Template:ZambiaProvincialMinisters    Zambia  235107991
1                      Bir I of Kanem      Chad  355319463
2   Template:Zimbabwe-politician-stub  Zimbabwe  391862046
3     Template:Uganda-politician-stub    Uganda  391862070
4    Template:Namibia-politician-stub   Namibia  391862409
df shape is (47197, 3)


In [19]:
print(WPDS_2018_data_csv.head())
print("df shape is " + str(WPDS_2018_data_csv.shape))

  Geography Population mid-2018 (millions)
0    AFRICA                          1,284
1   Algeria                           42.7
2     Egypt                             97
3     Libya                            6.5
4   Morocco                           35.2
df shape is (207, 2)


## Part 1.2: Cleaning the data
As mentioned in the instructions [[2](#appendix)], both page_data.csv and WPDS_2018_data.csv contain some rows that will need to be filtered out and/or ignore.  
- `page_data.csv` Pages that start with 'template' are _not_ Wikipedia articles.
- `WPDS_2018_data.csv` Rows with its entire name in upper case under "Geography" are continents, not country.

### 1.2.1: Working on `page_data.csv`  
Extracting rows _without_ the string "Template:" under "page".  
- `.str.contains(somestring)` returns a boolean. It checks if the string contains _"somestring"_
- The `~` is used to negate a boolean series since we only want FALSE results.
- Calling the new dataframe `page_data`

In [20]:
page_data = page_data_csv[~page_data_csv["page"].str.contains("Template:")]
page_data.reset_index(drop=True, inplace=True)

# printing DF head to check results
page_data.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


### 1.2.2: Working on `WPDS_2018_data.csv`  

Removing rows with all uppercase string under "Geography."  
- For all cells with only uppercase letters, the last letter has to be an upper case.
- For all country names with more than 1 letter, the last letter has to be a lower case.
- Therefore, we only have to extract the last letter of the row "Geography" and check its case.
- Extract all results that are TRUE for `.str.islower()`
- Call the new dataframe `WPDS_2018_data`

In [21]:
WPDS_2018_data = WPDS_2018_data_csv[WPDS_2018_data_csv["Geography"].str[-1].str.islower()]
WPDS_2018_data.reset_index(drop=True, inplace=True)

# printing DF head to check results
WPDS_2018_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,Algeria,42.7
1,Egypt,97.0
2,Libya,6.5
3,Morocco,35.2
4,Sudan,41.7


In [22]:
# sanity check on all .isupper() results. Expect all regional names instead of countries.
# save that as WPDS_2018_data_region
WPDS_2018_data_region = WPDS_2018_data_csv[WPDS_2018_data_csv["Geography"].str[-1].str.isupper()]
WPDS_2018_data_region.reset_index(drop=True, inplace=True)

WPDS_2018_data_region.head(10)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284
1,NORTHERN AMERICA,365
2,LATIN AMERICA AND THE CARIBBEAN,649
3,ASIA,4536
4,EUROPE,746
5,OCEANIA,41


## Part 2: ORES results on article quality
Steps to obtain the ORES' prediction of article quality is taken from this [Github page](https://github.com/Ironholds/data-512-a2/)  
Articles are categorized into the following categories:
- A - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

In [23]:
headers = {'User-Agent' : 'https://github.com/winnawat', 'From' : 'nawats@uw.edu'}

def get_ores_data(revision_ids, headers):
    '''
    This function retrieves ORES data given a list of revisin IDs.
    Function taken from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb
    '''
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [24]:
def get_prediction_from_ores(revision_ids,ores_response):
    '''
    A function to extract "prediction" from the JSON output given by function get_ores_data
    Returns a dataframe with columns ['rev_id','prediction']
    '''
    predictions = {}
    for revid in revision_ids:
        try:
            predictions[revid] = ores_response['enwiki']['scores'][str(revid)]['wp10']['score']['prediction']
        except KeyError:
            predictions[revid] = "No prediction"
        
    pred_df = pd.DataFrame.from_dict(predictions, orient='index', columns=["prediction"])
    pred_df.reset_index(inplace=True)
    pred_df.rename(columns={"index":"rev_id"},inplace=True)
    return pred_df

The rev_ids are contained in the dataframe `page_data`. Extracting the list of rev_ids to pass to `get_ores_data` and `get_prediction_from_ores`

In [40]:
rev_ids = page_data['rev_id'].tolist()

# getting the idex of the last element of the list of rev_ids
last_id = len(rev_ids)

def make_api_calls(last_id, savecsv=True, filename="predictions.csv"):
    '''
    A function which takes in the last index of the rev IDs and make API calls for 100 IDs at a time.
    Loop executed until it reaches index last_id
    '''
    pred_df = pd.DataFrame(columns=['rev_id', 'prediction'])
    last_index = 0
    
    # making API calls, 100 rev IDs at a time
    for i in range(100,last_id,100):
        subset_rev_ids = rev_ids[i-100:i]
        ores_json = get_ores_data(subset_rev_ids, headers)
        df1 = get_prediction_from_ores(subset_rev_ids, ores_json)
        pred_df = pred_df.append(df1)
        last_index = i
        
    # getting the remainder of red_ids that were left out by range()
    lastloop_ids = rev_ids[last_index:last_id]
    ores_json = get_ores_data(lastloop_ids, headers)
    df1 = get_prediction_from_ores(lastloop_ids, ores_json)
    pred_df = pred_df.append(df1)
    
    pred_df.reset_index(drop=True, inplace=True)
    
    # either save to csv or return a dataframe
    if (savecsv == True):
        pred_df.to_csv(filename)
    else:
        return pred_df

<a id='api'></a>

In [49]:
# saving this as a csv so the loop does not have to be re-run
# this has to be run once. From now on the ores predictions will be read from CSV.

# make_api_calls(last_id)

## Part 3: Merging the data and generate CSV files
The CSV generated from above is read and merged with the existing data frames.  
Any rows with no prediction available is output to `wp_wpds_countries-no_match.csv`  
The rest of the data goes into `wp_wpds_politicians_by_country.csv`

In [50]:
predictions_csv = pd.read_csv("predictions.csv")
predictions = predictions_csv.drop(columns=['Unnamed: 0'])
page_data_pred = pd.merge(page_data, predictions, on='rev_id', how='outer')

In [51]:
page_data_pred.tail()

Unnamed: 0,page,country,rev_id,prediction
46696,Yahya Jammeh,Gambia,807482007,GA
46697,Lucius Fairchild,United States,807483006,C
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA
46699,Francis Fessenden,United States,807483270,C
46700,Ajay Kannoujiya,India,807484325,No prediction


In [52]:
WPDS_2018_data.tail()

Unnamed: 0,country,Population mid-2018 (millions)
195,Samoa,0.2
196,Solomon Islands,0.7
197,Tonga,0.1
198,Tuvalu,0.01
199,Vanuatu,0.3


Renaming 'Geography' column in `WPDS_2018_data` to 'country'

In [53]:
WPDS_2018_data = WPDS_2018_data.rename(columns={'Geography':'country'})
WPDS_2018_data.tail()

Unnamed: 0,country,Population mid-2018 (millions)
195,Samoa,0.2
196,Solomon Islands,0.7
197,Tonga,0.1
198,Tuvalu,0.01
199,Vanuatu,0.3


Merging `WPDS_2018_data` to `page_data_pred` by 'country'

In [145]:
page_data_pred_country = pd.merge(page_data_pred, WPDS_2018_data, on='country', how='outer')
page_data_pred_country.tail()

Unnamed: 0,page,country,rev_id,prediction,Population mid-2018 (millions)
46716,,French Polynesia,,,0.3
46717,,Guam,,,0.2
46718,,New Caledonia,,,0.3
46719,,Palau,,,0.02
46720,,Samoa,,,0.2


Formatting the column names to the specified schema  

|Column|
|---|
|country|
|article_name|
|revision_id|
|article_quality|
|population|

In [146]:
page_data_pred_country = page_data_pred_country.rename(columns={'page':'article_name',
                                       'rev_id':'revision_id',
                                       'prediction':'article_quality',
                                       'Population mid-2018 (millions)':'population'})

Checking for rows with NaN.

In [147]:
np.sum(page_data_pred_country.isna())

article_name         20
country               0
revision_id          20
article_quality      20
population         2083
dtype: int64

There are 20 rows with page, red_id, and prediction missing and 2083 rows with population missing.  
Saving these rows in a separate CSV `wp_wpds_countries-no_match.csv` before dropping them.

In [148]:
pop_na = page_data_pred_country[page_data_pred_country['population'].isna()]
page_na = page_data_pred_country[page_data_pred_country['article_name'].isna()]

In [149]:
all_na = pop_na.append(page_na)
all_na.reset_index(drop=True, inplace=True)
all_na.to_csv('wp_wpds_countries-no_match.csv')

Dropping all rows with na and make sure the red_id column is an integer.

In [150]:
page_data_pred_country = page_data_pred_country.dropna()
page_data_pred_country.reset_index(drop=True, inplace=True)
page_data_pred_country = page_data_pred_country.astype({'revision_id': 'int'})
page_data_pred_country.tail()

Unnamed: 0,article_name,country,revision_id,article_quality,population
44613,Rita Sinon,Seychelles,800323154,Stub,0.1
44614,Sylvette Frichot,Seychelles,800323798,Stub,0.1
44615,May De Silva,Seychelles,800969960,Start,0.1
44616,Vincent Meriton,Seychelles,802051093,Stub,0.1
44617,Marie-Louise Potter,Seychelles,804209620,Stub,0.1


Saving the dataframe to CSV

In [151]:
page_data_pred_country.to_csv('wp_wpds_politicians_by_country.csv')

## Part 4: Analysis

To estimate the coverage and quality of Wikipedia articles for each country, the following data has to be aggregated by country:
- Number of articles
- Number of high-quality articles
    - High quality articles are those labelled FA and GA

In [152]:
# Labeling the high-quality articles
page_data_pred_country['HQA'] = page_data_pred_country['article_quality'].isin(['FA','GA']).astype(int)

In [153]:
# Checking the head of the table
page_data_pred_country.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population,HQA
0,Bir I of Kanem,Chad,355319463,Stub,15.4,0
1,Abdullah II of Kanem,Chad,498683267,Stub,15.4,0
2,Salmama II of Kanem,Chad,565745353,Stub,15.4,0
3,Kuri I of Kanem,Chad,565745365,Stub,15.4,0
4,Mohammed I of Kanem,Chad,565745375,Stub,15.4,0


In [157]:
# Checking the high quality articles
page_data_pred_country[page_data_pred_country['HQA'] == 1].head()

Unnamed: 0,article_name,country,revision_id,article_quality,population,HQA
61,Mahamat Nouri,Chad,792954115,FA,15.4,1
83,Hissène Habré,Chad,803166806,GA,15.4,1
262,Norodom Chakrapong,Cambodia,788905950,GA,16.0,1
282,Norodom Sihanouk,Cambodia,799302232,FA,16.0,1
303,Nuon Chea,Cambodia,805876135,GA,16.0,1


Aggregating the data

## Appendix
<a id='appendix'></a>
[1]: DATA516 Course wiki, A2 assignment: https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data