<center> <h1> DATA 516 HW2 </h1>
<center> <h5> Win Nawat Suvansinpan </h5>

The goal of this assignment is to look at Wikipedia articles on political figures from different countries while focusing on the bias within data. This is done by measuring the coverage and quality of Wikipedia articles on politicians across different countries.[[1](#appendix)]  
**The data sources are:**
- Politicians by Country from the English-language Wikipedia
    -  Data on most English-language Wikipedia articles within the category "Category:Politicians by nationality" and subcategories, along with the code used to generate that data.
    - https://figshare.com/articles/Untitled_Item/5513449
- Population data
    - Data on population size of countries in the world as of mid-2018
    - https://www.prb.org/international/indicator/population/table/

**Coverage and quality**  
A machine learning service called [ORES](https://www.mediawiki.org/wiki/ORES) (Objective Revision Evaluation Service) is used to estimate the quality of each article. The tiers of qualities are:
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article

In [19]:
import json
import pandas as pd
import requests

## Part 1.1: Reading the data

In [20]:
page_data_csv = pd.read_csv("page_data.csv")
WPDS_2018_data_csv = pd.read_csv("WPDS_2018_data.csv")

Looking at the head of the dataframes.

In [21]:
print(page_data_csv.head())
print("df shape is " + str(page_data_csv.shape))

                                 page   country     rev_id
0  Template:ZambiaProvincialMinisters    Zambia  235107991
1                      Bir I of Kanem      Chad  355319463
2   Template:Zimbabwe-politician-stub  Zimbabwe  391862046
3     Template:Uganda-politician-stub    Uganda  391862070
4    Template:Namibia-politician-stub   Namibia  391862409
df shape is (47197, 3)


In [22]:
print(WPDS_2018_data_csv.head())
print("df shape is " + str(WPDS_2018_data_csv.shape))

  Geography Population mid-2018 (millions)
0    AFRICA                          1,284
1   Algeria                           42.7
2     Egypt                             97
3     Libya                            6.5
4   Morocco                           35.2
df shape is (207, 2)


## Part 1.2: Cleaning the data
As mentioned in the instructions [[2](#appendix)], both page_data.csv and WPDS_2018_data.csv contain some rows that will need to be filtered out and/or ignore.  
- `page_data.csv` Pages that start with 'template' are _not_ Wikipedia articles.
- `WPDS_2018_data.csv` Rows with its entire name in upper case under "Geography" are continents, not country.

### 1.2.1: Working on `page_data.csv`  
Extracting rows _without_ the string "Template:" under "page".  
- `.str.contains(somestring)` returns a boolean. It checks if the string contains _"somestring"_
- The `~` is used to negate a boolean series since we only want FALSE results.
- Calling the new dataframe `page_data`

In [23]:
page_data = page_data_csv[~page_data_csv["page"].str.contains("Template:")]
page_data.reset_index(drop=True, inplace=True)

# printing DF head to check results
page_data.head()

Unnamed: 0,page,country,rev_id
0,Bir I of Kanem,Chad,355319463
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
2,Yos Por,Cambodia,393822005
3,Julius Gregr,Czech Republic,395521877
4,Edvard Gregr,Czech Republic,395526568


### 1.2.2: Working on `WPDS_2018_data.csv`  

Removing rows with all uppercase string under "Geography."  
- For all cells with only uppercase letters, the last letter has to be an upper case.
- For all country names with more than 1 letter, the last letter has to be a lower case.
- Therefore, we only have to extract the last letter of the row "Geography" and check its case.
- Extract all results that are TRUE for `.str.islower()`
- Call the new dataframe `WPDS_2018_data`

In [24]:
WPDS_2018_data = WPDS_2018_data_csv[WPDS_2018_data_csv["Geography"].str[-1].str.islower()]
WPDS_2018_data.reset_index(drop=True, inplace=True)

# printing DF head to check results
WPDS_2018_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,Algeria,42.7
1,Egypt,97.0
2,Libya,6.5
3,Morocco,35.2
4,Sudan,41.7


In [25]:
# sanity check on all .isupper() results. Expect all regional names instead of countries.
# save that as WPDS_2018_data_region
WPDS_2018_data_region = WPDS_2018_data_csv[WPDS_2018_data_csv["Geography"].str[-1].str.isupper()]
WPDS_2018_data_region.reset_index(drop=True, inplace=True)

WPDS_2018_data_region.head(10)

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284
1,NORTHERN AMERICA,365
2,LATIN AMERICA AND THE CARIBBEAN,649
3,ASIA,4536
4,EUROPE,746
5,OCEANIA,41


## Part 2: ORES results on article quality
Steps to obtain the ORES' prediction of article quality is taken from this [Github page](https://github.com/Ironholds/data-512-a2/)

In [26]:
headers = {'User-Agent' : 'https://github.com/winnawat', 'From' : 'nawats@uw.edu'}

def get_ores_data(revision_ids, headers):
    '''
    This function retrieves ORES data given a list of revisin IDs.
    Function taken from https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb
    '''
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [223]:
example_ids = [783381498, 807355596, 757539710]
ores_json = get_ores_data(example_ids, headers)

In [27]:
def get_prediction_from_ores(revision_ids,ores_response):
    '''
    A function to extract "prediction" from the JSON output given by function get_ores_data
    Returns a dataframe with columns ['rev_id','prediction']
    '''
    predictions = {}
    for revid in revision_ids:
        try:
            predictions[revid] = ores_response['enwiki']['scores'][str(revid)]['wp10']['score']['prediction']
        except KeyError:
            predictions[revid] = "No prediction"
        
    pred_df = pd.DataFrame.from_dict(predictions, orient='index', columns=["prediction"])
    pred_df.reset_index(inplace=True)
    pred_df.rename(columns={"index":"rev_id"},inplace=True)
    return pred_df

The rev_ids are contained in the dataframe `page_data`. Extracting the list of rev_ids to pass to `get_ores_data` and `get_prediction_from_ores`

In [73]:
rev_ids = page_data['rev_id'].tolist()

# getting the idex of the last element of the list of rev_ids
last_id = len(rev_ids)

def make_api_calls(last_id, savecsv=True, filename="predictions.csv"):
    pred_df = pd.DataFrame(columns=['rev_id', 'prediction'])
    for i in range(100,last_id,100):
        subset_rev_ids = rev_ids[i-100:i]
        ores_json = get_ores_data(subset_rev_ids, headers)
        df1 = get_prediction_from_ores(subset_rev_ids, ores_json)
        pred_df = pred_df.append(df1)
    pred_df.reset_index(drop=True, inplace=True)
    if (savecsv == True):
        pred_df.to_csv(filename)
    else:
        return pred_df

In [74]:
# saving this as a csv so the loop does not have to be re-run
# this has to be run once. From now on the ores predictions will be read from CSV.

# make_api_calls(last_id)

## Part 3: Merging the data
The CSV generated from above is read and merged with the existing data frames.

In [115]:
predictions_csv = pd.read_csv("predictions.csv")
predictions = predictions_csv.drop(columns=['Unnamed: 0'])
page_data_pred = pd.merge(page_data, predictions, on='rev_id')

In [117]:
page_data_pred.tail()

Unnamed: 0,page,country,rev_id,prediction
46695,Hal Bidlack,United States,807481636,C
46696,Yahya Jammeh,Gambia,807482007,GA
46697,Lucius Fairchild,United States,807483006,C
46698,Fahd of Saudi Arabia,Saudi Arabia,807483153,GA
46699,Francis Fessenden,United States,807483270,C


## Appendix
<a id='appendix'></a>
[1]: DATA516 Course wiki, A2 assignment: https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)/Assignments#A2:_Bias_in_data