# A2: Bias in Data

## The goal of the assignment is to examine bias in Wikipedia's English version, specifically by looking at the articles related to politicians in different countries.
The metrics to evaluate bias are as follows:
1. Ratio between the number of politician related articles and the countries' population
2. Ratio between the number of high-quality politician articles to the overall number of poitician articles

The output contains the top 10 extreme countries for both the metrics.

In [1]:
# importing packages
import pandas as pd
import numpy as np
import requests

## Data Retrieval
The data for the analysis is coming from 2 sources. 

Source 1:  
Wikipedia Data for all political pages by country  
(Link: https://figshare.com/articles/Untitled_Item/5513449)    
You need to download the zip file, and extract it. page_data.csv is the file of conern. Or, you can just clone the entire GitHub repository. The file path is written in a way that it will read the relevant file if the Jupyter notebook is running in the root directory.

Source 2:  
Population Data for "almost" all the countries  
(Link: https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0)  
You can download the file and read it (not required if cloning from GitHub).

Columns in the Wikipedia data:
* page: Name of the Wikipedia page that contains the politician related article 
* country: the country to which the politician belongs
* rev_id: the revision id that identifies the last revision to the said page

Columns in the population data:
* Geography: country name
* Population mid-2018 (millions): population size (in millions)

In [2]:
# reading the population data
pop_data = pd.read_csv('./raw_data/WPDS_2018_data.csv')

# reading the page data
page_data =  pd.read_csv('./raw_data/country/country/data/page_data.csv')

We also need to fetch the article quality scores, and we ping the Machine Learning service called [ORES](https://www.mediawiki.org/wiki/ORES), provided by Wikimedia API. The service evaluates the quality of an article and classifies all the articles in 6 different classes based on the estimated quality of the written text (FA- Featured Article, GA - Good Article are the high quality ones we are looking for). 

As per ORES [documentation](https://www.mediawiki.org/wiki/ORES#Article_quality) about article quality, the categories are defined as follows (ordered from best to worst):  
  
1. FA - Featured article  
2. GA - Good article  
3. B - B-class article  
4. C - C-class article  
5. Start - Start-class article  
6. Stub - Stub-class article  


In terms of hitting the API programmatically, we make a loop such that we only make 100 requests at a time due to rate restrictions. 
We initialize a Pandas Dataframe to save all the results.  

The JSON response for a single revision ID from ORES looks like this:  
{
    "enwiki": {
        "models": {
            "wp10": {
                "version": "0.5.0"
            }
        },
        "scores": {
            "757539710": {
                "wp10": {
                    "score": {
                        "prediction": "Start",
                        "probability": {
                            "B": 0.0950995993086368,
                            "C": 0.1709859524092081,
                            "FA": 0.002534267983331672,
                            "GA": 0.005731369423122624,
                            "Start": 0.7091352495053856,
                            "Stub": 0.01651356137031511
                        }
                    }
                }
            }
        }
    }
}
  
We save the rating as a NaN if we do not get a result.

In [3]:
# header for making ORES pings
headers = {'User-Agent' : 'https://github.com/rohitgupta91', 'From' : 'rgupta91@uw.edu'}
# endpoint definition
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

In [4]:
# getting a list of all revision IDs 
list_revision_id = list(page_data['rev_id'])

# getting a list of ids seperated by a distance of 100
list_rev_dist = list(np.arange(0, len(list_revision_id), 100))

In [5]:
# getting the dict of parameters to hit the API
# initializing an empty dataframe
revid_pd = pd.DataFrame()

# variable to store count of rev id where no info found
count_no_info = 0

# ping with 100 revision ids at a time
for i in list_rev_dist:
    start_idx = i
    end_idx = i + 100
    
    # getting all the 100 revision ids for the iteration
    loop_rev_id = list_revision_id[start_idx:end_idx]
    
    # getting all ids joined together 
    rev_ids = '|'.join(str(x) for x in loop_rev_id)
        
    # defining parameters for the API call
    params = {
    'project' : 'enwiki',
    'model'   : 'wp10',
    'revids'  : rev_ids
    }

    # making an API call and storing the response
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    # saving the ratings received from the API
    for rev_id in loop_rev_id:
        try:
            rating = response['enwiki']['scores'][str(int(rev_id))]['wp10']['score']['prediction']
        # if no ratings fetched for a particular rev-id
        # assign NAN value
        except:
            rating = np.nan
            # incrementing the counter of no info
            count_no_info = count_no_info + 1
        
        # appending results to the dataframe
        revid_pd = revid_pd.append({'rev_id' : str(int(rev_id)), 'rating' : rating}, ignore_index=True)

In [6]:
# snapshot of the resultant data
revid_pd.head()

Unnamed: 0,rating,rev_id
0,,235107991
1,Stub,355319463
2,Stub,391862046
3,Stub,391862070
4,Stub,391862409


In [7]:
# how many revision id not retreived
print ("Unable to retrieve", count_no_info, "Revision IDs, which is", 
       round(count_no_info/len(list_revision_id),4) * 100,
      "% of the total IDs")

Unable to retrieve 105 Revision IDs, which is 0.22 % of the total IDs


We merge the page data with the ORES ratings, and subsequently with the population data. 
For performing the merge, we need to ensure that the joining variables are in the same type (both strings or integers).

In [8]:
# merging the two dataframes
# converting the page data rev id to a string
page_data['rev_id'] = page_data['rev_id'].apply(lambda x: str(x))
page_df = pd.merge(revid_pd, page_data, on = "rev_id")

# merging with the population data
final_df = pd.merge(page_df, pop_data, left_on = 'country', right_on = 'Geography')

In [9]:
# snapshot of the resultant data
final_df.head()

Unnamed: 0,rating,rev_id,page,country,Geography,Population mid-2018 (millions)
0,,235107991,Template:ZambiaProvincialMinisters,Zambia,Zambia,17.7
1,Stub,757566606,Gladys Lundwe,Zambia,Zambia,17.7
2,Stub,764848643,Mwamba Luchembe,Zambia,Zambia,17.7
3,Start,768166426,Thandiwe Banda,Zambia,Zambia,17.7
4,C,776082926,Sylvester Chisembele,Zambia,Zambia,17.7


In [10]:
# some countries not present in the population dataset
no_country = page_df[~page_df['country'].isin(final_df['country'])]
no_country.country.unique()

array(['Palestinian Territory', 'Hondura', 'Czech Republic', 'Salvadoran',
       'Saint Kitts and Nevis', 'Palauan', 'French Guiana', 'Ivorian',
       'Saint Vincent and the Grenadines', 'Rhodesian', 'Omani',
       'Congo, Dem. Rep. of', 'Niuean', 'East Timorese', 'Faroese',
       'Cape Colony', 'South Korean', 'Samoan', 'Montserratian',
       'Pitcairn Islands', 'Abkhazia', 'Martinique', 'Carniolan',
       'Saint Lucian', 'South African Republic', 'Incan', 'Chechen',
       'Jersey', 'Guernsey', 'Guadeloupe', 'South Ossetian', 'Cook Island',
       'Tokelauan', 'Swaziland', 'Dagestani', 'Greenlandic', 'Ossetian',
       'Somaliland', 'Rojava'], dtype=object)

As can be seen from the above result, some of the countries are not present in the population dataset. Hence, we will be losing some rows because we are performing an inner join. Let us see how many countries are not present in the population dataset, and how many rows we lose as a result.

In [11]:
# how much data loss because of the mismatch in inner join
print ("Lost", round(1 - len(final_df)/len(page_df),4) * 100, "% of data due to merging the dataframes")

Lost 4.49 % of data due to merging the dataframes


Some maintenance tasks, such as renaming, converting population variable from string to an integer, and re-arranging the column structure.

In [12]:
# manipulating the final dataframe
# renaming the population column
final_df.rename(columns={'Population mid-2018 (millions)': 'population'}, inplace=True)
# renaming the revision id
final_df.rename(columns={'rev_id': 'revision_id'}, inplace=True)
# renaming article quality
final_df.rename(columns={'rating': 'article_quality'}, inplace=True)
# renaming article name
final_df.rename(columns={'page': 'article_name'}, inplace=True)
# converting the population to an integer (from million quantity)
final_df['population'] = final_df['population'].apply(lambda x: int(float(x.replace(',', ''))*1000000))

# re-arranging and selecting columns
final_df = final_df[['country', 'article_name', 'revision_id', 'article_quality', 'population']]

In [13]:
# snapshot of the resultant data
final_df.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Zambia,Template:ZambiaProvincialMinisters,235107991,,17700000
1,Zambia,Gladys Lundwe,757566606,Stub,17700000
2,Zambia,Mwamba Luchembe,764848643,Stub,17700000
3,Zambia,Thandiwe Banda,768166426,Start,17700000
4,Zambia,Sylvester Chisembele,776082926,C,17700000


In [14]:
# saving the dataset before initiating the analysis
print ('Saving the dataset to the disk...')
final_df.to_csv('final_dataframe.csv')

Saving the dataset to the disk...


## Data Analysis

Firstly, we drop all the rows where the rating information was not retrieved from the ORES system. We also create a flag for the high quality articles (FA or GA).  
Afterwards, we roll-up the data at a country-level with the total number of articles and high quality articles summed up.  

We want to create 2 metrics:
1. Ratio between the number of politician articles and the countries' population
2. Ratio between the number of high-quality politician articles to the overall number of articles

In [15]:
# starting analysis
# removing rows where rating is NaN
final_df = final_df.dropna(how='any')  

# creating a flag for high quality articles
final_df['high_count'] = np.where((final_df['article_quality'] == 'GA') | (final_df['article_quality'] == 'FA'), 1, 0)

KeyError: 'rating'

In [122]:
# rolling the dataset at the country level
country_df = pd.pivot_table(
    final_df,  
    index = ['country','population'],
    values = ['high_count'],
    aggfunc = [lambda x: len(x), np.sum]
).reset_index()

# moving index to columns
country_df.index.name = country_df.columns.name = None

# dropping a level columns
country_df.columns = country_df.columns.droplevel()

# renaming all column name
country_df.columns = ['country','population','num_articles','num_hq_articles']

In [124]:
# articles per person in a country (percentage)
country_df['art_per_person'] = country_df['num_articles']/country_df['population']*100

The following tables show the highest and lowest ranked countries in terms of the proportion of politician related articles with population of the country.

In [125]:
# highest ranked countries in terms of articles per person
country_df.sort_values(by = 'art_per_person', ascending = False).head(10)

Unnamed: 0,country,population,num_articles,num_hq_articles,art_per_person
166,Tuvalu,10000,55,5,0.55
115,Nauru,10000,53,0,0.53
135,San Marino,30000,82,0,0.273333
108,Monaco,40000,40,0,0.1
93,Liechtenstein,40000,29,0,0.0725
161,Tonga,100000,63,1,0.063
103,Marshall Islands,60000,37,0,0.061667
68,Iceland,400000,206,2,0.0515
3,Andorra,80000,34,0,0.0425
52,Federated States of Micronesia,100000,38,0,0.038


In [126]:
# lowest ranked countries in terms of articles per person
country_df.sort_values(by = 'art_per_person', ascending = True).head(10)

Unnamed: 0,country,population,num_articles,num_hq_articles,art_per_person
69,India,1371300000,986,14,7.2e-05
70,Indonesia,265200000,214,8,8.1e-05
34,China,1393800000,1135,33,8.1e-05
173,Uzbekistan,32900000,29,1,8.8e-05
51,Ethiopia,107500000,105,1,9.8e-05
178,Zambia,17700000,25,0,0.000141
82,"Korea, North",25600000,39,7,0.000152
159,Thailand,66200000,112,3,0.000169
13,Bangladesh,166400000,323,3,0.000194
112,Mozambique,30500000,60,0,0.000197


The following tables show the highest and lowest ranked countries in terms of the proportion of high-quality politician related articles with overall number of politician related articles. 

In [130]:
# high-quality articles per person in a country (percentage)
country_df['hq_art_per_person'] = country_df['num_hq_articles']/country_df['num_articles']*100

In [131]:
# highest ranked countries in terms of high-quality articles per person
country_df.sort_values(by = 'hq_art_per_person', ascending = False).head(10)

Unnamed: 0,country,population,num_articles,num_hq_articles,art_per_person,hq_art_per_person
82,"Korea, North",25600000,39,7,0.000152,17.948718
137,Saudi Arabia,33400000,119,16,0.000356,13.445378
31,Central African Republic,4700000,68,8,0.001447,11.764706
132,Romania,19500000,348,40,0.001785,11.494253
104,Mauritania,4500000,52,5,0.001156,9.615385
19,Bhutan,800000,33,3,0.004125,9.090909
166,Tuvalu,10000,55,5,0.55,9.090909
44,Dominica,70000,12,1,0.017143,8.333333
171,United States,328000000,1092,82,0.000333,7.509158
18,Benin,11500000,94,7,0.000817,7.446809


In [132]:
# lowest ranked countries in terms of high-quality articles per person
country_df.sort_values(by = 'hq_art_per_person', ascending = True).head(10)

Unnamed: 0,country,population,num_articles,num_hq_articles,art_per_person,hq_art_per_person
136,Sao Tome and Principe,200000,22,0,0.011,0.0
112,Mozambique,30500000,60,0,0.000197,0.0
28,Cameroon,25600000,105,0,0.00041,0.0
65,Guyana,800000,20,0,0.0025,0.0
165,Turkmenistan,5900000,33,0,0.000559,0.0
108,Monaco,40000,40,0,0.1,0.0
107,Moldova,3500000,426,0,0.012171,0.0
36,Comoros,800000,51,0,0.006375,0.0
103,Marshall Islands,60000,37,0,0.061667,0.0
38,Costa Rica,5000000,150,0,0.003,0.0


The lowest-ranked countries for high quality articles are all with 0 proportions, which does not make much sense. Let us see all the countries with no high quality article.

In [139]:
# zero high-quality articles
zero_hq = country_df[country_df['num_hq_articles'] == 0]
print ("There are", len(zero_hq), "countries with no high quality politician article.")

There are 37 countries with no high quality politician article.


In [141]:
# printing out all these countries
zero_hq['country'].unique()

array(['Andorra', 'Angola', 'Antigua and Barbuda', 'Bahamas', 'Barbados',
       'Belgium', 'Belize', 'Cameroon', 'Cape Verde', 'Comoros',
       'Costa Rica', 'Djibouti', 'Federated States of Micronesia',
       'Finland', 'Guyana', 'Kazakhstan', 'Kiribati', 'Lesotho',
       'Liechtenstein', 'Macedonia', 'Malta', 'Marshall Islands',
       'Moldova', 'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Slovakia',
       'Solomon Islands', 'Switzerland', 'Tunisia', 'Turkmenistan',
       'Uganda', 'Zambia'], dtype=object)