# Investigating Bias in Wikipedia Articles About Political Figures

The goal of this project is analysing articles on political figures, and using their existence and quality to examine the different kinds of bias in Wikipedia's data.

First, read in all the packages that we require for this analysis

In [130]:
import pandas as pd
import csv
import requests
import json
import numpy as np
from IPython.display import display, HTML
from copy import deepcopy

## Step 1: Reading in the Data

The first step that we do is read in the data that we're using for this analysis. We use two data sources in this analysis. 

The first data source contains articles about political figures by country. This data source is contained in a zip folder titled country.zip and can be found at the following link: https://figshare.com/articles/Untitled_Item/5513449. The data file is stored in the data directory of the country directory. It is titled page_data.csv. The first step that I take is reading the data into a Pandas dataframe. This data is licensed under CC-BY-SA 4.0 license. You can distribute the data, but you **must** attribute.

In [2]:
page_data = pd.read_csv('data/raw/page_data.csv')

In [3]:
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


The page_data data contains three columns, page, country and rev_id. The rev_id is what we use to call the ORES API.

The other data that we use is the population data. This data contains world populations for 207 countries as of 2018. The file can be found at the following link: https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0. In order to download it, hit the download button at the top right corner, and select direct download. Let's load and have a look at this data. No license is explicitly stated for the data, so according to convention, it likely means that all rights are reserved for the data. Due to this not being stated, I am not including the data with this repository. You will need to download the data from the link to use it.

In [4]:
population_data = pd.read_csv('data/raw/WPDS_2018_data.csv')

In [5]:
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


As observed above, this data contains two columns, one listing the country and the population of that country as of mid-2018 (in millions).

## Step 2: Get Article Scores from ORES

As a part of this investigation, we need to determine the countries with the highest and lowest proportion of high quality articles about politicians. In order to do this, we require article scores, which we can obtain using the ORES API. 

You can find the documentation for the ORES API at this link: https://www.mediawiki.org/wiki/ORES.

The first step that we take is setting up the endpoints and headers.

In [6]:
endpoint_def = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
headers = {'User-Agent' : 'https://github.com/tejasmhos', 'From' : 'tejash@uw.edu'}

After that's done, we can proceed to write a function that goes through the list of all rev_id's and returns the score that's associated with the article.

In [7]:
rev_ids = list(page_data['rev_id'])

In [8]:
def get_ores_data(revision_ids, headers):
    """
    This code was taken from the sample notebook that was provided to us. All code belongs to
    its original author, who in this case is Os. 
    """
    # Define the endpoint
    endpoint = endpoint_def
    
    #Define the parameters
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

In [9]:
def predictions_split(rev_ids):
    """
    This function takes the rev_ids list, iterates 100 rows at a time
    and then combines the results together to form the final list of 
    revids and ratings. This is done 100 at a time due to the API having
    a limit on the number of values that can be passed at once. The return
    object is a dataframe that contains the revids and the rating for 
    those revids
    """
    start = 0
    flag = 0
    end = 100
    dataframe_final = pd.DataFrame(columns=['rev_ids','ratings'])
    while(1):
        response_ores = get_ores_data(rev_ids[start:end], headers)
        for revid in response_ores['enwiki']['scores']:
            try:
                rating = response_ores['enwiki']['scores'][revid]['wp10']['score']['prediction']
            except:
                rating = np.nan
            dataframe_final = dataframe_final.append({'rev_ids':revid, 'ratings':rating}, ignore_index=True)
        if flag == 1:
            break
        start +=100
        if end+100 > len(rev_ids):
            end = len(rev_ids)
            flag = 1
        else:
            end += 100
    return dataframe_final

The next step is running our code on the list of rev_ids, and getting the ratings associated with them. Some articles may not be in the database, so we assign a NaN value to those articles. Those then disappear when we join this data back to the original page_data dataframe.

In [10]:
#run the code, get the results for ratings
result = predictions_split(rev_ids)

## Step 3: Joining Datasets Together to Get Final Dataset

The final step to constructing our complete dataset is performing a number of joins. We first join the result dataframe with the page_data dataframe on the rev_id. 

Before we do this, the type of rev_id in the results dataframe is of type string. We need to coerce the type of this column to int to ensure that the join works correctly. Thats what we do below.

In [70]:
#explicit type conversion
result['rev_ids'] = result['rev_ids'].astype(int)

Now we perform the merge operation on page_data and result dataframes on the column rev_id and rev_ids respectively.

In [71]:
intermed_2 = page_data.merge(result, left_on='rev_id', right_on='rev_ids', how='inner')

The next step is joining this intermediate table to the country table. We do this join on the country fields in both the tables. In order to avoid any sort of mismatch in case, I lower the case of the country in both tables to ensure a join that's as synergetic as possible. 

In [72]:
intermed_2['country'] = intermed_2['country'].apply(lambda x:x.lower())
population_data['Geography'] = population_data['Geography'].apply(lambda x:x.lower())

In [73]:
#perform the final join
final_data = intermed_2.merge(population_data, left_on ='country', right_on = 'Geography', how='inner')

In [74]:
#remove the nans from the table
final_data.dropna(inplace=True)

In [75]:
#Select the columns we need, remove duplicate columns
final_data = final_data[['country', 'rev_id', 'page','Population mid-2018 (millions)','ratings']]

In [76]:
#reset the index
final_data.reset_index(inplace = True)

In [77]:
#rename column names according to convention
final_data.rename(index=str, columns={"page": "article_name", "rev_id": "revision_id","ratings":"article_quality", "Population mid-2018 (millions)":"population"}, inplace=True)

In [78]:
#Again, duplication columns and index is removed
final_data = final_data[['country', 'article_name', 'revision_id', 'article_quality','population']]
final_data['population'] = final_data['population'].apply(lambda x:x.replace(',',''))
final_data['population'] = final_data['population'].astype('float')
#Converting to millions
final_data['population'] = final_data['population'].apply(lambda x:x*1000000)
#Again, back to int
final_data['population'] = final_data['population'].astype(int)

In [135]:
#Having a look at how our final data looks
final_data.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,zambia,Gladys Lundwe,757566606,Stub,17700000
1,zambia,Mwamba Luchembe,764848643,Stub,17700000
2,zambia,Thandiwe Banda,768166426,Start,17700000
3,zambia,Sylvester Chisembele,776082926,C,17700000
4,zambia,Victoria Kalima,776530837,Start,17700000


We performed a number of steps here. Firstly, we dropped the NaNs from the ratings. Next, we selected the columns that we require for this analysis, and got rid of columns that were duplicated as a result of the join. Finally, we reset the index so that the index was sequential. We also renamed the columns in accordance with the scheme that was given .Since our data is processed and ready, we save it to a CSV and proceed to the next stage which is the analysis step.

In [79]:
final_data.to_csv('data/final_data.csv')

## Step 4: Performing Analysis

There are two analyses that we perform. The first analysis is addressed here. We need to find the proportion of articles per population of a country. 

### Analysis 1 : Countries with the highest and lowest proportion of articles as compared to their population

This analysis is pretty simple. We run a simple groupby on the country and population, and get the count of the number of articles that are associated with this country. 

In [103]:
prop_articles_per_country = final_data.groupby(['country','population'])['revision_id'].count().to_frame()
prop_articles_per_country = prop_articles_per_country.reset_index()

In [106]:
#Calculating the proportions
prop_articles_per_country['proportions'] = (prop_articles_per_country['revision_id']/prop_articles_per_country['population']) * 100

In [125]:
#This code styles the tables to make them look good
%%HTML
<style type="text/css">
    table.dataframe td, table.dataframe th {
        border-style: solid;
    }
</style>

In [201]:
prop_articles_per_country = prop_articles_per_country.rename(columns = {'revision_id':'number_of_articles'})

In [203]:
#10 highest ranked countries with respect to number of articles as a proportion of population
prop_articles_per_country.sort_values(by='proportions', ascending = False).head(10)[['country', 'population','number_of_articles', 'proportions']]

Unnamed: 0,country,population,number_of_articles,proportions
166,tuvalu,10000,55,0.55
115,nauru,10000,53,0.53
135,san marino,30000,82,0.273333
108,monaco,40000,40,0.1
93,liechtenstein,40000,29,0.0725
161,tonga,100000,63,0.063
103,marshall islands,60000,37,0.061667
68,iceland,400000,206,0.0515
3,andorra,80000,34,0.0425
52,federated states of micronesia,100000,38,0.038


The table above shows the countries that have the highest rank with respect to the number of articles as a proportion of their population. The results aren't that surprising, the countries that have a small population have a higher proportion of articles with respect to the size of their population. The highest being Tuvalu and Nauru, both extremely small islands in Australia. 

In [204]:
#10 lowest ranked countries with respect to number of articles as a proportion of population
prop_articles_per_country.sort_values(by='proportions', ascending=True).head(10)[['country','population','number_of_articles','proportions']]

Unnamed: 0,country,population,number_of_articles,proportions
69,india,1371300000,986,7.2e-05
70,indonesia,265200000,214,8.1e-05
34,china,1393800000,1135,8.1e-05
173,uzbekistan,32900000,29,8.8e-05
51,ethiopia,107500000,105,9.8e-05
178,zambia,17700000,25,0.000141
82,"korea, north",25600000,39,0.000152
159,thailand,66200000,112,0.000169
13,bangladesh,166400000,323,0.000194
112,mozambique,30500000,60,0.000197


The table above shows the data for the reverse case, the countries that have the lowest rank with respect to the number of articles as a proportion of their population. The list is full of developing countries, and the two most populous countries appear in this list (China and India). There's nothing that surprises me here. Even though the most populous countries have more number of populations, it's likely that there are only a few who are popular enough to have their own page.

### Analysis 2 : Countries with the highest and lowest proportion of high quality articles as compared to the total number of articles they have

In this stage of analysis, we compute the proportion of high quality articles to total articles in the politicians category of each country. The first step of our analysis is determining the number of high quality articles. As per the definition given to us, a high quality article is one that has a quality of FA or GA. We use a groupby to get this count, after filtering our data down to FA and GA.

In [180]:
#deepcopying our data
high_quality_count = deepcopy(final_data)
high_quality_count =high_quality_count[(high_quality_count['article_quality'] == 'FA') | (high_quality_count['article_quality'] == 'GA')]

In [181]:
#groupby to get the count of high quality articles by country
high_quality_count = high_quality_count.groupby(['country','population'])['revision_id'].count().to_frame()

In [182]:
#make sure that country and population are actual columns
high_quality_count.reset_index(inplace=True)

In order to get our final counts, we need to join back with the table that contains the total number of articles (the table prop_articles_per_country). In order to avoid duplication of columns, I first deepcopy this table into a new variable, rename some duplicated columns and then perform the join on country. 

In [183]:
total_articles = deepcopy(prop_articles_per_country)
total_articles = total_articles.rename(columns={'revision_id':'total_articles'})

In [184]:
#selecting the columns we need
total_articles = total_articles[['country','total_articles']]

In [185]:
#renaming columns and selecting the ones we need for this analysis
high_quality_count = high_quality_count.rename(columns = {'revision_id':'high_quality_articles'})
high_quality_count = high_quality_count[['country','high_quality_articles']]

In [195]:
high_quality_count.shape

(143, 2)

You can notice that the number of countries with high quality articles is significantly smaller than the total number of countries we have in the data. So, I perform a left join on the total_articles table and fill the NaN values with 0. This ensures that we don't lose any data. 

The final step to get our table is calculating the proportions and adding them to a separate column of the dataframe. We do this below. The final tables are also shown below.

In [196]:
#performing the join, getting the table with total and high quality articles
high_quality_prop = total_articles.merge(high_quality_count, left_on = 'country', right_on = 'country', how = 'left')

In [197]:
high_quality_prop = high_quality_prop.fillna(0)
high_quality_prop['high_quality_articles'] = high_quality_prop['high_quality_articles'].astype(int)

In [198]:
#Calculating the proportion
high_quality_prop['high_quality_articles_proportion'] = (high_quality_prop['high_quality_articles']/high_quality_prop['total_articles'])*100

The final step is presenting our results. We first show the countries that have the highest proportion of high quality articles as compared to the total number of articles that they have.

In [199]:
#Finding countries with highest proportion of high quality articles
high_quality_prop.sort_values(by='high_quality_articles_proportion', ascending = False).head(10)[['country', 'total_articles', 'high_quality_articles','high_quality_articles_proportion']]

Unnamed: 0,country,total_articles,high_quality_articles,high_quality_articles_proportion
6,"korea, north",39,7,17.948718
20,saudi arabia,119,16,13.445378
81,central african republic,68,8,11.764706
87,romania,348,40,11.494253
69,mauritania,52,5,9.615385
179,tuvalu,55,5,9.090909
130,bhutan,33,3,9.090909
160,dominica,12,1,8.333333
17,united states,1092,82,7.509158
56,benin,94,7,7.446809


The results are very interesting. The country with the highest proportion is North Korea. This is followed by Saudi Arabia. Personally, I think the large amount of interest in the way the government in these two countries function has lead to a lot of good quality articles being written about politicians in these two countries. One astute observation I notice is that Romania and United States are the only two Western nations in the entire top ten.

Now we do the same thing, but for the countries with lowest proportion of high ranked articles.

In [200]:
#Finding the countries with the lowest proportion of high ranked articles
high_quality_prop.sort_values(by='high_quality_articles_proportion', ascending = True).head(10)[['country', 'total_articles', 'high_quality_articles','high_quality_articles_proportion']]

Unnamed: 0,country,total_articles,high_quality_articles,high_quality_articles_proportion
142,comoros,51,0,0.0
157,solomon islands,98,0,0.0
75,nepal,361,0,0.0
126,djibouti,39,0,0.0
74,tunisia,140,0,0.0
28,kazakhstan,79,0,0.0
100,slovakia,119,0,0.0
154,moldova,426,0,0.0
152,sao tome and principe,22,0,0.0
23,cameroon,105,0,0.0


The 10 lowest proportions are 0, but we know that there are 37 countries with no high ranked articles. This concludes our research. 

## Final Reflections

### What biases did you expect to find in the data, and why?
I expected to find that developed countries would have a higher number of higher quality articles, and also a higher ratio of articles to population. I assumed that developed countries would have higher English-educated people, thus leading to higher quality and larger number of articles. 

### What are the results?
The results turned out to be quite different. Very small countries with several thousand people had a higher proportion of articles to their population. The number of high quality articles favored countries that typically have a lot of negative publicity with respect to the way their government works (North Korea, Saudi Arabia).

### What theories do you have about why the results are what they are?
Most of this is explained throughout my notebook at each stage of analysis. 