# A2: Bias in Data

The goal of this assignment is to explore the concept of bias through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. For this assignment, you will combine a dataset of Wikipedia articles with a dataset of country populations, and use a machine learning service called ORES to estimate the quality of each article.

You are expected to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries. Your analysis will consist of a series of tables that show:
1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
2. the countries with the highest and lowest proportion of high quality articles about politicians.
3. a ranking of geographic regions by articles-per-person and proportion of high quality articles.


### Import Libraries

In [1]:
import os

import pandas as pd
import numpy as np

from pprint import pprint as pp

### Define Constants

In [2]:
RAW_DATA_PATH = '../data/raw'
PROCESSED_DATA_PATH = '../data/processed'
VISUALIZATIONS_DATA_PATH = '../data/visualizations'

for path in [RAW_DATA_PATH, PROCESSED_DATA_PATH, VISUALIZATIONS_DATA_PATH]:
    if not os.path.exists(path):
        os.makedirs(path)

In [3]:
RAW_COUNTRY_DATASET_FPATH = os.path.join(RAW_DATA_PATH, 'page_data.csv')
RAW_WORLD_POPULATION_DATASET_FPATH = os.path.join(RAW_DATA_PATH, 'WPDS_2020_data.csv')

PROCESSED_POLITICIANS_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'politicians_country.csv')
PROCESSED_WORLD_POPULATION_COUNTRY_LEVEL_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'world_population_country_level.csv')
PROCESSED_WORLD_POPULATION_REGION_LEVEL_DATASET_FPATH = os.path.join(PROCESSED_DATA_PATH, 'world_population_region_level.csv')

## 1. Data Acquisition

We obtain the data from several different places:
1. The Wikipedia politicians by country dataset can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449)
    * We first download the zipped folder manually
    * We then extracted the zipped folder
    * Inside the folder we go to: `country/country/data`
    * Here, we copy `page_data.csv` and place this inside the raw data path
2. The population data is available in CSV format as [WPDS_2020_data.csv](https://docs.google.com/spreadsheets/d/1CFJO2zna2No5KqNm9rPK5PCACoXKzb-nycJFhV689Iw/edit?usp=sharing)
    * This dataset is drawn from the world population data sheet published by the [Population Reference Bureau](https://www.prb.org/international/indicator/population/table/).


In [4]:
df_pcd = pd.read_csv(RAW_COUNTRY_DATASET_FPATH)
df_wpd = pd.read_csv(RAW_WORLD_POPULATION_DATASET_FPATH)

In [5]:
df_pcd.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


In [6]:
df_wpd.head(5)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
3,DZ,Algeria,Country,2019,44.357,44357000
4,EG,Egypt,Country,2019,100.803,100803000


### Data Cleaning

There is some information that is not needed for analysis in each of the files mentioned above. Thus, we performing the following cleaning steps:
1. Country Dataset
    * The dataset contains some page names that start with the string "Template:".
    * These pages are not Wikipedia articles, and should not be included in your analysis.
2. Population Dataset
    * This dataset contains some rows that provide cumulative regional population counts, rather than country-level counts.
    * These rows are distinguished by having ALL CAPS values in the 'geography' field (e.g. AFRICA, OCEANIA).
    * We remove these from the dataset, but retain a copy of these in a seperate file

In [7]:
df_pcd = df_cd[df_cd["page"].str.contains("Template:")==False]
df_wpd_country = df_pd[df_pd['Name'].str.isupper() == False] # Country-level counts
df_wpd_region = df_pd[df_pd['Name'].str.isupper()] # Cumulative region level counts

In [8]:
# Ensure that we do not have anything that is all-caps in the `Name` field
df_wpd_country['Name'].unique()

array(['Algeria', 'Egypt', 'Libya', 'Morocco', 'Sudan', 'Tunisia',
       'Western Sahara', 'Benin', 'Burkina Faso', 'Cape Verde',
       "Cote d'Ivoire", 'Gambia', 'Ghana', 'Guinea', 'Guinea-Bissau',
       'Liberia', 'Mali', 'Mauritania', 'Niger', 'Nigeria', 'Senegal',
       'Sierra Leone', 'Togo', 'Burundi', 'Comoros', 'Djibouti',
       'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Reunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Tanzania', 'Uganda',
       'Zambia', 'Zimbabwe', 'Angola', 'Cameroon',
       'Central African Republic', 'Chad', 'Congo', 'Congo, Dem. Rep.',
       'Equatorial Guinea', 'Gabon', 'Sao Tome and Principe', 'Botswana',
       'eSwatini', 'Lesotho', 'Namibia', 'South Africa', 'Canada',
       'United States', 'Belize', 'Costa Rica', 'El Salvador',
       'Guatemala', 'Honduras', 'Mexico', 'Nicaragua', 'Panama',
       'Antigua and Barbuda', 'Bahamas', 'Barbados', 'Cuba', 'Curacao',
 

In [9]:
# Ensure that we only have strings that are all-caps in the `Name` field
df_wpd_region['Name'].unique()

array(['WORLD', 'AFRICA', 'NORTHERN AFRICA', 'WESTERN AFRICA',
       'EASTERN AFRICA', 'MIDDLE AFRICA', 'SOUTHERN AFRICA',
       'NORTHERN AMERICA', 'LATIN AMERICA AND THE CARIBBEAN',
       'CENTRAL AMERICA', 'CARIBBEAN', 'SOUTH AMERICA', 'ASIA',
       'WESTERN ASIA', 'CENTRAL ASIA', 'SOUTH ASIA', 'SOUTHEAST ASIA',
       'EAST ASIA', 'EUROPE', 'NORTHERN EUROPE', 'WESTERN EUROPE',
       'EASTERN EUROPE', 'SOUTHERN EUROPE', 'OCEANIA'], dtype=object)

In [None]:
df_pcd.to_csv(PROCESSED_POLITICIANS_DATASET_FPATH, index=False)
df_wpd_country.to_csv(PROCESSED_WORLD_POPULATION_COUNTRY_LEVEL_DATASET_FPATH, index=False)
df_wpd_region.to_csv(PROCESSED_WORLD_POPULATION_REGION_LEVEL_DATASET_FPATH, index=False)