# A2: Bias in data

## Step 1: Getting the Article and Population Data

In [1]:
import pandas as pd
from tqdm import tqdm

In [2]:
article_data = pd.read_csv('page_data.csv')
population_data = pd.read_csv('WPDS_2020_data.csv')

## Step 2: Cleaning the Data

In [3]:
# cleaning article_data
article_data_cleaned = article_data[~article_data['page'].str.startswith('Template:')]
print(article_data_cleaned.head())

                                                 page                country  \
1                                      Bir I of Kanem                   Chad   
10  Information Minister of the Palestinian Nation...  Palestinian Territory   
12                                            Yos Por               Cambodia   
23                                       Julius Gregr         Czech Republic   
24                                       Edvard Gregr         Czech Republic   

       rev_id  
1   355319463  
10  393276188  
12  393822005  
23  395521877  
24  395526568  


In [4]:
# cleaning population_data
population_data_cleaned = population_data[~population_data['Name'].str.isupper()]
print(population_data_cleaned.head())
population_data_regional = population_data[population_data['Name'].str.isupper()]

  FIPS     Name     Type  TimeFrame  Data (M)  Population
3   DZ  Algeria  Country       2019    44.357    44357000
4   EG    Egypt  Country       2019   100.803   100803000
5   LY    Libya  Country       2019     6.891     6891000
6   MA  Morocco  Country       2019    35.952    35952000
7   SD    Sudan  Country       2019    43.849    43849000


## Step 3: Getting Article Quality Predictions

In [5]:
import json
import requests

In [6]:
headers = {
    'User-Agent': 'https://github.com/zhangjunhao0',
    'From': 'zjh01@uw.edu'
}

def ores_data(revision_ids):
    revids = '|'.join(str(x) for x in revision_ids)
    ores_api = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {
        'project': 'enwiki',
        'model': 'articlequality',
        'revids': revids
    }
    
    return requests.get(ores_api.format(**params), headers=headers).json()

In [7]:
print(article_data_cleaned.shape)

(46701, 3)


In [8]:
# create batches and use batch processing, 50 per batch
revision_ids = list(article_data_cleaned['rev_id'])
#revision_ids = revision_ids[:1001]
total_batch = (len(revision_ids)-1)//50+1
scores_list = []
rev_ids_no_scores = []
for i in tqdm(range(total_batch)):
    # Call the API
    ores_data_json = None
    if i == total_batch-1:
        ores_data_json = ores_data(revision_ids[50*i:])
    else:
        ores_data_json = ores_data(revision_ids[50*i:50*i+50])
    
    # Extract the key value pair
    ores_data_scores = ores_data_json['enwiki']['scores']
    for rev_id in ores_data_scores:
        if 'score' in ores_data_scores[rev_id]['articlequality']:
            scores_list.append([rev_id, ores_data_scores[rev_id]['articlequality']['score']['prediction']])
        else:
            rev_ids_no_scores.append(rev_id)

100%|██████████| 935/935 [04:45<00:00,  3.28it/s]


In [9]:
df_article_score = pd.DataFrame(scores_list, columns=['rev_id','article_quality_est.'])
df_article_score['rev_id'] = df_article_score['rev_id'].astype('int')
df_article_with_quality = article_data_cleaned.merge(df_article_score, how='inner', on='rev_id')
print(df_article_with_quality.head())

                                                page                country  \
0                                     Bir I of Kanem                   Chad   
1  Information Minister of the Palestinian Nation...  Palestinian Territory   
2                                            Yos Por               Cambodia   
3                                       Julius Gregr         Czech Republic   
4                                       Edvard Gregr         Czech Republic   

      rev_id article_quality_est.  
0  355319463                 Stub  
1  393276188                 Stub  
2  393822005                 Stub  
3  395521877                 Stub  
4  395526568                 Stub  


In [10]:
with open('articles_with_no_ores_scores', 'w') as f:
    for item in rev_ids_no_scores:
        f.write("%s\n" % item)

## Step 4: Combining the Datasets

In [11]:
df_merged = df_article_with_quality.merge(population_data_cleaned, how='outer', left_on='country', right_on='Name')
print(df_merged.head(20))

                        page country       rev_id article_quality_est. FIPS  \
0             Bir I of Kanem    Chad  355319463.0                 Stub   TD   
1       Abdullah II of Kanem    Chad  498683267.0                 Stub   TD   
2        Salmama II of Kanem    Chad  565745353.0                 Stub   TD   
3            Kuri I of Kanem    Chad  565745365.0                 Stub   TD   
4        Mohammed I of Kanem    Chad  565745375.0                 Stub   TD   
5           Kuri II of Kanem    Chad  669719757.0                 Stub   TD   
6            Bir II of Kanem    Chad  670893206.0                 Stub   TD   
7            Mahamat Hissene    Chad  693055898.0                 Stub   TD   
8                   Othman I    Chad  705432607.0                 Stub   TD   
9            Alphonse Kotiga    Chad  707593108.0                 Stub   TD   
10         Oueddei Kichidemi    Chad  708346649.0                Start   TD   
11                  Dunama I    Chad  710675092.0   

In [12]:
# keep selected column
df_merged['article_name'] = df_merged['page']
df_merged['revision_id'] = df_merged['rev_id']
df_merged['population'] = df_merged['Population']
df_merged = df_merged[['country','article_name','revision_id','article_quality_est.','population']]

In [13]:
# rows with unmatched data
df_unmatched = df_merged[df_merged.isnull().any(axis=1)]
df_unmatched.to_csv('wp_wpds_countries-no_match.csv', index=False)

In [14]:
# rows with matched data
df_matched = df_merged.dropna()
df_matched.to_csv('wp_wpds_politicians_by_country.csv', index=False)

## Step 5: Analysis

In [15]:
# load data
article_with_population = pd.read_csv('wp_wpds_politicians_by_country.csv')

In [16]:
# articles-per-population for each country
high_quality_article = article_with_population[(article_with_population['article_quality_est.']=='FA')|(article_with_population['article_quality_est.']=='GA')]
articles_per_population = high_quality_article.groupby(['country','population'])['revision_id'].count().to_frame().reset_index()
articles_per_population['articles-per-population'] = (articles_per_population['revision_id']/articles_per_population['population']) * 100
articles_per_population = articles_per_population[['country', 'articles-per-population']]
print(articles_per_population.head())

       country  articles-per-population
0  Afghanistan                 0.000033
1      Albania                 0.000106
2      Algeria                 0.000005
3    Argentina                 0.000035
4      Armenia                 0.000169


In [17]:
# percent high quality articles for each country
high_quality_count_by_country = high_quality_article.groupby('country')['revision_id'].count().to_frame().reset_index()
total_count_by_country = article_with_population.groupby('country')['revision_id'].count().to_frame().reset_index()
percent_high_quality_by_country = total_count_by_country.merge(high_quality_count_by_country, how='inner', on='country')
percent_high_quality_by_country['percent_high_quality'] = percent_high_quality_by_country['revision_id_y']/percent_high_quality_by_country['revision_id_x']*100
percent_high_quality_by_country = percent_high_quality_by_country[['country', 'percent_high_quality']]

In [18]:
percent_high_quality_by_country.head()

Unnamed: 0,country,percent_high_quality
0,Afghanistan,4.075235
1,Albania,0.657895
2,Algeria,1.724138
3,Argentina,3.258656
4,Armenia,2.590674


In [19]:
population_data_regional.head(30)

Unnamed: 0,FIPS,Name,Type,TimeFrame,Data (M),Population
0,WORLD,WORLD,World,2019,7772.85,7772850000
1,AFRICA,AFRICA,Sub-Region,2019,1337.918,1337918000
2,NORTHERN AFRICA,NORTHERN AFRICA,Sub-Region,2019,244.344,244344000
10,WESTERN AFRICA,WESTERN AFRICA,Sub-Region,2019,401.115,401115000
27,EASTERN AFRICA,EASTERN AFRICA,Sub-Region,2019,444.97,444970000
48,MIDDLE AFRICA,MIDDLE AFRICA,Sub-Region,2019,179.757,179757000
58,SOUTHERN AFRICA,SOUTHERN AFRICA,Sub-Region,2019,67.732,67732000
64,NORTHERN AMERICA,NORTHERN AMERICA,Sub-Region,2019,368.193,368193000
67,LATIN AMERICA AND THE CARIBBEAN,LATIN AMERICA AND THE CARIBBEAN,Sub-Region,2019,651.036,651036000
68,CENTRAL AMERICA,CENTRAL AMERICA,Sub-Region,2019,178.611,178611000


In [20]:
# articles-per-population for each smaller region

# find the corresponding region for each country
country_to_subregion_mapping = population_data
current_region = None
country_names = population_data['Name']
res_region = []
for i in range(len(country_names)):
    if country_names[i].isupper():
        current_region = country_names[i]
    res_region.append(current_region)
country_to_subregion_mapping['region'] = res_region
country_to_subregion_mapping = country_to_subregion_mapping[~country_to_subregion_mapping['Name'].str.isupper()]
country_to_subregion_mapping = country_to_subregion_mapping[['Name','region']]
article_with_population_and_region = article_with_population.merge(country_to_subregion_mapping, how='inner', left_on='country', right_on='Name')

# analysis
high_quality_article_region = article_with_population_and_region[(article_with_population_and_region['article_quality_est.']=='FA')|(article_with_population_and_region['article_quality_est.']=='GA')]
articles_per_population_region = high_quality_article_region.groupby('region').agg({'revision_id':'count'}).reset_index()
articles_per_population_region = articles_per_population_region.merge(population_data[['Name','Population']], left_on='region',right_on='Name')
articles_per_population_region['articles-per-population'] = (articles_per_population_region['revision_id']/articles_per_population_region['Population']) * 100
articles_per_population_region = articles_per_population_region[['region', 'articles-per-population']]
print(articles_per_population_region)

              region  articles-per-population
0          CARIBBEAN                 0.000030
1    CENTRAL AMERICA                 0.000013
2       CENTRAL ASIA                 0.000009
3          EAST ASIA                 0.000005
4     EASTERN AFRICA                 0.000008
5     EASTERN EUROPE                 0.000040
6      MIDDLE AFRICA                 0.000009
7    NORTHERN AFRICA                 0.000008
8   NORTHERN AMERICA                 0.000028
9    NORTHERN EUROPE                 0.000096
10           OCEANIA                 0.000146
11     SOUTH AMERICA                 0.000009
12        SOUTH ASIA                 0.000004
13    SOUTHEAST ASIA                 0.000011
14   SOUTHERN AFRICA                 0.000013
15   SOUTHERN EUROPE                 0.000048
16    WESTERN AFRICA                 0.000010
17      WESTERN ASIA                 0.000032
18    WESTERN EUROPE                 0.000029


In [21]:
# articles-per-population for each larger subregion (AFRICA, LATIN AMERICA AND THE CARIBBEAN, ASIA, EUROPE)
# find the corresponding larger region for each country
country_to_large_subregion_mapping = population_data
current_region = None
country_names = population_data['Name']
res_region = []
large_regions = ['AFRICA', 'LATIN AMERICA AND THE CARIBBEAN', 'ASIA', 'EUROPE']
for i in range(len(country_names)):
    if country_names[i] in large_regions:
        current_region = country_names[i]
    elif country_names[i] == 'NORTHERN AMERICA': # since northern america does not belong to any larger subregion
        current_region = None
    res_region.append(current_region)
country_to_large_subregion_mapping['region'] = res_region
country_to_large_subregion_mapping = country_to_large_subregion_mapping[~country_to_large_subregion_mapping['Name'].str.isupper()]
country_to_large_subregion_mapping = country_to_large_subregion_mapping[['Name','region']]
country_to_large_subregion_mapping = country_to_large_subregion_mapping.dropna()
article_with_population_and_large_region = article_with_population.merge(country_to_large_subregion_mapping, how='inner', left_on='country', right_on='Name')

# analysis
high_quality_article_large_region = article_with_population_and_large_region[(article_with_population_and_large_region['article_quality_est.']=='FA')|(article_with_population_and_large_region['article_quality_est.']=='GA')]
articles_per_population_large_region = high_quality_article_large_region.groupby('region').agg({'revision_id':'count'}).reset_index()
articles_per_population_large_region = articles_per_population_large_region.merge(population_data[['Name','Population']], left_on='region',right_on='Name')
articles_per_population_large_region['articles-per-population'] = (articles_per_population_large_region['revision_id']/articles_per_population_large_region['Population']) * 100
articles_per_population_large_region = articles_per_population_large_region[['region', 'articles-per-population']]
print(articles_per_population_large_region)

                            region  articles-per-population
0                           AFRICA                 0.000009
1                             ASIA                 0.000007
2                           EUROPE                 0.000055
3  LATIN AMERICA AND THE CARIBBEAN                 0.000012


In [22]:
# concatenate both smaller and large regions for final result
articles_per_population_all_region = pd.concat([articles_per_population_region, articles_per_population_large_region], ignore_index=True)
print(articles_per_population_all_region)

                             region  articles-per-population
0                         CARIBBEAN                 0.000030
1                   CENTRAL AMERICA                 0.000013
2                      CENTRAL ASIA                 0.000009
3                         EAST ASIA                 0.000005
4                    EASTERN AFRICA                 0.000008
5                    EASTERN EUROPE                 0.000040
6                     MIDDLE AFRICA                 0.000009
7                   NORTHERN AFRICA                 0.000008
8                  NORTHERN AMERICA                 0.000028
9                   NORTHERN EUROPE                 0.000096
10                          OCEANIA                 0.000146
11                    SOUTH AMERICA                 0.000009
12                       SOUTH ASIA                 0.000004
13                   SOUTHEAST ASIA                 0.000011
14                  SOUTHERN AFRICA                 0.000013
15                  SOUT

In [23]:
# percent high quality articles for each small region
high_quality_count_by_region = high_quality_article_region.groupby('region')['revision_id'].count().to_frame().reset_index()
total_count_by_region = article_with_population_and_region.groupby('region')['revision_id'].count().to_frame().reset_index()
percent_high_quality_by_region = total_count_by_region.merge(high_quality_count_by_region, how='inner', on='region')
percent_high_quality_by_region['percent_high_quality'] = percent_high_quality_by_region['revision_id_y']/percent_high_quality_by_region['revision_id_x']*100
percent_high_quality_by_region = percent_high_quality_by_region[['region', 'percent_high_quality']]

# percent high quality articles for each large region
high_quality_count_by_large_region = high_quality_article_large_region.groupby('region')['revision_id'].count().to_frame().reset_index()
total_count_by_large_region = article_with_population_and_large_region.groupby('region')['revision_id'].count().to_frame().reset_index()
percent_high_quality_by_large_region = total_count_by_large_region.merge(high_quality_count_by_large_region, how='inner', on='region')
percent_high_quality_by_large_region['percent_high_quality'] = percent_high_quality_by_large_region['revision_id_y']/percent_high_quality_by_large_region['revision_id_x']*100
percent_high_quality_by_large_region = percent_high_quality_by_large_region[['region', 'percent_high_quality']]

# concatenate both smaller and large regions for final result
percent_high_quality_by_all_region = pd.concat([percent_high_quality_by_region, percent_high_quality_by_large_region], ignore_index=True)
print(percent_high_quality_by_all_region)

                             region  percent_high_quality
0                         CARIBBEAN              1.870504
1                   CENTRAL AMERICA              1.490603
2                      CENTRAL ASIA              2.857143
3                         EAST ASIA              3.073190
4                    EASTERN AFRICA              1.398881
5                    EASTERN EUROPE              3.161844
6                     MIDDLE AFRICA              2.406015
7                   NORTHERN AFRICA              2.113459
8                  NORTHERN AMERICA              5.470805
9                   NORTHERN EUROPE              2.710603
10                          OCEANIA              2.015355
11                    SOUTH AMERICA              1.319261
12                       SOUTH ASIA              1.626202
13                   SOUTHEAST ASIA              3.613861
14                  SOUTHERN AFRICA              1.419558
15                  SOUTHERN EUROPE              1.994609
16            

## Step 6: Results

### Top 10 countries by coverage

In [24]:
articles_per_population = articles_per_population.sort_values(by='articles-per-population', ascending=False)
print(articles_per_population.head(10))

         country  articles-per-population
133       Tuvalu                 0.040000
33      Dominica                 0.001389
141      Vanuatu                 0.000935
51       Iceland                 0.000543
56       Ireland                 0.000500
86    Montenegro                 0.000322
81    Martinique                 0.000281
12        Bhutan                 0.000274
91   New Zealand                 0.000261
106      Romania                 0.000218


### Bottom 10 countries by coverage

In [25]:
articles_per_population = articles_per_population.sort_values(by='articles-per-population', ascending=True)
print(articles_per_population.head(10))

        country  articles-per-population
52        India             9.285051e-07
94      Nigeria             9.702144e-07
128    Tanzania             1.674088e-06
38     Ethiopia             1.740402e-06
8    Bangladesh             1.766691e-06
27     Colombia             2.022490e-06
134      Uganda             2.186222e-06
87      Morocco             2.781486e-06
16       Brazil             2.832701e-06
26        China             2.852284e-06


### Top 10 countries by relative quality

In [26]:
percent_high_quality_by_country = percent_high_quality_by_country.sort_values(by='percent_high_quality', ascending=False)
print(percent_high_quality_by_country.head(10))

                      country  percent_high_quality
63               Korea, North             22.222222
109              Saudi Arabia             12.820513
106                   Romania             12.244898
23   Central African Republic             12.121212
140                Uzbekistan             10.714286
82                 Mauritania             10.416667
46                  Guatemala              8.433735
33                   Dominica              8.333333
125                     Syria              7.812500
11                      Benin              7.692308


### Bottom 10 countries by relative quality

In [27]:
percent_high_quality_by_country = percent_high_quality_by_country.sort_values(by='percent_high_quality', ascending=True)
print(percent_high_quality_by_country.head(10))

         country  percent_high_quality
10       Belgium              0.192678
128     Tanzania              0.247525
124  Switzerland              0.248756
89         Nepal              0.280899
101         Peru              0.285714
94       Nigeria              0.295858
104     Portugal              0.314465
27      Colombia              0.350877
73     Lithuania              0.409836
87       Morocco              0.485437


### Geographic regions by coverage

In [28]:
print(articles_per_population_all_region.sort_values(by='articles-per-population', ascending=False))

                             region  articles-per-population
10                          OCEANIA                 0.000146
9                   NORTHERN EUROPE                 0.000096
21                           EUROPE                 0.000055
15                  SOUTHERN EUROPE                 0.000048
5                    EASTERN EUROPE                 0.000040
17                     WESTERN ASIA                 0.000032
0                         CARIBBEAN                 0.000030
18                   WESTERN EUROPE                 0.000029
8                  NORTHERN AMERICA                 0.000028
14                  SOUTHERN AFRICA                 0.000013
1                   CENTRAL AMERICA                 0.000013
22  LATIN AMERICA AND THE CARIBBEAN                 0.000012
13                   SOUTHEAST ASIA                 0.000011
16                   WESTERN AFRICA                 0.000010
2                      CENTRAL ASIA                 0.000009
11                    SO

### Geographic regions by quality

In [29]:
print(percent_high_quality_by_all_region.sort_values(by='percent_high_quality', ascending=False))

                             region  percent_high_quality
8                  NORTHERN AMERICA              5.470805
13                   SOUTHEAST ASIA              3.613861
17                     WESTERN ASIA              3.472493
5                    EASTERN EUROPE              3.161844
3                         EAST ASIA              3.073190
2                      CENTRAL ASIA              2.857143
9                   NORTHERN EUROPE              2.710603
20                             ASIA              2.708494
6                     MIDDLE AFRICA              2.406015
21                           EUROPE              2.186226
7                   NORTHERN AFRICA              2.113459
10                          OCEANIA              2.015355
15                  SOUTHERN EUROPE              1.994609
0                         CARIBBEAN              1.870504
16                   WESTERN AFRICA              1.870033
19                           AFRICA              1.740020
12            

## Reflections and Implications

In this project, we investigated the qualities of Wikipedia articles on politicians. We use two metrics: articles per population and proportion of high quality articles. The first metric looks at the overall coverage of wikipedia articles on politicians, as countries with greater populations are expected to have more politicians, and thus more wikipedia articles on them. The second metric focuses on the quality of article by looking at how many articles are of high qualities. 

In my opinion, the articles per population metric is biased against these countries for two reasons. First, different countries' populations are of entirely different orders of magnitudes. Some countries have 10000 times more populations than others, making this matric very punishing for large countries. Second, although there should be more politicians for a larger country, the number of important politicians do not scale linearly with the population. The total number of politicians for these countries is likely large, but many of them remain in a local level and thus does not make sense for Wikipedia to include high quality articles of them containing much information, since their actions likely do not affect the entire country as a whole. Thus, this metric is biased. This claim is further supported after I performed the analysis and found that the bottom 10 countries ranked by coverage consist mostly of those with high population. 

Another finding is that in a regional level, regions with more English speaking countries have higher qualities of articles. This is mainly because the data source is from English Wikipedia pages. These pages might be machine translated from other languages, or human translated by some volunteers whose English are not as good as native speakers. The machine learning API only receives these articles as input and thus cannot change the rating standards accordingly, thus causing this bias.

When analyzing the data, I found some country names from the two datasets (articles and populations) do not match exactly. As a result, these data is discarded. Discarding this data can create bias because the discarded data is not random and with missing data from some countries, the overall regional data can also be biased.