# DATA 512 A2
## Riley Waters

In this notebook, I will compare Wikipedia articles on political figures in different countries. I am looking to see how article coverage and article quality differ in the different countries and regions. This could uncover some underlying bias within Wikipedia articles.

The quality of the articles is found using the ORES system. Documentation here: https://www.mediawiki.org/wiki/ORES

### Getting the article and population data

Two data sources are used. The first is the "Wikipedia politians by country" dataset which can be found here as 'page_data.csv': https://figshare.com/articles/Untitled_Item/5513449

In [1]:
import pandas as pd

#Get the wiki articles data
article_df = pd.read_csv('./data/source/page_data.csv')
article_df.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


The second source is world population data, drawn from the Population Reference Bureau here: https://www.prb.org/international/indicator/population/table/

In [2]:
# Get the country population data
population_df = pd.read_csv('./data/source/WPDS_2018_data.csv')
population_df.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


### Cleaning the data

The articles data set has some page names starting with 'Template:'. These are not Wikipedia articles, so they are filtered out. I also re-name some fields for clarity

In [3]:
# Get rid of the rows in article that start with Template:
article_clean_df = article_df[~article_df['page'].str.startswith('Template:')]

# Rename fields
article_clean_df = article_clean_df.rename(columns={'page': 'article_name', 'rev_id': 'revision_id'})
article_clean_df.head()

Unnamed: 0,article_name,country,revision_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


I'll also rename fields in the population dataset and convert population to its full numerical value.

In [4]:
# Rename values
population_clean_df = population_df.rename(columns={'Geography': 'country', 'Population mid-2018 (millions)':'population'})

# Turn population into numerical actual population
population_clean_df['population'] = population_clean_df['population'].apply(lambda x: float(x.replace(',',''))*1e6)

population_clean_df.head()

Unnamed: 0,country,population
0,AFRICA,1284000000.0
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0


Some of the countries are actually regions. I'll collect those and put them in a csv, then leave only the actual countries in my cleaned dataset

In [5]:
# Split the regions from the countries
cumulative_region_df = population_clean_df[population_clean_df['country'].str.isupper()]
cumulative_region_df.to_csv('./data/region_cumulatives.csv', sep=',',index=False)
cumulative_region_df.head()

Unnamed: 0,country,population
0,AFRICA,1284000000.0
56,NORTHERN AMERICA,365000000.0
59,LATIN AMERICA AND THE CARIBBEAN,649000000.0
95,ASIA,4536000000.0
144,EUROPE,746000000.0


In [6]:
# Keep only the actual countries
population_clean_df = population_clean_df[~population_clean_df['country'].str.isupper()]
population_clean_df.head()

Unnamed: 0,country,population
1,Algeria,42700000.0
2,Egypt,97000000.0
3,Libya,6500000.0
4,Morocco,35200000.0
5,Sudan,41700000.0


### Getting article quality predictions

As mentioned, the article quality scores come from a machine learning system called ORES. Using the oresapi package, we retrieve article quality scores for each revision id in the article dataset. Some more information on oresapi can be found here: https://pypi.org/project/oresapi/. It assigns each article a quality rating from the following options:

FA - Featured article

GA - Good article

B - B-class article

C - C-class article

Start - Start-class article

Stub - Stub-class article

In [7]:
import oresapi

# Get the ores results for each revid in article df
rev_id_list = article_clean_df['revision_id'].tolist()

ores_session = oresapi.Session("https://ores.wikimedia.org", "Class project rdwaters@uw.edu")

results = ores_session.score("enwiki", ["articlequality"], rev_id_list)

# Keep the rev ids and their corresponding results attached
id_res_zip = zip (rev_id_list, results)

Some of the rev ids cannot be found and result in an error. I collect these error ids and store them in a csv. Then, I merge the non-error results into my article dataframe.

In [8]:
error_id_list = []
temp_list = []
for res in id_res_zip:
    if 'error' not in res[1]['articlequality']:
        # If there is no error, grab the quality
        article_quality = res[1]['articlequality']['score']['prediction']
        temp_dict = {
            'revision_id':res[0],
            'article_quality':article_quality
        }
        temp_list.append(temp_dict)
    else:
        # If there is an error, grab the error rev id
        error_id_list.append(res[0])
temp_df = pd.DataFrame(temp_list)

# Merge the non-error quality ratings into the dataframe
article_score_df = pd.merge(article_clean_df, temp_df, on='revision_id')
article_score_df.head()

Unnamed: 0,article_name,country,revision_id,article_quality
0,Bir I of Kanem,Chad,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,Stub
2,Yos Por,Cambodia,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,Stub


In [9]:
# Store the error ids into a csv
error_df = pd.DataFrame(data={"error_id": error_id_list})
error_df.to_csv('./data/error_rev_ids.csv', sep=',',index=False)

### Combining the datasets
The population and article dataframes are merged on their country name using an outer join. Any rows that are missing an article name or a population have a country that is in one dataset but not the other. These countries are separated and their rows are stored. The rows with matching countries are used for the final analysis.

In [10]:
# Outer join the two datasets on country
combined_df = pd.merge(article_score_df, population_clean_df, on='country', how='outer')
combined_df.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,15400000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15400000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,15400000.0
3,Kuri I of Kanem,Chad,565745365.0,Stub,15400000.0
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15400000.0


In [11]:
# Get all rows where population or page is null
no_match_df = combined_df.loc[combined_df['population'].isnull() | combined_df['article_name'].isnull()]
# Save these rows to a csv
no_match_df.to_csv('./data/wp_wpds_countries-no_match.csv', index=False)
no_match_df.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
97,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188.0,Stub,
98,Finance Minister of the Palestinian National A...,Palestinian Territory,596181202.0,Start,
99,Planning Minister of the Palestinian National ...,Palestinian Territory,633612729.0,Start,
100,Hossam Arafat (politician),Palestinian Territory,680933208.0,Stub,
101,Tawfik Tirawi,Palestinian Territory,701106976.0,Start,


In [12]:
# Get all rows where population and page are not null
final_df = combined_df.loc[combined_df['population'].notnull() & combined_df['article_name'].notnull()]
# Save these to a csv and use it for the final analysis
final_df.to_csv('./data/wp_wpds_politicians_by_country.csv', index=False)
final_df.head()

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,Chad,355319463.0,Stub,15400000.0
1,Abdullah II of Kanem,Chad,498683267.0,Stub,15400000.0
2,Salmama II of Kanem,Chad,565745353.0,Stub,15400000.0
3,Kuri I of Kanem,Chad,565745365.0,Stub,15400000.0
4,Mohammed I of Kanem,Chad,565745375.0,Stub,15400000.0


For analysis purposes, I will also need the region of each country and that regions total population. Recall that these are in the original population dataframe. I attach the region and the region total population to each row using their country. This works because the dataset lists the region totals followed by the countries in that region, so the order is important.

In [13]:
region = ''
regional_pop = 0
temp_list = []

for idx, row in population_df.iterrows():
    if row['Geography'].isupper():
        # Uppercase indicates a region. Save this and the population
        region = row['Geography']
        regional_pop = row['Population mid-2018 (millions)']
    else:
        # Lowercase indicates a country. Use the previous region to figure out which region the country is in
        temp_dict = {
            'region': region,
            'country': row['Geography'],
            'regional_population': regional_pop
        }
        temp_list.append(temp_dict)

# Create a country region mapping dataframe
country_region_df = pd.DataFrame(temp_list)
country_region_df['regional_population'] = country_region_df['regional_population'].apply(lambda x: float(x.replace(',',''))*1e6)
country_region_df.head() 

Unnamed: 0,country,region,regional_population
0,Algeria,AFRICA,1284000000.0
1,Egypt,AFRICA,1284000000.0
2,Libya,AFRICA,1284000000.0
3,Morocco,AFRICA,1284000000.0
4,Sudan,AFRICA,1284000000.0


In [14]:
# Merge the mapping dataframe into my main dataframe
final_df = pd.merge(country_region_df, final_df, on='country')
final_df.head()

Unnamed: 0,country,region,regional_population,article_name,revision_id,article_quality,population
0,Algeria,AFRICA,1284000000.0,Ali Fawzi Rebaine,686269631.0,Stub,42700000.0
1,Algeria,AFRICA,1284000000.0,Ahmed Attaf,705910185.0,Stub,42700000.0
2,Algeria,AFRICA,1284000000.0,Ahmed Djoghlaf,707427823.0,Stub,42700000.0
3,Algeria,AFRICA,1284000000.0,Hammi Larouissi,708060571.0,Stub,42700000.0
4,Algeria,AFRICA,1284000000.0,Salah Goudjil,708980561.0,Stub,42700000.0


### Analysis

For the analysis, I need to find out the coverage and relative quality of articles in each country and each region. Coverage is the percent of articles per population. Relative quality is the percent of quality articles ('FA' or 'GA') per total articles.

In [15]:
# Group by the country
group_df = final_df.groupby('country')

temp_list = []
for country, group in group_df:
    # Total articles
    articles_in_group = len(group)
    
    # filter to quality articles
    quality_articles = group[group['article_quality'].isin(['FA', 'GA'])]
    
    # Country population
    population = group['population'].iloc[0]
    
    # Number of quality articles
    quality_articles_count = len(quality_articles)
    
    temp_dict = {
        'country': country,
        'articles_count': articles_in_group,
        'population': population,
        'quality_articles_count': quality_articles_count,
        'coverage': (articles_in_group/population)*100.0,
        'relative_quality': (quality_articles_count/articles_in_group)*100.0
    }
    temp_list.append(temp_dict)

# Create the analysis table per country
analysis_country_df = pd.DataFrame(temp_list)
analysis_country_df.to_csv('./data/final_analysis_data.csv', index=False)
analysis_country_df.head()

Unnamed: 0,articles_count,country,coverage,population,quality_articles_count,relative_quality
0,320,Afghanistan,0.000877,36500000.0,12,3.75
1,457,Albania,0.015759,2900000.0,3,0.656455
2,116,Algeria,0.000272,42700000.0,2,1.724138
3,34,Andorra,0.0425,80000.0,0,0.0
4,106,Angola,0.000349,30400000.0,0,0.0


In [16]:
# Group by the region
group_df = final_df.groupby('region')

temp_list = []
for region, group in group_df:
    # Total articles
    articles_in_group = len(group)
    
    # filter to quality articles
    quality_articles = group[group['article_quality'].isin(['FA', 'GA'])]
    
    # Regional population
    population = group['regional_population'].iloc[0]
    
    # Number of quality articles
    quality_articles_count = len(quality_articles)
    
    temp_dict = {
        'region': region,
        'articles_count': articles_in_group,
        'population': population,
        'quality_articles_count': quality_articles_count,
        'coverage': (articles_in_group/population)*100.0,
        'relative_quality': (quality_articles_count/articles_in_group)*100.0
    }
    temp_list.append(temp_dict)

# Create the analysis table per region
analysis_regional_df = pd.DataFrame(temp_list)
analysis_regional_df.to_csv('./data/final_analysis_data_regional.csv', index=False)
analysis_regional_df.head()

Unnamed: 0,articles_count,coverage,population,quality_articles_count,region,relative_quality
0,6851,0.000534,1284000000.0,125,AFRICA,1.824551
1,11531,0.000254,4536000000.0,310,ASIA,2.688405
2,15864,0.002127,746000000.0,322,EUROPE,2.029753
3,5169,0.000796,649000000.0,69,LATIN AMERICA AND THE CARIBBEAN,1.334881
4,1921,0.000526,365000000.0,99,NORTHERN AMERICA,5.153566


### Results

#### Top 10 countries by coverage
"10 highest-ranked countries in terms of number of politician articles as a proportion of country population"

In [17]:
analysis_country_df.sort_values('coverage', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,articles_count,country,coverage,population,quality_articles_count,relative_quality
0,54,Tuvalu,0.54,10000.0,5,9.259259
1,52,Nauru,0.52,10000.0,0,0.0
2,81,San Marino,0.27,30000.0,0,0.0
3,40,Monaco,0.1,40000.0,0,0.0
4,28,Liechtenstein,0.07,40000.0,0,0.0
5,63,Tonga,0.063,100000.0,0,0.0
6,37,Marshall Islands,0.061667,60000.0,0,0.0
7,201,Iceland,0.05025,400000.0,2,0.995025
8,34,Andorra,0.0425,80000.0,0,0.0
9,36,Grenada,0.036,100000.0,1,2.777778


#### Bottom 10 countries by coverage
"10 lowest-ranked countries in terms of number of politician articles as a proportion of country population"

In [18]:
analysis_country_df.sort_values('coverage', ascending=True).reset_index(drop=True).head(10)

Unnamed: 0,articles_count,country,coverage,population,quality_articles_count,relative_quality
0,980,India,7.1e-05,1371300000.0,17,1.734694
1,210,Indonesia,7.9e-05,265200000.0,10,4.761905
2,1130,China,8.1e-05,1393800000.0,41,3.628319
3,28,Uzbekistan,8.5e-05,32900000.0,2,7.142857
4,101,Ethiopia,9.4e-05,107500000.0,2,1.980198
5,36,"Korea, North",0.000141,25600000.0,7,19.444444
6,25,Zambia,0.000141,17700000.0,0,0.0
7,112,Thailand,0.000169,66200000.0,3,2.678571
8,58,Mozambique,0.00019,30500000.0,0,0.0
9,319,Bangladesh,0.000192,166400000.0,3,0.940439


#### Top 10 countries by relative quality
"10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality"

In [19]:
analysis_country_df.sort_values('relative_quality', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,articles_count,country,coverage,population,quality_articles_count,relative_quality
0,36,"Korea, North",0.000141,25600000.0,7,19.444444
1,118,Saudi Arabia,0.000353,33400000.0,15,12.711864
2,48,Mauritania,0.001067,4500000.0,6,12.5
3,66,Central African Republic,0.001404,4700000.0,8,12.121212
4,343,Romania,0.001759,19500000.0,39,11.370262
5,54,Tuvalu,0.54,10000.0,5,9.259259
6,33,Bhutan,0.004125,800000.0,3,9.090909
7,12,Dominica,0.017143,70000.0,1,8.333333
8,128,Syria,0.000699,18300000.0,10,7.8125
9,91,Benin,0.000791,11500000.0,7,7.692308


#### Bottom 10 countries by relative quality
"10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality"

Note that many countries have 0 as their percentage of quality articles. There are more than these 10 that have the same.

In [20]:
analysis_country_df.sort_values('relative_quality', ascending=True).reset_index(drop=True).head(10)

Unnamed: 0,articles_count,country,coverage,population,quality_articles_count,relative_quality
0,116,Slovakia,0.002148,5400000.0,0,0.0
1,162,Namibia,0.00648,2500000.0,0,0.0
2,37,Cape Verde,0.006167,600000.0,0,0.0
3,58,Mozambique,0.00019,30500000.0,0,0.0
4,147,Costa Rica,0.00294,5000000.0,0,0.0
5,40,Monaco,0.1,40000.0,0,0.0
6,37,Djibouti,0.0037,1000000.0,0,0.0
7,423,Moldova,0.012086,3500000.0,0,0.0
8,185,Uganda,0.00042,44100000.0,0,0.0
9,16,Eritrea,0.000267,6000000.0,0,0.0


#### Geographic regions by coverage
"Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population"

In [21]:
analysis_regional_df.sort_values('coverage', ascending=False).reset_index(drop=True)

Unnamed: 0,articles_count,coverage,population,quality_articles_count,region,relative_quality
0,3128,0.007629,41000000.0,66,OCEANIA,2.109974
1,15864,0.002127,746000000.0,322,EUROPE,2.029753
2,5169,0.000796,649000000.0,69,LATIN AMERICA AND THE CARIBBEAN,1.334881
3,6851,0.000534,1284000000.0,125,AFRICA,1.824551
4,1921,0.000526,365000000.0,99,NORTHERN AMERICA,5.153566
5,11531,0.000254,4536000000.0,310,ASIA,2.688405


#### Geographic regions by relative quality
"Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality"

In [22]:
analysis_regional_df.sort_values('relative_quality', ascending=False).reset_index(drop=True)

Unnamed: 0,articles_count,coverage,population,quality_articles_count,region,relative_quality
0,1921,0.000526,365000000.0,99,NORTHERN AMERICA,5.153566
1,11531,0.000254,4536000000.0,310,ASIA,2.688405
2,3128,0.007629,41000000.0,66,OCEANIA,2.109974
3,15864,0.002127,746000000.0,322,EUROPE,2.029753
4,6851,0.000534,1284000000.0,125,AFRICA,1.824551
5,5169,0.000796,649000000.0,69,LATIN AMERICA AND THE CARIBBEAN,1.334881


The reflection and implications of these findings are written in the README of this repository.