## English Wikipedia Political Figures Articles Coverage and Quality Analysis

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import pandas as pd
import numpy as np

### Step 1: Set up page and population data

**Wikipedia dataset** from Figshare.com: https://figshare.com/articles/Untitled_Item/5513449

This dataset is downloaded from Figshare.com. <br/>
The dataset is titled "Politicians by Country from the English-language Wikipedia", of which are data extracted from Wikimedia thru API calls. <br/>
Both the dataset and the code used to extract the data are under CC-BY-SA 4.0 license. <br/>
It is downloadable as a csv file titled "page_data.csv", and there are three columns and 47,197 rows in the csv file. <br/>

    page: article title of the page for political figures, not cleaned yet
    country: cleaned version of country name from which the category the political figure is under
    rev_id: unique identifier for revision tracking

In [3]:
## load page_data.csv into pandas DataFrame and examine first 5 rows
page_data = pd.read_csv('page_data.csv', sep=',')
page_data.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


**Population dataset** from Dropbox: https://www.dropbox.com/s/5u7sy1xt7g0oi2c/WPDS_2018_data.csv?dl=0

This dataset is downloaded from Dropbox. <br/>
The dataset is originally from Population Reference Bureau under International Indicators <br/>
and it is population data for all countries from mid-2018 in millions of population. <br/>
It is downloadable as a csv file titled "WPDS_2018_data.csv", and there are two columns and 207 rows in the csv file. <br/>

    Geography: country and continent names 
    Population mid-2018 (millions): population data from mid-2018 in millions

In [4]:
## load WPDS_2018_data.csv into pandas DataFrame and examine first 5 rows   
population_data = pd.read_csv('WPDS_2018_data.csv', sep=',', thousands=',')
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


### Step 2: Set up article quality predictions

For the article quality predictions, we will be using ORES API calls by passing in each articles' rev_id and getting their 'prediction' values from the json file.

For ORES documentation, please refer to this website: https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context

For prediction values in ORES, there are 6 quality categories, in later analysis, we will mainly focus on the first two categories for high quality article percentage calculation.

    FA - Featured article
    GA - Good article
    B - B-class article
    C - C-class article
    Start - Start-class article
    Stub - Stub-class article

In [5]:
## import packages for making API calls to ORES
import requests
import json

In [6]:
## Define hearders and endpoint for API call
headers = {'User-Agent' : 'https://github.com/yd4wh', 'From' : 'yd4wh@uw.edu'}
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'

In [7]:
## Define a function that will recurse over all rev_ids and output quality predictions
def get_ores_quality_prediction(revids, headers, endpoint):
        
    # define parameters for endpoints
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revids)
              }
    
    # use above defined parameters to make API requests
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    
    # loop thru each revids in the 100 group to get their quality predictions
    quality_prediction = []
    revid_list = []
    
    # After testing, there are errors with revids that don't have a score associated
    # therefore, when looping, also included except to pass the revids that don't have scores associated with them
    for revid in revids:
        # to iterate thru every revids to pull out the prediction value
        try:
            quality_prediction.append(response['enwiki']['scores'][str(revid)]['wp10']['score']['prediction'])
            revid_list.append(revid)
        # use except to pass thru the revids without scores
        except:
            pass
        
    # this function will return revids and its associated prediction values
    return revid_list, quality_prediction

In [8]:
## set up revids in every 100 to be passed thru the API call function

# change revids into list
revids = list(page_data['rev_id'])

# define starting and ending points for first iteration
start = 0
end = 100

# create empty Dataframe for collecting article quality output
article_quality = pd.DataFrame()

# loop over all revids in groups of 100s
while start < len(revids):
    
    # pull out the revids in groups of 100 for each iteration
    iter_revids = revids[start:end]
    
    # call the function to get article quality predictions
    iter_result = get_ores_quality_prediction(iter_revids,headers,endpoint)
    article_quality = article_quality.append(pd.DataFrame(list(iter_result)).T)
    
    # update starting and ending points for next iteration
    start += 100
    end = min(start+100, len(revids))

# print out the final Dataframe with the revids that don't have scores
article_quality.head()

Unnamed: 0,0,1
0,355319463,Stub
1,391862046,Stub
2,391862070,Stub
3,391862409,Stub
4,391862819,Stub


The ORES article quality prediction dataframe is now saved as **article_quality** with 2 columns and 47,092 rows after removing all articles that doesn't have a article score.
    
    revision_id: the revision_id that can be linked back to page_data
    article_quality: the ORES quality prediction for associated revision_id

In [9]:
# rename article_quality columns before merging in next step
article_quality.rename(columns={0:'revision_id',1:'article_quality'}, inplace=True)
article_quality.head()

Unnamed: 0,revision_id,article_quality
0,355319463,Stub
1,391862046,Stub
2,391862070,Stub
3,391862409,Stub
4,391862819,Stub


### Step 3: Combine page_data, population_data and article_quality

This step will use the common columns in page_data(rev_id, country), population_data(Geography), article_quality(revision_id) to merge all three dataframes together, and in the end build a combined dataframe together with 5 columns and 44,973 rows after removing all data point that don't match.

    country: country column from page_data
    article_name: page column from page_data
    revision_id: revision_id column from article_quality
    article_quality: article_quality column from article_quality
    population: Population mid-2018 (millions) column from population_data which will be in millions

In [10]:
# make deep copies of the three dataframes as base df for merging
df_page_data = page_data.copy(deep=True)
df_population_data = population_data.copy(deep=True)
df_article_quality = article_quality.copy(deep=True)

In [11]:
# combine page_data and article_quality on rev_id and revision_id columns
combined_data = df_page_data.merge(df_article_quality, how='right', left_on='rev_id', right_on='revision_id')

# combine combined_data with population_data on country and geography columns
combined_data = combined_data.merge(df_population_data, how='inner', left_on='country', right_on='Geography')
combined_data.rename(columns={'page':'article_name','Population mid-2018 (millions)':'population'}, inplace=True)
combined_data.head()

Unnamed: 0,article_name,country,rev_id,revision_id,article_quality,Geography,population
0,Bir I of Kanem,Chad,355319463,355319463,Stub,Chad,15.4
1,Abdullah II of Kanem,Chad,498683267,498683267,Stub,Chad,15.4
2,Salmama II of Kanem,Chad,565745353,565745353,Stub,Chad,15.4
3,Kuri I of Kanem,Chad,565745365,565745365,Stub,Chad,15.4
4,Mohammed I of Kanem,Chad,565745375,565745375,Stub,Chad,15.4


In [12]:
# clean dataframe for combined_data to just keep five colunms documented above
df_combined_data = combined_data[['country',
                                  'article_name',
                                  'revision_id',
                                  'article_quality',
                                  'population']]
df_combined_data.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Chad,Bir I of Kanem,355319463,Stub,15.4
1,Chad,Abdullah II of Kanem,498683267,Stub,15.4
2,Chad,Salmama II of Kanem,565745353,Stub,15.4
3,Chad,Kuri I of Kanem,565745365,Stub,15.4
4,Chad,Mohammed I of Kanem,565745375,Stub,15.4


In [13]:
# output final_data.csv for reproducibiilty
df_combined_data.to_csv('final_data.csv', index=False)

### Step 4: Analysis on articles quality by country and population

In [14]:
# make a deep copy of the final DataFrame for analysis
final_data = df_combined_data.copy(deep=True)

The **percentage of articles-per-population** for each country: this measure will be calculated by taking the total number of articles in a particular country and divide it by the total population of the corresponding country. This requires us to sum the total number of articles by country and to represent population number normally.

In [15]:
# count total number of articles in each country using group by
article_by_country = final_data.groupby('country').count()['article_name']

In [16]:
# pass the series into a dataframe for merging with population data
df_article_by_country = article_by_country.to_frame(name='article_count')
# change country into a column instead of index for merging
df_article_by_country['country'] = df_article_by_country.index
df_article_by_country.head()

Unnamed: 0_level_0,article_count,country
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,326,Afghanistan
Albania,460,Albania
Algeria,119,Algeria
Andorra,34,Andorra
Angola,110,Angola


In [17]:
# merge with population_data df to calculate percentage
articles_per_population = df_article_by_country.merge(df_population_data, how='inner', 
                                                      left_on='country', right_on='Geography')
# change population number into normal presentation
articles_per_population['population'] = articles_per_population['Population mid-2018 (millions)']*1000000
# calculate the percentage of articles per population by country
articles_per_population['pcnt_articles_per_population'] = 100*(articles_per_population['article_count']/articles_per_population['population'])

The **percentage of high-quality-articles** for each country: this measure will be calculated by taking the total number of articles in a particular country that qualifies as being either "FA" or "GA" and divide it by the total number of articles about politicians of the corresponding country.

In [18]:
# limit articles to only "FA" and "GA" qualities
high_quality_articles = final_data.loc[final_data['article_quality'].isin(['FA','GA'])]
# count total number of high quality articles in each country using group by 
quality_article_by_country = high_quality_articles.groupby('country').count()['article_name']

In [19]:
# pass the series into a dataframe for merging with population data
df_quality_article_by_country = quality_article_by_country.to_frame(name='high_quality_article_count')
# change country into a column instead of index for later merging
df_quality_article_by_country['country'] = df_quality_article_by_country.index
df_quality_article_by_country.head()

Unnamed: 0_level_0,high_quality_article_count,country
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,10,Afghanistan
Albania,4,Albania
Algeria,2,Algeria
Argentina,15,Argentina
Armenia,5,Armenia


In [20]:
# merge with articles_per_population df to calculate article percentage
analysis_df = df_quality_article_by_country.merge(articles_per_population, how='right', 
                                                  left_on='country', right_on='country')
# divide total number of high quality articles by total article count
analysis_df['pcnt_high_quality_articles'] = 100*(analysis_df['high_quality_article_count']/analysis_df['article_count'])
analysis_df.head()

Unnamed: 0,high_quality_article_count,country,article_count,Geography,Population mid-2018 (millions),population,pcnt_articles_per_population,pcnt_high_quality_articles
0,10.0,Afghanistan,326,Afghanistan,36.5,36500000.0,0.000893,3.067485
1,4.0,Albania,460,Albania,2.9,2900000.0,0.015862,0.869565
2,2.0,Algeria,119,Algeria,42.7,42700000.0,0.000279,1.680672
3,15.0,Argentina,496,Argentina,44.5,44500000.0,0.001115,3.024194
4,5.0,Armenia,198,Armenia,3.0,3000000.0,0.0066,2.525253


The combined analysis DataFrame will include all countries that have population and wikipedia articles regardless of the count of high quality articles. 

In [21]:
# keep neccessary and non-duplicate columns
analysis_df = analysis_df[['country',
                           'article_count',
                           'high_quality_article_count',
                           'population',
                           'pcnt_articles_per_population',
                           'pcnt_high_quality_articles']]
analysis_df.head()

Unnamed: 0,country,article_count,high_quality_article_count,population,pcnt_articles_per_population,pcnt_high_quality_articles
0,Afghanistan,326,10.0,36500000.0,0.000893,3.067485
1,Albania,460,4.0,2900000.0,0.015862,0.869565
2,Algeria,119,2.0,42700000.0,0.000279,1.680672
3,Argentina,496,15.0,44500000.0,0.001115,3.024194
4,Armenia,198,5.0,3000000.0,0.0066,2.525253


### Step 5: Tables of highest and lowest ranked countries by *articles_per_population* and *high_quality_articles*

This section will display four tables that summarize the 10 highest and 10 lowest ranked countries in terms of their pcnt_articles_per_population and pcnt_high_quality_articles in the order below:

    1. 10 highest-ranked countries in terms of pcnt_articles_per_population
    2. 10 lowest-ranked countries in terms of pcnt_articles_per_population
    3. 10 highest-ranked countries in terms of pcnt_high_quality_articles
    4. 10 lowest-ranked countries in terms of pcnt_high_quality_articles

In [22]:
# 10 highest-ranked countries sorting by 'pcnt_articles_per_population'
analysis_df.sort_values(by='pcnt_articles_per_population', ascending=False).head(10)[['country',
                                                                                      'article_count',
                                                                                      'population',
                                                                                      'pcnt_articles_per_population']]

Unnamed: 0,country,article_count,population,pcnt_articles_per_population
131,Tuvalu,55,10000.0,0.55
168,Nauru,53,10000.0,0.53
170,San Marino,82,30000.0,0.273333
166,Monaco,40,40000.0,0.1
161,Liechtenstein,29,40000.0,0.0725
128,Tonga,63,100000.0,0.063
164,Marshall Islands,37,60000.0,0.061667
53,Iceland,206,400000.0,0.0515
143,Andorra,34,80000.0,0.0425
155,Federated States of Micronesia,38,100000.0,0.038


In [23]:
# 10 lowest-ranked countries sorting by 'pcnt_articles_per_population'
analysis_df.sort_values(by='pcnt_articles_per_population').head(10)[['country',
                                                                     'article_count',
                                                                     'population',
                                                                     'pcnt_articles_per_population']]

Unnamed: 0,country,article_count,population,pcnt_articles_per_population
54,India,986,1371300000.0,7.2e-05
55,Indonesia,214,265200000.0,8.1e-05
25,China,1135,1393800000.0,8.1e-05
137,Uzbekistan,29,32900000.0,8.8e-05
39,Ethiopia,105,107500000.0,9.8e-05
179,Zambia,25,17700000.0,0.000141
65,"Korea, North",39,25600000.0,0.000152
126,Thailand,112,66200000.0,0.000169
9,Bangladesh,323,166400000.0,0.000194
167,Mozambique,60,30500000.0,0.000197


In [24]:
# 10 highest-ranked countries sorting by 'pcnt_high_quality_articles'
analysis_df.sort_values(by='pcnt_high_quality_articles', ascending=False).head(10)[['country',
                                                                                    'high_quality_article_count',
                                                                                    'article_count',
                                                                                    'pcnt_high_quality_articles']]

Unnamed: 0,country,high_quality_article_count,article_count,pcnt_high_quality_articles
65,"Korea, North",7.0,39,17.948718
108,Saudi Arabia,16.0,119,13.445378
22,Central African Republic,8.0,68,11.764706
105,Romania,40.0,348,11.494253
82,Mauritania,5.0,52,9.615385
12,Bhutan,3.0,33,9.090909
131,Tuvalu,5.0,55,9.090909
32,Dominica,1.0,12,8.333333
135,United States,82.0,1092,7.509158
11,Benin,7.0,94,7.446809


In [25]:
# 10 lowest-ranked countries sorting by 'pcnt_high_quality_articles'
analysis_df.sort_values(by='pcnt_high_quality_articles').head(10)[['country',
                                                                   'high_quality_article_count',
                                                                   'article_count',
                                                                   'pcnt_high_quality_articles']]

Unnamed: 0,country,high_quality_article_count,article_count,pcnt_high_quality_articles
125,Tanzania,1.0,408,0.245098
100,Peru,1.0,354,0.282486
75,Lithuania,1.0,248,0.403226
94,Nigeria,3.0,682,0.439883
87,Morocco,1.0,208,0.480769
40,Fiji,1.0,199,0.502513
13,Bolivia,1.0,187,0.534759
16,Brazil,3.0,551,0.544465
76,Luxembourg,1.0,180,0.555556
111,Sierra Leone,1.0,166,0.60241


One caveat on the lowest-ranked countries in terms of pcnt_high_quality_articles, we only included countries that have at least 1 article qualified as "GA" or "FA" and didn't include countries that don't have any high quality articles about politicians. Therefore, as a separate group of countries that don't have any high quality articles written about politicians, we've listed below in alphabetical order. There are 37 countries that dont have any articles qualified as "GA" or "FA".

In [35]:
countries_without_high_qulaity_articles = analysis_df.loc[pd.isnull(analysis_df['pcnt_high_quality_articles'])]
print("There are "+ str(countries_without_high_qulaity_articles.count()[1]) + " countries that don't have any high quality articles.")
countries_without_high_qulaity_articles[['country']]

There are 37 countries that don't have any high quality articles.


Unnamed: 0,country
143,Andorra
144,Angola
145,Antigua and Barbuda
146,Bahamas
147,Barbados
148,Belgium
149,Belize
150,Cameroon
151,Cape Verde
152,Comoros
