In [1]:
import pandas as pd
import numpy as np
import csv
import json
import requests
from collections import defaultdict

__Data Preprocessing__

1. I am reading the page_data and WPDS csv files to preprocess the data.
2. I add a new column called keep, that removes all the data points where the word 'Template' is in the page column. These articles are not to be processed.
3. I filter the data frame to keep only the records which do not have the word 'Template' in them.

In [2]:
page_data = pd.read_csv('page_data.csv')

world_population_data = pd.read_csv('WPDS_2018_data.csv')

#This function returns 1 if word is found in the row
#Here , row refers to the page column in the dataset and word refers to 'Template'
def filter_rows(row, word):
    if word in row:
        return(1)
    else:
        return(0)

#Adding a keep column to the page_data dataframe to decide which rows to keep and which rows to omit
page_data['keep'] = page_data.apply(lambda x: filter_rows(x['page'], 'Template'), axis=1)

#Filtering the page_data dataframe based on the values in the keep column
#Keep = 0: record does not contain 'Template', Keep = 1: record contains 'Template'
page_data = page_data[page_data['keep'] == 0]

__Predicting the article quality__

1. Here, I will use the API to get the article quality predictions for each article in our preprocessed dataset.
2. We will add an 'article quality' column in the page_data dataframe that assigns an article quality to each rev_id
3. For some rev_ids, we do not have an article quality returned. We add a try catch block to keep a track of these records.

In [15]:
headers = {'User-Agent' : 'https://github.com/yashkale94', 'From' : 'yashkale@uw.edu'}

def get_ores_data(revision_ids, headers):
    
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return(response)
    
rev_ids = page_data['rev_id'].values
low = 0
high = 100
scores = []
flag = 0
while(True):
    
    #Get predictions for each rev_id in batches of 100
    predictions = get_ores_data(rev_ids[low:high], headers)
    
    #Access the scores for the given rev_ids
    predictions = predictions['enwiki']['scores']
    
    #Checks if we have reached the end of the list of rev_ids
    if high > len(rev_ids):
        high = len(rev_ids)
        flag = 1
        
    #Try catch block to check for rev_ids that do not return any score/prediction
    for revid in rev_ids[low:high]:
        try:
            score = predictions[str(revid)]['wp10']['score']['prediction']
            scores.append(score)
        except:
            scores.append('No score found')
    if flag == 1:
        break
    low+=100
    high+=100
    
#We add a new article quality column to our page_data dataframe
page_data['article_quality'] = scores

__Merging of the two dataframes for final dataframe__

1. We want to create a final dataframe for our analysis. For this, we merge the page_data and world_population_data dataframes.
2. We create two new columns in the page_data dataframe, namely: 'Region' and 'Population'. This stores the region and population of the country the article that falls in.
3. We use a try catch block to take care of those articles that do not have a country or population associated with it in the world_population_data file.

In [16]:
#Get the two columns of world_population_data dataframe in a list
region_country = world_population_data[['Geography','Population mid-2018 (millions)']].values
count = 0

#Create a dictionary to keep the map of every country to the region it belongs to
#If the name is in block letters, it is a region name and not a country name
country_to_region_map = {}
for i in region_country:
    if i[0].isupper():
        key = i[0]
    else:
        country_to_region_map[i[0]] = [key, i[1]]
 

In [21]:
region_list = []
population_list = []

#We iterate on the page_data dataframe to create two columns, that store the population and region name for each country
for index, row in page_data.iterrows():
    try:
        region, population = country_to_region_map[row.values[1]]
        region_list.append(region)
        population_list.append(population)
    except:
        #If we do not have a corresponding population for a country, we add 'Population not found'
        region_list.append('Country not found')
        population_list.append('Population not found')    
        
page_data['Region'] = region_list
page_data['Population'] = population_list

In [24]:
page_data.drop(columns=['Region'], inplace=True)
page_data = page_data[['country', 'page','rev_id','article_quality', 'Population']]
data_not_found = page_data[(page_data['article_quality'] == 'No score found') | (page_data['Population'] == 'Population not found')]
page_data_complete = page_data[(page_data['article_quality'] != 'No score found') & (page_data['Population'] != 'Population not found')]

__Calculating the countries by coverage__

1. Here, we use a dictionary to keep a map of number of articles per country.
2. We then use the information we have about the population of each country to calculate the coverage per country.

In [26]:
#creates a dictionary to store country:articles as key value pair
articles_per_country = defaultdict(int)
for i, row in page_data_complete.iterrows():
    country = row.values[0]
    articles_per_country[country]+=1

#creates a list to store the countries and their articles coverage
proportion_country = []
for country, value in articles_per_country.items():
    country_population = country_to_region_map[country][1]
    
    #Removes the character ',' from the population and converts it to float so that we can add the population value later
    country_population = float(country_population.replace(',',''))
    
    #Population values are in millions. Hence we multiply by 10^6
    proportion_country.append((country, value/(country_population*10**6)))

#This gives us the top 10 countries by their coverage
top_countries = sorted(proportion_country, key = lambda x: (x[1]), reverse=True)[0:10]



__Top 10 countries by coverage percentage__

In [27]:
df = pd.DataFrame(top_countries, columns=['Country','Coverage'])
df['Coverage'] = df['Coverage']*100
df

Unnamed: 0,Country,Coverage
0,Tuvalu,0.54
1,Nauru,0.52
2,San Marino,0.27
3,Monaco,0.1
4,Liechtenstein,0.07
5,Tonga,0.063
6,Marshall Islands,0.061667
7,Iceland,0.05025
8,Andorra,0.0425
9,Grenada,0.036


__Bottom 10 countries by coverage percentage:__

In [28]:
bottom_countries = sorted(proportion_country, key = lambda x: (x[1]))[0:10]
df = pd.DataFrame(bottom_countries, columns=['Country','Coverage'])
df['Coverage'] = df['Coverage']*100
df

Unnamed: 0,Country,Coverage
0,India,7.1e-05
1,Indonesia,7.9e-05
2,China,8.1e-05
3,Uzbekistan,8.5e-05
4,Ethiopia,9.4e-05
5,"Korea, North",0.000141
6,Zambia,0.000141
7,Thailand,0.000169
8,Mozambique,0.00019
9,Bangladesh,0.000192


__Calculating countries based on the coverage of GA/FA articles__

1. Here, we want to calculate the relative quality of countries, based on the number of GA/FA articles published in these countries.
2. We create a dictionary that will store the GA/FA articles per country as a key value pair.

In [50]:
#We filter on the dataframe to keep only those records that have atleast one GA/FA article
page_data_high = page_data_complete[(page_data_complete['article_quality'] == 'GA') | (page_data_complete['article_quality'] == 'FA')]

#We create a dictionary to store the number of high quality articles per country
high_rated_articles = defaultdict(int)
for i, row in page_data_high.iterrows():
    article_quality = row.values[3]
    country = row.values[0]
    high_rated_articles[country]+=1

high_rated_articles_ratio = []
for key, value in high_rated_articles.items():
    articles_in_country = articles_per_country[key]
    high_rated_articles_ratio.append((key, value/articles_in_country))

__Top 10 countries by relative quality:__

In [31]:
#We create a dataframe from a list that sorts the countries based on their relative article_quality
top_countries = sorted(high_rated_articles_ratio, key=lambda x: x[1], reverse=True)[0:10]
df = pd.DataFrame(top_countries, columns=['Country','article_quality'])
df['article_quality'] = df['article_quality']*100
df

Unnamed: 0,Country,article_quality
0,"Korea, North",19.444444
1,Saudi Arabia,12.711864
2,Mauritania,12.5
3,Central African Republic,12.121212
4,Romania,11.370262
5,Tuvalu,9.259259
6,Bhutan,9.090909
7,Dominica,8.333333
8,Syria,7.8125
9,Benin,7.692308


1. To calculate the bottom countries, we make sure that we consider countries that have 0 GA/FA articles also.

In [32]:
#We keep a count of number of articles in each country.
#For each country, we keep a track of number of articles in each given article_quality possibility
articles_count_country = {}
for index, row in page_data_complete.iterrows():
    if row.values[0] not in articles_count_country.keys():
        articles_count_country[row.values[0]] = {}
        articles_count_country[row.values[0]][row.values[3]] = 1
    else:
        if row.values[3] not in articles_count_country[row.values[0]].keys():
            articles_count_country[row.values[0]][row.values[3]] = 1
        else:
            articles_count_country[row.values[0]][row.values[3]]+=1

#With this, we make sure we add 0 to those countries that do not have GA or FA in them
for key, value in articles_count_country.items():
    if 'GA' not in value.keys():
        articles_count_country[key]['GA'] = 0
    if 'FA' not in value.keys():
        articles_count_country[key]['FA'] = 0
        


__Bottom 10 countries by relative quality:__

In [33]:
high_rated_articles_ratio = []
for key, value in articles_per_country.items():
    high_rated_articles = articles_count_country[key]['GA'] + articles_count_country[key]['FA']
    high_rated_articles_ratio.append((key,high_rated_articles/articles_per_country[key]))
    
bottom_countries = sorted(high_rated_articles_ratio, key=lambda x: x[1])[0:10]
df = pd.DataFrame(bottom_countries, columns=['Country','article_quality'])
df['article_quality'] = df['article_quality']*100
df

Unnamed: 0,Country,article_quality
0,Malta,0.0
1,Angola,0.0
2,Finland,0.0
3,Tunisia,0.0
4,San Marino,0.0
5,Uganda,0.0
6,Moldova,0.0
7,Monaco,0.0
8,Turkmenistan,0.0
9,Slovakia,0.0


1. Now, we want to calculate the similar metrics at a geographical region
2. For this, we require the number of articles in the geogrpahical region.
3. We maintain a dictionary that stores the number of articles in a given region.

In [34]:
#We create a dictionary to store articles per region
articles_by_region = defaultdict(int)
for key, value in articles_per_country.items():
    region_population = country_to_region_map[key][1]
    region_articles = value
    articles_by_region[country_to_region_map[key][0]]+=value

#We create a dictionary to store the population in every region
population_by_region = defaultdict(int)
for key, value in articles_per_country.items():
    region_population = country_to_region_map[key][1]
    region_population = region_population.replace(',','')
    population_by_region[country_to_region_map[key][0]]+=(float(region_population))*10**6

In [35]:
#For each region, we store the population. We converted the string value to a float value and multiply by 10^6, as the population
#is in millions
africa_population = world_population_data[world_population_data['Geography'] == 'AFRICA']['Population mid-2018 (millions)'].values[0]
northern_america = world_population_data[world_population_data['Geography'] == 'NORTHERN AMERICA']['Population mid-2018 (millions)'].values[0]
latin_america = world_population_data[world_population_data['Geography'] == 'LATIN AMERICA AND THE CARIBBEAN']['Population mid-2018 (millions)'].values[0]
asia = world_population_data[world_population_data['Geography'] == 'ASIA']['Population mid-2018 (millions)'].values[0]
europe = world_population_data[world_population_data['Geography'] == 'EUROPE']['Population mid-2018 (millions)'].values[0]
oceania = world_population_data[world_population_data['Geography'] == 'OCEANIA']['Population mid-2018 (millions)'].values[0]

africa_population = float(africa_population.replace(',',''))*10**6
northern_america = float(northern_america.replace(',',''))*10**6
latin_america = float(latin_america.replace(',',''))*10**6
asia = float(asia.replace(',',''))*10**6
europe = float(europe.replace(',',''))*10**6
oceania = float(oceania.replace(',',''))*10**6

__Geographic regions by coverage:__

In [45]:
articles_regions_order = []
articles_regions_order.append(('AFRICA', articles_by_region['AFRICA']/ africa_population))
articles_regions_order.append(('NORTHERN AMERICA', articles_by_region['NORTHERN AMERICA']/ northern_america))
articles_regions_order.append(('LATIN AMERICA AND THE CARIBBEAN', articles_by_region['LATIN AMERICA AND THE CARIBBEAN']/ latin_america))
articles_regions_order.append(('ASIA', articles_by_region['ASIA']/ asia))
articles_regions_order.append(('EUROPE', articles_by_region['EUROPE']/ europe))
articles_regions_order.append(('OCEANIA', articles_by_region['OCEANIA']/ oceania))

__Top 10 Geographic regions by coverage percentage__

In [46]:
top_regions = sorted(articles_regions_order, key=lambda x: x[1], reverse=True)
df = pd.DataFrame(top_regions, columns=['Region','Coverage'])
df['Coverage'] = df['Coverage']*100
df

Unnamed: 0,Region,Coverage
0,OCEANIA,0.007629
1,EUROPE,0.002127
2,LATIN AMERICA AND THE CARIBBEAN,0.000796
3,AFRICA,0.000534
4,NORTHERN AMERICA,0.000526
5,ASIA,0.000254


Similarly, we store the number of high quality articles per region in a dictionary.

In [48]:
#We create a dictionary to store the number of high quality articles per country
high_rated_articles = defaultdict(int)
for i, row in page_data_high.iterrows():
    article_quality = row.values[3]
    country = row.values[0]
    high_rated_articles[country]+=1

high_articles_regionwise = defaultdict(int)
for key, value in high_rated_articles.items():
    high_articles_regionwise[country_to_region_map[key][0]]+=value

high_articles_regionwise_order = []
high_articles_regionwise_order.append(('AFRICA',high_articles_regionwise['AFRICA']/ articles_by_region['AFRICA']))
high_articles_regionwise_order.append(('NORTHERN AMERICA',high_articles_regionwise['NORTHERN AMERICA']/ articles_by_region['NORTHERN AMERICA']))
high_articles_regionwise_order.append(('LATIN AMERICA AND THE CARIBBEAN',high_articles_regionwise['LATIN AMERICA AND THE CARIBBEAN']/ articles_by_region['LATIN AMERICA AND THE CARIBBEAN']))
high_articles_regionwise_order.append(('ASIA',high_articles_regionwise['ASIA']/ articles_by_region['ASIA']))
high_articles_regionwise_order.append(('EUROPE',high_articles_regionwise['EUROPE']/ articles_by_region['EUROPE']))
high_articles_regionwise_order.append(('OCEANIA',high_articles_regionwise['OCEANIA']/ articles_by_region['OCEANIA']))

__Top 10 Geographic regions by relative article quality__

In [49]:
top_regions = sorted(high_articles_regionwise_order, key=lambda x: x[1], reverse=True)
df = pd.DataFrame(top_regions, columns=['Region','article_quality'])
df['article_quality'] = df['article_quality']*100
df

Unnamed: 0,Region,article_quality
0,NORTHERN AMERICA,5.153566
1,ASIA,2.688405
2,OCEANIA,2.109974
3,EUROPE,2.029753
4,AFRICA,1.824551
5,LATIN AMERICA AND THE CARIBBEAN,1.334881


__We store the results of those records, that do not have population data, or do not have any articles by their name in a different file__

In [54]:
page_data1 = pd.read_csv('page_data.csv')

world_population_data1 = pd.read_csv('WPDS_2018_data.csv')

no_population = page_data[page_data['Population'] == 'Population not found']
countries_article = set(page_data1['country'].values)

#We filter the countries who do not have corresponding article data
def filter_countries(country, countries):
    if country not in countries:
        return(1)
    else:
        return(0)

world_population_data1['keep'] = world_population_data1.apply(lambda x: filter_countries(x['Geography'], countries_article), axis=1)
world_population_data1 = world_population_data1[world_population_data1['keep'] == 1]

#We add another rev_id column in the population data so that we can merge this with the page_data dataframe
world_population_data1['rev_id'] = 'No Articles found'

world_population_data1 = world_population_data1[['Geography','Population mid-2018 (millions)','rev_id']]
no_population = no_population[['country', 'Population', 'rev_id']]

df_final = pd.concat([no_population, world_population_data1.rename(columns={'Geography':'country','Population mid-2018 (millions)'
                                                                    :'Population'})], ignore_index=True)

#We then store these records that have no match in a file
df_final.to_csv('wp_wpds_countries-no_match.csv', index=False)

#We store the data for which we have performed analysis and have entire data available in another separate file.
page_data_complete.to_csv('wp_wpds_politicians_by_country.csv', index=False)

__FINAL ANALYSIS__

1. Initially when I began parsing the dataset, I could see that there a lot of discrepancies. The way country names are defined in both the data sources is not constant. That creates a problem when identifying the same country.
2. The way the regions are stored in the data files is extremely inefficient. The only map from the country to its region is that in the WPDS_2018_data file, regions are marked by capital letters and countries with capital and small letters. This is not a good dataset to work with, since it requires a lot of cleaning.
3. There are some revision_ids that do not return any ORES score, even though they are not of the 'Template' form. It does not state any reason as to why this score is not available for such articles. This makes me question which articles are suitable to generate an ORES score for.
4. I expected some bias in this dataset, where the countries with large populations and growing economies, I expected them to have a higher relative quality ratio than other countries. On the contrary, I found that these countries had a lower relatice quality ratio. I expected this bias, because I thought larger the countries, there would be a higher chance the articles would be of higher quality.
5. I believe this dataset, with a little bit of cleaning can be used to correlate the findings of high quality articles in countries to their respective literacy rates. I think we could see an interesting correlation there, which might suggest that higher literacy rates correspond to higher quality articles. 
6. In my opinion, this dataset is not a very good source as there are so many data points that are misleading. For example, we have the same country Oman listed as Omani in one source and Oman in the other. This is not a good data set as ideally, both would point to the same country Oman, but would fail here.