## Step 1,2: Data Acquisition and Processing

In this ste[ we are going to download data from following sources and save it as .csv for further processing:

- Wikipedia articles data is downloaded from figshare. This project contains data on most English-language Wikipedia articles within the category "Category:Politicians by nationality" and subcategories, along with the code used to generate that data.
- Population data is downloaded from Population Reference Bureau(PRB). This data is from year 2015 for 210 countries.

In the next steps, we will get the article quality prediction by calling ORES api and merge article_quality with wikipedia and population data in a single dataframe. We will then write the dataframe in a csv file and save it to disk

### Getting the Data and Appending ORES Prediction Values 

In this step we will be reading the csv files in and appending the ORES Prediction values to their corresponding dataframe rows.

In [None]:
import csv
import requests
from multiprocessing.dummy import Pool as ThreadPool

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

#### Page Data: Read from csv and append ORES info


In [None]:
print('Reading data from page_data.csv')

data = []
with open('page_data.csv', encoding='utf-8') as page_file:
    reader = csv.reader(page_file)
    next(reader)
    data = [row for row in reader]
    
url = 'https://ores.wikimedia.org/v3/scores/enwiki/?models=wp10&revids='

In [None]:
using the threadpooling will approximately decrease the execution time by 11 times.
Without the pooling it will take close to 11 hours while using pooling we will have the results in less than 1 hour.

In [None]:
def for_pool(row):
    # create url for API request
    tmp_url = url + row[2]
    try:
        # get request
        result = requests.get(url=tmp_url).json()['enwiki']['scores']
        # get prediction name
        prediction = result[row[2]]['wp10']['score']['prediction']
        return row + [prediction]
    except:
        return row + [None]
print('Collecting data using API (please wait about 1 hour...)')

pool = ThreadPool(28)
page_data_with_prediction = pool.map(for_pool, data)
pool.close()
pool.join()

#### Population Data: Read from csv and process

Once data is loaded, the population data needs some processing before it's ready to use. The first two rows and 'Foonotes' column needs to be trimmed. The format for population data needs to be changed to number so that it can be used for percentage calculation in later steps. Below section applies the steps mentioned.

In [None]:
#First, create a dictionary of key: value pairs: key - country name, value - population.
population_data = {}
with open('Population Mid-2015.csv', encoding='utf-8') as population_file:
    reader = csv.reader(population_file)
    next(reader)
    next(reader)
    next(reader)
    for row in reader:
        try:
            population_data[row[0]] = int(row[4].replace(',',''))
        except:
            pass


#### Create final dataset
For each row in page_data_with_prediction, if score exists and if the population_data has country population, add population in the new dataset.

In [None]:
final_dataset = []
for row in page_data_with_prediction:
    if row[3] != None:
        try:
            population = population_data[row[1]]
            final_dataset.append(row + [population])
        except:
            pass


writing the final data set to the disk:

In [None]:
fieldname = ['article_name', 'country', 'revision_id', 'article_quality', 'population']
with open('final_dataset.csv', 'w', encoding='utf-8', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(fieldname)
    writer.writerows(final_dataset)

## Step 2: Analysis

In this step we are going to calculate the percentage of articles-per-population for each country and
the percentage of high-quality articles(where prediction is either 'FA' or 'GA') for each country.
Based on the results, we will produce four tables that show:
1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


The code below performs the percentage calculations for articles-per-population for each country:

In [None]:
import pandas as pd

# Load data in pandas dataframe
final_dataset = pd.read_csv('final_dataset.csv')

# find all unique countries in dataframe
countries = final_dataset['country'].unique()

articles_per_population = []
# for each country find articles_per_population
for country in countries:
    tmp_dataset = final_dataset[final_dataset['country'] == country]
    articles = len(tmp_dataset)
    population = tmp_dataset['population'].iloc[0]
    articles_per_population.append([country, articles/population*100])

articles_per_population = list(zip(*articles_per_population))    
articles_per_population = pd.DataFrame({'country': articles_per_population[0],
                                        'articles_per_population': articles_per_population[1]})


##### Table 1: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

In [None]:
articles_per_population.sort_values('articles_per_population').head(10)

##### Table 2: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

In [None]:
articles_per_population.sort_values('articles_per_population', ascending=False).head(10)

The code below performs the percentage calculations of high-quality articles(where prediction is either 'FA' or 'GA') for each country:

In [None]:
high_quality_articles = []
# for each country find high_quality_articles 
for country in countries:
    tmp_dataset = final_dataset[final_dataset['country'] == country]
    row_index = ((tmp_dataset.article_quality == 'GA') | (tmp_dataset.article_quality == 'FA'))
    tmp_high_quality = tmp_dataset[row_index]
    high_quality_articles.append([country, len(tmp_high_quality)/len(tmp_dataset)*100])
    
high_quality_articles = list(zip(*high_quality_articles))    
high_quality_articles = pd.DataFrame({'country': high_quality_articles[0],
                                        'high_quality_articles': high_quality_articles[1]})


##### Table 3: 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [None]:
high_quality_articles.sort_values('high_quality_articles', ascending=False).head(10)

##### Table 4: 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


In [None]:
high_quality_articles.sort_values('high_quality_articles').head(10)

## Step4: Reflection

