# Getting Quality Predictions for World Politicians' Wikipedia Articles

### Homework #2 – Data 512
### Daniel Vogler

In this final notebook of the assignment, I calculate total articles per capita (defined as Wikipedia articles in the dataset per 1M inhabitants) and total high-quality articles per capita (same definition, but restricted to articles rated `FA` or `GA` by ORES).

In [161]:
import pandas as pd
import plotly.express as px
import numpy as np


In [162]:
df = pd.read_csv("../output_data/data_for_analysis.csv", index_col=0)

# Calculating total articles per capita

With the data prepared, calculating per-capita rates of total articles and high-quality articles is straighforward.

First, I group the dataset by country and divide by population (in millions) to get the number of total articles per 1M inhabitants. Then, I repeat the procedure, this time grouping by region and dividing by *adjusted* regional population, which I calculate and explain in the [previous notebook.](dataset_combination.ipynb)

In [163]:
articles_by_country = df.groupby("country")["article_title"].count()

# country populations are the same across records, so we can just take the first one 
population_by_country = df.groupby("country")["population"].first()

articles_per_capita = pd.DataFrame()

articles_per_capita.index = articles_by_country.index # calc in next step requires matching indices
articles_per_capita["articles_per_capita"] = articles_by_country / population_by_country

# output in largest-to-smallest order
articles_per_capita.sort_values("articles_per_capita", ascending=False)

Unnamed: 0_level_0,articles_per_capita
country,Unnamed: 1_level_1
Monaco,inf
Tuvalu,inf
Antigua and Barbuda,330.000000
Federated States of Micronesia,140.000000
Marshall Islands,130.000000
...,...
Zambia,0.148515
Saudi Arabia,0.135501
Ghana,0.117302
India,0.105698


Tuvalu and Monaco return infinite values because their populations are 0 in the [raw data](../raw_data/population_by_country_AUG.2024.csv). These can be interpreted as "a very large number."

In [164]:
regional_populations_path = "../output_data/adjusted_regional_populations.csv"

adjusted_regional_populations = pd.read_csv(regional_populations_path, index_col=0)

adjusted_regional_populations.set_index("region", inplace=True)

Next, I calculate the articles per capita on a regional basis:

In [165]:
articles_by_region = df.groupby("region")["article_title"].count()
population_by_region = adjusted_regional_populations["in-sample_population"]

regional_articles_per_capita = pd.DataFrame()

regional_articles_per_capita["articles_per_capita"] = articles_by_region / population_by_region

# output regions largest to smallest
regional_articles_per_capita.sort_values("articles_per_capita", ascending=False)

Unnamed: 0_level_0,articles_per_capita
region,Unnamed: 1_level_1
Northern Europe,7.05036
Oceania,6.486486
Caribbean,6.010929
Southern Europe,5.405941
Central America,3.762183
Western Europe,2.768891
Eastern Europe,2.731029
Western Asia,2.081923
Southern Africa,1.800878
Central Asia,1.467662


# Calculating High-Quality Articles per Capita

Calculating high-quality articles per capita works in exactly the same way as above, with the additional step of filtering the original dataframe such that only high-quality articles are counted.

In [166]:
# articles are considered high quality if they have a quality score of FA or GA
hq_articles = df[(df["article_quality"] == "FA") | (df["article_quality"] == "GA")]

hq_articles_by_country = hq_articles.groupby("country")["article_title"].count()

hq_articles_by_country = pd.merge(articles_by_country, hq_articles_by_country, how = "left", on = "country")

hq_articles_by_country.drop(columns=["article_title_x"], inplace=True)
hq_articles_by_country.rename(columns={
    "article_title_y" : "hq_articles"
}, inplace=True)
hq_articles_by_country.fillna(0, inplace=True)

hq_articles_by_country

hq_articles_per_capita = pd.DataFrame()

hq_articles_per_capita.index = hq_articles_by_country.index # calc in next step requires matching indices

#(hq_articles_by_country["hq_articles"]) / (population_by_country)

hq_articles_per_capita["hq_articles_per_capita"] = (
    hq_articles_by_country["hq_articles"]) / (population_by_country)

hq_articles_per_capita.sort_values("hq_articles_per_capita", ascending=False)

Unnamed: 0_level_0,hq_articles_per_capita
country,Unnamed: 1_level_1
Montenegro,5.000000
Luxembourg,2.857143
Albania,2.592593
Kosovo,2.352941
Lithuania,2.068966
...,...
Nicaragua,0.000000
Eritrea,0.000000
Zimbabwe,0.000000
Monaco,


In [167]:
hq_articles_by_region = hq_articles.groupby("region")["article_title"].count()

hq_regional_articles_per_capita = pd.DataFrame()

hq_regional_articles_per_capita["hq_articles_per_capita"] = hq_articles_by_region / population_by_region

hq_regional_articles_per_capita.sort_values("hq_articles_per_capita", ascending=False)

Unnamed: 0_level_0,hq_articles_per_capita
region,Unnamed: 1_level_1
Northern Europe,0.395683
Southern Europe,0.349835
Caribbean,0.245902
Central America,0.194932
Eastern Europe,0.157776
Southern Africa,0.11713
Western Europe,0.11583
Western Asia,0.091401
Oceania,0.09009
Northern Africa,0.074248


With the necessary calculations complete, I report tables summarizing results in the final section. Analysis of these results appears in the [README](../README.md).

## Results

### The 10 countries with the highest total articles per 1M people are:

In [168]:
articles_per_capita_display = articles_per_capita.sort_values("articles_per_capita", ascending=False).head(10).reset_index()
articles_per_capita_display.index = articles_per_capita_display.index + 1
articles_per_capita_display.rename(columns={
    "articles_per_capita": "Articles per 1M people",
    "country": 'Country'
    }, inplace=True)
articles_per_capita_display

Unnamed: 0,Country,Articles per 1M people
1,Monaco,inf
2,Tuvalu,inf
3,Antigua and Barbuda,330.0
4,Federated States of Micronesia,140.0
5,Marshall Islands,130.0
6,Tonga,100.0
7,Barbados,83.333333
8,Montenegro,63.333333
9,Seychelles,60.0
10,Bhutan,55.0


Tuvalu and Monaco return infinite values because their populations are 0 in the [raw data](../raw_data/population_by_country_AUG.2024.csv). These can be interpreted as "a very large number."

Overall, very small countries, mostly islands, tend to have high values for articles per 1M inhabitants. This is likely driven by their very low populations.

### The 10 countries with the lowest total articles per 1M people are:

In [170]:
articles_per_capita_display = articles_per_capita.sort_values("articles_per_capita", ascending=True).head(10).reset_index()
articles_per_capita_display.index = articles_per_capita_display.index + 1
articles_per_capita_display.rename(columns={
    "articles_per_capita": "Articles per 1M people",
    "country": 'Country'
    }, inplace=True)
articles_per_capita_display

Unnamed: 0,Country,Articles per 1M people
1,China,0.011337
2,India,0.105698
3,Ghana,0.117302
4,Saudi Arabia,0.135501
5,Zambia,0.148515
6,Norway,0.181818
7,Israel,0.204082
8,Egypt,0.304183
9,Cote d'Ivoire,0.323625
10,Mozambique,0.353982


On the other side, the two the most populous countries in the world (and several others with relatively large populations) are among the 10 countries with the lowest number of articles per 1M people. This is likely due to their very *high* populations, and the fact that the number of politicians (who are eligible to have Wikipedia articles about them) in a country is not necessarily proportional to population. Please see the [readme](../README.md) for a more detailed discussion of this issue.

### The 10 countries with the greatest number of high-quality articles per 1M people are:

In [172]:
articles_per_capita_display = hq_articles_per_capita.sort_values("hq_articles_per_capita", ascending=False).head(10).reset_index()
articles_per_capita_display.index = articles_per_capita_display.index + 1
articles_per_capita_display.rename(columns={
    "hq_articles_per_capita": "HQ Articles per 1M people",
    "country": 'Country'
    }, inplace=True)
articles_per_capita_display

Unnamed: 0,Country,HQ Articles per 1M people
1,Montenegro,5.0
2,Luxembourg,2.857143
3,Albania,2.592593
4,Kosovo,2.352941
5,Lithuania,2.068966
6,Maldives,1.666667
7,Croatia,1.315789
8,Guyana,1.25
9,Palestinian Territory,1.090909
10,Slovenia,0.952381


Southern and Eastern European countries have a high number of high-quality English Wikipedia articles when population size is taken into account.

### The 10 countries with the smallest number of high-quality articles per 1M people are:

In [174]:
# omit 'infinite' values
articles_per_capita_display = hq_articles_per_capita.sort_values("hq_articles_per_capita", ascending=True).head(10).reset_index()
articles_per_capita_display.index = articles_per_capita_display.index + 1
articles_per_capita_display.rename(columns={
    "hq_articles_per_capita": "HQ Articles per 1M people",
    "country": 'Country'
    }, inplace=True)
articles_per_capita_display

Unnamed: 0,Country,HQ Articles per 1M people
1,Zimbabwe,0.0
2,Qatar,0.0
3,Grenada,0.0
4,Gambia,0.0
5,Samoa,0.0
6,Senegal,0.0
7,Federated States of Micronesia,0.0
8,Estonia,0.0
9,Eritrea,0.0
10,Equatorial Guinea,0.0


On the other side, many countries do not have *any* high-quality articles written about their politicians. These countries make up the bottom end of the high-quality articles rankings.

### Geographic regions ranked by total articles per capita:

In [142]:
regional_articles_display = regional_articles_per_capita.sort_values("articles_per_capita", ascending=False).reset_index()
regional_articles_display.index = regional_articles_display.index + 1

regional_articles_display.rename(columns={
    "region": "Region",
    "articles_per_capita" : "Articles per 1M People"
}, inplace=True)
regional_articles_display.index.name = "Rank"
display(regional_articles_display)

Unnamed: 0_level_0,Region,Articles per 1M People
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Northern Europe,7.05036
2,Oceania,6.486486
3,Caribbean,6.010929
4,Southern Europe,5.405941
5,Central America,3.762183
6,Western Europe,2.768891
7,Eastern Europe,2.731029
8,Western Asia,2.081923
9,Southern Africa,1.800878
10,Central Asia,1.467662


Northern Europe, Oceania, and the Caribbean, all regions with many English speakers or political/cultural links to English-speaking countries, have the greatest number of articles per 1M people. Southern and Eastern Asia have very few, likely driven by extremely high populations. 

### Geographic regions ranked by high-quality articles per capita:

In [143]:
hq_regional_articles_display = hq_regional_articles_per_capita.sort_values("hq_articles_per_capita", ascending=False).reset_index()

hq_regional_articles_display.index = hq_regional_articles_display.index + 1

hq_regional_articles_display.rename(columns={
    "region": "Region",
    "hq_articles_per_capita" : "HQ Articles per 1M People"
}, inplace=True)
hq_regional_articles_display.index.name = "Rank"
hq_regional_articles_display

Unnamed: 0_level_0,Region,HQ Articles per 1M People
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Northern Europe,0.395683
2,Southern Europe,0.349835
3,Caribbean,0.245902
4,Central America,0.194932
5,Eastern Europe,0.157776
6,Southern Africa,0.11713
7,Western Europe,0.11583
8,Western Asia,0.091401
9,Oceania,0.09009
10,Northern Africa,0.074248
