# Getting Quality Predictions for World Politicians' Wikipedia Articles

### Homework #2 – Data 512
### Daniel Vogler

# Initial Data Exploration – Resolving Inconsistencies

First, I open and explore the provided datasets to try to resolve any inconsistencies before beginning to analyze the data.

In [3]:
import pandas as pd
import os

In [4]:
politicians_raw = pd.read_csv("../raw_data/politicians_by_country_AUG.2024.csv")
population_raw = pd.read_csv("../raw_data/population_by_country_AUG.2024.csv")

As a first step, I visualize the top of each dataframe to understand its structure and see the datatypes in each column.

In [5]:
politicians_raw.head(10)

Unnamed: 0,name,url,country
0,Majah Ha Adrif,https://en.wikipedia.org/wiki/Majah_Ha_Adrif,Afghanistan
1,Haroon al-Afghani,https://en.wikipedia.org/wiki/Haroon_al-Afghani,Afghanistan
2,Tayyab Agha,https://en.wikipedia.org/wiki/Tayyab_Agha,Afghanistan
3,Khadija Zahra Ahmadi,https://en.wikipedia.org/wiki/Khadija_Zahra_Ah...,Afghanistan
4,Aziza Ahmadyar,https://en.wikipedia.org/wiki/Aziza_Ahmadyar,Afghanistan
5,Muqadasa Ahmadzai,https://en.wikipedia.org/wiki/Muqadasa_Ahmadzai,Afghanistan
6,Mohammad Sarwar Ahmedzai,https://en.wikipedia.org/wiki/Mohammad_Sarwar_...,Afghanistan
7,Amir Muhammad Akhundzada,https://en.wikipedia.org/wiki/Amir_Muhammad_Ak...,Afghanistan
8,Nasrullah Baryalai Arsalai,https://en.wikipedia.org/wiki/Nasrullah_Baryal...,Afghanistan
9,Abdul Rahim Ayoubi,https://en.wikipedia.org/wiki/Abdul_Rahim_Ayoubi,Afghanistan


In [6]:
population_raw.head(10)

Unnamed: 0,Geography,Population
0,WORLD,8009.0
1,AFRICA,1453.0
2,NORTHERN AFRICA,256.0
3,Algeria,46.8
4,Egypt,105.2
5,Libya,6.9
6,Morocco,37.0
7,Sudan,48.1
8,Tunisia,11.9
9,Western Sahara,0.6


Now, I'll do a basic check for duplicates:

In [22]:
# this defines a duplicate as a record where the politician's name and country match another
# record's 
politician_duplicates = politicians_raw[politicians_raw.duplicated(subset=["name", "country"], keep=False)]
politician_duplicates

Unnamed: 0,name,url,country


This indicates that there are no duplicate *records* in this dataset, in the sense that the same politician from the same country is listed twice. But what if one politician maps to multiple countries (because they represented a country that broke up into multiple new countries, like [this person](https://en.wikipedia.org/wiki/Torokul_Dzhanuzakov))? In that case, the politician's URL would be duplicated:

In [17]:
duplicated_urls_df = politicians_raw[politicians_raw.duplicated("url", keep=False)]
duplicated_urls_df.sort_values(by="name") # sort by name to see dups more easily

Unnamed: 0,name,url,country
6351,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,Sweden
3101,Abir Al-Sahlani,https://en.wikipedia.org/wiki/Abir_Al-Sahlani,Iraq
5374,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,Russia
4237,"Aleksandr Nikitin (politician, born 1987)",https://en.wikipedia.org/wiki/Aleksandr_Nikiti...,Moldova
5292,Ali al-Qaradaghi,https://en.wikipedia.org/wiki/Ali_al-Qaradaghi,Qatar
...,...,...,...
1119,Venko Markovski,https://en.wikipedia.org/wiki/Venko_Markovski,Bulgaria
3680,Visar Ymeri,https://en.wikipedia.org/wiki/Visar_Ymeri,Kosovo
151,Visar Ymeri,https://en.wikipedia.org/wiki/Visar_Ymeri,Albania
1303,Yat Hwaidi,https://en.wikipedia.org/wiki/Yat_Hwaidi,Cambodia


Clearly there are some cases like this. I will keep track of them for my final analysis:

In [18]:
def group_duplicated_urls():

    duplicated_urls = duplicated_urls_df.groupby('url', 
                                            as_index=False).agg({
        "url" : "first",
        "country" : lambda x: ', '.join(x.unique()) 
    })

    return duplicated_urls

duplicated_politicans = group_duplicated_urls()

###### AI USE ACKOWLEDGEMENT ########
# I figured out how to do this by searching ChatGPT on October 13, 2024.

################################################################################
# PROMPT: I have a data frame with three columns: name, url, and country. 
# I want to re-index this data frame, grouping by name and turning country into 
# a single column where the different values of country are separated by a comma

# OUTPUT (example code):
# You can achieve this in Pandas by using the groupby() method combined with agg() 
# to concatenate the country values into a single string separated by commas. 
# Here's how to do it:

# Example Code: 

# import pandas as pd

# # Sample DataFrame
# data = {
#     'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
#     'url': ['url1', 'url2', 'url3', 'url4', 'url5'],
#     'country': ['USA', 'Canada', 'UK', 'USA', 'Mexico']
# }

# df = pd.DataFrame(data)

# # Group by 'name' and aggregate 'country'
# grouped_df = df.groupby('name', as_index=False).agg({
#     'url': 'first',  # Assuming you want to keep the first URL associated with each name
#     'country': lambda x: ', '.join(x.unique())  # Concatenate unique countries
# })

# print(grouped_df)

I am recording these politicians as a reference for later use in the analysis section. Since these records are ~1% of the total records, they will have a negligible effect on the API calls needed to predict article quality, so I do not drop them from the dataset at this stage.

In [19]:
duplicated_politicans.to_csv("../cleaned_data/duplicated_politicians.csv")

Now, I shift gears to another data quality challenge.

Often, when working with geographic data, it's challenging to perform joins because regions have different names in different datasets. Let's check if that's the case here.

In [10]:
# countries that are in the politician dataset but not in the country dataset
print(set(politicians_raw["country"]) - set(population_raw["Geography"]))
# countries that are in the country dataset but not in the politician dataset
print(set(population_raw["Geography"]) - set(politicians_raw["country"]))

{'Korea, South', 'Korean', 'Guinea-Bissau'}
{'Palau', 'Brunei', 'China (Hong Kong SAR)', 'San Marino', 'Martinique', 'CENTRAL AMERICA', 'WESTERN EUROPE', 'OCEANIA', 'NORTHERN AMERICA', 'Nauru', 'SOUTH ASIA', 'ASIA', 'Canada', 'Philippines', 'Guam', 'MIDDLE AFRICA', 'LATIN AMERICA AND THE CARIBBEAN', 'Netherlands', 'Ireland', 'GuineaBissau', 'New Zealand', 'WESTERN AFRICA', 'Romania', 'Mexico', 'Korea (South)', 'EASTERN EUROPE', 'Puerto Rico', 'SOUTH AMERICA', 'Liechtenstein', 'New Caledonia', 'Guadeloupe', 'NORTHERN EUROPE', 'WORLD', 'WESTERN ASIA', 'Suriname', 'Korea (North)', 'Dominica', 'Mayotte', 'CENTRAL ASIA', 'Australia', 'French Guiana', 'Curacao', 'SOUTHEAST ASIA', 'eSwatini', 'United Kingdom', 'Sao Tome and Principe', 'Fiji', 'SOUTHERN AFRICA', 'United States', 'China (Macao SAR)', 'CARIBBEAN', 'Western Sahara', 'EAST ASIA', 'Andorra', 'French Polynesia', 'Kiribati', 'Jamaica', 'Iceland', 'Mauritius', 'NORTHERN AFRICA', 'Georgia', 'EUROPE', 'AFRICA', 'SOUTHERN EUROPE', 'Reuni

The first set returned above tells us that politicans from South Korea and Guinea-Bissau appear in the list of articles we're supposed to analyze *and* do not match the list of countries in the `population_by_country_AUG.2024.csv` file. These require reconciliation, which I perform below.

The bottom set is the set of countries and region names that do not appear in the `population_by_country_AUG.2024.csv` dataset. These do not require reconciliation, since some countries' politicians simply aren't in the list of politicians we are tasked with analyzing.

Below, I reconcile naming conventions for South Korea and Guinea-Bissau across the two datasets, then output the cleaned data to a dedicated folder.

In [11]:
politicians_cleaned = politicians_raw.copy()
politicians_cleaned["country"] = politicians_cleaned["country"].replace(
    {
        "Korea, South": "Korea (South)",
        "Korean": "Korea (South)"
    }
)

population_cleaned = population_raw.copy()
population_cleaned["Geography"] = population_cleaned["Geography"].replace(
    {
        "GuineaBissau": "Guinea-Bissau"
    }
)

print(set(politicians_cleaned["country"]) - set(population_cleaned["Geography"]))


set()


The empty set result confirms that I've addressed the issue with South Korea and Guinea-Bissau in these datasets. Joining on these two terms will now work. With that issue resolved, I now save the output as cleaned data:

In [12]:
out_directory = "../cleaned_data/"

if not os.path.exists(out_directory):
    os.makedirs(out_directory)
    print(f"Created '{out_directory}' folder to store cleaned data")
else:
    print(f"Despositing cleaned data in the folder: '{out_directory}'")

population_cleaned.to_csv(out_directory + "population_by_country_AUG_2024_clean.csv")
politicians_cleaned.to_csv(out_directory + "politicians_by_country_AUG_2024_clean.csv")

Despositing cleaned data in the folder: '../cleaned_data/'
