# Dataset Combination Notebook

This notebook combines multiple datasets, ensuring proper preprocessing, cleaning, and merging. The final output is a single combined dataset.

## 1. Importing Libraries

First, we import the necessary Python libraries for data manipulation and analysis.

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np

In [None]:
# Read the datasets
Political_Factors = pd.read_csv('/dataset/political/political_combined.csv')
Economic_Factors = pd.read_csv('/dataset/economic/econ_merged_final.csv')
Ethnic_Power_Relations = pd.read_csv('/dataset/ethnicity/cleaned_er.csv')
Ethnicity_Refugees = pd.read_csv('/dataset/ethnicity/cleaned_epr.csv')
Urbanisation = pd.read_csv('/dataset/social/cleaned_urban_population_data2.csv')
Democratic_Factors = pd.read_csv('/dataset/economic/econ_merged_data.csv')

In [None]:
# Store DataFrames in a list
dataframes = [Political_Factors, Economic_Factors, Ethnic_Power_Relations, Ethnicity_Refugees, Urbanisation, Democratic_Factors]

In [None]:
# View the first 5 rows for each DataFrame
for df in dataframes:
    df_name = [k for k, v in locals().items() if v is df][0] # get df name
    print(f"First 5 rows for {df_name}:")
    display(df.head())
    print("-" * 20)

First 5 rows for Political_Factors:


Unnamed: 0.1,Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),...,Media bias (highest score=1),Media freedom (highest score=1),Freedom of the Press (highest score=1),Gender Equality (highest score=1),Educational equality (highest score=1),Health equality (highest score=1),Infant mortality rate (highest score=1),Life expectancy (highest score=1),Mean years of schooling (highest score=1),Human Development Index
0,0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,...,0.96,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.864
1,1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,...,0.96,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.866
2,2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,...,0.96,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.868
3,3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,...,0.96,1.0,0.84,0.78,0.8,0.88,0.98,0.91,0.4,0.873
4,4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,...,0.96,1.0,0.84,0.78,0.8,0.88,0.98,0.91,0.4,0.873


--------------------
First 5 rows for Economic_Factors:


Unnamed: 0,country,year,unemployment_percentage_change,inflation_percentage_change
0,Australia,1990,0.0,-2.666355
1,Australia,1991,0.0,-56.67986
2,Australia,1992,11.965366,-68.135519
3,Australia,1993,1.360291,73.246347
4,Australia,1994,-10.616785,12.316079


--------------------
First 5 rows for Ethnic_Power_Relations:


Unnamed: 0,country,year,totalrefugees
0,Netherlands,1990,0.0
1,Netherlands,1991,0.0
2,Netherlands,1992,0.0
3,Netherlands,1993,0.0
4,Netherlands,1994,0.0


--------------------
First 5 rows for Ethnicity_Refugees:


Unnamed: 0,country,year,group,size
0,Netherlands,1990,Dutch,0.95
1,Netherlands,1991,Dutch,0.95
2,Netherlands,1992,Dutch,0.95
3,Netherlands,1993,Dutch,0.95
4,Netherlands,1994,Dutch,0.95


--------------------
First 5 rows for Urbanisation:


Unnamed: 0,Country,Year,Urban population (% of total population)
0,Australia,1990,85.433
1,Australia,1991,85.403
2,Australia,1992,85.285
3,Australia,1993,85.157
4,Australia,1994,85.028


--------------------
First 5 rows for Democratic_Factors:


Unnamed: 0,Year,Country,Gini coefficient,GDP,Conflict,Deaths,Free and fair elections index,Liberal democracy index
0,1990,Australia,0.437959,2.057392,0,0,0.956,0.852
1,1991,Australia,0.449667,-1.643571,0,0,0.956,0.852
2,1992,Australia,0.444478,-0.690437,0,0,0.956,0.852
3,1993,Australia,0.451112,3.123738,0,0,0.958,0.852
4,1994,Australia,0.455326,2.983092,0,0,0.958,0.852


--------------------


## 2. Loading and Inspecting Datasets

We load the datasets and perform an initial inspection to understand their structure.

In [None]:
# Remove the 'Unnamed: 0' column
Political_Factors.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
# Select the required columns
Ethnicity_Refugees = Ethnicity_Refugees[['country', 'year', 'size']]

In [None]:
# Standardise column names
Urbanisation.rename(columns={'Country':'country'}, inplace=True)
Urbanisation.rename(columns={'Year':'year'}, inplace=True)

Democratic_Factors.rename(columns={'Country':'country'}, inplace=True)
Democratic_Factors.rename(columns={'Year':'year'}, inplace=True)

Ethnicity_Refugees.rename(columns={'size':'ethnicity_ratio'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Ethnicity_Refugees.rename(columns={'size':'ethnicity_ratio'}, inplace=True)


In [None]:
# Select the countries in the OECD
selected_countries = [
    'Australia', 'Austria', 'Belgium', 'Canada', 'Chile', 'Colombia', 'Costa Rica',
    'Czechia', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece',
    'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Japan', 'South Korea', 'Latvia',
    'Lithuania', 'Luxembourg', 'Mexico', 'Netherlands', 'New Zealand', 'Norway',
    'Poland', 'Portugal', 'Spain', 'Sweden', 'Switzerland',
    'Turkey', 'United Kingdom', 'United States']

## 3. Data Cleaning & Preprocessing

Before merging, we ensure the datasets are clean, handle missing values, and standardise column names if necessary.

In [None]:
# View the countries in each DataFrame
for df in dataframes:
    df_name = [k for k, v in locals().items() if v is df][0]
    print(f"Countries in {df_name}:")
    print(df['country'].unique())
    print("-" * 20)

Countries in Political_Factors:
['Australia' 'Austria' 'Belgium' 'Canada' 'Chile' 'Colombia' 'Costa Rica'
 'Czechia' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece'
 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy' 'Japan'
 'Korea (Republic of)' 'Latvia' 'Lithuania' 'Luxembourg' 'Mexico'
 'Netherlands' 'New Zealand' 'Norway' 'Poland' 'Portugal' 'Slovenia'
 'Spain' 'Sweden' 'Switzerland' 'Turkey' 'United Kingdom' 'United States']
--------------------
Countries in Economic_Factors:
['Australia' 'Austria' 'Belgium' 'Canada' 'Chile' 'Colombia' 'Costa Rica'
 'Czechia' 'Denmark' 'Estonia' 'Finland' 'France' 'Germany' 'Greece'
 'Hungary' 'Iceland' 'Ireland' 'Israel' 'Italy' 'Japan' 'Korea' 'Latvia'
 'Lithuania' 'Luxembourg' 'Mexico' 'Netherlands' 'New Zealand' 'Norway'
 'Poland' 'Portugal' 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'Turkey'
 'United Kingdom' 'United States']
--------------------
Countries in Ethnic_Power_Relations:
['Netherlands' 'United States' 'Estonia' 'France' 'Col

In [None]:
# Drop rows with the Slovak Republic
Urbanisation = Urbanisation[Urbanisation['country'] != 'Slovak Republic']

In [None]:
# Create a function to change the country names
def change_country_name(df, old_name, new_name):
    df.loc[df['country'] == old_name, 'country'] = new_name

In [None]:
# Change the country names in all DataFrames
change_country_name(Ethnic_Power_Relations, 'Korea', 'South Korea')

change_country_name(Ethnicity_Refugees, 'Korea', 'South Korea')

change_country_name(Urbanisation, 'Korea', 'South Korea')
change_country_name(Urbanisation, 'Turkiye', 'Turkey')

change_country_name(Political_Factors, 'Korea (Republic of)', 'South Korea')

change_country_name(Economic_Factors, 'Korea', 'South Korea')

## 4. Merging Datasets

Once the datasets are clean, we merge them based on common keys, ensuring consistency.

In [None]:
# Merge the Political_Factors & Urbanisation DataFrames
df_combined = pd.merge(Political_Factors, Urbanisation, on=['country', 'year'], how='outer')
df_combined.head()

Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),Freedom of Religion (highest score=1),...,Media freedom (highest score=1),Freedom of the Press (highest score=1),Gender Equality (highest score=1),Educational equality (highest score=1),Health equality (highest score=1),Infant mortality rate (highest score=1),Life expectancy (highest score=1),Mean years of schooling (highest score=1),Human Development Index,Urban population (% of total population)
0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.864,85.433
1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.866,85.403
2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,1.0,0.84,0.78,0.8,0.88,0.97,0.9,0.39,0.868,85.285
3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,1.0,0.84,0.78,0.8,0.88,0.98,0.91,0.4,0.873,85.157
4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,1.0,0.84,0.78,0.8,0.88,0.98,0.91,0.4,0.873,85.028


In [None]:
# Merge the df_combined & Democratic_Factors DataFrames
df_combined = pd.merge(df_combined, Democratic_Factors, on=['country', 'year'], how='outer')
df_combined.head()

Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),Freedom of Religion (highest score=1),...,Life expectancy (highest score=1),Mean years of schooling (highest score=1),Human Development Index,Urban population (% of total population),Gini coefficient,GDP,Conflict,Deaths,Free and fair elections index,Liberal democracy index
0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.9,0.39,0.864,85.433,0.437959,2.057392,0,0,0.956,0.852
1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.9,0.39,0.866,85.403,0.449667,-1.643571,0,0,0.956,0.852
2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.9,0.39,0.868,85.285,0.444478,-0.690437,0,0,0.956,0.852
3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.91,0.4,0.873,85.157,0.451112,3.123738,0,0,0.958,0.852
4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.91,0.4,0.873,85.028,0.455326,2.983092,0,0,0.958,0.852


In [None]:
# Merge the df_combined & Ethnic_Power_Relations DataFrames
df_combined = pd.merge(df_combined, Ethnic_Power_Relations, on=['country', 'year'], how='outer')
df_combined.head()

Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),Freedom of Religion (highest score=1),...,Mean years of schooling (highest score=1),Human Development Index,Urban population (% of total population),Gini coefficient,GDP,Conflict,Deaths,Free and fair elections index,Liberal democracy index,totalrefugees
0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.39,0.864,85.433,0.437959,2.057392,0,0,0.956,0.852,0.0
1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.39,0.866,85.403,0.449667,-1.643571,0,0,0.956,0.852,0.0
2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.39,0.868,85.285,0.444478,-0.690437,0,0,0.956,0.852,0.0
3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.4,0.873,85.157,0.451112,3.123738,0,0,0.958,0.852,0.0
4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.4,0.873,85.028,0.455326,2.983092,0,0,0.958,0.852,0.0


In [None]:
# Merge the df_combined & Ethnicity_Refugees DataFrames
df_combined = pd.merge(df_combined, Ethnicity_Refugees, on=['country', 'year'], how='outer')
df_combined.head()

Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),Freedom of Religion (highest score=1),...,Human Development Index,Urban population (% of total population),Gini coefficient,GDP,Conflict,Deaths,Free and fair elections index,Liberal democracy index,totalrefugees,ethnicity_ratio
0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.864,85.433,0.437959,2.057392,0,0,0.956,0.852,0.0,0.84
1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.866,85.403,0.449667,-1.643571,0,0,0.956,0.852,0.0,0.84
2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.868,85.285,0.444478,-0.690437,0,0,0.956,0.852,0.0,0.84
3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.873,85.157,0.451112,3.123738,0,0,0.958,0.852,0.0,0.84
4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.873,85.028,0.455326,2.983092,0,0,0.958,0.852,0.0,0.84


In [None]:
# Merge the df_combined & Economic_Factors DataFrames
df_combined = pd.merge(df_combined, Economic_Factors, on=['country', 'year'], how='outer')
df_combined.head()

Unnamed: 0,country,year,Effective Parliament (highest score=1),Election free and fair (highest score=1),Election government intimidation (highest score=1),Voter turnout (highest score=1),Fair trial (highest score=1),Judicial Independence (highest score=1),Predictable Enforcement (highest score=1),Freedom of Religion (highest score=1),...,Gini coefficient,GDP,Conflict,Deaths,Free and fair elections index,Liberal democracy index,totalrefugees,ethnicity_ratio,unemployment_percentage_change,inflation_percentage_change
0,Australia,1990,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.437959,2.057392,0,0,0.956,0.852,0.0,0.84,0.0,-2.666355
1,Australia,1991,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.449667,-1.643571,0,0,0.956,0.852,0.0,0.84,0.0,-56.67986
2,Australia,1992,0.85,0.82,0.85,0.82,1.0,1.0,0.9,0.98,...,0.444478,-0.690437,0,0,0.956,0.852,0.0,0.84,11.965366,-68.135519
3,Australia,1993,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.451112,3.123738,0,0,0.958,0.852,0.0,0.84,1.360291,73.246347
4,Australia,1994,0.85,0.82,0.85,0.84,1.0,1.0,0.9,0.98,...,0.455326,2.983092,0,0,0.958,0.852,0.0,0.84,-10.616785,12.316079


## 5. Exporting the Final Dataset

After merging, we save the final cleaned dataset for further analysis.

In [None]:
# View the columns in the df_combined DataFrame
df_combined.columns

Index(['country', 'year', 'Effective Parliament (highest score=1)',
       'Election free and fair (highest score=1)',
       'Election government intimidation (highest score=1)',
       'Voter turnout (highest score=1)', 'Fair trial (highest score=1)',
       'Judicial Independence (highest score=1)',
       'Predictable Enforcement (highest score=1)',
       'Freedom of Religion (highest score=1)',
       'Free Political Parties (highest score=1)',
       'Harassment of journalists (highest score=1)',
       'Media bias (highest score=1)', 'Media freedom (highest score=1)',
       'Freedom of the Press (highest score=1)',
       'Gender Equality (highest score=1)',
       'Educational equality (highest score=1)',
       'Health equality (highest score=1)',
       'Infant mortality rate (highest score=1)',
       'Life expectancy (highest score=1)',
       'Mean years of schooling (highest score=1)', 'Human Development Index',
       'Urban population (% of total population)', 'Gini c

In [None]:
# Export the combined DataFrame
df_combined.to_csv('/dataset/fully-combined.csv', index=False)