# Research Question: 

Question: Can we predict the number of people who attain their bachelor's in an OECD country? 

In this assignment, we want to predict the number of people in OECD countrys that complete their Bachelor's degree. Facors like the country's GDP, ratio of people enrolled in primary education, secondary education, and tertiary education, the population of the country, how much the government spends on higher education, how much households spend on higher education, number of public universities, number of private universities, average cost of higher education, household income, and the year are all included as variables in this model. We will train a multivariate regression to see if we can reliably predict the number of people graduating with their bachelors.

Our inspiration for this project stemmed from https://icfdn.org/our-impact/education/. In this website, we found out that "Just one extra year of schooling can increase an individual’s earnings by up to 10%, and can raise the region’s average annual gross domestic product (GDP) growth by 0.37%." Because of this we wanted to explore how factors of education in a country can impact a country's GDP. 

We decided to only evaluate countries that are part of the Organization for Economic Co-operation and Development (OECD). There are 37 countries that are part of this organization that collaborate to develop policy standards and economic growth. We chose countries that are part of the OECD to evalvuate on because they account for three-fifths of the world's GDP, three-quarters of world trade, half of the world's energy consumption, and 18 percent of the world's population. Because these 37 countries account for a huge part of a country's GDP, we decided that this group of countries would be easier to evaluate compared attaining data from all 197 countries in the world.  https://www.state.gov/the-organization-for-economic-co-operation-and-development-oecd/#:~:text=and%20Development%20(OECD)-,The%20Organization%20for%20Economic%20Cooperation%20and%20Development%20(OECD),to%20promote%20sustainable%20economic%20growth.


In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb 

# Data Collection/Cleaning: 

## Enrollment rates in early childhood education: 
https://data.oecd.org/students/enrolment-rate-in-early-childhood-education.htm

This data set shows the enrollment rates in early childhood education in OECD countries. The net enrollment rates are calculated by the number of students of a particular age group (ages 3-5) enrolled in early childhood education by the size of the population of that age group. This data set only includes enrollment rates from 2013-2020.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location would specificy what country. The indicator specified what type of education that age group was enrolled in. This column would be helpful for us because when we combine all the data sets, we need to use this indicator value to differentiate what type of education they are enrolled in. The subject column listed the age group of each row (ages 3-5). The Measure column just specifies how the data was separated. Because this data set is a part of a larger Education database, the measure value just indicated how each age group was grouped by, and in this case it is age. Because the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each enrollment rate is from, ranges from years 2013 to 2020. The Value column is displays the net enrollment percentages for each row. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, I removed the Measure, Frequency, and Flag Codes columns. After removing these columns, the data set only includes the columns: Location, Indicator, Subject, Measure, Frequency, and Time. 


In [4]:
enrollment_early_df = pd.read_csv('enrollment rate in early childhood education.csv')
query = """
        SELECT Location, Indicator, Subject, Time, Value
        FROM enrollment_early_df
        """
enrollment_early_df = duckdb.sql(query).df()
enrollment_early_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'enrollment rate in early childhood education.csv'

Because, we want to figure out the enrollment rates for each year for each country, we need to combine all the age groups for each year for each country. To do this, we calculated the average enrollment rate for each country for each year. At the end, we dropped the Subject and Indicator columns because for this specific case we just need the enrollment rates, years, and country name. We also renamed all the columns to better represent the values in order to better represent the data and easier to understand. 

In [None]:
filtered_df = enrollment_early_df[enrollment_early_df['SUBJECT'].isin(['AGE_3', 'AGE_4', 'AGE_5'])]
enrollment_early_education_df = filtered_df.groupby(['LOCATION','TIME'])['Value'].mean().reset_index()
enrollment_early_education_df = enrollment_early_education_df.drop_duplicates()
enrollment_early_education_df = enrollment_early_education_df.rename(columns = {"LOCATION": "Country", 
                                                                                "TIME": "Year", 
                                                                               "Value" : "Early Childhood Education Enrollment Rates"})
enrollment_early_education_df = enrollment_early_education_df.replace({"ENROLMENT_ECE" : "Early Childhood"})
enrollment_early_education_df.head()

## Enrollment rates in secondary and teritiary education: 

https://data.oecd.org/students/enrolment-rate-in-secondary-and-tertiary-education.htm#indicator-chart

This data set shows the enrollment rates in secondary and tertiary education in OECD countries. The net enrollment rates are calculated by dividing the number of students of a particular age enrolled in these levels of education by the size of the population at that age (ages 17-19). The data set only includes data from 2013-2020.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specified what type of education that age group was enrolled in. This column would be helpful for us because when we combine all the data sets, we need to use this indicator value to differentiate what type of education they are enrolled in. The subject column listed the age group of each row (ages 17-19). The Measure column just specifies how the data was separated. Because this data set is a part of a larger Education database, the measure value just indicated how each age group was grouped by, and in this case it is age. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each enrollment rate is from, ranges from years 2013 to 2020. The Value column is displays the net enrollment percentages for each row. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, I removed the Measure, Frequency, and Flag Codes columns. After removing these columns, the data set only includes the columns: Location, Indicator, Subject, Measure, Frequency, and Time.

In [None]:
enrollment_higher_df = pd.read_csv('enrollment rates in secondary and tertiary education.csv')

query = """
        SELECT Location, Indicator, Subject, Time, Value
        FROM enrollment_higher_df
        """
enrollment_higher_df = duckdb.sql(query).df()
enrollment_higher_df.head()

Because, we want to figure out the enrollment rates for each year for each country, we need to combine all the age groups for each year for each country. To do this, we calculated the average enrollment rate for each country for each year. At the end, we dropped the Subject and Indicator columns, because in this case, we just need the enrollment rates, year, and country. We also renamed all the columns to better represent the values in order to better represent the data and easier to understand. 

In [None]:
filtered_df = enrollment_higher_df[enrollment_higher_df['SUBJECT'].isin(['AGE_17', 'AGE_18', 'AGE_19'])]
enrollment_higher_education_df = filtered_df.groupby(['LOCATION','TIME'])['Value'].mean().reset_index()
enrollment_higher_education_df = enrollment_higher_education_df.drop_duplicates()
enrollment_higher_education_df = enrollment_higher_education_df.rename(columns = {"LOCATION": "Country", 
                                                                                "TIME": "Year", 
                                                                               "Value" : "Higher Education Enrollment Rates"})
enrollment_higher_education_df = enrollment_higher_education_df.replace({"ENROLMENT" : "Higher"})
enrollment_higher_education_df.head()

The two dataframes about enrollment rates in education are combined below, so that we will be able to have a cohesive dataframe that shows enrollment rates in early, and higher education in 37 different countries, from 2013 to 2017. 

In [None]:
query = """
        SELECT *
        FROM enrollment_early_education_df
        FULL JOIN enrollment_higher_education_df
        ON enrollment_early_education_df.Country = enrollment_higher_education_df.Country
        AND enrollment_early_education_df.Year = enrollment_higher_education_df.Year;
        """
enrollment_rates_df = duckdb.sql(query).df()
enrollment_rates_df

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [None]:
# Fill missing values in Country with values from Country_2
enrollment_rates_df['Country'] = enrollment_rates_df['Country'].fillna(enrollment_rates_df['Country_2'])

# Fill missing values in Country_2 with values from Country
enrollment_rates_df['Country_2'] = enrollment_rates_df['Country_2'].fillna(enrollment_rates_df['Country'])

# Fill missing values in Year with values from Year_2
enrollment_rates_df['Year'] = enrollment_rates_df['Year'].fillna(enrollment_rates_df['Year_2'])

# Fill missing values in Year_2 with values from Year
enrollment_rates_df['Year_2'] = enrollment_rates_df['Year_2'].fillna(enrollment_rates_df['Year'])

#dropping Country_2 and Year_2 columns
enrollment_rates_df = enrollment_rates_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
enrollment_rates_df['Year'] = enrollment_rates_df['Year'].astype(int)

enrollment_rates_df

In [None]:
enrollment_rates_df.to_csv('enrollment_rates.csv', index=False)

## GDP:

https://data.oecd.org/gdp/gross-domestic-product-gdp.htm

This data set shows the nominal Gross Domestic Product(GDP) per capita of OEPD countries in US dollars from 1960 to 2022. The Gross domestic product is the standard measure of the value added created through the production of goods and servives in a country during a certain period. While the GDP per capita can be found by diving the total GDP by its population. It also measure the inclome earned from that population, the total amount spend on a final goods and services. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured. This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it was measured in US dollars. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1960 to 2022. The Value column displays the nominal gdp value. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "GDP in US Dollars" in order to specify the value of it, which is needed when we combine all the data sets. 

We also need to limit the years to be in between the years 2013 to 2020 in order to be consistent with the previous data sets found and limit the amount of nan in the data. 

In [None]:
gdp_df = pd.read_csv('gdp.csv')
query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "GDP per Capita"
        FROM gdp_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
gdp_df = duckdb.sql(query).df()
gdp_df

In [None]:
gdp_df.to_csv('clean_gdp.csv', index=False)

The two dataframes (enrollment_rates_df and gdp_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates and gdp of each country from 2013-2020. 

In [None]:
query = """
        SELECT *
        FROM enrollment_rates_df
        FULL JOIN gdp_df
        ON enrollment_rates_df.Country = gdp_df.Country
        AND enrollment_rates_df.Year = gdp_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant, change all the float values in Year to be integers, and round the GDP values to two decimals to represent US dollars. 

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

#rounding the GDP values to 2 decimals
combined_df['GDP per Capita'] = combined_df['GDP per Capita'].round(2)

combined_df

## Population:

https://data.oecd.org/pop/population.htm

This data set shows the total population of OECD country in millions of people from 1950 to 2022. The total population includes the following: national armed forces stationed abroad; merchant seamen at sea; diplomatic personnel located abroad; civilian aliens resident in the country; displaced persons resident in the country. However, it excludes the following: foreign armed forces stationed in the country; foreign diplomatic personnel located in the country; civilian aliens temporarily in the country.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (population). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the population per millions of people. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1950 to 2022. The Value column displays the total population per millions of people. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Population" in order to specify the value of it, which is needed when we combine all the data sets. 

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values. 

In [None]:
population_df = pd.read_csv('population.csv')
population_df

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value * 1000000 AS "Total Population"
        FROM population_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
population_df = duckdb.sql(query).df()
population_df

In [None]:
population_df.to_csv('clean_population.csv', index=False)

The two dataframes (combined_df and population_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN population_df
        ON combined_df.Country = population_df.Country
        AND combined_df.Year = population_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers, 

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

## Public Spending On Education:

https://data.oecd.org/eduresource/public-spending-on-education.htm#indicator-chart

This data shows the public spending on education as a percentage fo GDP for tertiary levels of education. Public spending on education includes direct expendiute on educational institutions, as well as education-related public susidies given to households and administered by educational institutions. Public entities include ministries other than ministries of education, local and regional governments, and other public agencies. Public spending includes expenditure on schools, universities and other public and private institutions delivering or supporting educational services. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (public spending on education). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the public spending as a percentage of gdp. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 2000 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Public Spending on Education" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [None]:
public_spending_df = pd.read_csv('public_spending_on_education.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Public Spending on Education (%)"
        FROM public_spending_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
public_spending_df = duckdb.sql(query).df()
public_spending_df

In [None]:
public_spending_df.to_csv('clean_public_education_df.csv', index=False)

The two dataframes (combined_df and public_spending_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, and public spending as a percentage of gdp of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN public_spending_df
        ON combined_df.Country = public_spending_df.Country
        AND combined_df.Year = public_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers,

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

## Private Spending on Education:

https://data.oecd.org/eduresource/private-spending-on-education.htm#indicator-chart

This data shows private spending on education as a percentage of GDP for tertiary education. Private spending on education refers to expenditure funded by private resources which are households and other private entities. It includes all direct expenditure on education institions, and net of public subsidies.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (private spending on education). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the public spending as a percentage of gdp. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 2000 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Private Spending on Education (%)" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [None]:
private_spending_df = pd.read_csv('private_spending_on_education.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Private Spending on Education (%)"
        FROM private_spending_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
private_spending_df = duckdb.sql(query).df()
private_spending_df

In [None]:
private_spending_df.to_csv('clean_private_education_df.csv', index=False)

The two dataframes (combined_df and private_spending_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, and public spending as a percentage of GDP, and private spending as a percentage of GDP of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN private_spending_df
        ON combined_df.Country = private_spending_df.Country
        AND combined_df.Year = private_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers,

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

## Household Income per Capita:

https://data.oecd.org/hha/household-disposable-income.htm#indicator-chart

This data set shows the gross household disposable income per capita in 37 OECD countries. Household disposable income is available to households such as wages and salaries, income from self-employment and unincorporated enterprices, income from pensions and other social benefits, and income from financial investments. Gross means that depreciation costs are not subtracted. For gross household disposable income per capita, growth rates (percentage change from previous period) are presented; these are ‘real’ growth rates adjusted to remove the effects of price changes. Information is also presented for gross household disposable income including social transfers, such as health or education provided for free or at reduced prices by governments and not-for-profit organisations. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (household disposable income). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the household income per capita. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1970 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Household Income per Capita" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [None]:
household_income_df = pd.read_csv('household_income.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Household Income per Capita"
        FROM household_income_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
household_income_df = duckdb.sql(query).df()
household_income_df

In [None]:
household_income_df.to_csv('clean_household_income_df.csv', index=False)

The two dataframes (combined_df and household_income_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, and household income per capita of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN household_income_df
        ON combined_df.Country = household_income_df.Country
        AND combined_df.Year = household_income_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

## Education Spending:

https://data.oecd.org/eduresource/education-spending.htm#indicator-chart

This data set shows the average amount of education spending that covers expenditure on schools, universities and other public and private educational institutions in 37 OECD countries. Spending includes instruction and ancillary services for students and families provided through educational institutions. Education spending is shown in USD per student.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (education spending in dollars). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the USD per student. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1995 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Average Spending on Higher Education (USD/student)" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [None]:
average_spending_df = pd.read_csv('educaion_spending.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Average Spending on Higher Education (USD/student)"
        FROM average_spending_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
average_spending_df = duckdb.sql(query).df()
average_spending_df

In [None]:
average_spending_df.to_csv('clean_average_spending_df.csv', index=False)

The two dataframes (combined_df and average_spending_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, and average spending for higher education per student of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN average_spending_df
        ON combined_df.Country = average_spending_df.Country
        AND combined_df.Year = average_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

## Number of Universities:

We weren't able to find a data set that directly listed the amount of universities in each OECD country. Because of this, we had to make a data set on Excel and convert to a csv later.

All the information from the data set was found from this website: https://www.webometrics.info/en/distribution_by_country


In [None]:
num_universities_df = pd.read_csv('number_of_universities.csv')
num_universities_df

The two dataframes (combined_df and num_universities_df) are combined below using a Left Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, average spending for higher education per student, and number of universities for each country from 2013-2020.

In [None]:
combined_df = pd.merge(combined_df, num_universities_df, on='Country', how='left')

combined_df

## Population who have completed Tertiary Education:

https://data.oecd.org/eduatt/population-with-tertiary-education.htm

This data set shows the population with tertiary education. Tertiary education includes bachelor's and higher levels of education. The measure is percentage of same age population.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (percent of population that have completed their tertiary education). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year the data is from, ranges from years 1989 to 2020. The Value column displays the percentage of the population with tertiary education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Population with Tertiary Education(%)" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [None]:
completed_tertiary_edu_df = pd.read_csv('completed_tertiary_edu.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Population with Tertiary Education (%)"
        FROM completed_tertiary_edu_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
completed_tertiary_edu_df = duckdb.sql(query).df()
completed_tertiary_edu_df

In [None]:
completed_tertiary_edu_df.to_csv('clean_completed_tertiary_edu_df.csv', index=False)

The two dataframes (combined_df and completed_tertiary_edu_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, and average spending for higher education per student, number of universities, and population with tertiary education of each country from 2013-2020.

In [None]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN completed_tertiary_edu_df
        ON combined_df.Country = completed_tertiary_edu_df.Country
        AND combined_df.Year = completed_tertiary_edu_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [None]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Since some of the rows above in the combined data frame, have values for OECD, which represents the averages across all the countries, we are going to drop those rows, since not all the data frames that we joined in this value.

In [None]:
combined_df = combined_df[combined_df['Country'] != 'OECD']
combined_df

In [None]:
combined_df.to_csv('combined_data.csv', index=False)