# Research Question: 

Question: Can we predict the number of people who attain their bachelor's in an OECD country? 

In this assignment, we want to predict the number of people in OECD countrys that complete their Bachelor's degree. Facors like the country's GDP, ratio of people enrolled in primary education, secondary education, and tertiary education, the population of the country, how much the government spends on higher education, how much households spend on higher education, number of public universities, number of private universities, average cost of higher education, household income, and the year are all included as variables in this model. We will train a multivariate regression to see if we can reliably predict the number of people graduating with their bachelors.

Our inspiration for this project stemmed from https://icfdn.org/our-impact/education/. In this website, we found out that "Just one extra year of schooling can increase an individual’s earnings by up to 10%, and can raise the region’s average annual gross domestic product (GDP) growth by 0.37%." Because of this we wanted to explore how factors of education in a country can impact a country's GDP. 

We decided to only evaluate countries that are part of the Organization for Economic Co-operation and Development (OECD). There are 37 countries that are part of this organization that collaborate to develop policy standards and economic growth. We chose countries that are part of the OECD to evalvuate on because they account for three-fifths of the world's GDP, three-quarters of world trade, half of the world's energy consumption, and 18 percent of the world's population. Because these 37 countries account for a huge part of a country's GDP, we decided that this group of countries would be easier to evaluate compared attaining data from all 197 countries in the world.  https://www.state.gov/the-organization-for-economic-co-operation-and-development-oecd/#:~:text=and%20Development%20(OECD)-,The%20Organization%20for%20Economic%20Cooperation%20and%20Development%20(OECD),to%20promote%20sustainable%20economic%20growth.


In [353]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb 

# Data Collection/Cleaning: 

## GDP

https://data.oecd.org/gdp/gross-domestic-product-gdp.htm

This data set shows the nominal Gross Domestic Product(GDP) per capita of OEPD countries in US dollars from 1960 to 2022. The Gross domestic product is the standard measure of the value added created through the production of goods and servives in a country during a certain period. While the GDP per capita can be found by diving the total GDP by its population. It also measure the inclome earned from that population, the total amount spend on a final goods and services. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured. This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it was measured in US dollars. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **1960 to 2022**. The Value column displays the nominal gdp value. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "GDP in US Dollars" in order to specify the value of it, which is needed when we combine all the data sets. 

In [354]:
gdp_df = pd.read_csv('gdp.csv')
query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "GDP per Capita"
        FROM gdp_df
        """
gdp_df = duckdb.sql(query).df()
gdp_df

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.627589
1,AUS,1961,2383.188902
2,AUS,1962,2577.332834
3,AUS,1963,2752.620592
4,AUS,1964,2902.590472
...,...,...,...
1786,CRI,2018,21312.713380
1787,CRI,2019,22739.241909
1788,CRI,2020,21755.528069
1789,CRI,2021,22612.375289


In [355]:
gdp_df.to_csv('clean_gdp.csv', index=False)

In [356]:
combined_df = gdp_df
combined_df 

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.627589
1,AUS,1961,2383.188902
2,AUS,1962,2577.332834
3,AUS,1963,2752.620592
4,AUS,1964,2902.590472
...,...,...,...
1786,CRI,2018,21312.713380
1787,CRI,2019,22739.241909
1788,CRI,2020,21755.528069
1789,CRI,2021,22612.375289


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant, change all the float values in Year to be integers, and round the GDP values to two decimals to represent US dollars. 
**REVISIT THIS DESCRIPTION**

In [357]:
#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

#rounding the GDP values to 2 decimals
combined_df['GDP per Capita'] = combined_df['GDP per Capita'].round(2)

combined_df

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.63
1,AUS,1961,2383.19
2,AUS,1962,2577.33
3,AUS,1963,2752.62
4,AUS,1964,2902.59
...,...,...,...
1786,CRI,2018,21312.71
1787,CRI,2019,22739.24
1788,CRI,2020,21755.53
1789,CRI,2021,22612.38


## Population:

https://data.oecd.org/pop/population.htm

This data set shows the total population of OECD country in millions of people from 1950 to 2022. The total population includes the following: national armed forces stationed abroad; merchant seamen at sea; diplomatic personnel located abroad; civilian aliens resident in the country; displaced persons resident in the country. However, it excludes the following: foreign armed forces stationed in the country; foreign diplomatic personnel located in the country; civilian aliens temporarily in the country.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (population). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the population per millions of people. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **1950 to 2022**. The Value column displays the total population per millions of people. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Population" in order to specify the value of it, which is needed when we combine all the data sets. 

In [358]:
population_df = pd.read_csv('population.csv')
population_df

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Population(Million)"
        FROM population_df
        """
population_df = duckdb.sql(query).df()
population_df

Unnamed: 0,Country,Year,Population(Million)
0,AUS,1950,8.178700
1,AUS,1951,8.421700
2,AUS,1952,8.636500
3,AUS,1953,8.815300
4,AUS,1954,8.986500
...,...,...,...
2842,LTU,2018,2.801543
2843,LTU,2019,2.794137
2844,LTU,2020,2.794885
2845,LTU,2021,2.808380


In [359]:
population_df.to_csv('clean_population.csv', index=False)

The two dataframes (combined_df and population_df) are combined below using a left ioin, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population of each country and year. By using the left join, it allows us to filter through the new data set we are looking to add to our dataset to reduce the number of NAs that may form when there are data for years that is not in our combined_df.

In [360]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN population_df
        ON combined_df.Country = population_df.Country
        AND combined_df.Year = population_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Country_2,Year_2,Population(Million)
0,AUS,1960,2412.63,AUS,1960,10.275000
1,AUS,1961,2383.19,AUS,1961,10.508200
2,AUS,1962,2577.33,AUS,1962,10.700500
3,AUS,1963,2752.62,AUS,1963,10.906900
4,AUS,1964,2902.59,AUS,1964,11.121600
...,...,...,...,...,...,...
1786,SVN,2022,48361.94,SVN,2022,2.108732
1787,LVA,2012,21297.47,LVA,2012,2.034324
1788,LVA,2014,23810.22,LVA,2014,1.993785
1789,PRT,1977,4136.35,PRT,1977,9.455673


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers, 

In [361]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million)
0,AUS,1960,2412.63,10.275000
1,AUS,1961,2383.19,10.508200
2,AUS,1962,2577.33,10.700500
3,AUS,1963,2752.62,10.906900
4,AUS,1964,2902.59,11.121600
...,...,...,...,...
1786,SVN,2022,48361.94,2.108732
1787,LVA,2012,21297.47,2.034324
1788,LVA,2014,23810.22,1.993785
1789,PRT,1977,4136.35,9.455673


## Private Spending on Education:

https://data.oecd.org/eduresource/private-spending-on-education.htm#indicator-chart

This data shows private spending on education as a percentage of GDP for tertiary education. Private spending on education refers to expenditure funded by private resources which are households and other private entities. It includes all direct expenditure on education institions, and net of public subsidies.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (private spending on education). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the public spending as a percentage of gdp. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **2000 to 2020**. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Private Spending on Education (%)" in order to specify the value of it, which is needed when we combine all the data sets.

In [362]:
private_spending_df = pd.read_csv('private_spending_on_education.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Private Spending on Education (%)"
        FROM private_spending_df
        """
private_spending_df = duckdb.sql(query).df()
private_spending_df

Unnamed: 0,Country,Year,Private Spending on Education (%)
0,AUS,2000,0.748428
1,AUS,2005,0.828712
2,AUS,2008,0.844496
3,AUS,2009,0.882858
4,AUS,2010,0.896708
...,...,...,...
507,LTU,2016,0.338386
508,LTU,2017,0.310164
509,LTU,2018,0.304289
510,LTU,2019,0.298019


In [363]:
private_spending_df.to_csv('clean_private_education_df.csv', index=False)

The two dataframes (combined_df and private_spending_df) are combined below using a left join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, and public spending as a percentage of GDP, and private spending as a percentage of GDP of each country from 2013-2020.

In [364]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN private_spending_df
        ON combined_df.Country = private_spending_df.Country
        AND combined_df.Year = private_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Country_2,Year_2,Private Spending on Education (%)
0,AUS,2000,28312.86,19.028802,AUS,2000.0,0.748428
1,AUS,2005,35659.13,20.176844,AUS,2005.0,0.828712
2,AUS,2008,40130.34,21.249199,AUS,2008.0,0.844496
3,AUS,2009,41672.92,21.691653,AUS,2009.0,0.882858
4,AUS,2010,42816.43,22.031750,AUS,2010.0,0.896708
...,...,...,...,...,...,...,...
1786,USA,2008,48498.45,304.093966,USA,2008.0,1.490560
1787,EST,2019,39068.37,1.326855,EST,2019.0,0.220020
1788,USA,2011,49951.91,311.583481,USA,2011.0,1.661183
1789,SVK,1995,8695.70,5.363676,SVK,1995.0,0.033159


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers,

In [365]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%)
0,AUS,2000,28312.86,19.028802,0.748428
1,AUS,2005,35659.13,20.176844,0.828712
2,AUS,2008,40130.34,21.249199,0.844496
3,AUS,2009,41672.92,21.691653,0.882858
4,AUS,2010,42816.43,22.031750,0.896708
...,...,...,...,...,...
1786,USA,2008,48498.45,304.093966,1.490560
1787,EST,2019,39068.37,1.326855,0.220020
1788,USA,2011,49951.91,311.583481,1.661183
1789,SVK,1995,8695.70,5.363676,0.033159


## Expenditures on Education as a Percent of the GDP

https://databank.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?id=c755d342&report_name=EdStats_Indicators_Report&populartype=series

This data set shows the government expenditures on education as a percentage of the GDP from **1980 to 2019** and can provide insight about the impact of government support on number of degrees awarded. 

The orginal data set has all the countries and all the years since 1960. However, many of the entries are missing values. When downloaded, the countries column did not have a header; to ease manipulation of this data set, I renamed the first column header to "Country". 

Since the years are the headers of the columns, which can make it difficult to match enteries during queries to join two dataframes, the years are melted into one single column named "Year". 

In addition, since the other dataframes record countries as the capital abbreviations of each, to keep the country names consistant in preparation for the final merge into a dataframe, we mapped the country names to the abbreviations and modified the "Country" column.

Upon further investigation, we realized the last column is empty and does not contain any data. Therefore, we updated the data frame. In addition, datasets downloaded from this website filled in empty cells with "..", which lead to issues when merging. To resolve this issue, we did a map to replace all instances of ".." with NA. Also, there was an empty column between 1989 and 1990. When the csv was imported to Python, the header was changed from an empty cell value to 'Unnamed: 11' and we removed this empty column to avoid merging issues. 

In [366]:
gdpPerEdu_df = pd.read_csv("GovernmentExp_EduofGdp.csv")
gdpPerEdu_df = gdpPerEdu_df.rename(columns ={" ": "Country"})

# Removing empty column and OECD Member Data
gdpPerEdu_df = gdpPerEdu_df.drop(columns=['Unnamed: 11'])
gdpPerEdu_df = gdpPerEdu_df.drop([27])

# Melting years into a single column
year_names = gdpPerEdu_df.columns[1:]
gdpPerEdu_df = gdpPerEdu_df.melt(id_vars = ["Country"],
                            var_name = "Year",
                            value_vars = year_names,
                            value_name = "Percent_GDP_On_Edu")
gdpPerEdu_df['Year'] = gdpPerEdu_df['Year'].astype(int)

When we were mapping the countries, we decided not to map the average of the OECD members because the different economic and political structure may skew the average data. Also, may of our other data did not contain this kind of value, therefore, to stay consistent, we decided to omit this. 

In [367]:
#Mapping the country names with abbreviations to be consistent with how countries
#are represented.

country_map = {"Australia":"AUS","Austria":"AUT","Belgium":"BEL",
            "Canada":"CAN", "Chile":"CHL", "Colombia":"COL",
            "Costa Rica":"CRI","Czechia":"CZE", 
            "Denmark":"DNK", "Estonia": "EST", 
            "Finland":"FIN", "France":"FRA", "Germany":"DEU", 
            "Greece":"GRC", "Hungary":"HUN", "Iceland":"ISL", 
            "Ireland":"IRL","Israel":"ISR", "Italy":"ITA", 
            "Japan":"JPN", "Korea, Rep.":"KOR", "Korea":"KOR", 
            "Latvia":"LVA", "Lithuania":"LTU", "Luxembourg":"LUX", 
            "Mexico":"MEX", "Netherlands":"NLD", "New Zealand":"NZL",
            "Norway":"NOR", "Poland":"POL", "Portugal":"PRT", 
            "Slovak Republic":"SVK","Slovenia":"SVN","Spain":"ESP",
            "Sweden":"SWE","Switzerland":"CHE", "Turkiye":"TUR",
            "United Kingdom":"GBR","United States":"USA"}

gdpPerEdu_df["Country"] = gdpPerEdu_df["Country"].map(country_map)

# The last row is empty, thus it is necessary to drop it prior to merging
# it with combined_df
gdpPerEdu_df = gdpPerEdu[:-1]
gdpPerEdu_df.replace('..',np.NaN)

Unnamed: 0,Country,Year,Percent_GDP_On_Edu
0,AUS,1980,5.7
1,AUT,1980,5.0
2,BEL,1980,5.3
3,CAN,1980,6.5
4,CHL,1980,4.2
...,...,...,...
1754,CHE,2022,
1755,TUR,2022,
1756,GBR,2022,
1757,USA,2022,


Despite our efforts to remove data with "Unnamed: 11", it did not reflect in gdpPerEdu_df. Therefore, we specified in the query to ignore all instances where the year was 'Unnamed: 11'.

After saving the modified csv, we proceed to merge this new information with the other pieces of data using LEFT JOIN with the combined_df.

In [368]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN gdpPerEdu_df
        ON combined_df.Country = gdpPerEdu_df.Country
        AND combined_df.Year = gdpPerEdu_df.Year
        WHERE gdpPerEdu_df.Year != 'Unnamed: 11';
        """

combined_df = duckdb.sql(query).df()

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu
0,AUS,2000,28312.86,19.028802,0.748428,4.9
1,AUS,2005,35659.13,20.176844,0.828712,4.9
2,AUS,2008,40130.34,21.249199,0.844496,4.6
3,AUS,2009,41672.92,21.691653,0.882858,5.1
4,AUS,2010,42816.43,22.031750,0.896708,5.6
...,...,...,...,...,...,...
1491,ESP,1982,8088.87,37.987108,,..
1492,COL,2022,20841.38,51.682692,,..
1493,ISR,2022,49788.63,9.528600,,..
1494,NLD,1989,17868.50,14.848906,,..


In [369]:
combined_df.replace('..',np.NaN)
combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu
0,AUS,2000,28312.86,19.028802,0.748428,4.9
1,AUS,2005,35659.13,20.176844,0.828712,4.9
2,AUS,2008,40130.34,21.249199,0.844496,4.6
3,AUS,2009,41672.92,21.691653,0.882858,5.1
4,AUS,2010,42816.43,22.031750,0.896708,5.6
...,...,...,...,...,...,...
1491,ESP,1982,8088.87,37.987108,,..
1492,COL,2022,20841.38,51.682692,,..
1493,ISR,2022,49788.63,9.528600,,..
1494,NLD,1989,17868.50,14.848906,,..


## Expenditures on Tertiary Education as a Percent of the GDP

https://databank.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?id=c755d342&report_name=EdStats_Indicators_Report&populartype=series#

This data show government expenditures on tertiary education as a percentage of GDP from **1960 to 2019**. Multiplying this with the GDP can provide insight as to the amount of money provided by the government per year per country. Similar to the previous factor of expenditures on education as a whole, this can help us analyze the impact of government support as people are working to complete a higher education on the amount of degrees awarded per year and country.

The orginal data set has all the countries and all the years since 1960. However, many of the entries are missing values. When downloaded, the countries column did not have a header; to ease manipulation of this data set, I renamed the first column header to "Country". 

Since the years are the headers of the columns, which can make it difficult to match enteries during queries to join two dataframes, the years are melted into one single column named "Year". 

In addition, since the other dataframes record countries as the capital abbreviations of each, to keep the country names consistant in preparation for the final merge into a dataframe, we mapped the country names to the abbreviations and modified the "Country" column.

In [370]:
gdpPerTertEdu_df = pd.read_csv("TertiaryGovExp%GDP.csv")
gdpPerTertEdu_df = gdpPerTertEdu_df.rename(columns ={" ": "Country"})

# Removing empty column and OECD Member Data
gdpPerTertEdu_df = gdpPerTertEdu_df.drop(columns=['Unnamed: 11'])
gdpPerTertEdu_df = gdpPerTertEdu_df.drop([27])

# Melting years into a single column
year_names = gdpPerTertEdu_df.columns[1:]
gdpPerTertEdu_df = gdpPerTertEdu_df.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Percent_GDP_On_TertEdu")

When we were mapping the countries, we decided not to map the average of the OECD members because the different economic and political structure may skew the average data. Also, may of our other data did not contain this kind of value, therefore, to stay consistent, we decided to omit this. 

In [371]:
country_map = {"Australia":"AUS","Austria":"AUT","Belgium":"BEL",
            "Canada":"CAN", "Chile":"CHL", "Colombia":"COL",
            "Costa Rica":"CRI","Czechia":"CZE", 
            "Denmark":"DNK", "Estonia": "EST", 
            "Finland":"FIN", "France":"FRA", "Germany":"DEU", 
            "Greece":"GRC", "Hungary":"HUN", "Iceland":"ISL", 
            "Ireland":"IRL","Israel":"ISR", "Italy":"ITA", 
            "Japan":"JPN", "Korea, Rep.":"KOR", "Korea":"KOR", 
            "Latvia":"LVA", "Lithuania":"LTU", "Luxembourg":"LUX", 
            "Mexico":"MEX", "Netherlands":"NLD", "New Zealand":"NZL",
            "Norway":"NOR", "Poland":"POL", "Portugal":"PRT", 
            "Slovak Republic":"SVK","Slovenia":"SVN","Spain":"ESP",
            "Sweden":"SWE","Switzerland":"CHE", "Turkiye":"TUR",
            "United Kingdom":"GBR","United States":"USA"}

gdpPerTertEdu_df["Country"] = gdpPerTertEdu_df["Country"].map(country_map)

# The last row is empty, thus it is necessary to drop it prior to merging
# it with combined_df
gdpPerTertEdu_df = gdpPerTertEdu_df[:-1]
gdpPerTertEdu_df.replace('..',np.NaN)

Unnamed: 0,Country,Year,Percent_GDP_On_TertEdu
0,AUS,1980,1.3
1,AUT,1980,0.7
2,BEL,1980,1.0
3,CAN,1980,1.9
4,CHL,1980,1.5
...,...,...,...
1671,CHE,2022,
1672,TUR,2022,
1673,GBR,2022,
1674,USA,2022,


In [372]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN gdpPerTertEdu_df
        ON combined_df.Country = gdpPerTertEdu_df.Country
        AND combined_df.Year = gdpPerTertEdu_df.Year
        """

combined_df = duckdb.sql(query).df()

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu
0,AUS,2000,28312.86,19.028802,0.748428,4.9,1.1
1,AUS,2005,35659.13,20.176844,0.828712,4.9,1.1
2,AUS,2008,40130.34,21.249199,0.844496,4.6,1.0
3,AUS,2009,41672.92,21.691653,0.882858,5.1,1.1
4,AUS,2010,42816.43,22.031750,0.896708,5.6,1.2
...,...,...,...,...,...,...,...
1491,CHL,2019,25509.45,19.107216,1.647009,..,..
1492,CZE,1993,12123.46,10.330607,,4.2,0.8
1493,BEL,1997,23732.79,10.181246,,..,..
1494,TUR,1984,5664.87,49.070000,,2.0,0.5


## Household Income per Capita:

https://data.oecd.org/hha/household-disposable-income.htm#indicator-chart

This data set shows the gross household disposable income per capita in 37 OECD countries. Household disposable income is available to households such as wages and salaries, income from self-employment and unincorporated enterprices, income from pensions and other social benefits, and income from financial investments. Gross means that depreciation costs are not subtracted. For gross household disposable income per capita, growth rates (percentage change from previous period) are presented; these are ‘real’ growth rates adjusted to remove the effects of price changes. Information is also presented for gross household disposable income including social transfers, such as health or education provided for free or at reduced prices by governments and not-for-profit organisations. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (household disposable income). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the household income per capita. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1970 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Household Income per Capita" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [373]:
household_income_df = pd.read_csv('household_income.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Household Income per Capita"
        FROM household_income_df;
        """
household_income_df = duckdb.sql(query).df()
household_income_df

Unnamed: 0,Country,Year,Household Income per Capita
0,JPN,2007,24916.381131
1,JPN,2008,25393.938874
2,JPN,2009,25581.218842
3,JPN,2010,26402.021262
4,JPN,2011,27299.673602
...,...,...,...
506,CRI,2016,14675.888631
507,CRI,2017,16130.493739
508,CRI,2018,16619.155338
509,CRI,2019,17161.123623


In [374]:
household_income_df.to_csv('clean_household_income_df.csv', index=False)

The two dataframes (combined_df and household_income_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, and household income per capita of each country from 2013-2020.

In [375]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN household_income_df
        ON combined_df.Country = household_income_df.Country
        AND combined_df.Year = household_income_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Country_2,Year_2,Household Income per Capita
0,AUS,2008,40130.34,21.249199,0.844496,4.6,1.0,AUS,2008.0,31051.323158
1,AUS,2010,42816.43,22.031750,0.896708,5.6,1.2,AUS,2010.0,32599.998578
2,AUS,2012,43884.64,22.733465,0.881156,4.9,1.2,AUS,2012.0,33934.614031
3,AUS,2014,47606.75,23.475686,1.133321,5.2,1.4,AUS,2014.0,36816.129278
4,AUS,2015,47226.76,23.815995,1.264779,5.3,1.5,AUS,2015.0,37553.196701
...,...,...,...,...,...,...,...,...,...,...
1491,EST,2015,29222.75,1.314608,0.417253,5.1,1.4,EST,2015.0,19321.329148
1492,GBR,2017,46061.35,66.040229,1.386047,5.4,1.4,GBR,2017.0,32417.628654
1493,CAN,2012,42290.88,34.714222,0.923423,..,..,CAN,2012.0,29745.720518
1494,USA,2011,49951.91,311.583481,1.661183,..,..,USA,2011.0,42186.856901


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [376]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita
0,AUS,2008,40130.34,21.249199,0.844496,4.6,1.0,31051.323158
1,AUS,2010,42816.43,22.031750,0.896708,5.6,1.2,32599.998578
2,AUS,2012,43884.64,22.733465,0.881156,4.9,1.2,33934.614031
3,AUS,2014,47606.75,23.475686,1.133321,5.2,1.4,36816.129278
4,AUS,2015,47226.76,23.815995,1.264779,5.3,1.5,37553.196701
...,...,...,...,...,...,...,...,...
1491,EST,2015,29222.75,1.314608,0.417253,5.1,1.4,19321.329148
1492,GBR,2017,46061.35,66.040229,1.386047,5.4,1.4,32417.628654
1493,CAN,2012,42290.88,34.714222,0.923423,..,..,29745.720518
1494,USA,2011,49951.91,311.583481,1.661183,..,..,42186.856901


## Education Spending:

https://data.oecd.org/eduresource/education-spending.htm#indicator-chart

This data set shows the average amount of education spending that covers expenditure on schools, universities and other public and private educational institutions in 37 OECD countries. Spending includes instruction and ancillary services for students and families provided through educational institutions. Education spending is shown in USD per student.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (education spending in dollars). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the USD per student. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1995 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Average Spending on Higher Education (USD/student)" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [377]:
average_spending_df = pd.read_csv('educaion_spending.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Average Spending on Higher Education (USD/student)"
        FROM average_spending_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
average_spending_df = duckdb.sql(query).df()
average_spending_df

Unnamed: 0,Country,Year,Average Spending on Higher Education (USD/student)
0,AUT,2013,16853.000
1,AUT,2014,16867.620
2,AUT,2015,17560.620
3,AUT,2016,18625.000
4,AUT,2017,18974.750
...,...,...,...
285,LTU,2016,7852.196
286,LTU,2017,8412.116
287,LTU,2018,9908.427
288,LTU,2019,11431.870


In [378]:
average_spending_df.to_csv('clean_average_spending_df.csv', index=False)

The two dataframes (combined_df and average_spending_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, and average spending for higher education per student of each country from 2013-2020.

In [379]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN average_spending_df
        ON combined_df.Country = average_spending_df.Country
        AND combined_df.Year = average_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Country_2,Year_2,Average Spending on Higher Education (USD/student)
0,AUS,2014.0,47606.75,23.475686,1.133321,5.2,1.4,36816.129278,AUS,2014.0,19493.580
1,AUS,2015.0,47226.76,23.815995,1.264779,5.3,1.5,37553.196701,AUS,2015.0,20304.370
2,AUS,2018.0,53025.39,24.966643,1.230631,..,..,39522.008971,AUS,2018.0,20675.540
3,AUS,2019.0,52785.25,25.340217,1.275049,..,..,40000.974957,AUS,2019.0,20664.380
4,AUS,2020.0,55772.74,25.655289,1.195328,..,..,42423.425135,AUS,2020.0,22204.119
...,...,...,...,...,...,...,...,...,...,...,...
1497,,,,,,,,,RUS,2013.0,8709.718
1498,,,,,,,,,RUS,2017.0,7312.862
1499,,,,,,,,,RUS,2015.0,8177.802
1500,,,,,,,,,RUS,2014.0,8925.799


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [380]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Average Spending on Higher Education (USD/student)
0,AUS,2014,47606.75,23.475686,1.133321,5.2,1.4,36816.129278,19493.580
1,AUS,2015,47226.76,23.815995,1.264779,5.3,1.5,37553.196701,20304.370
2,AUS,2018,53025.39,24.966643,1.230631,..,..,39522.008971,20675.540
3,AUS,2019,52785.25,25.340217,1.275049,..,..,40000.974957,20664.380
4,AUS,2020,55772.74,25.655289,1.195328,..,..,42423.425135,22204.119
...,...,...,...,...,...,...,...,...,...
1497,RUS,2013,,,,,,,8709.718
1498,RUS,2017,,,,,,,7312.862
1499,RUS,2015,,,,,,,8177.802
1500,RUS,2014,,,,,,,8925.799


## Number of Universities:

We weren't able to find a data set that directly listed the amount of universities in each OECD country. Because of this, we had to make a data set on Excel and convert to a csv later.

All the information from the data set was found from this website: https://www.webometrics.info/en/distribution_by_country


In [381]:
num_universities_df = pd.read_csv('number_of_universities.csv')
num_universities_df

Unnamed: 0,Country,Number of Universities
0,AUS,187
1,AUT,84
2,BEL,142
3,CAN,383
4,CHL,130
5,COL,299
6,CRI,68
7,CZE,64
8,DNK,81
9,EST,31


The two dataframes (combined_df and num_universities_df) are combined below using a Left Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, average spending for higher education per student, and number of universities for each country from 2013-2020.

In [382]:
combined_df = pd.merge(combined_df, num_universities_df, on='Country', how='left')

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Average Spending on Higher Education (USD/student),Number of Universities
0,AUS,2014,47606.75,23.475686,1.133321,5.2,1.4,36816.129278,19493.580,187.0
1,AUS,2015,47226.76,23.815995,1.264779,5.3,1.5,37553.196701,20304.370,187.0
2,AUS,2018,53025.39,24.966643,1.230631,..,..,39522.008971,20675.540,187.0
3,AUS,2019,52785.25,25.340217,1.275049,..,..,40000.974957,20664.380,187.0
4,AUS,2020,55772.74,25.655289,1.195328,..,..,42423.425135,22204.119,187.0
...,...,...,...,...,...,...,...,...,...,...
1497,RUS,2013,,,,,,,8709.718,
1498,RUS,2017,,,,,,,7312.862,
1499,RUS,2015,,,,,,,8177.802,
1500,RUS,2014,,,,,,,8925.799,


Since some of the rows above in the combined data frame, have values for OECD, which represents the averages across all the countries, we are going to drop those rows, since not all the data frames that we joined in this value.

In [383]:
combined_df = combined_df[combined_df['Country'] != 'OECD']
combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Average Spending on Higher Education (USD/student),Number of Universities
0,AUS,2014,47606.75,23.475686,1.133321,5.2,1.4,36816.129278,19493.580,187.0
1,AUS,2015,47226.76,23.815995,1.264779,5.3,1.5,37553.196701,20304.370,187.0
2,AUS,2018,53025.39,24.966643,1.230631,..,..,39522.008971,20675.540,187.0
3,AUS,2019,52785.25,25.340217,1.275049,..,..,40000.974957,20664.380,187.0
4,AUS,2020,55772.74,25.655289,1.195328,..,..,42423.425135,22204.119,187.0
...,...,...,...,...,...,...,...,...,...,...
1497,RUS,2013,,,,,,,8709.718,
1498,RUS,2017,,,,,,,7312.862,
1499,RUS,2015,,,,,,,8177.802,
1500,RUS,2014,,,,,,,8925.799,


In [384]:
combined_df.to_csv('combined_data.csv', index=False)

In [385]:
students_per_teacher_df = pd.read_csv('students_per_teaching_staff.csv')

In [386]:
students_per_teacher_df.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUT,STUDPERTEACHER,EARLYCHILDEDU,RT,A,2013,12.798,
1,AUT,STUDPERTEACHER,EARLYCHILDEDU,RT,A,2014,12.926,
2,AUT,STUDPERTEACHER,EARLYCHILDEDU,RT,A,2015,12.537,
3,AUT,STUDPERTEACHER,EARLYCHILDEDU,RT,A,2016,12.346,
4,AUT,STUDPERTEACHER,EARLYCHILDEDU,RT,A,2017,12.709,


In [387]:
query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Students per Teaching Staff"
        FROM students_per_teacher_df
        WHERE TIME IN (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020);
        """
students_per_teacher_df = duckdb.sql(query).df()
students_per_teacher_df

Unnamed: 0,Country,Year,Students per Teaching Staff
0,AUT,2013,12.798
1,AUT,2014,12.926
2,AUT,2015,12.537
3,AUT,2016,12.346
4,AUT,2017,12.709
...,...,...,...
200,SVN,2016,8.045
201,SVN,2017,16.580
202,SVN,2018,7.846
203,SVN,2019,16.625


In [388]:
students_per_teacher_df.to_csv('clean_students_per_teacher_df.csv', index=False)

In [389]:
query = """
        SELECT *
        FROM combined_df
        FULL JOIN students_per_teacher_df
        ON combined_df.Country = students_per_teacher_df.Country
        AND combined_df.Year = students_per_teacher_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Average Spending on Higher Education (USD/student),Number of Universities,Country_2,Year_2,Students per Teaching Staff
0,AUT,2016,52665.09,8.739806,0.112024,5.5,1.8,36227.955632,18625.00,84.0,AUT,2016.0,12.346
1,AUT,2017,54188.36,8.795073,0.152499,5.4,1.7,36984.312014,18974.75,84.0,AUT,2017.0,12.709
2,AUT,2019,59716.25,8.877637,0.192745,..,..,40065.855276,21946.23,84.0,AUT,2019.0,11.999
3,CZE,2015,33909.31,10.542942,0.228737,5.8,0.8,22313.393223,10960.24,64.0,CZE,2015.0,13.455
4,CZE,2016,36101.29,10.565284,0.222914,5.6,0.7,23663.665288,10193.64,64.0,CZE,2016.0,13.403
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1497,EST,2017,33867.80,1.317384,0.214046,5.0,1.1,21826.467629,14523.48,31.0,EST,2017.0,8.534
1498,EST,2019,39068.37,1.326855,0.220020,..,..,24916.118163,17243.69,31.0,EST,2019.0,8.230
1499,FRA,2013,39528.47,65.735961,0.305140,..,..,31388.020775,16233.88,625.0,FRA,2013.0,24.181
1500,NLD,2016,52289.40,17.030314,0.498491,5.5,1.7,33258.289148,19873.65,129.0,NLD,2016.0,16.774


In [390]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Private Spending on Education (%),Percent_GDP_On_Edu,Percent_GDP_On_TertEdu,Household Income per Capita,Average Spending on Higher Education (USD/student),Number of Universities,Students per Teaching Staff
0,AUT,2016,52665.09,8.739806,0.112024,5.5,1.8,36227.955632,18625.00,84.0,12.346
1,AUT,2017,54188.36,8.795073,0.152499,5.4,1.7,36984.312014,18974.75,84.0,12.709
2,AUT,2019,59716.25,8.877637,0.192745,..,..,40065.855276,21946.23,84.0,11.999
3,CZE,2015,33909.31,10.542942,0.228737,5.8,0.8,22313.393223,10960.24,64.0,13.455
4,CZE,2016,36101.29,10.565284,0.222914,5.6,0.7,23663.665288,10193.64,64.0,13.403
...,...,...,...,...,...,...,...,...,...,...,...
1497,EST,2017,33867.80,1.317384,0.214046,5.0,1.1,21826.467629,14523.48,31.0,8.534
1498,EST,2019,39068.37,1.326855,0.220020,..,..,24916.118163,17243.69,31.0,8.230
1499,FRA,2013,39528.47,65.735961,0.305140,..,..,31388.020775,16233.88,625.0,24.181
1500,NLD,2016,52289.40,17.030314,0.498491,5.5,1.7,33258.289148,19873.65,129.0,16.774
