# Research Question: 

Question: Can we predict the number of people who attain their bachelor's in an OECD country? 

In this assignment, we want to predict the number of people in OECD countrys that complete their Bachelor's degree. Facors like the country's GDP, ratio of people enrolled in primary education, secondary education, and tertiary education, the population of the country, how much the government spends on higher education, how much households spend on higher education, number of public universities, number of private universities, average cost of higher education, household income, and the year are all included as variables in this model. We will train a multivariate regression to see if we can reliably predict the number of people graduating with their bachelors.

Our inspiration for this project stemmed from https://icfdn.org/our-impact/education/. In this website, we found out that "Just one extra year of schooling can increase an individual’s earnings by up to 10%, and can raise the region’s average annual gross domestic product (GDP) growth by 0.37%." Because of this we wanted to explore how factors of education in a country can impact a country's GDP. 

We decided to only evaluate countries that are part of the Organization for Economic Co-operation and Development (OECD). There are 37 countries that are part of this organization that collaborate to develop policy standards and economic growth. We chose countries that are part of the OECD to evalvuate on because they account for three-fifths of the world's GDP, three-quarters of world trade, half of the world's energy consumption, and 18 percent of the world's population. Because these 37 countries account for a huge part of a country's GDP, we decided that this group of countries would be easier to evaluate compared attaining data from all 197 countries in the world.  https://www.state.gov/the-organization-for-economic-co-operation-and-development-oecd/#:~:text=and%20Development%20(OECD)-,The%20Organization%20for%20Economic%20Cooperation%20and%20Development%20(OECD),to%20promote%20sustainable%20economic%20growth.


In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb 

# Data Collection/Cleaning: 

### GDP

https://data.oecd.org/gdp/gross-domestic-product-gdp.htm

This data set shows the nominal Gross Domestic Product(GDP) per capita of OEPD countries in US dollars from 1960 to 2022. The Gross domestic product is the standard measure of the value added created through the production of goods and servives in a country during a certain period. While the GDP per capita can be found by diving the total GDP by its population. It also measure the inclome earned from that population, the total amount spend on a final goods and services. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured. This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it was measured in US dollars. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **1960 to 2022**. The Value column displays the nominal gdp value. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "GDP in US Dollars" in order to specify the value of it, which is needed when we combine all the data sets. 

In [3]:
gdp_df = pd.read_csv('gdp.csv')
query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "GDP per Capita"
        FROM gdp_df
        """
gdp_df = duckdb.sql(query).df()
gdp_df

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.627589
1,AUS,1961,2383.188902
2,AUS,1962,2577.332834
3,AUS,1963,2752.620592
4,AUS,1964,2902.590472
...,...,...,...
1786,CRI,2018,21312.713380
1787,CRI,2019,22739.241909
1788,CRI,2020,21755.528069
1789,CRI,2021,22612.375289


In [4]:
gdp_df.to_csv('clean_gdp.csv', index=False)

In [5]:
combined_df = gdp_df
combined_df 

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.627589
1,AUS,1961,2383.188902
2,AUS,1962,2577.332834
3,AUS,1963,2752.620592
4,AUS,1964,2902.590472
...,...,...,...
1786,CRI,2018,21312.713380
1787,CRI,2019,22739.241909
1788,CRI,2020,21755.528069
1789,CRI,2021,22612.375289


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant, change all the float values in Year to be integers, and round the GDP values to two decimals to represent US dollars. 
**REVISIT THIS DESCRIPTION**

In [6]:
#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

#rounding the GDP values to 2 decimals
combined_df['GDP per Capita'] = combined_df['GDP per Capita'].round(2)

combined_df

Unnamed: 0,Country,Year,GDP per Capita
0,AUS,1960,2412.63
1,AUS,1961,2383.19
2,AUS,1962,2577.33
3,AUS,1963,2752.62
4,AUS,1964,2902.59
...,...,...,...
1786,CRI,2018,21312.71
1787,CRI,2019,22739.24
1788,CRI,2020,21755.53
1789,CRI,2021,22612.38


### Population

https://data.oecd.org/pop/population.htm

This data set shows the total population of OECD country in millions of people from 1950 to 2022. The total population includes the following: national armed forces stationed abroad; merchant seamen at sea; diplomatic personnel located abroad; civilian aliens resident in the country; displaced persons resident in the country. However, it excludes the following: foreign armed forces stationed in the country; foreign diplomatic personnel located in the country; civilian aliens temporarily in the country.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (population). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the population per millions of people. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **1950 to 2022**. The Value column displays the total population per millions of people. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Population" in order to specify the value of it, which is needed when we combine all the data sets. 

In [7]:
population_df = pd.read_csv('population.csv')
population_df

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Population(Million)"
        FROM population_df
        """
population_df = duckdb.sql(query).df()
population_df

Unnamed: 0,Country,Year,Population(Million)
0,AUS,1950,8.178700
1,AUS,1951,8.421700
2,AUS,1952,8.636500
3,AUS,1953,8.815300
4,AUS,1954,8.986500
...,...,...,...
2842,LTU,2018,2.801543
2843,LTU,2019,2.794137
2844,LTU,2020,2.794885
2845,LTU,2021,2.808380


In [8]:
population_df.to_csv('clean_population.csv', index=False)

The two dataframes (combined_df and population_df) are combined below using a left ioin, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population of each country and year. By using the left join, it allows us to filter through the new data set we are looking to add to our dataset to reduce the number of NAs that may form when there are data for years that is not in our combined_df.

In [9]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN population_df
        ON combined_df.Country = population_df.Country
        AND combined_df.Year = population_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Country_2,Year_2,Population(Million)
0,AUS,1960,2412.63,AUS,1960,10.275000
1,AUS,1961,2383.19,AUS,1961,10.508200
2,AUS,1962,2577.33,AUS,1962,10.700500
3,AUS,1963,2752.62,AUS,1963,10.906900
4,AUS,1964,2902.59,AUS,1964,11.121600
...,...,...,...,...,...,...
1786,SVN,2022,48361.94,SVN,2022,2.108732
1787,LVA,2012,21297.47,LVA,2012,2.034324
1788,LVA,2014,23810.22,LVA,2014,1.993785
1789,PRT,1977,4136.35,PRT,1977,9.455673


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers, 

In [10]:
# Fill missing values in Country with values from Country_2
combined_df['Country'] = combined_df['Country'].fillna(combined_df['Country_2'])

# Fill missing values in Country_2 with values from Country
combined_df['Country_2'] = combined_df['Country_2'].fillna(combined_df['Country'])

# Fill missing values in Year with values from Year_2
combined_df['Year'] = combined_df['Year'].fillna(combined_df['Year_2'])

# Fill missing values in Year_2 with values from Year
combined_df['Year_2'] = combined_df['Year_2'].fillna(combined_df['Year'])

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million)
0,AUS,1960,2412.63,10.275000
1,AUS,1961,2383.19,10.508200
2,AUS,1962,2577.33,10.700500
3,AUS,1963,2752.62,10.906900
4,AUS,1964,2902.59,11.121600
...,...,...,...,...
1786,SVN,2022,48361.94,2.108732
1787,LVA,2012,21297.47,2.034324
1788,LVA,2014,23810.22,1.993785
1789,PRT,1977,4136.35,9.455673


### Education Spending on Higher Education 

https://data.oecd.org/eduresource/education-spending.htm#indicator-chart

This data set shows the average amount of education spending that covers expenditure on schools, universities and other public and private educational institutions in 37 OECD countries. Spending includes instruction and ancillary services for students and families provided through educational institutions. Education spending is shown in USD per student.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (education spending in dollars). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the USD per student. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years **1995 to 2020**. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Average Spending on Higher Education (USD/student)" in order to specify the value of it, which is needed when we combine all the data sets.

In [11]:
average_spending_df = pd.read_csv('education_spending.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Average Spending on Higher Education (USD/student)"
        FROM average_spending_df
        ORDER BY Year
        """
average_spending_df = duckdb.sql(query).df()
average_spending_df

Unnamed: 0,Country,Year,Average Spending on Higher Education (USD/student)
0,CZE,1995,7846.0600
1,HUN,1995,6369.8950
2,CHL,1995,4452.8050
3,FIN,1995,9831.3360
4,USA,1995,15696.5100
...,...,...,...
490,TUR,2020,9287.7930
491,COL,2020,4980.6108
492,LVA,2020,13043.3500
493,CRI,2020,15424.3000


In [12]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN average_spending_df
        ON combined_df.Country = average_spending_df.Country
        AND combined_df.Year = average_spending_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Country_2,Year_2,Average Spending on Higher Education (USD/student)
0,AUS,2000,28312.86,19.028802,AUS,2000.0,12500.200
1,AUS,2005,35659.13,20.176844,AUS,2005.0,14171.660
2,AUS,2008,40130.34,21.249199,AUS,2008.0,15768.220
3,AUS,2009,41672.92,21.691653,AUS,2009.0,16589.220
4,AUS,2010,42816.43,22.031750,AUS,2010.0,16300.980
...,...,...,...,...,...,...,...
1786,USA,2008,48498.45,304.093966,USA,2008.0,26949.290
1787,EST,2019,39068.37,1.326855,EST,2019.0,17243.690
1788,USA,2011,49951.91,311.583481,USA,2011.0,26202.940
1789,SVK,1995,8695.70,5.363676,SVK,1995.0,4851.865


In the table, above there are a lot of missing values. Some countries might not have enrollment rates for either early childhood education or higher education or gdp values, or etc. and because of this a lot of Column values for Country, and Year are filled with Nan. To fix this, we have to make all the values in the Country and Country_2 column the same, and Year and Year_2 values the same. After doing this, we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [13]:
#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student)
0,AUS,2000,28312.86,19.028802,12500.200
1,AUS,2005,35659.13,20.176844,14171.660
2,AUS,2008,40130.34,21.249199,15768.220
3,AUS,2009,41672.92,21.691653,16589.220
4,AUS,2010,42816.43,22.031750,16300.980
...,...,...,...,...,...
1786,USA,2008,48498.45,304.093966,26949.290
1787,EST,2019,39068.37,1.326855,17243.690
1788,USA,2011,49951.91,311.583481,26202.940
1789,SVK,1995,8695.70,5.363676,4851.865


### Expenditures on Education as a Percent of the GDP

https://data.oecd.org/eduresource/education-spending.htm#indicator-chart

This data set shows the average amount of education spending that covers expenditure on schools, universities and other public and private educational institutions in 37 OECD countries. Spending includes instruction and ancillary services for students and families provided through educational institutions. Education spending is shown in USD per student.

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (education spending in dollars). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the USD per student. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1995 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Average Spending on Higher Education (USD/student)" in order to specify the value of it, which is needed when we combine all the data sets.

In [14]:
government_expenditure_df = pd.read_csv('total-government-expenditure-on-education-gdp.csv')

query = """
        SELECT 
            Code AS Country,
            Year,
            "Historical and more recent expenditure estimates" AS "Government Expenditure On Education (%)"
        FROM government_expenditure_df
        """
    
government_expenditure_df = duckdb.sql(query).df()
government_expenditure_df


Unnamed: 0,Country,Year,Government Expenditure On Education (%)
0,AFG,1971,1.16036
1,AFG,1972,1.11718
2,AFG,1973,1.42788
3,AFG,1975,1.30332
4,AFG,1979,1.73981
...,...,...,...
5174,ZWE,2014,6.13835
5175,ZWE,2015,5.81279
5176,ZWE,2016,5.47262
5177,ZWE,2017,5.81878


In the table, above two columns for country name and year, so we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [15]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN government_expenditure_df
        ON combined_df.Country = government_expenditure_df.Country
        AND combined_df.Year = government_expenditure_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df.head()

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%)
0,AUS,1960,2412.63,10.275000,,1.40000
1,AUS,1978,8553.93,14.359255,,5.99879
2,AUS,1979,9456.68,14.515729,,5.88711
3,AUS,1980,10478.42,14.695356,,5.64446
4,AUS,1982,11988.84,15.184247,,5.47011
...,...,...,...,...,...,...
1786,USA,1973,6725.41,211.908788,,
1787,USA,1982,14399.35,231.664458,,
1788,USA,1990,23835.32,249.622814,,
1789,CHL,2021,28070.41,19.678363,,


### Expenditures on Tertiary Education as a Percent of the GDP

https://databank.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?id=c755d342&report_name=EdStats_Indicators_Report&populartype=series#

This data show government expenditures on tertiary education as a percentage of GDP from **1960 to 2019**. Multiplying this with the GDP can provide insight as to the amount of money provided by the government per year per country. Similar to the previous factor of expenditures on education as a whole, this can help us analyze the impact of government support as people are working to complete a higher education on the amount of degrees awarded per year and country.

The orginal data set has all the countries and all the years since 1960. However, many of the entries are missing values. When downloaded, the countries column did not have a header; to ease manipulation of this data set, I renamed the first column header to "Country". 

Since the years are the headers of the columns, which can make it difficult to match enteries during queries to join two dataframes, the years are melted into one single column named "Year". 

In addition, since the other dataframes record countries as the capital abbreviations of each, to keep the country names consistant in preparation for the final merge into a dataframe, we mapped the country names to the abbreviations and modified the "Country" column.

In [16]:
gdpPerTertEdu_df = pd.read_csv("TertiaryGovExp%GDP.csv")
gdpPerTertEdu_df = gdpPerTertEdu_df.rename(columns ={" ": "Country"})

# Removing empty column and OECD Member Data
gdpPerTertEdu_df = gdpPerTertEdu_df.drop(columns=['Unnamed: 11'])
gdpPerTertEdu_df = gdpPerTertEdu_df.drop([27])

# Melting years into a single column
year_names = gdpPerTertEdu_df.columns[1:]
gdpPerTertEdu_df = gdpPerTertEdu_df.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Government_Spending_Teritary (% GDP)")

When we were mapping the countries, we decided not to map the average of the OECD members because the different economic and political structure may skew the average data. Also, may of our other data did not contain this kind of value, therefore, to stay consistent, we decided to omit this. 

In [17]:
country_map = {"Australia":"AUS","Austria":"AUT","Belgium":"BEL",
            "Canada":"CAN", "Chile":"CHL", "Colombia":"COL",
            "Costa Rica":"CRI","Czechia":"CZE", 
            "Denmark":"DNK", "Estonia": "EST", 
            "Finland":"FIN", "France":"FRA", "Germany":"DEU", 
            "Greece":"GRC", "Hungary":"HUN", "Iceland":"ISL", 
            "Ireland":"IRL","Israel":"ISR", "Italy":"ITA", 
            "Japan":"JPN", "Korea, Rep.":"KOR", "Korea":"KOR", 
            "Latvia":"LVA", "Lithuania":"LTU", "Luxembourg":"LUX", 
            "Mexico":"MEX", "Netherlands":"NLD", "New Zealand":"NZL",
            "Norway":"NOR", "Poland":"POL", "Portugal":"PRT", 
            "Slovak Republic":"SVK","Slovenia":"SVN","Spain":"ESP",
            "Sweden":"SWE","Switzerland":"CHE", "Turkiye":"TUR",
            "United Kingdom":"GBR","United States":"USA"}

gdpPerTertEdu_df["Country"] = gdpPerTertEdu_df["Country"].map(country_map)

# The last row is empty, thus it is necessary to drop it prior to merging
# it with combined_df
gdpPerTertEdu_df = gdpPerTertEdu_df[:-1]
gdpPerTertEdu_df.replace('..',np.NaN)

Unnamed: 0,Country,Year,Government_Spending_Teritary (% GDP)
0,AUS,1980,1.3
1,AUT,1980,0.7
2,BEL,1980,1.0
3,CAN,1980,1.9
4,CHL,1980,1.5
...,...,...,...
1671,CHE,2022,
1672,TUR,2022,
1673,GBR,2022,
1674,USA,2022,


In [18]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN gdpPerTertEdu_df
        ON combined_df.Country = gdpPerTertEdu_df.Country
        AND combined_df.Year = gdpPerTertEdu_df.Year
        """

combined_df = duckdb.sql(query).df()

#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP)
0,AUS,1980,10478.42,14.695356,,5.64446,1.3
1,AUS,1982,11988.84,15.184247,,5.47011,1.2
2,AUS,1983,12781.13,15.393472,,5.36276,1.7
3,AUS,1985,14514.11,15.788312,,5.39562,1.7
4,AUS,1986,14983.69,16.018350,,5.27071,1.7
...,...,...,...,...,...,...,...
1786,DNK,1992,19826.83,5.171370,,,..
1787,ISR,1996,21623.87,5.685100,,,..
1788,BEL,1997,23732.79,10.181246,,,..
1789,USA,2008,48498.45,304.093966,26949.29,,..


### Household Income per Capita

https://data.oecd.org/hha/household-disposable-income.htm#indicator-chart

This data set shows the gross household disposable income per capita in 37 OECD countries. Household disposable income is available to households such as wages and salaries, income from self-employment and unincorporated enterprices, income from pensions and other social benefits, and income from financial investments. Gross means that depreciation costs are not subtracted. For gross household disposable income per capita, growth rates (percentage change from previous period) are presented; these are ‘real’ growth rates adjusted to remove the effects of price changes. Information is also presented for gross household disposable income including social transfers, such as health or education provided for free or at reduced prices by governments and not-for-profit organisations. 

The original data set has 8 different columns: Location, Indicator, Subject, Measure, Frequency, Time, Value, Flag Codes). The Location specifies the country. The indicator specifies what is being measured (household disposable income). This column is also not necessary because there is only one value being measured in this specific data set, it is redundant. The Measure column just specifies how the data was measured, in this case it is the household income per capita. Since the Measure is the same for all the rows, this column isn't necessary. The frequency column is similar to the Measure column and is also not necessary for this case. The Time column indicates what year each gdp value is from from, ranges from years 1970 to 2020. The Value column displays the percentage of gdp that is used for public education. And the last column, Flag Codes, is used for indicating something wrong for each row, in this case, because the whole column does not contain any values, this column is not necessary.

From the original data set, we removed the Indicator, Subject, Measure, Frequency, and Flag Codes columns. We also renamed the "Location" column as "Country" and the "Time" column as "Year" in order to be consistent with the enrollment_rates_df. We also renamed the "Value" column as "Household Income per Capita" in order to specify the value of it, which is needed when we combine all the data sets.

We also need to limit Years to be in between 2013 to 2020 in order to be consistent with the previous data sets and limit the amount of missing values.

In [19]:
household_income_df = pd.read_csv('household_income.csv')

query = """
        SELECT 
            LOCATION AS Country,
            TIME AS Year,
            Value AS "Household Income per Capita"
        FROM household_income_df;
        """
household_income_df = duckdb.sql(query).df()
household_income_df

Unnamed: 0,Country,Year,Household Income per Capita
0,JPN,2007,24916.381131
1,JPN,2008,25393.938874
2,JPN,2009,25581.218842
3,JPN,2010,26402.021262
4,JPN,2011,27299.673602
...,...,...,...
506,CRI,2016,14675.888631
507,CRI,2017,16130.493739
508,CRI,2018,16619.155338
509,CRI,2019,17161.123623


In [20]:
household_income_df.to_csv('clean_household_income_df.csv', index=False)

The two dataframes (combined_df and household_income_df) are combined below using a Full Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, and household income per capita of each country from 2013-2020.

In [21]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN household_income_df
        ON combined_df.Country = household_income_df.Country
        AND combined_df.Year = household_income_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df 

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP),Country_2,Year_2,Household Income per Capita
0,AUS,2007,39687.45,20.827622,,4.656220,1.0,AUS,2007.0,29524.379108
1,AUS,2008,40130.34,21.249199,15768.22,4.632780,1.0,AUS,2008.0,31051.323158
2,AUS,2010,42816.43,22.031750,16300.98,5.543040,1.2,AUS,2010.0,32599.998578
3,AUS,2011,44440.58,22.340024,16382.44,5.069950,1.2,AUS,2011.0,33941.087440
4,AUS,2012,43884.64,22.733465,16002.71,4.867880,1.2,AUS,2012.0,33934.614031
...,...,...,...,...,...,...,...,...,...,...
1786,EST,2015,29222.75,1.314608,12905.99,5.144190,1.4,EST,2015.0,19321.329148
1787,ESP,2008,33242.25,45.983169,13075.95,4.528060,1.0,ESP,2008.0,23317.348626
1788,GBR,2017,46061.35,66.040229,28042.59,5.384990,1.4,GBR,2017.0,32417.628654
1789,USA,2011,49951.91,311.583481,26202.94,6.521565,..,USA,2011.0,42186.856901


In the table, above two columns for country name and year, so we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [22]:
#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP),Household Income per Capita
0,AUS,2007,39687.45,20.827622,,4.656220,1.0,29524.379108
1,AUS,2008,40130.34,21.249199,15768.22,4.632780,1.0,31051.323158
2,AUS,2010,42816.43,22.031750,16300.98,5.543040,1.2,32599.998578
3,AUS,2011,44440.58,22.340024,16382.44,5.069950,1.2,33941.087440
4,AUS,2012,43884.64,22.733465,16002.71,4.867880,1.2,33934.614031
...,...,...,...,...,...,...,...,...
1786,EST,2015,29222.75,1.314608,12905.99,5.144190,1.4,19321.329148
1787,ESP,2008,33242.25,45.983169,13075.95,4.528060,1.0,23317.348626
1788,GBR,2017,46061.35,66.040229,28042.59,5.384990,1.4,32417.628654
1789,USA,2011,49951.91,311.583481,26202.94,6.521565,..,42186.856901


### Number of Universities

We weren't able to find a data set that directly listed the amount of universities in each OECD country. Because of this, we had to make a data set on Excel and convert to a csv later.

All the information from the data set was found from this website: https://www.webometrics.info/en/distribution_by_country


In [23]:
num_universities_df = pd.read_csv('number_of_universities.csv')
num_universities_df

Unnamed: 0,Country,Number of Universities
0,AUS,187
1,AUT,84
2,BEL,142
3,CAN,383
4,CHL,130
5,COL,299
6,CRI,68
7,CZE,64
8,DNK,81
9,EST,31


The two dataframes (combined_df and num_universities_df) are combined below using a Left Join, so that we will have a cohesive data frame showing enrollment rates, gdp, and total population, public spending as a percentage of GDP, private spending as a percentage of GDP, household income per capita, average spending for higher education per student, and number of universities for each country from 2013-2020.

In [24]:
combined_df = pd.merge(combined_df, num_universities_df, on='Country', how='left')

combined_df

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP),Household Income per Capita,Number of Universities
0,AUS,2007,39687.45,20.827622,,4.656220,1.0,29524.379108,187
1,AUS,2008,40130.34,21.249199,15768.22,4.632780,1.0,31051.323158,187
2,AUS,2010,42816.43,22.031750,16300.98,5.543040,1.2,32599.998578,187
3,AUS,2011,44440.58,22.340024,16382.44,5.069950,1.2,33941.087440,187
4,AUS,2012,43884.64,22.733465,16002.71,4.867880,1.2,33934.614031,187
...,...,...,...,...,...,...,...,...,...
1786,EST,2015,29222.75,1.314608,12905.99,5.144190,1.4,19321.329148,31
1787,ESP,2008,33242.25,45.983169,13075.95,4.528060,1.0,23317.348626,276
1788,GBR,2017,46061.35,66.040229,28042.59,5.384990,1.4,32417.628654,337
1789,USA,2011,49951.91,311.583481,26202.94,6.521565,..,42186.856901,3180


Since some of the rows above in the combined data frame, have values for OECD, which represents the averages across all the countries, we are going to drop those rows, since not all the data frames that we joined in this value.

### Education Enrollment Data

https://ourworldindata.org/grapher/primary-secondary-enrollment-completion-rates?tab=table

This data set shows the gross rate at which people enter and complete primary, secondary, and tertiary education from **1970 to 2020**. It was created to show how school enrollment around the world increased dramatically in the last century. This data set was created by Our World Data, which is a project of the Global Change Data Lab, which is a registered charity in England and Wales. In addition, the enrollment rate at each stage can help demonstrate the access to education in the country. Having a lower enrollment rate may imply that children in that country does not have the support necessary to recieve a proper education. 

This data set has 9 columns: Entity, Code, Year, and School enrollment and completion rates for primary, secondary,and tertiary education.

The rate of primary school enrollment is measured through administrative data and is defined as the number of children enrolled in primary school who belong to a certain age group that corresponds to primary schooling divided by the total population of that age group. Some percentages are greater than 100, because children may enter education late or repeat a year. The rate of enrollment in secondary education and tertiary education is also measured in a similar manner as the rate of enrollment in primary education. 

The Code column lists all the country codes for all the countries that have available data while the entity column has the country's full name. For this assignment, we are limiting the list of countries to OECD countries and using the country abbreviation to match the countries; therefore, we will not be using the Entity column. The Year column lists all the years over which this data was collected from 1970 to 2020. However, in order to reduce the amount of missing data in our final data set, we are going to limit the range of years from 1995 to 2020. 


In [25]:
enrollment_rates_df = pd.read_csv('enrollment_completion_rates.csv')
enrollment_rates_df.head()

Unnamed: 0,Entity,Code,Year,"Primary completion rate, total (% of relevant age group)","Completion rate, upper secondary education, both sexes (%)","Completion rate, lower secondary education, both sexes (%)","School enrollment, primary (% gross)","School enrollment, secondary (% gross)","School enrollment, tertiary (% gross)"
0,Afghanistan,AFG,1974,16.657,,,33.1083,10.91069,1.0226
1,Afghanistan,AFG,1977,17.88076,,,36.11149,13.13156,1.40822
2,Afghanistan,AFG,1978,19.65284,,,37.63702,13.82176,1.80833
3,Afghanistan,AFG,1980,26.37324,,,44.13337,16.76427,
4,Afghanistan,AFG,1981,32.70612,,,47.7117,19.32814,


Here, we are selecting the columns we will be using, which is the country abbreviation, year, and enrollment rates.

In [26]:
query = """
        SELECT
            Code AS Country,
            Year,
            "School enrollment, primary (% gross)" AS "Primary Enrollment rate (% gross)",
            "School enrollment, secondary (% gross)" AS "Secondary Enrollment rate (% gross)",
            "School enrollment, tertiary (% gross)" AS "Tertiary Enrollment rate (% gross)"
        FROM enrollment_rates_df
        """

enrollment_rates_df = duckdb.sql(query).df()

In [27]:
enrollment_rates_df.head()

Unnamed: 0,Country,Year,Primary Enrollment rate (% gross),Secondary Enrollment rate (% gross),Tertiary Enrollment rate (% gross)
0,AFG,1974,33.1083,10.91069,1.0226
1,AFG,1977,36.11149,13.13156,1.40822
2,AFG,1978,37.63702,13.82176,1.80833
3,AFG,1980,44.13337,16.76427,
4,AFG,1981,47.7117,19.32814,


Using left join and mapping the year and country, we can filter out all countries that are not OECD countries, since all OECD countries would already be in the combined_df dataframe. 

In [28]:
query = """
        SELECT *
        FROM combined_df
        LEFT JOIN enrollment_rates_df
        ON combined_df.Country = enrollment_rates_df.Country
        AND combined_df.Year = enrollment_rates_df.Year
        ORDER BY combined_df.Country, combined_df.Year;
        """

combined_df = duckdb.sql(query).df()
combined_df.head()

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP),Household Income per Capita,Number of Universities,Country_2,Year_2,Primary Enrollment rate (% gross),Secondary Enrollment rate (% gross),Tertiary Enrollment rate (% gross)
0,AUS,1960,2412.63,10.275,,1.4,,,187,,,,,
1,AUS,1961,2383.19,10.5082,,,,,187,,,,,
2,AUS,1962,2577.33,10.7005,,,,,187,,,,,
3,AUS,1963,2752.62,10.9069,,,,,187,,,,,
4,AUS,1964,2902.59,11.1216,,,,,187,,,,,


In the table, above two columns for country name and year, so we need to drop the Country_2 and Year_2 columns, because they are redundant and change all the float values in Year to be integers.

In [29]:
#dropping Country_2 and Year_2 columns
combined_df = combined_df.drop(columns=['Country_2', 'Year_2'])

#making the values in the Year column integers
combined_df['Year'] = combined_df['Year'].astype(int)

combined_df.head()

Unnamed: 0,Country,Year,GDP per Capita,Population(Million),Average Spending on Higher Education (USD/student),Government Expenditure On Education (%),Government_Spending_Teritary (% GDP),Household Income per Capita,Number of Universities,Primary Enrollment rate (% gross),Secondary Enrollment rate (% gross),Tertiary Enrollment rate (% gross)
0,AUS,1960,2412.63,10.275,,1.4,,,187,,,
1,AUS,1961,2383.19,10.5082,,,,,187,,,
2,AUS,1962,2577.33,10.7005,,,,,187,,,
3,AUS,1963,2752.62,10.9069,,,,,187,,,
4,AUS,1964,2902.59,11.1216,,,,,187,,,


In [None]:
combined_df.to_csv('combined_data.csv', index=False)

## Exploratory Data Analysis