## INFO 2950 Final Project - Phase II ##

##### Research Question: Of the factors like GDP, average income, number of private and public universities, etc. that can impact the number of completed bachelors in a country per year, which factors has more impact on the number of bachelors obtained in OECD countries?

In this assignment, we will observe the number of people in OECD countries that complete their Bachelor's degree. Factors like the country's GDP, GDP per capital, average family income, average tuition, the population of the country, how much the country spends on education, number of public/private universities, average student loans in countries, loan interest rates, whether or not there are standardized tests are all included as variables in this model. We will train a multivariate regression to see if we can reliably predict the number of people graduating with their bachelors. We will also look to see which combinations of factors minimizes the residual and thus have a greater impact on the number of bachelor degrees obtained. 

To provide some background, OECD stands for Organization for Economic Cooperation and Development and is a group of countries with market-based economies collaberate to promote sustainable economic growth. (Source: https://www.state.gov/the-organization-for-economic-co-operation-and-development-oecd/#:~:text=and%20Development%20(OECD)-,The%20Organization%20for%20Economic%20Cooperation%20and%20Development%20(OECD),to%20promote%20sustainable%20economic%20growth.)

Perhaps talk more about why specfically this group and what impact it will have for our conclusion (more data, but limitation of model)

In [3]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

import duckdb

### Initial Data Cleaning
Getting rid of unwanted, book-keeping cells & changing column names

In [37]:
wages = pd.read_csv("datasets/AnnualWage.csv")
wages = wages.rename(columns = {"Reference area" : "Country", "UNIT_MEASURE" : "Currency",
                    "TIME_PERIOD" : "Year", "OBS_VALUE" : "Avg_Wage"})
wages_df = duckdb.sql("""SELECT Country, Currency, Year, Avg_Wage 
                    FROM wages""").df()
wages_df.head()

Unnamed: 0,Country,Currency,Year,Avg_Wage
0,Slovak Republic,EUR,2000,5429.359686
1,Slovak Republic,EUR,2001,5732.174771
2,Slovak Republic,EUR,2002,6204.566786
3,Slovak Republic,EUR,2003,6741.059776
4,Slovak Republic,EUR,2004,7339.87799


Cleaning population information

In [46]:
pop = pd.read_csv("datasets/total_pop.csv")
pop = pop.rename(columns = {"TIME":"Year", "Value": "Percent_Rural"})
rural_pop_df = duckdb.sql("""SELECT Country, Year, Percent_Rural, Indicator
FROM pop WHERE Indicator = 'Rural population (% of total population)'""").df()
rural_pop_df = duckdb.sql("""SELECT Country, Year, Percent_Rural
FROM rural_pop_df""").df()
rural_pop_df.head()

pop = pop.rename(columns = {"Percent_Rural" : "PopTotal"})
total_pop_df = duckdb.sql("""SELECT Country, Year, Indicator, PopTotal
FROM pop WHERE Indicator = 'Total population (thousands)'""").df()
total_pop_df = duckdb.sql("""SELECT Country, Year, PopTotal
FROM total_pop_df""").df()
total_pop_df.head()

pop_info_df = duckdb.sql("""SELECT * FROM rural_pop_df
    LEFT JOIN total_pop_df
    ON (total_pop_df.Year = rural_pop_df.Year 
    AND total_pop_df.Country = rural_pop_df.Country)""").df()
pop_info_df = duckdb.sql("""SELECT Country, Year, PopTotal, Percent_Rural, 
    (PopTotal * Percent_Rural) AS RuralPop  
    FROM pop_info_df ORDER BY Country, Year""").df()

pop_info_df.head()

Unnamed: 0,Country,Year,Percent_Rural
0,Australia,1980,14
1,Australia,1981,14
2,Australia,1982,14
3,Australia,1983,14
4,Australia,1984,14


Government expenses spent on edu

In [102]:
perEduExp_Gov = pd.read_csv("datasets/Edu%TotalGovExp.csv")
perEduExp_Gov = perEduExp_Gov.rename(columns = {" ":"Country"})
year_names = perEduExp_Gov.columns[1:]
perEduExp_Gov = perEduExp_Gov.melt(id_vars = ["Country"],
                                   var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "Edu_Expend_Percent")


EduExp_Gov = pd.read_csv("datasets/ExpendonEdu_MillionUSD.csv")
EduExp_Gov = EduExp_Gov.rename(columns = {" ":"Country"})
year_names = EduExp_Gov.columns[1:]
EduExp_Gov = EduExp_Gov.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "Edu_Expend_USD")

TerEduExp_Gov = pd.read_csv("datasets/TertiaryPubExpenditure.csv")
TerEduExp_Gov = TerEduExp_Gov.rename(columns = {" ":"Country"})
year_names = TerEduExp_Gov.columns[1:]
TerEduExp_Gov = TerEduExp_Gov.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "TerPubEdu_Expend_USD")

gov_exp = duckdb.sql("""SELECT * FROM perEduExp_Gov 
                LEFT JOIN EduExp_Gov
                ON (perEduExp_Gov.Year = EduExp_Gov.Year AND
                perEduExp_Gov.Country = EduExp_Gov.Country)
                LEFT JOIN TerEduExp_Gov
                ON (perEduExp_Gov.Year = TerEduExp_Gov.Year AND
                perEduExp_Gov.Country = TerEduExp_Gov.Country)""").df()
gov_exp = duckdb.sql("""SELECT Country, Year, 
Edu_Expend_Percent, Edu_Expend_USD, TerPubEdu_Expend_USD
FROM gov_exp ORDER BY Country, Year""").df()

gov_exp.head()

Unnamed: 0,Country,Year,Edu_Expend_Percent,Edu_Expend_USD,TerPubEdu_Expend_USD
0,Australia,1980,..,1765.3,..
1,Australia,1981,..,..,..
2,Australia,1982,..,2100.3,..
3,Australia,1983,..,3296.9,..
4,Australia,1984,..,..,..


In [111]:
gdpALL = pd.read_csv("datasets/GDP.csv")
oecd_list = ["Australia","Austria","Belgium", "Canada", "Chile", "Colombia", "Costa Rica",
            "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", 
             "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea, Rep.", 
            "Latvia", "Lithuania", "Luxembourg", "Mexico", "Netherlands", "New Zealand",
            "Norway", "Poland", "Portugal"]

gdpPerCap = pd.read_csv("datasets/GDP_perCapita.csv")
gdpPerCap = gdpPerCap.rename(columns = {" ":"Country"})
year_names = gdpPerCap.columns[1:]
gdpPerCap = gdpPerCap.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "GDP_PerCap")
gdpPerCap.head()

gdpPerEdu = pd.read_csv("datasets/GovernmentExp_EduofGdp.csv")
gdpPerEdu = gdpPerEdu.rename(columns ={" ": "Country"})
year_names = gdpPerEdu.columns[1:]
gdpPerEdu = gdpPerEdu.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Percent_GDP_On_Edu")
gdpPerEdu.head()

gdpPerTertEdu = pd.read_csv("datasets/TertiaryGovExp%GDP.csv")
gdpPerTertEdu = gdpPerTertEdu.rename(columns ={" ": "Country"})
year_names = gdpPerTertEdu.columns[1:]
gdpPerTertEdu = gdpPerTertEdu.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Percent_GDP_On_TertEdu")
gdpPerTertEdu.head()

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       'Unnamed: 67'],
      dtype='object')


Unnamed: 0,Country,Year,Percent_GDP_On_TertEdu
0,Australia,1980,1.3
1,Austria,1980,0.7
2,Belgium,1980,1.0
3,Canada,1980,1.9
4,Chile,1980,1.5
