## INFO 2950 Final Project - Phase II ##

##### Research Question: Of the factors like GDP, average income, number of private and public universities, etc. that can impact the number of completed bachelors in a country per year, which factors has more impact on the number of bachelors obtained in OECD countries?

In this assignment, we will observe the number of people in OECD countries that complete their Bachelor's degree. Factors like the country's GDP, GDP per capital, average family income, average tuition, the population of the country, how much the country spends on education, number of public/private universities, average student loans in countries, loan interest rates, whether or not there are standardized tests are all included as variables in this model. We will train a multivariate regression to see if we can reliably predict the number of people graduating with their bachelors. We will also look to see which combinations of factors minimizes the residual and thus have a greater impact on the number of bachelor degrees obtained. 

To provide some background, OECD stands for Organization for Economic Cooperation and Development and is a group of countries with market-based economies collaberate to promote sustainable economic growth. (Source: https://www.state.gov/the-organization-for-economic-co-operation-and-development-oecd/#:~:text=and%20Development%20(OECD)-,The%20Organization%20for%20Economic%20Cooperation%20and%20Development%20(OECD),to%20promote%20sustainable%20economic%20growth.)

Perhaps talk more about why specfically this group and what impact it will have for our conclusion (more data, but limitation of model)

In [154]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import time

import duckdb

### Initial Data Cleaning
Getting rid of unwanted, book-keeping cells & changing column names

In [37]:
wages = pd.read_csv("datasets/AnnualWage.csv")
wages = wages.rename(columns = {"Reference area" : "Country", "UNIT_MEASURE" : "Currency",
                    "TIME_PERIOD" : "Year", "OBS_VALUE" : "Avg_Wage"})
wages_df = duckdb.sql("""SELECT Country, Currency, Year, Avg_Wage 
                    FROM wages""").df()
wages_df.head()

Unnamed: 0,Country,Currency,Year,Avg_Wage
0,Slovak Republic,EUR,2000,5429.359686
1,Slovak Republic,EUR,2001,5732.174771
2,Slovak Republic,EUR,2002,6204.566786
3,Slovak Republic,EUR,2003,6741.059776
4,Slovak Republic,EUR,2004,7339.87799


Cleaning population information

In [46]:
pop = pd.read_csv("datasets/total_pop.csv")
pop = pop.rename(columns = {"TIME":"Year", "Value": "Percent_Rural"})
rural_pop_df = duckdb.sql("""SELECT Country, Year, Percent_Rural, Indicator
FROM pop WHERE Indicator = 'Rural population (% of total population)'""").df()
rural_pop_df = duckdb.sql("""SELECT Country, Year, Percent_Rural
FROM rural_pop_df""").df()
rural_pop_df.head()

pop = pop.rename(columns = {"Percent_Rural" : "PopTotal"})
total_pop_df = duckdb.sql("""SELECT Country, Year, Indicator, PopTotal
FROM pop WHERE Indicator = 'Total population (thousands)'""").df()
total_pop_df = duckdb.sql("""SELECT Country, Year, PopTotal
FROM total_pop_df""").df()
total_pop_df.head()

pop_info_df = duckdb.sql("""SELECT * FROM rural_pop_df
    LEFT JOIN total_pop_df
    ON (total_pop_df.Year = rural_pop_df.Year 
    AND total_pop_df.Country = rural_pop_df.Country)""").df()
pop_info_df = duckdb.sql("""SELECT Country, Year, PopTotal, Percent_Rural, 
    (PopTotal * Percent_Rural) AS RuralPop  
    FROM pop_info_df ORDER BY Country, Year""").df()

pop_info_df.head()

Unnamed: 0,Country,Year,Percent_Rural
0,Australia,1980,14
1,Australia,1981,14
2,Australia,1982,14
3,Australia,1983,14
4,Australia,1984,14


Government expenses spent on edu

In [127]:
perEduExp_Gov = pd.read_csv("datasets/Edu%TotalGovExp.csv")
perEduExp_Gov = perEduExp_Gov.rename(columns = {" ":"Country"})
year_names = perEduExp_Gov.columns[1:]
perEduExp_Gov = perEduExp_Gov.melt(id_vars = ["Country"],
                                   var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "Edu_Expend_Percent")


EduExp_Gov = pd.read_csv("datasets/ExpendonEdu_MillionUSD.csv")
EduExp_Gov = EduExp_Gov.rename(columns = {" ":"Country"})
year_names = EduExp_Gov.columns[1:]
EduExp_Gov = EduExp_Gov.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "Edu_Expend_USD")

TerEduExp_Gov = pd.read_csv("datasets/TertiaryPubExpenditure.csv")
TerEduExp_Gov = TerEduExp_Gov.rename(columns = {" ":"Country"})
year_names = TerEduExp_Gov.columns[1:]
TerEduExp_Gov = TerEduExp_Gov.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "TerPubEdu_Expend_USD")

gov_exp = duckdb.sql("""SELECT * FROM perEduExp_Gov 
                LEFT JOIN EduExp_Gov
                ON (perEduExp_Gov.Year = EduExp_Gov.Year AND
                perEduExp_Gov.Country = EduExp_Gov.Country)
                LEFT JOIN TerEduExp_Gov
                ON (perEduExp_Gov.Year = TerEduExp_Gov.Year AND
                perEduExp_Gov.Country = TerEduExp_Gov.Country)""").df()
gov_exp = duckdb.sql("""SELECT Country, Year, 
Edu_Expend_Percent, Edu_Expend_USD, TerPubEdu_Expend_USD
FROM gov_exp ORDER BY Country, Year""").df()

gov_exp.head()

Unnamed: 0,Country,Year,Edu_Expend_Percent,Edu_Expend_USD,TerPubEdu_Expend_USD
0,Australia,1980,..,1765.3,..
1,Australia,1981,..,..,..
2,Australia,1982,..,2100.3,..
3,Australia,1983,..,3296.9,..
4,Australia,1984,..,..,..


Cleaning and melting data frames related to GDP to have the year as one single x factor

In [149]:
gdpALL = pd.read_csv("datasets/GDP.csv")
oecd_list = ["Australia","Austria","Belgium", "Canada", "Chile", "Colombia", "Costa Rica",
            "Czechia", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", 
             "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea, Rep.", 
            "Latvia", "Lithuania", "Luxembourg", "Mexico", "Netherlands", "New Zealand",
            "Norway", "Poland", "Portugal", "Slovak Republic","Slovenia","Spain","Sweden",
            "Switzerland", "Turkiye","United Kingdom","United States"]
gdpOECD = gdpALL[gdpALL["Country Name"].isin(oecd_list)]
gdpOECD = gdpOECD.rename(columns = {"Country Name":"Country"})
year_names = gdpOECD.columns[24:-1]
gdpOECD = gdpOECD.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "GDP")


gdpPerCap = pd.read_csv("datasets/GDP_perCapita.csv")
gdpPerCap = gdpPerCap.rename(columns = {" ":"Country"})
year_names = gdpPerCap.columns[1:]
gdpPerCap = gdpPerCap.melt(id_vars = ["Country"],
                             var_name = "Year",
                                  value_vars = year_names,
                                  value_name = "GDP_PerCap")

gdpPerEdu = pd.read_csv("datasets/GovernmentExp_EduofGdp.csv")
gdpPerEdu = gdpPerEdu.rename(columns ={" ": "Country"})
year_names = gdpPerEdu.columns[1:]
gdpPerEdu = gdpPerEdu.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Percent_GDP_On_Edu")

gdpPerTertEdu = pd.read_csv("datasets/TertiaryGovExp%GDP.csv")
gdpPerTertEdu = gdpPerTertEdu.rename(columns ={" ": "Country"})
year_names = gdpPerTertEdu.columns[1:]
gdpPerTertEdu = gdpPerTertEdu.melt(id_vars = ["Country"],
                                     var_name = "Year",
                          value_vars = year_names,
                           value_name = "Percent_GDP_On_TertEdu")

gdp_factors = duckdb.sql("""SELECT * FROM gdpOECD
LEFT JOIN gdpPerCap ON (gdpOECD.Year = gdpPerCap.Year AND
                gdpOECD.Country = gdpPerCap.Country)
LEFT JOIN gdpPerEdu ON (gdpOECD.Year = gdpPerEdu.Year AND
                gdpOECD.Country = gdpPerEdu.Country)
LEFT JOIN gdpPerTertEdu ON (gdpOECD.Year = gdpPerTertEdu.Year AND
                gdpOECD.Country = gdpPerTertEdu.Country)""").df()

gdp_factors = duckdb.sql("""SELECT Country, Year, GDP, GDP_PerCap, 
Percent_GDP_On_Edu, Percent_GDP_On_TertEdu FROM gdp_factors ORDER BY Country, Year""").df()
print (gdp_factors)

#gdp_factors.fillna()

#gdp_factors["Expend_TertEdu"] = gdp_factors.GDP * gdp_factors.Percent_GDP_On_TertEdu
#gdp_factors["Expend_Edu"] = gdp_factors.GDP * gdp_factors.Percent_GDP_On_Edu



            Country  Year           GDP GDP_PerCap Percent_GDP_On_Edu  \
0         Australia  1980  1.499844e+11    10194.3                5.7   
1         Australia  1981  1.768919e+11    11833.7                 ..   
2         Australia  1982  1.940373e+11    12766.5                5.5   
3         Australia  1983  1.772635e+11    11518.7                5.4   
4         Australia  1984  1.935172e+11    12431.9                 ..   
...             ...   ...           ...        ...                ...   
1629  United States  2018  2.053306e+13    62794.6                 ..   
1630  United States  2019  2.138098e+13         ..                 ..   
1631  United States  2020  2.106047e+13         ..                 ..   
1632  United States  2021  2.331508e+13         ..                 ..   
1633  United States  2022  2.546270e+13         ..                 ..   

     Percent_GDP_On_TertEdu  
0                       1.3  
1                        ..  
2                       1.2  
3  

Y value -- Tert Degree completed

In [172]:
grad = pd.read_csv("datasets/Grad_TretandBach.csv")
grad = grad.rename(columns ={"Reference sector": "Type", "Education level" : "EduLev",
                            "Value":"NumDegree"})
print (grad.columns)
grad = duckdb.sql("""SELECT Country, Type, EduLev, Year, NumDegree
FROM grad""").df()
#grad = grad.pivot(columns = ["Type"], 
                  #values = ["Value"])
#grad.head()

Index(['COUNTRY', 'Country', 'SEX', 'Gender', 'REF_SECTOR', 'Type',
       'EDUCATION_LEV', 'EduLev', 'YEAR', 'Year', 'NumDegree', 'Flag Codes',
       'Flags'],
      dtype='object')


In [162]:
url = "https://n26.com/en-eu/the-education-price-index"
result = requests.get(url)
if result.status_code != 200:
    print ("Failed")
print (result.status_code)

200


In [176]:
soup = BeautifulSoup(result.text, 'html.parser')
#print (soup)
table = soup.find_all("table", {"role":"grid"})
tableBody = table.find_all("tbody")
#for country in table.find_all('tr'):
    #row = country.find

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [177]:
oecd_list = ["AUS","AUT","BEL", "CAN", "CHL", "COL", "CRI",
            "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GRC", 
             "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR", 
            "LVA", "LTU", "LUX", "MEX", "NLD", "NZL",
            "NOR", "POL", "PRT", "SVK","SVN","ESP","SWE",
            "CHE", "TUR","GBR","USA"]

Unnamed: 0,Country,Years_Pay_Off_Loan,Min_Wage_Hours_To_Afford_Degree,Cost_Per_Year(Euro),Junior_Salary(Euro),Senior_Salary(Euro),Cost_Per_Year(USD),Junior_Salary(USD),Senior_Salary(USD)
0,Australia,5.0,1490.0,5939.0,51048.0,96902.0,5647.989,48546.648,92153.802
1,Austria,0.0,15.0,41.0,41595.0,75558.0,38.991,39556.845,71855.658
2,Belgium,1.0,506.0,1555.0,50651.0,91263.0,1478.805,48169.101,86791.113
3,Canada,8.0,3642.0,9176.0,47833.0,90994.0,8726.376,45489.183,86535.294
4,Chile,11.0,9004.0,5757.0,24975.0,46620.0,5474.907,23751.225,44335.62
