# Assignment
In this assignment, you'll continue working with the U.S. Education Dataset from Kaggle. The data gives detailed state level information on the several facets of the state of education on annual basis. 

Don't forget to apply the most suitable missing value filling techniques you applied in the previous checkpoints to the data. You should provide your answers to the following questions after you handled the missing values.

Say, we want to understand the relationship between the expenditures of the governments and the students' overall success in the math and reading.


In [1]:
# Libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')
sns.set(style="whitegrid")


postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'useducation'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

education_df = pd.read_sql_query('select * from useducation',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [2]:
# Replace missing values with interpolation since data is time series

education2_df = education_df.copy()

null_val = ['ENROLL','TOTAL_REVENUE','FEDERAL_REVENUE','STATE_REVENUE','LOCAL_REVENUE',
           'TOTAL_EXPENDITURE','INSTRUCTION_EXPENDITURE','SUPPORT_SERVICES_EXPENDITURE',
           'OTHER_EXPENDITURE','CAPITAL_OUTLAY_EXPENDITURE','GRADES_PK_G','GRADES_KG_G',
           'GRADES_4_G','GRADES_8_G','GRADES_12_G','GRADES_1_8_G','GRADES_9_12_G','GRADES_ALL_G',
           'AVG_MATH_4_SCORE','AVG_MATH_8_SCORE','AVG_READING_4_SCORE','AVG_READING_8_SCORE']

for col in null_val:
    education2_df.loc[:, col].interpolate(inplace=True)
    
# we drop the null values after interpolation
education2_df.dropna(inplace=True)

education2_df.head()

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,GRADES_4_G,GRADES_8_G,GRADES_12_G,GRADES_1_8_G,GRADES_9_12_G,GRADES_ALL_G,AVG_MATH_4_SCORE,AVG_MATH_8_SCORE,AVG_READING_4_SCORE,AVG_READING_8_SCORE
36,2001_WYOMING,WYOMING,2001,89711.0,804297.0,69172.0,403021.0,332104.0,787949.0,426072.0,...,6587.0,7211.0,6855.0,53091.0,29035.0,1098988.5,219.51587,268.128094,221.250787,261.870243
37,1992_OKLAHOMA,OKLAHOMA,1992,129586.3125,2396705.0,211627.0,1470516.0,714562.0,2515272.0,1283577.0,...,48793.0,44163.0,33346.0,387659.0,157824.0,550342.0,220.319806,268.13199,219.256638,261.86103
38,1992_OREGON,OREGON,1992,169461.625,2773959.0,163544.0,788309.0,1822106.0,2898210.0,1556770.0,...,41443.0,39610.0,31920.0,325128.0,144002.0,469901.0,222.312769,269.755214,217.26249,261.851817
39,1992_PENNSYLVANIA,PENNSYLVANIA,1992,209336.9375,11257252.0,658139.0,4227323.0,6371790.0,11539253.0,6075381.0,...,131248.0,126293.0,108244.0,1063552.0,486354.0,1554322.0,224.305732,271.378439,215.268341,261.842604
40,1992_RHODE_ISLAND,RHODE_ISLAND,1992,249212.25,883073.0,43545.0,329810.0,509718.0,863404.0,556787.0,...,11129.0,10204.0,8244.0,90758.0,37931.0,129187.0,215.449248,265.907109,219.940098,261.833392


#### 1. Create a new score variable from the weighted averages of all score variables in the datasets. Notice that the number of students in the 4th grade isn't the same as the number of students in the 8th grade. So, you should appropriately weigh the scores!

In [3]:
# Create overall score

education2_df["overall_score"] = (education2_df["GRADES_4_G"]*((education2_df["AVG_MATH_4_SCORE"] + 
                                                                education2_df["AVG_READING_4_SCORE"])*0.5) + 
                                  education2_df["GRADES_8_G"]* ((education2_df["AVG_MATH_8_SCORE"] + 
                                     education2_df["AVG_READING_8_SCORE"])*0.5))/(education2_df["GRADES_4_G"] + 
                                                                                  education2_df["GRADES_8_G"])


We weighted the score variables using the number of students in the respective grades.

#### 2. What are the correlations between this newly created score variable and the expenditure types? Which 1 of the expenditure types is more correlated than the others?

In [4]:
# Examine coorelation between variables 

education2_df[["overall_score", "TOTAL_EXPENDITURE", "INSTRUCTION_EXPENDITURE",
              "SUPPORT_SERVICES_EXPENDITURE", "OTHER_EXPENDITURE", "CAPITAL_OUTLAY_EXPENDITURE"]].corr()

Unnamed: 0,overall_score,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,OTHER_EXPENDITURE,CAPITAL_OUTLAY_EXPENDITURE
overall_score,1.0,0.204239,0.207131,0.221824,0.169775,0.128087
TOTAL_EXPENDITURE,0.204239,1.0,0.992698,0.992435,0.951726,0.928129
INSTRUCTION_EXPENDITURE,0.207131,0.992698,1.0,0.979165,0.920297,0.895527
SUPPORT_SERVICES_EXPENDITURE,0.221824,0.992435,0.979165,1.0,0.953411,0.905265
OTHER_EXPENDITURE,0.169775,0.951726,0.920297,0.953411,1.0,0.923468
CAPITAL_OUTLAY_EXPENDITURE,0.128087,0.928129,0.895527,0.905265,0.923468,1.0


Overall score is most correlated to support service expenditure (0.22).

#### 3. Now, apply PCA to the 4 expenditure types. How much of the total variance is explained by the 1st component?

In [5]:
# Apply PCA to expenditure types 

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X = education2_df[["INSTRUCTION_EXPENDITURE", "SUPPORT_SERVICES_EXPENDITURE",
                  "OTHER_EXPENDITURE", "CAPITAL_OUTLAY_EXPENDITURE"]]

X = StandardScaler().fit_transform(X)

sklearn_pca = PCA(n_components=1)
education2_df["pca_1"] = sklearn_pca.fit_transform(X)

print(
    'The percentage of total variance in the dataset explained by each',
    'component from Sklearn PCA.\n',
    sklearn_pca.explained_variance_ratio_
)

The percentage of total variance in the dataset explained by each component from Sklearn PCA.
 [0.94725496]



More than 94% of the total variance is explained by the first principal component.

#### 4. What is the correlation between the overall score variable and the 1st principal component?

In [6]:
# Examine coorelation between variables 

education2_df[["overall_score", "pca_1", "TOTAL_EXPENDITURE", "INSTRUCTION_EXPENDITURE",
              "SUPPORT_SERVICES_EXPENDITURE", "OTHER_EXPENDITURE", "CAPITAL_OUTLAY_EXPENDITURE"]].corr()

Unnamed: 0,overall_score,pca_1,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,OTHER_EXPENDITURE,CAPITAL_OUTLAY_EXPENDITURE
overall_score,1.0,0.187067,0.204239,0.207131,0.221824,0.169775,0.128087
pca_1,0.187067,1.0,0.992988,0.975096,0.986139,0.975451,0.956156
TOTAL_EXPENDITURE,0.204239,0.992988,1.0,0.992698,0.992435,0.951726,0.928129
INSTRUCTION_EXPENDITURE,0.207131,0.975096,0.992698,1.0,0.979165,0.920297,0.895527
SUPPORT_SERVICES_EXPENDITURE,0.221824,0.986139,0.992435,0.979165,1.0,0.953411,0.905265
OTHER_EXPENDITURE,0.169775,0.975451,0.951726,0.920297,0.953411,1.0,0.923468
CAPITAL_OUTLAY_EXPENDITURE,0.128087,0.956156,0.928129,0.895527,0.905265,0.923468,1.0


The correlation between overall score and the first principal component (0.187) remains relatively low compared to the correlation between overall score and support service expenditure. 

#### 5. If you were to choose the best variables for your model, would you prefer using the 1st principal component instead of the expenditure variables? Why?


Since support service expenditure variable is more correlated with the overall score than the first principal component, I would prefer using expenditure variables instead. 

Also, PCA works best when the variables involved range from weak to moderately strong correlations. Most expenditure variables have a higher correlation level greater than 0.8 which may result in stability. 