## CSMODEL Project 1- Statistical Inference

by: **Jericho Dizon** and **Patrick Narvasa**

This notebook serves as a analysis of the [U.S. Education Datasets: Unification Project](https://www.kaggle.com/noriuk/us-education-datasets-unification-project) found in Kaggle. This serves as our submission for the requirements of our class CSMODEL SY 2020-2021 Term 3. 

The U.S Education Dataset: Unification Project is an effort by Roy Garrard to reflect the "multiple facets" of the US education system in one csv file. He aggregated the data collected from various source, mainly from the  U.S. Census Bureau and the National Center for Education Statistics (NCES). It contains data of the number of students enrolled in a year, in each grade level, separated in each states. The data spans from 1992 to the last update of 2019. 


## Collection Process
The collection of data were gathered by the use of different online sources that have databases on the number of enrollment, financials and academic achievement. The website gathered the data using surveys and reports from different government websites like U.S. Census Bureau and the National Center for Education Statistics with the following links:
- Enrollment
https://nces.ed.gov/ccd/stnfis.asp
- Financials
https://www.census.gov/programs-surveys/school-finances/data/tables.html
- Academic Achievement
https://www.nationsreportcard.gov/ndecore/xplore/NDE

## Structure of the Data File

In the Kaggle Library, there two (2) csv files: states_all.csv and states_all_extended.csv. The states_all.csv is an aggregated version of the data found in states_all_extended, to look at the bigger picture ourselves, we will be looking into the second file: **states_all_extended.csv** only.

There are 1715 observations inside the states_all_extended file. Each observation is a state in the U.S. together with what year the columns are from.

Example: 1992_ALABAMA is different from 2000_ALABAMA

The example given is from the column PRIMARY_KEY. This gives easier access to which state and year we would want to observe, although there are 2 columns that contain those variables (STATE, YEAR).

The columns represent the data that concerns the schools of that given state in that given period of time. It is mainly concerned data such as number of student per grade level, the state revenue generated by the schools, academic performance, race, gender, etc.


## Variables

## Research Questions

In this analysis, we want to find out the following:
1) Is there a significant difference between the NAEP scores of all states in 2019 compared to the scores of all states in 2015?
2) 

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import ttest_ind

## Reading the CSV

In [4]:
states_df = pd.read_csv('states_all.csv')
states_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1715 entries, 0 to 1714
Data columns (total 25 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PRIMARY_KEY                   1715 non-null   object 
 1   STATE                         1715 non-null   object 
 2   YEAR                          1715 non-null   int64  
 3   ENROLL                        1224 non-null   float64
 4   TOTAL_REVENUE                 1275 non-null   float64
 5   FEDERAL_REVENUE               1275 non-null   float64
 6   STATE_REVENUE                 1275 non-null   float64
 7   LOCAL_REVENUE                 1275 non-null   float64
 8   TOTAL_EXPENDITURE             1275 non-null   float64
 9   INSTRUCTION_EXPENDITURE       1275 non-null   float64
 10  SUPPORT_SERVICES_EXPENDITURE  1275 non-null   float64
 11  OTHER_EXPENDITURE             1224 non-null   float64
 12  CAPITAL_OUTLAY_EXPENDITURE    1275 non-null   float64
 13  GRA

In [3]:
states_df.head(10)

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
0,1992_ALABAMA,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,...,,,,,,,,,,
1,1992_ALASKA,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,...,,,,,,,,,,
2,1992_ARIZONA,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,...,,,,,,,,,,
3,1992_ARKANSAS,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,...,,,,,,,,,,
4,1992_CALIFORNIA,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,...,,,,,,,,,,
5,1992_COLORADO,COLORADO,1992,,3185173.0,163253.0,1307986.0,1713934.0,3264826.0,1642466.0,...,,,,,,,,,,
6,1992_CONNECTICUT,CONNECTICUT,1992,,3834302.0,143542.0,1342539.0,2348221.0,3721338.0,2148041.0,...,,,,,,,,,,
7,1992_DELAWARE,DELAWARE,1992,,645233.0,45945.0,420942.0,178346.0,638784.0,372722.0,...,,,,,,,,,,
8,1992_DISTRICT_OF_COLUMBIA,DISTRICT_OF_COLUMBIA,1992,,709480.0,64749.0,0.0,644731.0,742893.0,329160.0,...,,,,,,,,,,
9,1992_FLORIDA,FLORIDA,1992,,11506299.0,788420.0,5683949.0,5033930.0,11305642.0,5166374.0,...,,,,,,,,,,
