## CSMODEL Project 1- Statistical Inference

by: **Jericho Dizon** and **Patrick Narvasa**

This notebook serves as a analysis of the [U.S. Education Datasets: Unification Project](https://www.kaggle.com/noriuk/us-education-datasets-unification-project) found in Kaggle. This serves as our submission for the requirements of our class CSMODEL SY 2020-2021 Term 3. 

The U.S Education Dataset: Unification Project is an effort by Roy Garrard to reflect the "multiple facets" of the US education system in one csv file. He aggregated the data collected from various source, mainly from the  U.S. Census Bureau and the National Center for Education Statistics (NCES). It contains data of the number of students enrolled in a year, in each grade level, separated in each states. The data spans from 1992 to the last update of 2019. 

The **NAEP** stands for the National Assessment of Educational Progress (NAEP), [which is an assessment of the educational system in the US](https://nces.ed.gov/nationsreportcard/about/). It is an exam taken by students in Grade 4, 8, and 12. The scale of scores in Grades 4 and 8 is from 0-500 while the scale of scores in Grade 12 is 0-300. There are lots of subjects like Science and Geography and it is mostly added in the Grade 12 exams, but what they all have in common and most important data to check are the **Reading and Math scores**. This is the source of the academic grades columns from Roy 


## Collection Process
The collection of data were gathered by the use of different online sources that have databases on the number of enrollment, financials and academic achievement. The website gathered the data using surveys and reports from different government websites like U.S. Census Bureau and the National Center for Education Statistics with the following links:
- Enrollment
https://nces.ed.gov/ccd/stnfis.asp
- Financials
https://www.census.gov/programs-surveys/school-finances/data/tables.html
- Academic Achievement
https://www.nationsreportcard.gov/ndecore/xplore/NDE

## Structure of the Data File

In the Kaggle Library, there two (2) csv files: states_all.csv and states_all_extended.csv. The states_all.csv is an aggregated version of the data found in states_all_extended. We will be using only **states_all.csv** since it encompassses all the data we need for analysis.

There are 1715 observations inside the states_all file. Each observation is a state in the U.S. together with what year the columns are from.

Example: 1992_ALABAMA is different from 2000_ALABAMA

The example given is from the column PRIMARY_KEY. This gives easier access to which state and year we would want to observe, although there are 2 columns that contain those variables (STATE, YEAR).

The columns represent the data that concerns the schools of that given state in that given period of time. It is mainly concerned data such as number of student per grade level, the state revenue generated by the schools, and academic performance (NAEP)


## Variables

In the `states_all.csv` file in Excel,the following are the descriptions of each variable in the dataset.

- **`STATE`**: name of state in the USA
- **`YEAR`**: year in which the grade of the state is taken
- **`AVG_MATH_4_SCORE`**: average score of grade 4 students in math
- **`AVG_MATH_8_SCORE`**: average score of grade 8 students in math
- **`AVG_READING_4_SCORE`**: average score of grade 4 students in reading
- **`AVG_READING_8_SCORE`**: average score of grade 8 students in reading
- asd


## Research Questions

In this analysis, we want to find out the following:
1) Is there a significant difference between the NAEP scores of all states in 2019 compared to the scores of all states in 2005?
2) Is there any significance between the reading and math NAEP scores per state in 2019.

## Imports

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import ttest_ind

## Reading the CSV

In [6]:
states_df = pd.read_csv('states_all.csv')
states_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1715 entries, 0 to 1714
Data columns (total 25 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PRIMARY_KEY                   1715 non-null   object 
 1   STATE                         1715 non-null   object 
 2   YEAR                          1715 non-null   int64  
 3   ENROLL                        1224 non-null   float64
 4   TOTAL_REVENUE                 1275 non-null   float64
 5   FEDERAL_REVENUE               1275 non-null   float64
 6   STATE_REVENUE                 1275 non-null   float64
 7   LOCAL_REVENUE                 1275 non-null   float64
 8   TOTAL_EXPENDITURE             1275 non-null   float64
 9   INSTRUCTION_EXPENDITURE       1275 non-null   float64
 10  SUPPORT_SERVICES_EXPENDITURE  1275 non-null   float64
 11  OTHER_EXPENDITURE             1224 non-null   float64
 12  CAPITAL_OUTLAY_EXPENDITURE    1275 non-null   float64
 13  GRA

In [7]:
states_df.tail(53)

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,GRADES_4_G,GRADES_8_G,GRADES_12_G,GRADES_1_8_G,GRADES_9_12_G,GRADES_ALL_G,AVG_MATH_4_SCORE,AVG_MATH_8_SCORE,AVG_READING_4_SCORE,AVG_READING_8_SCORE
1662,2019_ALABAMA,ALABAMA,2019,,,,,,,,...,,,,,,,230.0,269.0,212.0,253.0
1663,2019_ALASKA,ALASKA,2019,,,,,,,,...,,,,,,,232.0,274.0,204.0,252.0
1664,2019_ARIZONA,ARIZONA,2019,,,,,,,,...,,,,,,,238.0,280.0,216.0,259.0
1665,2019_ARKANSAS,ARKANSAS,2019,,,,,,,,...,,,,,,,233.0,274.0,215.0,259.0
1666,2019_CALIFORNIA,CALIFORNIA,2019,,,,,,,,...,,,,,,,235.0,276.0,216.0,259.0
1667,2019_COLORADO,COLORADO,2019,,,,,,,,...,,,,,,,242.0,285.0,225.0,267.0
1668,2019_CONNECTICUT,CONNECTICUT,2019,,,,,,,,...,,,,,,,243.0,286.0,224.0,270.0
1669,2019_DELAWARE,DELAWARE,2019,,,,,,,,...,,,,,,,239.0,277.0,218.0,260.0
1670,2019_DISTRICT_OF_COLUMBIA,DISTRICT_OF_COLUMBIA,2019,,,,,,,,...,,,,,,,235.0,269.0,214.0,250.0
1671,2019_DODEA,DODEA,2019,,,,,,,,...,,,,,,,250.0,292.0,235.0,280.0


## Data Cleaning

## Exploratory Data Analysis

## Is there a significant difference between the NAEP scores of all states in 2019 compared to the scores of all states in 2005?

## Is there any significance between the reading and math NAEP scores per state in 2019