# Final Project
## By: Rohan Ondkar, Naveen Harish, Melvin Gonsalves

# Part 1: Why are we collecting this data?
Early education is important for a number of reasons. First and foremost, young children are at a critical stage in their development, and early education can play a crucial role in shaping their future. By providing children with a strong foundation in key areas such as language, literacy, and mathematics, early education can help to prepare them for success in school and in life.

Another reason why early education is important is that it can help to narrow the achievement gap that often exists between children from different backgrounds. Children who receive high-quality early education are more likely to do well in school and go on to graduate from high school and attend college. This can help to break the cycle of poverty that affects many families and communities.

One major factor contributing to unequal education in the US is the unequal distribution of funding for schools. In many states, schools are funded primarily through property taxes, which means that schools in more affluent areas tend to have more funding than those in poorer areas. This leads to a situation where schools in wealthier neighborhoods are able to offer more advanced classes, extracurricular activities, and resources such as computers and modern facilities, while schools in poorer areas may struggle to provide even basic education.

This is exactly what we will be trying to to prove throughout this tutorial. Our focus is to confirm whether education inequality is reflected by national math and reading examination differences. One approach you could take is to collect data on national math and reading exam scores, and compare them across different demographic groups. This could include looking at factors such as race, income, and geographic location, to see if there are any significant differences in exam scores

Once we have collected and analyzed this data, we can then use it to confirm whether education inequality is reflected by national math and reading examination differences, and identify the factors that may be contributing to these differences. This information can then be used to inform efforts to address education inequality and improve early education for all students.

We will be using math and read as the subjects to look into because they are the most consistent and are taught at all school at an elementary level. It is common to use national math and reading exam scores as a way to measure educational achievement and inequality, because these subjects are considered fundamental to a student's overall educational experience. Math and reading skills are essential for success in many other academic subjects, as well as in daily life, and they are often included in standardized tests as a way to assess a student's overall academic performance. Additionally, math and reading exam scores can provide a more objective measure of educational achievement and inequality than other indicators, such as graduation rates or self-reported survey data. This can make them a useful tool for identifying and addressing disparities in educational opportunities and outcomes.

Our null hypothesis is that none of the factors mentioned above do not impact children at an elementary level by state. Our null hypothesis is that none of the factors mentioned above do impact children at an elementary level by state.

# Part 2: Data Management/Representation
First we have to import the necessary libraries that we need to load the dataset. We are using pandas, numpy, and matplotlib.pyplot.<br>
Pandas is used for the DataFrame object since that is an easy way to store tabular data. <br>
Numpy is used for its math functionality.<br>
Mathplotlib.pyplot is used to plot graphs demonstrating relationships between variables in our data.

In [120]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Getting the csv file and displaying it. Found at https://www.kaggle.com/datasets/noriuk/us-education-datasets-unification-project

In [121]:
school = pd.read_csv("education_states.csv")

# display first few rows
school

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
0,1992_ALABAMA,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,...,,,,,,,,,,
1,1992_ALASKA,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,...,,,,,,,,,,
2,1992_ARIZONA,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,...,,,,,,,,,,
3,1992_ARKANSAS,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,...,,,,,,,,,,
4,1992_CALIFORNIA,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1710,2019_VIRGINIA,VIRGINIA,2019,,,,,,,,...,247.0,278.0,286.0,315.0,,,,,269.0,293.0
1711,2019_WASHINGTON,WASHINGTON,2019,,,,,,,,...,248.0,267.0,285.0,315.0,237.0,259.0,,,263.0,292.0
1712,2019_WEST_VIRGINIA,WEST_VIRGINIA,2019,,,,,,,,...,,,,,,,,,249.0,
1713,2019_WISCONSIN,WISCONSIN,2019,,,,,,,,...,251.0,273.0,277.0,294.0,253.0,267.0,,,268.0,276.0


PRIMARY_KEY is not needed for this table. We have STATE and YEAR, so we do not need PRIMARY_KEY because it provides no new information

In [122]:
school = school.drop(columns=['PRIMARY_KEY'])

school

Unnamed: 0,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,...,G08_HI_A_READING,G08_HI_A_MATHEMATICS,G08_AS_A_READING,G08_AS_A_MATHEMATICS,G08_AM_A_READING,G08_AM_A_MATHEMATICS,G08_HP_A_READING,G08_HP_A_MATHEMATICS,G08_TR_A_READING,G08_TR_A_MATHEMATICS
0,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,735036.0,...,,,,,,,,,,
1,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,350902.0,...,,,,,,,,,,
2,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,1007732.0,...,,,,,,,,,,
3,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,483488.0,...,,,,,,,,,,
4,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,8520926.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1710,VIRGINIA,2019,,,,,,,,,...,247.0,278.0,286.0,315.0,,,,,269.0,293.0
1711,WASHINGTON,2019,,,,,,,,,...,248.0,267.0,285.0,315.0,237.0,259.0,,,263.0,292.0
1712,WEST_VIRGINIA,2019,,,,,,,,,...,,,,,,,,,249.0,
1713,WISCONSIN,2019,,,,,,,,,...,251.0,273.0,277.0,294.0,253.0,267.0,,,268.0,276.0


We are going to use data after 2009 because anything before this does not date DEMOGRAPHICS into account... and thats kinda what we want to do

In [123]:
#setup for 2009. Getting the previous and current length setup
prev = len(school.index)
school = school[school['YEAR'] >= 2009]
curr = len(school.index)



# Part 3: Exploratory Data Analysis

*start here melvin and Nav*

# Part 4: Hypothesis Testing

G04_A_A_READING stands for reading averages from all races from grade 3 (2009-2019) <br>
G04_A_A_MATHEMATICS stands for math averages from all races from grade 3 (2009-2019)

In [124]:
#finding the average read writing scores for each 
avg_score = school[['STATE', 'YEAR', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS']]

avg_score

Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS
867,ALABAMA,2009,216.0,228.0
868,ALASKA,2009,211.0,237.0
869,ARIZONA,2009,210.0,230.0
870,ARKANSAS,2009,216.0,238.0
871,CALIFORNIA,2009,210.0,232.0
...,...,...,...,...
1710,VIRGINIA,2019,224.0,247.0
1711,WASHINGTON,2019,220.0,240.0
1712,WEST_VIRGINIA,2019,213.0,231.0
1713,WISCONSIN,2019,220.0,242.0


Our formula for getting the growth of reading and math score, we subtracted the 2009 average score from each data point (2010-2019) <br>
This is our score change, which we will use for both math (MATH_CHANGE) and reading (READING_CHANGE)

### MATH_CHANGE

In [125]:
#MATH CHANGE

#set MATH_CHANGE = NaN
avg_score['MATH_CHANGE'] = np.NaN

#function to find and return average math for 2009
def find_math_scores_2009(row):
    state_row = row['STATE']
    new = avg_score.loc[avg_score['STATE'] == state_row]
    new = new.loc[new['YEAR'] == 2009]
    return new['G04_A_A_MATHEMATICS']

# Subtract the 2009 average score from each data point (2010-2019)
for i, row in avg_score.iterrows():
    avg_score.at[i, 'MATH_CHANGE'] = row['G04_A_A_MATHEMATICS'] - find_math_scores_2009(row)

avg_score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  avg_score['MATH_CHANGE'] = np.NaN


Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS,MATH_CHANGE
867,ALABAMA,2009,216.0,228.0,0.0
868,ALASKA,2009,211.0,237.0,0.0
869,ARIZONA,2009,210.0,230.0,0.0
870,ARKANSAS,2009,216.0,238.0,0.0
871,CALIFORNIA,2009,210.0,232.0,0.0
...,...,...,...,...,...
1710,VIRGINIA,2019,224.0,247.0,4.0
1711,WASHINGTON,2019,220.0,240.0,-2.0
1712,WEST_VIRGINIA,2019,213.0,231.0,-2.0
1713,WISCONSIN,2019,220.0,242.0,-2.0


### READING_CHANGE

In [126]:
#MATH CHANGE

#set READING_CHANGE = NaN
avg_score['READING_CHANGE'] = np.NaN

#function to find and return average reading for 2009
def find_reading_scores_2009(row):
    state_row = row['STATE']
    new = avg_score.loc[avg_score['STATE'] == state_row]
    new = new.loc[new['YEAR'] == 2009]
    return new['G04_A_A_READING']

# Subtract the 2009 average score from each data point (2010-2019)
for i, row in avg_score.iterrows():
    avg_score.at[i, 'READING_CHANGE'] = row['G04_A_A_READING'] - find_reading_scores_2009(row)

avg_score.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  avg_score['READING_CHANGE'] = np.NaN


Unnamed: 0,STATE,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS,MATH_CHANGE,READING_CHANGE
867,ALABAMA,2009,216.0,228.0,0.0,0.0
868,ALASKA,2009,211.0,237.0,0.0,0.0
869,ARIZONA,2009,210.0,230.0,0.0,0.0
870,ARKANSAS,2009,216.0,238.0,0.0,0.0
871,CALIFORNIA,2009,210.0,232.0,0.0,0.0


We can use the pandas method get_dummies to create a dataframe where each state is represented by a binary value (1 or 0) indicating whether the data value is in that state or not. We can then drop the Alabama column because if all other state columns are 0, we can assume the data value must be in Alabama. This is possible because each state is treated as a unique, independent variable.

In [127]:
# get dummies
avg_score = pd.get_dummies(avg_score, columns=['STATE'])

We can then drop the Alabama column because if all other state columns are 0, we can assume the data value must be in Alabama. This is possible because each state is treated as a unique, independent variable.

In [128]:
# drop alabama and reading and mathematics averages since we no longer need them
state_avg = avg_score.drop(columns=['STATE_ALABAMA', 'G04_A_A_READING', 'G04_A_A_MATHEMATICS'])
avg_score.head()

Unnamed: 0,YEAR,G04_A_A_READING,G04_A_A_MATHEMATICS,MATH_CHANGE,READING_CHANGE,STATE_ALABAMA,STATE_ALASKA,STATE_ARIZONA,STATE_ARKANSAS,STATE_CALIFORNIA,...,STATE_SOUTH_DAKOTA,STATE_TENNESSEE,STATE_TEXAS,STATE_UTAH,STATE_VERMONT,STATE_VIRGINIA,STATE_WASHINGTON,STATE_WEST_VIRGINIA,STATE_WISCONSIN,STATE_WYOMING
867,2009,216.0,228.0,0.0,0.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
868,2009,211.0,237.0,0.0,0.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
869,2009,210.0,230.0,0.0,0.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
870,2009,216.0,238.0,0.0,0.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
871,2009,210.0,232.0,0.0,0.0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In order to evaluate the accuracy of our predictor, we need to split our dataset into a training set and a test set. The training set will be used to train the predictor, and the test set will be used to evaluate its performance. In this case, we want to predict the values for the years 2017, 2018, and 2019, so we will use those rows as our test data and the remaining rows as our train data. We will first start the training data by dropping the NaN rows

In [129]:
# drop the NaN rows
trainer = state_avg[state_avg['YEAR'] < 2017].dropna()
predictor = state_avg[state_avg['YEAR'] >= 2017].dropna()

In [130]:
from sklearn.linear_model import LinearRegression

X_reading = []
y_reading = []
X_math = []
y_math = []

# iterate through each row and add the year and state to the X variables and the growths to the y variables
for i, row in trainer.iterrows():
    add = row[3:].tolist()
    add.insert(0, row['YEAR'])
    X_reading.append(add)
    y_reading.append(row['READING_CHANGE'])
    X_math.append(add)
    y_math.append(row['MATH_CHANGE'])

trainer

Unnamed: 0,YEAR,MATH_CHANGE,READING_CHANGE,STATE_ALASKA,STATE_ARIZONA,STATE_ARKANSAS,STATE_CALIFORNIA,STATE_COLORADO,STATE_CONNECTICUT,STATE_DELAWARE,...,STATE_SOUTH_DAKOTA,STATE_TENNESSEE,STATE_TEXAS,STATE_UTAH,STATE_VERMONT,STATE_VIRGINIA,STATE_WASHINGTON,STATE_WEST_VIRGINIA,STATE_WISCONSIN,STATE_WYOMING
867,2009,0.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
868,2009,0.0,0.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
869,2009,0.0,0.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
870,2009,0.0,0.0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
871,2009,0.0,0.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1655,2011,1.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1656,2013,5.0,4.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1657,2013,2.0,1.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1658,2015,8.0,6.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


To create a linear regression model that can make predictions on our data, we need to first separate our data into independent and dependent variables. The independent variables are the factors that we will use to make predictions, while the dependent variables are the values that we want to predict. 

In [131]:
from sklearn.linear_model import LinearRegression

X_tester_reading = []
X_tester_math = []

reading_regression = LinearRegression().fit(X_reading, y_reading)
math_regression = LinearRegression().fit(X_math, y_math)

# accumulate X values for reading and math
for i, row in predictor.iterrows():
    add = row[3:].tolist()
    add.insert(0, row['YEAR'])
    X_tester_reading.append(add)
    X_tester_math.append(add)
    
# predict based of X values
predictor['PREDICT_MATH'] = reading_regression.predict(X_tester_reading)
predictor['PREDICT_READING'] = reading_regression.predict(X_tester_reading)


predictor

Unnamed: 0,YEAR,MATH_CHANGE,READING_CHANGE,STATE_ALASKA,STATE_ARIZONA,STATE_ARKANSAS,STATE_CALIFORNIA,STATE_COLORADO,STATE_CONNECTICUT,STATE_DELAWARE,...,STATE_TEXAS,STATE_UTAH,STATE_VERMONT,STATE_VIRGINIA,STATE_WASHINGTON,STATE_WEST_VIRGINIA,STATE_WISCONSIN,STATE_WYOMING,PREDICT_MATH,PREDICT_READING
1281,2017,4.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3.712264,3.712264
1288,2017,-7.0,-4.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.962264,0.962264
1295,2017,4.0,5.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4.212264,4.212264
1302,2017,-4.0,0.0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,3.212264,3.212264
1309,2017,0.0,5.0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,3.462264,3.462264
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1710,2019,4.0,-3.0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,3.147170,3.147170
1711,2019,-2.0,-1.0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,4.647170,4.647170
1712,2019,-2.0,-2.0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,2.397170,2.397170
1713,2019,-2.0,0.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,3.647170,3.647170


To determine whether being in a particular state has an effect on the growth of reading and math test scores, we can use the statsmodel library to calculate the p-value of each coefficient that we pass into the model. The p-value is a measure of statistical significance that indicates the probability that the observed relationship between a given independent variable and the dependent variable is due to chance. If the p-value is below a certain threshold (usually 0.05), we can conclude that the relationship is statistically significant and that the independent variable has a significant effect on the dependent variable.

In [132]:
import statsmodels.api as sm

In [133]:
# create statsmodel for math data
predict_math = sm.OLS(trainer['MATH_CHANGE'].tolist(), sm.add_constant(X_math)).fit()
predict_math.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.511
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,3.11
Date:,"Thu, 08 Dec 2022",Prob (F-statistic):,2.35e-08
Time:,02:36:48,Log-Likelihood:,-420.79
No. Observations:,212,AIC:,949.6
Df Residuals:,158,BIC:,1131.0
Df Model:,53,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-407.2425,126.068,-3.230,0.002,-656.238,-158.247
x1,0.2038,0.063,3.252,0.001,0.080,0.328
x2,-3.5000,1.442,-2.426,0.016,-6.349,-0.651
x3,3.0000,1.442,2.080,0.039,0.151,5.849
x4,-3.0000,1.442,-2.080,0.039,-5.849,-0.151
x5,-1.7500,1.442,-1.213,0.227,-4.599,1.099
x6,-1.7500,1.442,-1.213,0.227,-4.599,1.099
x7,-5.2500,1.442,-3.640,0.000,-8.099,-2.401
x8,-1.5000,1.442,-1.040,0.300,-4.349,1.349

0,1,2,3
Omnibus:,6.335,Durbin-Watson:,1.739
Prob(Omnibus):,0.042,Jarque-Bera (JB):,6.673
Skew:,-0.294,Prob(JB):,0.0356
Kurtosis:,3.639,Cond. No.,1810000.0


In [134]:
# create statsmodel for reading data
predict_reading = sm.OLS(trainer['READING_CHANGE'].tolist(), sm.add_constant(X_reading)).fit()
predict_reading.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.511
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,3.113
Date:,"Thu, 08 Dec 2022",Prob (F-statistic):,2.28e-08
Time:,02:36:48,Log-Likelihood:,-394.27
No. Observations:,212,AIC:,896.5
Df Residuals:,158,BIC:,1078.0
Df Model:,53,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-687.0151,111.242,-6.176,0.000,-906.729,-467.301
x1,0.3425,0.055,6.194,0.000,0.233,0.452
x2,-2.7500,1.273,-2.161,0.032,-5.264,-0.236
x3,0.5000,1.273,0.393,0.695,-2.014,3.014
x4,-0.5000,1.273,-0.393,0.695,-3.014,2.014
x5,-0.2500,1.273,-0.196,0.845,-2.764,2.264
x6,-3.0000,1.273,-2.357,0.020,-5.514,-0.486
x7,-2.2500,1.273,-1.768,0.079,-4.764,0.264
x8,-2.7500,1.273,-2.161,0.032,-5.264,-0.236

0,1,2,3
Omnibus:,6.181,Durbin-Watson:,2.013
Prob(Omnibus):,0.045,Jarque-Bera (JB):,9.162
Skew:,0.103,Prob(JB):,0.0102
Kurtosis:,3.997,Cond. No.,1810000.0


We can look at the P>|t| column in the summary table, which indicates the p-values of the coefficients. If the p-value of a coefficient is greater than 0.05, we can consider it to be statistically significant and conclude that it has a significant effect on the predicted value.

In [135]:
# accumulate constant names
const_names = trainer.columns[3:].tolist()
const_names.insert(0, 'YEAR')

math_significant = []

Iterate through p values and if less than 0.05 add to math_significant

In [136]:
for i in range(len(predict_math.pvalues) - 1):
    if predict_math.pvalues[i + 1] < 0.05:
        math_significant.append(const_names[i])
        
print("Math Constraints Tested: : " + str(math_significant))

Math Constraints Tested: : ['YEAR', 'STATE_ALASKA', 'STATE_ARIZONA', 'STATE_ARKANSAS', 'STATE_CONNECTICUT', 'STATE_DISTRICT_OF_COLUMBIA', 'STATE_FLORIDA', 'STATE_IDAHO', 'STATE_KANSAS', 'STATE_MARYLAND', 'STATE_MISSOURI', 'STATE_MONTANA', 'STATE_NEW_JERSEY', 'STATE_NEW_YORK', 'STATE_SOUTH_DAKOTA', 'STATE_VERMONT']


Iterate through p values and if less than 0.05 add to reading_significant

In [137]:
reading_significant = []


for i in range(len(predict_reading.pvalues) - 1):
    if predict_reading.pvalues[i + 1] < 0.05:
        reading_significant.append(const_names[i])
        
print("Reading Constraints Tested: " + str(reading_significant))
print()

Reading Constraints Tested: ['YEAR', 'STATE_ALASKA', 'STATE_COLORADO', 'STATE_DELAWARE', 'STATE_KANSAS', 'STATE_MISSOURI', 'STATE_NEW_MEXICO', 'STATE_NEW_YORK', 'STATE_NORTH_DAKOTA', 'STATE_SOUTH_DAKOTA', 'STATE_TEXAS']



Checks MATH p-values and coffeficients from the regression model and prints out that specific states name 

In [140]:
print("Predicted Negative Growth in Math:")

for i in range(len(math_regression.coef_)):
    if(i > 1):
        if math_regression.coef_[i] < 0 and predict_math.pvalues[i + 1] < 0.05:
            print(const_names[i])

Predicted Negative Growth in Math:
STATE_ARKANSAS
STATE_CONNECTICUT
STATE_FLORIDA
STATE_IDAHO
STATE_KANSAS
STATE_MARYLAND
STATE_MISSOURI
STATE_MONTANA
STATE_NEW_JERSEY
STATE_NEW_YORK
STATE_SOUTH_DAKOTA
STATE_VERMONT


Checks READING p-values and coffeficients from the regression model and prints out that specific states name 

In [141]:
print("Predicted Negative Growth in Reading:")

for i in range(len(reading_regression.coef_)):
    if(i > 1):
        if reading_regression.coef_[i] < 0 and predict_reading.pvalues[i + 1] < 0.05:
            print(const_names[i])

Predicted Negative Growth in Reading:
STATE_COLORADO
STATE_DELAWARE
STATE_KANSAS
STATE_MISSOURI
STATE_NEW_MEXICO
STATE_NEW_YORK
STATE_NORTH_DAKOTA
STATE_SOUTH_DAKOTA
STATE_TEXAS


Based on the data, a number of states appear to have both a negative coefficient and a low p value above 0.05, suggesting that they are predicted to have significant decreases in the average scores for 4th graders on the NAEP test in both reading and writing. It is notable that 19 states seem to have significant predicted negative growth in their average math and reading scores for 4th graders. Some of these states include Maine and California, which have traditionally performed well in these areas.