# State Average SAT Scores - An Observational Study

When, in 1982, average Scholastic Achievement Test (SAT) scores were first published on a state-by-state basis in the United States, the huge variation in the scores was a source of great pride for some states and of consternation for others.  Average scores ranged from a low of 790 (out of a possible 1,600) in South Carolina to a high of 1,088 in Iowa.  This 298-point spread dwarfed the 20-year national decline of 80 points.  Two researchers set out to "assess the extent to which the compositional/demographic and school-structural characteristics are implicated in SAT differences."  (Data from B. Powell and L. C. Steelman, "Variations in State SAT Performance:  Meaningful or Misleading?"  *Harvard Educational Review* 54(4) (1984): 389-412.)



The state averages of the local SAT (verbal + quantitative) scores are listed below, along with six variables that may be associated with the SAT differences among states.  Some explanatory variables come from the Powell and Steelman article, while others were obtained from the College Entrance Examination Board (by Robert Powers).  The variables are the following:  *Takers* is the percentage of the total eligible students (high school seniors) in the state who took the exam; *income* is the median income of families of test-takers, in hundreds of dollars; *years* is the average number of years that the test-takers, in hundreds of dollars; *years* is the number of years that the test-takers had formal studies in social sciences, natural sciences, and humanities; *public* is the percentage of the test-takers who attended public secondary schools; *expend* is the total state expenditure on secondary schools, expressed in hundreds of dollars per student; and *rank* is the median percentile ranking of the test-takers within their secondary school classes.

In [None]:
# 3rd party library imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set()
pd.options.display.float_format = "{:.3f}".format
pd.options.display.max_columns = 12

We begin by reading the data and summarizing the variables.

In [None]:
df = pd.read_csv('case1201.csv')
df.head()

In [None]:
df.describe()

# Display 12.4:  Matrix of scatterplots for SAT scores and explanatory variables

In [None]:
cols = ['Rank', 'Takers', 'Years', 'Income', 'Public', 'Expend', 'SAT']
g = sns.pairplot(df[cols])

The SAT/Takers scatterplots indicate a non-linearity, so apply the logarithmic transform to the **Takers** variable.

In [None]:
df['logTakers'] = np.log(df['Takers'])
cols.remove('Takers')
cols.insert(1, 'logTakers')
g = sns.pairplot(df[cols])

*Expend* clearly has an outlier (Alaska) and *Public* may have one (Louisiana).

# Preliminary Analysis

In [None]:
formula = 'SAT ~ logTakers + Rank'
model = smf.ols(formula=formula, data=df).fit()
model.summary()

The takers and class rank variables explain 81% of the total variation.  Look at the effect of *Expend* on SAT scores after taking *Takers* and *Rank* into account.

## Display 12.5: Partial residual plot of state average SAT scores (adjusted for takers and median class rank) versus state expenditure

In [None]:
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)                      
model = smf.ols(formula='SAT ~ logTakers + Rank + Expend', data=df).fit()          
sm.graphics.plot_ccpr(model, 'Expend', ax=ax[0]) 
ax[0].set_title('With Alaska')
ax[0].set_ylabel('Partial Residual')
ax[0].set_xlabel('')
                                                                                
model = smf.ols(formula='SAT ~ logTakers + Rank + Expend', data=df.query('Expend < 40')).fit()
sm.graphics.plot_ccpr(model, 'Expend', ax=ax[1])     
ax[1].set_title('Without Alaska')
ax[1].set_ylabel('')
ax[1].set_xlabel('')

fig.suptitle('CCPR (partial residual plots)')
fig.supxlabel('Expenditure ($100s per student)')
fig.tight_layout()

## Partial residual plot of state average SAT scores (adjusted for takers and median class rank) versus public school

In [None]:
fig, ax = plt.subplots(ncols=2, sharex=True, sharey=True)                      
model = smf.ols(formula='SAT ~ logTakers + Rank + Public', data=df).fit()          
sm.graphics.plot_ccpr(model, 'Public', ax=ax[0]) 
ax[0].set_title('With Louisiana')
ax[0].set_ylabel('Partial Residual')
ax[0].set_xlabel('')
                                                                                
model = smf.ols(formula='SAT ~ logTakers + Rank + Public', data=df.query('Public > 50')).fit()
sm.graphics.plot_ccpr(model, 'Public', ax=ax[1])     
ax[1].set_title('Without Louisiana')
ax[1].set_ylabel('')
ax[1].set_xlabel('')

fig.suptitle('CCPR (partial residual plots)')
fig.supxlabel('Test Taker Percentage in Public Schools')
fig.tight_layout()