# Identifying controllable factors for education success in higher schools

### Abstract
It is well known that in most countries student performance is closely linked to socioeconomic status, but
attention is also paid to factors such as funding, class size and teacher salaries to improve student
performance, especially in schools in lower socioeconomic areas. We conducted a study to answer the
question: What was the impact of differing funding levels, class size and teacher salaries on 2017 school
outcomes, measured by SAT scores and graduation rate, across ~ 250 high schools in Massachusetts,
USA, using a Kaggle dataset.
Initial analysis using multilinear regression saw no impact of these factors on the overall school population.
However the schools differed widely on situational factors like socio-demographics and urban density.
Repeating the regression analysis at a segment level found that these factors had a much higher impact,
and that the factors differed according to segment, for instance higher teachers’ salaries was important in
well-off areas, whereas a large number/variety of classes was important in disadvantaged areas.
The conclusions of the study have a number of qualifications, such as the strong collinearity between key
dimensions.
(I did this project for a Python course)

#### Motivation and research question
Governments, charities and society has the best education of its children as one of its primary
goals, and they make substantial investments in schools and teachers accordingly. Questions of
evidence-based confirmation of the value of these investments on student outcomes, so that
scarce money can be best directed, are becoming increasingly pressing. For instance, The Smith
Family, a prominent Australian education charity, has as one of its activities:
“Research and evaluation helps us to measure the outcomes and assess the effectiveness of
our support and programs. Evaluation and regular reporting also drive continual improvement
across the organisation.”
One common set of education investments are class size, student funding and teacher salaries. A
better, evidence-based understanding of how much these factors affect student outcomes can
assist administrators and charities like The Smith Family optimize their investments.

The question this analysis seeks to answer, at least for Massachusetts high schools, is  "What was the impact of differing funding levels, class size and teacher salaries on
2017 school outcomes, measured by SAT scores and graduation rate?

I would like to acknowledge the kernel/report [‘Exploratory Analysis SAT scores in Public Highs chools](http://www.kaggle.com/lgl12b/improving-sat-scores-in-public-highschools) by Luis de Mola for inspiration

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import plotly.offline as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly import tools
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.cluster import KMeans
import statsmodels.api as sm
import utils
from itertools import cycle, islice
from pandas.tools.plotting import parallel_coordinates
import collections

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt

from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot as plt

In [None]:
py.init_notebook_mode(connected=True)


In [None]:
#Read in MA town size data 
town_size = pd.read_csv('../input/massachusetts-town-size-2010/MA_town_size.csv', sep=',')
#remove NA columns, set column headers
town_size.drop(town_size.columns[0],axis=1,inplace=True)
town_size.drop(town_size.index[:5],axis=0,inplace=True)
town_size.columns = town_size.iloc[0]
town_size.drop(town_size.index[0],axis=0,inplace=True)
town_size.columns.values[1]='Pop_per_sqm'
#convert object to integer
town_size['Pop_per_sqm']=town_size['Pop_per_sqm'].astype(int)
town_size.head()

In [None]:
# A histogram showed that 1,000 per sqm was a good cutoff for urban/suburban
town_size['rural?']=(town_size['Pop_per_sqm']<1000).astype(int)
percentage = town_size['rural?'].sum()/town_size['rural?'].count()
print(f'Percentage of towns classified as suburban or rural (i.e. not urban) {percentage:.0%}')

In [None]:
# Read school data
schoolDf = pd.read_csv('../input/massachusetts-public-schools-data/MA_Public_Schools_2017.csv', sep=',',header=0)
# merge in town size flag
schoolDf=pd.merge(schoolDf,town_size[['Community','rural?']], left_on=['Town'],right_on=['Community'],how='left')
#select schools covering grades 9-12
TypeOfSchool=('09,10,11,12')
schoolDf['HigherEd?']=schoolDf['Grade'].str.contains(TypeOfSchool)
no_high_schools = schoolDf['HigherEd?'].sum()
print(f'Number of schools covering grades 9-12 =  {no_high_schools}')

In order to provide structure to the analysis, we grouped the data into 3 categories:
1.  Descriptions, or situational factors. Data that describes the school’s town/area, e.g.: Urban density,  % economically disadvantaged
2. Interventionss, or controllable factors. These are factors that the school authorities have some control over, such as: Class size, or # of classes offered
3. Outcomes. These are factors that describe student success, such as SAT scores and graduation rates

In [None]:
Descriptors=[
'% First Language Not English',
'% Students With Disabilities',
'% High Needs',
'% Economically Disadvantaged',
'% African American',
'% Asian',
'% Hispanic',
'% White',
'% Multi-Race, Non-Hispanic',
'% Females',
'rural?']

In [None]:
Intervention=[ 'Total # of Classes', 
               'Average Class Size',
               'Number of Students', 
               'Average Salary', 
               'Average Expenditures per Pupil']

In [None]:
Outcomes=['Average SAT_Reading',
'Average SAT_Writing',
'Average SAT_Math',
'% Dropped Out',
'% Graduated',
'% Attending College',
'% MA Community College']

In [None]:
FullList=list(set(Outcomes)|set(Descriptors)|set(Intervention)|set(['HigherEd?']))
FullDf=schoolDf[FullList]
FullDf=FullDf[FullDf['HigherEd?']==True]
FullDf=FullDf.dropna()
print(f'Number of schools to analyse ={FullDf.shape[0]}')

In [None]:
FullDf.drop(['HigherEd?'],axis=1,inplace=True)

## Let's examine the range of our data

In [None]:
FullDf.describe().transpose()

### Subroutine to format heat map charts
We'll be showing lots of these

In [None]:
def create_heatmap(corr,title):
    layout = go.Layout(
        title=title,
        font=dict(family='Courier New, monospace', size=16, color='#7f7f7f'),
        xaxis=dict(
            #title='x Axis',
            autorange=True,
            showgrid=True,
            zeroline=True,
            showline=True,
            ticks='',
            showticklabels=True
        ),
        yaxis=dict(
            #title='y Axis',
            autorange=True,
            showgrid=True,
            zeroline=True,
            showline=True,
            ticks='',
            showticklabels=True,
            automargin= True
        )
    )
    trace = go.Heatmap(z=corr.values, x=corr.index,y=corr.columns)
    data=[trace]
    fig = go.Figure(data=data, layout=layout)
    return(fig)

### Check for correlation and colinearity

In [None]:
#Examine descriptive and intervention variables
DesIn=list(set(Descriptors)|set(Intervention))
DesInDf=FullDf[DesIn]

In [None]:
corr=DesInDf.corr()
layout=create_heatmap(corr,'Correlation of descriptive and intervention factors')
iplot(layout)

In [None]:
#Let's examine the eigenvalues and vectors
corr = np.corrcoef(DesInDf, rowvar=False)
w, v = np.linalg.eig(corr) 
float_formatter = lambda x: "%.1f" % x
np.around(w,decimals =1)

A few entries close to zero indicating colinearity, especially for the last few dimensions

In [None]:
# Let's examine the eigenvectors to locate the colinearity
eigenvectors =v
df = pd.DataFrame(data=eigenvectors,columns=DesInDf.columns, index=DesInDf.columns)
fig=create_heatmap(df,'Eigenvectors of correlation matric <br> (large values indicates colinearity)')
iplot(fig)

Let's examine one of these colinearities:  average student expenditure and and % multi-race

In [None]:
trace1 = go.Scatter(
    y = DesInDf['Average Expenditures per Pupil'],
    x = DesInDf['% Multi-Race, Non-Hispanic'],
    mode = 'markers'
)

layout = go.Layout(
    title='Average Expenditures per Pupil vs % multi-Race ',
   font=dict(family='Courier New, monospace', size=16, color='#7f7f7f'),
    yaxis=dict( title='Average Expenditures per Pupil',showline=True,),
    xaxis=dict( title='% Multi-Race, Non-Hispanic',showline=True)
)
data = [trace1]
fig = go.Figure(data=data, layout=layout)
# Plot and embed in ipython notebook!
iplot(fig)  

There are no schools with > 3% muti-race who spend less than $14k per student

### Let's create clusters of schools based on descriptive variables

In [None]:
DescriptorsDf=FullDf[Descriptors]


Create a dendrite tree first to decide the optimal number of clusters

In [None]:
def plot_dendrogram(Amodel, **kwargs):

    # Children of hierarchical clustering
    children = Amodel.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
Amodel = AgglomerativeClustering(n_clusters=3)

Amodel = Amodel.fit(DescriptorsDf)

In [None]:
plt.title('Hierarchical Clustering Dendrogram')
plot_dendrogram(Amodel=Amodel, labels=Amodel.labels_)
plt.show()

Looks to be good branch separation with 5 branches.   Create a k-means clustering with 5 branches

In [None]:
kmeans = KMeans(n_clusters=5,random_state=95)
model = kmeans.fit(DescriptorsDf)
# check size of clusters
Size=collections.Counter(model.labels_)
Size=np.array(list(Size.items()))
Size[:,1]

In [None]:
#add cluser tags to dataframe
FullDf['cluster'] = model.labels_

### Lets examine clusters

In [None]:
centers = model.cluster_centers_
# Function that creates a DataFrame with a column for Cluster Number

def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P

P= pd_centers(DescriptorsDf, centers)
P["CLUSTER SIZE"]=Size[:,1]
pd.options.display.float_format = '{:.0f}'.format
P.transpose()

### We can identify 2 extreme segments:  "WASP Privileged" (segment 4) and "Urban Disadvantaged" (segment 2)

### Create a routine to prepare and plot multilinear analysis - 
we'll be doings a lot of these

### **Note**!  The results of the correlation are affected by which schools are randomly assigned in the train/test split.  Hence the results in the graphs may vary somewhat from the commentary

In [None]:
def RegressionAndPlot(X,y,title,outcome_type):
    Xdata=StandardScaler().fit_transform(X)
    X=pd.DataFrame(data=Xdata,    # values
                  index=X.index,    # 1st column as index
                  columns=X.columns)  # 1st row as the column names
    #Split the Dataset into Training and Test Datasets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=200)
    #Linear Regression: Fit a model to the training set 
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    # Calculate R2 score
    y_prediction = regressor.predict(X_test)
    R2= r2_score(y_test,y_prediction)

    RegressionDf=pd.DataFrame(data=regressor.coef_,    # values
                  index=X.columns,    # 1st column as index
                  columns=['Correlation'])  # 1st row as the column names
    RegressionDf.sort_values('Correlation',ascending=True,inplace=True)
    data = [go.Bar(
                x=RegressionDf['Correlation'],
                y=RegressionDf.index,
                orientation = 'h'
    )]
    layout = go.Layout(
        title=f'{title} <br> correlation with {outcome_type} (R2 = {R2:.0%})',
        xaxis=dict(
            title='Correlation',
            autorange=True,
            showgrid=True,
            zeroline=True,
            showline=True,
            ticks='',
            showticklabels=True
        ),
        yaxis=dict(
            #title='y Axis',
            autorange=True,
            showgrid=True,
            zeroline=True,
            showline=True,
            ticks='',
            showticklabels=True,
            automargin= True
        )
    )
    fig = go.Figure(data=data, layout=layout)
    return fig

## Run multilinear analysis on all high schools (~ 250 schools)
Let's see what discriptive factors and interventions most affect tudent outcomes

In [None]:
DesIn=list(set(Descriptors)|set(Intervention))
X=FullDf[DesIn]
y = FullDf['Average SAT_Math']
title="Descriptors and interventions, all schools"
outcome_type = 'Maths SAT score'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)


When we combined situational and controllable factors we can describe/predict SAT scores very well (R2 ~ 80%).     Sociodemographic factors have the biggest impact on success

In [None]:
X=FullDf[DesIn]
y = FullDf['% Graduated']
title="Descriptors and interventions, all schools"
outcome_type = 'Graduation rate'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)

Situational and controllable factors don't predict graduation rates as well (R2 ~ 50%), though socioeconomic factors are still key

### Let's examine the effect of controllable factors alone

In [None]:
X=FullDf[Intervention]
y = FullDf['Average SAT_Math']
title="Interventions only, all schools"
outcome_type = 'Maths SAT score'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)

Controllable factors alone had very low descriptive/redictive power for SAT scores ...

In [None]:
X=FullDf[Intervention]
y = FullDf['% Graduated']
title="Interventions only, all schools"
outcome_type = 'Graduation rate'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)

... or for graduation rate

## Run multilinear analysis on segments of schools

### Let's look at 'Urban Disadvantaged" (segment 2) first

In [None]:
X=FullDf[FullDf['cluster'] ==2][Intervention]
y = FullDf[FullDf['cluster'] ==2]['% Graduated']
title="Interventions only, Urban Disadvantaged schools (n=54)"
outcome_type = 'Graduation rate'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)

### The impact of intervention is higher at the segment level - R2 ~ 20% .   The most important factor is a large number of classes - a possible proxy for variety of classess better engaging with students?

### Let's look at WASP Privilege (segment 4)

In [None]:
X=FullDf[FullDf['cluster'] ==4][Intervention]
y = FullDf[FullDf['cluster'] ==4]['Average SAT_Math']
title="Interventions only, WASP Privileged schools (n=30)"
outcome_type = 'Average maths SAT'
fig=RegressionAndPlot(X,y,title,outcome_type)
iplot(fig)

### The interventions impact on the 'Suburban Privilge' segment also not strong  - R2 ~ 20% - but again higher than at the aggregrate level.  Here the most important factor is teachers' average salary.

Regarding our question:
What was the impact of differing funding levels, class size and teacher salaries on 2017 school outcomes, measured by SAT scores and graduation rate, across ~ 300 high schools in Massachusetts, USA?
, we conclude that certain factors do indeed appear to have a significant impact, though these factors differ on the type, and also on the magnitude of impact, depending on the type of school