#### This notebook runs the mae feature selection function to determine the best features to use in the clustering models. I use the 2015-2016 SQR data to inform feature selection in order to prevent overfitting the data.  

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from maeFeatureSelection import maeFeatureSelection
from sklearn.preprocessing import Imputer

##### For this first set of analyses, I chose to drop na values on the target variables rather than impute a mean or median value across missing records. Schools with cohort sizes less than 20 students are not reported on in NYC data, so imputing values for these schools would overestimate the size of the effect between the features and various targets. Additionally, using the correlation heat map from the initial visualizations script, I remove certain features that are highly correlated with one another- such as 8th grade English and Math Proficency, instead opting to only use one of those variables. Additionally the economic need index is a composite score of the percent of students in temp housing and those that are HRA eligible, so I opt to only use that variable.

In [2]:
data = pd.read_csv("data/sqrAnalysisData.csv")
sy = data[data["schoolYear"]=='2015_2016'].copy()
syFeatureNames = ['averageGrade8MathProficiency','percentEnglishLanguageLearners',
                  'percentStudentswithDisabilities','percentSelfContained','economicNeedIndex',
                  'percentAsian','percentBlack','percentHispanic','percentWhite']
naDropSy = sy.dropna(subset = ['4YearGraduationRate','collegeandCareerPreparatoryCourseIndex',
 '4YearGraduationRateBlackorHispanicMalesinLowestThirdCitywide',
 '4YearGraduationRateEnglishLanguageLearners'])

In [3]:
imp = Imputer(missing_values='NaN', strategy='mean')
syFeatures = imp.fit_transform(naDropSy.loc[:,[
    'averageGrade8EnglishProficiency','averageGrade8MathProficiency',
    'percentEnglishLanguageLearners','percentStudentswithDisabilities','percentSelfContained',
    'economicNeedIndex','percentinTempHousing','percentHRAEligible','percentAsian',
    'percentBlack','percentHispanic','percentWhite']])

In [4]:
gradRate = naDropSy.loc[:,'4YearGraduationRate']
ccpci = naDropSy.loc[:,'collegeandCareerPreparatoryCourseIndex']
bhLowestThirdGradRate = naDropSy.loc[:, '4YearGraduationRateBlackorHispanicMalesinLowestThirdCitywide']
ellGradRate = naDropSy.loc[:, '4YearGraduationRateEnglishLanguageLearners']

In [5]:
gradModel = maeFeatureSelection(data=syFeatures,target=gradRate,featureNames=syFeatureNames) 
ccpciModel = maeFeatureSelection(data=syFeatures,target=ccpci,featureNames=syFeatureNames) 
bhModel = maeFeatureSelection(data=syFeatures,target=bhLowestThirdGradRate,featureNames=syFeatureNames) 
ellModel = maeFeatureSelection(data=syFeatures,target=ellGradRate,featureNames=syFeatureNames) 

##### Unfortunately these results indicate that the additive models have a neglible effect on improving the mean absolute error. Given the relatively small number of features in the dataset, I move forward with clustering the data on all features. An alternative method would be to use PCA or a Lasso model to reduce dimensionality before clustering. 