# How PASSNYC's outreach can have a large impact.

PASSNYC aims to increase representation from minorities, ELL students and lower-income students to New York's prestigious Specialized High Schools (SHSs). 

To do so, PASSNYC will target student registration for the SHS Admissions Test (SHSAT), expecting that as registration goes up among the target demographics,  attendence will follow.


# To improve registration, PASSNYC and its partners have a set of levers they can pull.

Ryan B. (in the [Discussion forums](http://www.kaggle.com/passnyc/data-science-for-good/discussion/61971) categorized them under umbrellas of *outreach* and *intervention*. While outreach can boost registration by improving awareness and knowledge, interventions affect scores - but are more costly and take longer.

# The best analysis will help PASSNYC identify schools for each available intervention.

Thus, we want to produce meaningful insights for both, but we'll prioritize outreach efforts because they're cheaper, quicker, and can still be effective if there's a need.

# So in what ways do we expect PASSNYC and its partners' interventions to drive registration?

From the students' perspective, taking the SHSAT 

* How well do I expect to perform?
* If I do well, what's in it for me at an SHS?
* How easy is it for me to take the test itself?
* How easy is it for me to actually attend the school?

Within those questions we can form hypotheses around how PASSNYC and its partners can drive scores:
* **Preparing students for the tests** will improve the value they expect to get in return.
* **Educating students and parents on the benefits** of attending an SHS will encourage them to take a shot at the test.
* **Educating students on the environment and life at an SHS** may help students who question whether they'd succeed and fit in there.

We can also predict how different demographic groups may need different treatment as well, such as:
* **Giving realistic information about costs** in time and money to families in a lower socioeconomic status (SES). Low-SES family members support one another closely, so additional travel or study time can take a toll.

# Finally, how do our constraints impact how we analyze the data?

Our priority should be to segment students so that PASSNYC has both:
1. One or more starting points for immediate outreach
2. A strategy for its long-term interventions through partnerships

In doing so, I found the following:
1. A collection of schools that registered far fewer students than would be expected given their students' Common Core test scores. Reaching out to these schools will strongly drive recommendations because students there needn't be convinced that they can perform well on the SHSAT.

2. PASSNYC's best bet for outreach each year is to find the schools where students perform well on Common Core tests. Common Core scores were the best predictor by far of registration, but not all grades are equal. Scores of 6th and 7th graders - those closest to taking the test - predicted outcomes better than any other grade. 

3. Holding test scores constant, attendence, economic need, and special education participation became the strongest predictors. Partners that focus on improving students' scores should seek out schools scoring well on those metrics, and they should further look for schools near the middle-of-the-pack in terms of scoring, because encouraging students who are on-the-fence to sign up and improve will add greatest value.

# Data Analysis

I start by bringing in data science tools and data before cleaning it and creating new variables. I'll move quickly through the code of this point, but here are some key takeaways:

1. I calculated the percentage of students who sat for the SHSAT at each school as the main predicted variable. This was the closest datapoint available to SHSAT registration.
2. I created variables measuring distance between each school and both the SHSs and the testing centers. I saved the smallest of each distance for each school.
3. I averaged the % of students who attained 4s on Math and ELA for each grade level.
4. Using two NYC Open Data sources on Kaggle, I gathered safety data (averaging the total number of crimes at each school over six years) and the % of students in special education.

In [None]:
#import libraries, including...

import os 

#data structuring, statistics, and other math we'll use
import pandas as pd
import numpy as np
from math import radians, cos, sin, asin, sqrt

#visualization
%matplotlib inline
import seaborn as sns
sns.light_palette("purple", as_cmap=True)

#machine learning and statistics
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from scipy.stats.stats import pearsonr

In [None]:
# Load our school explorer data and update the index column name to match our other datasets

school_data = pd.read_csv('../input/data-science-for-good/2016 School Explorer.csv', index_col = 'Location Code')
SHSAT_results = pd.read_csv('../input/20172018-shsat-admissions-test-offers/2017-2018_SHSAT_Admissions_Test_Offers_By_Sending_School.csv', index_col = 'Feeder School DBN')
school_safety = pd.read_csv('../input/ny-2010-2016-school-safety-report/2010-2016-school-safety-report.csv', index_col = 'DBN') #Open NYC dataset from Kaggle
school_demographics = pd.read_csv('../input/ny-school-demographics-and-accountability-snapshot/2006-2012-school-demographics-and-accountability-snapshot.csv', index_col = 'DBN') #Open NYC dataset from Kaggle
specialized_hs_locations = pd.read_csv('../input/nyc-specialized-high-school-locations/elite_eight_data.csv') #Source: Infocusp competition submission. Link: https://www.kaggle.com/infocusp/recommendations-to-passnyc-based-on-data-analysis/data
SHSAT_testing_locations = pd.read_csv('../input/shsat-testing-locations/SHSAT_testing_locations.csv') #Private dataset containing lat/

#Note - I used a private version of the specialized_hs_locations dataset because of a typo in the original, 
#but I give full credit to Infocusp

raw_datasets = [school_data, SHSAT_results, school_safety, specialized_hs_locations, school_demographics]

In [None]:
# INITIAL CLEAN-UP (aka preprocessing)

# Removing unecessary columns before cleaning and merging

SHSAT_results.drop(['Feeder School Name'], axis=1, inplace=True) 
school_safety = school_safety[['Major N', 'Oth N', 'NoCrim N', 'Prop N', 'Vio N']]
school_demographics = pd.DataFrame(school_demographics[['frl_percent','sped_percent']])

# fix values before running computations

# In our SHSAT results, replacing '0-5' with 0 in the 'Count' columns

SHSAT_results['# of Test-takers'] = SHSAT_results['Count of Testers'][(SHSAT_results['Count of Testers'].astype(str)) != "0-5"]
SHSAT_results['# of Offers'] = SHSAT_results['Count of Offers'][(SHSAT_results['Count of Offers'].astype(str)) != "0-5"]

SHSAT_results.fillna(value = 0, inplace = True)

school_safety = school_safety.loc[school_safety.index.dropna()] # remove rows with blank DBNs

# Convert dollars, percents, and numbers to the correct format; swap out non-numbers with 0s

dollar_columns = ['School Income Estimate']
percent_columns = ['Percentage of Black/Hispanic students','Percent Asian', 'Percent Black', 'Percent ELL', 
                   'Percent Hispanic', 'Percent Black / Hispanic','Percent White', 'Student Attendance Rate', 
                   'Percent of Students Chronically Absent', 'Rigorous Instruction %', 
                   'Collaborative Teachers %', 'Supportive Environment %','Effective School Leadership %', 
                   'Strong Family-Community Ties %', 'Trust %', 'sped_percent']

def fix_dollars(df):
    for cols in dollar_columns:
        if cols in df:
            df[cols] = df[cols].astype(np.object_).str.replace('$','').str.replace(',','').astype(float)

def fix_percents(df):
    for cols in percent_columns:
        if cols in df:
            df[cols] = (df[cols].astype(np.object_).str.replace('%','').astype(float) / 100)

for dfs in raw_datasets:
    fix_dollars(dfs)
    fix_percents(dfs)

In [None]:
# Making new columns with secondary datasets

school_safety['Average Total Crimes 2013-2016'] = school_safety[['Major N', 'Oth N', 'NoCrim N', 'Prop N', 'Vio N']].sum(axis=1)
school_safety = school_safety.groupby(['DBN']).mean()
school_safety = school_safety['Average Total Crimes 2013-2016']
SHSAT_results = SHSAT_results[['Count of Students in HS Admissions','# of Test-takers','# of Offers']]
school_demographics = (school_demographics.groupby('DBN').mean() / 100)
school_demographics.columns = ['Free Meals %', 'Special Ed %']

In [None]:
# MERGING DATASETS

# merge datasets together
school_data = school_data.join(SHSAT_results).join(school_safety).join(school_demographics)
school_data = school_data.dropna(subset=['Count of Students in HS Admissions'])

In [None]:
## Feature creation within our aggregated dataset ##

# Creating our dependent, or predicted, variable
school_data['% Taking SHSAT'] = (school_data['# of Test-takers'].astype(float) 
                                 / school_data['Count of Students in HS Admissions'].astype(float))

school_data['% Receiving Offers'] = (school_data['# of Offers'].astype(float) 
                                     / school_data['Count of Students in HS Admissions'].astype(float))
   
# Estimate the % of HS candidates receiving 4s for outreach. Our 2017-2018 SHSAT cohort were 6th graders in 2015-16.                            
school_data['% of 2017-18 SHSAT Takers Receiving 4s in 2016'] = ((school_data['Grade 6 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 6 Math - All Students Tested'].astype(float))
                                                + (school_data['Grade 6 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 6 ELA - All Students Tested'].astype(float)) 
                                                / 2)

# Estimate the % of 5th graders receiving 4s for medium-term interventions, since this is the 2018-2019 class.
school_data['% of 2018-19 SHSAT Takers Receiving 4s in 2016'] =  ((school_data['Grade 5 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 5 Math - All Students Tested'].astype(float))
                                                + (school_data['Grade 5 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 5 ELA - All Students Tested'].astype(float)) 
                                                / 2)

# Aggregate % of students with 4s for each grade level to see how strong of a predictor each is
school_data['% of 7th Graders Receiving 4s'] =  ((school_data['Grade 7 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 7 Math - All Students Tested'].astype(float))
                                                + (school_data['Grade 7 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 7 ELA - All Students Tested'].astype(float)) 
                                                / 2)

school_data['% of 8th Graders Receiving 4s'] =  ((school_data['Grade 8 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 8 Math - All Students Tested'].astype(float))
                                                + (school_data['Grade 8 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 8 ELA - All Students Tested'].astype(float)) 
                                                / 2)

school_data['% of 3rd Graders Receiving 4s'] =  ((school_data['Grade 3 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 3 Math - All Students tested'].astype(float))
                                                + (school_data['Grade 3 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 3 ELA - All Students Tested'].astype(float)) 
                                                / 2)

school_data['% of 4th Graders Receiving 4s'] =  ((school_data['Grade 4 Math 4s - All Students'].astype(float) 
                                                / school_data['Grade 4 Math - All Students Tested'].astype(float))
                                                + (school_data['Grade 4 ELA 4s - All Students'].astype(float) 
                                                / school_data['Grade 4 ELA - All Students Tested'].astype(float)) 
                                                / 2)

# Average the CC scores accross schools for longer-term interventions.
school_data['Average CC Scores'] = (school_data['Average ELA Proficiency'] + 
                                    school_data['Average Math Proficiency'] / 2)

In [None]:
# calculate distance between schools and nearest testing center and specialized high school

# taken from https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas and edited

def haversine(longitude1, latitude1, target_dataset):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    distances = []
    lon1, lat1 = map(radians, [longitude1, latitude1])
    for i in range(len(target_dataset)):
    # convert decimal degrees to radians 
        lon2, lat2 = map(radians, [target_dataset.loc[i,"Long"], target_dataset.loc[i,"Lat"]])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * asin(sqrt(a)) 
        km = 6367 * c
        distances.append(km)
    return min(distances)

# these loops iterate separately over our schools to find the nearest SHS and testing locations, respectively
for index, row in school_data.iterrows():
    school_data.loc[index, 'KM to nearest SHS'] = haversine(row['Longitude'], row['Latitude'], specialized_hs_locations)

for index, row in school_data.iterrows():
    school_data.loc[index, 'KM to nearest Testing Location'] = haversine(row['Longitude'], row['Latitude'], SHSAT_testing_locations)

In [None]:
# Create subsets for segmenting students, predictors, target, and school ID for easy review

demographics = ['Percent Black / Hispanic', 'Economic Need Index', 'Percent ELL', 
                'Community School?', 'Special Ed %']

model_predictors = ['KM to nearest SHS','KM to nearest Testing Location', 'Average Total Crimes 2013-2016',
                   'Economic Need Index','Rigorous Instruction %', 'Collaborative Teachers %', 
                    'Supportive Environment %', 'Effective School Leadership %', 
                    'Strong Family-Community Ties %', 'Trust %', 'Special Ed %', 'Percent of Students Chronically Absent'] 
                    #Some data we expect to add value as a predictor and as a means of segmenting schools.

model_target = ['% Taking SHSAT']


# Predictions and results

Our first goal is to understand schools that should've produced many sign-ups last year but didn't. These schools, especially because they're likely to have high-scoring students, will be easy to improve with low-cost outreach efforts.

In [None]:
# Start with a predictor aimed at the short-term to identify the highest-impact schools ('quick wins').
# I call these variables 'short_term' because they'll be easy to act on right away.

short_term_targets = school_data[model_predictors]
short_term_targets = short_term_targets.join(school_data['% of 2017-18 SHSAT Takers Receiving 4s in 2016'])
short_term_targets = short_term_targets.fillna(short_term_targets.mean())

Rand_Forest = RandomForestRegressor(min_samples_leaf=10, n_estimators=100, n_jobs=1, random_state=0)
# parameter tuning
RF_params = {"max_depth": [3,5,6,None],
              "max_features": [0.33,0.67,1.0],
              "min_samples_leaf": [4,9,16]}
RF_Gridsearch = GridSearchCV(Rand_Forest, RF_params, cv=3, n_jobs=1)
RF_Gridsearch.fit(short_term_targets, school_data[model_target].values.ravel())
Rand_Forest = Rand_Forest.set_params(**RF_Gridsearch.best_params_)
Rand_Forest.fit(short_term_targets, school_data[model_target])

# delete variables which are not used or almost unused to keep the model on the simpler side
short_term_targets = short_term_targets.loc[:, Rand_Forest.feature_importances_>0.01]
Rand_Forest.fit(short_term_targets, school_data[model_target])

# Save the model's predictions as a new variable
short_term_prediction = school_data[model_target]
short_term_prediction['Prediction'] = Rand_Forest.predict(short_term_targets)

# Model 1 demonstrates the predictive power of high-scoring students.

We see clearly that schools with high-scoring students send a high rate of students to the exam. Each bar in the below chart essentially represents how well each variable predicted the SHSAT registration rate. This tells us two things:

1. Students who expect to do well clearly have the strongest incentives to sign up. Pushing the needle on a student's expectation of his/her score will prove strong encouragement.
2. Schools we find that underperform their prediction will do well in converting SHSAT registrations into acceptances, because we'll be targeting schools with students who'll score well without much help.

The model used is a Random Forests machine learning algorithm that creates decision trees to granularly understand and predict the outcome, or SHSAT completion by school.


In [None]:
# Visualize predictor strength

short_term_weights = pd.Series(index=short_term_targets.columns, data=Rand_Forest.feature_importances_).sort_values(ascending=True)
short_term_weights.plot(kind='barh', figsize=(10,15), color="purple");

# We can identify many high-value schools for outreach and quantify the impact.

With every prediction comes underperformers, and those are our low-hanging fruit. Below, I calculate just how far each school underperformed, and I create a new table showing the top five examples. Note the "Underperformance in # of Eight Graders" column - that shows how many students would've registered if the school met its predicted registration rate.

The top five schools alone can produce another hundred test-takers with only the push to meet expectations.


In [None]:
short_term_prediction['Underperformance Gap'] = (short_term_prediction["Prediction"] - 
                                                 short_term_prediction['% Taking SHSAT'])

short_term_prediction = short_term_prediction.sort_values('Underperformance Gap', ascending=False)
short_term_prediction['Underperformance in # of Eight Graders'] = (short_term_prediction['Underperformance Gap'].astype(float)
                                                                   * school_data['Count of Students in HS Admissions'].astype(float))
short_term_prediction = short_term_prediction.join(school_data[['School Name', 'Address (Full)']])

short_term_prediction.head()

It makes sense to prioritize, so I've also created a prioritization score that takes into account the demographics and size of a school. I then create a final table of 30 schools, which will display at the bottom, still sorted by the underperformance gap like the table above.

In [None]:
# Prioritize schools

# Students scoring 4s on Common Core exams were excluded as a weighing variable because they could favor schools based on which grade levels they serve.

prioritization_metrics = pd.DataFrame((school_data['Economic Need Index'] + school_data['Percent ELL'] 
                          + school_data['Percent Black / Hispanic'].astype(float) / 3) 
                                      * school_data['Count of Students in HS Admissions'].astype(float))

short_term_prediction["Prioritization Score"] = prioritization_metrics
short_term_prediction = short_term_prediction.join(school_data[demographics])

# Analysis 2 reveals that cohorts matter when considering schools to target.

Now that we have seen some schools that performed below expectations in the past, how do we start to make predictions about the future? The previous analysis showed that the future can be seen through students' Common Core performance, but how far in advance does that shed light? 

It turns out, student cohorts about 1-2 years from taking the test will give much stronger indications of upcoming registration. In other words, **if you want to make sure your high-performing students ultimately take the SHSAT, start finding them in the 6th and 7th grade.** That guideline makes sense for PASSNYC partners as well, because they'll have the perfect amount of time to prepare students to perform at their peak.

Below are the relationships between the rate of 4s on Math or ELA exams and how well they ultimately predicted registration. Remember the registration data is two years ahead of the Common Core data, so our 6th graders scoring 4s became our 2017-2018 test takers. Hence the strongest correlation lies there.

In [None]:
# Estimate the % of 7th graders receiving 4s for longer-term interventions.
test_scores_by_grade = ['% of 3rd Graders Receiving 4s','% of 4th Graders Receiving 4s', 
                        '% of 2018-19 SHSAT Takers Receiving 4s in 2016', '% of 2017-18 SHSAT Takers Receiving 4s in 2016',
                       '% of 7th Graders Receiving 4s', '% of 8th Graders Receiving 4s']

correlations = pd.DataFrame()

DV = school_data['% Taking SHSAT']

for items in test_scores_by_grade:
    grade_check = school_data[items] >= 0
    #check that the school has students of that grade
    grade_test_scores = school_data[items]
    correlations[items] = pearsonr(DV[grade_check],grade_test_scores[grade_check])
    
correlations = correlations.loc[0,:]

# Model 3 shows that, when looking for young students to target, attendence is key. 

Now that we have a sense of who we want to focus our SHSAT preparation efforts on, how do we decide where to go? What school characteristics will predict registration once we've accounted for the student test performance?

Perhaps unsurprisingly, **attendence is the next-strongest predictor of SHSAT registration**. PASSNYC should emphasize to its partners the importance that students show up to the schools they visit, because it both (a) predicts SHSAT registration and (b) improves the chance that students will benefit from the services provided.

**Economic need is the next-best predictor**. And though PASSNYC can't necessary drive economics, they can tailor their messaging appropriately so that students in tough economic situations can understand both (a) the additional strain they may or may not face on their time and (b) the significant financial upside of a high-quality, prestigious high school education. College bills in particular likely deter students from imagining how SHS schools could benefit them, but most young students - and likely their parents, especially if they lack college education - would not understand the volume and quality of scholarships available to those succeeding at the top New York City high schools.

Below, find another table showing the weights of the various factors.

In [None]:
# Perform the same analysis using 7th grade scores to find schools worth targeting with tutoring programs

targets_excl_grades = school_data[model_predictors]
targets_excl_grades = targets_excl_grades.fillna(targets_excl_grades.mean())

Rand_Forest_2 = RandomForestRegressor(min_samples_leaf=10, n_estimators=100, n_jobs=1, random_state=0)

RF_Gridsearch.fit(targets_excl_grades, school_data[model_target].values.ravel())
Rand_Forest_2 = Rand_Forest_2.set_params(**RF_Gridsearch.best_params_)
Rand_Forest_2.fit(targets_excl_grades, school_data[model_target])

# delete variables which are not used or almost unused to keep the model on the simpler side
targets_excl_grades = targets_excl_grades.loc[:, Rand_Forest_2.feature_importances_>0.015]
Rand_Forest_2.fit(targets_excl_grades, school_data[model_target])

# Save the model's predictions as a new variable
prediction_excl_grades = school_data[model_target]
prediction_excl_grades['Prediction'] = Rand_Forest_2.predict(targets_excl_grades)

In [None]:
excl_grades_weights = pd.Series(index=targets_excl_grades.columns, data=Rand_Forest_2.feature_importances_).sort_values(ascending=True)
excl_grades_weights.plot(kind='barh', figsize=(10,15), color="purple");

# PASSNYC should continue actively studying its student populations both to identify target audiences and best-practice approaches.

As we've seen, identifying high-performing student cohorts can prove extremely valuable for raising SHSAT registration, even through relatively easy means like outreach. However, this will take continued review each year, as the high-performing schools may change. Luckily, that data won't take challenging data analysis to provide insights - 

1. Find the high-scoring schools, and make sure families and students are aware of the benefits. They should be optimistic about the test and therefore likely to register with little encouragement.
2. Focus longer interventions on schools where students are likely to engage, and if possible, focus on schools that don't perform quite at the top on test scores. They'll need more than just outreach, but they'll offer realistic returns-on-investment. 

# Concluding remarks

Do feel free to contact me if you have any questions or would like to follow-up. Attached, you should find the .CSV file with the full list of schools with recommended weights. I've recreated the first 15 results below.

A special shout-out and thanks to the PASSNYC, Kaggle, and user team and community. Its fantastic to have the chance to work on meaningful projects with real organizations *and* benefit from the work of others who share the same goal.

In [None]:
short_term_prediction.head(15)
short_term_prediction.to_csv("Chris_Deitrick_PASSNYC_School_Recommendations.csv")