**Introduction**

**Defining the problem statement**

**Datasets used**

**Scope and assumptions**

**Recommendations**

**Introduction**

Our solution is a simple approach that, in our humble opinion, aligns most directly with PassNYC&#39;s mission statement:

>PASSNYC aims to identify talented underserved students within New York City&#39;s underperforming school districts in order to increase the diversity of students taking the Specialized High School Admissions Test.

In our solution, we will attempt to support this mission in the one way that we can – by enabling data driven decisioning. Our main guiding principles in approaching this exercise were:

- **Take an open and transparent approach to using data**
- **Make the approach easy to replicate or customise**
- **Make the logic behind the recommendations easily explainable and, therefore, usable to all stakeholders**
- **Align most strongly to the main mission – &quot;**** identifying talented underserved students **_… in order to_** increase diversity** _of students taking the Specialized High School Admissions Test.&quot;_

We have consciously steered away from the temptation of applying a machine learning algorithm to this data, as we felt that that would make our solution less user friendly or reproduceable, without adding any extra value.

Our entire approach can be easily reproduced using a spreadsheet, and therefore has no dependence on a technical team to translate into a working model.

We will be happy to share the original excel model on request as well.

**Defining the problem statement**

PASSNYC and its partners provide outreach services that improve the chances of students taking the SHSAT and receiving placements in these specialized high schools.

>**Identify the schools where minority and underserved students stand to gain the most from services like after school programs, test preparation, mentoring, or resources for parents.**

I am further using the inputs given by Ryan J S Baxter and Akeel6passnyc in the recent AMA:

>&quot;Outreach efforts can be broad and relatively inexpensive (targeted mailers, information handouts, email campaigns, etc), while programmatic interventions (prep courses, workshops, tutoring, etc) are more expensive and are designed to actually impact scores rather than simply engage families in learning about the tests. PASSNYC is engaged in multi-tiered approach to this problem and looking to create **solutions to awareness** (knowing about the test and schools), **participation** (taking the test), **and preparedness** (preparing for the test). NYC schools tend to have 100 - 300 students per grade so most outreach efforts target the entire grade while most programmatic interventions target participation from one fourth to one half of a grade at a particular school.&quot;

>&quot;this particular project (the kernel) is focused on **increasing test taking** our larger mission seeks to develop solutions that will also support **other elements such as preparation**.&quot;

**Datasets**   **used**

Apart from the datasets provided in the challenge, we will be using the following open datasets:

Mandatory datasets [https://www.kaggle.com/passnyc/data-science-for-good](https://www.kaggle.com/passnyc/data-science-for-good)

1. School explorer
2. D5 SHSAT registration data

Socrata datasets

1. NYC school safety report [https://www.kaggle.com/new-york-city/ny-2010-2016-school-safety-report](https://www.kaggle.com/new-york-city/ny-2010-2016-school-safety-report)
2. NY School Demographics and Accountability Snapshot [https://www.kaggle.com/new-york-city/ny-school-demographics-and-accountability-snapshot](https://www.kaggle.com/new-york-city/ny-school-demographics-and-accountability-snapshot)

Other datasets

1. PASSNYC resource centers [https://www.kaggle.com/infocusp/passnyc-resource-centers](https://www.kaggle.com/infocusp/passnyc-resource-centers) (created by Infocusp)
2. NYC ELA and Math State Test Results [https://www.kaggle.com/araraonline/nys-ela-and-math-test-results](https://www.kaggle.com/araraonline/nys-ela-and-math-test-results) (shared by André Alcântara, similar to the open data)

**Scope and assumptions**

Our solution will be restricted to schools that

- Offer grades 7, 8 or 9 (the grades in which you are eligible to take SHSAT)
- Have a high percentage of blacks / hispanics (as these are the underrepresented communities)

This scope is based on the following:

- Most data points like economic index, geographical demography, school safety parameters are strongly related to the percentage of blacks/Hispanics
- Asians have been excluded from this recommendation as they are not under represented in the SHST schools
- As this test is taken in the middle grades, therefore most impact will be felt by reaching out to schools that offer atleast upto 7th grade

**Recommendations**

At the end of this notebook, you will be able to download csv files with schools ranked in order of priority for each of the following reccomendations

1. **Increasing awareness** (targeted mailers, information handouts, email campaigns, etc)
  - The main parameter for this recommendation will be low participation in the SHSAT test
2. **Increase participation** (Increase awareness + offer mentoring and counselling)
  - This recommendation will use a combination of the above parameter and also parameters like the Math and ELA scores to further present schools that seem to have good quality of education and therefore may not require any other intervention
  - We will further try to offer more targeted recommendations based on inputs from the school safety report to offer after school programs in areas that may benefit from access to safe spaces and community activities. (This is based on the assumption that children from unsafe areas may not be prioritising their education, and require some counselling and mentoring)
3. **Increasing preparedness through programmatic interventions** (prep courses, workshops, tutoring, etc
  - As this is the most expensive intervention, we will focus on schools that are already participating in SHSATs but are receiving very few offers.
  - This recommendation uses a combinations of participation, average scores and the success rate of the participating students' ability to receive the SHSAT offer

Apart from the above recommendations, there will be a list that **matches schools**** with ****each of the PASSNYC resource centres.** This list will be based on the following parameters:

- Type of intervention recommended for the school matched with the type of program offered by the partner
- Proximity of school to the PASSNYC centre (our assumption is that a student will be comfortable in travelling upto 2 miles to the partner centre)







In [None]:
print('Initial steps: Ingesting, pre processing data')

In [None]:
#importing libraries

import matplotlib.pyplot as plt
import matplotlib
import plotly.plotly as py
import seaborn as sns 
import numpy as np
import pandas as pd
import numpy as np
import random as rnd
%matplotlib inline
from math import sin, cos, sqrt, atan2, radians

In [None]:
#Loading needed databases

School_exp = pd.read_csv("../input/data-science-for-good/2016 School Explorer.csv")
D5 = pd.read_csv("../input/data-science-for-good/D5 SHSAT Registrations and Testers.csv")
School_safety = pd.read_csv("../input/ny-2010-2016-school-safety-report/2010-2016-school-safety-report.csv")
SHSAT = pd.read_csv("../input/2017-2018-shsat-admissions-test-offers-by-schools/2017-2018 SHSAT Admissions Test Offers By Sending School.csv")
Center_list = pd.read_csv("../input/passnyc-resource-centers/passnyc-resource-centers.csv")
School_demo = pd.read_csv("../input/ny-school-demographics-and-accountability-snapshot/2006-2012-school-demographics-and-accountability-snapshot.csv")

In [None]:
#Preprocessing

D5.set_index('Year of SHST',inplace=True, drop=True) #Changing index 
D5 = D5.drop([2013, 2014, 2015]) ##Dropping rows that will not be used

School_safety.set_index('School Year',inplace=True, drop=True) #Changing index 
School_safety = School_safety.drop(['2013-14','2014-15'])  ##Dropping columns that will not be used

School_safety = School_safety.dropna(subset = ['DBN'])

School_exp = pd.merge(School_exp, School_safety, how='left', left_on='Location Code', right_on='DBN')
School_exp = pd.merge(School_exp, SHSAT, how='left', left_on='DBN', right_on='School DBN')



In [None]:
School_exp.set_index('DBN',inplace=True, drop=True)
D5.set_index('DBN',inplace=True, drop=True)
School_exp = School_exp.join(D5)

School_exp = School_exp.drop(columns=['Adjusted Grade','New?','Other Location Code in LCGMS','Location Code_x','SED Code','Grade Low','Grade 3 ELA - All Students Tested','Grade 3 ELA 4s - All Students','Grade 3 ELA 4s - American Indian or Alaska Native','Grade 3 ELA 4s - Black or African American','Grade 3 ELA 4s - Hispanic or Latino','Grade 3 ELA 4s - Asian or Pacific Islander','Grade 3 ELA 4s - White','Grade 3 ELA 4s - Multiracial','Grade 3 ELA 4s - Limited English Proficient','Grade 3 ELA 4s - Economically Disadvantaged','Grade 3 Math - All Students tested','Grade 3 Math 4s - All Students','Grade 3 Math 4s - American Indian or Alaska Native','Grade 3 Math 4s - Black or African American','Grade 3 Math 4s - Hispanic or Latino','Grade 3 Math 4s - Asian or Pacific Islander','Grade 3 Math 4s - White','Grade 3 Math 4s - Multiracial','Grade 3 Math 4s - Limited English Proficient','Grade 3 Math 4s - Economically Disadvantaged','Grade 4 ELA - All Students Tested','Grade 4 ELA 4s - All Students','Grade 4 ELA 4s - American Indian or Alaska Native','Grade 4 ELA 4s - Black or African American','Grade 4 ELA 4s - Hispanic or Latino','Grade 4 ELA 4s - Asian or Pacific Islander','Grade 4 ELA 4s - White','Grade 4 ELA 4s - Multiracial','Grade 4 ELA 4s - Limited English Proficient','Grade 4 ELA 4s - Economically Disadvantaged','Grade 4 Math - All Students Tested','Grade 4 Math 4s - All Students','Grade 4 Math 4s - American Indian or Alaska Native','Grade 4 Math 4s - Black or African American','Grade 4 Math 4s - Hispanic or Latino','Grade 4 Math 4s - Asian or Pacific Islander','Grade 4 Math 4s - White','Grade 4 Math 4s - Multiracial','Grade 4 Math 4s - Limited English Proficient','Grade 4 Math 4s - Economically Disadvantaged','Grade 5 ELA - All Students Tested','Grade 5 ELA 4s - All Students','Grade 5 ELA 4s - American Indian or Alaska Native','Grade 5 ELA 4s - Black or African American','Grade 5 ELA 4s - Hispanic or Latino','Grade 5 ELA 4s - Asian or Pacific Islander','Grade 5 ELA 4s - White','Grade 5 ELA 4s - Multiracial','Grade 5 ELA 4s - Limited English Proficient','Grade 5 ELA 4s - Economically Disadvantaged','Grade 5 Math - All Students Tested','Grade 5 Math 4s - All Students','Grade 5 Math 4s - American Indian or Alaska Native','Grade 5 Math 4s - Black or African American','Grade 5 Math 4s - Hispanic or Latino','Grade 5 Math 4s - Asian or Pacific Islander','Grade 5 Math 4s - White','Grade 5 Math 4s - Multiracial','Grade 5 Math 4s - Limited English Proficient','Grade 5 Math 4s - Economically Disadvantaged','Grade 6 ELA - All Students Tested','Grade 6 ELA 4s - All Students','Grade 6 ELA 4s - American Indian or Alaska Native','Grade 6 ELA 4s - Black or African American','Grade 6 ELA 4s - Hispanic or Latino','Grade 6 ELA 4s - Asian or Pacific Islander','Grade 6 ELA 4s - White','Grade 6 ELA 4s - Multiracial','Grade 6 ELA 4s - Limited English Proficient','Grade 6 ELA 4s - Economically Disadvantaged','Grade 6 Math - All Students Tested','Grade 6 Math 4s - All Students','Grade 6 Math 4s - American Indian or Alaska Native','Grade 6 Math 4s - Black or African American','Grade 6 Math 4s - Hispanic or Latino','Grade 6 Math 4s - Asian or Pacific Islander','Grade 6 Math 4s - White','Grade 6 Math 4s - Multiracial','Grade 6 Math 4s - Limited English Proficient','Grade 6 Math 4s - Economically Disadvantaged','Grade 7 ELA 4s - American Indian or Alaska Native','Grade 7 ELA 4s - Asian or Pacific Islander','Grade 7 ELA 4s - Multiracial','Grade 7 ELA 4s - Economically Disadvantaged','Grade 7 Math 4s - American Indian or Alaska Native','Grade 7 Math 4s - Asian or Pacific Islander','Grade 7 Math 4s - Multiracial','Grade 8 ELA 4s - American Indian or Alaska Native','Grade 8 ELA 4s - Asian or Pacific Islander','Grade 8 Math 4s - American Indian or Alaska Native','Grade 8 Math 4s - Asian or Pacific Islander','Grade 8 Math 4s - Multiracial','Building Code','Location Name','Location Code_y','Address','Borough_x','Geographical District Code','Register','Building Name','# Schools','Schools in Building','Postcode','Latitude_y','Longitude_y','BBL','NTA','School DBN','Borough_y','School Category','School Name_y'])
                                      
School_exp = School_exp[School_exp['Grade High'] != '0K']
School_exp['Grade High'] = School_exp['Grade High'].astype(np.object).astype(int)

School_demo.set_index('schoolyear', inplace=True, drop=True)
School_demo = School_demo.drop([20052006, 20062007, 20072008, 20082009, 20092010, 20102011])
School_demo = School_demo.fillna(0)
School_demo.set_index('DBN',inplace=True, drop=True)
School_demo.grade6 = School_demo[['grade6']].convert_objects(convert_numeric=True).fillna(0)
School_demo.grade7 = School_demo[['grade7']].convert_objects(convert_numeric=True).fillna(0)
School_demo.grade8 = School_demo[['grade8']].convert_objects(convert_numeric=True).fillna(0)
School_demo['Reach'] = School_demo['grade7'] + School_demo['grade8'] + School_demo['grade6']
School_demo['grade7'] = School_demo['grade7'].astype(np.object).astype(float)

School_exp = pd.merge(School_exp, School_demo, how='left', left_on='DBN', right_on='DBN')

for col in School_exp.columns.values:
    if col.startswith("Percent") or col.endswith("%") or col.endswith("Rate"):
        School_exp[col] = School_exp[col].astype(np.object).str.replace('%', '').astype(float)

School_exp.replace(np.NaN,0, inplace=True)

School_exp['grade7'] = School_exp['grade7'].astype(np.object).astype(float)

#Add conditional subset flags

School_exp['Minority'] = np.where(School_exp['Percent Black / Hispanic']>60, 'Minority', 'no')

School_exp['7to9'] = np.where(School_exp['Grade High']>6, 'SHST', 'no')

In [None]:
print('Creating our clusters')

In [None]:
#Creating our recommendation clusters

School_exp['Cluster 1'] = np.where((School_exp['Number of students who registered for the SHSAT']<35) & (School_exp['7to9'] == 'SHST'), 'Awareness', 'no')
School_exp['Cluster 2'] = np.where((School_exp['Cluster 1'] == 'Awareness') & (School_exp['Average ELA Proficiency'] > 2.3) & (School_exp['Average Math Proficiency'] > 2.3), 'Mentoring', 'no')
School_exp['Cluster 3'] = np.where((School_exp['Cluster 1'] == 'Awareness') & (School_exp['Supportive Environment %'] < 85), 'Afterschool programs', 'no')
School_exp['Cluster 4'] = np.where((School_exp['7to9'] == 'SHST') & (School_exp['Percent of eight grade students who received offer'] < 10) & (School_exp['Rigorous Instruction %'] < 85), 'Test prep', 'no')


In [None]:
#Nearest resource center locator

place = []
dist = []
for i in range(1282):
    dist.append(99999)
R = 3959.0

for i in range(1282):
    lat1 = radians(School_exp['Latitude_x'][i])
    lon1 = radians(School_exp['Longitude_x'][i])
    center_name = ''
    for j in range(82):
        lat2 = radians(Center_list['Lat'][j])
        lon2 = radians(Center_list['Long'][j])
        
        if Center_list['Test Prep'][j] == 1 and School_exp['Cluster 4'][i] == 'Test prep':
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            distance = R * c
            if dist[i] > distance:
                center_name = Center_list['Resource Center Name'][j]
                dist[i] = round(distance,1)
                
        if Center_list['After School Program'][j] == 1 and School_exp['Cluster 3'][i] == 'Afterschool programs':
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            distance = R * c
            if dist[i] > distance:
                center_name = Center_list['Resource Center Name'][j]
                dist[i] = round(distance,1)

    place.append(center_name)

**Increasing awareness**

Here are our recommendations for schools to focus on for increasing awareness. 
This list contains schools that have more than 60% black/hispanic population and is ranked based on the total number of students across grades 7, 8 and 9 to help you prioritise your efforts.

Here are the top 10 schools in this list:

In [None]:
School_awareness = School_exp.loc[School_exp['Cluster 1'] == 'Awareness']
School_awareness = School_awareness[['School Name_x', 'Reach']]
School_awareness['Awareness_rank']=School_awareness['Reach'].rank(ascending=0)


In [None]:
School_awareness = School_awareness.sort_values(by=['Awareness_rank'], ascending=[True]) #sort based on order of priority
School_awareness.to_csv('1_Increase_awareness.csv')

In [None]:
print(School_awareness.head(n=10))

print('Please note: The entire ranked list will be available as an output of this notebook')

**Increase participation** 

(Increase awareness + offer mentoring and counselling)

Above average ELA, Math scores indicate that students need a bit of push to become SHST ready, therefore maybe prime candidates for mentoring and counselling to help them up their game

List of schools is ranked based on a sum of the school average ELA and Math score

Here are the top 10 schools in this list:

In [None]:
Inc_participation = School_exp.loc[School_exp['Cluster 2'] == 'Mentoring']
Inc_participation = Inc_participation[['School Name_x', 'Average ELA Proficiency', 'Average Math Proficiency']]
Inc_participation['Proficiency'] = Inc_participation['Average ELA Proficiency'] + Inc_participation['Average Math Proficiency']
Inc_participation['Mentoring_rank']=Inc_participation['Proficiency'].rank(ascending=0)

In [None]:
Inc_participation = Inc_participation.sort_values(by = ['Mentoring_rank'], ascending=[True]) #sort based on order of priority
Inc_participation.to_csv('2_Offer_mentoring.csv')

In [None]:
print(Inc_participation.head(n=10))

print('Please note: The entire ranked list will be available as an output of this notebook')

**Offer programmatic interventions** 

In this recommendation, we offer two lists of schools

1. Schools that may benefit from afterschool programs as they lack a supportive environment. 

We feel that adding the option of spending time in a "safe space"' may help a student be more forward thinking and help them break out of a cycle that deprioritises education

2. Schools that will benefit from access to test prep centers

This is based on the data where a school has people who take SHST but are unable to get offers (less than 10% success rate), couple with the fact that these schools have poor rating on the "Rigorous instruction" rating

Please note that these lists are subsets of the previous lists, ie, we are still 


**Schools that may benefit from afterschool programs**

This list is sorted based on the "Supportive environment" rating for the school.

*Please note that this list is a subset of the previous two recommendations, as it is still filtered on schools that have a large percentage of minority students and offer atleast grade 7*

In the list below, you will notice that our top entry in ranked as 21... our ranking starts from this number, apologies for the bug :(

In [None]:
Afterschool_prog = School_exp.loc[School_exp['Cluster 3'] == 'Afterschool programs']
Afterschool_prog = Afterschool_prog[['School Name_x', 'Supportive Environment %']]

Afterschool_prog['Afterschool_rank']=Afterschool_prog['Supportive Environment %'].rank(ascending=0)

In [None]:
Afterschool_prog = Afterschool_prog.sort_values(by = ['Afterschool_rank'], ascending=[True]) #sort based on order of priority

Afterschool_prog.to_csv('3_Afterschool_programs.csv')

In [None]:
print(Afterschool_prog.head(n=10))

print('Please note: The entire ranked list will be available as an output of this notebook')

**Schools that will benefit from access to test prep centers**

Created based on a combination of lack of rigorous instruction and low success rate in SHST

This list is sorted based on the "Economic index" for the school.

*Please note that this list is of ALL schools, as we believe that this is a crucial intervention, where despite being motivated to appear for SHST, our hypothesis is that the poor economic status does not allow them to explore premium tutoring options*

In [None]:
Test_prep = School_exp.loc[School_exp['Cluster 4'] == 'Test prep']
Test_prep = Test_prep[['School Name_x', 'Economic Need Index', 'Rigorous Instruction %', 'Percent of eight grade students who received offer']]
Test_prep['Prep_rank']=Test_prep['Economic Need Index'].rank(ascending=0)

In [None]:
Test_prep = Test_prep.sort_values(by = ['Prep_rank'], ascending=[True]) #sort based on order of priority

Test_prep.to_csv('4_Test_prep.csv')

In [None]:
print(Test_prep.head(n=10))

print('Please note: The entire ranked list will be available as an output of this notebook')

**Matching schools with each of the PASSNYC resource centres. **

Our first output helps you quickly identify which schools are nearby which resource centre through a comprehensive array.

*Please note that all distance is in miles*

In [None]:
#Making the csv for distance between all the Partner Resource Centers and the schools

rows = 1283
columns = 83
Center_school = [[0 for x in range(columns)] for y in range(rows)] 
for i in range(1,1283):
    lat1 = radians(School_exp['Latitude_x'][i-1])
    lon1 = radians(School_exp['Longitude_x'][i-1])
    for j in range(1,83):
        lat2 = radians(Center_list['Lat'][j-1])
        lon2 = radians(Center_list['Long'][j-1])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
        c = 2 * atan2(sqrt(a), sqrt(1 - a))
        distance = R * c
        Center_school[i][j] = round(distance,1)

Columns_csv = [0 for x in range(83)]

for i in range(1,1282):
    Center_school[i][0] = School_exp['School Name_x'][i]
for j in range(1,82):
    Center_school[0][j] = Center_list['Resource Center Name'][j]

Resource_school = pd.DataFrame(Center_school)
Resource_school.to_csv('5_Resource center vs Schools Distance.csv')

**Schools - Resource centre matching  - Afterschool programs**

We have used our earlier list of schools that would benefit from afterschool programs to match with centres that offer such programs.

In this list you will be able to see which is the nearest resource centre to these schools

In [None]:
#Making csv for Cluster 3 : Increasing Awareness

Cluster_3 = School_exp[['School Name_x', 'Cluster 3']]
Cluster_3['Nearest Resource Center'] = place
Cluster_3['Distance to the Nearest Resource Center'] = dist
cols = Cluster_3.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, '')) else x)
Cluster_3.columns = cols
Cluster_3 = Cluster_3.drop(Cluster_3[Cluster_3.Cluster_3 == 'no'].index)
Cluster_3 = Cluster_3.drop(columns=['Cluster_3'])
cols = Cluster_3.columns
cols = cols.map(lambda x: x.replace('_', ' ') if isinstance(x, (str, '')) else x)
Cluster_3.columns = cols
Cluster_3.to_csv('6_Partners for After School Programs.csv')

In [None]:
print(Cluster_3.head(n=10))

print('Please note: The entire list will be available as an output of this notebook')

**Schools - Resource centre matching  - Test prep**

We have used our earlier list of schools that would benefit from test prepto match with centres that offer such programs.

In this list you will be able to see which is the nearest resource centre to these schools

In [None]:
#Making csv for Cluster 4 : Test Prep

Cluster_4 = School_exp[['School Name_x', 'Cluster 4']]
Cluster_4['Nearest Resource Center'] = place
Cluster_4['Distance to the Nearest Resource Center'] = dist
cols = Cluster_4.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, '')) else x)
Cluster_4.columns = cols
Cluster_4 = Cluster_4.drop(Cluster_4[Cluster_4.Cluster_4 == 'no'].index)
Cluster_4 = Cluster_4.drop(columns=['Cluster_4'])
cols = Cluster_4.columns
cols = cols.map(lambda x: x.replace('_', ' ') if isinstance(x, (str, '')) else x)
Cluster_4.columns = cols
Cluster_4.to_csv('7_Partners for Test Prep.csv')

In [None]:
print(Cluster_4.head(n=10))

print('Please note: The entire list will be available as an output of this notebook')

We hope that our analysis was useful for you and will help you in the amazing work you are doing. 

We tried to align our efforts to the PASSNYC directive of keeping the model influential and shareable in its reproducibility and customisability as well as making the logic of it easy to articulate.

A lot of the data that we have presented here can be easily recreated using spreadsheets, and we would be glad to offer you the excel versions of these calculations as well.

*Thanks for reading through our submission,

Mansi & Samarth*
