# Preface
---
This is a public kernel submitted to the PASSNYC Data Science for Good Kaggle competition. <br>
https://www.kaggle.com/passnyc/data-science-for-good/home <br><br>
Author: Jared Ucherek <br>
Source: https://github.com/jareducherek/PassNYC-Kaggle <br>
<br><br>

This kernel aims to provide PASSNYC with specific schools that would be a good target to start tutoring programs for the SHSAT. Each section begins with a brief description of its contents. 

It is assumed that readers have a general understanding of American schooling and the SHSAT testing process for students in NYC. If you are unfamiliar with NYC, feel free to use the links below to briefly familiarize yourself. Many conclusions below simply reaffirm concepts commonly understood about the American education system. In large American cities across the nation, poor neighborhoods with underfunded schools typically have higher concentrations of minorities.
<br><br><br><br>
SHSAT testing description: <br>
https://www.schools.nyc.gov/school-life/learning/testing/specialized-high-school-admissions-test

Interactive maps of NYC income distributions by burrough: <br>
https://ny.curbed.com/2017/8/9/16119400/income-distribution-nyc-map

Previous studies that provide similar findings to this exploratory data analysis: <br>
http://www.centernyc.org/high-school-diversity-data/

An interactive map that illustrates travel time in NYC: <br>
https://project.wnyc.org/transit-time/#40.84135,-73.86692,12,1611

<a id='contents'></a>

# Table of Contents
---


[1. Summary and Approach](#summary)<br>
[2. Imports and Configs](#imports)  <br>
[3. Exploratory Data Analysis](#data)  <br>
-  [a. 2016 NYC Schools](#schools)  <br>
-  [b. D5 SHSAT Registrations](#shsat)  <br>
-  [c. Specialized Schools](#specialized)  <br>
-  [d. General Demographics](#demographics)<br>

[6. Conclusion: Targeted Schools](#conclusion)  <br>
-  [a. Proposal 1: Economic Need](#map1)  <br>
-  [b. Proposal 2: Eighth Grade Size](#map2)  <br>
-  [c. Proposal 3: School Proximity](#map3)<br>
-  [d. Finalized map and Closing Remarks](#map4)<br>



<a id='summary'></a>

# Summary and Approach
***
The PASSNYC team is tasked with a few goals that may not be simultaneously accomplishable. Realistically, increasing testing diversity should increase the specialized high school diversity; however, these schools recruit solely based on test scores. Providing unprepared students with the opportunity to take the exam would result in higher rejection rates. 

Top-quality schools extensively prepare their students to take this exam throughout their education. Therefore,  helping more students to register to take the exam would most likely result in very little change. Moreover, the demographics of the test takers are shown to be relatively diverse. Due to the nature of standardized testing and the strict admittance criteria, white and asian students who are more prepared are accepted at higher rates, which mitigates the potential for diversity in specialized schools. 

Overall, the solution is to find balance between providing marginalized schools the opportunity to start registering students for the exam, and tutoring students who might be close to already being admitted into a specialized school. In one case, new schools are introduced into the testing pools, but the students will most likely not succeed with sufficient preparation; in the other case, students already intending to take the test might improve their ability to score well, but testing demographics would not change too much. 

There are countless ways to begin increasing the diversity of eight specialized high schools. Due to the reasonable diversity of testing, yet lack of diversity in acceptance rates, it is apparent that schools registering mostly blacks and hispanics to take the exam do not have the resources to properly prepare them. Therefore, the targeted schools from this study will adhere to a small number of attributes: proximity to specialized high schools, percentage Hispanic and Black, and student size. 

There are other factors that will be examined to provide information about a specific school's performance, but proximity, demographics, and size are most important. Students already attending schools nearby specialized high schools would not need to adapt to a new neighborhood, and their commute will not be too far to the specialized high schools. Additionally, as an untested program, it is more reasonable to experiment with the effectiveness of tutoring programs by offering services to larger schools. These programs could start up quickly, and utilize feedback received by interviewing students or possibly maintaining anonymous test results. This approach aims to equalize the acceptance rates between demographics, rather than blindly equalize the testing demographics, which are reasonably representative of the NYC student population. 

By minimizing deployment time and focusing on a few large schools, hopefully the PASSNYC team can develop a viable program eventually offered to a profusion of underserved students.

[Jump to Table of Contents](#contents)

<a id='imports'></a>

# Imports and Configs
***

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

from geopy.geocoders import Nominatim
import folium
import ast

pd.options.display.max_rows = 4000
pd.options.display.max_columns = 4000
pd.options.display.max_seq_items = 2000

[Jump to Table of Contents](#contents)

<a id='data'></a>

# Exploratory Data Analysis

This section begins by generalizing the data provided in the '2016 School Explorer' CSV. The generalized analysis serves to contextualize the issue of diversity within specialized schools and should be widely accepted. Next, the D5 SHSAT Registrations and Testers' CSV is combined with the '2016 School Explorer' Data to determine any significant predictors of Number of Students that take the SHSAT. Lastly, the '2014-2015-doe-high-school-directory' CSV is filtered to show the 8 Specialized Schools in New York, and the 'school-district-breakdowns' CSV is used to help show the diversity within NYC Schools. 

Given the subject of the data, and the current understanding of American schools, nothing found should be out of the ordinary. This analysis is not meant to draw erroneous conclusions; rather, this aims to plainly provide details of NYC schools, and familiarize those foreign to the characteristics of American schools which face socioeconomic discrimination.  




[Jump to Table of Contents](#contents)

<a id='schools'></a>

## 2016 Schools Data Exploration
***
This section cleans the '2016 School Explorer.csv' by changing certain columns to numbers in order to create a large heatmap of the data. Many inferences can be made by studying the heatmap, but here are some important relationships:

<br>
As 'Percent Asian' increases:
 - Economic Need Index decreases (r = -0.359294)
 - School Income Estimate ($) increases (r = 0.247352)
 - Student Attendance Rate increases (r = 0.182024)
 - Percent of Students Chronically Absent decreases (r = -0.403291)
 - Average ELA Proficiency increases (r = 0.462710)
 - Average Math Proficiency increases (r = 0.528597)
 
<br>
As 'Percent Black / Hispanic' increases:
 - Economic Need Index increases (r = 0.775140)
 - School Income Estimate ($) decreases (r = -0.685344)
 - Student Attendance Rate decreases (r = -0.208507)
 - Percent of Students Chronically Absent increases (r = 0.520729)
 - Average ELA Proficiency decreases (r = -0.748990)
 - Average Math Proficiency decreases (r = -0.735648)
 
<br>
As 'Percent White' increases:
 - Economic Need Index decreases (r = -0.771980)
 - School Income Estimate ($) increases (r = 0.716063)
 - Student Attendance Rate increases (r = 0.141135)
 - Percent of Students Chronically Absent decreases (r = -0.390038)
 - Average ELA Proficiency increases (r = 0.650427)
 - Average Math Proficiency increases (r = 0.579286)
 
<br>
r is <a href="https://en.wikipedia.org/wiki/Correlation_coefficient">Pearson product-moment correlation coefficient</a>

English and Language Arts (ELA) and Math testing are standardized tests given to every student attending public schools in America. For more information, feel free to <a href="http://www.corestandards.org/read-the-standards/">ELA and Math standards</a>.

In [None]:
#import data and show diminsions
data = pd.read_csv("../input/data-science-for-good/2016 School Explorer.csv")
data.info()

In [None]:
fig, ax = plt.subplots(figsize=(10,8)) 
sns.heatmap(data = data[data.columns[15:41]].corr(), cmap = 'coolwarm', ax=ax)

### Data Cleaning:
** ~26 columns should be included in this matrix, but they must be cleaned first. **

In [None]:
# Columns 15-41 can be converted to floating numbers to increase the size of our heatmap. 
# I will try to move left to right in this section to make it as straightforward as possible. 
# Choosing columns manually also helps me begin I understanding what each one means.

clean_data = data.copy()

### Community School

In [None]:
clean_data['Community School?'] = pd.Series(np.where(clean_data['Community School?'].values == 'Yes', 1, 0),
                                  clean_data.index)

### School Income Estimate

In [None]:
clean_data = clean_data.rename(columns = {'School Income Estimate': 'School Income Estimate ($)'})

In [None]:
clean_data['School Income Estimate ($)'].isnull().sum()

In [None]:
clean_data['School Income Estimate ($)'] = clean_data['School Income Estimate ($)']\
                                        .apply(lambda x: x if type(x) == float else x.replace('$',''))\
                                        .apply(lambda x: x if type(x) == float else x.replace(',',''))\
                                        .astype(float)

### Percent Columns

In [None]:
#Converting the % Amounts to decimals

percent_cols = ['Percent ELL', 'Percent Asian',
                'Percent Black', 'Percent Hispanic', 'Percent Black / Hispanic',
                'Percent White', 'Student Attendance Rate',
                'Percent of Students Chronically Absent', 'Rigorous Instruction %',
                'Collaborative Teachers %', 'Supportive Environment %', 'Effective School Leadership %',
                'Strong Family-Community Ties %', 'Trust %']

In [None]:
clean_data[percent_cols] = clean_data[percent_cols]\
                            .apply(lambda x: x.apply(lambda x: x if type(x) == float else x.replace('%',''))\
                            .astype(float)/100)

### Rating Columns (Likert Scale)

In [None]:
# Commented out due to relation to Percentage columns

#likert_columns = ['Rigorous Instruction Rating', 'Collaborative Teachers Rating',
#                 'Supportive Environment Rating', 'Effective School Leadership Rating',
#                 'Strong Family-Community Ties Rating', 'Trust Rating', 'Student Achievement Rating']
#clean_data[likert_columns].apply(pd.Series.value_counts)


In [None]:
#These columns can be encoded to view the correlation as well. 
#pd.unique(clean_data[likert_columns].values.ravel())


In [None]:
#cat_type = pd.api.types.CategoricalDtype(categories=['Not Meeting Target','Approaching Target', 'Meeting Target', 'Exceeding Target'], 
#                            ordered=True)

#temp = pd.DataFrame()
#temp = data[likert_columns]

#for col in likert_columns:
#    temp[col] = temp[col].astype(cat_type)
    
#clean_data[likert_columns] = temp[likert_columns].apply(lambda x: x.cat.codes).replace(-1, np.nan)

### Results

In [None]:
clean_data[clean_data.columns[15:44]].head(2)

Economic Need Index = (Percent Temporary Housing) + (Percent HRA-eligible * 0.5) + 
                      (Percent Free Lunch Eligible * 0.5)
                      
For universal lunch schools, the percentage of free lunch eligible
comes from the last year the school collected lunch forms. “HRAeligible”
refers to students whose families have been identified by the
Human Resources Administration as receiving certain types of public
assistance. HRA-eligible is based on current year data. Students
are identified in temporary housing if they have been identified in
temporary housing anytime in the past four years. Students identified
in temporary housing who are also HRA eligible count toward both
percentages. Students who are HRA eligible also count toward
Percent Free Lunch Eligible.


http://schools.nyc.gov/NR/rdonlyres/7B6EEB8B-D0E8-432B-9BF6-3E374958EA70/0/EducatorGuide_EMS_20131118.pdf

In [None]:
fig, ax = plt.subplots(figsize=(14,12)) 
sns.heatmap(data = clean_data[clean_data.columns[15:41]].corr(), cmap = 'coolwarm', ax=ax)

In [None]:
#Used to grab specific correlation numbers referenced in the section overview.
#clean_data[clean_data.columns[15:41]].corr()

[Jump to Table of Contents](#contents)

<a id='shsat'></a>

## SHSAT Data Exploration
***
This section cleans the 'D5 SHSAT Registrations and Testers.csv' and combines it with the dataset above. More data from other schools from the Department of Education would be helpful, but data requests apparently take upwards of several months to be granted. Regardless, here are the important inferences from the new heatmap: 

<br>
As 'Percent Asian' increases:
 - Number of students who registered for the SHSAT increases (r = 0.197907) 
 - Number of students who took the SHSAT increases (r = 0.434725)
 
<br>
As 'Percent Black / Hispanic' increases:
 - Number of students who registered for the SHSAT decreases (r = -0.190516) 
 - Number of students who took the SHSAT decreases (r = -0.513048)
 
<br>
As 'Percent White' increases:
 - Number of students who registered for the SHSAT increases  (r = 0.254128) 
 - Number of students who took the SHSAT increases (r = 0.586734)
 
<br>
r is <a href="https://en.wikipedia.org/wiki/Correlation_coefficient">Pearson product-moment correlation coefficient</a>

In [None]:
#Import data and show dimensions

shsat = pd.read_csv("../input/data-science-for-good/D5 SHSAT Registrations and Testers.csv")
shsat.info()

This data is really helpful, there are some interesting applications on other public kernels. One thing to note: 9th grade testing rates are low for a reason, specialized schools start at 9th grade. I am just going to drop 9th grade data because very few students are accepted in that range and they are basically considered transfer students. Most information online suggests that the vast majority of test takers are in the 8th grade, applying to the schools to start 9th grade.

https://www.princetonreview.com/k12/shsat-information

In [None]:
shsat_clean = shsat[shsat['Grade level'] == 8]

In [None]:
shsat_locs = {
    '05M046' : (40.831629, -73.936006), '05M123' : (40.820165, -73.944486), '05M129' : (40.815000, -73.952222),
    '05M148' : (40.817322, -73.947338), '05M161' : (40.817755, -73.952468), '05M286' : (40.815478, -73.955556),
    '05M302' : (40.817458, -73.947372), '05M362' : (40.810687, -73.956061), '05M367' : (40.815478, -73.955556), 
    '05M410' : (40.815681, -73.955774), '05M469' : (40.807063, -73.938829), '05M499' : (40.824398, -73.936545),
    '05M514' : (40.819702, -73.956747), '05M670' : (40.815225, -73.944321), '84M065' : (40.810745, -73.949076),
    '84M284' : (40.812433, -73.948153), '84M336' : (40.820126, -73.956664), '84M341' : (40.808695, -73.936839),
    '84M350' : (40.814584, -73.944991), '84M384' : (40.805584, -73.935484), '84M388' : (40.815042, -73.945689),
    '84M481' : (40.805976, -73.951846), '84M709' : (40.821182, -73.940665), '84M726' : (40.819764, -73.95724) 
    }

In [None]:
print('2016: ' + str(shsat_clean[shsat_clean['Year of SHST'] == 2016]['School name'].unique().size))
print('2015: ' + str(shsat_clean[shsat_clean['Year of SHST'] == 2015]['School name'].unique().size))
print('2014: ' + str(shsat_clean[shsat_clean['Year of SHST'] == 2014]['School name'].unique().size))
print('2013: ' + str(shsat_clean[shsat_clean['Year of SHST'] == 2013]['School name'].unique().size))

In [None]:
this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')

def plotDot(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    x = point['School name'] + '<br>' + "Enrollment on 10/31: " + str(point['Enrollment on 10/31']) + '<br>' + \
    "Students who took SHSAT: " + str(point['Number of students who took the SHSAT'])
    iframe = folium.IFrame(html=x, width=500, height=90)
    popup = folium.Popup(iframe)
    folium.CircleMarker(location=[shsat_locs[point.DBN][0], shsat_locs[point.DBN][1]],
                        radius=3, weight=5, color='red', popup=popup).add_to(this_map)

#The red schools are schools contained in the SHSAT dataset. 
shsat_clean.apply(plotDot, axis = 1)

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

this_map

In [None]:
shsat_clean_2016 = shsat_clean[shsat_clean['Year of SHST'] == 2016].reset_index(drop=True)
schools_shsat_data = clean_data[clean_data['Location Code'].isin(shsat_clean['DBN'].unique())].reset_index(drop=True)
schools_shsat_data = schools_shsat_data.merge(shsat_clean_2016, left_on='Location Code', right_on='DBN')
# I choose not to drop these extra columns because it is a nice sanity check to make sure the merge worked correctly. 
#schools_shsat_data.drop(columns=['School name', 'DBN', 'Year of SHST', 'Grade level'])

In [None]:
fig, ax = plt.subplots(figsize=(14,12)) 
sns.heatmap(data = schools_shsat_data[list(schools_shsat_data.columns[15:41]) + 
                                      list(schools_shsat_data.columns[-3:])].corr(), 
            cmap = 'coolwarm', ax=ax)

In [None]:
#Used to grab specific correlation numbers referenced in the section overview.
#schools_shsat_data[list(schools_shsat_data.columns[15:41]) + list(schools_shsat_data.columns[-3:])].corr()

In [None]:
train = schools_shsat_data[list(schools_shsat_data.columns[15:41]) + list(schools_shsat_data.columns[-3:])]
train = train[train.select_dtypes(include='number').columns]
train.dropna(inplace=True)
train.info()

In [None]:
X = train[train.columns[:-10]]
y1 = train[train.columns[-2]]
y2 = train[train.columns[-1]]

X2 = sm.add_constant(X)
est = sm.OLS(y2, X2)
est2 = est.fit()
print(est2.summary())

This heatmap is really informative enough for analyzing the schools at such a broad context. Ultimately, everything that can be taken from this graph should be rather common knowledge for local residents of the town. This correlation is something I would expect even where I am from.

Poor students and poor families struggle receiving equal opportunities in schools. This is no surprise, let's try and find some schools that are good candidates to kickstart this PASSNYC Program. 

[Jump to Table of Contents](#contents)

<a id='specialized'></a>

## NYC Specialized schools Data Exploration
***

There are only 9 specialized schools in NYC. La Guardia uses auditions to accept students, which leaves 8 high schools that accept students based on SHSAT scores; because of this, comprehensive analysis of individual schools would not be difficult. Comparing the geographic location of each school and the diversity of each schools might help increase the granularity of this study. At the very least, noting the location of each specialized high schools helps provide additional context to the overall problem. 

<br><br>
Specialized High Schools: <br>
http://schools.nyc.gov/ChoicesEnrollment/High/specialized/default.htm

In [None]:
highschools = pd.read_csv("../input/nyc-high-school-directory/2014-2015-doe-high-school-directory.csv")
highschools.info(verbose = False)

In [None]:
highschools.head(2)

In [None]:
#hs['school_type'].unique()
#hs[hs['school_type'].isnull()]

In [None]:
specialized_highschool_list = ['Bronx High School of Science', 'Brooklyn Latin School, The', 
                           'Brooklyn Technical High School', 'High School for Mathematics, Science and Engineering at City College',
                           'High School of American Studies at Lehman College', 'Queens High School for the Sciences at York College', 
                           'Staten Island Technical High School', 'Stuyvesant High School', 
                           ]

#Audition only schools = 'Fiorello H. LaGuardia High School of Music & Art and Performing Arts'

specialized_hs = highschools[highschools['school_name'].isin(specialized_highschool_list)].reset_index(drop=True)

In [None]:
schools_shsat_data.head(2)

In [None]:
this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')

def plotBlueDots(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    x = point['School Name'] + '<br>' + "Grade 8 Students: "
    iframe = folium.IFrame(html=x, width=300, height=40)
    popup = folium.Popup(iframe)
    folium.CircleMarker(location=[point.Latitude, point.Longitude],
                        radius=2, weight=5, popup=popup,
                       color = 'blue').add_to(this_map)
    
def plotRedDots(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    x = point['School name'] + '<br>' + "Enrollment on 10/31: " + str(point['Enrollment on 10/31']) + '<br>' + \
    "Students who took SHSAT: " + str(point['Number of students who took the SHSAT'])
    iframe = folium.IFrame(html=x, width=500, height=90)
    popup = folium.Popup(iframe)
    folium.CircleMarker(location=[point.Latitude, point.Longitude],
                        radius=2, weight=5, popup=popup, 
                       color = 'red').add_to(this_map)
    
def plotMarker(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.Marker(location=[float(ast.literal_eval(point['Location 1'])['latitude']), 
                          float(ast.literal_eval(point['Location 1'])['longitude'])],
                          popup=point['school_name']).add_to(this_map)

    
    
    
clean_data.apply(plotBlueDots, axis = 1)
#The red schools are schools contained in the SHSAT dataset. 
schools_shsat_data.apply(plotRedDots, axis = 1)
specialized_hs.apply(plotMarker, axis = 1)

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

this_map

There are too many underserved schools in New York to try and characterize them all at a microscopic level. A good starting metric would be to focus on underperforming schools that are close to these specialized schools. A short commute is an extremely important thing to consider when choosing where to attend high school, especially for students who may have to walk or take public transportation. 

[Jump to Table of Contents](#contents)

<a id='demographics'></a>

## General Demographics
***
By analyzing the demographics of NYC students, test takers, and specialize high school students, we can determine where the breakdown in diversity occurs. Unsurprisingly, the black and hispanic students, who comprise the majority of underperforming schools in New York, also score poorly on the SHSAT at higher rates. 

In [None]:
district_demographics = pd.read_csv('../input/nyc-school-district-breakdowns/school-district-breakdowns.csv')

In [None]:
district_demographics['COUNT PARTICIPANTS'].sum()

In [None]:
students = list([])
temp_cols = ['COUNT BLACK NON HISPANIC', 'COUNT HISPANIC LATINO', 
        'COUNT WHITE NON HISPANIC', 'COUNT ASIAN NON HISPANIC']
for col in temp_cols:
    students.append(district_demographics[col].sum())

remainder = district_demographics['COUNT PARTICIPANTS'].sum() - sum(students)
students.append(remainder)

In [None]:
labels = ['Black', 'Hispanic', 'White', 'Asian', 'Other/Unspecified']
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'grey']
tested = np.array([5770, 6514, 5125, 8732, 2192])
offered = np.array([207, 319, 1342, 2619, 580])
pass_rates = offered/tested

title_size = 25
font = {'family' : 'DejaVu Sans',
        'weight' : 'normal',
        'size'   : 18}
plt.rc('font', **font)

plt.figure(0, figsize = (20, 20))
plt.subplot(2,2,1)
plt.title('2018 SHSAT Testing', 
          fontdict = {'fontsize': title_size})
plt.pie(tested, labels=labels, colors=colors,
        autopct='%1.1f%%', startangle = 90)
plt.axis('square')
 
#plt.figure(1, figsize = (10, 10))
plt.subplot(2,2,2)
plt.title('2018 Specialized Schools Acceptances',
         fontdict = {'fontsize': title_size})
plt.pie(offered, labels=labels, colors=colors,
    autopct='%1.1f%%', startangle = 90)
plt.axis('square')

#plt.figure(1, figsize = (10, 10))
plt.subplot(2,2,3)
plt.title('2018 Specialized Schools Pass Rates',
         fontdict = {'fontsize': title_size})
plt.barh(width = pass_rates, y=labels, align='center',
        color='green', ecolor='black')
plt.axis('tight')

#plt.figure(2, figsize = (10, 10))
plt.subplot(2,2,4)
plt.title('2018 Student Survey',
         fontdict = {'fontsize': title_size})
plt.pie(students, labels=labels, colors=colors,
    autopct='%1.1f%%', startangle = 90)

plt.axis('square')
plt.show()

Comparing the demographics from each of the charts reveals why specialized schools lack diversity. Students that take the test show a reasonable amount of diversity, but white and asian students score high enough to be accepted at a much higher rate. This reduces the diversity when the students begin their 9th grade year at the specialized schools.

Data obtained from: https://www.wsj.com/articles/who-got-into-stuyvesant-and-new-yorks-other-elite-public-high-schools-1520465259

[Jump to Table of Contents](#contents)

<a id='conclusion'></a>

# Conclusion: Targeted Schools

## Identifying the Schools in Need for SHSAT support
***
Because every student in a public school must take the ELA and Math standardized tests (barring some rare circumstance), we can determine the effective student size of a school based on how many students took the test. 

Again, accounting for size, location, and demographics should be sufficient enough to find a few schools to start a pilot tutoring program at; however, analyzing other variables might refine the avaiable options considered. There are justifications for choosing schools that have both positive and negative qualities, and it might be interesting to look for schools that lack funding but are still performing well on other metrics such as test scores, student attendance, or leadership.

In [None]:
this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')

def plotBlueDots(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.CircleMarker(location=[point.Latitude, point.Longitude],
                        radius=2, weight=5,
                       color = 'blue').add_to(this_map)
    
def plotMarker(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.Marker(location=[float(ast.literal_eval(point['Location 1'])['latitude']), 
                          float(ast.literal_eval(point['Location 1'])['longitude'])],
                          popup=point['school_name']).add_to(this_map)

clean_data[(clean_data['Grade 8 Math - All Students Tested'] > 0) &
          (clean_data['Percent Black / Hispanic'] > .60)].apply(plotBlueDots, axis = 1)



#use df.apply(,axis=1) to "iterate" through every row in your dataframe
specialized_hs.apply(plotMarker, axis = 1)


#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

this_map

By plotting all of the schools with eighth graders onto the map, the scope of the problem is further revealed. There are a lot of schools in NYC, so a reasonable metric needs to be established to filter these.

In [None]:
high_qualifiers = ['Economic Need Index', 'Percent Black / Hispanic', 'Student Attendance Rate', 
                   'Average ELA Proficiency', 'Average Math Proficiency',
                   'Grade 8 Math - All Students Tested', 'Grade 8 ELA - All Students Tested']

low_qualifiers = ['School Income Estimate ($)', 'Effective School Leadership %', 
                  'Collaborative Teachers %', 'Strong Family-Community Ties %', 'Trust %']

target_cols = clean_data[['School Name', 'Latitude', 'Longitude', 'Grade High'] + high_qualifiers + low_qualifiers]
target_cols = target_cols[target_cols['Grade 8 Math - All Students Tested'] > 0]
#target_cols = target_cols[target_cols['Grade High'] == '08']
#plt.figure(0, figsize = (20, 20))


In [None]:
target_cols.hist(figsize=(15,25),layout=(5,3))

Because the distribution for every column is skewed (determined through quick visual check above), either the log of the data should be used, or analyzing the data using quantiles would be more appropriate.

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

target_cols.select_dtypes(include=numerics).apply(np.log).drop(['Latitude', 'Longitude'], axis=1).describe()

In [None]:
target_cols.select_dtypes(include=numerics).apply(np.log).drop(['Latitude', 'Longitude'], axis=1).describe()

In [None]:
target_cols.select_dtypes(include=numerics).apply(np.log).drop(['Latitude', 'Longitude'], axis=1).info()

It appears that 0's exist in several columns, which restricts the use of a log transformation to normalize the data. Fortunately, there is a large number of samples, so analyzing the quantiles is acceptable to determine how the schools rank within certain categories. 

In [None]:
target_cols.quantile(q=0.75, axis='rows', numeric_only=True)

In [None]:
target_cols.quantile(q=0.25, axis='rows', numeric_only=True)

In [None]:
high_qualifiers = ['Economic Need Index', 'Percent Black / Hispanic', 'Student Attendance Rate', 
                   'Average ELA Proficiency', 'Average Math Proficiency',
                   'Grade 8 Math - All Students Tested', 'Grade 8 ELA - All Students Tested']

low_qualifiers = ['School Income Estimate ($)', 'Effective School Leadership %', 
                  'Collaborative Teachers %', 'Strong Family-Community Ties %', 'Trust %']

def filter_df_col(df, col, greater_flag):
    if greater_flag == True:
        df = df.loc[df[col] > target_cols.quantile(q=0.4, axis='rows', numeric_only=True)[col]]
    else:
        df = df.loc[df[col] < target_cols.quantile(q=0.6, axis='rows', numeric_only=True)[col]]
    return df

filtered_targets = target_cols.copy()

for col in high_qualifiers:
    filtered_targets = filter_df_col(filtered_targets, col, True)
    
for col in low_qualifiers:
    filtered_targets = filter_df_col(filtered_targets, col, False)
    
filtered_targets.info()

[Jump to Table of Contents](#contents)

<a id='map1'></a>

## Proposal 1: Economic Need
***

It appears that there are too many qualifiers being used to find schools that are poorly funded, but still performing acceptably. This is due mainly to the correlation between funding and performance observed in the earlier sections. Recall that the Percentage of Blacks and Hispanics attending a schools is highly correlated with the amount of funding and performance of the school. However, due to this correlation, finding a school that is deserving of PASSNYC's services should not be difficult. Picking a school with a high percentage of black and hispanic students would normally mean that the school needs extra services. Realistically, the only thing left to do is find a school with students that would be open to attending a specialized high school. 


For the three proposals below, three different approaches are used to filter the list of schools. The filtered list is then plotted on the maps.

The considered categories within the different maps are:
- Economic Need Index
- Percent Black / Hispanic
- Grade 8 Math - All Students Tested
- School Income Estimate

In [None]:
high_qualifiers = ['Economic Need Index', ]
low_qualifiers = ['School Income Estimate ($)']

high_percentile = .5
low_percentile = .2

def filter_df_col(df, col, greater_flag):
    if greater_flag == True:
        df = df.loc[df[col] > target_cols.quantile(q=high_percentile, axis='rows', numeric_only=True)[col]]
    else:
        df = df.loc[df[col] < target_cols.quantile(q=low_percentile, axis='rows', numeric_only=True)[col]]
    return df

filtered_targets1 = target_cols.copy()

for col in high_qualifiers:
    filtered_targets1 = filter_df_col(filtered_targets1, col, True)
    
for col in low_qualifiers:
    filtered_targets1 = filter_df_col(filtered_targets1, col, False)
    
filtered_targets1.info()

In [None]:
color_threshold = .7

this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')
    
def plotTargets(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    percent_black_hispanic = point['Percent Black / Hispanic']
    x = point['School Name'] + '<br>' + \
        "Highest Grade: " + str(point['Grade High']) + '<br>' + \
        "Grade 8 Students: " + str(point['Grade 8 Math - All Students Tested']) + '<br>' + \
        "Percent Black / Hispanic: " + str(percent_black_hispanic) + '<br>'
    iframe = folium.IFrame(html=x, width=400, height=90)
    popup = folium.Popup(iframe)
    folium.Circle(location=[point.Latitude, point.Longitude],
                  radius=400, fill = True, popup=popup, 
                  color = 'red' if percent_black_hispanic > color_threshold 
                                  else 'blue').add_to(this_map)

def plotMarker(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.Marker(location=[float(ast.literal_eval(point['Location 1'])['latitude']), 
                          float(ast.literal_eval(point['Location 1'])['longitude'])],
                          popup=point['school_name']).add_to(this_map)

filtered_targets1.apply(plotTargets, axis=1)
specialized_hs.apply(plotMarker, axis = 1)

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

this_map

[Jump to Table of Contents](#contents)

<a id='map2'></a>

## Proposal 2: Eighth Grade Size

***

In [None]:
high_qualifiers = ['Grade 8 Math - All Students Tested']

high_percentile = .9
low_percentile = 0

def filter_df_col(df, col, greater_flag):
    if greater_flag == True:
        df = df.loc[df[col] > target_cols.quantile(q=high_percentile, axis='rows', numeric_only=True)[col]]
    else:
        df = df.loc[df[col] < target_cols.quantile(q=low_percentile, axis='rows', numeric_only=True)[col]]
    return df

filtered_targets2 = target_cols.copy()

for col in high_qualifiers:
    filtered_targets2 = filter_df_col(filtered_targets2, col, True)

    
filtered_targets2.info()

In [None]:
color_threshold = .7

this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')
    
def plotTargets(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    percent_black_hispanic = point['Percent Black / Hispanic']
    x = point['School Name'] + '<br>' + \
        "Highest Grade: " + str(point['Grade High']) + '<br>' + \
        "Grade 8 Students: " + str(point['Grade 8 Math - All Students Tested']) + '<br>' + \
        "Percent Black / Hispanic: " + str(percent_black_hispanic) + '<br>'
    iframe = folium.IFrame(html=x, width=400, height=90)
    popup = folium.Popup(iframe)
    folium.Circle(location=[point.Latitude, point.Longitude],
                  radius=400, fill = True, popup=popup, 
                  color = 'red' if percent_black_hispanic > color_threshold 
                                  else 'blue').add_to(this_map)

def plotMarker(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.Marker(location=[float(ast.literal_eval(point['Location 1'])['latitude']), 
                          float(ast.literal_eval(point['Location 1'])['longitude'])],
                          popup=point['school_name']).add_to(this_map)

filtered_targets2.apply(plotTargets, axis=1)
specialized_hs.apply(plotMarker, axis = 1)

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())


this_map

[Jump to Table of Contents](#contents)

<a id='map3'></a>

## Proposal 3: School Proximity
***

In [None]:
filtered_targets3 = target_cols.copy()

In [None]:
color_threshold = .7

this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')
    
def plotTargets(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    percent_black_hispanic = point['Percent Black / Hispanic']
    x = point['School Name'] + '<br>' + \
        "Highest Grade: " + str(point['Grade High']) + '<br>' + \
        "Grade 8 Students: " + str(point['Grade 8 Math - All Students Tested']) + '<br>' + \
        "Percent Black / Hispanic: " + str(percent_black_hispanic) + '<br>'
    iframe = folium.IFrame(html=x, width=400, height=90)
    popup = folium.Popup(iframe)
    folium.Circle(location=[point.Latitude, point.Longitude],
                  radius=400, fill = True, popup=popup, 
                  color = 'red' if percent_black_hispanic > color_threshold 
                                  else 'blue').add_to(this_map)

def plotMarker(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.Marker(location=[float(ast.literal_eval(point['Location 1'])['latitude']), 
                          float(ast.literal_eval(point['Location 1'])['longitude'])],
                          popup=point['school_name']).add_to(this_map)

filtered_targets3.apply(plotTargets, axis = 1)
specialized_hs.apply(plotMarker, axis = 1)
this_map.fit_bounds(this_map.get_bounds())


this_map

[Jump to Table of Contents](#contents)

<a id='map4'></a>

## Finalized Map
***
By interacting with the above maps, I selected a subset of schools that are within reasonable distances to specialized highschools, which are replotted below. Most of the schools near the two specialized high schools would be great choices to start tutoring programs, because they are grouped together nicely, and the students would have the chance to attend either high school without drastically changing their commute. 

In [None]:
schools1 = ['P.S. 161 PEDRO ALBIZU CAMPOS']
schools2 = ['THE NEW SCHOOL FOR LEADERSHIP AND JOURNALISM']

schools3 = ['INTERNATIONAL SCHOOL FOR LIBERAL ARTS',
           'P.S./M.S. 280 MOSHOLU PARKWAY', 'J.H.S. 080 THE MOSHOLU PARKWAY', 
            'P.S./M.S. 20 P.O.GEORGE J. WERDANN, III', 'P.S. 095 SHEILA MENCHER'
           ]

specialized_target_schools = ['High School for Mathematics, Science and Engineering at City College', 
                              'High School of American Studies at Lehman College',
                              'Bronx High School of Science']


this_map = folium.Map(prefer_canvas=True, tiles='Stamen Toner')
    
def plotTargets(point, color):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    percent_black_hispanic = point['Percent Black / Hispanic']
    x = point['School Name'] + '<br>' + \
        "Highest Grade: " + str(point['Grade High']) + '<br>' + \
        "Grade 8 Students: " + str(point['Grade 8 Math - All Students Tested']) + '<br>' + \
        "Percent Black / Hispanic: " + str(percent_black_hispanic) + '<br>'
    iframe = folium.IFrame(html=x, width=500, height=90)
    popup = folium.Popup(iframe)
    folium.Circle(location=[point.Latitude, point.Longitude],
                  radius=400, fill = True, popup=popup, 
                  color = color).add_to(this_map)


#use df.apply(,axis=1) to "iterate" through every row in your dataframe
specialized_hs[specialized_hs['school_name'].isin(specialized_target_schools)].apply(plotMarker, axis = 1)
target_cols[target_cols['School Name'].isin(schools1)].apply(lambda x: plotTargets(x, 'blue'), axis=1)
target_cols[target_cols['School Name'].isin(schools2)].apply(lambda x: plotTargets(x, 'red'), axis=1)
target_cols[target_cols['School Name'].isin(schools3)].apply(lambda x: plotTargets(x, 'green'), axis=1)


#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())


this_map


## Closing Remarks
***

The exploratory analysis, interactive maps, and final approach to this notebook should serve as a guide to understand the background and subjective nature of the problem. Although competitors have submitted a wide variety of analyses on this subject, it is important to start soon and begin receiving feedback from tutors, students, and schools in order to refine the process. With such a complex issue, overanalyzing the topic can only obfuscate the goal and stifle progress. The schools provided above would all be great candidates to start helping underprivileged kids prepare for the rigorous SHSAT test. 

<br><br><br>

Feel free to contact me via Kaggle or Github in order to discuss this topic or suggest changes/bug fixes to my kernel. Thank you for reading.
<br>
Github: https://github.com/jareducherek

[Jump to Table of Contents](#contents)