# Introduction

In this notebook we tackle the Data Science for Good Challenge posted by NYCPASS. The main goal of this challenge is to support NYCPASS in their endeavour of enabling underrepresented students gain access to specialized high schools in NYC by providing a number of support programs. 

Our analysis will begin from a relatively broad perspective, where we will try to identify the main variables determining success for students. In doing that we will start from the premise that access to these specialized schools not only depends on academic criteria but also on personal elements. As we will show below, the data points towards a scenario where the academic component is strongly affected by elements like the economic situation of students and schools, however, it will also provide hints telling us that some students may be failing to even take the test perhaps due to a lack of confidence or a supportive environment.

The notebook is structured in the following sections

- **Data preparation**: here we import all the required data, including the minimum of two NYC Kaggle datasets demanded by competition rules, and carried out processing of the data. In a first read or for readers more interested in the actual data analysis, we recommend skipping this section.
- **Data analysis**: this section contains the actual data analysis. It is itself broken down into two subsections: the big picture, and a more in-depth analysis, where in the former we look at the data from a more global perspective, whereas on the latter we attempt to look into more specific details. In the latter section we also take the time to dwelve into the difference in performance between community and non-community schools, a characteristic which plays a significant role.
- **Targeting NYCPASS's support**: in this section we use the insights gained in the data analysis section to develop an ad-hoc score which should enable PASSNYC in making a decision of which schools to prioritize. We present a sample of the highest priority schools, according to this ad-hoc score, and also a geographical visualization of schools where we emphasize those in more urgent need of assistance.
- **Conclusions**: here we present a brief conclusion of our findings.

# Data Preparation

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Read in main data

## This DataFrame will be our main DataFrame, so we will add additional data from other sources below
schoolExplorer2016File = '../input/data-science-for-good/2016 School Explorer.csv'
df = pd.read_csv(schoolExplorer2016File,header=0)

# Convert percentage fields into numerical values, add new features, etc.

df['Average Proficiency'] = df.loc[:,'Average ELA Proficiency':'Average Math Proficiency'].mean(axis=1)

# Normalize percentage rates
percentageFields =  list(df.filter(like='Percent').columns.values) + \
                    list(df.filter(like='Rate').columns.values) + \
                    list(df.filter(like='%').columns.values)
        
fillRate = {}
minRate = 0.01 # Arbitrary value 
for f in percentageFields:
    # Transform to float
    fillRate[f] = df[f].dropna().apply(lambda x: float(x[:-1])/100).mean()
    df.fillna(value={f : str(fillRate[f])+'%'},inplace=True)
    df[f]=df[f].apply(lambda x: float(x[:-1])/100)
    
    # Normalize
    df.loc[df[f]<minRate, f] = minRate
    
def remComma(x):
    if isinstance(x,str):
        x = float(x[1:].replace(',',''))
        
    return x
    
fillRate['School Income Estimate'] = df['School Income Estimate'].dropna().apply(lambda x: float(x[1:].replace(',',''))).mean()
df['School Income Estimate']=df['School Income Estimate'].apply(remComma)
df.fillna(value={'School Income Estimate' : fillRate['School Income Estimate']},inplace=True);

# There is a typo on the 'Grade 3 Math - All Students Tested' label (Tested->tested)
df['Grade 3 Math - All Students Tested'] = df['Grade 3 Math - All Students tested']
df.drop(labels=['Grade 3 Math - All Students tested'], axis=1, inplace=True)

# Compute percentage of outstanding students in sample
for g in range(3,9): 
    df['Grade ' + str(g) + ' ELA 4s %'] = df['Grade ' +str(g)+' ELA 4s - All Students'].map(float) / df['Grade ' + str(g) + ' ELA - All Students Tested']
    df['Grade ' + str(g) + ' Math 4s %'] = df['Grade ' +str(g)+' Math 4s - All Students'].map(float) / df['Grade ' + str(g) + ' Math - All Students Tested']
    df['Grade ' + str(g) + ' 4s %'] = df[['Grade ' + str(g) + ' ELA 4s %', 'Grade ' + str(g) + ' Math 4s %']].mean(axis=1)
    
# Simply fill with zeros the entries for schools with no students taking exams
df[df.filter(like='4s %').columns]=df[df.filter(like='4s %').columns].fillna(-0.1)

# Convert coordinates into Web Mercator Format (we will need this when plotting maps)
def wgs84_to_web_mercator(df, lon="Longitude", lat="Latitude"):
    """ 
        Converts decimal longitude/latitude to Web Mercator format
        Code taken from the Bokeh tutorial on Geoplotting
        https://hub.mybinder.org/user/bokeh-bokeh-notebooks-xq55q0f7/notebooks/tutorial/09%20-%20Geographic%20Plots.ipynb
    """
    k = 6378137 # earth's radius in m
    df[lon+'WebMercator'] = df[lon] * (k * np.pi/180.0)
    df[lat+'WebMercator'] = np.log(np.tan((90 + df[lat]) * np.pi/360.0)) * k
    return df

wgs84_to_web_mercator(df)

# Sort entries by Location Code (will be useful later on)
df.sort_values(by='Location Code', inplace=True)

In [None]:
# Read in NYC DOE high school directory data

doeHSDir2016File = '../input/nyc-high-school-directory/2016-doe-high-school-directory.csv'
doeHSDir2015File = '../input/nyc-high-school-directory/2014-2015-doe-high-school-directory.csv'
HSDir2016 = pd.read_csv(doeHSDir2016File,header=0)
HSDir2015 = pd.read_csv(doeHSDir2015File,header=0)

# Prepare DOE data

labelCols = {'dbn':'DBN'}
numericCols = {'total_students':'Total Students'}
stringCols = {'extracurricular_activities':'Extracurricular Activities',
                   'language_classes':'Language Classes',
                   'advancedplacement_courses':'Advanced Placement Courses',
                   'school_sports':'School Sports',
                   'partner_highered':'Partner Higher Ed',
                   'partner_cultural':'Partner Cultural',
                   'partner_financial':'Partner Financial'}

interestingCols = {**labelCols, **numericCols, **stringCols}
    
HSDir2016.rename(columns=interestingCols,inplace=True)
HSDir2015.rename(columns=interestingCols,inplace=True)

HSDir2016.sort_values(by='DBN',inplace=True)
HSDir2015.sort_values(by='DBN',inplace=True)

hsDirFilter = HSDir2015['DBN'].isin(HSDir2016['DBN'].values)

# Generate a unified DataFrame for both 2015 and 2016.
HSDir = HSDir2016[list(interestingCols.values())].copy()

commonSchoolFilter2016 = HSDir['DBN'].isin(HSDir2015['DBN'].values)
commonSchoolFilter2015 = HSDir2015['DBN'].isin(HSDir['DBN'].values)

HSDir.loc[commonSchoolFilter2016 , 'Total Students'] = 0.5*(HSDir.loc[commonSchoolFilter2016 , 'Total Students'].values + HSDir2015.loc[commonSchoolFilter2015 , 'Total Students'].values)

def cleanSemicolons(s):
    if isinstance(s,str):
        s=s.replace(';',',')
    
    return s

for col in stringCols.values():
    for l, r in zip(HSDir.loc[commonSchoolFilter2016 , col].values , HSDir2015.loc[commonSchoolFilter2015 , col].values):
        if isinstance(l,str) and isinstance(r,str):
            l += ',' + r
        elif isinstance(r,str):
            l = r
            
    HSDir.loc[commonSchoolFilter2016 , col] = HSDir.loc[commonSchoolFilter2016 , col].map(cleanSemicolons)
    
HSDir = pd.concat([HSDir , HSDir2015.loc[~hsDirFilter,interestingCols.values()]], ignore_index=True)
HSDir.sort_values(by='DBN', inplace=True)

# Now we add all this data to the 'main' DataFrame
HSDirFilter = HSDir['DBN'].isin(df['Location Code'].values)
DFFilter = df['Location Code'].isin(HSDir['DBN'].values)

def countCommas(s):
    if isinstance(s,str):
        return 1 + s.count(',')
    else:
        return 0

df['Total Students'] = pd.Series(np.NaN, index=df.index)
df.loc[DFFilter , 'Total Students'] = HSDir.loc[HSDirFilter, 'Total Students'].values 

# Fill missing values with a simple estimate. We distinguish between school type because of the analysis below
commSchoolFilter = df['Community School?']=='Yes'
df.loc[~DFFilter & commSchoolFilter , 'Total Students'] = df.loc[commSchoolFilter,'Total Students'].mean()
df.loc[~DFFilter & ~commSchoolFilter , 'Total Students'] = df.loc[~commSchoolFilter,'Total Students'].mean()

for col in stringCols.values():
    df[col] = pd.Series(np.NaN, index=df.index)
    df.loc[DFFilter , col] = HSDir.loc[HSDirFilter, col].map(countCommas).values
    df[col].fillna(0, inplace=True)

In [None]:
# Read in NYC School Demographics and Accountability data

schoolDemographicsFile = '../input/ny-school-demographics-and-accountability-snapshot/2006-2012-school-demographics-and-accountability-snapshot.csv'
HSDemo = pd.read_csv(schoolDemographicsFile,header=0)

# Prepare NYC School Demographics and Accountability data

HSDemo.sort_values(by='DBN')

## Replace spurious strings being read in in numeric fields by NaNs
def replaceString(x):
    if x == '    ':
        x = np.NaN
    else:
        x = float(x)
    
    return x

gradeCountCols = HSDemo.filter(like='grade').columns.values
for l in gradeCountCols:
    HSDemo.loc[:,l] = HSDemo.loc[:,l].map(replaceString).values

    # Fill NaNs by zeros in student counts for all grades
    HSDemo.loc[:,l].fillna(np.NaN, inplace=True)

# Consider only average counts
HSDemo = HSDemo.groupby('DBN').mean()
HSDemo.reset_index(inplace=True)

genderCols = {'male_per':'Male %', 'female_per':'Female %'}
HSDemo.rename(columns=genderCols,inplace=True)

## Insert student counts to main DataFrame
HSDemoFilter = HSDemo['DBN'].isin(df['Location Code'].values)
DFFilter = df['Location Code'].isin(HSDemo['DBN'].values)

for col in list(gradeCountCols)+list(genderCols.values()):
    df[col] = pd.Series(np.NaN, index=df.index)
    df.loc[DFFilter,col] = HSDemo.loc[HSDemoFilter, col].values

In [None]:
# Read in TNYTimes data
TNYTimesFile = '../input/the-new-york-times-nyc-shsat-data/nyc-shsat-data.csv'
TNYTimes = pd.read_csv(TNYTimesFile,header=0)

# Prepare TNYTimes data

TNYTimes.drop(labels=['Unnamed: 4','Source: New York City Department of Education. ^Data suppressed for values of 5 students or fewer.'], axis=1, inplace=True)
TNYTimes.sort_values(by='DBN', inplace=True)

def replaceSmall(s):
    '''All values below 6 were replaced by an \'s\' in the original data. This functions replaces that placeholder for NaNs.'''
    if s=='s' or s=='s^':
        s = 0 # np.NaN
    return s

TNYTimes['Offers'] = TNYTimes['Offers'].map(replaceSmall)
TNYTimes['Testers'] = TNYTimes['Testers'].map(replaceSmall)

# Insert data into main DataFrame
df['SHSAT Takers'] = 0 # np.NaN
df['SHSAT Offers'] = 0 # np.NaN

hsFilter = TNYTimes['DBN'].isin(df['Location Code'].values)
dfFilter = df['Location Code'].isin(TNYTimes['DBN'].values)

df.loc[dfFilter , 'SHSAT Takers'] = TNYTimes.loc[hsFilter, 'Testers'].map(float).values
df.loc[dfFilter , 'SHSAT Offers'] = TNYTimes.loc[hsFilter, 'Offers'].map(float).values

df['SHSAT Takers %'] = df['SHSAT Takers'] / df['Total Students']
df['SHSAT Offers %'] = df['SHSAT Offers'] / df['SHSAT Takers']
df['SHSAT Offers %'].fillna(0, inplace=True)

# Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb; sb.set()

## The big picture
To begin our analysis we try to find simple trends in the data that can be used to identify which variables appear to be playing a key role in a school's performance. This will allows us to gain some intuition regarding correlations present in the data. 

In [None]:
meanAverageProficiency = df['Average Proficiency'].mean()
medianAverageProficiency = df['Average Proficiency'].median()
medianSchoolIncomeEstimate = df['School Income Estimate'].median()
df.fillna(value={'Average Proficiency' : meanAverageProficiency},inplace=True);

print('Mean Average Proficiency = ' + str(meanAverageProficiency))
print('Median Average Proficiency = ' + str(medianAverageProficiency))
print('Median School Income Estimate = ' + str(medianSchoolIncomeEstimate))

### Three key variables for academic proficiency

We were able to identify three variables exhibiting a clear effect on average proficiency, namely:
- Economic Need Index
- Chronical Absence
- Estimated School Income

As we will see below, these variables serve not only as a proxy for academic performance but could also, possibly, hint towards issues arising outside school boundaries, e.g., in a student's household, hindering a student's chances of entering a specialized school. 

#### Economic Need Index (ENI)

As the data below shows, there is a clear anticorrelation between a school's Economic Need Index and the average proficiency of it's students, i.e., the larger the ENI a school exhibits the lower its students are likely to score on the SHSAT. Statistically this anticorrelation can be measured using Pearson's $R$ coefficient ($|R| \leq 1$, with the extremes showing maximal (anti-)correlations), for which we find $R=-0.76$. The black line indicates the median of all school average proficiencies.

In [None]:
g=sb.jointplot(df['Economic Need Index'],df['Average Proficiency'],
           kind='kde',color='b');

Xs = [df['Economic Need Index'].min(), df['Economic Need Index'].max()]
Meds=[meanAverageProficiency,meanAverageProficiency]
g.ax_joint.plot(Xs, Meds, '--k')

#### Chronical Absence

Chronical absence is the second variable showing a strong degree of anticorrelation with average performance. In this case we find $R=-0.5$. 

Just as interestingly the distribution appears to exhibit a bimodal structure separating schools which exhibited higher scores from those which exhibited lower scores. 

In [None]:
g=sb.jointplot(df['Percent of Students Chronically Absent'],df['Average Proficiency'],
           kind='kde',color='b')

Xs = [df['Percent of Students Chronically Absent'].min(), df['Percent of Students Chronically Absent'].max()]
Meds=[meanAverageProficiency,meanAverageProficiency]
g.ax_joint.plot(Xs, Meds, '--k')

sb.jointplot(df['Economic Need Index'],df['Percent of Students Chronically Absent'],
           kind='kde',color='b');

To identify these clusters we employ the so-called Ward Agglomerative Clustering technique and plot joint and marginal distributions by cluster, where we denote the cluster of schools with low chronical absence/high average proficiency in blue and the high chronical absence/low average proficiency in red.

We find that the cluster with schools obtaining higher scores (blue) is centered around a chronical absence rate of about $10\%$ and exhibits a median score about $70\%$ higher than that of the cluster of schools with lower scores. This latter cluster is centered at a chronical absence of about $30\%$.

In [None]:
from sklearn.cluster import AgglomerativeClustering

clusterer = AgglomerativeClustering(n_clusters=2, linkage= 'ward')
clusterLabels = clusterer.fit_predict(df[['Average Proficiency','Percent of Students Chronically Absent']])

cluster0 = clusterLabels == 0
cluster1 = clusterLabels == 1

g = sb.JointGrid(df['Percent of Students Chronically Absent'][cluster1],df['Average Proficiency'][cluster1],
                xlim=[df['Percent of Students Chronically Absent'].min(),df['Percent of Students Chronically Absent'].max()],
                ylim=[df['Average Proficiency'].min(),df['Average Proficiency'].max()])

g = g.plot_joint(sb.kdeplot, cmap='Blues_d')
g = g.plot_marginals(sb.distplot, color="b")
g.x = df['Percent of Students Chronically Absent'].values[cluster0]
g.y = df['Average Proficiency'].values[cluster0]
g = g.plot_joint(sb.kdeplot, cmap='Reds_d')
g = g.plot_marginals(sb.distplot, color="r")

#### Estimated School Income (ESI)

School income appears to be a variable providing a clear threshold separating two subgroups of schools with low scores and high scores. What we find is that ESI exhibits a median of about $48$K USD (shown by the red dashed line); there is, however, a considerable number of schools lying significantly below this value, the vast majority of which have average proficiencies below the median for all schools (shown by the black dashed line). This provides a hint that schools with incomes significantly below the median are very likely to exhibit scores below the median score.

In [None]:
g=sb.jointplot(df['School Income Estimate'],df['Average Proficiency'],
           kind='kde',color='b',n_levels=15)

Xs = [df['School Income Estimate'].min(), df['School Income Estimate'].max()]
Meds=[meanAverageProficiency,meanAverageProficiency]
g.ax_joint.plot(Xs, Meds, '--k')

Ys = [df['Average Proficiency'].min(), df['Average Proficiency'].max()]
Meds=[medianSchoolIncomeEstimate,medianSchoolIncomeEstimate]

g.ax_joint.plot(Meds, Ys, '--r');

### SHSAT participation rates

In [None]:
ratingFields = df.filter(like='Rating').columns.values
ratingValues = df['Supportive Environment Rating'].dropna().unique()

dfTakersByRating = pd.DataFrame(columns=ratingFields,index=['Meeting/Exceeding Target','Not Meeting/Approaching Target'])

for rf in ratingFields:
    filt = df[rf].isin(['Exceeding Target','Meeting Target'])
    dfTakersByRating.loc['Meeting/Exceeding Target',rf] = df.loc[filt, 'SHSAT Takers %'].mean()
    
    filt = df[rf].isin(['Approaching Target','Not Meeting Target'])
    dfTakersByRating.loc['Not Meeting/Approaching Target',rf] = df.loc[filt, 'SHSAT Takers %'].mean()

dfTakersByRating = dfTakersByRating.reset_index(col_fill='Value').rename(columns={'index':'Score'})
dfTakersByRating = dfTakersByRating.melt(id_vars=['Score'], value_vars=ratingFields, var_name='Rating', value_name='Average SHSAT Takers %')
dfTakersByRating['Rating'] = dfTakersByRating['Rating'].map(lambda s: s.replace(' Rating',''))

Part of gaining access to a specialized school requires that students not only be prepared fo the demands of the SHSAT but also, and perhaps just as importantly, that the students feel confident/supported enough to actually take the exam.

The results in the figure below show quite clearly how schools offering collaborative teachers, a supportive environment, rigorous instruction, as well as proper student achievement, manage to get more of their students to take the SHSAT. Interestingly, strong family-community ties appear to reduce the number of students taking the exam. Understanding this will require additional efforts.

This observation is important since it shows that it should be possible, in principle, to achieve an increased enrollment of underrepresented minorities in specialized high schools also by endowing students with the confidence that they too are capable of succeeding both in the exam as well as inside a specialized school.

In [None]:
fig, (ax1) = plt.subplots(figsize=(21, 8), ncols=1)
sb.barplot(x="Rating", y="Average SHSAT Takers %", hue='Score', data=dfTakersByRating, ax=ax1)
plt.show()

## A more in-depth analysis

### The case of community schools

According to our estimates, community schools account for roughly $5\%$ of the total student population in NYC. This might seem like giving them additional consideration might not be worth the effort; however, as we shall show below, community schools do show signs worthy of concern in terms of academic performance and access to specialized high schools, which make them natural candidates for NYCPASS' programs.

In [None]:
dfBySchoolType = df.groupby('Community School?').mean()
dfBySchoolType.reset_index(inplace=True)

percentCols=['Percent Asian','Percent Black / Hispanic', 'Percent White']

dfMinorityPercentage = dfBySchoolType.melt(id_vars=['Community School?'], value_vars=percentCols,var_name='Group', value_name='Percentage')

dfMinorityPercentage['Group'] = dfMinorityPercentage['Group'].map(lambda s: s.replace('Percent ','')).values

##### Low academic performance and high rates of chronical absence

As we show below community schools systematically exhibit deficiencies when compared to non-community schools; first, their overall scores fall on the lowest quartile of non-community school scores, with only a few outliers obtaining results comparable to the median performance of non-community schools. Second, the median percent of students which are chronically absent is nearly twice as large for community schools when compared to non-community schools. This strengthens our observation above regarding the strong connection between the number of chronically absent students and lower academic performance, even more, it shows that community schools appear to suffer severyle from this connection.

In [None]:
fig, (ax1,ax2) = plt.subplots(figsize=(21, 8), ncols=2)
sb.boxplot(df['Community School?'],df['Average Proficiency'], ax=ax1)
sb.boxplot(df['Community School?'],df['Percent of Students Chronically Absent'], ax=ax2)
plt.show()

##### Large concentration of Black/Hispanic students
As the following plot illustrates not only do community schools suffer from a severe lower than average academic performance but their student bodies tend to have a much larger proportion of Black/Latino students. Since it is precisely these communities which tend to suffer from access rates to specialized schools, this makes community schools a group of very interesting targets for NYCPASS' support programs.

In [None]:
fig, (ax1) = plt.subplots(figsize=(21, 8), ncols=1)
sb.barplot(x="Group", y="Percentage", hue='Community School?', data=dfMinorityPercentage, ax=ax1)
plt.show()

##### A need for support systems beyond classroom topics?

A point which draws our attention is the fact that community schools exhibit average numbers of extracurricular activities, advanced placement programs, language classes and school sports. Naively, we would have expected such numbers to be proxies for improvements inside the classroom, however, as we have pointed out above, they remain notorious under performers. This leads us to think that the most pressing issues for kids in these schools might originate somewhere outside of school, i.e., it leads us to conjecture that these kids might be exposed to detrimental factors like family/gang violence, as well as perhaps depression, etc.

We believe NYPASS' programs focusing on the strengthening of family ties, as well as emotional and mental coaching or mentoring could prove useful for kids in community schools.

In [None]:
specialActivities=['Extracurricular Activities','Advanced Placement Courses','Language Classes','School Sports']
dfActivities = dfBySchoolType[['Community School?']+specialActivities]
dfActivities = dfActivities.melt(id_vars=['Community School?'], value_vars=specialActivities,var_name='Activity', value_name='Average Offer')

In [None]:
fig, (ax1) = plt.subplots(figsize=(17, 8), ncols=1)
sb.barplot(x="Activity", y="Average Offer", hue='Community School?', data=dfActivities, ax=ax1)
plt.show()

## Targeting PASSNYC's support

Taking all the analysis carried out above, we now address the question of which schools to focus on using PASSNYC's support programs. To do this we develop an ad hoc score for schools in the 2016 School Explorer dataset.

To build up the ad-hoc score we choose the following parameters and assign a weight to each of them

- Economic Need Index: the school with the largest ENI obtains the full weight, whereas the school with the lowest ENI obtains zero weight.
- Chronical Absence: the school with the largest Chronical Absence obtains the full weight, whereas the school with the lowest Chronical Absence obtains zero weight. 
- Community School: if a school is a community school it receives full weight, if not, it receives zero weight.
- School Income Estimate: the school with the lowest SIE obtains the full weight, whereas the school with the lowest SIE obtains zero weight. 
- Percentage of Black/Latino Students: the school with the largest percentage obtains the full weight, whereas the school with the lowest percentage obtains zero weight.

It is important that this approach is can readily be adapted to other forms of choosing both the variables to be used in the scoring as well as their relative importance.

In [None]:
scoringWeights = {}
scoringWeights['Economic Need Index'] = 0.2
scoringWeights['Percent of Students Chronically Absent'] = 0.2
scoringWeights['School Income Estimate'] = 0.3
scoringWeights['Percent Black / Hispanic'] = 0.3
                                                                      
scoringFields = ['Economic Need Index','Percent of Students Chronically Absent','School Income Estimate','Percent Black / Hispanic']

scoringParams = {}
for field in scoringFields:
    scoringParams[field] = {'max':df[field].max() , 'min':df[field].min()}
    
scores = pd.DataFrame(index=[df.index], columns=['DBN','Name','Percent Black / Hispanic','Score'])
scores['DBN'] = df['Location Code'].values
scores['Name'] = df['School Name'].values
scores['Percent Black / Hispanic'] = df['Percent Black / Hispanic'].values
scores['Score'] = 0
for field in scoringFields:
    scores['Score'] += scoringWeights[field] * (df[field].values - scoringParams[field]['min'])/abs(scoringParams[field]['max']-scoringParams[field]['min'])

Below we show a sample of candidate schools for PASSNYC's programs

In [None]:
scores.sort_values(by='Score',ascending=False).head(15)

### Geographical Visualization

To wrap up our analysis we show a map of NYC with schools shown as circles using a color coding given by their ENI (yellow - higher priority, purple - low priority) and their size given by the ad-hoc score.

To obtain more data from a specific school the reader need only use the mouse to hover over a given data point.

In [None]:
from bokeh.io import output_notebook, show
output_notebook()
from bokeh.plotting import figure
from bokeh.models import WMTSTileSource, ColumnDataSource, Circle, ColorBar, BasicTicker, HoverTool
from bokeh.models.mappers import ColorMapper, LinearColorMapper
from bokeh.palettes import Viridis5

# web mercator coordinates
NYC = x_range,y_range = ((df['LongitudeWebMercator'].min(),df['LongitudeWebMercator'].max()) ,
                         (df['LatitudeWebMercator'].min(),df['LatitudeWebMercator'].max()))

p = figure( tools='hover, pan, wheel_zoom',
            x_range=x_range, y_range=y_range, 
            width=1250,
            height=850
          )
p.axis.visible = False

hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [("Score", "@Score"),("ENI", "@ENI"), ("Black/Hispanic %", "@BHPercent")]

url = 'http://a.basemaps.cartocdn.com/dark_all/{Z}/{X}/{Y}.png'
attribution = "Tiles by Carto, under CC BY 3.0. Data by OSM, under ODbL"

p.add_tile(WMTSTileSource(url=url, attribution=attribution))

color_mapper = LinearColorMapper(palette=Viridis5)

source = ColumnDataSource(
    data=dict(
        lat=df['LatitudeWebMercator'].values,
        lon=df['LongitudeWebMercator'].values,
        Score=df['Percent Black / Hispanic'].values,
        ScoreScaled=np.exp(df['Percent Black / Hispanic'].values*2.5),
        ENI=df['Economic Need Index'].values,
        BHPercent=df['Percent Black / Hispanic'].values*100        
    )
)

circle = Circle(x="lon", y="lat", size="ScoreScaled", fill_color={'field': 'ENI', 'transform': color_mapper}, fill_alpha=0.5, line_color=None)
p.add_glyph(source, circle)

color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(),
                     label_standoff=12, border_line_color=None, location=(0,0))

p.add_layout(color_bar, 'right')

show(p)

## Summary

In summary we have carried out an analysis of the numerous factors potentially playing a key role in the success of a student in being accepted at a specialized school. To do this we have employed numerous data sources, some of which were already as Kaggle datasets, while some others were obtained from external sources (The New York Times).

We found a very strong correlation between variables like Economic Need Index, Chronical Absence and School Estimated Income with a school's academic performance. We also found that it may be important for a school to meet certain quality standards in terms of as this appears to correlate positively with the number of students actually taking the SHSAT.

In a more detailed analysis we looked into community schools which appear to suffer particularly badly from low academic performance levels and the information available appears to hint towards academic problems having a root cause extending beyond school limits, i.e., we believe students in such schools could greatly profit from support programs focusing on strengthening ties as well as providing mental coaching, in order to foster the students' self-confidence when facing the SHSAT.

We close this analysis by developing an ad-hoc score which takes into account all previous observations and can be used as a simple guideline when deciding which schools should be prioritized by PASSNYC when implementing support programs. This score is greatly flexible and can easily be tuned to incorporate modifications in the relative weights of its parameters should more sofisticated approaches be imployed in the future.