In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# data visualization, exploration
import seaborn as sns 
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['axes.labelsize'] = 'large'

# I reuse variable names quite often so the below magic line is for enabling autocomplete via tab
%config IPCompleter.greedy=True

Written by Marcus Lee for the public domain under the Apache 2.0 license. You can find me on linkedin.com/marcuslee143 or github.com/MarcusLee143 if you would like collaborate on this effort.

I would first like to thank PASSNYC for hosting this challenge to provide for those in need in New York City. In the name of reproducable data science, I thought I would break down and organize the problem statement into more understandable terms so that you can follow along my work here.

#### Objectives:
- Encourage underserved students in NYC to take the SHSAT, a placement test for historically underserved students in New York City.
- Identify obstacles in registering for the SHSAT.
- Assist PASSNYC in providing critical outreach services to overcome said obstacles at schools with historically underserved students.

#### Factors identified as good indicators of need by PASSNYC:
- English Language Learners
- Students with Disabilities
- Students on Free/Reduced Lunch
- Students with Temporary Housing

#### Data Sources (publicly available CSV files provided by Kaggle):
**PASSNYC Data:**
- 2016 School Explorer (for identifying schools)
- D5 SHSAT Registration and Testers (for identifying registration obstacles)

**New York City Open Data:**
- (tbd, still exploring)

With these datasets, we should be able to identify schools with students which are either linked to or have record of the factors which PASSNYC has identified as important to their outreach services. If you understood everything so far, that's great! If not, feel free to research any of what I just listed above.

The following cells will be for Exploratory Data Analysis. The objective of this is to sift through the data sources and make it understandable and relatable to the human reader. I'm going to start with the PASSNYC data first, and then if necessary, the New York City Open Data.

# Exploratory Data Analysis

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

## Scratch space for SHSAT registration data
Before I make any complicated links with our data, I just want to better understand the data.

In [None]:
sat_explorer = pd.read_csv('../input/D5 SHSAT Registrations and Testers.csv')

In [None]:
sat_explorer.shape

In [None]:
sat_explorer.columns

In [None]:
sat_explorer.head()

We want to look at schools with low SHSAT reigstration and participation. We can measure these (for each year) relative to student enrollment and student registration. I'm going to define some ratios:

**Registration ratio:** # of students registered for the SHSAT / Enrollment on 10-31

**Participation ratio:** # of students who took the SHSAT / Enrollment on 10-31

**Commitment ratio:** # of students who took the SHSAT / # of students registered for the SHSAT

It will be especially interesting to see if, for example, there are schools where *registration* is accessible, but the *physical test* is not. Different factors affect these differently, e.g. I would guess that English Languager Learners would have difficulty registering AND participating, whereas Students with Disabilities may register just fine but have difficulties showing up to the test. THIS IS JUST AN EXAMPLE HYPOTHESIS and we're going to see what is actually the case as we dive deeper into the data.

In [None]:
sat_copy = sat_explorer.copy() # I want to keep the original data clean

In [None]:
sat_copy.drop('DBN', axis=1, inplace=True) # dropping the internal database key

In [None]:
# Calculating the above ratios
num_registered = sat_copy['Number of students who registered for the SHSAT']
num_participated = sat_copy['Number of students who took the SHSAT']
enrollment = sat_copy['Enrollment on 10/31']

sat_copy['registration_ratio'] = num_registered / enrollment
sat_copy['participation_ratio'] = num_participated / enrollment
sat_copy['commitment_ratio'] = num_participated / num_registered

In [None]:
# Sort ratios by descending
sat_sorted = sat_copy.sort_values(by=['registration_ratio',
                                      'participation_ratio',
                                      'commitment_ratio'], ascending=True, axis=0)
# pandas has weird logic for rows vs columns
# I believe axis=0 indicates that we are sorting on the ROWS, as opposed to the columns, and then we pick the COLUMN
# LABELS that we want to sort the rows on. You're welcome, college students.

These zero-valued schools probably don't even offer the SHSAT to begin with. It may be of interest for PASSNYC to explore ways to help these schools administer the SHSAT.

Conversely, it may be more efficient to simply leave these be on the basis of geographical proximity.

Regardless, for further analysis we probably want to filter schools where students didn't registerd for the SHSAT. We can come back to that later.

In [None]:
sat_nonzero = sat_sorted.loc[sat_sorted['Number of students who registered for the SHSAT'] > 0]
sat_nonzero.head()

In [None]:
#TODO: With this smaller dataset, examine geographic overlay for each ratio (3 total)

——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

Alright, so recall that we are interested in identifying the following factors across our datasets for **providing services in areas which match the needs of said services**:
- English Language Learners
- Students with Disabilities
- Students on Free/Reduced Lunch
- Students with Temporary Housing

The beautiful thing about data science, and especially Jupyter notebooks, is that we can just look at all the factors we're interested in one fell swoop. So, let's just take a look at the columns, and with the factors we have in mind, just filter our dataset with relevant information and work from there.

In [None]:
school_explorer = pd.read_csv('../input/2016 School Explorer.csv')

In [None]:
school_explorer_columns = list(school_explorer.columns)
school_explorer_columns

In [None]:
list(sat_explorer.columns)

PASSNYC's focus is identifying schools that need assistance, so let's start at a higher level and just aggregate information about individual schools and areas.

In [None]:
# There's 161 columns in the 2016 Schools Dataset. No way we need all of them.
school_explorer_geo = school_explorer[
    ['School Name',
     'Latitude',
     'Longitude',
#      'Address (Full)',
     'Economic Need Index', # index for housing, health, and free lunch welfare
     'School Income Estimate',
     'Percent ELL', # exactly one of the factors we wanted
     'Average ELA Proficiency'] # English proficiency measurement, another good proxy for benchmarking English teaching needs
]
school_explorer_geo.head()

Since the Economic Need Index combines housing, health, and food welfare needs into one number, we have to dive deeper into other data sources in order to separate the contribution of these factors in SHSAT registration.

However, with regards to English ability, we have enough information to compare SHSAT scores across schools to their proportion of ELL students and average English proficiency schools. Furthermore, we can probably overlay this on a map of Manhattan. But let's start with simpler data sorting and grouping.

In [None]:
# Here, I'm grouping by school, summing registration and test-taking numbers, and recalculating the proportions
# so that I can join this information on the school_explorer data.
sat_pregroup = sat_sorted.drop(['registration_ratio',
                                 'participation_ratio',
                                 'commitment_ratio',
                                 'Year of SHST',
                                 'Grade level'], axis=1)\
.rename(columns={'School name':'School Name', # necessary for merging tables later
                 'Enrollment on 10/31':'Total enrollment',
                 'Number of students who registered for the SHSAT':'Total registration',
                 'Number of students who took the SHSAT':'Total participation'})
sat_grouped = sat_pregroup.groupby('School Name').sum().reset_index()

In [None]:
# Redo calculations
num_registered = sat_grouped['Total registration']
num_participated = sat_grouped['Total participation']
enrollment = sat_grouped['Total enrollment']
sat_grouped['registration_ratio'] = num_registered / enrollment
sat_grouped['participation_ratio'] = num_participated / enrollment
sat_grouped['commitment_ratio'] = num_participated / num_registered
sat_grouped

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'; not applicable to my scenario, was getting annoying

In [None]:
# lowercase school names and strip punctuation
school_explorer_geo['School Name'] = school_explorer_geo['School Name'].str.lower().str.replace(r'[^\w\s]', '')
sat_grouped['School Name'] = sat_grouped['School Name'].str.lower().str.replace(r'[^\w\s]', '')

In [None]:
# Combine the tables by school name.
sat_merge_school = sat_grouped.merge(school_explorer_geo, on="School Name")
sat_merge_school.sort_values(['registration_ratio', 'Total registration']).head()

In [None]:
# clean Percent ELL
sat_merge_school['Percent ELL'] = sat_merge_school['Percent ELL'].str.replace('%', '').astype(float) / 100
sat_merge_school.head()

So from here, we can make some quick observations on what schools are having more or less registration with the SHSAT, and attempt some correlations between registration numbers, registration ratio, Economic Need Index, Percent ELL, and Average ELA proficiency.

In [None]:
print("Number of schools in 2016 data: " + str(len(school_explorer_geo)))
print("Number of schools in SHSAT data: " + str(len(sat_explorer.groupby('School name'))))

In [None]:
fig, [ax0, ax1] = plt.subplots(1,2, figsize=(15,6))
sns.regplot(x="Total registration", y="Economic Need Index", data=sat_merge_school, ax=ax0)
sns.regplot(x="registration_ratio", y="Economic Need Index", data=sat_merge_school, ax=ax1)
ax1.set_xlim(0,1.5)
ax0.set_ylim(0.4,1.0)
ax1.set_ylim(0.4,1.0)

In [None]:
fig, [ax0, ax1] = plt.subplots(1,2, figsize=(15,6))
sns.regplot(x="Total registration", y="Percent ELL", data=sat_merge_school, ax=ax0)
sns.regplot(x="registration_ratio", y="Percent ELL", data=sat_merge_school, ax=ax1)
ax1.set_xlim(0,1.5)
ax1.set_ylim(0,1.0)

In [None]:
fig, [ax0, ax1] = plt.subplots(1,2, figsize=(15,6))
sns.regplot(x="Total registration", y="Average ELA Proficiency", data=sat_merge_school, ax=ax0)
sns.regplot(x="registration_ratio", y="Average ELA Proficiency", data=sat_merge_school, ax=ax1)
ax1.set_xlim(0,1.5)
ax1.set_ylim(0,6)

What you should be taking away from this so far is that, with the small subset of data we have on schools, their English levels, and SHSAT registration information, our analysis so far is **inconclusive**.

From here, we should try to connect more data from NYC's open source datasets, so we can gather more information on whether the factors PASSNYC is interested in are stastically significant. Also, we'll want to granularize these observations by SHSAT year and Grade Level for better informed insights. The aggregations I've done help setup a basic analysis pipeline, but in the process lost lots of information, so now let's work backwards. (to be continued)

In [None]:
#TODO: after aggregate analysis, dive deeper into trends 
# for Grade level and Year of the SHSAT (especially for data over time)

In [None]:
#TODO: Revisit Grade granular data, especially "Limited English Proficient" and "Economically Disadvantaged"

In [None]:
#TODO: Gather more information to break apart "Economic Need Index" (housing, health/disability, free lunch)