In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np
import pandas as pd 
import re #regular expressions
import folium
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns #plotting
sns.set()
import scipy
import math
import json
#import branca
from typing import List, Tuple, Dict, Union
from textwrap import wrap
import ipywidgets as widgets
from IPython.display import Markdown
from statsmodels.sandbox.stats.multicomp import multipletests

pd.set_option('display.max_columns', None)

HSDir_2017 = pd.read_csv('../input/nyc-high-school-directory/2017-doe-high-school-directory.csv')

# Table of Contents:
* [Specialized High Schools and the SHSAT](#sec1)
* [Disparity in the Specialized High Schools](#sec2)
    * [Quantifying Diversity](#sec21)
    * [Visualization](#sec22)
    * [Hypothesis Testing](#sec23)
* [Looking at the SHSAT](#sec3)
    * [High Performing Middle Schools](#sec31)
    * [Middle Schools with Undertested Black/Hispanic Students](#sec32)
* [Conclusion](#sec4)

# Specialized High Schools and the SHSAT <a class="anchor" id="sec1"></a>


According to a directory that the New York City Department of Education provides <sup><a href='#1'>1</a></sup> , NYC has 440 public high schools. Nine of these high schools are classified as "specialized" and are targeted specifically towards gifted students. 

In [None]:
#Creating a dictionary of specialized high schools and their district borough numbers, a helpful ID variable
specialized_dict = {dbn:school for dbn, school in HSDir_2017.query('specialized==1')[['dbn', 'school_name']].values}

for school in specialized_dict.values():
    print(school)

The only criteria for admission to most specialized high schools is a student's score on the Specialized High Schools Admissions Test (SHSAT), a test consisting of an English Language Arts section and a math section. The only specialized high school that does not consider the SHSAT is Fiorello H. LaGuardia High School of Music & Art and Performing Arts. According to its website:
>"Acceptance to LaGuardia Arts is based on a competitive audition and review of student records to ensure success in both the demanding studio work and the challenging academic programs." <sup><a href='#2'>2</a></sup>

In [None]:
#Removing Fiorello from the specialized high school dictionary
specialized_dict.pop('03M485', None)

The SHSAT is optionally offered every fall for 8th and 9th graders. Only 18% of the students who took the SHSAT in 2016 ended up receiving an offer, making these schools notoriously difficult to get into. <sup><a href='#3'>3</a></sup> 
 
PASSNYC (Promoting Access to Specialized Schools in New York City) is a non-profit company based in NYC. Their purpose is to provide outreach programs for kids in under-served areas to help them prepare for the SHSAT. By giving the kids opportunities to succeed, PASSNYC aims to bring more diversity to the specialized high schools. The organization already has methods in place to identify kids and schools suited for their outreach programs. But they believe that they can do better.

The purpose of this notebook is to:

<ul>
    <li>Quantify the diversity issue in specialized high schools</li><br/>
    <li>Identify schools that could use PASSNYC's help</li>


# Disparity in the Specialized High Schools <a class="anchor" id="sec2"></a>

On PASSNYC's website, they state: 
>"In recent years, the City’s specialized high schools—institutions with historically transformative impact on student outcomes—have seen a shift toward more homogenous [sic] student body demographics." <sup><a href='#4'>4</a></sup>

To validify this statement, demographic data is needed from a range of school years. Thankfully, a NYC Department of Education dataset labeled "2013 - 2018 Demographic Snapshot School" contains demographic information for students from 2013-2018. <sup><a href='#5'>5</a></sup> 

In [None]:
#Import the datasets
demographics_df = pd.read_csv('../input/2013-2018-demographic-snapshot-district/2013_-_2018_Demographic_Snapshot_School.csv')
demographics_df.head()

Looking at the dataset, there appears to be demographic data for ethnicity, gender, whether a student has a disability, whether a student is a non-native English speaker, and whether a student is in poverty.

In [None]:
ethnicity_list = ['ASIAN', 'BLACK', 'HISPANIC', 'WHITE', 'OTHER']
gender_list = ['FEMALE', 'MALE']
disability_list = ['STUDENTS WITH DISABILITIES', 'STUDENTS WITH NO DISABILITIES']
ELL_list = ['ENGLISH LANGUAGE LEARNERS', 'NOT ENGLISH LANGUAGE LEARNERS']
poverty_list = ['POVERTY', 'NO POVERTY']
demographic_dict = {'Ethnicity': ethnicity_list, 'Gender': gender_list,
                    'Disabilities': disability_list, 'Poverty': poverty_list,
                    'English Language Learners': ELL_list}

After processing the dataset, it will possible to start testing PASSNYC's claim about school homogeneity.

In [None]:
#PREPROCESSING

#Making the Year column an integer
demographics_df['Year'] = demographics_df['Year'].str.slice(0,4).astype('int64')

#Adding a column for Borough
demographics_df['borough'] = demographics_df['DBN'].str.slice(2,3)
demographics_df.loc[demographics_df['borough']=='X','borough'] = 'Bronx'
demographics_df.loc[demographics_df['borough']=='K','borough'] = 'Brooklyn'
demographics_df.loc[demographics_df['borough']=='Q','borough'] = 'Queens'
demographics_df.loc[demographics_df['borough']=='M','borough'] = 'Manhattan'
demographics_df.loc[demographics_df['borough']=='R','borough'] = 'Staten Island'

#Changing 'No Data' results in the Economic Need Index to be np.NaN
demographics_df.loc[demographics_df['Economic Need Index']=='No Data', 'Economic Need Index'] = np.NaN

#Changing percentage columns to float type
for column in [column for column in demographics_df.columns if '%' in column] + ['Economic Need Index']:
    demographics_df.loc[-demographics_df[column].isnull(), column] = \
    demographics_df.loc[-demographics_df[column].isnull(), column].str.slice(0,-1).astype('float64')/100
    
#Making all column names in dataset capitalized
demographics_df.columns = demographics_df.columns.str.upper()

#Rename "MULTIPLE RACE CATEGORIES NOT REPRESENTED" to "OTHER" to keep succinct
demographics_df.rename(columns = {'# MULTIPLE RACE CATEGORIES NOT REPRESENTED': '# OTHER',
                                        '% MULTIPLE RACE CATEGORIES NOT REPRESENTED': '% OTHER',
                                        'SCHOOL NAME': 'SCHOOL_NAME'}, inplace=True)

#Adding columns for demographic inverses
demographics_df['# NO POVERTY'] = demographics_df['TOTAL ENROLLMENT'] - demographics_df['# POVERTY']
demographics_df['% NO POVERTY'] = 1 - demographics_df['% POVERTY']
    
demographics_df['# NOT ENGLISH LANGUAGE LEARNERS'] = demographics_df['TOTAL ENROLLMENT'] - demographics_df['# ENGLISH LANGUAGE LEARNERS']  
demographics_df['% NOT ENGLISH LANGUAGE LEARNERS'] = 1 - demographics_df['% ENGLISH LANGUAGE LEARNERS']

demographics_df['# STUDENTS WITH NO DISABILITIES'] = demographics_df['TOTAL ENROLLMENT'] - demographics_df['# STUDENTS WITH DISABILITIES']  
demographics_df['% STUDENTS WITH NO DISABILITIES'] = 1 - demographics_df['% STUDENTS WITH DISABILITIES']  

#Updating specialized school names in the datasets
for DBN, school_name in specialized_dict.items():
    demographics_df.loc[demographics_df['DBN']==DBN, 'SCHOOL_NAME'] = school_name
    
#Year range to work with
years = (2013,2017)

#Only specialized schools
specialized_df = demographics_df.query('DBN in @specialized_dict.keys()').copy()
specialized_df.set_index(['SCHOOL_NAME', 'YEAR'], inplace=True)

## Quantifying Diversity <a class="anchor" id="sec21"></a>

One way to test the accuracy of this statement is to use the Shannon Index. The Shannon Index, \\(H\\), is one of the most widely used diversity indices <sup><a href='#6'>6</a></sup> , which are used to quantify diversity in a sample. The Shannon Index is defined by

$$ H = -\sum_{i=1}^{R}{\frac{n_{i}}{N}\log{\big(\frac{n_{i}}{N}\big)}} $$ 

where 
* \\(R\\) is the total number of categories
* \\(n_i\\) is the count of observations from category \\(i\\)
* \\(N\\) is the total number of observations. 

This is the same equation for entropy, used in information theory, physics, and statistics. Claude Shannon published this result in 1948 in "A Mathematical Theory of Communication". <sup><a href='#7'>7</a> , <a href='#8'>8</a></sup> The Shannon Index takes a minimum value at 0 and a maximum value at \\(\log{(R)}\\). Higher indices signify higher diversity. 

The sample variance of the Shannon Index is given by

$$ s^{2}_{H} = \frac{\sum_{i=1}^{R}{\frac{n_{i}}{N}\big(\log{\frac{n_{i}}{N}}\big)^{2}} - \big(\sum_{i=1}^{R}{\frac{n_{i}}{N}\log{\frac{n_{i}}{N}}}\big)^{2}}{N} + \frac{R-1}{2N^{2}} $$

This will come in use later.

While the Shannon Index will give a measure of diversity, it increases non-linearly, making it difficult to compare indices. One way to fix this is to transform the Shannon Index with the exponential function, \\(e^{H}\\). The result is scaled linearly and is known as a Hill Number. <sup><a href='#9'>9</a></sup> Hill Numbers can be explained as the effective number of categories for a demographic. <sup><a href='#10'>10</a></sup> If \\(R\\) different categories had equal frequencies \\(\frac{1}{R}\\), then the Hill Number would be \\(R\\). As the frequencies get less similar, there are fewer effective categories.

Using Hill Numbers, the change in specialized high school diversity can be visualized.

## Visualization <a class="anchor" id="sec22"></a>

In [None]:
def percentage_table(date_range: Tuple[int, int], demographic_list: List[str], demographic_df: pd.DataFrame) -> pd.DataFrame:
    """
    Creates a view of each schools' demographic percentages for 2 years
    
    Args:
        date_range (Tuple[int,int]): The year range of demographic data to include.
        demographic_list (List[str]): A list of demographic categories.
        demographic_df (pd.DataFrame): A pandas dataframe of demographic values.

    Returns:
        pd.DataFrame: demographic dataframe filtered view.
    """
    df = demographic_df.query('YEAR in [2013,2017]')[[' '.join(['%', demographic]) for demographic in demographic_list]]

    overall_df = demographic_df.query('YEAR in [2013,2017]')[[' '.join(['#', demographic]) for demographic in demographic_list]]\
                               .groupby(level=1).agg(sum)\
                               .apply(lambda x: x/x.sum(), axis=1)\
                               .set_index(pd.MultiIndex.from_product([['OVERALL'], [2013,2017]]))\
                               .sort_index(axis=1)
                
    return df.append(overall_df.rename(columns = dict(zip(overall_df.columns, df.columns))))


def highlight_rows(x):
    colors = []
    for school in x.index.get_level_values(0):
        if x[school,x.index.unique(level='YEAR')[1]] > x[school,x.index.unique(level='YEAR')[0]]:
            colors.append('background-color: lightgreen')
        elif x[school,x.index.unique(level='YEAR')[1]] < x[school,x.index.unique(level='YEAR')[0]]:
            colors.append('background-color: #FDB5A6')
        else:
            colors.append('')
    return colors


outputs = [widgets.Output() for _ in demographic_dict.keys()]
tab = widgets.Tab(children = outputs)
for i, demographic in enumerate(demographic_dict.keys()):
    tab.set_title(i, demographic)
display(tab)    

for i, demographic in enumerate(demographic_dict.keys()):
    with outputs[i]:
        display(Markdown('### {}'.format(demographic)));
        display(percentage_table(years, demographic_dict[demographic], specialized_df).style.apply(highlight_rows, axis=0));

In [None]:
def log(n: float) -> float:
    """
    Allows math.log to return 0 for log(0) instead of undefined for the purpose of calculating a Shannon Index.
    """
    if n==0:
        return 0
    else:
        return math.log(n)
    
    
def shannon_index(categories: List[int]) -> Tuple[float, float, int]: 
    """
    Calculates Shannon Index numbers for a demographic list.
    
    Args:
        categories (List[int]): A list of demographic categories.

    Returns:
        Tuple[float, float, int]: (Shannon Index expected value, Shannon Index variance, total number of observations)
    """
    N = sum(categories)
    expected_value = -sum((x/N)*log(x/N) for x in categories)
    variance = ((sum((x/N)*(log(x/N)**2) for x in categories) - ((sum((x/N)*log(x/N) for x in categories))**2))/N) + ((len(categories)-1)/(2*(N**2)))
    return (expected_value, variance, N)


def shannon_list(date_range: Tuple[int, int], demographic_list: List[str], demographic_df: pd.DataFrame) -> List[Union[float, str]]:
    """
    Creates a list of the yearly overall Shannon Index information for the specialized high schools.
    
    Args:
        date_range (Tuple[int,int]): The year range of demographic data to include.
        demographic_list (List[str]): A list of demographic categories.
        demographic_df (pd.DataFrame): A pandas dataframe of demographic values.

    Returns:
        List[int, float, float, int]: A list of school year, Shannon expected value, Shannon variance, 
                                      and total number of observations as the value.
    """
    shannon = []
    for year in range(date_range[0], date_range[1]+1):
        shannon.append([year, *shannon_index(list(demographic_df.xs(year, level='YEAR')[[' '.join(['#', demographic]) for demographic in demographic_list]].sum().values))])
    return shannon


def hill_graphs(date_range: Tuple[int, int], demographic_list: List[str], demographic_name: str, demographic_df: pd.DataFrame) -> None:
    """
    Creates a set of bar charts and lineplots visualizing the change in diversity for a date range
    
    Args:
        date_range (Tuple[int,int]): The year range of demographic data to include.
        demographic_list (List[str]): A list of demographic categories.
        demographic_name (str): The name of the demographic to be included in graph text.
        demographic_df (pd.DataFrame): A pandas dataframe of demographic values.

    Returns:
        None: Multiple graphs.
    """
    shannon = shannon_list(date_range, demographic_list, demographic_df)
    
    fig = plt.figure(figsize=(20, 8*(len(specialized_dict)+1)))
    grid = plt.GridSpec(len(specialized_dict)+1, 3, hspace=0.4, wspace=0.4)

    barcharts = fig.add_subplot(grid[i,:2])
    barcharts.set_ylabel('Percentage')
    barcharts.set_ylim(0,1)
    barcharts.set_title(demographic_name + ' Distribution')
    
    demographic_df.query('YEAR in @date_range')[[' '.join(['#', demographic]) for demographic in demographic_list]]\
                   .groupby(level=1).sum().apply(lambda x: x/x.sum(), axis=1)\
                   .sort_index().T\
                   .plot(kind='bar', ax=barcharts, rot=0)


    lineplots = fig.add_subplot(grid[i,2:])
    lineplots.set_ylabel('Effective ' + demographic_name + ' Number')
    lineplots.set_xlabel('Year')
    lineplots.set_ylim(1, len(demographic_list))
    lineplots.set_title('Hill Numbers Over Time')



    lineplots.plot(*zip(*[[x[0], math.exp(x[1])] for x in shannon]), color="red")

    
    
    plt.show(fig);

    
for i, demographic in enumerate(demographic_dict.keys()):
    display(Markdown('### {}'.format(demographic)))
    hill_graphs(years, demographic_dict[demographic], demographic, specialized_df);

Some observations about the visualizations:

**Ethnicity**:
* The multiple ethnicities not represented have increased in overall frequency since 2013. They've noticingly become more present at the High School for Mathematics, Science, and Engineering at City College (HSMSE) and The Brooklyn Latin School, where both have "other" ethnicities make up more than 10% of their student body. 
* The proportion of Hispanic and black students has gone down in seven of the eight specialized high schools. 
* The proportion of Asian and white students has gone up in most high schools. 

**Gender**:
* Since 2013, half of the high schools have increased in female proportion and half of the high schools have decreased in female proportion.
* Overall, the proportions have changed less than a tenth of a percent.

**Disability**:
* Most high schools maintained a low number of students of disabilities, either increasing or decreasing by a relatively small amount. 
* High School for Mathematics, Science, and Engineering at City College (HSMSE) went from about 3.1% of its students having disabilities in 2013/2014 to 4.8% in 2017/2018.

**Poverty**:
* Overall, the percentage of students with poverty has gone down from 53.3% to 50.7%. Since this is closer to 50/50 though, the diversity has technically increased. If this trend continues, the diversity will start dropping once the proportion of students with poverty drops below 50%.
* The proportion of students with poverty has gone down in every high school except for one. 
* The high school where the proportion of students with poverty increased was Staten Island Technical High School, where it went from 31.7% to 40.8%. 

**English Language Learners**
* In 2013, 5 of the high schools had students who were English Language Learners. By 2017, only the High School for Mathematics, Science and Engineering at City College (HSMSE) did.

## Hypothesis testing <a class="anchor" id="sec23"></a>

Seeing the diversity visualized gives some initial details of how demographics have changed over the years. However, since some of the changes are subtle, it would be helpful if there was a way to see if any changes were statistically significant. One way to compare diversity is to use Hutcheson's t-test. <sup><a href='#11'>11</a> , <a href='#12'>12</a></sup> Created by Kermit Hutcheson in a letter to the editor in a 1970 issue of the Journal of Theoretical Biology, Hutcheson's t-test is an unpooled two-sample t-test specifically adapted for Shannon Indices.

For this test, the null and alternative hypotheses are \\(H_0: H_a = H_b\\) and \\(H_1: H_a \ne H_b\\) respectively, with \\(H_a\\) and \\(H_b\\) being the two Shannon Indices. The null hypothesis assumes equal diversity in the two samples. The test statistic is calculated as
$$ t = \frac{H_a - H_b}{\sqrt{s^{2}_{H_a} + s^{2}_{H_b}}} $$

with \\(s^{2}_{H}\\) being the sample variance of the Shannon Index. The degrees of freedom are given by

$$ \frac{\big(s^{2}_{H_a} + s^{2}_{H_b}\big)^2}
{\Bigg(\frac{\big(s^{2}_{H_a}\big)^2}{N_a} + \frac{\big(s^{2}_{H_b}\big)^2}{N_b}\Bigg)} $$

There will be five hypotheses tests ran: one for each demographic (ethnicity, gender, disability, poverty, English Language Learner). Each demographic will be tested on the specialized schools grouped together instead of individually. This will allow fewer hypotheses tests and answers the question at hand: have specialized schools become less diverse over the years. 

Since multiple tests are being conducted, the False Discovery Rate (the proportion of false rejections of the null hypothesis compared to the total number of rejections) will be controlled using the Benjamini-Hochberg procedure. <sup><a href='#13'>13</a></sup>

In [None]:
def t_test(a: float, var_a: float, N_a: int, b: float, var_b: float, N_b: int) -> Tuple[float, int, float]:
    """
    Calculates the Hutcheson t-test statistics to compare two Shannon Indices
    
    Args:
        a (float): Shannon Index #1
        var_a (float): Shannon Index #1 variance
        N_a (float): total observations #1
        b (float): Shannon Index #2
        var_b (float): Shannon Index #2 variance
        N_b: total observations #2

    Returns:
        Tuple[float, int, float]: A tuple of (t-statistic, degrees of freedom, p-value)
    """
    t = (a - b)/math.sqrt(var_a + var_b)
    df = math.ceil(((var_a + var_b)**2)/(((var_a**2)/N_a)+((var_b**2)/N_b)))
    return (t, df, 1 - scipy.stats.t.cdf(math.fabs(t), df))

def shannon_t_tests(date_range: Tuple[int, int]) -> pd.DataFrame:
    """
    Creates table displaying multiple t-tests adjusted with the Benjamini-Hochberg procedure
    
    Args:
        date_range (Tuple[int,int]): The year range of demographic data to include.
        
    Returns:
        None: A pandas dataframe.
    """
    significance_df = pd.DataFrame(data=None, columns = ['Name', "Shannon Index " + str(date_range[0]), "Shannon Index " + str(date_range[1]), 't', 'df', 'pval'])
    for i, demographic in enumerate(demographic_dict.keys()):
        shannon = shannon_list(date_range, demographic_dict[demographic], specialized_df)
        significance_df.loc[i] = (demographic, shannon[0][1], shannon[-1][1], *t_test(*shannon[0][1:],*shannon[-1][1:]))
    is_reject, corrected_pvals, _, _ = multipletests(significance_df["pval"], alpha=0.05, method='fdr_bh')
    significance_df["reject"] = is_reject
    significance_df["adj_pval"] = corrected_pvals
    return significance_df

In [None]:
shannon_t_tests(years)

After adjusting for multiple tests, the null hypotheses for ethnicity, poverty, and English Language Learners are rejected at a rejection level of \\(\alpha=0.05\\). Therefore, there is evidence that the diversity of these demographics in specialized high schools between the 2013/2014 school year and the 2017/2018 school year is not equal. In the cases of ethnicity and English Language Learners, it appears that the diversity has decreased. With poverty, it appears that the diversity has actually increased. Looking at the visualizations of the data again, this makes some sense. 

* Ethnicity appears not to have changed too much, but since the total enrollment of students in specialized high schools is 14,876 in the 2013/2014 school year and 15,540 in the 2017/2018, the variance is lower. Even a small change is significant.
* Both gender and disability seem to have not changed at all, so it is not surprising that the difference in their proportions was not big enough to warrant a rejection of the null hypothesis. 
* The proportion of students with poverty at the specialized high schools shifted by +/- 2.5. Since the proportions converged near 50%, the diversity is close to it's maximum Hill Number value. 
* The English Language Learner percentages in 2013 and 2017 look nearly identical, but since there were 14 students who were English Language Learners in 2013 and only 1 in 2017, it decreased significantly. In retrospect, it may be difficult to quantify students as English Language Learners. Since one of the two components of the SHSAT is English Language Arts<sup>1</sup>,  a student needs to be adept at the English language to score well. If a student scores well enough to pass, they might not consider themselves an English Language Learner. 


With that being said, the only quantifiable demographic that has gone down in diversity in the specialized high schools is ethnicity. In regards to ethnicity, PASSNYC is correct in saying that the student bodies have become more homogeneous over the past few years. The second part of this project will aim to find ways to fix that. 

# Looking at the SHSAT <a class="anchor" id="sec3"></a>

To help identify schools that need assistance with SHSAT preparation, PASSNYC provided two datasets to work with:
* '2016 School Explorer.csv' is a dataset that gives several characteristics of public elementary and middle schools in NYC. 
* 'D5 SHSAT Registrations and Testers.csv' is a dataset that gives SHSAT registration information for schools in the D5 district (Central Harlem).

In [None]:
school_explorer = pd.read_csv('../input/data-science-for-good/2016 School Explorer.csv')
registration = pd.read_csv('../input/data-science-for-good/D5 SHSAT Registrations and Testers.csv')

#Processing for school_explorer dataset

#Capitalize column names
school_explorer.columns = school_explorer.columns.str.upper()

#Change percent columns from strings to usable integers
for column in [column for column in school_explorer.columns if ('%' in column) or ('PERCENT' in column) or ('RATE' in column)]:
    school_explorer.loc[-school_explorer[column].isnull(), column] = school_explorer.loc[-school_explorer[column].isnull(), column].str.strip('%').astype('float64')/100
    school_explorer[column] = school_explorer[column].astype('float64')
    
#Change dollars to floats
school_explorer['SCHOOL INCOME ESTIMATE'] = school_explorer['SCHOOL INCOME ESTIMATE'].str.replace('[\$, ]', '').astype('float64')

#Change location code column name to DBN
school_explorer.rename(columns = {'LOCATION CODE': 'DBN'}, inplace=True)

school_explorer.head()

Looking at the `school_explorer` dataset, it appears to have demographic columns, metrics on the school itself, and information on test results from the New York State Testing Program (NYSTP) <sup>1</sup>.  The dataset is titled "2016 School Explorer", which could be either the 2015/2016 school year or the 2016/2017 school year. To find out which one it is, the demographic dataset from Part 1 can be used since it had the school years labeled as "2015/2016", etc. 

In [None]:
#testing 2015/2016 vs. 2016/2017
testing = pd.merge(school_explorer[['PERCENT BLACK', 'DBN']], demographics_df.query('YEAR in (2015,2016)')[['% BLACK', 'DBN', 'YEAR']], on='DBN')
for year in (2015,2016):
    print('Year {} Error: {}'.format(year, (testing.loc[testing['YEAR']==year, 'PERCENT BLACK'] - testing.loc[testing['YEAR']==year, '% BLACK']).sum()))
del testing

It looks like the data provided is for the 2015/2016 year. Since it will be helpful to have the demographics columns included in this dataset, the datasets will be merged.

In [None]:
redundant_columns = ['SCHOOL NAME', 'ECONOMIC NEED INDEX']
redundant_columns.extend(['PERCENT ' + demographic for demographic in ['ELL', 'ASIAN', 'BLACK', 'HISPANIC', 'WHITE', 'BLACK / HISPANIC']])

school_explorer = pd.merge(school_explorer[[column for column in school_explorer.columns if column not in redundant_columns]], 
         demographics_df.query('YEAR==2015'), on='DBN')
school_explorer['PERCENT BLACK/HISPANIC'] = school_explorer['% BLACK'] + school_explorer['% HISPANIC']

To get extra metrics regarding the NYSTP, a dataset from the NYC DOE website can be merged into the existing dataset as well. 

In [None]:
test_results = pd.read_csv('../input/nyc-ela-and-math-results-20152016/3-8_ELA_AND_MATH_RESEARCHER_FILE_2016.csv')

#Dropping nan values
test_results.replace('-', np.nan, inplace=True)
test_results.dropna(subset=[column for column in test_results.columns if ('_COUNT' in column) or ('_PCT' in column)], inplace=True)

#Converting percent columns into float64 types
percentage_columns = [column for column in test_results.columns if 'PCT' in column]
for column in percentage_columns:
    test_results[column] = test_results[column].str.strip('%').astype('float64')/100
    
#Create grade column
test_results['GRADE'] = test_results['ITEM_DESC'].str.slice(6,7).astype('int64')
    
#Get test averages
test_results['SUBGROUP_AVERAGE'] = test_results[percentage_columns[:-2]].apply(lambda x: sum(percentage * score for percentage, score in zip(x, range(1,5))), axis=1)
subgroup_avgs = test_results.groupby(['BEDSCODE', 'GRADE', 'SUBGROUP_NAME']).sum()['SUBGROUP_AVERAGE'].reset_index()
subgroup_avgs.rename(columns = {'SUBGROUP_AVERAGE': 'SUBGROUP_AVERAGE_TOTAL'}, inplace=True)
test_results = test_results.merge(subgroup_avgs, on=['BEDSCODE', 'GRADE', 'SUBGROUP_NAME'])
del subgroup_avgs
test_results['TOTAL_TESTED'] = test_results['TOTAL_TESTED'].astype('int64')
avg_df = test_results.groupby(['BEDSCODE', 'ITEM_SUBJECT_AREA']).apply(lambda x: sum(count * score for count, score in zip(x['TOTAL_TESTED'], x['SUBGROUP_AVERAGE']))/x['TOTAL_TESTED'].sum()).to_frame()
test_results = test_results.join(avg_df, on=['BEDSCODE', 'ITEM_SUBJECT_AREA'])
test_results = test_results.merge(test_results.groupby(['BEDSCODE', 'ITEM_SUBJECT_AREA']).first().unstack()[0].reset_index().copy(),
                   on='BEDSCODE')
test_results.rename(columns = {'ELA': 'ELA_SCHOOL_AVERAGE', 'Mathematics': 'MATH_SCHOOL_AVERAGE'}, inplace=True)
del avg_df

#Drop unnecessary columns
test_results.drop([0, 'SY_END_DATE', 'ITEM_DESC', 'MEAN_SCALE_SCORE'], axis=1, inplace=True)

#Merge
school_explorer = pd.merge(school_explorer[[column for column in school_explorer.columns if 'GRADE' not in column and 'AVERAGE' not in column]], 
                           test_results, left_on='SED CODE', right_on='BEDSCODE')

del test_results

#Just 8th grade results are wanted as they are the ones taking the SHSAT
school_explorer = school_explorer.query('GRADE==8')

Knowing that ethnic diversity has decreased in the specialized high schools, the goal is to increase the number of qualified underrepresented students taking the SHSAT. In order to find out which middle schools have qualified and underrepresented students, there need to be two classifications performed:
* An identification of high performing middle schools
* An identification of middle schools that have both a high percentage of Black/Hispanic students (the most underrepresented ethnically) and a low percentage of SHSAT test taking.

## High Performing Middle Schools <a class="anchor" id="sec31"></a>

To find high performing middle schools, the NYSTP results can be looked at.

In [None]:
from scipy import stats

fig = plt.figure(figsize=(10,10))
NYSTP_results = school_explorer.query('SUBGROUP_NAME=="All Students"').set_index(['BEDSCODE', 'ITEM_SUBJECT_AREA']).unstack()['SUBGROUP_AVERAGE'].dropna()
plt.scatter(NYSTP_results['ELA'],NYSTP_results['Mathematics'])
plt.title('NYSTP Results for the 2015/2016 School Year')
plt.xlabel('English Language Arts results')
plt.ylabel('Mathematics results');

Students' mathematics results and ELA results appear positively correlated. The top right corner contains schools that have good candidates for the SHSAT. To find a suitable cutoff point, K-means clustering<sup><a href='#14'>14</a></sup> will be used to segment the schools into custers. To find an optimal number of clusters, the "elbow method" will be used. 

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist

def elbow_method(df: pd.DataFrame, kmax: int) -> pd.DataFrame:
    """
    Creates graph to find optimal k-value for k-means clustering and returns a standardized dataframe.
         
    Args:
        df (pd.DataFrame): Dataframe to perform elbow method and standardization on
        kmax (int): Number of k values to try

    Returns:
        pd.DataFrame: A standardized version of df
    """
    df_norm = stats.zscore(df)
    k_values = []
    k_range = range(1,kmax)
    for k in k_range:
        kmeanModel = KMeans(n_clusters=k).fit(df_norm)
        kmeanModel.fit(df_norm)
        k_values.append(sum(np.min(cdist(df_norm, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df_norm.shape[0])

    plt.plot(k_range, k_values, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Sum of squares')
    plt.title('Elbow Method for finding an optimal k')
    plt.show()
    return df_norm
NYSTP_results_norm = elbow_method(NYSTP_results, 10)

It appears that 3 clusters will work well.

In [None]:
def cluster(df: pd.DataFrame, n: int) -> np.ndarray:
    """
    Returns k-means labels.
         
    Args:
        df (pd.DataFrame): Dataframe to cluster
        n (int): Number of clusters

    Returns:
        pd.DataFrame: A standardized version of df
    """
    kmeans = KMeans(n_clusters=n, random_state=0)
    fit = kmeans.fit(df)
    labels = kmeans.predict(df)
    return labels

labels = cluster(NYSTP_results_norm, 3)
colmap = {1: '#a6cee3', 2: '#1f78b4', 3: '#b2df8a', 4: '#33a02c', 5: '#fb9a99',
          6: '#e31a1c', 7: '#fdbf6f', 8: '#ff7f00', 9: '#cab2d6', 10: '#6a3d9a'}
colors = list(map(lambda x: colmap[x+1], labels))

fig = plt.figure(figsize=(10, 10))
plt.scatter(NYSTP_results['ELA'],NYSTP_results['Mathematics'], color=colors, alpha=0.5, edgecolor='k')
plt.title('NYSTP Results for the 2015/2016 School Year Clustered')
plt.xlabel('English Language Arts results')
plt.ylabel('Mathematics results');

In [None]:
NYSTP_results['TEST_CLUSTER'] = labels
school_explorer = school_explorer.merge(NYSTP_results['TEST_CLUSTER'].reset_index(), on='BEDSCODE')
school_explorer['HIGH_PERFORMER'] = (school_explorer['TEST_CLUSTER'] == 2)

## Middle Schools with Undertested Black/Hispanic Students <a class="anchor" id="sec32"></a>

To find middle schools who have a low percentage of test takers, registration information will be needed. The registration dataset that PASSNYC provided only has information on the D5 school district (Central Harlem). Fortunately, more data has been provided by the Department of Education that includes how many offers were made.

In [None]:
offers = pd.read_csv('../input/2015-2016-shsat-results/2015-2016_SHSAT_Admissions_Test_Offers_By_Sending_School.csv')

#Rename columns
offers.rename(columns = {'Feeder School DBN': 'DBN', 'Feeder School Name': 'SCHOOL_NAME',
                   'Count of Students in HS Admissions': 'OLD_COUNT',
                   'Count of Testers': 'TESTED', 'Count of Offers': 'OFFERED'}, inplace=True)
offers.set_index('DBN', inplace=True)

#Add actual student enrollments
eligible_testers = demographics_df.query('YEAR==2015')[['GRADE 8', 'GRADE 9', 'DBN']].set_index('DBN').apply(lambda x: pd.Series({'ELIGIBLE': x.sum()}), axis=1)
offers = pd.merge(offers, eligible_testers,
                left_index=True, right_index=True)
offers.drop('OLD_COUNT', axis=1, inplace=True)

#Make columns float64 type
for column in ['TESTED', 'OFFERED']:
    offers.loc[offers[column]=='0-5',column] = np.nan
    offers[column] = offers[column].astype('float64')

#Calculate relevant percentages
offers['% TESTED'] = (offers['TESTED']/offers['ELIGIBLE']).fillna(0)
offers['% SUCCESSFUL'] = offers['OFFERED']/offers['TESTED']

#Merge offers into school_explorer dataset
school_explorer = pd.merge(school_explorer.query('GRADE==8'), offers.drop('SCHOOL_NAME', axis=1), on='DBN')

In [None]:
tested_df = school_explorer.groupby('BEDSCODE').first()[['% TESTED', 'PERCENT BLACK/HISPANIC']].dropna()
fig = plt.figure(figsize=(10, 10))
plt.scatter(tested_df['% TESTED'],tested_df['PERCENT BLACK/HISPANIC'])
plt.title('Scatterplot Comparing SHSAT Test Participation and Proportion of Black/Hispanic Students')
plt.xlabel('Percent Tested')
plt.ylabel('Proportion of Black and Hispanic Students');

Looking at the percent of students who took the SHSAT versus the proportion of black/Hispanic students, there is a significant amount of predominantly black/Hispanic middle schools that have less than 20% of their students taking the SHSAT. Using k-means clustering again, the top left corner group will be of interest. 

In [None]:
tested_df_norm = elbow_method(tested_df, 10)

This time, it looks like 5 clusters will be optimal.

In [None]:
labels = cluster(tested_df_norm, 5)

colors = list(map(lambda x: colmap[x+1], labels))

fig = plt.figure(figsize=(10, 10))
plt.scatter(tested_df['% TESTED'],tested_df['PERCENT BLACK/HISPANIC'], color=colors, alpha=0.5, edgecolor='k');
plt.title('Scatterplot Comparing SHSAT Test Participation and Proportion of Black/Hispanic Students \n with K-means Clusters')
plt.xlabel('Percent Tested')
plt.ylabel('Proportion of Black and Hispanic Students');

In [None]:
tested_df['DIVERSITY_CLUSTER'] = labels
school_explorer = school_explorer.merge(tested_df['DIVERSITY_CLUSTER'].reset_index(), on='BEDSCODE')
school_explorer['UNDERTESTED'] = (school_explorer['DIVERSITY_CLUSTER'] == 0)

# Conclusion <a class="anchor" id="sec4"></a>

Looking at the intersection of high performing middle schools and middle schools with a high percentage of black/Hispanic students but few taking the SHSAT, there are 10 middle schools that meet these criteria. 

In [None]:
display(school_explorer.query('HIGH_PERFORMER & UNDERTESTED').groupby('BEDSCODE').first()[['SCHOOL_NAME', 'SUBGROUP_AVERAGE_TOTAL', '% TESTED']])
school_explorer.query('HIGH_PERFORMER & UNDERTESTED').groupby('BEDSCODE').first().shape[0]

In [None]:
m = folium.Map(
    location=[40.777488, -73.879681],
    tiles='Stamen Toner',
    zoom_start=11,)

for index,row in school_explorer.query('HIGH_PERFORMER & UNDERTESTED').groupby('BEDSCODE').first().iterrows():
    folium.Marker(
        location=[row['LATITUDE'], row['LONGITUDE']],
        popup=folium.Popup(row['SCHOOL_NAME'], parse_html=True)
    ).add_to(m)
m

If PASSNYC focuses efforts on these 10 middle schools (e.g. after-school programs to encourage a higher SHSAT testing rate), it could be effective in making the specialized high schools more diverse.

# Bibliography <a class="anchor" id="sec5"></a>
<br/>
<a id='1'> 1. </a> <a href='https://www.kaggle.com/new-york-city/nyc-high-school-directory/home'> "NYC High School Directory" on Kaggle</a><br/>

<a id='2'> 2. </a> <a href='https://laguardiahs.org/apps/pages/index.jsp?uREC_ID=314119&type=d'>Fiorello H. LaGuardia High School Website</a><br/>

<a id='3'> 3. </a> <a href='https://www.princetonreview.com/k12/shsat-information'>SHSAT Information from The Princeton Review</a><br/>

<a id='4'> 4. </a> <a href='http://www.passnyc.org/'>PASSNYC Website</a><br/>

<a id='5'> 5. </a> <a href='https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6'> "2013-2018 Demographic Snapshot School" on NYC Open Data</a><br/>

<a id='6'> 6. </a> <a href='https://en.wikipedia.org/wiki/Diversity_index#Simpson_index'> Wikipedia article on diversity indices </a><br/>

<a id='7'> 7. </a> <a href='http://eebweb.arizona.edu/courses/Ecol206/shannon%20weaver-wiener.pdf'> Ian F. Spellerberg and Peter J. Fedor "A tribute to Claude Shannon (1916–2001) and a plea for
more rigorous use of species richness, species diversity
and the ‘Shannon–Wiener’ Index", 2003 </a><br/>

<a id='8'> 8. </a> <a href='http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf'>Claude Shannon "A Mathematical Theory of Communication", 1948 </a><br/>

<a id='9'> 9. </a> <a href='https://www.uvm.edu/~ngotelli/manuscriptpdfs/ChaoHill.pdf'>Anne Chao "Rarefaction and extrapolation with Hill numbers: a framework
for sampling and estimation in species diversity studies", 2014</a><br/>

<a id='10'> 10. </a> <a href='http://www.loujost.com/Statistics%20and%20Physics/Diversity%20and%20Similarity/EffectiveNumberOfSpecies.htm'>Lou Jost on "Effective Number of Species"</a><br/>

<a id='11'> 11. </a> <a href='https://www.sciencedirect.com/science/article/pii/0022519370901244'>Kermit Hutcheson "A test for comparing diversities based on the shannon formula", 1970 </a><br/>

<a id='12'> 12. </a> <a href='http://www.dataanalytics.org.uk/Publications/S4E2e%20Support/exercises/Comparing%20shannon%20diversity.htm'>Post by Mark Gardener on DataAnalytics.org.uk about calculating the Hutcheson's t-test</a><br/>

<a id='13'> 13. </a> <a href='https://multithreaded.stitchfix.com/blog/2015/10/15/multiple-hypothesis-testing/'>More reading about Benjamini</a><br/>

<a id='14'> 14. </a> <a href='https://en.wikipedia.org/wiki/K-means_clustering'>K-means clustering</a> <br/>