# Class size

This notebook loads and cleans class size and pupil-teacher-ratio data from the 2013-14 Updated Class Size Report produced by the New York City Department of Education.

## Import Python libraries and set working directories

In [1]:
import os
import feather
import numpy as np
import pandas as pd

In [2]:
input_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'input')
intermediate_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'intermediate')
output_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'output')

## Load data and generate `DBN` code to uniquely identify schools

The [raw file](http://schools.nyc.gov/offices/d_chanc_oper/budget/dbor/DBOR_CLASS_SIZE/FY14_Data/Updated_School_level_Detail_Summary_2014_02_13.xlsx) comes from the NYC Department of Education (NYCDOE), available [here](http://schools.nyc.gov/NR/exeres/23ABEBD1-D31F-4436-BC76-DD630242E621,frameless.htm?NRMODE=Published). OCR also hosts a data portal with information from earlier years [here](https://ocrdata.ed.gov/). NYCDOE posts archived class size reports [here](http://schools.nyc.gov/AboutUs/schools/data/classsize/Class+Size+Archive.htm).

In [None]:
report = pd.read_excel(
    os.path.join(input_dir, 'Updated_School_level_Detail_Summary_2014_02_13.xlsx'), 
    skiprows = 10, 
    converters = {'CSD':str} 
)

report.columns = report.columns.str.lower()
report.columns = report.columns.str.replace(' ', '_')

The DBN code can be generated as the `csd` plus the `school_code`.

In [3]:
report['dbn'] = report[['csd', 'school_code']].apply(lambda x: ''.join(x), axis=1)

## Calculate average class size

First, sum the number of students or seats filled (`number_of_students_/_seats_filled`) and the number of sections (`number_of_sections`) for each school.

In [6]:
report_agg = report.groupby(['dbn', 'school_name'])[['number_of_students_/_seats_filled', 
                                                       'number_of_sections']].sum().reset_index()

Then, divide the number of students or seats filled (`number_of_students_/_seats_filled`) by the number of sections (`number_of_sections`) to get average class size.

In [7]:
report_agg['avg_class_size'] = report_agg['number_of_students_/_seats_filled'] / report_agg['number_of_sections']

## Get pupil-teacher ratios and merge with average class size

In [12]:
ratios = report[['dbn', 'school_name', 'schoolwide_pupil-teacher_ratio']].dropna(axis=0, how='any')
merged = pd.merge(report_agg, ratios, on = ['dbn', 'school_name'], how = 'outer', indicator = True)
merged.rename(columns = {'schoolwide_pupil-teacher_ratio': 'pupil_teacher_ratio'}, inplace = True)
merged.drop(['number_of_students_/_seats_filled', 'number_of_sections', '_merge'], axis = 1, inplace = True)

## Save data

Save the `merged` dataframe to a [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) file in the `data/intermediate` folder.

In [14]:
merged.to_feather(os.path.join(intermediate_dir, 'df_class_size.feather'))