# Processing the federal Civil Rights Data Collection survey data

This notebook documents how we downloaded, transformed, and cleaned the Civil Rights Data Collection survey data from the U.S. Department of Education's Office for Civil Rights for our analysis of bullying/harassment during the 2013-14 school year in NYC public schools. Note that although our analysis is restricted to NYC schools, here we are loading the entire dataset of U.S. public schools.

## Import Python libraries and set working directories

In [1]:
import os
import re
import feather
import numpy as np
import pandas as pd

In [2]:
input_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'input')
intermediate_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'intermediate')

## Load raw data

The [raw file](https://inventory.data.gov/dataset/2acc601e-9806-4dff-b144-f8a5e7c095b8/resource/3dc84a95-526a-4b90-aacd-72f60d4fecbc/download/crdc201314csv.zip) of survey responses comes from the U.S. Department of Education (ED)'s Office for Civil Rights (OCR), available on [Data.gov](https://catalog.data.gov/dataset/civil-rights-data-collection-2013-14). The full page for the Civil Rights Data Collection for the 2013-14 school year is [here](https://www2.ed.gov/about/offices/list/ocr/docs/crdc-2013-14.html). OCR also hosts a data portal with information from earlier years [here](https://ocrdata.ed.gov/).

In [3]:
ocr_schools_read = pd.read_csv(
    os.path.join(input_dir, 'CRDC2013_14_SCH.csv'),
    encoding = "ISO-8859-1", 
    dtype={'COMBOKEY':'str', 'LEAID': 'str', 'NCES_SCHOOL_ID': 'str'},
    na_values = [-1, -2,-3, -4, -5, -9],
    low_memory=False)

ocr_schools_read.columns = ocr_schools_read.columns.str.lower()

## Save an intermediate version of the data with just three columns: `sch_name`, `combokey` (the school's unique ID number), and `nces_school_id` (the school's ID number in another federal dataset)

We will use this intermediate dataframe (which we will save into a [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) file in the `data/intermediate` folder), to create a crosswalk from the `DBN`, the code used in the NYC School Survey to identify schools, to the `combokey`, the code used in the federal civil rights data to identify schools. The `nces_school_id` is one of the steps along the way.

In [4]:
ocr_schools_ids = ocr_schools_read[['sch_name', 'combokey', 'nces_school_id']]
ocr_schools_ids.to_feather(os.path.join(intermediate_dir, 'ocr_schools_ids.feather'))

## Select bullying/harassment variables

For each school, the OCR collects information on the number of allegations of bullying/harassment, the number of students reported to be bullied/harassed, and the number of students disciplined for bullying/harassment. Bullying/harassment falls into three categories: race, sex, and disability.

In [5]:
ocr = ocr_schools_read[['combokey'] + [c for c in ocr_schools_read.columns if ('tot_hb' in c) | 
                                       ('tot_enr' in c) | ('hballegations' in c)]].copy()

Rename variables and create indicators for each category of harassment where the indicator will = 1 if there was at least one allegation/student in the category.

In [6]:
ocr.rename(columns = {col: re.sub('rac', 'race', col) for col in ocr.columns}, inplace = True)
ocr.rename(columns = {col: re.sub('sch_hballegations', 'allegations_harass', col) for col in ocr.columns}, inplace = True)
ocr.rename(columns = {col: re.sub('tot_hbreported', 'students_report_harass', col) for col in ocr.columns}, inplace = True)
ocr.rename(columns = {col: re.sub('tot_hbdisciplined', 'students_disc_harass', col) for col in ocr.columns}, inplace = True)
ocr.rename(columns = {col: re.sub('tot_enr', 'ocr_enroll', col) for col in ocr.columns}, inplace = True)
ocr.drop([c for c in ocr.columns if 'dso' in c], axis = 1, inplace = True)

In [7]:
for var in ['ocr_enroll', 'students_report_harass_sex', 'students_report_harass_race', 
            'students_report_harass_dis', 'students_disc_harass_sex', 'students_disc_harass_race', 'students_disc_harass_dis'] :
    ocr[var + '_tot'] = ocr[var + '_m'] + ocr[var + '_f']

for var in ['students_report_harass_sex', 'students_report_harass_race', 
            'students_report_harass_dis', 'students_disc_harass_sex', 'students_disc_harass_race', 'students_disc_harass_dis'] :    
    ocr['perc_' + var] = (ocr[var + '_tot'] / ocr['ocr_enroll_tot']) * 100 

for var in ['sex', 'race', 'dis'] :
    ocr['perc_allegations_harass_' + var] = ocr['allegations_harass_' + var] / ocr['ocr_enroll_tot']
    
for var in ['dis', 'race', 'sex'] : 
    ocr['students_report_harass_' + var + '_tot_ind'] = np.where(ocr['students_report_harass_' + var + '_tot'] > 0, 1, 0)
    ocr['students_disc_harass_' + var + '_tot_ind'] = np.where(ocr['students_disc_harass_' + var + '_tot'] > 0, 1, 0)
    ocr['allegations_harass_' + var + '_ind'] = np.where(ocr['allegations_harass_' + var] > 0, 1, 0)

for indicator in ['disc_harass_', 'report_harass_', 'allegations_harass_']:
    indicators_cols = [col for col in ocr.columns if ((indicator in col) & ('ind' in col))]
    ocr[indicator + 'ind'] = ocr[indicators_cols].max(axis = 1)

## Save cleaned data

Save the `ocr` dataframe, which represents the cleaned Civil Rights Data Collection survey data, to a [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) file in the `data/intermediate` folder.

In [8]:
ocr.to_feather(os.path.join(intermediate_dir, 'federal_ocr_survey.feather'))