# Processing the New York City School Survey data

This notebook documents how we downloaded, transformed, and cleaned the NYC School Survey data from the NYC Department of Education for our analysis of bullying/harassment during the 2013-14 school year in NYC public schools.

## Import Python libraries and set working directories

In [1]:
import os
import feather
import numpy as np
import pandas as pd

In [2]:
input_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'input')
intermediate_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'intermediate')
output_dir = os.path.join(os.path.dirname(os.getcwd()), 'data', 'output')

os.path.exists(intermediate_dir) or os.mkdir(intermediate_dir)
os.path.exists(output_dir) or os.mkdir(output_dir)

True

## Load raw data

The [raw Excel file](http://schools.nyc.gov/documents/misc/2014%20Public%20Data%20File%20SUPPRESSED.xlsx) of survey responses comes from the New York City Department of Education (NYCDOE). The full page for the 2014 NYC School Survey Results (representing the 2013-14 school year) is [here](http://schools.nyc.gov/Accountability/tools/survey/2014+NYC+School+Survey+Results). NYCDOE posts archived survey information [here](http://schools.nyc.gov/Accountability/tools/survey/SurveyArchives.htm).

In [12]:
surveys = pd.ExcelFile(
    os.path.join(input_dir, '2014 Public Data File SUPPRESSED.xlsx')
)

## Save an intermediate version of the data with just two columns: `DBN` (the school's unique ID number) and `School Name`

We will use this intermediate dataframe (which we will save into a [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) file in the `data/intermediate` folder), to create a crosswalk from the `DBN`, the code used in the NYC School Survey to identify schools, to the `combokey`, the code used in the federal civil rights data to identify schools.

In [15]:
df_survey = surveys.parse(sheetname = 'Total - GenEd')
df_survey = df_survey.iloc[2:,0:2]
df_survey.reset_index(inplace = True, drop = True)
df_survey.to_feather(os.path.join(intermediate_dir, 'nyc_survey_ids.feather'))

## Transform raw data

There are many sheets in the NYC School Survey file corresponding to the surveyed populations (Parents, Students, Teachers) & response types (% versus #). We are interested in the General Ed - # tabs for Parents, Students, and Teachers, which provides both the raw number of responses for each school. We'll reshape the original Excel sheets to "long" format for easier analysis. 

In [43]:
def ReshapeFunc(excel_obj, name):
    """ Takes in an excel file object with multiple tabs in a wide format, 
    and a specified index of the tab to be parsed and reshaped. 
    
    Returns a data frame of the specified tab reshaped to long format """
    
    # parse and clean columns
    df = excel_obj.parse(sheetname = name)
    cols1 = list(df)
    cols2 = list(df.iloc[0,:])
    cols2 = [str(x).lower() for x in cols2]
    cols = [x+"_"+y for x,y in zip(cols1,cols2)]
    df.columns = cols
    df = df.iloc[2:,:]
    
    # reshape - indexing, pivoting and stacking
    idx = [c for c in df.columns if c.endswith('_nan')]
    multi_indexed_df = df.set_index(idx)
    stacked_df = multi_indexed_df.stack(dropna=False)
    long_df = stacked_df.reset_index()
    
    # clean up and finalize
    long_df.columns = long_df.columns.astype(str)
    level = [c for c in long_df.columns if c.startswith('level')]
    col_str = long_df[level].iloc[:,0].str.split("_") 
    long_df['question'] = [x[0] for x in col_str] 
    long_df['answer'] = [x[1] for x in col_str]
    long_df['value'] = long_df['0']
    long_df['value'] = long_df['value'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
    long_df['value'] = long_df['value'].fillna(0)
    drop = [c for c in long_df.columns if '0' in c or 'level' in c]
    df_final = long_df.drop(drop, axis = 1)
    
    df_final['question'] = np.where(df_final['question'].str.contains('Unnamed'), np.nan, df_final['question'])
    df_final['question'] = df_final['question'].fillna(method='ffill')
    
    df_final.columns = df_final.columns.str.replace('_nan', '')
    df_final.columns = df_final.columns.str.replace(' ', '_')
    df_final.columns = df_final.columns.str.lower()
    
    return df_final

In [44]:
parents_num = ReshapeFunc(surveys, 'GenEd - Parent # of Responses')
students_num = ReshapeFunc(surveys, 'GenEd - Student # of Responses')
teachers_num = ReshapeFunc(surveys, 'Teacher # of Responses')

## Select bullying/harassment questions from each of the surveys

### Parents

In the parent survey, the relevant questions are:
- 5d. At my child's school students harass or bully other students.
- 5e. At my child's school students harass or bully each other based on differences (such as race, color, ethnicity, national origin, citizenship / immigration status, religion, gender, gender identity, gender expression, sexual orientation, disability or weight).

The answer categories are:
- Strongly disagree
- Disagree
- Agree
- Strongly agree
- Don't Know

Let's select these and put the parent survey information into a single dataframe.

In [81]:
parents = parents_num.loc[parents_num['question'].str.startswith('5d') | 
                            parents_num['question'].str.startswith('5e')].copy()
parents['level'] = 'parents'
parents['question_n'] = parents.groupby(['dbn', 'question'])['value'].transform(sum)
parents.rename(columns = {'number_of_eligible_responses':'total_n',
                          'number_of_parent_responses':'survey_n'}, inplace = True)
parents['survey_n'] = parents['survey_n'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

In our analysis, we delete the Don't Know category and recalculate the response percentages, in order to make Parents' responses comparable to Students' and Teachers', which both have 4-category responses.

In [82]:
parents = parents.loc[parents['answer'] != "don’t know"].copy()
parents['question_n_new'] = parents.groupby(['dbn', 'question'])['value'].transform(sum)
parents['perc'] = (parents['value']/parents['question_n_new']) * 100

### Students

In the student survey, the relevant questions are:
- 11c. At my school students harass or bully other students.
- 11d. At my school students harass or bully each other based on differences (such as race, color, ethnicity, national origin, citizenship/immigration status, religion, gender, gender identity, gender expression, sexual orientation, disability or weight).

The answer categories are:
- None of the time
- Some of the time
- Most of the time
- All of the time

Let's select these and put the student survey information into a single dataframe.

In [86]:
students_num['question_n'] = students_num.groupby(['dbn', 'question'])['value'].transform(sum)
students_num['survey_n'] = students_num.groupby(['dbn'])['question_n'].transform(max)
students = students_num.loc[students_num['question'].str.startswith('11c') | 
                            students_num['question'].str.startswith('11d')].copy()
students['level'] = 'students'
students['total_student_response_rate'] = students['total_student_response_rate'].apply(lambda x: pd.to_numeric(x, errors='coerce'))
students['total_n'] = round(students['survey_n'] / (students['total_student_response_rate']/100))
students.rename(columns = {'total_student_response_rate':'total_response_rate'}, inplace = True)

### Teachers

In the teacher survey, the relevant questions are:
- 10e. At my school, students are often harassed or bullied in school.
- 10j. At my school, there are conflicts based on differences (race, color, creed, ethnicity, national origin, citizenship/immigration status, religion, gender, gender identity, gender expression, sexual orientation, disability, or weight).

The answer categories are:
- Strongly disagree
- Disagree
- Agree
- Strongly agree

Let's select these and put the teacher survey information into a single dataframe.

In [87]:
teachers = teachers_num.loc[teachers_num['question'].str.startswith('10e') | 
                            teachers_num['question'].str.startswith('10j')].copy()
teachers['level'] = 'teachers'
teachers['question_n'] = teachers.groupby(['dbn', 'question'])['value'].transform(sum)
teachers.rename(columns = {'number_of_eligible_teachers':'total_n',
                          'number_of_teacher_respondents':'survey_n'}, inplace = True)
teachers['survey_n'] = teachers['survey_n'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

## Recode questions and answers to make them comparable between Parents, Students, and Teachers

As shown above, parents, students, and teachers are asked comparable questions relating to bullying/harassment. They are first asked about bullying/harassment in general, and then they are asked about bullying/harassment based on differences. We denote the first question `harass` and the second `harass_differences`.

Note the question wording for teachers is slightly different, as teachers are asked about *conflicts* based on differences rather than harassment/bullying based on differences outright. This consideration is also mentioned in our analysis.

In [88]:
def recode_question(series):
    if '11d' in str(series) or '5e' in str(series) or '10j' in str(series) :
        return "harass_differences"
    else:
        return "harass"

In [89]:
students['question'] = students['question'].apply(recode_question)
parents['question'] = parents['question'].apply(recode_question)
teachers['question'] = teachers['question'].apply(recode_question)

Parents and teachers answer on a 4-point disagreement/agreement scale, ranging from "strongly disagree" to "strongly agree." We will recode these numerically from 1-4 in ascending agreement. Note that parents have a fifth category, "don't know", that we will recode to NaN. Students answer on a 4-point frequency scale, ranging from "none of the time" to "all of the time." We will recode these numerically from 1-4 in ascending frequency.

In [90]:
def recode_answers(series):
    if series == 'strongly disagree':
        return 1
    elif series == 'disagree':
        return 2
    elif series == 'agree':
        return 3
    elif series == 'strongly agree':
        return 4
    elif series == "don’t know":
        return np.nan
    elif series == 'none of the time':
        return 1
    elif series == 'some of the time':
        return 2
    elif series == 'most of the time':
        return 3
    elif series == 'all of the time':
        return 4

## Merge Parent, Student, and Teacher surveys and calculate response percentages

In [91]:
survey = students.append(parents).append(teachers)
survey['answer_code'] = survey['answer'].apply(recode_answers)
pd.crosstab(survey['answer_code'], survey['answer'], dropna = False) # check the mapping

answer,agree,all of the time,disagree,most of the time,none of the time,some of the time,strongly agree,strongly disagree
answer_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,0,0,0,3568,0,0,7136
2,0,0,7136,0,0,3568,0,0
3,7136,0,0,3568,0,0,0,0
4,0,3568,0,0,0,0,7136,0


Change response values to NaN if nobody answered the question as they are currently displayed as "0", which is misleading.

In [92]:
survey['value'] = np.where(survey['question_n'] == 0, np.nan, survey['value'])
survey['perc'] = np.where(survey['question_n'] == 0, np.nan, survey['perc'])

Calculate survey response rate (% who responded to the survey) and question response rate (% who responded to the question). Note for parents the question response rate is based on the TOTAL number of responses (including the Don't Knows).

In [93]:
survey['survey_rr'] = (survey['survey_n'] / survey['total_n']) * 100
survey['survey_rr'] = np.where(survey['survey_rr'] > 100, 100, survey['survey_rr'])
survey['question_rr'] = (survey['question_n'] / survey['survey_n']) * 100

For each surveyed pouplation, calculate the percentages for that responded to the question (`perc`), based on the number of responses.

In [94]:
survey['perc'] = np.where(survey['level'] == 'parents', survey['perc'],
                          (survey['value'] / survey['question_n']) * 100)

Calculate % of parents who responded in the "don't know" category.

In [95]:
survey['dk_parents_perc'] = ((survey['question_n'] - survey['question_n_new'])/ survey['question_n'])*100 

## Reshape the data to the "wide" format such that there are 4 rows per school for the 4 main response categories

In [97]:
survey_pivot = survey.pivot_table(index = ['dbn', 'school_name', 'answer_code'],
                             columns = ['question', 'level'],
                             values = ['perc', 'survey_rr', 'question_rr', 'survey_n', 'dk_parents_perc'],
                             aggfunc ='mean')
survey_pivot.columns = ['_'.join(str(s).strip() for s in col if s) for col in survey_pivot.columns]
survey_pivot.reset_index(inplace = True)

In [100]:
survey_pivot['survey_n_parents'] = survey_pivot['survey_n_harass_parents']
survey_pivot['survey_n_students'] = survey_pivot['survey_n_harass_students']
survey_pivot['survey_n_teachers'] = survey_pivot['survey_n_harass_teachers']

survey_pivot['survey_rr_parents'] = survey_pivot['survey_rr_harass_parents']
survey_pivot['survey_rr_students'] = survey_pivot['survey_rr_harass_students']
survey_pivot['survey_rr_teachers'] = survey_pivot['survey_rr_harass_teachers']

survey_pivot.drop([c for c in survey_pivot.columns if ('survey_rr_harass_' in c) | ('survey_n_harass_' in c)], axis = 1, inplace = True)

In [102]:
survey_pivot.head()

Unnamed: 0,dbn,school_name,answer_code,dk_parents_perc_harass_parents,dk_parents_perc_harass_differences_parents,perc_harass_parents,perc_harass_students,perc_harass_teachers,perc_harass_differences_parents,perc_harass_differences_students,...,question_rr_harass_teachers,question_rr_harass_differences_parents,question_rr_harass_differences_students,question_rr_harass_differences_teachers,survey_n_parents,survey_n_students,survey_n_teachers,survey_rr_parents,survey_rr_students,survey_rr_teachers
0,01M015,P.S. 015 Roberto Clemente,1,20.0,26.5625,54.166667,,36.363636,63.829787,,...,100.0,96.969697,,100.0,66.0,0.0,22.0,44.295302,,91.666667
1,01M015,P.S. 015 Roberto Clemente,2,20.0,26.5625,16.666667,,54.545455,17.021277,,...,100.0,96.969697,,100.0,66.0,0.0,22.0,44.295302,,91.666667
2,01M015,P.S. 015 Roberto Clemente,3,20.0,26.5625,14.583333,,9.090909,6.382979,,...,100.0,96.969697,,100.0,66.0,0.0,22.0,44.295302,,91.666667
3,01M015,P.S. 015 Roberto Clemente,4,20.0,26.5625,14.583333,,0.0,12.765957,,...,100.0,96.969697,,100.0,66.0,0.0,22.0,44.295302,,91.666667
4,01M019,P.S. 019 Asher Levy,1,19.917012,22.362869,35.751295,,54.054054,41.304348,,...,100.0,96.734694,,100.0,245.0,0.0,37.0,100.0,,97.368421


## Save cleaned data

Save the `survey_pivot` dataframe, which represents the cleaned NYC School Survey data, to a [feather](https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/) file in the `data/intermediate` folder.

In [110]:
survey_pivot.to_feather(os.path.join(intermediate_dir, 'nyc_survey_wide.feather'))