### Getting Data in Shape
##### This notebook ingests the NYCDOE School Quality Reports from the past two years and produces a single table combining columns from multiple sheets across the origin xlsx files. It also adds a school year column to the final dataset to easily distinguish between the two years of data.

In [None]:
## Import needed libraries. Assumes env.yml in parent folder has been activated. 
import pandas as pd
import re
from functools import reduce

##### Here, I create variables that indicate which sheets and columns to keep from the data files. Since there are slight variations in the column names for the graduation metrics, I explictly names them for ease of ingestion. In the future, I'd like to come up with a clever way to regex for these columns across both tables, similiar to what I've done for the summary sheet information. 

In [None]:
dataUrls= ["http://infohub.nyced.org/docs/default-source/default-document-library/2015_2016_hs_sqr_results_2017_01_05.xlsx",
          "http://infohub.nyced.org/docs/default-source/default-document-library/2016-17_hs_sqr.xlsx"]
dataSheets = ["Summary","Student Achievement","Closing the Achievement Gap"]
summaryColsRegex = "(?<!Trust)Percent (?!Positive|of|Overage)|(8)|Economic|Name|Type|Enrollment"
achievement1516 = ['School Name','Metric Value - Graduation Rate, 4 year',
                  'Metric Value - College and Career Preparatory Course Index']
achievement1617 = ['School Name','Metric Value - 4-Year Graduation Rate',
                 'Metric Value - College and Career Preparatory Course Index']
closing1516 = ['School Name','Metric Value - Graduation Rate, 4-year, black/hispanic lowest third city',
               'Metric Value - Graduation Rate, 4-year, ELL']
closing1617 = ['School Name','Metric City Rating - 4-Year Graduation Rate - Black or Hispanic Males in Lowest Third Citywide',
            'Metric Value - 4-Year Graduation Rate - English Language Learners']

##### As a part of cleaning, I want to simplify the column names, so this function handles for the various quirks of this particular dataset.

In [None]:
def cleanColumnNames(col):
    col = col.replace("Metric Value -", "")
    col = col.replace(" ", "")
    col = col.replace("-", "")
    col = col.replace(",", "")
    col = col[0].lower() + col[1:]
    return(col)

##### Next, I create a custom cleaning function that downloads the data, reads in the required sheets and subsets to the correct columns. It then cbinds the various data sources together and adds a school year column.

In [None]:
def cleanSqr(url):
    temp = pd.read_excel(io=url,sheet_name = dataSheets,header=1)
    year = re.search("(?<=library/)(.*)(?=_hs)",url)[0]
    temp["Summary"] = temp["Summary"].filter(regex = summaryColsRegex)
    if year == "2016-17" :
        temp["Student Achievement"] = temp["Student Achievement"].filter(achievement1617)
        temp["Closing the Achievement Gap"] = temp["Closing the Achievement Gap"].filter(closing1617)
    else :
        temp["Student Achievement"] = temp["Student Achievement"].filter(achievement1516)
        temp["Closing the Achievement Gap"] = temp["Closing the Achievement Gap"].filter(closing1516)
    dfs = [temp["Summary"],temp]
    full = reduce(lambda left,right: pd.merge(left,right,on='School Name'), temp.values())
    colNames = list(full)
    newNames = [cleanColumnNames(col) for col in colNames]
    columnDict = {k: v for k, v in zip(colNames, newNames)}
    full = full.rename(columns = columnDict)
    full['schoolYear'] = year
    return(full)

##### In this step, I use list comprehension to iterate my function over the list of urls. While the list is not large now, this method can easily scale to much larger list of data. I also rename the columns in the two datasets to match. After doing so, I run a few lines of code to confirm that the dataframes are the same shape and have the same column names. 

In [None]:
sqr1516, sqr1617 = [cleanSqr(x) for x in dataUrls]
update1516 = {k: v for k, v in zip(list(sqr1516), list(sqr1617))}
sqr1516 = sqr1516.rename(columns = update1516)

In [None]:
s15 = set(list(sqr1516))
s16 = set(list(sqr1617))
s15.symmetric_difference(s16) # This should return an empty set to ensure data column equivalency

##### Finally, I row bind the two years of data into a single dataframe to save. This will be the dataset I use going forward in analyses. 

In [None]:
colList = list(sqr1516)
sqr1617 = sqr1617[colList] #this ensures the columns are in the same order before appending
finalData = sqr1516.append(sqr1617)
finalData.to_csv(path_or_buf="data/sqrAnalysisData.csv",index=False) 
#If you are running this notebook, change the path_or_buf argument to fit your desired save location