# NYC 311 Capstone - Introduction
## Stephen Behunin | BrainStation | stephenbehuninwork@gmail.com
____________________________________________________________________________________________
## Notebook Executive Summary
The goal of this notebook is to give an introduction to the the project, describe the material contained within each notebook, provide a reference document for navigating the report, and define the scope and purpose of this analysis.
____________________________________________________________________________________________

## Layout
____________________________________________________________________________________________
#### Part 1- Report Table of Contents

#### Part 2- Introduction
- Background
- Business Problem
- Business Questions
- Goals of the Analysis
- Scope

#### Part 3 - Data Dictionary
- Final Complaint Dataset Dictionary
- Dropped Column Complaint Dataset Dictionary
- Timeseries Dataset Dictionary

# Part 1- Report Table of Contents
____________________________________________________________________________________________




This section provides a reference layout of all of the notebooks and supplementary materials for this report.

- **Introduction Notebook** (311_Introduction_Final.ipynb)
    - *Part 1- Report Table of Contents*
    - *Part 2- Introduction*
        - Background
        - Business Problem
        - Business Questions
        - Goals of the Analysis
        - Scop
     - *Part 3- Data Dictionary*
     

- **Data Cleaning Notebook** (311_Cleaning_Final.ipynb)
    - Pre-amble¶
        - Importing packages
        - Reading in the source data
        - Creating a backup copy of the original dataset
    - Part 1 - Dataset Basics
        - Collecting basic information about the dataset
    - Part 2 - Columns
        - Dropping unwanted columns
        - Top Ten Descriptors
        - Formatting column datatypes
    - Part 3 - Rows
        - Dealing with missing or null data
        - Searching for duplicate values
        - Resetting the index column
    - Part 4 - Limiting Scope
        - Filtering out unclosed complaints
        - Restricting the dataset to complaints from 2010 to 2019
        - Converting the complaint dataframe to a .csv file
    - Part 5- Creating the Timeseries Dataframe
        - Creating the dataset
        - Cleaning the new dataset
    - Part 6- Conversion to CSV
        - Final dimensions check
        - Converting the dataframe to a .csv file
        
        
- **Exploratory Data Analysis** (311_Timseries_EDA_Final.ipynb)
    - Pre-amble
        - Importing packages
        - Reading in the data
    - Part 1 - EDA by Timescale
        - Total
            - Daily SR Graph
            - Total Complaints by Descriptor
            - Histogram of SR Volumes
            - 100 Highest and Lowest SR Volume Days
        - Year
            - Total SRs by Year
            - Monthly SR Volume Year to Year
            - Top Ten Descriptors for January
            - Adjusting the timeseries data
        - Month
            - Monthly Mean, Minimum and Maximum
            - Monthly Alltime SR Volume
        - Weekday
    - Part 2 - Seasonal Trend Decomposition
        - Monthly SR Volume Graph
        - Seasonal Plot
        - Monthly Deviation from Mean SR Volume
        - Trend-Seasonal Decomposition
    - Conclusion


- **Modeling** (311_Timeseries_Modeling_Final.ipynb)
    - Pre-amble
        - Importing packages
        - Reading in the data
    - Part 1 - Basic Modeling
        - Monthly SR volume from 2010 to 2018
        - Differencing
        - Train-test split
        - Baseline Forecasts and Evaluation
            - Simple Mean Regression on Differenced Data
    - Part 2 - Advanced Modeling
        - Autocorrelation
        - AR Modeling
        - ARIMA Modeling
    - Part 3 - SARIMA Modeling
        - Creating the non-differenced training and validation sets
        - Non-Differenced Mean Baseline
        - Autocorrelation for Non-Differenced Data
        - Implementing the SARIMA Model
        - Evaluating the SARIMA Model
        - Final Scoring
    - Conclusion


- **Recurrent Neural Networks** (311_Timeseries_RNN_Final.ipynb)
    - Pre-amble
        - Importing packages
        - Reading in the data

    - Part 1 - Data preparation for Neural Networks
        - Feature Engineering Timestamps
        - Splitting the dataset
        - Normalizing the data
    - Part 2 - Defining functions and preparing background processes
        - Data Windowing
        - Indexing and Offsetting
        - Creating the Month window
        - Splitting
        - Plotting the windows
        - Making windowed datasets
    - Part 3 - Baseline & Metrics
        - Loss
        - Mean Squared Error
        - Mean Absolute Percentage Error (MAPE)
    - Part 4 - Neural Networks
        - General Neural Networks
    - Part 5 - Simple RNNs
        - Simple Model
    - Part 6 - GRUs
    - Part 7 - GRU with Adam Optimizer
    - Part 8 -  GRU with SGD-Nesterov Optimizer
    - Part 9 - The Final Comparison
    - Part 10 - Optimization
    - Conclusion


- **Final Report / Project Summary** (311_Report_Final.pdf)
- **Final Presentation** (311_Presentation_Final.pdf)


# Part 2- Introduction
____________________________________________
- Background 
- Business Problem
- Business Questions
- Goals of the Analysis
- Scope

## Background
_______________
In 2018 the NYC 311 system had 44,023,630 customer requests for services or information.

<a href="https://www1.nyc.gov/311/311-sets-new-record-in-2018.page">These interactions</a>  included everything from making noise complaints and requesting city records, to getting information about city services. The city began publishing an online database of service requests in 2010 in order to both enhance visibility around public services and provide a better method of tracking ongoing service requests. Through this portal customers can access information on both active and inactive service requests including the status of a complaint, what actions were taken to remedy the situation and a large amount of technical data regarding the complaint. 

## Business Problem 
________________________
On average the City of New York will receive 2.1 million Service Requests through its 311 system each year. Everything from potholes to noise complaints flows into the queu of pending SRs. The city must process, assign and complete the required actions for each and every SR it recieves. This is of course in addition to dealing with emergency situations, regular maintenance and building infrastructure and programs for the future of the city. Naturally this means that extraordinary strain is placed on city services which are frequently stretched to the limit and need to utilize every ounce of manpower and funding available to them. 

One of the fundamental challenges for successful resource allocation at this scale is constistency. Having appropriate resources avaiable for departments to handle the flow of complaints is necessary to keep the system running smoothly, the bosses happy and the citizens content. Unfortunately the volume of requests received each day is far from consistent. During its slowest day on record the 311 system received 1680 SRs, which for a city of of 8.4 million seems like a reasonable figure. On its busiest day however the system received 11735 requests. Astute observers may notice the slight 7 fold difference in daily SR volume experienced by this system. Which leads to an interesting question: how do you effectively allocate resources, staffing and operating budgets when you may be dealing with 2000 complaints one day and 11000 the next?

## Business Question
_________________________
The problem outlined above centers on uncertainty in SR volume. While randomness is inherent to any system so complex and human dependent as city services there are still patterns within the usage of the New York 311 system. The amount of uncertainty surrounding SR volume can be dramatically reduced through understanding these patterns and using them to forecast future SR volume. Reducing the uncertainty of these numbers will help NYC better allocate its resources and effectively serve the community. Answering these questions will be at the core of effectively understanding and forecasting SR volume:
1. What trends, seasonal changes and cycles effect SR volume for the NYC 311 system?
2. What are the traits/predictors of variance in SR volume not explained by the factors listed above?

A multitude of other questions can be devised as part of these larger questions or as supporting information for the analysis. However the goal of this project centers on these two questions.

## Goals of the Analysis
_______________________________
The goals of this analysis are as follows:

1. Understand the long term trends, seasonality and cycles present within the data.
    - Perform EDA and timeseries decomposition to understand these components. 
2. Effectively model future SR volume on a long timeframe in coarse detail. 
    - Predict SR volume based on understanding the above up to a year out.
    - This model can be used to assist in longer term budgeting and planning by providing reasonable projections of future SR volume. 
3. Create a forecasting tool to accurately predict SR volume in the near term with a high degree of precision capable of predicting SR volume 3-4 weeks ahead. 
    - Use the modeling for #2 combined with more advanced Neural Network techniques to make a more powerful and accurate, but nearsighted, forecast of SR volume.
    - This model can be used to make short term staffing decisions such as employee scheduling. And assist in effective resource allocation in a more granular but smaller scale than the previous model.
    
These goals are significant in the amount of effort and time needed to understand and accurately model them. Because this analysis is complex the scope of the project will be limited as described in the next section.

## Scope 
_________________________________
Between 2010 and 2021 the NYC 311 Open Data system logged 26.4 million service requests. Which equates to roughly 2.3 million requests for each of the 11.5 years the system has been operational. To keep the scope of this project to a reasonable level there will be three major limitations to the dataset used for this analysis.
1. Timeframe - 2010 to 2019
    - Only complaints falling within these years will be considered for analysis.
        - This timeframe was chosen as it gives ample data for training, validating and testing the models. 
        - Additionally the Covid-19 pandemic caused issues with the 311 System which lead to inconsistent data collection.
          This timeframe avoids these potential issues and ensures that once conditions return to normal the model will be as accurate as possible.
2. Complaint Status
    - Only complaints with a "Closed" status will be considered for analysis, meaning that only those complaints that are considered resolved by the responsible agency will be analyzed.
3. Timeseries Limitation
    - Due to the size of the dataset and the constraints placed on this project the timeseries analysis will be constrained to the single dimension of SR volume. 
        - The dataset comes with many columns that are unnecessary for the purposes of this analysis. The dataset will be cleaned normally to give the purest possible data before being reduced into the timeseries format required.

# Part 3 - Data Dictionary
_______________________________

This section has data dictionaries for each of the primary dataset configurations within the notebooks. Each row gives the name of a column and a brief description of the column.

## Final Complaint Dataset Dictionary

"Unique Key" - the unique identifier given to each NYC 311 complaint.

"Created Date" - the day on which the SR was created in the 311 system.

"Closed Date" - the day on which the SR was closed in the 311 system.

"Agency" - the agency responsible for resolving the 311 complaint.

"Descriptor" - a short description of the complaint from a preset list of possible complaints.

"Open Data Channel Type" - the channel through which the complaint was submitted : Phone, Online, Mobile etc.

"Latitude" - the geovalidated latitude coordinate of the complaint.

"Longitude" - the geovalidated longitude coordinate of the complaint.

## Dropped Column Complaint Dataset Dictionary

"Agency Name" - this column is already encoded in the "Agency" column, if the full name of the agency is needed it can be found easily. Because of this redundancy the column will be dropped. 

All location data except for the "Latitude" and "Longitude columns will be dropped to reduce conflict between the variables as most of the other data columns are redundant.

"Landmark" - this column indicates if there is a notable landmark associated with the location of the service request, it is redundant since latitude and longitude data are available. And because the sheer number of landmarks within the dataset puts this attribute outside the scope of this analysis.

"X Coordinate (State Plane)" & "Y Coordinate (State Plane)" - these columns show the grid locations of each service request on the state unified plane. This plane is designed to help the government entities within the State of New York better coordinate location information in a single unified way. However for the purposes of this analsis this information is redundant and will be dropped. Source: https://gis.ny.gov/coordinationprogram/workgroups/wg_1/related/standards/datum.htm

"Intersection Street 1" & "Intersection Street 2" - these columns display the nearest intersections on either side of the service request location, they are used for referencing blocks within each street. This information adds unneccessary complexity for the purposes of this analysis, the location of the requests will be encoded through the "Latitude" and "Longitude" columns instead.

"Location" - this column is just a combination of the "Latitude" & "Longitude" columns and is redundant, so it will be dropped.

"Complaint Type" - the complaint type is encoded by the "Descriptor" column and is therefore redundant.

"Incident Zip" - indicates the zip code of the complaint.

"Incident Address" - indicates the house/building/apartment number referenced in the complaint.

"Street Name" - name of the street for the complaint.

"Cross Street 1" & "Cross Street 2" - these columns show the nearest cross streets to the complaint location.

"Location Type" - this column indicates the type of location the service request was made regarding, such as subway, sidewalk, street etc.

"Address Type" - references the type of address for the complaint, whether it is a block address or a house/building address.

"City" - shows the city within NYC that the complaint is from.

"Borough" - shows the borough within NYC that the complaint is from.

"Park Facility Name" - name of the park referenced in the complaint, similar to "Landmark".

"Park Borough" - shows the borough within NYC that the park referenced in the complaint is from.

"Community Board" - this is similar to a neighborhood council that has citizen involvement in the administration of government services. This data is excessive for the analysis being performed and the effects will most likely be captured within the location data as the boards are assigned to geographic areas.

"Facility Type" - this column references the type of NYC government building that is subject to a complaint, but only if the complaint is about a government service or building. Irrelevant information for this analysis.

"BBL" - indicates the parcel id for the building referenced in the complaint, used by certain departments such as city planning to more accurately define and accesss records for complaint buildings.

"Due Date" - this column shows the date by which the NYC governmental agency responsible for the complaint should respond to the complaint. For this analysis the due date is irrelevant, although this data may be accessed as part of additional analysis to compare the predictions made by the modeling to the standard expected by the city.

"Resolution Description" - this column gives the action taken by the responding agency to rectify the complaint. Although this column could be useful for determining the length and quality of response it is post hoc information. Meaning that it is given after the event has occured and cannot be used as a predictor variable.

"Resolution Action Updated Date" - this column gives the timestamp of the last update to the resolution given by the responsible agency. This column is unnecessary as the metric of concern is the time to completion for each complaint not the last update.

Empty Columns - the following columns have no data and will not be explained.

"Vehicle Type"

"Taxi Company Borough"

"Taxi Pick Up Location"

"Bridge Highway Name" 

"Bridge Highway Direction"

"Road Ramp" 

"Bridge Highway Segment"

## Timeseries Dataset Dictionary

"Created Date" - datetime column showing the date each complaint was created, in this case the day that saw the SR Volume in the next column.

"Total SRs" - an aggregated column that shows the total number of Service Requests created on a particular day.

"Year sin" - a calculated column that gives the sine transformed seconds value of the "Created Date".

"Year cos' - a calculated column that gives the Cosine transformed seconds value of the "Created Date".