# NYC 311 Calls
This project explores the comprehensive dataset of 311 Service Requests from 2010 to the present, maintained by New York City's open data initiative. The NYC 311 service acts as a vital communication channel between the city's residents and its various non-emergency services, addressing concerns ranging from noise complaints to road maintenance issues. This dataset is publicly available and regularly updated on the [NYC Open Data portal](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9), complete with a data dictionary that elucidates its structure.

In our analysis, we delve into the data, which spans up until August 4, 2023, with a particular focus on evaluating the dynamics and trends of the city's non-emergency requests. Our primary objectives in this project include:

- **Daily Complaint Patterns**: Calculating the average number of daily complaints received in the year 2022.
- **Peak Inquiry Analysis**: Identifying the single date with the maximum number of calls and determining the prevalent complaint types on that day.
- **Seasonal Variations and Quietest Periods**: Analyzing monthly call patterns to discover the quietest month historically and performing ETS decomposition to observe seasonal trends on specific dates, such as December 25, 2020.
- **Autocorrelation Study**: Examining the relationship between consecutive days' call volumes to understand dependencies and patterns.
- **Forecasting**: Utilizing the Prophet library to predict future call volumes and assess the model accuracy with Root Mean Square Error (RMSE) metrics for the last 90 days' forecast.

Through these analyses, this project aims to provide insights into the operational scale of NYC's 311 service and understand the temporal dynamics that could assist city planners and public service providers in optimizing resource allocation and improving service delivery.

In [7]:
import pandas as pd
import statsmodels.api as sm
from prophet import Prophet
from prophet.diagnostics import performance_metrics
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

In [8]:
df = pd.read_pickle('shared/Project-3_NYC_311_Calls.pkl')
# Make the index as a proper DatetimeIndex
df = df.set_index(pd.DatetimeIndex(df['Created Date']))
# Delete the Created Date column
del df['Created Date']

In [9]:
df

Unnamed: 0_level_0,Unique Key,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,City,Resolution Description,Borough,Open Data Channel Type
Created Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-04-06 00:00:00,20184537,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10002.0,NEW YORK,More than one complaint was received for this ...,MANHATTAN,UNKNOWN
2011-04-06 00:00:00,20184538,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,WINDOWS,RESIDENTIAL BUILDING,11236.0,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,UNKNOWN
2011-04-06 00:00:00,20184539,HPD,Department of Housing Preservation and Develop...,PAINT - PLASTER,WALLS,RESIDENTIAL BUILDING,10460.0,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2022-07-08 11:14:43,54732265,DSNY,Department of Sanitation,Dirty Condition,Trash,Sidewalk,10467.0,BRONX,The Department of Sanitation investigated this...,BRONX,PHONE
2011-04-06 00:00:00,20184540,HPD,Department of Housing Preservation and Develop...,NONCONST,VERMIN,RESIDENTIAL BUILDING,10460.0,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
...,...,...,...,...,...,...,...,...,...,...,...
2011-04-06 00:00:00,20184532,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10468,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2011-04-06 00:00:00,20184533,HPD,Department of Housing Preservation and Develop...,HEATING,HEAT,RESIDENTIAL BUILDING,10018,NEW YORK,More than one complaint was received for this ...,MANHATTAN,UNKNOWN
2011-04-06 00:00:00,20184534,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,STAIRS,RESIDENTIAL BUILDING,10460,BRONX,The Department of Housing Preservation and Dev...,BRONX,UNKNOWN
2011-04-06 00:00:00,20184535,HPD,Department of Housing Preservation and Develop...,GENERAL CONSTRUCTION,GAS,RESIDENTIAL BUILDING,11236,BROOKLYN,The Department of Housing Preservation and Dev...,BROOKLYN,UNKNOWN


## 1. What is the average number of daily complaints received in 2022?

In [11]:
# Filter data for the year 2022
df_2022 = df.loc['2022']

# Count daily complaints in 2022
daily_calls_2022 = df_2022['Unique Key'].resample('D').count()

# Calculate average daily complaints in 2022
average_daily_calls_2022 = daily_calls_2022.mean()

print("The average number of daily complaints received in 2022:", average_daily_calls_2022)

The average number of daily complaints received in 2022: 8684.320547945206


## 2. On which single date were the maximum number of calls received?


In [12]:
# Count daily complaints for all data
daily_calls = df['Unique Key'].resample('D').count()

# Find the date with the maximum number of calls received
max_calls_date = daily_calls.idxmax()
print("The date with the maximum number of calls received:", max_calls_date)

# More data Exploration: Find the number of calls on that day
max_calls = daily_calls.max()
print("Number of calls:", max_calls)

The date with the maximum number of calls received: 2020-08-04 00:00:00
Number of calls: 24415


## 3. On the date the maximum number of calls were received, what was the most important complaint type?

In [13]:
max_calls_data = df.loc[max_calls_date.strftime('%Y-%m-%d')]

# Find the most important complaint type
most_important_complaint = max_calls_data['Complaint Type'].value_counts().idxmax()

print("The most important complaint type:", most_important_complaint)

The most important complaint type: Damaged Tree


## 4. Quietest month: Group the data by months, and identify the month that historically has the fewest number of calls.

In [14]:
# Group the data by months
monthly_calls = df['Unique Key'].resample('ME').count()
monthly_calls.index = monthly_calls.index.month

# Get total monthly calls
total_monthly_calls = monthly_calls.groupby(monthly_calls.index).sum()

# Find the month with the fewest calls
quietest_month = total_monthly_calls.idxmin()

print("The historically quietest month:", quietest_month)

The historically quietest month: 12


## 5. Resample your time series to a daily frequency.  Perform ETS decomposition based on an additive model.  What is the value of the seasonal component on 2020-12-25 (rounded to the nearest integer)?

In [15]:
# Perform ETS decomposition
decomposition = sm.tsa.seasonal_decompose(daily_calls, model='additive')

# Get the value of the seasonal component on 2020-12-25
seasonal = decomposition.seasonal
seasonal_value_2020_12_25 = round(seasonal.loc['2020-12-25'])

print("The seasonal component on 2020-12-25:", seasonal_value_2020_12_25)

The seasonal component on 2020-12-25: 183


## 6. Calculate the autocorrelation of the number of daily calls with the number of calls the day prior, ie lag of 1.  (Use the daily series).

In [16]:
# Calculate the autocorrelation of the number of daily calls with a lag of 1 day
autocorrelation_lag_1 = daily_calls.autocorr(lag=1)

print("The autocorrelation of the number of daily calls with the number of calls the day prior:", autocorrelation_lag_1)

The autocorrelation of the number of daily calls with the number of calls the day prior: 0.7517059728398577


## 7. Forecast the daily series with a test set of 90 days using the Prophet library.  What is your RMSE on your test set?

In [17]:
# Convert data for Prophet
df_prophet = daily_calls.reset_index()
df_prophet.columns = ['ds', 'y']

# Training data except last 90 days
train = df_prophet.iloc[:-90]
# Last 90 days for testing
test = df_prophet.iloc[-90:]

# Set up and train Prophet model
model = Prophet()
model.fit(train)

# Forecast future dates
future = model.make_future_dataframe(periods=90)
forecast = model.predict(future)

# Filter out the predictions for the test set
forecasted_values = forecast[-90:]['yhat']

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test['y'], forecasted_values))
print("The RMSE on the test set:", rmse)

16:50:05 - cmdstanpy - INFO - Chain [1] start processing
16:50:07 - cmdstanpy - INFO - Chain [1] done processing


The RMSE on the test set: 1233.7823321393885
