# NYC 311 Calls: Data Preparation
The NYC Open Data site maintains a complete record of all requests made to the 311 Service beginning January 1, 2010.  The dataset includes over 36 million records with over 40 features.  The file is over 20GB.  Reading it in its entirely taxes most computers and can take a very long time (over 20 minutes on my computer).  To ease the operation of the main notebook, this notebook reads in only the data necessary for this analysis, performs all data processing necessary and converts the data into pickle files, reducing total size to under 1GB.  Reading operations should also be greatly improved (less than 6 seconds on my computer).

In [2]:
import pandas as pd
import time

In [3]:
%%time
# Read in 311 data

df_311_calls  = pd.read_csv('Data/311_Service_Requests_20240430.csv',
                            index_col='Unique Key',
                            usecols = ['Agency','Complaint Type','Created Date','Incident Zip','Unique Key', 'Descriptor'],
                            dtype = {'Agency':'category','Complaint Type':str,'Created Date':str,'Incident Zip':str,'Unique Key':'int64', 'Descriptor':str})

df_311_calls = df_311_calls.rename(columns={'Complaint Type': 'Type', 'Created Date': 'Date', 'Incident Zip': 'Zip'})

CPU times: total: 3min 25s
Wall time: 3min 29s


In [4]:
%%time
# Read in NYC zip codes
df_zips = pd.read_csv('Data/uszips/uszips.csv',
                      index_col = 'zip',
                      usecols=['lat','lng','population','density','zip'],
                      dtype = {'zip': str} )

df_zips = df_zips.rename(columns={'population': 'zip_population', 'density': 'zip_density'})

CPU times: total: 15.6 ms
Wall time: 17 ms


In [5]:
print(f'Initial items: {len(df_311_calls):,.0f}')

Initial items: 36,217,243


In [6]:
%%time
# Drop rows with no useable location data
no_loc = df_311_calls['Zip'].isna() #& df_311['Latitude'].isna()
df_311_calls = df_311_calls[~no_loc]
print(f'Rows with locations: {len(df_311_calls):,.0f}')

Rows with locations: 34,682,301
CPU times: total: 3.81 s
Wall time: 3.85 s


In [7]:
%%time
# Convert date columns to datetime, dropping time component and dropping dates after March 31, 2024
date_format = '%m/%d/%Y %I:%M:%S %p'
df_311_calls['Date'] = pd.to_datetime(df_311_calls['Date'], format=date_format).values.astype('datetime64[D]')
df_311_calls = df_311_calls[df_311_calls['Date'] < pd.Timestamp('2024-04-01')]
print(f'Rows in the date range: {len(df_311_calls):,.0f}')

Rows in the date range: 34,435,921
CPU times: total: 2min 34s
Wall time: 2min 38s


In [8]:
%%time
# Convert NaN values in 'Incident Zip' to 'empty', to simplify further processing
df_311_calls['Zip'] = df_311_calls['Zip'].fillna('empty')
df_311_calls.isna().sum()

# Clean up ZIP codes by removing '-####' if present
df_311_calls['Zip'] = df_311_calls['Zip'].str.replace(r'-\d{4}$', '', regex=True)

# Drop all rows with zip codes not in NYC
df_311_calls = df_311_calls[df_311_calls['Zip'].isin(df_zips.index)]
print(f'Rows in NYC: {len(df_311_calls):,.0f}')

Rows in NYC: 34,388,549
CPU times: total: 25.5 s
Wall time: 26.1 s


In [9]:
%%time
# Convert all type and subtype items to titlecase
df_311_calls['Type'] = df_311_calls['Type'].str.title()
df_311_calls['Descriptor'] = df_311_calls['Descriptor'].str.title()

CPU times: total: 22.5 s
Wall time: 22.9 s


In [10]:
%%time
# Read in NYC weather data, and format
df_weather = pd.read_csv('Data/NYC_weather_data.csv')
df_weather.drop('Unnamed: 0', axis=1, inplace=True)
df_weather['date'] = pd.to_datetime(df_weather['date']).values.astype('datetime64[D]')
df_weather.set_index('date',inplace=True)
df_weather.drop_duplicates(inplace=True)
df_weather = df_weather[df_weather.index < pd.Timestamp('2024-04-01')]

CPU times: total: 406 ms
Wall time: 422 ms


In [11]:
%%time
df_311_calls['Agency'] = df_311_calls['Agency'].astype('category')
df_311_calls['Type'] = df_311_calls['Type'].astype('category') 

CPU times: total: 3.97 s
Wall time: 3.99 s


In [12]:
%%time
df_311_calls.to_pickle('Data/311_Calls.pickle')
df_zips.to_pickle('Data/NYC_Zips.pickle')
df_weather.to_pickle('Data/NYC_Weather.pickle')

CPU times: total: 15.7 s
Wall time: 18 s
