# 311 Processing

This is a serious hack to get some specific 311 request data sets from a **very large** [dataset](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/data).

There are different types of 311 requests, serviced by different agencies.  For starters I am processing the large data set into a selection of 311 requests by agency.  The agencies I'm including are:

  - Department of Environmental Protection (DEP)
  - Housing Preservation & Development (HPD)
  - Department of Sanitation (DSNY)
  - Department of Parks and Recreation (DPR)
  - Department of Buildings (DOB)
  - Department of Transporation (DOT)
  
The mechanics of this process are:

  1. Download the csv file.
  2. On my linux machine I use `split -d -l 5000000 <fname>.csv 311_split_part.` to create multiple files.
  3. I read these (5) files, from the `raw directory`, into seperate pandas dataframes.
  4. Perform various hacks to create subseted dataframes.
  5. Use location info from the df to create a (geopandas) geodataframe.
  6. Save, in the `processed directory`, each department specific df and gdf in parquet format.
  
As I said, **this is a seriously hacky approach**.

In keeping with the hackyness, I am going to ignore warnings (1 - appending df's, and 2 - gpd.to_parquet).  What can I say!

**Note:** Some of the pandas df's have NoneType's for Location so it passes right through so there are not valid geometries.  I will keep them and filter when needed.

In [None]:
def min_max(df):
    print(df['Created Date'].max())
    print(df['Created Date'].min())

So it takes 38 minutes to read (and convert dt)

In [None]:
%%time
df00 = pd.read_csv('../data/raw/311/311_split_part.00', parse_dates=['Created Date'], low_memory=False)

In [None]:
col = list(df00.columns)

In [None]:
%%time
#df00 = pd.read_csv('../data/raw/311/311_split_part.00', parse_dates=['Created Date'], low_memory=False)

df01 = pd.read_csv('../data/raw/311/311_split_part.01', names=col, parse_dates=['Created Date'], low_memory=False)

df02 = pd.read_csv('../data/raw/311/311_split_part.02', names=col, parse_dates=['Created Date'], low_memory=False)

df03 = pd.read_csv('../data/raw/311/311_split_part.03', names=col, parse_dates=['Created Date'], low_memory=False)

df04 = pd.read_csv('../data/raw/311/311_split_part.04', names=col, parse_dates=['Created Date'], low_memory=False)

df05 = pd.read_csv('../data/raw/311/311_split_part.05', names=col, parse_dates=['Created Date'], low_memory=False)

In [None]:
sum([len(x) for x in [df00, df01, df02, df03, df04, df05]])

In [None]:
min_max(df00)

In [None]:
min_max(df01)

In [None]:
min_max(df02)

In [None]:
min_max(df03)

In [None]:
min_max(df04)

In [None]:
min_max(df05)

In [None]:
df00.columns

In [None]:
df00['Agency'].value_counts()[:40]

In [None]:
def df_by_agency(agency):
    """
    Hard code this hummer!  Yikes!
    """
    df0 = df00.loc[df00['Agency'] == agency]
    df1 = df01.loc[df01['Agency'] == agency]
    df2 = df02.loc[df02['Agency'] == agency]
    df3 = df03.loc[df03['Agency'] == agency]
    df4 = df04.loc[df04['Agency'] == agency]
    df5 = df05.loc[df05['Agency'] == agency]
    
    return df0.append(df1, ignore_index=True).append(df2, ignore_index=True).append(df3, ignore_index=True).append(df4, ignore_index=True).append(df5, ignore_index=True)

# Dept of Environmental Protection (DEP)

https://www1.nyc.gov/site/dep/index.page

In [None]:
dep_df = df_by_agency('DEP')

In [None]:
#dep_df.info(verbose=True, show_counts=True)

In [None]:
dep_df.to_parquet('../data/processed/311/dep-full.parq')

In [None]:
dep_311_gdf = gpd.GeoDataFrame(dep_df,
                               geometry = [Point(x, y) for x, y in zip(dep_df.Longitude, dep_df.Latitude)])

In [None]:
dep_311_gdf.to_parquet('../data/processed/311/dep-clean-geo.parq')

# Housing Preservation & Development (HPD)

https://www1.nyc.gov/site/hpd/index.page 

In [None]:
hpd_df = df_by_agency('HPD')

In [None]:
#hpd_df.info(verbose=True, show_counts=True)

In [None]:
hpd_df.to_parquet('../data/processed/311/hpd-full.parq')

In [None]:
hpd_311_gdf = gpd.GeoDataFrame(hpd_df,
                               geometry = [Point(x, y) for x, y in zip(hpd_df.Longitude, hpd_df.Latitude)])

In [None]:
hpd_311_gdf.to_parquet('../data/processed/311/hpd-clean-geo.parq')

# Department of Sanitation (DSNY)

https://www1.nyc.gov/assets/dsny/site/home

In [None]:
dsny_df = df_by_agency('DSNY')

In [None]:
#dsny_df.info(verbose=True, show_counts=True)

In [None]:
dsny_df.to_parquet('../data/processed/311/dsny-full.parq')

In [None]:
#dsny_df.info(verbose=True, show_counts=True)

In [None]:
dsny_311_gdf = gpd.GeoDataFrame(dsny_df,
                               geometry = [Point(x, y) for x, y in zip(dsny_df.Longitude, dsny_df.Latitude)])

In [None]:
dsny_311_gdf.to_parquet('../data/processed/311/dsny-clean-geo.parq')

# Department of Parks and Recreation (DPR)

https://www.nycgovparks.org/

In [None]:
dpr_df = df_by_agency('DPR')

In [None]:
dpr_df.to_parquet('../data/processed/311/dpr-full.parq')

In [None]:
dpr_311_gdf = gpd.GeoDataFrame(dpr_df,
                               geometry = [Point(x, y) for x, y in zip(dpr_df.Longitude, dpr_df.Latitude)])

In [None]:
dpr_311_gdf.to_parquet('../data/processed/311/dpr-clean-geo.parq')

# Department of Buildings (DOB)

https://www1.nyc.gov/site/buildings/index.page

In [None]:
dob_df = df_by_agency('DOB')

In [None]:
dob_df.to_parquet('../data/processed/311/dob-full.parq')

In [None]:
dob_311_gdf = gpd.GeoDataFrame(dob_df,
                               geometry = [Point(x, y) for x, y in zip(dob_df.Longitude, dob_df.Latitude)])

In [None]:
dob_311_gdf.to_parquet('../data/processed/311/dob-clean-geo.parq')

# Department of Transportation (DOT)

https://www1.nyc.gov/html/dot/html/home/home.shtml

In [None]:
dot_df = df_by_agency('DOT')

In [None]:
dot_df.to_parquet('../data/processed/311/dot-full.parq')

In [None]:
dot_311_gdf = gpd.GeoDataFrame(dot_df,
                               geometry = [Point(x, y) for x, y in zip(dot_df.Longitude, dot_df.Latitude)])

In [None]:
dot_311_gdf.to_parquet('../data/processed/311/dot-clean-geo.parq')

# Summary

This noteboook implements a very mechanical process to process 311 data.

The most recent dataset I've used was pulled on April 25, 2022.

Subsequently in my analysis I noticed a department (DEP) dataset.  Not sure if all departments have them?