# Chicago Noise Complaints ETL

### ETL process

#### Extract
* Import PostgreSQL table into Pandas DataFrame.

#### Transform
* Drop columns not required.
* Rename columns.
* Add new columns required.
* Drop complaints dated earlier than 1/1/16 (keeping only complaints within approximately last 5 years).
* Check for missing data.
* Check for duplicates.
* Get zip code, full address, latitude, and longitude from Google Maps geocoding API.

#### Load
* After cleaning, dropping data more than 5 years old, and adding zip code, export to PostgreSQL database. This table may be used on its own to create detailed maps of Chicago noise complaints.

#### Additional Transformation
* Drop columns no longer needed.
* Group data by zip code and get count of noise complaints for each zip code.

#### Load
* After aggregating to zip code, export data to PostgreSQL database. This table will be related to 

In [10]:
# Import dependencies.

import numpy as np
import pandas as pd
import requests
import datetime as dt
from sqlalchemy import create_engine
import json

# Import credentials from file.

# Google maps geocoding API key.
from credentials import gmaps_key

# Postgres database credentials.
from credentials import pgadmin_username
from credentials import pgadmin_password

## Extract

### Import cvs file into Pandas DataFrame.

In [11]:
# Read in csv file and store as Pandas dataframe.
noise_complaints_df = pd.read_csv("../Resources/Noise_Complaints_chicago.csv")
noise_complaints_df.head(2)

Unnamed: 0,COMPLAINT ID,COMPLAINT TYPE,STREET NUMBER FROM,STREET NUMBER TO,DIRECTION,STREET NAME,STREET TYPE,INSPECTOR,COMPLAINT DATE,COMPLAINT DETAIL,INSPECTION LOG,DATA SOURCE,Modified Date
0,DOECOMP2,Noise Complaint,1,,S,STATE,ST,10,08/23/1993,STREET MUSIC HEAVY NOISE POLLUTION REFERRED TO...,MORE INFORMATION MAY BE AVAILABLE IN THE CDPH ...,HISTORIC DEPT. OF ENVIRONMENT,01/01/2012
1,DOECOMP73,Noise Complaint,10,,S,WACKER,DR,25,04/02/1993,SAXOPHONE PLAYING ON MADISON.,MORE INFORMATION MAY BE AVAILABLE IN THE CDPH ...,HISTORIC DEPT. OF ENVIRONMENT,01/01/2012


## Transform

### Drop columns not required.

In [12]:
# Drop COMPLAINT ID, COMPLAINT TYPE, STREET NUMBER TO, INSPECTOR, INSPECTION LOG, DATA SOURCE, and Modified Date columns.
(noise_complaints_df.drop(columns=['COMPLAINT ID', 'COMPLAINT TYPE', 'STREET NUMBER TO', 'INSPECTOR', 
                                   'INSPECTION LOG', 'DATA SOURCE', 'Modified Date'], inplace=True))

noise_complaints_df.head(1)

Unnamed: 0,STREET NUMBER FROM,DIRECTION,STREET NAME,STREET TYPE,COMPLAINT DATE,COMPLAINT DETAIL
0,1,S,STATE,ST,08/23/1993,STREET MUSIC HEAVY NOISE POLLUTION REFERRED TO...


### Rename a column.

In [13]:
# Rename STREET NUMBER FROM column to simply STREET NUMBER as this column will be used to represent point locations
# rather than ranges.
noise_complaints_df.rename(columns = {'STREET NUMBER FROM':'STREET NUMBER'}, inplace = True)

noise_complaints_df.head(1)

Unnamed: 0,STREET NUMBER,DIRECTION,STREET NAME,STREET TYPE,COMPLAINT DATE,COMPLAINT DETAIL
0,1,S,STATE,ST,08/23/1993,STREET MUSIC HEAVY NOISE POLLUTION REFERRED TO...


### Add new columns required.

In [14]:
# Add columns to store formatted date, zip code, full address, latitude, and longitude.

# Add FORMATTED DATE column derived from COMPLAINT DATE column. This column holds dates in Pandas DateTime format and 
# will be used to select noise complaints within recent years.
noise_complaints_df["FORMATTED DATE"] = pd.to_datetime(noise_complaints_df["COMPLAINT DATE"])

# Add ZIP CODE column. Will be populated with results of API call and will be used to join PostgresSQL tables.
noise_complaints_df["ZIP CODE"] = ""

# Add CITY column.
noise_complaints_df["CITY"] = "Chicago"

# Add STATE column.
noise_complaints_df["STATE"] = "IL"

# Add partial address column and compute contents from existing columns. 
# STREET NUMBER FROM is assumed to be representative of the point location of the complaint.
noise_complaints_df["PARTIAL ADDRESS"] = (noise_complaints_df["STREET NUMBER"].map(str)+" "+noise_complaints_df["DIRECTION"]
                                          +" "+noise_complaints_df['STREET NAME']+" "+noise_complaints_df['STREET TYPE']
                                          +", "+noise_complaints_df['CITY']+", "+noise_complaints_df['STATE'])

# Add FULL ADDRESS column. Will be populated with results from API call, including zip code.
noise_complaints_df["FULL ADDRESS"] = ""

# Add LATITUDE and LONGITUDE columns
noise_complaints_df["LATITUDE"]=""
noise_complaints_df["LONGITUDE"]=""

noise_complaints_df.head(1)

Unnamed: 0,STREET NUMBER,DIRECTION,STREET NAME,STREET TYPE,COMPLAINT DATE,COMPLAINT DETAIL,FORMATTED DATE,ZIP CODE,CITY,STATE,PARTIAL ADDRESS,FULL ADDRESS,LATITUDE,LONGITUDE
0,1,S,STATE,ST,08/23/1993,STREET MUSIC HEAVY NOISE POLLUTION REFERRED TO...,1993-08-23,,Chicago,IL,"1 S STATE ST, Chicago, IL",,,


### Reorder columns

In [15]:
# Reorder columns
noise_complaints_df = noise_complaints_df[['COMPLAINT DATE', 'COMPLAINT DETAIL', 'FORMATTED DATE', 'STREET NUMBER', 'DIRECTION', 'STREET NAME', 'STREET TYPE', 'ZIP CODE', 'PARTIAL ADDRESS', 'FULL ADDRESS',  'LATITUDE', 'LONGITUDE']]

noise_complaints_df.head(5)

Unnamed: 0,COMPLAINT DATE,COMPLAINT DETAIL,FORMATTED DATE,STREET NUMBER,DIRECTION,STREET NAME,STREET TYPE,ZIP CODE,PARTIAL ADDRESS,FULL ADDRESS,LATITUDE,LONGITUDE
0,08/23/1993,STREET MUSIC HEAVY NOISE POLLUTION REFERRED TO...,1993-08-23,1,S,STATE,ST,,"1 S STATE ST, Chicago, IL",,,
1,04/02/1993,SAXOPHONE PLAYING ON MADISON.,1993-04-02,10,S,WACKER,DR,,"10 S WACKER DR, Chicago, IL",,,
2,08/09/1995,ALLIED VALVE INDUSTRIES TEST COMPR...,1995-08-09,1019,W,GRAND,AVE,,"1019 W GRAND AVE, Chicago, IL",,,
3,01/24/2001,S&S WAREHOUSE EXHAUST FA...,2001-01-24,10300,S,COTTAGE GROVE,AVE,,"10300 S COTTAGE GROVE AVE, Chicago, IL",,,
4,05/17/2010,GENERATOR NOISE PRODUCING FUMES IN THE ALLEY. ...,2010-05-17,1035,W,LAKE,ST,,"1035 W LAKE ST, Chicago, IL",,,


### Drop complaints dated earlier than 1/1/2016, keeping only complaints within approximately the last 5 years.

In [16]:
# Format 1/1/2016 as datetime object so it can be used in date comparison step.
date = dt.datetime.strptime('1/1/2016', '%m/%d/%Y')

# Select rows in which FORMATTED DATE is >= 1/1/2016. Store results in new DataFrame.
last_5_years_df = noise_complaints_df.loc[noise_complaints_df["FORMATTED DATE"] >= date]

# Sort ascending and print to check that prior dates were successfully removed.
last_5_years_df.sort_values("FORMATTED DATE")
print(last_5_years_df)

     COMPLAINT DATE                                   COMPLAINT DETAIL  \
195      01/25/2016             VERY LOUD NOSY ROOF TOP AFTER 6:20 PM.   
196      03/10/2016  COMMERCIAL NOISE COMING FROM ROOF TOP AFTER 8 PM.   
197      05/26/2016  VENTILATION FAN MAKING LOUD HUMMING NOISE ALL ...   
198      06/16/2016  COMMERCIAL/RESIDENTIAL AIR CONDITIONER UNIT ON...   
199      06/28/2016  CHILLERS ON THE ROOF MAKING NOISE AND BOTHERIN...   
...             ...                                                ...   
9235     01/02/2020  VERY LOUD AIR CONDITIONER OR INDUSTRIAL COMMER...   
9236     07/20/2020  AIR COMPRESSOR ON REAR OF BLDG IS PERIODICALLY...   
9237     11/16/2020                                                NaN   
9238     02/04/2020  CALLER STATES THAT THE DEVELOPER CONTINUED TO ...   
9239     03/12/2020  NOISE COMING FROM KITCHEN EQUIPMENT CREATING L...   

     FORMATTED DATE  STREET NUMBER DIRECTION STREET NAME STREET TYPE ZIP CODE  \
195      2016-01-25           

### Get zip code, full address, latitude, and longitude from Google Maps geocoding API.


In [17]:
# Create a test dataframe
test_df = last_5_years_df.head(10)

In [9]:
# Test API call in a for loop over (a copy of) test_df.

for index, row in test_df.iterrows():
    
    try:
        
        # Send partial addresses through Google Maps geocoding api call to retreive additional locational information.
        test_address = row["PARTIAL ADDRESS"]
        url = ('https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}').format(test_address, gmaps_key)
        response = requests.get(url).json()
        
        # Extract results of interest from json.
        zip_code = response["results"][0]["address_components"][7]["short_name"]
        full_address = response["results"][0]["formatted_address"]
        lat = response["results"][0]["geometry"]["location"]["lat"]
        lng = response["results"][0]["geometry"]["location"]["lng"]
        
        # Store zip code in appropriate column of (original) test_df.
        test_df.at[index, "ZIP CODE"] = zip_code
        test_df.at[index, "FULL ADDRESS"] = full_address
        test_df.at[index, "LATITUDE"] = lat
        test_df.at[index, "LONGITUDE"] = lng
    
    except(): pass
    
test_df.head()

Unnamed: 0,COMPLAINT DATE,COMPLAINT DETAIL,FORMATTED DATE,STREET NUMBER,DIRECTION,STREET NAME,STREET TYPE,ZIP CODE,PARTIAL ADDRESS,FULL ADDRESS,LATITUDE,LONGITUDE
195,01/25/2016,VERY LOUD NOSY ROOF TOP AFTER 6:20 PM.,2016-01-25,1456,N,DAYTON,ST,60642,"1456 N DAYTON ST, Chicago, IL","1456 N Dayton St, Chicago, IL 60642, USA",41.9084,-87.6499
196,03/10/2016,COMMERCIAL NOISE COMING FROM ROOF TOP AFTER 8 PM.,2016-03-10,19,S,WABASH,AVE,60603,"19 S WABASH AVE, Chicago, IL","19 S Wabash Ave, Chicago, IL 60603, USA",41.8816,-87.6258
197,05/26/2016,VENTILATION FAN MAKING LOUD HUMMING NOISE ALL ...,2016-05-26,1500,N,CLYBOURN,AVE,60610,"1500 N CLYBOURN AVE, Chicago, IL","1500 N Clybourn Ave, Chicago, IL 60610, USA",41.9083,-87.6462
198,06/16/2016,COMMERCIAL/RESIDENTIAL AIR CONDITIONER UNIT ON...,2016-06-16,640,N,WABASH,AVE,60611,"640 N WABASH AVE, Chicago, IL","640 N Wabash Ave, Chicago, IL 60611, USA",41.8938,-87.6271
199,06/28/2016,CHILLERS ON THE ROOF MAKING NOISE AND BOTHERIN...,2016-06-28,1035,N,DEARBORN,ST,60610,"1035 N DEARBORN ST, Chicago, IL","1035 N Dearborn St, Chicago, IL 60610, USA",41.9017,-87.6295


### Export 'Recent Noise Complaints with Zip' to PostgreSQL database.

Drop columns no longer needed.

In [None]:
# Drop ... columns.
(noise_complaints_df.drop(columns=['', '', '', '', 
                                   '', '', ''], inplace=True))

Check for missing data.

In [18]:
# Check for NaNs

Check for duplicates.

In [None]:
noise_complaints_df.drop_duplicates("id", inplace=True)

Aggregate data to zip codes.

## Load

Export transformed Pandas DataFrame to PostgreSQL database.