# Rats Sightings in Manhattan NYC

## 01. Data Gathering and Data Preprocessing
---
Author: _Zhan Yu_

## Table of Contents
- [Loading Libraries](#Loading-Libraries)
- [Data Gathering](#Data-Gathering)
    - [Socrata](#Socrata)
    - [Restaurant Data](#Restaurant-Data)
    - [Census Data](#Census-Data)
- [Data Preprocessing](#Data-Preprocessing)
    - [rats.csv](#rats.csv)
    - [rat_sightings.csv](#rat_sightings.csv)

## Loading Libraries

In [1]:
# Libraries: 
import pandas as pd
import numpy as np
import os
import requests
import json

# Install packages at the current environment first
from sodapy import Socrata

import warnings
warnings.simplefilter(action="ignore")

## Data Gathering

In this project, the datasets [Rodent Inspection in NYC](https://data.cityofnewyork.us/Health/Rodent-Inspection/p937-wjvj) and [Rat Sightings](https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe) are public datasets from data.cityofnewyork.us. The census data are from [United States Census Bureau](https://www.census.gov/en.html).

### Socrata 

The Socrata APIs provide rich query functionality through a query language we call the “Socrata Query Language” or “SoQL”.   
Install packages at the current environment (for example, mine is (dsi)) before running:
``` Terminal
pip install sodapy
```
From [NYC Open Data](https://opendata.cityofnewyork.us/), we can load Rodent Inspection dataset and Rat Sightings dataset by using `Socrata`. Unauthenticated client only works with public data sets. Note `None` in place of application token, and no username or password. In these two cases they are all public datasets.

In [2]:
client = Socrata("data.cityofnewyork.us", None)

# First 1,000,000 results, returned as JSON from API / converted to Python list of dictionaries by sodapy.
results = client.get("p937-wjvj",                                # Rodent Inspection dataset
                     limit=1_000_000, where="boro_code = 1")     # Manhattan only

# Convert to pandas DataFrame
rats = pd.DataFrame.from_records(results)



In [3]:
# First 150,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.

results = client.get("3q43-55fe",                                # Rat Sightings dataset
                     limit=150_000)

# Convert to pandas DataFrame
rat_sightings = pd.DataFrame.from_records(results)

Because the datasets are too big (about 200 MB), we are going to do some data cleaning and trimming before we export them.

### Restaurant Data

From [United States Census Bureau](https://www.census.gov/en.html) we are using [ZIP Codes Business Patterns from 2010 to 2017](https://www.census.gov/data/developers/data-sets/cbp-nonemp-zbp/zbp-api.2010.html). The API key is not needed. [Code reference](https://towardsdatascience.com/getting-census-data-in-5-easy-steps-a08eeb63995d)

We are going to look at zip codes in Manhattan based on the `rats.csv`.

In [4]:
# Zip codes in Manhattan:
zip_manhattan = set(rats['zip_code'].dropna().astype(int))

# Remove unwanted zip codes:
zip_manhattan.remove(0)
zip_manhattan.remove(10000) 

We are able to get 2010 - 2017 Business Patterns with codes of "Food services and drinking places" from United States Census Bureau. However, the 2018 Business Patterns data are not on the website yet, so we will put the estimate numbers in 2018 since the numbers of restaurants in zip code 10002 have not been changing much.

In [5]:
# Iniatiate an empty data frame:
zip_res_yr = pd.DataFrame(columns=['zipcode', 'type', 'count', 'year'], data=[]) 

# From 2010 to 2011:
for year in range(2010,2012):
    
    #zipcode in Manhattan:
    for zipcode in zip_manhattan:       
        
        try:
            # "ESTAB": Number of establishments
            # "72": "Food services and drinking places"
            baseAPI = f"https://api.census.gov/data/{year}/zbp?get=ESTAB&for=zipcode:{zipcode}&NAICS2007=72"
            response = requests.get(baseAPI)
            formattedResponse = json.loads(response.text)[1:]
            formattedResponse = [item[::-1] for item in formattedResponse]
            
            # Store the response in a dataframe
            zip_res = pd.DataFrame(columns=['zipcode','type', 'count'], data=formattedResponse)
            zip_res['year'] = year
            zip_res_yr = pd.concat([zip_res_yr, zip_res], ignore_index=True)
            
        except:
            pass

# From 2012 to 2016:
for year in range(2012,2017):
    
    #zipcode in Manhattan:
    for zipcode in zip_manhattan:
        
        try:            
            # "ESTAB": Number of establishments
            # "722": "Food services and drinking places"
            baseAPI = f"https://api.census.gov/data/{year}/zbp?get=ESTAB&for=zipcode:{zipcode}&NAICS2012=722"
            response = requests.get(baseAPI)
            formattedResponse = json.loads(response.text)[1:]
            formattedResponse = [item[::-1] for item in formattedResponse]
            
            # Store the response in a dataframe
            zip_res = pd.DataFrame(columns=['zipcode','type', 'count'], data=formattedResponse)
            zip_res['year'] = year
            zip_res_yr = pd.concat([zip_res_yr, zip_res], ignore_index=True)
            
        except:
            pass

# Year 2017        
for zipcode in zip_manhattan:
    try:            
        # "ESTAB": Number of establishments
        # "722": "Food services and drinking places"
        baseAPI = f"https://api.census.gov/data/2017/zbp?get=ESTAB&for=zipcode:{zipcode}&NAICS2017=722"
        response = requests.get(baseAPI)
        formattedResponse = json.loads(response.text)[1:]
        formattedResponse = [item[::-1] for item in formattedResponse]
            
        # Store the response in a dataframe
        zip_res = pd.DataFrame(columns=['zipcode','type', 'count'], data=formattedResponse)
        zip_res['year'] = 2017
        zip_res_yr = pd.concat([zip_res_yr, zip_res], ignore_index=True)
            
    except:
        pass
zip_res_yr= zip_res_yr.astype(int)

# Getting 'zip_year' for future merging:
zip_res_yr['zip_year'] = zip_res_yr['zipcode'].astype(str) + ' ' + zip_res_yr['year'].astype(str)

In [6]:
zip_res_year = zip_res_yr[['zip_year', 'count']]
zip_res_year.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594 entries, 0 to 593
Data columns (total 2 columns):
zip_year    594 non-null object
count       594 non-null int64
dtypes: int64(1), object(1)
memory usage: 9.4+ KB


In [7]:
# Considering 2018 and 2017 have the same number of restaurants:
for zipcode in zip_manhattan:
    try:
        count = zip_res_year.loc[zip_res_year['zip_year']==str(zipcode)+' 2017']['count'].values[0]
        df = pd.DataFrame(columns = ['zip_year', 'count'], data = [[str(zipcode)+' 2018', count]])
        zip_res_year = pd.concat([zip_res_year, df], ignore_index=True)
    except:
            pass

In [8]:
# Export data as name "zip_res_yr.csv":
zip_res_year.to_csv('../datasets/zip_res_yr.csv', index=False)

### Census Data

From [United States Census Bureau](https://www.census.gov/en.html) we will use "ACS 5-Year Data" from 2011 to 2018 and "Decennial Census" of 2010.  
First we need to get our an API key from [HERE](https://api.census.gov/data/key_signup.html).

In [9]:
api_key = 'API KEY'

In [10]:
# Initiate an empty data frame 'zip_pop_yr' which has columns 'zipcode', 'population' and 'year':
zip_pop_yr = pd.DataFrame(columns=['zipcode', 'population', 'year'], data=[]) 
# 2010 Census data:
for zipcode in zip_manhattan:
    try:
        baseAPI = f"https://api.census.gov/data/2010/dec/sf1?key={api_key}&get=H010001&for=zip%20code%20tabulation%20area:{zipcode}"
        response = requests.get(baseAPI)
        formattedResponse = json.loads(response.text)[1:]
        formattedResponse = [item[::-1] for item in formattedResponse]
        
        # Store the response in a dataframe
        zip_pop = pd.DataFrame(columns=['zipcode', 'population'], data=formattedResponse)
        zip_pop['year'] = 2010
        zip_pop_yr = pd.concat([zip_pop_yr, zip_pop], ignore_index=True)
        
    except:
        pass

In [11]:
# 2011-2018 American Community Survey 5-Year Data:
acs_table = 'B01003_001E'  # Code of "Total population" 

# From 2011 to 2018:
for year in range(2011,2019):
    
    #zipcode in Manhattan:
    for zipcode in zip_manhattan:
        
        try:
            
            baseAPI = f"https://api.census.gov/data/{year}/acs/acs5?key={api_key}&get={acs_table}&for=zip%20code%20tabulation%20area:{zipcode}"
            response = requests.get(baseAPI)
            formattedResponse = json.loads(response.text)[1:]
            formattedResponse = [item[::-1] for item in formattedResponse]
            
            # Store the response in a dataframe
            zip_pop = pd.DataFrame(columns=['zipcode', 'population'], data=formattedResponse)
            zip_pop['year'] = year
            zip_pop_yr = pd.concat([zip_pop_yr, zip_pop], ignore_index=True)
            
        except:
            pass
        
zip_pop_yr= zip_pop_yr.astype(int)


In [12]:
# Export data as name "zip_pop_yr.csv":
zip_pop_yr.to_csv('../datasets/zip_pop_yr.csv', index=False)

## Data Preprocessing

### rats.csv  

In [13]:
# Setting the index to 'job_id':
rats.set_index('job_id',inplace = True)
rats.head()

Unnamed: 0_level_0,inspection_type,job_ticket_or_work_order_id,job_progress,bbl,boro_code,block,lot,house_number,street_name,zip_code,x_coord,y_coord,latitude,longitude,borough,inspection_date,result,approved_date,location
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
PO2104504,BAIT,523484,4,1020110053,1,2011,53,138,WEST 143 STREET,10030,1001245,237933,40.81972411373,-73.9385975302,Manhattan,2020-01-07T11:30:25.000,Monitoring visit,2020-03-09T11:28:53.000,"{'latitude': '40.81972411373', 'longitude': '-..."
PO2086583,BAIT,523490,9,1019420035,1,1942,35,2330,ADAM C POWELL BOULEVARD,10030,1000034,236688,40.816309165023,-73.942975778265,Manhattan,2020-01-07T12:00:56.000,Bait applied,2020-03-09T11:22:28.000,"{'latitude': '40.816309165023', 'longitude': '..."
PO2112174,BAIT,523481,4,1019060061,1,1906,61,2039,ADAM C POWELL BOULEVARD,10027,998148,233272,40.806936299002,-73.949796592215,Manhattan,2020-01-07T09:30:41.000,Bait applied,2020-03-09T11:27:18.000,"{'latitude': '40.806936299002', 'longitude': '..."
PO2128168,BAIT,523487,4,1019670008,1,1967,8,431,WEST 126 STREET,10027,996816,235437,40.812880674409,-73.954604074021,Manhattan,2020-01-07T10:45:20.000,Bait applied,2020-03-09T11:31:44.000,"{'latitude': '40.812880674409', 'longitude': '..."
PO2128166,BAIT,523485,4,1019670012,1,1967,12,423,WEST 126 STREET,10027,996850,235387,40.812743388588,-73.954481339882,Manhattan,2020-01-07T10:30:09.000,Bait applied,2020-03-09T11:29:33.000,"{'latitude': '40.812743388588', 'longitude': '..."


In [14]:
# Check the shape of data frame:
rats.shape

(594546, 19)

After taking a first look at our dataset, we have a general idea of features of dataset. We are going to only keep the features we need.

In [15]:
rats_df = rats[['inspection_type',
                'zip_code','inspection_date']]
rats_df.head()

Unnamed: 0_level_0,inspection_type,zip_code,inspection_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PO2104504,BAIT,10030,2020-01-07T11:30:25.000
PO2086583,BAIT,10030,2020-01-07T12:00:56.000
PO2112174,BAIT,10027,2020-01-07T09:30:41.000
PO2128168,BAIT,10027,2020-01-07T10:45:20.000
PO2128166,BAIT,10027,2020-01-07T10:30:09.000


In [16]:
rats_df.isnull().mean()

inspection_type    0.000000
zip_code           0.004043
inspection_date    0.000000
dtype: float64

Since the missing values have very small percentage of total dataset, we can drop all the rows with missing values and still have a relatively large dataset.

In [17]:
rats_df.dropna(inplace = True)
rats_df.shape

(592142, 3)

We are going to check and change the original data types:

In [18]:
# Checking the original data types:
rats_df.dtypes

inspection_type    object
zip_code           object
inspection_date    object
dtype: object

In [19]:
# Changing date columns into datetime form: 
rats_df['inspection_date'] = pd.to_datetime(rats_df['inspection_date'])

# Changing those columns which seem like numerical but actually strings:
rats_df['zip_code']=rats_df['zip_code'].astype(int)

# Checking data types again:
rats_df.dtypes

inspection_type            object
zip_code                    int64
inspection_date    datetime64[ns]
dtype: object

We are setting out dataset in chronological order because the time range of our data set is more than 100 years.

In [20]:
rats_df = rats_df.sort_values(by = 'inspection_date', ascending = False)
rats_df.head()

Unnamed: 0_level_0,inspection_type,zip_code,inspection_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PO2167036,COMPLIANCE,10029,2020-01-08 13:20:54
PO2167008,COMPLIANCE,10029,2020-01-08 13:10:44
PO2189724,INITIAL,10002,2020-01-08 13:10:40
PO2125410,COMPLIANCE,10019,2020-01-08 13:05:13
PO2167016,COMPLIANCE,10029,2020-01-08 13:00:09


Let's only see the year 2010 to 2018 data which are still the majority of data:

In [21]:
rats_df = rats_df.loc[(rats_df['inspection_date'] >= '2010-01-01')&(rats_df['inspection_date'] < '2019-01-01')]
rats_df = rats_df.loc[(rats_df['zip_code']>10000)&(rats_df['zip_code']<20000)]
rats_df.shape

(510172, 3)

In [22]:
# Export the cleaned data:
rats_df.to_csv('../datasets/rats.csv', index=False)

### rat_sightings.csv

In [23]:
rat_sightings.head(2)

Unnamed: 0,descriptor,incident_zip,x_coordinate_state_plane_,created_date,location,city,:@computed_region_sbqj_enih,cross_street_2,:@computed_region_efsh_h5xi,park_facility_name,...,longitude,:@computed_region_f5dn_yrer,status,unique_key,intersection_street_2,closed_date,resolution_action_updated_date,address_type,due_date,facility_type
0,Rat Sighting,10002,990296,2020-03-10T00:47:13.000,"{'latitude': '40.71757450675541', 'human_addre...",NEW YORK,4,EAST HOUSTON STREET,11723,Unspecified,...,-73.97818964140325,70,In Progress,45788764,EAST HOUSTON STREET,,,,,
1,Rat Sighting,10002,990272,2020-03-10T00:43:44.000,"{'latitude': '40.717423561327045', 'human_addr...",NEW YORK,4,EAST HOUSTON STREET,11723,Unspecified,...,-73.97827626827603,70,In Progress,45789884,EAST HOUSTON STREET,,,,,


After taking a first look at our dataset, we have a general idea of features of dataset. We are going to only keep the features we need.

In [24]:
sightings = rat_sightings[['incident_zip', 'created_date', 'borough']]
sightings.head()

Unnamed: 0,incident_zip,created_date,borough
0,10002,2020-03-10T00:47:13.000,MANHATTAN
1,10002,2020-03-10T00:43:44.000,MANHATTAN
2,11213,2020-03-10T00:35:10.000,BROOKLYN
3,10024,2020-03-09T20:32:15.000,MANHATTAN
4,10025,2020-03-09T20:21:23.000,MANHATTAN


Let's only focus on Manhattan, and drop missing data.

In [25]:
sightings = sightings.loc[sightings['borough']=='MANHATTAN'].drop(columns = 'borough')
sightings = sightings.dropna()
sightings = sightings.drop(sightings.loc[sightings['incident_zip']=='N/A'].index)
sightings['incident_zip'] = sightings['incident_zip'].astype(int)

In [26]:
# Clean the zip codes:
sightings = sightings.loc[(sightings['incident_zip']>10000)&(sightings['incident_zip']<20000)]
sightings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37146 entries, 0 to 143224
Data columns (total 2 columns):
incident_zip    37146 non-null int64
created_date    37146 non-null object
dtypes: int64(1), object(1)
memory usage: 870.6+ KB


In [28]:
# Export the cleaned data:
sightings.to_csv('../datasets/rats_sightings.csv', index=False)

So far, we have four datasets which are going to be used in next notebook for EDA.