# Data Gathering and Data Cleaning
#### By: _Zhan Yu_

## Table of Contents
- [Loading Libraries & Data](#Loading-Libraries-&-Data)

## Loading Libraries & Data

In this project, the two datasets [Rodent Inspection in NYC](https://data.cityofnewyork.us/Health/Rodent-Inspection/p937-wjvj) and [Restaurant-Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j ) are public datasets from data.cityofnewyork.us

In [41]:
# Libraries: 
import pandas as pd
import numpy as np
import os

from sodapy import Socrata

import warnings
warnings.simplefilter(action="ignore")

### SoQL(Socrata Query Language)  

The Socrata APIs provide rich query functionality through a query language we call the “Socrata Query Language” or “SoQL”.   
Install packages at the current environment (for example, mine is (dsi)) before running:
``` Terminal
pip install sodapy
```

In [2]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# In this case they are public datasets.

# First 1,000,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("p937-wjvj",                                # Rodent Inspection dataset
                     limit=1_000_000, where="boro_code = 1")

# Convert to pandas DataFrame
rats = pd.DataFrame.from_records(results)



In [3]:
# First 500,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("43nn-pn8j",                                # NYC Restaurant Inspection dataset
                     limit=500_000)

# Convert to pandas DataFrame
restaurants = pd.DataFrame.from_records(results)

Because the two datasets are too big (both about 200 MB), we are going to do some data cleaning and trimming before we export them.

## Data Cleaning

### rats.csv  

In [4]:
# Setting the index to 'job_id':
rats.set_index('job_id',inplace = True)
rats.head()

Unnamed: 0_level_0,inspection_type,job_ticket_or_work_order_id,job_progress,bbl,boro_code,block,lot,house_number,street_name,zip_code,x_coord,y_coord,latitude,longitude,borough,inspection_date,result,approved_date,location
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
PO2075886,INITIAL,2075886,1,1022430001,1,2243,1,4977,BROADWAY,10034,1006543,255608,40.868203550611,-73.919469362846,Manhattan,2019-08-02T11:11:55.000,Active Rat Signs,2019-08-07T11:34:09.000,"{'latitude': '40.868203550611', 'longitude': '..."
PO333717,BAIT,70581,5,1009620100,1,962,100,462,1 AVENUE,10016,991147,208600,40.739489790595,-73.976623458224,Manhattan,2011-10-26T09:45:58.000,Bait applied,2011-10-28T07:14:07.000,"{'latitude': '40.739489790595', 'longitude': '..."
PO431003,BAIT,92841,1,1022290001,1,2229,1,4986,BROADWAY,10034,1006611,255630,40.868266491555,-73.919201730342,Manhattan,2012-03-19T14:00:03.000,Bait applied,2012-03-20T07:20:09.000,"{'latitude': '40.868266491555', 'longitude': '..."
PO333717,BAIT,69386,4,1009620100,1,962,100,462,1 AVENUE,10016,991147,208600,40.739489790595,-73.976623458224,Manhattan,2011-10-17T13:50:37.000,Bait applied,2011-10-18T07:09:09.000,"{'latitude': '40.739489790595', 'longitude': '..."
PO422674,BAIT,91531,1,1022290001,1,2229,1,4986,BROADWAY,10034,1006611,255630,40.868266491555,-73.919201730342,Manhattan,2012-03-08T13:51:50.000,Bait applied,2012-03-09T12:23:58.000,"{'latitude': '40.868266491555', 'longitude': '..."


In [5]:
# Check the shape of data frame:
rats.shape

(594536, 19)

After taking a first look at our dataset, we have a general idea of features of dataset. We are going to only keep the features we need.

In [6]:
rats_df = rats[['inspection_type',
                'job_progress', 'bbl','latitude',
                'longitude','inspection_date', 'result', 'approved_date']]
rats_df.head()

Unnamed: 0_level_0,inspection_type,job_progress,bbl,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PO2075886,INITIAL,1,1022430001,40.868203550611,-73.919469362846,2019-08-02T11:11:55.000,Active Rat Signs,2019-08-07T11:34:09.000
PO333717,BAIT,5,1009620100,40.739489790595,-73.976623458224,2011-10-26T09:45:58.000,Bait applied,2011-10-28T07:14:07.000
PO431003,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-19T14:00:03.000,Bait applied,2012-03-20T07:20:09.000
PO333717,BAIT,4,1009620100,40.739489790595,-73.976623458224,2011-10-17T13:50:37.000,Bait applied,2011-10-18T07:09:09.000
PO422674,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-08T13:51:50.000,Bait applied,2012-03-09T12:23:58.000


In [7]:
rats_df.isnull().mean()

inspection_type    0.000000
job_progress       0.000000
bbl                0.000000
latitude           0.000891
longitude          0.000891
inspection_date    0.000000
result             0.000002
approved_date      0.000000
dtype: float64

Since the missing values have very small percentage of total dataset, we can drop all the rows with missing values and still have a relatively large dataset.

In [8]:
rats_df.dropna(inplace = True)
rats_df.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


(594005, 8)

In [9]:
rats_df.isnull().sum().sum()

0

We are going to check and change the original data types:

In [11]:
# Checking the original data types:
rats_df.dtypes

inspection_type            object
job_progress               object
bbl                        object
latitude                   object
longitude                  object
inspection_date    datetime64[ns]
result                     object
approved_date      datetime64[ns]
dtype: object

In [10]:
# Changing date columns into datetime form: 
rats_df['inspection_date'] = pd.to_datetime(rats_df['inspection_date'])
rats_df['approved_date'] = pd.to_datetime(rats_df['approved_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
# Changing those columns which seem like numerical but actually strings:
rats_df['job_progress']=rats_df['job_progress'].astype(int)
rats_df['bbl']=rats_df['bbl'].astype(int)
rats_df['latitude']=rats_df['latitude'].astype(float)
rats_df['longitude']=rats_df['longitude'].astype(float)

In [67]:
# Checking data types again:
rats_df.dtypes

inspection_type            object
job_progress                int64
bbl                         int64
latitude                  float64
longitude                 float64
inspection_date    datetime64[ns]
result                     object
approved_date      datetime64[ns]
dtype: object

We are setting out dataset in chronological order because the time range of our data set is than 100 years.

In [98]:
set(rats_df['inspection_type'])

{'BAIT', 'CLEAN_UPS', 'COMPLIANCE', 'INITIAL'}

In [99]:
set(rats_df['result'])

{'Active Rat Signs',
 'Bait applied',
 'Cleanup done',
 'Monitoring visit',
 'Passed Inspection',
 'Problem Conditions'}

In [68]:
rats_df = rats_df.sort_values(by = 'inspection_date', ascending = False)
rats_df.head()

Unnamed: 0_level_0,inspection_type,job_progress,bbl,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PO2167036,COMPLIANCE,2,1016390133,40.795054,-73.943041,2020-01-08 13:20:54,Active Rat Signs,2020-01-09 08:16:29
PO2167008,COMPLIANCE,2,1016390021,40.795249,-73.9435,2020-01-08 13:10:44,Active Rat Signs,2020-01-09 08:16:29
PO2189724,INITIAL,1,1004160001,40.720992,-73.990498,2020-01-08 13:10:40,Passed Inspection,2020-01-09 09:39:57
PO2125410,COMPLIANCE,2,1010100046,40.765455,-73.977827,2020-01-08 13:05:13,Passed Inspection,2020-01-09 08:54:36
PO2167016,COMPLIANCE,2,1016390035,40.794941,-73.942221,2020-01-08 13:00:09,Passed Inspection,2020-01-09 08:16:29


Let's only see the last 10 years data which are still the majority of data:

In [69]:
rats_df = rats_df.loc[rats_df['inspection_date'] > '2010-01-01']
rats_df.shape

(591855, 8)

#### Export the cleaned data:

In [97]:
rats_df.to_csv('../datasets/rats.csv', index=False)

FileNotFoundError: [Errno 2] No such file or directory: '../datasets/rats.csv'

In [93]:
rats_df['bbl'].value_counts().head(10)

1003020001    428
1001650001    408
1004200001    366
1003890036    348
1001220001    327
1015240046    223
1020820047    216
1003760007    214
1003740041    210
1001660027    200
Name: bbl, dtype: int64

In [96]:
top10 = pd.DataFrame(index = rats_df['bbl'].value_counts().head(10).index, columns = ['latitude', 'longitude'])
for i in top10.index:
    top10.loc[i]['latitude'] = np.unique(rats_df.loc[rats_df['bbl'] == i]['latitude'])
    top10.loc[i]['longitude'] = np.unique(rats_df.loc[rats_df['bbl'] == i]['longitude'])
top10

Unnamed: 0,latitude,longitude
1003020001,[40.718008702954],[-73.99342724941]
1001650001,[40.715450766367],[-73.999538265827]
1004200001,"[0.0, 40.718274963396, 40.719784495498, 40.721...","[-73.993748286251, -73.992269055289, -73.99158..."
1003890036,"[0.0, 40.723545956581, 40.723680527621, 40.723...","[-73.979692000943, -73.979259113526, -73.97910..."
1001220001,[40.713806515614],[-74.005569530537]
1015240046,[40.785967020536],[-73.951206362211]
1020820047,[40.829347984541],[-73.945645775795]
1003760007,"[40.7239438899, 40.72402347752]","[-73.978941506339, -73.978883756867]"
1003740041,"[40.721292227281, 40.72146797746]","[-73.978231589502, -73.977787910393]"
1001660027,"[0.0, 40.715439772645]","[-74.001890223957, 0.0]"


In [19]:
grand_st = rats_df.loc[rats_df['bbl']== 1003020001].drop(columns = ['bbl'])
grand_st.shape

(428, 7)

In [100]:
# Export data: 
grand_st.to_csv('../datasets/grand_st.csv', index=False)

FileNotFoundError: [Errno 2] No such file or directory: '../datasets/grand_st.csv'

In [20]:
columbus_pk = rats_df.loc[rats_df['bbl']== 1001650001].drop(columns = ['bbl'])
columbus_pk.shape

(408, 7)

In [None]:
# Export data: 
columbus_pk.to_csv('../datasets/columbus_pk.csv', index=False)

In [21]:
rats_df.loc[rats_df['bbl']==1004200001].drop(columns = ['bbl'])

Unnamed: 0_level_0,inspection_type,job_progress,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PO2157964,INITIAL,1,40.7217132138775,-73.9915576363285,2019-11-18 11:40:36,Active Rat Signs,2019-11-19 13:39:33
PO2133146,INITIAL,1,40.7217132138775,-73.9915576363285,2019-10-23 14:40:25,Active Rat Signs,2019-10-24 12:11:05
PO2117477,INITIAL,1,40.7217080491804,-73.9915500932201,2019-10-02 15:14:56,Active Rat Signs,2019-10-03 14:32:11
PO2111709,INITIAL,1,40.7217132138775,-73.9915576363285,2019-09-18 11:35:18,Active Rat Signs,2019-09-19 11:09:56
PO2095913,INITIAL,1,40.7217132138775,-73.9915576363285,2019-08-26 10:26:46,Active Rat Signs,2019-08-27 13:01:29
...,...,...,...,...,...,...,...
PO378262,BAIT,1,40.7216308734523,-73.9915865091172,2011-11-20 16:45:02,Bait applied,2011-11-21 10:46:02
PO378118,BAIT,1,40.7216308734523,-73.9915865091172,2011-11-20 14:00:00,Bait applied,2011-11-21 08:41:35
PO368394,BAIT,1,40.7216308734523,-73.9915865091172,2011-10-23 14:00:00,Monitoring visit,2011-10-28 07:17:10
PO368202,BAIT,1,40.7216308734523,-73.9915865091172,2011-10-23 14:00:00,Monitoring visit,2011-10-26 10:13:41


**restaurants.csv**

In [5]:
restaurants.shape

(400874, 26)

In [6]:
restaurants = restaurants.loc[restaurants['boro'] == 'Manhattan']

In [7]:
restaurants.isnull().mean()

camis                    0.000000
dba                      0.000811
boro                     0.000000
building                 0.000665
street                   0.000000
zipcode                  0.016704
phone                    0.000000
cuisine_description      0.000000
inspection_date          0.000000
action                   0.003598
violation_code           0.013454
violation_description    0.020105
critical_flag            0.020105
score                    0.042313
grade                    0.497169
grade_date               0.504136
record_date              0.000000
inspection_type          0.003598
latitude                 0.001051
longitude                0.001051
community_board          0.017755
council_district         0.017679
census_tract             0.017679
bin                      0.020764
bbl                      0.001051
nta                      0.017755
dtype: float64

In [8]:
restaurants.head(2)

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta
1,40400049,PANYA/AUTRE KYOYA,Manhattan,810,STUYVESANT STREET,,2125980454,Japanese,2017-01-03T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Re-inspection,0.0,0.0,,,,,1,
3,41279957,LELABAR,Manhattan,422,HUDSON STREET,10014.0,2122060594,American,2017-05-05T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.730478164589,-74.006801296496,102.0,3.0,6700.0,1009774.0,1005830002,MN23


In [22]:
restaurants['bbl'].value_counts()

1             3112
1012800001     520
1007130001     432
1007810002     391
1000160125     255
              ... 
1003900036       1
1001750029       1
1004680039       1
1006620003       1
1004360042       1
Name: bbl, Length: 7136, dtype: int64

In [25]:
restaurants.loc[restaurants['bbl']=='1012800001']

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta
2093,50079524,DOUGHNUT PLANT,Manhattan,89,E 42ND ST,10017,2125053700,Donuts,2020-02-12T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752093933162,-73.977604354379,105,04,009200,1035381,1012800001,MN19
2183,50046726,CHIRPING CHICKEN,Manhattan,72,GRAND CENTRAL TERMINAL,10017,2126614059,Chicken,2018-10-12T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Re-inspection,40.752486367873,-73.977268555955,105,04,009200,1035381,1012800001,MN19
3455,40928958,HALE & HEARTY SOUP,Manhattan,55,GRAND CENTRAL TERMINAL,10017,2129832845,Soups & Sandwiches,2018-05-11T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752486367873,-73.977268555955,105,04,009200,1035381,1012800001,MN19
3677,40876078,CIPRIANI DOLCI,Manhattan,0,GRAND CENTRAL TERMINAL,10017,2129730999,Italian,2017-12-27T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752513822619,-73.977304639681,105,04,009200,1035381,1012800001,MN19
4440,50033575,JACQUES TORRES ICE CREAM,Manhattan,89,E 42 ST,10017,2129837353,"Ice Cream, Gelato, Yogurt, Ices",2018-12-21T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Re-inspection,40.752093933162,-73.977604354379,105,04,009200,1035381,1012800001,MN19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
396904,50046726,CHIRPING CHICKEN,Manhattan,72,GRAND CENTRAL TERMINAL,10017,2126614059,Chicken,2019-08-08T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752486367873,-73.977268555955,105,04,009200,1035381,1012800001,MN19
397171,40423705,ZARO'S BAKERY,Manhattan,89,EAST 42 STREET,10017,2123767619,American,2019-05-30T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752093933162,-73.977604354379,105,04,009200,1035381,1012800001,MN19
397292,50047012,GREAT NORTHERN FOOD HALL,Manhattan,89,E 42ND ST,10017,6462799871,Café/Coffee/Tea,2019-03-22T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752093933162,-73.977604354379,105,04,009200,1035381,1012800001,MN19
397613,41470208,FINANCIER PATISSERIE,Manhattan,15,VANDERBILT AVENUE,10017,2129731010,Café/Coffee/Tea,2017-06-06T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Initial Inspection,40.752785703357,-73.978073333536,105,04,009200,1035381,1012800001,MN19


In [9]:
set(restaurants['grade'])

{'A', 'B', 'C', 'G', 'N', 'P', 'Z', nan}

In [16]:
grad_a = restaurants.loc[restaurants['grade']=='A'].loc[restaurants['action']=='No violations were recorded at the time of this inspection.']

In [17]:
grad_a.shape

(283, 26)

In [20]:
grad_a.isnull().sum()

camis                      0
dba                        0
boro                       0
building                   1
street                     0
zipcode                   15
phone                      0
cuisine_description        0
inspection_date            0
action                     0
violation_code           283
violation_description    283
critical_flag            283
score                      0
grade                      0
grade_date                 0
record_date                0
inspection_type            0
latitude                   1
longitude                  1
community_board           16
council_district          16
census_tract              16
bin                       16
bbl                        1
nta                       16
dtype: int64

In [9]:
set(rats['result'])

{'Active Rat Signs',
 'Bait applied',
 'Cleanup done',
 'Monitoring visit',
 'Passed Inspection',
 'Problem Conditions',
 nan}

In [15]:
set(rats['inspection_type'])

{'BAIT', 'CLEAN_UPS', 'COMPLIANCE', 'INITIAL'}

In [14]:
set(restaurants['action'])

{'Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.',
 'Establishment re-closed by DOHMH',
 'Establishment re-opened by DOHMH',
 'No violations were recorded at the time of this inspection.',
 'Violations were cited in the following area(s).',
 nan}

In [21]:
set(restaurants['violation_description'])

 '""Wash hands” sign not posted at hand wash facility.',
 'A food containing artificial trans fat, with 0.5 grams or more of trans fat per serving, is being stored, distributed, held for service, used in preparation of a menu item, or served.',
 'Accurate thermometer not provided in refrigerated or hot holding equipment.',
 'Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding.',
 'Ashtray present in smoke-free area.',
 'Bulb not shielded or shatterproof, in areas where there is extreme heat, temperature changes, or where accidental contact may occur.',
 'Caloric content not posted on menus, menu boards or food tags, in a food service establishment that is 1 of 15 or more outlets operating the same type of business nationally under common ownership or control, or as a franchise or doing business under the same name, for each menu item that is served in

In [None]:
mice, rats, vermin