# EDA
#### By: _Zhan Yu_

## Table of Contents
- [Loading Libraries & Data](#Loading-Libraries-&-Data)

## Loading Libraries & Data

In this project, the two datasets [Rodent Inspection in NYC](https://data.cityofnewyork.us/Health/Rodent-Inspection/p937-wjvj) and [Restaurant-Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j ) are public datasets from data.cityofnewyork.us

In [1]:
# Libraries: 
import pandas as pd

import os
from sodapy import Socrata

### SoQL(Socrata Query Language)  

The Socrata APIs provide rich query functionality through a query language we call the “Socrata Query Language” or “SoQL”.   
Install packages at the current environment (for example, mine is (dsi)) before running:
``` Terminal
pip install sodapy
```

In [3]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# In this case they are public datasets.

# First 1,000,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("p937-wjvj",                                # Rodent Inspection dataset
                     limit=1_000_000, where="boro_code = 1")

# Convert to pandas DataFrame
rats = pd.DataFrame.from_records(results)



In [None]:
# First 500,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("43nn-pn8j",                                # NYC Restaurant Inspection dataset
                     limit=500_000)

# Convert to pandas DataFrame
restaurants = pd.DataFrame.from_records(results)

In [5]:
# Datasets are too big (both about 200 MB) 
# rats.to_csv('datasets/rats_raw.csv', index=False)
# restaurants.to_csv('datasets/restaurants_raw.csv', index=False)

**rats.csv**

In [6]:
rats.shape

(594536, 20)

In [7]:
rats.isnull().mean()

inspection_type                0.000000
job_ticket_or_work_order_id    0.000000
job_id                         0.000000
job_progress                   0.000000
bbl                            0.000000
boro_code                      0.000000
block                          0.000000
lot                            0.000000
house_number                   0.013893
street_name                    0.000631
zip_code                       0.004043
x_coord                        0.001759
y_coord                        0.001759
latitude                       0.000891
longitude                      0.000891
borough                        0.000000
inspection_date                0.000000
result                         0.000002
approved_date                  0.000000
location                       0.001097
dtype: float64

In [8]:
rats.dropna(inplace = True)

In [9]:
rats.shape

(583054, 20)

In [10]:
rats.set_index('job_id',inplace = True)

In [11]:
rats.columns

Index(['inspection_type', 'job_ticket_or_work_order_id', 'job_progress', 'bbl',
       'boro_code', 'block', 'lot', 'house_number', 'street_name', 'zip_code',
       'x_coord', 'y_coord', 'latitude', 'longitude', 'borough',
       'inspection_date', 'result', 'approved_date', 'location'],
      dtype='object')

In [13]:
rats_df = rats[['inspection_type',
       'job_progress', 'bbl','latitude',
       'longitude','inspection_date', 'result', 'approved_date']]
rats_df.head()

Unnamed: 0_level_0,inspection_type,job_progress,bbl,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PO2075886,INITIAL,1,1022430001,40.868203550611,-73.919469362846,2019-08-02T11:11:55.000,Active Rat Signs,2019-08-07T11:34:09.000
PO333717,BAIT,5,1009620100,40.739489790595,-73.976623458224,2011-10-26T09:45:58.000,Bait applied,2011-10-28T07:14:07.000
PO431003,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-19T14:00:03.000,Bait applied,2012-03-20T07:20:09.000
PO333717,BAIT,4,1009620100,40.739489790595,-73.976623458224,2011-10-17T13:50:37.000,Bait applied,2011-10-18T07:09:09.000
PO422674,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-08T13:51:50.000,Bait applied,2012-03-09T12:23:58.000


In [14]:
rats.isnull().sum().sum()

0

In [15]:
rats_df['inspection_date'] = pd.to_datetime(rats_df['inspection_date'])
rats_df['approved_date'] = pd.to_datetime(rats_df['approved_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [16]:
rats_df.dtypes

inspection_type            object
job_progress               object
bbl                        object
latitude                   object
longitude                  object
inspection_date    datetime64[ns]
result                     object
approved_date      datetime64[ns]
dtype: object

In [17]:
rats_df['inspection_date'].sort_values(ascending = False).index

Index(['PO2167036', 'PO2167008', 'PO2189724', 'PO2125410', 'PO2167016',
       'PO2166220', 'PO2104678', 'PO2167032', 'PO2166216', 'PO2167034',
       ...
       'PO1492406', 'PO1907117', 'PO1948729', 'PO2066470', 'PO1292060',
       'PO1429204', 'PO1992462', 'PO1344062', 'PO51132', 'PO1836311'],
      dtype='object', name='job_id', length=583054)

In [21]:
rats_df = rats_df.sort_values(by = 'inspection_date', ascending = False)

In [22]:
rats_df.head()

Unnamed: 0_level_0,inspection_type,job_progress,bbl,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PO2167036,COMPLIANCE,2,1016390133,40.795053875967,-73.943041006587,2020-01-08 13:20:54,Active Rat Signs,2020-01-09 08:16:29
PO2167008,COMPLIANCE,2,1016390021,40.795248979755,-73.943499517247,2020-01-08 13:10:44,Active Rat Signs,2020-01-09 08:16:29
PO2189724,INITIAL,1,1004160001,40.720992060105,-73.990497583403,2020-01-08 13:10:40,Passed Inspection,2020-01-09 09:39:57
PO2125410,COMPLIANCE,2,1010100046,40.765455462661,-73.977827275154,2020-01-08 13:05:13,Passed Inspection,2020-01-09 08:54:36
PO2167016,COMPLIANCE,2,1016390035,40.794940931638,-73.942221265198,2020-01-08 13:00:09,Passed Inspection,2020-01-09 08:16:29


In [29]:
rats_df['bbl'].value_counts()

1003020001    428
1001650001    408
1001220001    327
1003890036    313
1004200001    282
             ... 
1016670013      1
1008490053      1
1012607501      1
1017708900      1
1005940074      1
Name: bbl, Length: 43542, dtype: int64

**restaurants.csv**

In [13]:
restaurants.shape

(401074, 26)

In [25]:
restaurants = restaurants.loc[restaurants['boro'] == 'Manhattan']

In [30]:
restaurants.isnull().mean()

camis                    0.000000
dba                      0.000811
boro                     0.000000
building                 0.000665
street                   0.000000
zipcode                  0.016704
phone                    0.000000
cuisine_description      0.000000
inspection_date          0.000000
action                   0.003598
violation_code           0.013454
violation_description    0.020105
critical_flag            0.020105
score                    0.042313
grade                    0.497169
grade_date               0.504136
record_date              0.000000
inspection_type          0.003598
latitude                 0.001051
longitude                0.001051
community_board          0.017755
council_district         0.017679
census_tract             0.017679
bin                      0.020764
bbl                      0.001051
nta                      0.017755
dtype: float64

In [23]:
restaurants.head(2)

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta
0,50017603,SWEET GARDEN CHINESE TAKEOUT RESTAURANT,Queens,7922,PARSONS BLVD,11366.0,7185913462,Chinese,2018-05-17T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Re-inspection,40.720644462955,-73.809263463716,408.0,24.0,77904.0,4147723.0,4068170020,QN37
1,40400049,PANYA/AUTRE KYOYA,Manhattan,810,STUYVESANT STREET,,2125980454,Japanese,2017-01-03T00:00:00.000,Violations were cited in the following area(s).,...,2020-02-25T06:00:52.000,Cycle Inspection / Re-inspection,0.0,0.0,,,,,1,


In [None]:
res

In [9]:
set(rats['result'])

{'Active Rat Signs',
 'Bait applied',
 'Cleanup done',
 'Monitoring visit',
 'Passed Inspection',
 'Problem Conditions',
 nan}

In [15]:
set(rats['inspection_type'])

{'BAIT', 'CLEAN_UPS', 'COMPLIANCE', 'INITIAL'}

In [16]:
set(restaurants['action'])

{'Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.',
 'Establishment re-closed by DOHMH',
 'Establishment re-opened by DOHMH',
 'No violations were recorded at the time of this inspection.',
 'Violations were cited in the following area(s).',
 nan}