# EDA
#### By: _Zhan Yu_

## Table of Contents
- [Loading Libraries & Data](#Loading-Libraries-&-Data)

## Loading Libraries & Data

In this project, the two datasets [Rodent Inspection in NYC](https://data.cityofnewyork.us/Health/Rodent-Inspection/p937-wjvj) and [Restaurant-Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j ) are public datasets from data.cityofnewyork.us

In [1]:
# Libraries: 
import pandas as pd

import os
from sodapy import Socrata

### SoQL(Socrata Query Language)  

The Socrata APIs provide rich query functionality through a query language we call the “Socrata Query Language” or “SoQL”.   
Install packages at the current environment (for example, mine is (dsi)) before running:
``` Terminal
pip install sodapy
```

In [5]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# In this case they are public datasets.

# First 1,000,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("p937-wjvj",                                # Rodent Inspection dataset
                     limit=1_000_000, where="boro_code = 1")

# Convert to pandas DataFrame
rats = pd.DataFrame.from_records(results)



In [12]:
# First 500,000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("43nn-pn8j",                                # NYC Restaurant Inspection dataset
                     limit=500_000)

# Convert to pandas DataFrame
restaurants = pd.DataFrame.from_records(results)

In [17]:
# Datasets are too big (both about 200 MB) 
# rats.to_csv('datasets/rats_raw.csv', index=False)
# restaurants.to_csv('datasets/restaurants_raw.csv', index=False)

**rats.csv**

In [6]:
rats.shape

(594536, 20)

In [20]:
rats.isnull().mean()

inspection_type                0.000000
job_ticket_or_work_order_id    0.000000
job_id                         0.000000
job_progress                   0.000000
bbl                            0.000000
boro_code                      0.000000
block                          0.000000
lot                            0.000000
house_number                   0.013893
street_name                    0.000631
zip_code                       0.004043
x_coord                        0.001759
y_coord                        0.001759
latitude                       0.000891
longitude                      0.000891
borough                        0.000000
inspection_date                0.000000
result                         0.000002
approved_date                  0.000000
location                       0.001097
dtype: float64

In [22]:
rats.dropna(inplace = True)

In [23]:
rats.shape

(583054, 20)

In [32]:
rats.set_index('job_id',inplace = True)

In [24]:
rats.columns

Index(['inspection_type', 'job_ticket_or_work_order_id', 'job_id',
       'job_progress', 'bbl', 'boro_code', 'block', 'lot', 'house_number',
       'street_name', 'zip_code', 'x_coord', 'y_coord', 'latitude',
       'longitude', 'borough', 'inspection_date', 'result', 'approved_date',
       'location'],
      dtype='object')

In [33]:
rats_df = rats[['inspection_type',
       'job_progress', 'bbl','latitude',
       'longitude','inspection_date', 'result', 'approved_date']]
rats_df.head()

Unnamed: 0_level_0,inspection_type,job_progress,bbl,latitude,longitude,inspection_date,result,approved_date
job_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PO2075886,INITIAL,1,1022430001,40.868203550611,-73.919469362846,2019-08-02T11:11:55.000,Active Rat Signs,2019-08-07T11:34:09.000
PO333717,BAIT,5,1009620100,40.739489790595,-73.976623458224,2011-10-26T09:45:58.000,Bait applied,2011-10-28T07:14:07.000
PO431003,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-19T14:00:03.000,Bait applied,2012-03-20T07:20:09.000
PO333717,BAIT,4,1009620100,40.739489790595,-73.976623458224,2011-10-17T13:50:37.000,Bait applied,2011-10-18T07:09:09.000
PO422674,BAIT,1,1022290001,40.868266491555,-73.919201730342,2012-03-08T13:51:50.000,Bait applied,2012-03-09T12:23:58.000


In [27]:
rats.isnull().sum().sum()

0

In [36]:
rats_df['inspection_date'] = pd.to_datetime(rats_df['inspection_date'])
rats_df['approved_date'] = pd.to_datetime(rats_df['approved_date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [38]:
rats_df.dtypes

inspection_type            object
job_progress               object
bbl                        object
latitude                   object
longitude                  object
inspection_date    datetime64[ns]
result                     object
approved_date      datetime64[ns]
dtype: object

In [41]:
rats_df['inspection_date'].sort_values(ascending = False)

job_id
PO2167036   2020-01-08 13:20:54
PO2167008   2020-01-08 13:10:44
PO2189724   2020-01-08 13:10:40
PO2125410   2020-01-08 13:05:13
PO2167016   2020-01-08 13:00:09
                    ...        
PO1429204   1945-06-14 08:13:52
PO1992462   1935-04-24 14:35:49
PO1344062   1935-02-22 10:19:31
PO51132     1930-01-30 08:24:15
PO1836311   1918-10-19 14:34:44
Name: inspection_date, Length: 583054, dtype: datetime64[ns]

**restaurants.csv**

In [13]:
restaurants.shape

(401074, 26)

In [21]:
restaurants.isnull().mean()

camis                    0.000000
dba                      0.001010
boro                     0.000000
building                 0.000621
street                   0.000000
zipcode                  0.013661
phone                    0.000042
cuisine_description      0.000000
inspection_date          0.000000
action                   0.003486
violation_code           0.014182
violation_description    0.022617
critical_flag            0.022617
score                    0.042242
record_date              0.000000
inspection_type          0.003486
latitude                 0.001032
longitude                0.001032
community_board          0.014693
council_district         0.014663
census_tract             0.014663
bin                      0.019001
bbl                      0.001032
nta                      0.014693
grade                    0.494552
grade_date               0.500910
dtype: float64

In [7]:
rats.tail()

Unnamed: 0,inspection_type,job_ticket_or_work_order_id,job_id,job_progress,bbl,boro_code,block,lot,house_number,street_name,zip_code,x_coord,y_coord,latitude,longitude,borough,inspection_date,result,approved_date,location
594531,COMPLIANCE,1448107,PO1439162,2,1018380042,1,1838,42,14,WEST 103 STREET,10025,994678,229458,40.796472628049,-73.962337133531,Manhattan,2017-06-22T11:57:03.000,Passed Inspection,2017-06-26T09:10:19.000,"{'latitude': '40.796472628049', 'longitude': '..."
594532,INITIAL,1289507,PO1289507,1,1005500017,1,550,17,60,WASHINGTON MEWS,10003,985381,205835,40.731427971626,-73.99562690223,Manhattan,2016-11-22T11:59:08.000,Passed Inspection,2016-11-23T12:16:42.000,"{'latitude': '40.731427971626', 'longitude': '..."
594533,INITIAL,814770,PO814770,1,1021740077,1,2174,77,126,NAGLE AVENUE,10040,1004604,253094,40.861022449888,-73.92632894516,Manhattan,2014-06-11T11:53:38.000,Active Rat Signs,2014-06-13T11:05:49.000,"{'latitude': '40.861022449888', 'longitude': '..."
594534,INITIAL,44963,PO44963,1,1022260015,1,2226,15,125,VERMILYEA AVENUE,10034,1006063,254932,40.866486626423,-73.921311751905,Manhattan,2010-01-06T10:03:58.000,Active Rat Signs,2010-01-11T13:16:55.000,"{'latitude': '40.866486626423', 'longitude': '..."
594535,BAIT,484725,PO1954284,4,1022270019,1,2227,19,530,ISHAM STREET,10034,1006667,254965,40.866545234657,-73.918759119886,Manhattan,2019-06-06T12:43:54.000,Bait applied,2019-06-10T12:15:50.000,"{'latitude': '40.866545234657', 'longitude': '..."


In [14]:
restaurants.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,grade,grade_date
0,50073722,SAMWON GARDEN,Manhattan,37,W 32ND ST,10001,2126953131,Korean,2018-02-05T00:00:00.000,Violations were cited in the following area(s).,...,40.747629611437,-73.986610434278,105,4,7600,1015845,1008340021,MN17,,
1,50044714,THE HIDEOUT TAVERN,Bronx,143,E 233RD ST,10470,3472751105,Irish,2016-02-04T00:00:00.000,Violations were cited in the following area(s).,...,40.896425170898,-73.871547455917,212,11,44901,2018882,2033690024,BX62,A,2016-02-04T00:00:00.000
2,50002562,EXQUISITO RESTAURANT,Queens,2112,36TH AVENUE,11106,7187843505,Spanish,2018-11-28T00:00:00.000,Establishment Closed by DOHMH. Violations wer...,...,40.759769411236,-73.936581040398,401,26,3300,4004269,4003480019,QN68,,
3,41382110,CAFFEINA ESPRESSO BAR,Queens,4402,23 STREET,11101,7183616408,Café/Coffee/Tea,2020-01-31T00:00:00.000,Violations were cited in the following area(s).,...,40.748906949623,-73.944379896967,402,26,1900,4005192,4004390039,QN31,A,2020-01-31T00:00:00.000
4,50011059,PARDON MY FRENCH,Manhattan,103,AVENUE B,10009,2123589683,French,2018-04-10T00:00:00.000,Violations were cited in the following area(s).,...,40.724822636463,-73.981376497697,103,2,2602,1004672,1003890006,MN28,B,2018-04-10T00:00:00.000


In [9]:
set(rats['result'])

{'Active Rat Signs',
 'Bait applied',
 'Cleanup done',
 'Monitoring visit',
 'Passed Inspection',
 'Problem Conditions',
 nan}

In [15]:
set(rats['inspection_type'])

{'BAIT', 'CLEAN_UPS', 'COMPLIANCE', 'INITIAL'}

In [16]:
set(restaurants['action'])

{'Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.',
 'Establishment re-closed by DOHMH',
 'Establishment re-opened by DOHMH',
 'No violations were recorded at the time of this inspection.',
 'Violations were cited in the following area(s).',
 nan}