This data is from: https://data.cityofnewyork.us/City-Government/Points-of-Interest/rxuy-2muj

PDF: https://data.cityofnewyork.us/api/views/t95h-5fsr/files/ebabcf1d-c6e7-43ca-a031-5168036b2fbb?download=true&filename=PointOfInterest.pdf

# Overview:

The goal of this dataset refinement is to focus on essential information required for aggregating points into **community districts** and performing borough-level analysis within New York City. Below is the rationale for excluding and retaining specific columns.

---
Will we exclude some Columns 

## Retained Columns and Their Purpose:

### `the_geom` (Geometric Coordinates):
- **Reason for Inclusion:** Provides the latitude and longitude of the points, which are necessary for spatial operations such as determining if a point lies within a community district or borough.

### `BOROUGH` (Borough Identifier):
- **Reason for Inclusion:** Helps filter out data points outside New York City. Essential for borough-level aggregation and validation.

### `FACILITY_T` (Facility Type):
- **Reason for Inclusion:** Useful for understanding what kind of facility each point of interest represents. Adds contextual information that can enhance analysis or visualization.

### `B7SC` (Street Code):
- **Reason for Inclusion:** Serves as a street code identifier. Although it may not be immediately needed, it could be helpful later for linking data or detailed street-level analysis.

### `NAME` (Point Name):
- **Reason for Inclusion:** Adds a descriptive label for each point, making it easier to interpret and display the data during analysis or visualization.

---


In [98]:
import numpy as np
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import pickle
import typing

In [99]:

data = pd.read_csv('../data/POI.csv')
data.head()


Unnamed: 0,the_geom,SEGMENTID,COMPLEXID,SAFTYPE,SOS,PLACEID,FACI_DOM,BIN,BOROUGH,CREATED,MODIFIED,FACILITY_T,SOURCE,B7SC,PRI_ADD,NAME
0,POINT (-74.00701717096757 40.724634757833414),31895,0,N,1.0,567,9,0,1.0,05/14/2009 12:00:00 AM,11/18/2011 12:00:00 AM,6,DoITT,19743001.0,0,HOLLAND
1,POINT (-73.82661642130311 40.797182526598505),306303,3378,N,2.0,568,8,0,4.0,05/14/2009 12:00:00 AM,01/09/2017 12:00:00 AM,6,DoITT,49731001.0,0,WHITESTONE
2,POINT (-73.99395441100663 40.70384707235758),144842,3960,N,2.0,576,8,0,3.0,05/14/2009 12:00:00 AM,01/22/2018 12:00:00 AM,6,DoITT,39734001.0,0,BROOKLYN
3,POINT (-73.9919414213091 40.70960010711745),162664,0,N,1.0,580,8,0,1.0,05/14/2009 12:00:00 AM,05/11/2011 12:00:00 AM,6,DoITT,19795001.0,0,MANHATTAN
4,POINT (-73.9526609766105 40.73906602249743),157362,0,N,1.0,582,8,0,3.0,05/14/2009 12:00:00 AM,03/03/2017 12:00:00 AM,6,DoITT,39740001.0,0,PULASKI


In [100]:

df = data[['the_geom', 'BOROUGH', 'FACILITY_T',  'NAME']]
df.shape


(20576, 4)

In [101]:
df['BOROUGH'].unique()

array([ 1.,  4.,  3.,  2.,  5., nan,  8.])

Now exlude all Borough thats not within NYC

In [102]:
df = df.dropna(subset=['the_geom']).copy()
print(df.shape)

(20576, 4)


In [103]:
df = df.loc[df['BOROUGH'] <= 5]
print(df['BOROUGH'].unique())  


[1. 4. 3. 2. 5.]


In [104]:
print(df.describe)


<bound method NDFrame.describe of                                             the_geom  BOROUGH  FACILITY_T  \
0      POINT (-74.00701717096757 40.724634757833414)      1.0           6   
1      POINT (-73.82661642130311 40.797182526598505)      4.0           6   
2       POINT (-73.99395441100663 40.70384707235758)      3.0           6   
3        POINT (-73.9919414213091 40.70960010711745)      1.0           6   
4        POINT (-73.9526609766105 40.73906602249743)      3.0           6   
...                                              ...      ...         ...   
20571  POINT (-73.91125345633935 40.776969338770535)      4.0           9   
20572   POINT (-73.93874332465784 40.59206990676358)      3.0           3   
20573  POINT (-73.95498698976225 40.622455918912394)      3.0           2   
20574      POINT (-73.955370475949 40.6223263507244)      3.0           9   
20575   POINT (-73.94519728674484 40.61507393637472)      3.0           9   

                                       NA

In [105]:
borough_names = {
    1: 'Manhattan',
    2: 'Bronx',
    3: 'Brooklyn',
    4: 'Queens',
    5: 'Staten Island',
}
df['BOROUGH'] = df['BOROUGH'].map(borough_names)

df.head()

Unnamed: 0,the_geom,BOROUGH,FACILITY_T,NAME
0,POINT (-74.00701717096757 40.724634757833414),Manhattan,6,HOLLAND
1,POINT (-73.82661642130311 40.797182526598505),Queens,6,WHITESTONE
2,POINT (-73.99395441100663 40.70384707235758),Brooklyn,6,BROOKLYN
3,POINT (-73.9919414213091 40.70960010711745),Manhattan,6,MANHATTAN
4,POINT (-73.9526609766105 40.73906602249743),Brooklyn,6,PULASKI


In [106]:
df['longitude'] = df['the_geom'].str.extract(r'POINT \((-?\d+\.\d+)')[0].astype(float)
df['latitude'] = df['the_geom'].str.extract(r' (-?\d+\.\d+)\)').astype(float)

df.head()

Unnamed: 0,the_geom,BOROUGH,FACILITY_T,NAME,longitude,latitude
0,POINT (-74.00701717096757 40.724634757833414),Manhattan,6,HOLLAND,-74.007017,40.724635
1,POINT (-73.82661642130311 40.797182526598505),Queens,6,WHITESTONE,-73.826616,40.797183
2,POINT (-73.99395441100663 40.70384707235758),Brooklyn,6,BROOKLYN,-73.993954,40.703847
3,POINT (-73.9919414213091 40.70960010711745),Manhattan,6,MANHATTAN,-73.991941,40.7096
4,POINT (-73.9526609766105 40.73906602249743),Brooklyn,6,PULASKI,-73.952661,40.739066


In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20361 entries, 0 to 20575
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   the_geom    20361 non-null  object 
 1   BOROUGH     20361 non-null  object 
 2   FACILITY_T  20361 non-null  int64  
 3   NAME        20361 non-null  object 
 4   longitude   20361 non-null  float64
 5   latitude    20361 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 1.1+ MB


In [108]:
facility_types = [
    'Residential', 'Education Facility', 'Cultural Facility', 'Recreational Facility', 
    'Social Services', 'Transportation Facility', 'Commercial', 
    'Government Facility (non public safety)', 'Religious Institution', 
    'Health Services', 'Public Safety', 'Water', 'Miscellaneous'
]

# Create new columns for each facility type with 1 or 0
for i in range(1, 14):  # Facility types range from 1 to 13
    df[f'Facility_Type_{i}'] = (df['FACILITY_T'] == i).astype(int)

facility_names = {
    1: 'Residential',
    2: 'Education Facility',
    3: 'Cultural Facility',
    4: 'Recreational Facility',
    5: 'Social Services',
    6: 'Transportation Facility',
    7: 'Commercial',
    8: 'Government Facility',
    9: 'Religious Institution',
    10: 'Health Services',
    11: 'Public Safety',
    12: 'Water',
    13: 'Miscellaneous'
}

for i in range(1, 14):
    df.rename(columns={f'Facility_Type_{i}': facility_names[i]}, inplace=True)

df.head()

Unnamed: 0,the_geom,BOROUGH,FACILITY_T,NAME,longitude,latitude,Residential,Education Facility,Cultural Facility,Recreational Facility,Social Services,Transportation Facility,Commercial,Government Facility,Religious Institution,Health Services,Public Safety,Water,Miscellaneous
0,POINT (-74.00701717096757 40.724634757833414),Manhattan,6,HOLLAND,-74.007017,40.724635,0,0,0,0,0,1,0,0,0,0,0,0,0
1,POINT (-73.82661642130311 40.797182526598505),Queens,6,WHITESTONE,-73.826616,40.797183,0,0,0,0,0,1,0,0,0,0,0,0,0
2,POINT (-73.99395441100663 40.70384707235758),Brooklyn,6,BROOKLYN,-73.993954,40.703847,0,0,0,0,0,1,0,0,0,0,0,0,0
3,POINT (-73.9919414213091 40.70960010711745),Manhattan,6,MANHATTAN,-73.991941,40.7096,0,0,0,0,0,1,0,0,0,0,0,0,0
4,POINT (-73.9526609766105 40.73906602249743),Brooklyn,6,PULASKI,-73.952661,40.739066,0,0,0,0,0,1,0,0,0,0,0,0,0


In [109]:
df = df.drop(columns=['FACILITY_T', 'NAME','the_geom'])
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20361 entries, 0 to 20575
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   BOROUGH                  20361 non-null  object 
 1   longitude                20361 non-null  float64
 2   latitude                 20361 non-null  float64
 3   Residential              20361 non-null  int32  
 4   Education Facility       20361 non-null  int32  
 5   Cultural Facility        20361 non-null  int32  
 6   Recreational Facility    20361 non-null  int32  
 7   Social Services          20361 non-null  int32  
 8   Transportation Facility  20361 non-null  int32  
 9   Commercial               20361 non-null  int32  
 10  Government Facility      20361 non-null  int32  
 11  Religious Institution    20361 non-null  int32  
 12  Health Services          20361 non-null  int32  
 13  Public Safety            20361 non-null  int32  
 14  Water                 

In [110]:
df.to_csv('../exports/POI.csv', index=False)