## Introduction

In this notebook, we prepare the csv files that will be used for ArcGIS.

They need to be: 
1) small. In this case, we break down the original dataset into different boroughs and analyze/visualize to prevent ArcGIS glitches; 
2) have the correct "BIN" column and data (integers) for joins; 
3) We do not group by neighborhood or zip codes yet, so that we can have some visuals for each building. The average_eviction_count will be by each building over the 7 years. 
4) **The key to successfully operate a "join" in ArcGIS is to**

*   Choose a good dataset from the port / download the original one and reduce them;
*   Export the ones from the port (export their features) and form new editable table;
*   Re-index the join columns in the table's property section;
*   Join the attribute table (in this case, the building footprints) with self-cleaned csvs

5) The final results in this notebook can be shp files/geojason, but they don't have to be. The correct file formats for shape (zip) files are: cpg, dbf, geojson, prj, shp. 

In the next notebook, we will re-aggregate the average eviction counts by neighborhood or zip to create comparable visuals with SVI (social vulnerability index) Scores.



BBL Data explainations:
https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/64uk-42ks/about_data

A very detailed NYC building info data: https://s-media.nyc.gov/agencies/dcp/assets/files/pdf/data-tools/bytes/padgui.pdf

Some other info: https://www.nyc.gov/assets/finance/jump/hlpbldgcode.html


In [1]:
# !pip install geopandas folium matplotlib seaborn scipy
# !pip install esda
# !pip install splot
# # for google colab, had to reinstall some pacakges.

In [2]:
# !pip install geopandas folium matplotlib seaborn scipy esda splot

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import datetime as dt
import scipy

from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# visualization
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import seaborn as sns
import folium
from folium.plugins import HeatMap
from folium import Marker
from folium.plugins import MarkerCluster
import plotly.express as px
import plotly.io as pio

# spatial statistics
from esda.moran import Moran
from esda import Moran_Local
from esda.getisord import G_Local
from shapely.geometry import Point
from libpysal.weights import Queen, Rook

# system and utility
import warnings
import os
import io
from IPython.display import IFrame
from google.colab import files

from libpysal.weights import Queen, Rook
from esda.moran import Moran
import matplotlib.pyplot as plt
from splot.esda import moran_scatterplot

# suppress warnings
warnings.filterwarnings('ignore')

# inline
%matplotlib inline

# Part 1: Get the Evictions data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# data source:
# gdf already cleaned with lisa info
file_path = '/content/drive/My Drive/X999/evictions_cleaned_lisa.csv'

In [None]:
evictions_cleaned_raw = pd.read_csv(file_path)

In [None]:
evictions_cleaned = evictions_cleaned_raw.copy()

In [None]:
evictions_cleaned.columns

Index(['court_index_number', 'docket_number', 'eviction_address',
       'eviction_apartment_number', 'executed_date', 'borough',
       'eviction_postcode', 'ejectment', 'eviction/legal_possession',
       'latitude', 'longitude', 'community_board', 'council_district',
       'census_tract', 'bin', 'bbl', 'nta', 'geometry', 'eviction_count',
       'year', 'average_year_eviction_count', 'cluster', 'cluster_k',
       'same_cluster', 'lisa_cluster_rook', 'lisa_pvalue_rook',
       'lisa_cluster_queen', 'lisa_pvalue_queen'],
      dtype='object')

In [None]:
relevant_columns = [
    'eviction_address', 'eviction_apartment_number', 'borough', 'eviction_postcode',
    'latitude', 'longitude', 'bin', 'bbl', 'eviction_count', 'year',
    'average_year_eviction_count', 'geometry'
]

# Filter the DataFrame to keep only relevant columns
evictions_cleaned_filtered = evictions_cleaned[relevant_columns]
evictions_cleaned_filtered.columns

Index(['eviction_address', 'eviction_apartment_number', 'borough',
       'eviction_postcode', 'latitude', 'longitude', 'bin', 'bbl',
       'eviction_count', 'year', 'average_year_eviction_count', 'geometry'],
      dtype='object')

In [None]:
from shapely.wkt import loads
evictions_cleaned_filtered['geometry'] = evictions_cleaned_filtered['geometry'].apply(loads)

In [None]:
evictions_cleaned_geo = gpd.GeoDataFrame(evictions_cleaned_filtered, geometry='geometry', crs="EPSG:4326")

In [None]:
evictions_cleaned_geo.geometry.type

Unnamed: 0,0
0,Point
1,Point
2,Point
3,Point
4,Point
...,...
76479,Point
76480,Point
76481,Point
76482,Point


In [None]:
evictions_cleaned_geo.head()

Unnamed: 0,eviction_address,eviction_apartment_number,borough,eviction_postcode,latitude,longitude,bin,bbl,eviction_count,year,average_year_eviction_count,geometry
0,710 61ST STREET,2ND FLOOR,BROOKLYN,11220,40.635941,-74.011883,3143881.0,3057940000.0,3,2024,3.0,POINT (-74.01188 40.63594)
1,462 60TH STREET,FOURTH FLOOR APT AKA,BROOKLYN,11220,40.640008,-74.017068,3143435.0,3057820000.0,3,2024,3.0,POINT (-74.01707 40.64001)
2,3400 PAUL AVENUE,15D,BRONX,10468,40.87719,-73.889569,2015444.0,2032510000.0,4,2018,4.0,POINT (-73.88957 40.87719)
3,480 CONCORD AVENUE,4E,BRONX,10455,40.811197,-73.90881,2003900.0,2025770000.0,9,2019,2.25,POINT (-73.90881 40.8112)
4,65 EAST 193RD ST,1B,BRONX,10468,40.866075,-73.896515,2013945.0,2031770000.0,8,2017,2.666667,POINT (-73.89652 40.86608)


In [None]:
evictions_cleaned_geo['borough'].unique()

array(['BROOKLYN', 'BRONX', 'STATEN ISLAND', 'QUEENS', 'MANHATTAN'],
      dtype=object)

In [None]:
manhattan = evictions_cleaned_geo[evictions_cleaned_geo['borough'] == 'MANHATTAN']
brooklyn = evictions_cleaned_geo[evictions_cleaned_geo['borough'] == 'BROOKLYN']
staten_island = evictions_cleaned_geo[evictions_cleaned_geo['borough'] == 'STATEN ISLAND']
bronx = evictions_cleaned_geo[evictions_cleaned_geo['borough'] == 'BRONX']
queens = evictions_cleaned_geo[evictions_cleaned_geo['borough'] == 'QUEENS']

In [None]:
manhattan = manhattan.rename(columns={'bin': 'BIN'})
brooklyn = brooklyn.rename(columns={'bin': 'BIN'})
staten_island = staten_island.rename(columns={'bin': 'BIN'})
bronx = bronx.rename(columns={'bin': 'BIN'})
queens = queens.rename(columns={'bin': 'BIN'})

In [None]:
print(manhattan['BIN'].isnull().sum())
manhattan = manhattan.dropna(subset=['BIN'])
print(manhattan['BIN'].dtype)
# needs to be converted to integera, because in arcgis, the building data is numeric/long
manhattan['BIN'] = manhattan['BIN'].astype(int)

0
int64


In [None]:
print(manhattan['BIN'].unique())
manhattan = manhattan.dropna(subset=['BIN'])
manhattan['BIN'] = manhattan['BIN'].astype(int)
type(manhattan['BIN'].dtype)

[1084520 1087539 1063963 ... 1062905 1005429 1006684]


numpy.dtypes.Int64DType

In [None]:
manhattan.to_csv("manhattan_cleaned.csv", index=False)
files.download("manhattan_cleaned.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print(brooklyn['BIN'].isnull().sum())
brooklyn = brooklyn.dropna(subset=['BIN'])
print(brooklyn['BIN'].dtype)
# needs to be converted to integera, because in arcgis, the building data is numeric/long
brooklyn['BIN'] = brooklyn['BIN'].astype(int)

0
float64


In [None]:
print(brooklyn['BIN'].unique())
brooklyn = brooklyn.dropna(subset=['BIN'])
brooklyn['BIN'] = brooklyn['BIN'].astype(int)
type(brooklyn['BIN'].dtype)

[3143881 3143435 3034442 ... 3085014 3327967 3147431]


numpy.dtypes.Int64DType

In [None]:
print(staten_island['BIN'].isnull().sum())
staten_island = staten_island.dropna(subset=['BIN'])
print(staten_island['BIN'].dtype)
staten_island['BIN'] = staten_island['BIN'].astype(int)
print(staten_island['BIN'].unique())
staten_island = staten_island.dropna(subset=['BIN'])
staten_island['BIN'] = staten_island['BIN'].astype(int)
type(staten_island['BIN'].dtype)

0
float64
[5087958 5033262 5050009 ... 5169530 5108656 5147292]


numpy.dtypes.Int64DType

In [None]:
print(queens['BIN'].isnull().sum())
queens = queens.dropna(subset=['BIN'])
print(queens['BIN'].dtype)
queens['BIN'] = queens['BIN'].astype(int)
print(queens['BIN'].unique())
queens = queens.dropna(subset=['BIN'])
queens['BIN'] = queens['BIN'].astype(int)
type(queens['BIN'].dtype)

0
float64
[4436442 4074666 4168635 ... 4036623 4518927 4011328]


numpy.dtypes.Int64DType

In [None]:
print(bronx['BIN'].isnull().sum())
bronx = bronx.dropna(subset=['BIN'])
print(bronx['BIN'].dtype)
# needs to be converted to integera, because in arcgis, the building data is numeric/long
bronx['BIN'] = bronx['BIN'].astype(int)
print(bronx['BIN'].unique())
bronx = bronx.dropna(subset=['BIN'])
bronx['BIN'] = bronx['BIN'].astype(int)
type(bronx['BIN'].dtype)

0
float64
[2015444 2003900 2013945 ... 2028895 1053888 1064154]


numpy.dtypes.Int64DType

In [None]:
manhattan.to_csv("manhattan_cleaned.csv", index=False)
files.download("manhattan_cleaned.csv")
brooklyn.to_csv("brooklyn_cleaned.csv", index=False)
files.download("brooklyn_cleaned.csv")
staten_island.to_csv("staten_island_cleaned.csv", index=False)
files.download("staten_island_cleaned.csv")
queens.to_csv("queens_cleaned.csv", index=False)
files.download("queens_cleaned.csv")
bronx.to_csv("bronx_cleaned.csv", index=False)
files.download("bronx_cleaned.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>