## Introduction

In this notebook, we use the previously cleaned eviction dataset and we re-aggregage the average_eviction_count by zipcodes, neighborhoods, and boroughs. \
We also clean the dataset to better suit a "join and relate" with SVI dataset in ArcGIS.



BBL Data explainations:
https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/64uk-42ks/about_data

A very detailed NYC building info data: https://s-media.nyc.gov/agencies/dcp/assets/files/pdf/data-tools/bytes/padgui.pdf

Some other info: https://www.nyc.gov/assets/finance/jump/hlpbldgcode.html


In [2]:
# !pip install geopandas folium matplotlib seaborn scipy
# !pip install esda
# !pip install splot
# # for google colab, had to reinstall some pacakges.

In [None]:
# !pip install geopandas folium matplotlib seaborn scipy esda splot

In [3]:
import pandas as pd
import geopandas as gpd
import numpy as np
import datetime as dt
import scipy

from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# visualization
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors
import seaborn as sns
import folium
from folium.plugins import HeatMap
from folium import Marker
from folium.plugins import MarkerCluster
import plotly.express as px
import plotly.io as pio

# spatial statistics
from esda.moran import Moran
from esda import Moran_Local
from esda.getisord import G_Local
from shapely.geometry import Point
from libpysal.weights import Queen, Rook

# system and utility
import warnings
import os
import io
from IPython.display import IFrame
from google.colab import files

from libpysal.weights import Queen, Rook
from esda.moran import Moran
import matplotlib.pyplot as plt
from splot.esda import moran_scatterplot

# suppress warnings
warnings.filterwarnings('ignore')

# inline
%matplotlib inline

# Part 1: Get the Evictions data

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# data source:
# gdf already cleaned with lisa info
file_path = '/content/drive/My Drive/X999/evictions_cleaned_lisa.csv'

In [None]:
# evictions_cleaned_raw.to_csv(file_path, index=False)

In [6]:
evictions_cleaned_raw = pd.read_csv(file_path)

In [7]:
evictions_cleaned = evictions_cleaned_raw.copy()

In [8]:
evictions_cleaned.columns

Index(['court_index_number', 'docket_number', 'eviction_address',
       'eviction_apartment_number', 'executed_date', 'borough',
       'eviction_postcode', 'ejectment', 'eviction/legal_possession',
       'latitude', 'longitude', 'community_board', 'council_district',
       'census_tract', 'bin', 'bbl', 'nta', 'geometry', 'eviction_count',
       'year', 'average_year_eviction_count', 'cluster', 'cluster_k',
       'same_cluster', 'lisa_cluster_rook', 'lisa_pvalue_rook',
       'lisa_cluster_queen', 'lisa_pvalue_queen'],
      dtype='object')

In [9]:
relevant_columns = [
    'eviction_address', 'eviction_apartment_number', 'borough', 'eviction_postcode',
    'latitude', 'longitude', 'bin', 'bbl', 'eviction_count', 'year',
    'average_year_eviction_count', 'geometry'
]

# Filter the DataFrame to keep only relevant columns
evictions_cleaned_filtered = evictions_cleaned[relevant_columns]
evictions_cleaned_filtered.columns

Index(['eviction_address', 'eviction_apartment_number', 'borough',
       'eviction_postcode', 'latitude', 'longitude', 'bin', 'bbl',
       'eviction_count', 'year', 'average_year_eviction_count', 'geometry'],
      dtype='object')

In [10]:
# to match the svi data set's column name
evictions_cleaned.rename(columns={"eviction_postcode": "FIPS"}, inplace=True)
evictions_cleaned.head()

Unnamed: 0,court_index_number,docket_number,eviction_address,eviction_apartment_number,executed_date,borough,FIPS,ejectment,eviction/legal_possession,latitude,...,eviction_count,year,average_year_eviction_count,cluster,cluster_k,same_cluster,lisa_cluster_rook,lisa_pvalue_rook,lisa_cluster_queen,lisa_pvalue_queen
0,*313639/23,5202,710 61ST STREET,2ND FLOOR,2024-03-04,BROOKLYN,11220,Not an Ejectment,Possession,40.635941,...,3,2024,3.0,0,0,True,4,0.241,4,0.24
1,*324973/22,5308,462 60TH STREET,FOURTH FLOOR APT AKA,2024-08-13,BROOKLYN,11220,Not an Ejectment,Possession,40.640008,...,3,2024,3.0,0,0,True,4,0.201,4,0.211
2,*53336/16,170279,3400 PAUL AVENUE,15D,2018-10-17,BRONX,10468,Not an Ejectment,Possession,40.87719,...,4,2018,4.0,0,0,True,1,0.058,1,0.058
3,*5990/17,2703,480 CONCORD AVENUE,4E,2019-08-30,BRONX,10455,Not an Ejectment,Possession,40.811197,...,9,2019,2.25,0,0,True,3,0.052,3,0.059
4,000098/17,69483,65 EAST 193RD ST,1B,2017-05-04,BRONX,10468,Not an Ejectment,Possession,40.866075,...,8,2017,2.666667,0,0,True,1,0.431,1,0.437


##  Part 2: Aggregate over FIPS (zipcodes)

fips are usually not supposed to be zipcodes, but they happen to be the same thing in this particular SVI dataset I am using in ArcGIS.

In [11]:
# across all years for each borough
average_evictions_all_years_z = evictions_cleaned.groupby('FIPS')['eviction_count'].mean().reset_index()
average_evictions_all_years_z.rename(columns={'eviction_count': 'average_eviction_count_all_years_zip'}, inplace=True)
average_evictions_all_years_z

Unnamed: 0,FIPS,average_eviction_count_all_years_zip
0,10000,12.000000
1,10001,4.606635
2,10002,6.373457
3,10003,1.625806
4,10004,3.111111
...,...,...
196,11691,18.455603
197,11692,63.342640
198,11693,4.842105
199,11694,6.375796


In [12]:
average_evictions_all_years_z.to_csv('zipcode_average_over_years.csv', index=False, encoding='utf-8')
# average_evictions

In [13]:
files.download('zipcode_average_over_years.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 3: Aggregate over boroughs (County name)

In [16]:
# rename to match svi dataset's column name
evictions_cleaned.rename(columns={'borough': 'County name'}, inplace=True)
evictions_cleaned.columns

Index(['court_index_number', 'docket_number', 'eviction_address',
       'eviction_apartment_number', 'executed_date', 'County name', 'FIPS',
       'ejectment', 'eviction/legal_possession', 'latitude', 'longitude',
       'community_board', 'council_district', 'census_tract', 'bin', 'bbl',
       'nta', 'geometry', 'eviction_count', 'year',
       'average_year_eviction_count', 'cluster', 'cluster_k', 'same_cluster',
       'lisa_cluster_rook', 'lisa_pvalue_rook', 'lisa_cluster_queen',
       'lisa_pvalue_queen'],
      dtype='object')

In [17]:
# across all years for each borough
average_evictions_all_years_b = evictions_cleaned.groupby('County name')['eviction_count'].mean().reset_index()
average_evictions_all_years_b.rename(columns={'eviction_count': 'average_eviction_count_all_years'}, inplace=True)
average_evictions_all_years_b


Unnamed: 0,County name,average_eviction_count_all_years
0,BRONX,13.813084
1,BROOKLYN,6.54954
2,MANHATTAN,6.651513
3,QUEENS,9.617091
4,STATEN ISLAND,8.326641


In [18]:
average_evictions_all_years_b.to_csv('borough_average_over_years.csv', index=False, encoding='utf-8')
# average_evictions

In [19]:
files.download('borough_average_over_years.csv')

## Part 4: Aggregate over neighborhood tabulation areas (NTA)

(kept the borough info)

In [20]:
len(evictions_cleaned.nta.unique())

190

In [21]:
# groupby nta and borough to calculate average eviction counts across all years
average_evictions_nta = evictions_cleaned.groupby(['nta', 'County name'])['eviction_count'].mean().reset_index()
average_evictions_nta.rename(columns={'eviction_count': 'average_eviction_count_all_years'}, inplace=True)
len(average_evictions_nta)

196

In [22]:
# see if ntas associated with more than one borough
nta_borough_check = evictions_cleaned.groupby('nta')['County name'].nunique().reset_index()
nta_borough_check = nta_borough_check[nta_borough_check['County name'] > 1]
nta_borough_check
# this is why the lengths do not match


Unnamed: 0,nta,County name
28,Central Harlem North-Polo Grounds,2
41,Cypress Hills-City Line,2
80,Highbridge,2
101,Marble Hill-Inwood,2
118,North Riverdale-Fieldston-Riverdale,2
171,Washington Heights North,2


### fix this issue by manually assign the correct boroughs to the neighborhoods
### info mostly from nyc.gov and wikipedia

sources: \
https://www.nyc.gov/assets/sbs/downloads/pdf/neighborhoods/avenyc-cdna-washington-heights.pdf
https://en.wikipedia.org/wiki/Marble_Hill,_Manhattan#:~:text=%22Marble%20Hill%20is%20a%20neighborhood,within%20Bronx%20Community%20District%208.%22
https://en.wikipedia.org/wiki/Fieldston,_Bronx#:~:text=Fieldston%20is%20a%20privately%20owned,City%20borough%20of%20the%20Bronx.
https://www.nyc.gov/html/mancb10/html/faq/faq.shtml#:~:text=Community%20Board%2010%20covers%20the%20neighborhood%20of%20Harlem%20and%20Polo,Fordham%20Cliffs%20to%20the%20west.
https://www.cityneighborhoods.nyc/cypress-hills#:~:text=Cypress%20Hills%20is%20a%20diverse,the%20eastern%20section%20of%20Brooklyn.
https://en.wikipedia.org/wiki/Highbridge,_Bronx#:~:text=Highbridge%20is%20a%20residential%20neighborhood,the%20Bronx%2C%20New%20York%20City.


In [46]:
nta_borough_mapping = {
    "Central Harlem North-Polo Grounds": "MANHATTAN",
    "Cypress Hills-City Line": "BROOKLYN",
    "Highbridge": "BRONX",
    "Marble Hill-Inwood": "MANHATTAN",
    "North Riverdale-Fieldston-Riverdale": "BRONX",
    "Washington Heights North": "MANHATTAN"
}

In [47]:
for index, row in evictions_cleaned.iterrows():
    nta = row['nta']
    if nta in nta_borough_mapping:
        evictions_cleaned.at[index, 'County name'] = nta_borough_mapping[nta]

In [48]:
updated_ntas = evictions_cleaned[evictions_cleaned['nta'].isin(nta_borough_mapping.keys())]
updated_ntas[['nta', 'County name']].drop_duplicates()

Unnamed: 0,nta,County name
13,Central Harlem North-Polo Grounds,MANHATTAN
19,Highbridge,BRONX
65,Washington Heights North,MANHATTAN
220,Marble Hill-Inwood,MANHATTAN
605,Cypress Hills-City Line,BROOKLYN
2796,North Riverdale-Fieldston-Riverdale,BRONX


In [49]:
# see if ntas associated with more than one borough
nta_borough_check = evictions_cleaned.groupby('nta')['County name'].nunique().reset_index()
nta_borough_check = nta_borough_check[nta_borough_check['County name'] > 1]
nta_borough_check
# should be empty now

Unnamed: 0,nta,County name


In [50]:
# re-aggregations
average_evictions_nta_updated = evictions_cleaned.groupby(['nta', 'County name'])['eviction_count'].mean().reset_index()
average_evictions_nta_updated.rename(columns={'eviction_count': 'average_eviction_count_all_years'}, inplace=True)

In [51]:
average_evictions_nta_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 3 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   nta                               190 non-null    object 
 1   County name                       190 non-null    object 
 2   average_eviction_count_all_years  190 non-null    float64
dtypes: float64(1), object(2)
memory usage: 4.6+ KB


In [53]:
average_evictions_nta_updated.to_csv('nta_average_over_years.csv', index=False, encoding='utf-8')

### it's likely because some ntas are associated with multiple boroughs. This happens if the same neigborhood spans more than one borough. we can probably ignore this.

In [54]:
files.download('nta_average_over_years.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [55]:
file_path2 = '/content/drive/My Drive/X999/average_eviction_per_nta.csv'
file_path3 = '/content/drive/My Drive/X999/average_eviction_per_borough.csv'
file_path4 = '/content/drive/My Drive/X999/average_eviction_per_zipcode.csv'

In [56]:
average_evictions_nta.to_csv(file_path2, index=False)
average_evictions_all_years_b.to_csv(file_path3, index=False)
average_evictions_all_years_z.to_csv(file_path4, index=False)