# **Introduction**

In this notebook, we did the final merge of all relevant data during covid time, including BBL, evictions, SVI scores, and 311 complaints. We first got the already merged bbl_evictions_svi dataset and get rid of the nans for analysis (the previously version had nan for retrival purpose if we find it necesssary later). Then we combined all the 311 complaints and cleaned nans. We then groupby bbl and categories of the complaint data and reset them to a wide pivot table. Finally, we merged the pivot table with bbl_evictions_svi df to arrive at the final mega merged and cleaned df.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
import os
import io
import geopandas as gpd
import seaborn as sns

# suppress warning
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
# display all columns

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Step 1: get the bbl evictions svi merged data**

In [4]:
file_path1 = '/content/drive/My Drive/X999/merged_df_clean_covid.csv'

In [5]:
bbl_evictions_svi = pd.read_csv(file_path1)

In [6]:
bbl_evictions_svi.shape

(6450, 69)

In [7]:
nan_counts = bbl_evictions_svi.isna().sum()
columns_with_nans = nan_counts[nan_counts > 0]
columns_with_nans

Unnamed: 0,0
yearbuilt,344
bldgclass,344
numfloors,344
unitsres,344
ownername,344
bldgarea,344
building_type,344
building_category,344
is_condo,344
floor_category,344


In [8]:
nan_percentage = (bbl_evictions_svi.isna().sum() / len(bbl_evictions_svi)) * 100
nan_percentage = nan_percentage[nan_percentage > 0]
nan_percentage = nan_percentage.sort_values(ascending=False)
nan_percentage

Unnamed: 0,0
yearbuilt,5.333333
bldgclass,5.333333
numfloors,5.333333
unitsres,5.333333
ownername,5.333333
bldgarea,5.333333
building_type,5.333333
building_category,5.333333
is_condo,5.333333
floor_category,5.333333


## **There is really not much to do with these nan values, as they simply cannot be imputed with high confidence. For purely retrival purpose, I think we can keep the nans. They ocurred because these bbls in the eviction dataset could not find their matches in the bbl dataset. But for any other analysis (what we mainly care about here), we will remove them.**

In [11]:
bbl_evictions_svi = bbl_evictions_svi.dropna()
bbl_evictions_svi.shape, 6106 - 344, f'{344 / 6106*100:.2f} % removed'

((6106, 69), 5762, '5.63 % removed')

In [12]:
bbl_evictions_svi.isna().sum().sum()

np.int64(0)

In [13]:
bbl_evictions_svi.columns, bbl_evictions_svi.shape

(Index(['primary_key', 'court_index_number', 'docket_number', 'eviction_address',
        'eviction_apartment_number', 'executed_date', 'borough', 'zipcode', 'ejectment',
        'eviction/legal_possession', 'latitude', 'longitude', 'community_board', 'council_district',
        'census_tract', 'bin', 'bbl', 'nta', 'year', 'month_year', 'geometry',
        'average_year_eviction_count', 'yearbuilt', 'bldgclass', 'numfloors', 'unitsres',
        'ownername', 'bldgarea', 'building_type', 'building_category', 'is_condo', 'floor_category',
        'rent_era', 'architectural_style', 'economic_period', 'residential_units_category',
        'is_llc', 'building_size_category', 'size_quartile', 'decade', 'fips', 'e_totpop',
        'rpl_theme1', 'rpl_theme2', 'rpl_theme3', 'rpl_theme4', 'rpl_themes', 'ep_pov150',
        'ep_unemp', 'ep_nohsdp', 'ep_uninsur', 'ep_age65', 'ep_age17', 'ep_disabl', 'ep_limeng',
        'ep_noveh', 'ep_crowd', 'ep_hburd', 'ep_afam', 'ep_hisp', 'ep_asian', 'ep_aian'

# **Step2: get the combined 311 complaints data**

In [15]:
# saved_2017 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2017_reduced.csv"
# saved_2018 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2018_reduced.csv"
# saved_2019 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2019_reduced.csv"
saved_2020 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2020_reduced.csv"
saved_2021 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2021_reduced.csv"
saved_2022 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2022_reduced.csv"
# saved_2023 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2023_reduced.csv"
# saved_2024 = "/content/drive/My Drive/X999/311_different_years/filtered_df_2024_reduced.csv"

In [16]:
# df_2017 = pd.read_csv(saved_2017)
# df_2018 = pd.read_csv(saved_2018)
# df_2019 = pd.read_csv(saved_2019)
df_2020 = pd.read_csv(saved_2020)
df_2021 = pd.read_csv(saved_2021)
df_2022 = pd.read_csv(saved_2022)
# df_2023 = pd.read_csv(saved_2023)
# df_2024 = pd.read_csv(saved_2024)

In [17]:
covid_311_df = pd.concat([df_2020, df_2021, df_2022])

In [18]:
covid_311_df.head()

Unnamed: 0,unique_key,created_date,closed_date,complaint_type,incident_zip,incident_address,bbl,borough,latitude,longitude
0,48538697,2020-12-31 23:59:55,2021-01-01 01:07:04,Noise - Vehicle,10460.0,1569 HOE AVENUE,2029820000.0,BRONX,40.83582,-73.887516
1,48536596,2020-12-31 23:59:28,2021-01-01 01:33:12,Noise - Residential,10028.0,235 EAST 83 STREET,1015290000.0,MANHATTAN,40.776503,-73.954525
2,48536500,2020-12-31 23:58:55,2021-01-01 00:24:54,Noise - Residential,10468.0,2380 GRAND AVENUE,2031990000.0,BRONX,40.861553,-73.904168
3,48542024,2020-12-31 23:58:45,2021-01-14 16:49:17,Noise - Helicopter,10003.0,195 1 AVENUE,1004530000.0,MANHATTAN,40.729916,-73.983616
4,48543542,2020-12-31 23:58:39,2021-01-01 00:13:47,Noise - Residential,10034.0,571 ACADEMY STREET,1022218000.0,MANHATTAN,40.863565,-73.923221


In [19]:
covid_311_df.columns, covid_311_df.shape

(Index(['unique_key', 'created_date', 'closed_date', 'complaint_type', 'incident_zip',
        'incident_address', 'bbl', 'borough', 'latitude', 'longitude'],
       dtype='object'),
 (4052446, 10))

In [20]:
covid_311_df.bbl = covid_311_df.bbl.astype('int64')

In [21]:
covid_311_df.isna().sum().sum(), covid_311_df.duplicated().sum()

(np.int64(203), np.int64(0))

In [22]:
covid_311_df.isna().sum()

Unnamed: 0,0
unique_key,0
created_date,0
closed_date,0
complaint_type,0
incident_zip,99
incident_address,0
bbl,0
borough,2
latitude,51
longitude,51


## **In this case, it makes sense to just fillna with string 'unknown' or integer "0" depending on the columns, because these columns are not that essential, once they are merged with the evictions_bbl_svi data, as these columns with nans will be replaced by the ones from the main table. We can drop these columns afterwards if they would cause problems.**

In [24]:
covid_311_df['incident_address'] = covid_311_df['incident_address'].fillna('unknown')

In [23]:
# other_columns = ['incident_zip', 'latitude', 'longitude']
covid_311_df['incident_zip'] = covid_311_df['incident_zip'].fillna(0)
covid_311_df['latitude'] = covid_311_df['latitude'].fillna(0)
covid_311_df['longitude'] = covid_311_df['longitude'].fillna(0)

In [25]:
covid_311_df.shape, covid_311_df.isna().sum().sum(), covid_311_df.duplicated().sum()

((4052446, 10), np.int64(2), np.int64(0))

# **Step 3: merge bbl_evictions_svi with 311 compalints data.**

### It turns out, we do need a  **pivot table**, but need to groupby first to make the merge process more seamless. Doing so also helps us ignore the nan issues we just had in the above steps as we totally ignore the columns that had troubled data

In [26]:
bbl_evictions_svi.columns, bbl_evictions_svi.shape

(Index(['primary_key', 'court_index_number', 'docket_number', 'eviction_address',
        'eviction_apartment_number', 'executed_date', 'borough', 'zipcode', 'ejectment',
        'eviction/legal_possession', 'latitude', 'longitude', 'community_board', 'council_district',
        'census_tract', 'bin', 'bbl', 'nta', 'year', 'month_year', 'geometry',
        'average_year_eviction_count', 'yearbuilt', 'bldgclass', 'numfloors', 'unitsres',
        'ownername', 'bldgarea', 'building_type', 'building_category', 'is_condo', 'floor_category',
        'rent_era', 'architectural_style', 'economic_period', 'residential_units_category',
        'is_llc', 'building_size_category', 'size_quartile', 'decade', 'fips', 'e_totpop',
        'rpl_theme1', 'rpl_theme2', 'rpl_theme3', 'rpl_theme4', 'rpl_themes', 'ep_pov150',
        'ep_unemp', 'ep_nohsdp', 'ep_uninsur', 'ep_age65', 'ep_age17', 'ep_disabl', 'ep_limeng',
        'ep_noveh', 'ep_crowd', 'ep_hburd', 'ep_afam', 'ep_hisp', 'ep_asian', 'ep_aian'

In [27]:
covid_311_df.columns, covid_311_df.shape, bbl_evictions_svi.columns, bbl_evictions_svi.shape

(Index(['unique_key', 'created_date', 'closed_date', 'complaint_type', 'incident_zip',
        'incident_address', 'bbl', 'borough', 'latitude', 'longitude'],
       dtype='object'),
 (4052446, 10),
 Index(['primary_key', 'court_index_number', 'docket_number', 'eviction_address',
        'eviction_apartment_number', 'executed_date', 'borough', 'zipcode', 'ejectment',
        'eviction/legal_possession', 'latitude', 'longitude', 'community_board', 'council_district',
        'census_tract', 'bin', 'bbl', 'nta', 'year', 'month_year', 'geometry',
        'average_year_eviction_count', 'yearbuilt', 'bldgclass', 'numfloors', 'unitsres',
        'ownername', 'bldgarea', 'building_type', 'building_category', 'is_condo', 'floor_category',
        'rent_era', 'architectural_style', 'economic_period', 'residential_units_category',
        'is_llc', 'building_size_category', 'size_quartile', 'decade', 'fips', 'e_totpop',
        'rpl_theme1', 'rpl_theme2', 'rpl_theme3', 'rpl_theme4', 'rpl_themes'

In [28]:
court_bbl_map = bbl_evictions_svi[['primary_key', 'bbl']].drop_duplicates()
court_bbl_map.shape
# there are actually no duplicates, 70882, good

(6106, 2)

In [29]:
def categorize_complaint(complaint_type):
    complaint = complaint_type.lower().strip()

    # building systems and utilities stuff
    if 'heat' in complaint or 'hot water' in complaint:
        return 'heat_hot_water'
    elif any(term in complaint for term in ['water leak', 'plumbing', 'sewage']):
        return 'plumbing_issues'
    elif 'electric' in complaint:
        return 'electrical_issues'
    elif 'elevator' in complaint:
        return 'elevator_issues'

    # building structure and maintenance
    elif 'door' in complaint or 'window' in complaint:
        return 'doors_windows'
    elif any(term in complaint for term in ['paint', 'plaster', 'mold']):
        return 'walls_ceilings'
    elif 'floor' in complaint or 'stair' in complaint:
        return 'floors_stairs'
    elif 'outside building' in complaint:
        return 'building_exterior'
    elif 'appliance' in complaint:
        return 'appliances'

    # health and environmental impact
    elif 'unsanitary' in complaint or 'condition' in complaint:
        return 'sanitation_issues'
    elif any(pest in complaint for pest in ['rodent', 'mosquito', 'bee', 'wasp', 'pigeon']):
        return 'pest_issues'
    elif 'air' in complaint or 'asbestos' in complaint or 'smoking' in complaint:
        return 'air_quality'

    # noise (all noise complaints together)
    elif 'noise' in complaint:
        return 'noise_complaints'

    # public space influences and nuances
    elif 'homeless' in complaint or 'encampment' in complaint:
        return 'homeless_issues'
    elif 'graffiti' in complaint or 'advertisement' in complaint:
        return 'graffiti_posting'
    elif any(nuisance in complaint for nuisance in ['disorderly', 'panhandling', 'drinking', 'urinating', 'fireworks']):
        return 'public_nuisance'

    # living safety and services
    elif 'safety' in complaint:
        return 'safety_concerns'
    elif 'animal' in complaint or 'abuse' in complaint:
        return 'animal_issues'
    elif 'police' in complaint:
        return 'police_matters'

    # miscellaneous
    elif 'general' in complaint:
        return 'general_complaints'
    else:
        return 'other_issues'

## **We replaced real complaint types with categories to reduce the number of columns needed for a merged table. First, we re-group the complaint type and assign the counts to each category. Then we use a pivot table to show all the categries' names and counts. Then, we merge with the bbl_evictions_svi with the categries as columns so that the count of each type of complaints associated with each bbl will be preserved, and the size would be smaller (than if we didn't categorize) and easier for merge.**

In [30]:
covid_311_df['complaint_category'] = covid_311_df['complaint_type'].apply(categorize_complaint)

In [31]:
covid_311_df.shape
# add a new column to label the exact compalint type. Later we will use the wide form to expand all the values in this
# column and map them onto the column to form a pivot table

(4052446, 11)

In [32]:
covid_311_df.head()

Unnamed: 0,unique_key,created_date,closed_date,complaint_type,incident_zip,incident_address,bbl,borough,latitude,longitude,complaint_category
0,48538697,2020-12-31 23:59:55,2021-01-01 01:07:04,Noise - Vehicle,10460.0,1569 HOE AVENUE,2029820027,BRONX,40.83582,-73.887516,noise_complaints
1,48536596,2020-12-31 23:59:28,2021-01-01 01:33:12,Noise - Residential,10028.0,235 EAST 83 STREET,1015290018,MANHATTAN,40.776503,-73.954525,noise_complaints
2,48536500,2020-12-31 23:58:55,2021-01-01 00:24:54,Noise - Residential,10468.0,2380 GRAND AVENUE,2031990003,BRONX,40.861553,-73.904168,noise_complaints
3,48542024,2020-12-31 23:58:45,2021-01-14 16:49:17,Noise - Helicopter,10003.0,195 1 AVENUE,1004530034,MANHATTAN,40.729916,-73.983616,noise_complaints
4,48543542,2020-12-31 23:58:39,2021-01-01 00:13:47,Noise - Residential,10034.0,571 ACADEMY STREET,1022217501,MANHATTAN,40.863565,-73.923221,noise_complaints


In [33]:
covid_311_df.isna().sum().sum(), covid_311_df.duplicated().sum()

(np.int64(2), np.int64(0))

In [34]:
covid_311_df.shape
# no duplicates, 4052446

(4052446, 11)

In [35]:
covid_311_df.columns

Index(['unique_key', 'created_date', 'closed_date', 'complaint_type', 'incident_zip',
       'incident_address', 'bbl', 'borough', 'latitude', 'longitude', 'complaint_category'],
      dtype='object')

In [36]:
bbl_evictions_svi.columns

Index(['primary_key', 'court_index_number', 'docket_number', 'eviction_address',
       'eviction_apartment_number', 'executed_date', 'borough', 'zipcode', 'ejectment',
       'eviction/legal_possession', 'latitude', 'longitude', 'community_board', 'council_district',
       'census_tract', 'bin', 'bbl', 'nta', 'year', 'month_year', 'geometry',
       'average_year_eviction_count', 'yearbuilt', 'bldgclass', 'numfloors', 'unitsres',
       'ownername', 'bldgarea', 'building_type', 'building_category', 'is_condo', 'floor_category',
       'rent_era', 'architectural_style', 'economic_period', 'residential_units_category',
       'is_llc', 'building_size_category', 'size_quartile', 'decade', 'fips', 'e_totpop',
       'rpl_theme1', 'rpl_theme2', 'rpl_theme3', 'rpl_theme4', 'rpl_themes', 'ep_pov150',
       'ep_unemp', 'ep_nohsdp', 'ep_uninsur', 'ep_age65', 'ep_age17', 'ep_disabl', 'ep_limeng',
       'ep_noveh', 'ep_crowd', 'ep_hburd', 'ep_afam', 'ep_hisp', 'ep_asian', 'ep_aian', 'ep_nhpi'

In [37]:
bbl_evictions_svi.bbl.dtype

dtype('int64')

In [38]:
# count each category for each bbl
# group the complaints by bbl and categories and then count them
bbl_category_counts = covid_311_df.groupby(['bbl', 'complaint_category']).size().reset_index(name='count')

In [39]:
bbl_category_counts.bbl = bbl_category_counts.bbl.astype('int64')

In [40]:
bbl_category_counts

Unnamed: 0,bbl,complaint_category,count
0,0,animal_issues,1
1,0,appliances,7
2,0,doors_windows,8
3,0,electrical_issues,2
4,0,elevator_issues,16
...,...,...,...
639730,5270000501,sanitation_issues,1
639731,5270000504,sanitation_issues,1
639732,5270000506,noise_complaints,1
639733,5270000508,noise_complaints,1


## **It's necessary to use a bit pivot table transformation here, because we want this table to have a "wide" format so that:**

- each row represents a single bbl
- each complaint category becomes its own column
- the values show the count for each category

In [41]:
# use a bit pivot table here, to make this a wide format with categories as columns
# pivot to have categories as columns
bbl_complaints_wide = bbl_category_counts.pivot(
    index='bbl',
    columns='complaint_category',
    values='count'
).fillna(0).reset_index()

In [42]:
bbl_complaints_wide.isna().sum().sum(), bbl_complaints_wide.duplicated().sum()

(np.int64(0), np.int64(0))

In [44]:
bbl_evictions_svi.bbl.nunique(), covid_311_df.bbl.nunique(), bbl_complaints_wide.bbl.nunique(), bbl_complaints_wide.shape

(4827, 282098, 282098, (282098, 22))

In [45]:
bbl_complaints_wide
# correct shape, (342961, 22)

complaint_category,bbl,air_quality,animal_issues,appliances,building_exterior,doors_windows,electrical_issues,elevator_issues,floors_stairs,general_complaints,graffiti_posting,heat_hot_water,homeless_issues,noise_complaints,other_issues,pest_issues,plumbing_issues,police_matters,public_nuisance,safety_concerns,sanitation_issues,walls_ceilings
0,0,0.0,1.0,7.0,0.0,8.0,2.0,16.0,6.0,9.0,1.0,43.0,1.0,244.0,22.0,19.0,47.0,3.0,0.0,0.0,17.0,13.0
1,1000010010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0
2,1000010101,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1000010201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1000020001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282093,5270000501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
282094,5270000504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
282095,5270000506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
282096,5270000508,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
len(bbl_complaints_wide.columns) - 1

21

In [47]:
all_categories = [
    'heat_hot_water', 'plumbing_issues', 'electrical_issues', 'elevator_issues',
    'doors_windows', 'walls_ceilings', 'floors_stairs', 'building_exterior',
    'appliances', 'sanitation_issues', 'pest_issues', 'air_quality',
    'noise_complaints', 'homeless_issues', 'graffiti_posting', 'public_nuisance',
    'safety_concerns', 'animal_issues', 'police_matters', 'general_complaints',
    'other_issues'
]
# complete
len(all_categories)

21

In [48]:
# add a total column
bbl_complaints_wide['total_complaints'] = bbl_complaints_wide[all_categories].sum(axis=1)

In [49]:
bbl_complaints_wide
# so far, we do have the 311 complaint part figure out

complaint_category,bbl,air_quality,animal_issues,appliances,building_exterior,doors_windows,electrical_issues,elevator_issues,floors_stairs,general_complaints,graffiti_posting,heat_hot_water,homeless_issues,noise_complaints,other_issues,pest_issues,plumbing_issues,police_matters,public_nuisance,safety_concerns,sanitation_issues,walls_ceilings,total_complaints
0,0,0.0,1.0,7.0,0.0,8.0,2.0,16.0,6.0,9.0,1.0,43.0,1.0,244.0,22.0,19.0,47.0,3.0,0.0,0.0,17.0,13.0,459.0
1,1000010010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,12.0
2,1000010101,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,1000010201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1000020001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282093,5270000501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,8.0
282094,5270000504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
282095,5270000506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
282096,5270000508,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [50]:
bbl_evictions_svi.bbl.dtype, bbl_complaints_wide.bbl.dtype

(dtype('int64'), dtype('int64'))

In [51]:
bbl_complaints_wide.shape

(282098, 23)

In [52]:
bbl_evictions_svi.head()

Unnamed: 0,primary_key,court_index_number,docket_number,eviction_address,eviction_apartment_number,executed_date,borough,zipcode,ejectment,eviction/legal_possession,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,year,month_year,geometry,average_year_eviction_count,yearbuilt,bldgclass,numfloors,unitsres,ownername,bldgarea,building_type,building_category,is_condo,floor_category,rent_era,architectural_style,economic_period,residential_units_category,is_llc,building_size_category,size_quartile,decade,fips,e_totpop,rpl_theme1,rpl_theme2,rpl_theme3,rpl_theme4,rpl_themes,ep_pov150,ep_unemp,ep_nohsdp,ep_uninsur,ep_age65,ep_age17,ep_disabl,ep_limeng,ep_noveh,ep_crowd,ep_hburd,ep_afam,ep_hisp,ep_asian,ep_aian,ep_nhpi,ep_twomore,ep_otherrace,ep_minrty,ep_white,invalid_zip,svi_quartile
0,004123/20_209969,004123/20,209969,2541 A GRAND AVE,ROOM 3B,2022-08-22,BRONX,10468,Not an Ejectment,Possession,40.865396,-73.901317,7.0,14.0,265.0,2113173,2032140141,Kingsbridge Heights,2022,2022-08,POINT (-73.901317 40.865396),0.2,2004.0,C0,3.0,3.0,MONJU SARKER,3420.0,post-war,walk-up,False,low-rise,"1994–Present, vacancy decontrol","2001-present, New Architecture","1991–2008, modern economic growth",3-5 units,False,medium-small,Q4 (largest 25%),2000-2009,10468,81397.0,0.9954,0.9407,0.987,0.947,0.9874,39.5,11.6,28.3,9.2,11.2,26.4,12.2,26.9,71.8,19.2,56.7,15.6,78.0,2.3,0.0,0.0,0.5,0.5,96.9,3.1,False,Q3
1,0050153/20_106030,0050153/20,106030,98-05 67TH AVENUE,12F,2022-04-14,QUEENS,11375,Not an Ejectment,Possession,40.724241,-73.855552,6.0,29.0,71306.0,4074666,4031560133,Forest Hills,2022,2022-04,POINT (-73.855552 40.724241),0.2,1960.0,D3,13.0,181.0,MARSEILLES LEASING LIMITED PARTNERSHIP,177710.0,post-war,elevator,False,high-rise,"1947–1969, rent-control","1951–1980, the International Style, Alternative Modernism","1946–1975, pst war economic boom",100+ units,False,mega,Q4 (largest 25%),1960-1969,11375,75212.0,0.4759,0.5698,0.8789,0.8057,0.7322,12.0,4.8,6.1,3.7,20.4,18.0,10.5,7.9,41.9,5.8,25.4,2.7,16.4,28.5,0.1,0.0,4.6,0.7,53.0,47.0,False,Q1 (Low)
2,0052002/19_101926,0052002/19,101926,199 VERONICA PLACE,1ST FLOOR,2020-03-02,BROOKLYN,11226,Not an Ejectment,Possession,40.645404,-73.952578,17.0,40.0,792.0,3117969,3051370021,Erasmus,2020,2020-03,POINT (-73.952578 40.645404),0.6,1920.0,B3,2.0,2.0,"AANS, LLC.",1496.0,pre-war,two-family,False,low-rise,"Pre-1947, pre-rent-control","1900–1920, Beaux-Arts","Pre-1929, pre-great depression",2-unit,True,very small,Q2 (25-50%),1920-1929,11226,101053.0,0.93,0.4536,0.9639,0.9692,0.922,23.7,5.9,13.9,9.1,13.1,18.7,6.7,5.6,66.1,10.0,39.2,63.2,14.9,3.2,0.3,0.0,4.1,0.7,86.3,13.7,False,Q2
3,0057757/18_100889,0057757/18,100889,302 EASTERN PARKWAY,4B,2020-02-03,BROOKLYN,11225,Not an Ejectment,Possession,40.670832,-73.958843,9.0,35.0,213.0,3029673,3011850034,Crown Heights South,2020,2020-02,POINT (-73.958843 40.670832),0.8,1923.0,D1,6.0,48.0,302 EASTERN CORP,42984.0,pre-war,elevator,False,mid-rise,"Pre-1947, pre-rent-control","1921–1930, Art Deco Skyscrapers","Pre-1929, pre-great depression",21-100 units,False,very large,Q4 (largest 25%),1920-1929,11225,58476.0,0.8905,0.3157,0.933,0.8342,0.8538,23.1,6.6,11.5,5.9,15.3,16.7,9.6,2.2,66.2,6.9,37.3,53.7,10.8,3.3,0.0,0.0,3.9,0.9,72.6,27.4,False,Q1 (Low)
5,0061902/19_117253,0061902/19,117253,83-33 118TH STREET,5N,2020-02-14,QUEENS,11415,Not an Ejectment,Possession,40.706235,-73.834603,9.0,29.0,134.0,4079390,4033220043,Kew Gardens,2020,2020-02,POINT (-73.834603 40.706235),0.4,1979.0,D1,6.0,79.0,CIAMPA METROPOLITAN CO,72147.0,post-war,elevator,False,mid-rise,"1970–1993, deregularization","1951–1980, the International Style, Alternative Modernism","1976–1990, fiscal crisis and recovery",21-100 units,False,very large,Q4 (largest 25%),1970-1979,11415,20315.0,0.7661,0.5573,0.898,0.9396,0.8761,14.6,5.6,11.8,4.7,17.0,18.0,10.9,7.5,44.3,8.5,32.3,6.7,22.9,22.3,0.2,0.0,3.4,2.1,57.7,42.3,False,Q1 (Low)


In [53]:
bbl_evictions_svi.shape

(6106, 69)

In [54]:
bbl_evictions_svi_311 = bbl_evictions_svi.merge(
    bbl_complaints_wide,
    on='bbl',
    how='left'
)
# the final merge with bbl, evictions, svi with 311 complaints

In [55]:
bbl_evictions_svi_311.isna().sum()

Unnamed: 0,0
primary_key,0
court_index_number,0
docket_number,0
eviction_address,0
eviction_apartment_number,0
...,...
public_nuisance,720
safety_concerns,720
sanitation_issues,720
walls_ceilings,720


In [57]:
f"{720/bbl_evictions_svi_311.shape[0]*100:.2f} % of the rows have nans"

'11.79 % of the rows have nans'

In [58]:
nan_counts = bbl_evictions_svi_311.isna().sum()
columns_with_nans = nan_counts[nan_counts > 0]
columns_with_nans

Unnamed: 0,0
air_quality,720
animal_issues,720
appliances,720
building_exterior,720
doors_windows,720
electrical_issues,720
elevator_issues,720
floors_stairs,720
general_complaints,720
graffiti_posting,720


## **In this case, it would make no sense to fill these nans, as it will only add more inaccuracies to the dataset. We will drop all the rows that have nans in them.**

In [59]:
bbl_evictions_svi_311 = bbl_evictions_svi_311.dropna()

In [60]:
bbl_evictions_svi_311.isna().sum().sum(), bbl_evictions_svi_311.duplicated().sum(), bbl_evictions_svi_311.shape

(np.int64(0), np.int64(0), (5386, 91))

In [61]:
bbl_evictions_svi_311

Unnamed: 0,primary_key,court_index_number,docket_number,eviction_address,eviction_apartment_number,executed_date,borough,zipcode,ejectment,eviction/legal_possession,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,year,month_year,geometry,average_year_eviction_count,yearbuilt,bldgclass,numfloors,unitsres,ownername,bldgarea,building_type,building_category,is_condo,floor_category,rent_era,architectural_style,economic_period,residential_units_category,is_llc,building_size_category,size_quartile,decade,fips,e_totpop,rpl_theme1,rpl_theme2,rpl_theme3,rpl_theme4,rpl_themes,ep_pov150,ep_unemp,ep_nohsdp,ep_uninsur,ep_age65,ep_age17,ep_disabl,ep_limeng,ep_noveh,ep_crowd,ep_hburd,ep_afam,ep_hisp,ep_asian,ep_aian,ep_nhpi,ep_twomore,ep_otherrace,ep_minrty,ep_white,invalid_zip,svi_quartile,air_quality,animal_issues,appliances,building_exterior,doors_windows,electrical_issues,elevator_issues,floors_stairs,general_complaints,graffiti_posting,heat_hot_water,homeless_issues,noise_complaints,other_issues,pest_issues,plumbing_issues,police_matters,public_nuisance,safety_concerns,sanitation_issues,walls_ceilings,total_complaints
0,004123/20_209969,004123/20,209969,2541 A GRAND AVE,ROOM 3B,2022-08-22,BRONX,10468,Not an Ejectment,Possession,40.865396,-73.901317,7.0,14.0,265.0,2113173,2032140141,Kingsbridge Heights,2022,2022-08,POINT (-73.901317 40.865396),0.2,2004.0,C0,3.0,3.0,MONJU SARKER,3420.0,post-war,walk-up,False,low-rise,"1994–Present, vacancy decontrol","2001-present, New Architecture","1991–2008, modern economic growth",3-5 units,False,medium-small,Q4 (largest 25%),2000-2009,10468,81397.0,0.9954,0.9407,0.9870,0.9470,0.9874,39.5,11.6,28.3,9.2,11.2,26.4,12.2,26.9,71.8,19.2,56.7,15.6,78.0,2.3,0.0,0.0,0.5,0.5,96.9,3.1,False,Q3,0.0,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,1.0,14.0
1,0050153/20_106030,0050153/20,106030,98-05 67TH AVENUE,12F,2022-04-14,QUEENS,11375,Not an Ejectment,Possession,40.724241,-73.855552,6.0,29.0,71306.0,4074666,4031560133,Forest Hills,2022,2022-04,POINT (-73.855552 40.724241),0.2,1960.0,D3,13.0,181.0,MARSEILLES LEASING LIMITED PARTNERSHIP,177710.0,post-war,elevator,False,high-rise,"1947–1969, rent-control","1951–1980, the International Style, Alternative Modernism","1946–1975, pst war economic boom",100+ units,False,mega,Q4 (largest 25%),1960-1969,11375,75212.0,0.4759,0.5698,0.8789,0.8057,0.7322,12.0,4.8,6.1,3.7,20.4,18.0,10.5,7.9,41.9,5.8,25.4,2.7,16.4,28.5,0.1,0.0,4.6,0.7,53.0,47.0,False,Q1 (Low),0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,62.0,0.0,34.0,0.0,0.0,4.0,1.0,0.0,0.0,2.0,5.0,112.0
2,0052002/19_101926,0052002/19,101926,199 VERONICA PLACE,1ST FLOOR,2020-03-02,BROOKLYN,11226,Not an Ejectment,Possession,40.645404,-73.952578,17.0,40.0,792.0,3117969,3051370021,Erasmus,2020,2020-03,POINT (-73.952578 40.645404),0.6,1920.0,B3,2.0,2.0,"AANS, LLC.",1496.0,pre-war,two-family,False,low-rise,"Pre-1947, pre-rent-control","1900–1920, Beaux-Arts","Pre-1929, pre-great depression",2-unit,True,very small,Q2 (25-50%),1920-1929,11226,101053.0,0.9300,0.4536,0.9639,0.9692,0.9220,23.7,5.9,13.9,9.1,13.1,18.7,6.7,5.6,66.1,10.0,39.2,63.2,14.9,3.2,0.3,0.0,4.1,0.7,86.3,13.7,False,Q2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,0057757/18_100889,0057757/18,100889,302 EASTERN PARKWAY,4B,2020-02-03,BROOKLYN,11225,Not an Ejectment,Possession,40.670832,-73.958843,9.0,35.0,213.0,3029673,3011850034,Crown Heights South,2020,2020-02,POINT (-73.958843 40.670832),0.8,1923.0,D1,6.0,48.0,302 EASTERN CORP,42984.0,pre-war,elevator,False,mid-rise,"Pre-1947, pre-rent-control","1921–1930, Art Deco Skyscrapers","Pre-1929, pre-great depression",21-100 units,False,very large,Q4 (largest 25%),1920-1929,11225,58476.0,0.8905,0.3157,0.9330,0.8342,0.8538,23.1,6.6,11.5,5.9,15.3,16.7,9.6,2.2,66.2,6.9,37.3,53.7,10.8,3.3,0.0,0.0,3.9,0.9,72.6,27.4,False,Q1 (Low),0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,12.0,0.0,17.0,0.0,3.0,2.0,1.0,0.0,2.0,2.0,1.0,45.0
4,0061902/19_117253,0061902/19,117253,83-33 118TH STREET,5N,2020-02-14,QUEENS,11415,Not an Ejectment,Possession,40.706235,-73.834603,9.0,29.0,134.0,4079390,4033220043,Kew Gardens,2020,2020-02,POINT (-73.834603 40.706235),0.4,1979.0,D1,6.0,79.0,CIAMPA METROPOLITAN CO,72147.0,post-war,elevator,False,mid-rise,"1970–1993, deregularization","1951–1980, the International Style, Alternative Modernism","1976–1990, fiscal crisis and recovery",21-100 units,False,very large,Q4 (largest 25%),1970-1979,11415,20315.0,0.7661,0.5573,0.8980,0.9396,0.8761,14.6,5.6,11.8,4.7,17.0,18.0,10.9,7.5,44.3,8.5,32.3,6.7,22.9,22.3,0.2,0.0,3.4,2.1,57.7,42.3,False,Q1 (Low),0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,9.0,0.0,19.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0,38.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6100,R52623/19_101227,R52623/19,101227,185 ST.MARKS PLACE,13A,2020-02-19,STATEN ISLAND,10301,Not an Ejectment,Possession,40.645358,-74.080729,1.0,49.0,7.0,5108502,5000130008,West New Brighton-New Brighton-St. George,2020,2020-02,POINT (-74.080729 40.645358),0.6,1976.0,D3,20.0,454.0,NYC HOUSING DEVELOPMENT CORP.,524513.0,post-war,elevator,False,high-rise,"1970–1993, deregularization","1951–1980, the International Style, Alternative Modernism","1976–1990, fiscal crisis and recovery",100+ units,False,mega,Q4 (largest 25%),1970-1979,10301,40331.0,0.8784,0.7487,0.8992,0.9869,0.9329,20.4,7.1,13.7,5.2,15.6,20.7,11.6,6.5,25.5,8.1,32.2,19.9,26.3,7.6,0.7,0.0,3.7,0.3,58.6,41.4,False,Q2,0.0,4.0,3.0,0.0,19.0,3.0,2.0,22.0,4.0,0.0,24.0,0.0,253.0,1.0,0.0,27.0,6.0,9.0,9.0,20.0,16.0,422.0
6101,R52635/19_101039,R52635/19,101039,185 ST.MARKS PLACE,5E,2020-01-06,STATEN ISLAND,10301,Not an Ejectment,Possession,40.645358,-74.080729,1.0,49.0,7.0,5108502,5000130008,West New Brighton-New Brighton-St. George,2020,2020-01,POINT (-74.080729 40.645358),0.6,1976.0,D3,20.0,454.0,NYC HOUSING DEVELOPMENT CORP.,524513.0,post-war,elevator,False,high-rise,"1970–1993, deregularization","1951–1980, the International Style, Alternative Modernism","1976–1990, fiscal crisis and recovery",100+ units,False,mega,Q4 (largest 25%),1970-1979,10301,40331.0,0.8784,0.7487,0.8992,0.9869,0.9329,20.4,7.1,13.7,5.2,15.6,20.7,11.6,6.5,25.5,8.1,32.2,19.9,26.3,7.6,0.7,0.0,3.7,0.3,58.6,41.4,False,Q2,0.0,4.0,3.0,0.0,19.0,3.0,2.0,22.0,4.0,0.0,24.0,0.0,253.0,1.0,0.0,27.0,6.0,9.0,9.0,20.0,16.0,422.0
6102,R52697/19_101672,R52697/19,101672,351 ADELAIDE AVENUE,18,2020-01-22,STATEN ISLAND,10306,Not an Ejectment,Possession,40.558150,-74.118731,3.0,50.0,12805.0,5063024,5046760073,Oakwood-Oakwood Beach,2020,2020-01,POINT (-74.118731 40.55815),0.2,1965.0,C9,2.0,16.0,351 ADELAIDE AVENUE HOLDINGS LLC,8460.0,post-war,walk-up,False,low-rise,"1947–1969, rent-control","1951–1980, the International Style, Alternative Modernism","1946–1975, pst war economic boom",6-20 units,True,medium,Q4 (largest 25%),1960-1969,10306,56232.0,0.7769,0.8011,0.7994,0.8319,0.8739,16.6,6.7,11.5,4.3,18.8,21.3,11.1,8.5,14.7,4.3,28.0,3.0,16.2,13.8,0.0,0.0,3.5,0.4,36.8,63.2,False,Q1 (Low),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
6103,R52769/19_102174,R52769/19,102174,31 MARKHAM LANE,1A,2020-02-19,STATEN ISLAND,10310,Not an Ejectment,Possession,40.639034,-74.116274,1.0,49.0,97.0,5108650,5001690001,West New Brighton-New Brighton-St. George,2020,2020-02,POINT (-74.116274 40.639034),0.2,2007.0,C1,3.0,240.0,MARKHAM GARDENS L.P.,236523.0,post-war,walk-up,False,low-rise,"1994–Present, vacancy decontrol","2001-present, New Architecture","1991–2008, modern economic growth",100+ units,False,mega,Q4 (largest 25%),2000-2009,10310,26239.0,0.7569,0.7972,0.8975,0.8991,0.8928,21.0,4.2,12.6,4.5,12.6,25.7,8.7,6.0,21.8,4.4,28.0,20.0,29.0,5.3,0.1,0.0,3.0,0.2,57.6,42.4,False,Q1 (Low),0.0,0.0,0.0,0.0,3.0,1.0,0.0,2.0,3.0,0.0,11.0,0.0,25.0,0.0,4.0,9.0,0.0,4.0,1.0,8.0,8.0,79.0


In [62]:
zero_bbl_count = (bbl_evictions_svi_311['bbl'] == 0).sum()
zero_bbl_count
# no bbl == 0 rows

np.int64(0)

In [63]:
all_columns = list(bbl_evictions_svi_311.columns),
# len(all_columns)
# all_columns
type(all_columns), len(all_columns[0]) # wierd, have to use list comprehension, as remove() does not work

(tuple, 91)

## **There is one less column in this covid df compared to the normal time df, and that is svi_group, where we categories the svi theme1's score to low, medium, and high. It was only in normal time df because only normal time svi merged df did the regression analysis where this column was added. (see the evidence at the very end)**

In [None]:
# bbl_evictions_svi_311

In [64]:
# the goal is to move "bbl" to the front of the dataframe
# all_columns = merged_with_complaints.columns.tolist()
# print(all_columns)
# if 'court_index_number' in all_columns:
#     print("yes, court_index_number")
#     all_columns.remove('court_index_number')
# if 'bbl' in all_columns:
#     print("yes, bbl")
#     all_columns.remove('bbl')
# all_columns
remaining_columns = [col for col in all_columns if col not in ['primary_key', 'bbl']]
remaining_columns = remaining_columns[0]
print(len(remaining_columns))
remaining_columns.remove('primary_key')
remaining_columns.remove('bbl')

91


In [65]:
len(remaining_columns)
# good

89

In [66]:
new_column_order = ['primary_key', 'bbl'] + remaining_columns

In [67]:
# new order in place
bbl_evictions_svi_311 = bbl_evictions_svi_311[new_column_order]

In [68]:
display(bbl_evictions_svi_311.head())

Unnamed: 0,primary_key,bbl,court_index_number,docket_number,eviction_address,eviction_apartment_number,executed_date,borough,zipcode,ejectment,eviction/legal_possession,latitude,longitude,community_board,council_district,census_tract,bin,nta,year,month_year,geometry,average_year_eviction_count,yearbuilt,bldgclass,numfloors,unitsres,ownername,bldgarea,building_type,building_category,is_condo,floor_category,rent_era,architectural_style,economic_period,residential_units_category,is_llc,building_size_category,size_quartile,decade,fips,e_totpop,rpl_theme1,rpl_theme2,rpl_theme3,rpl_theme4,rpl_themes,ep_pov150,ep_unemp,ep_nohsdp,ep_uninsur,ep_age65,ep_age17,ep_disabl,ep_limeng,ep_noveh,ep_crowd,ep_hburd,ep_afam,ep_hisp,ep_asian,ep_aian,ep_nhpi,ep_twomore,ep_otherrace,ep_minrty,ep_white,invalid_zip,svi_quartile,air_quality,animal_issues,appliances,building_exterior,doors_windows,electrical_issues,elevator_issues,floors_stairs,general_complaints,graffiti_posting,heat_hot_water,homeless_issues,noise_complaints,other_issues,pest_issues,plumbing_issues,police_matters,public_nuisance,safety_concerns,sanitation_issues,walls_ceilings,total_complaints
0,004123/20_209969,2032140141,004123/20,209969,2541 A GRAND AVE,ROOM 3B,2022-08-22,BRONX,10468,Not an Ejectment,Possession,40.865396,-73.901317,7.0,14.0,265.0,2113173,Kingsbridge Heights,2022,2022-08,POINT (-73.901317 40.865396),0.2,2004.0,C0,3.0,3.0,MONJU SARKER,3420.0,post-war,walk-up,False,low-rise,"1994–Present, vacancy decontrol","2001-present, New Architecture","1991–2008, modern economic growth",3-5 units,False,medium-small,Q4 (largest 25%),2000-2009,10468,81397.0,0.9954,0.9407,0.987,0.947,0.9874,39.5,11.6,28.3,9.2,11.2,26.4,12.2,26.9,71.8,19.2,56.7,15.6,78.0,2.3,0.0,0.0,0.5,0.5,96.9,3.1,False,Q3,0.0,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,3.0,1.0,14.0
1,0050153/20_106030,4031560133,0050153/20,106030,98-05 67TH AVENUE,12F,2022-04-14,QUEENS,11375,Not an Ejectment,Possession,40.724241,-73.855552,6.0,29.0,71306.0,4074666,Forest Hills,2022,2022-04,POINT (-73.855552 40.724241),0.2,1960.0,D3,13.0,181.0,MARSEILLES LEASING LIMITED PARTNERSHIP,177710.0,post-war,elevator,False,high-rise,"1947–1969, rent-control","1951–1980, the International Style, Alternative Modernism","1946–1975, pst war economic boom",100+ units,False,mega,Q4 (largest 25%),1960-1969,11375,75212.0,0.4759,0.5698,0.8789,0.8057,0.7322,12.0,4.8,6.1,3.7,20.4,18.0,10.5,7.9,41.9,5.8,25.4,2.7,16.4,28.5,0.1,0.0,4.6,0.7,53.0,47.0,False,Q1 (Low),0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,62.0,0.0,34.0,0.0,0.0,4.0,1.0,0.0,0.0,2.0,5.0,112.0
2,0052002/19_101926,3051370021,0052002/19,101926,199 VERONICA PLACE,1ST FLOOR,2020-03-02,BROOKLYN,11226,Not an Ejectment,Possession,40.645404,-73.952578,17.0,40.0,792.0,3117969,Erasmus,2020,2020-03,POINT (-73.952578 40.645404),0.6,1920.0,B3,2.0,2.0,"AANS, LLC.",1496.0,pre-war,two-family,False,low-rise,"Pre-1947, pre-rent-control","1900–1920, Beaux-Arts","Pre-1929, pre-great depression",2-unit,True,very small,Q2 (25-50%),1920-1929,11226,101053.0,0.93,0.4536,0.9639,0.9692,0.922,23.7,5.9,13.9,9.1,13.1,18.7,6.7,5.6,66.1,10.0,39.2,63.2,14.9,3.2,0.3,0.0,4.1,0.7,86.3,13.7,False,Q2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,0057757/18_100889,3011850034,0057757/18,100889,302 EASTERN PARKWAY,4B,2020-02-03,BROOKLYN,11225,Not an Ejectment,Possession,40.670832,-73.958843,9.0,35.0,213.0,3029673,Crown Heights South,2020,2020-02,POINT (-73.958843 40.670832),0.8,1923.0,D1,6.0,48.0,302 EASTERN CORP,42984.0,pre-war,elevator,False,mid-rise,"Pre-1947, pre-rent-control","1921–1930, Art Deco Skyscrapers","Pre-1929, pre-great depression",21-100 units,False,very large,Q4 (largest 25%),1920-1929,11225,58476.0,0.8905,0.3157,0.933,0.8342,0.8538,23.1,6.6,11.5,5.9,15.3,16.7,9.6,2.2,66.2,6.9,37.3,53.7,10.8,3.3,0.0,0.0,3.9,0.9,72.6,27.4,False,Q1 (Low),0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,12.0,0.0,17.0,0.0,3.0,2.0,1.0,0.0,2.0,2.0,1.0,45.0
4,0061902/19_117253,4033220043,0061902/19,117253,83-33 118TH STREET,5N,2020-02-14,QUEENS,11415,Not an Ejectment,Possession,40.706235,-73.834603,9.0,29.0,134.0,4079390,Kew Gardens,2020,2020-02,POINT (-73.834603 40.706235),0.4,1979.0,D1,6.0,79.0,CIAMPA METROPOLITAN CO,72147.0,post-war,elevator,False,mid-rise,"1970–1993, deregularization","1951–1980, the International Style, Alternative Modernism","1976–1990, fiscal crisis and recovery",21-100 units,False,very large,Q4 (largest 25%),1970-1979,11415,20315.0,0.7661,0.5573,0.898,0.9396,0.8761,14.6,5.6,11.8,4.7,17.0,18.0,10.9,7.5,44.3,8.5,32.3,6.7,22.9,22.3,0.2,0.0,3.4,2.1,57.7,42.3,False,Q1 (Low),0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,9.0,0.0,19.0,0.0,1.0,0.0,0.0,0.0,0.0,6.0,1.0,38.0


In [69]:
bbl_evictions_svi_311.shape

(5386, 91)

In [70]:
# remove rows with BBL = 0
bbl_evictions_svi_311 = bbl_evictions_svi_311[bbl_evictions_svi_311['bbl'] != 0] # good
len(bbl_evictions_svi_311)

5386

In [71]:
bbl_evictions_svi_311.isna().sum().sum(), bbl_evictions_svi_311.duplicated().sum() # all clean

(np.int64(0), np.int64(0))

In [72]:
bbl_evictions_svi_311.shape
# final shape

(5386, 91)

In [73]:
bbl_evictions_svi_311.info(), \
bbl_evictions_svi_311.shape

<class 'pandas.core.frame.DataFrame'>
Index: 5386 entries, 0 to 6104
Data columns (total 91 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   primary_key                  5386 non-null   object 
 1   bbl                          5386 non-null   int64  
 2   court_index_number           5386 non-null   object 
 3   docket_number                5386 non-null   int64  
 4   eviction_address             5386 non-null   object 
 5   eviction_apartment_number    5386 non-null   object 
 6   executed_date                5386 non-null   object 
 7   borough                      5386 non-null   object 
 8   zipcode                      5386 non-null   int64  
 9   ejectment                    5386 non-null   object 
 10  eviction/legal_possession    5386 non-null   object 
 11  latitude                     5386 non-null   float64
 12  longitude                    5386 non-null   float64
 13  community_board        

(None, (5386, 91))

In [74]:
complaint_cols = ['bbl'] + all_categories + ['total_complaints']
existing_cols = [col for col in complaint_cols if col in bbl_evictions_svi_311.columns]
existing_cols

['bbl',
 'heat_hot_water',
 'plumbing_issues',
 'electrical_issues',
 'elevator_issues',
 'doors_windows',
 'walls_ceilings',
 'floors_stairs',
 'building_exterior',
 'appliances',
 'sanitation_issues',
 'pest_issues',
 'air_quality',
 'noise_complaints',
 'homeless_issues',
 'graffiti_posting',
 'public_nuisance',
 'safety_concerns',
 'animal_issues',
 'police_matters',
 'general_complaints',
 'other_issues',
 'total_complaints']

In [75]:
# just take a look at the ones related to the 311 complaint part
display(bbl_evictions_svi_311[['primary_key'] + existing_cols].head())

Unnamed: 0,primary_key,bbl,heat_hot_water,plumbing_issues,electrical_issues,elevator_issues,doors_windows,walls_ceilings,floors_stairs,building_exterior,appliances,sanitation_issues,pest_issues,air_quality,noise_complaints,homeless_issues,graffiti_posting,public_nuisance,safety_concerns,animal_issues,police_matters,general_complaints,other_issues,total_complaints
0,004123/20_209969,2032140141,1.0,2.0,0.0,0.0,3.0,1.0,2.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0
1,0050153/20_106030,4031560133,62.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,2.0,0.0,0.0,34.0,0.0,0.0,0.0,0.0,2.0,1.0,2.0,0.0,112.0
2,0052002/19_101926,3051370021,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,0057757/18_100889,3011850034,12.0,2.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,2.0,3.0,0.0,17.0,0.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,45.0
4,0061902/19_117253,4033220043,9.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,6.0,1.0,0.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.0


In [76]:
# count how many buildings have each type of complaint
buildings_with_complaints_clean = {col: (bbl_evictions_svi_311[col] > 0).sum() for col in existing_cols[1:]}
# sorted_counts = sorted(buildings_with_complaints.items(), key=lambda x: x[1], reverse=True)
# this is just a list
complaint_counts_df = pd.DataFrame(list(buildings_with_complaints_clean.items()),
                                  columns=['complaint_category', 'building_count'])

In [77]:
complaint_counts_df = complaint_counts_df.sort_values('building_count', ascending=False)
complaint_counts_df = complaint_counts_df.reset_index(drop=True)
complaint_counts_df

Unnamed: 0,complaint_category,building_count
0,total_complaints,5386
1,noise_complaints,4687
2,plumbing_issues,3986
3,heat_hot_water,3957
4,sanitation_issues,3773
5,doors_windows,3093
6,walls_ceilings,3089
7,electrical_issues,2608
8,general_complaints,2539
9,pest_issues,2430


# **Step 4: Save the final bbl_evictions_svi_311_merged dataset to the cloud for later use.**

### This should be considered a thoroughly cleaned merged df that's good for any analysis with no nans or duplicates.

In [78]:
bbl_evictions_svi_311.to_csv('/content/drive/My Drive/X999/bbl_evictions_311_svi_covid.csv', index=False)
# good, not too big, with all the necessary information
# great for analysis.
# if only for retrival purposes, we could have kept some of the rows that had nans for completeness.

In [79]:
link1 = "/content/drive/My Drive/X999/bbl_evictions_311_svi_normal_times.csv"
link2 = "/content/drive/My Drive/X999/bbl_evictions_311_svi_covid.csv"

In [80]:
normal_times_df = pd.read_csv(link1)
covid_df = pd.read_csv(link2)

In [84]:
normal_columns = set(normal_times_df.columns)
covid_columns = set(covid_df.columns)
only_in_normal = normal_columns - covid_columns
only_in_normal

{'svi_group'}

In [86]:
normal_times_df.svi_group

Unnamed: 0,svi_group
0,medium-high
1,medium-high
2,medium-high
3,medium-high
4,high
...,...
66392,medium-low
66393,low
66394,low
66395,low
