### Data Cleaning

This notebook outlines the steps taken to process and clean the LL84 which is one of three datasets needed for Model in predicting natural gas and electricity use.

In [1]:
import pandas as pd

### Reading LL84 data

This is public data available to download from here: https://data.cityofnewyork.us/Environment/Energy-and-Water-Data-Disclosure-for-Local-Law-84-/wcm8-aq5w.

In [2]:
LL84_2019 = pd.read_csv("data/LL84_2019.csv")

  LL84_2019 = pd.read_csv("data/LL84_2019.csv")


In [3]:
LL84_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29084 entries, 0 to 29083
Columns: 254 entries, Property Id to NTA
dtypes: float64(6), int64(5), object(243)
memory usage: 56.4+ MB


In [4]:
LL84_2019.head()

Unnamed: 0,Property Id,Property Name,Parent Property Id,Parent Property Name,Year Ending,"NYC Borough, Block and Lot (BBL)",NYC Building Identification Number (BIN),Address 1,Address 2,City,...,Last Modified Date - Water Meters,Generation Date,Release Date,Borough,Latitude,Longitude,Community Board,Council District,Census Tract,NTA
0,7365,1155,Not Applicable: Standalone Property,Not Applicable: Standalone Property,12/31/2019,1009970029,1022631,1155 Avenue of the Americas,Not Available,Manhattan,...,Not Available,05/28/2020 04:27:22 AM,05/28/2020 11:31:28 AM,MANHATTAN,40.756631,-73.982826,105.0,4.0,119.0,Midtown-Midtown South
1,8139,200,Not Applicable: Standalone Property,Not Applicable: Standalone Property,12/31/2019,1013150001,1037545,200 East 42nd St.,Not Available,Manhattan,...,03/03/2020 05:46 PM EST,05/28/2020 04:27:23 AM,05/28/2020 11:31:28 AM,MANHATTAN,40.750698,-73.974306,106.0,4.0,88.0,Turtle Bay-East Midtown
2,8604,114,Not Applicable: Standalone Property,Not Applicable: Standalone Property,12/31/2019,1009990019,1022667,114 West 47th st,Not Available,Manhattan,...,Not Available,05/28/2020 04:27:23 AM,05/28/2020 11:31:28 AM,MANHATTAN,40.75831,-73.982504,105.0,4.0,125.0,Midtown-Midtown South
3,8841,733,Not Applicable: Standalone Property,Not Applicable: Standalone Property,12/31/2019,1013190047,1037596,733 Third Avenue,Not Available,Manhattan,...,Not Available,05/28/2020 04:27:24 AM,05/28/2020 11:31:28 AM,MANHATTAN,40.753074,-73.972753,106.0,4.0,90.0,Turtle Bay-East Midtown
4,11809,Conde Nast Building,Not Applicable: Standalone Property,Not Applicable: Standalone Property,12/31/2019,1009950005,1085682,4 Times Square,Not Available,Manhattan,...,Not Available,05/28/2020 04:27:25 AM,05/28/2020 11:31:28 AM,MANHATTAN,40.756181,-73.986244,105.0,4.0,119.0,Midtown-Midtown South


In [5]:
# Rename the column "NYC Borough, Block and Lot (BBL)" to "BBL"
LL84_2019.rename(columns={"NYC Borough, Block and Lot (BBL)": "BBL"}, inplace=True)

In [6]:
# Reformat values in the BBL column
LL84_2019["BBL"] = LL84_2019["BBL"].str.replace("-", "").astype(str)

In [7]:
current_year = 2019

# Calculate building age
LL84_2019 ['Building Age'] = current_year - LL84_2019['Year Built']

In [8]:
# Retain only the specified columns
columns_to_retain = [
    "BBL",
    "Largest Property Use Type - Gross Floor Area (ft²)",
    "Building Age",
    "ENERGY STAR Score",
    "Weather Normalized Source EUI (kBtu/ft²)",
    "Weather Normalized Site Natural Gas Use (therms)",
    "Weather Normalized Site Energy Use (kBtu)",
    "Latitude",
    "Longitude",
    "NTA"
]
LL84_2019 = LL84_2019[columns_to_retain]
LL84_2019 = LL84_2019.replace("Not Available", pd.NA)  # Replace "not available" with NaN
LL84_2019.dropna(inplace=True)

In [9]:
# Remove duplicate rows
LL84_2019.drop_duplicates(inplace=True)

In [10]:
LL84_2019.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19603 entries, 2 to 29080
Data columns (total 10 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   BBL                                                 19603 non-null  object 
 1   Largest Property Use Type - Gross Floor Area (ft²)  19603 non-null  object 
 2   Building Age                                        19603 non-null  int64  
 3   ENERGY STAR Score                                   19603 non-null  object 
 4   Weather Normalized Source EUI (kBtu/ft²)            19603 non-null  object 
 5   Weather Normalized Site Natural Gas Use (therms)    19603 non-null  object 
 6   Weather Normalized Site Energy Use (kBtu)           19603 non-null  object 
 7   Latitude                                            19603 non-null  float64
 8   Longitude                                           19603 non-null  float64


In [11]:
# Save file 
LL84_2019.to_csv(r"Ll84_2019_processed.csv", index=False)

### Reading PLUTO data

This is public information exported from MapPLUTO™ - Shapefile 19v2 Releases and available to download from here: https://www.nyc.gov/site/planning/data-maps/open-data/bytes-archive.page.

In [12]:
PLUTO_2019 = pd.read_csv("data/PLUTO_2019.csv")

  PLUTO_2019 = pd.read_csv("data/PLUTO_2019.csv")


In [13]:
PLUTO_2019.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857298 entries, 0 to 857297
Data columns (total 89 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Borough     857298 non-null  object 
 1   Block       857298 non-null  int64  
 2   Lot         857298 non-null  int64  
 3   CD          857298 non-null  int64  
 4   CT2010      856730 non-null  float64
 5   CB2010      856730 non-null  float64
 6   SchoolDist  856166 non-null  float64
 7   Council     857298 non-null  int64  
 8   ZipCode     857298 non-null  int64  
 9   FireComp    856152 non-null  object 
 10  PolicePrct  857298 non-null  int64  
 11  HealthCent  857298 non-null  int64  
 12  HealthArea  857298 non-null  int64  
 13  Sanitboro   855967 non-null  float64
 14  SanitDistr  855967 non-null  float64
 15  SanitSub    855848 non-null  object 
 16  Address     856231 non-null  object 
 17  ZoneDist1   856596 non-null  object 
 18  ZoneDist2   19797 non-null   object 
 19  Zo

In [14]:
PLUTO_2019.head()

Unnamed: 0,Borough,Block,Lot,CD,CT2010,CB2010,SchoolDist,Council,ZipCode,FireComp,...,EDesigNum,APPBBL,APPDate,PLUTOMapID,FIRM07_FLA,PFIRM15_FL,Version,DCPEdited,Shape_Leng,Shape_Area
0,QN,15652,28,414,1032.01,2002.0,27.0,31,11691,L134,...,,4156520000.0,01/18/2007,1,1.0,1.0,19v2,,282.243624,3491.901845
1,QN,15652,29,414,1032.01,2002.0,27.0,31,11691,L134,...,,4156520000.0,01/18/2007,1,1.0,1.0,19v2,,264.24429,2509.830862
2,QN,15652,30,414,1032.01,2002.0,27.0,31,11691,L134,...,,4156520000.0,01/18/2007,1,1.0,1.0,19v2,,282.244024,3491.904073
3,QN,15652,118,414,1032.01,2002.0,27.0,31,11691,L134,...,,4156520000.0,01/18/2007,1,1.0,1.0,19v2,,288.175826,3422.459441
4,QN,15654,7,414,1032.01,2001.0,27.0,31,11691,L134,...,,0.0,,1,1.0,1.0,19v2,,254.40877,3766.653141


In [15]:
# Remove unwanted colunms
columns_to_drop = [
    "Borough", "Block", "Lot", "CD", "CT2010", "CB2010", "SchoolDist",
    "Council", "ZipCode", "FireComp", "PolicePrct", "HealthCent", "HealthArea",
    "Sanitboro", "SanitDistr", "SanitSub", "BldgClass",
    "Easements", "OwnerType", "ComArea", "ResArea", "OfficeArea", "RetailArea",
    "GarageArea", "StrgeArea", "FactryArea", "OtherArea", "AreaSource",
   "UnitsTotal", "Ext", "IrrLotCode", "BsmtCode",
     "AssessTot", "ExemptTot", "YearBuilt", "YearAlter1", 
    "YearAlter2", "HistDist", "ResidFAR", "CommFAR", "FacilFAR", "CondoNo", 
    "Tract2010", "APPBBL", "APPDate", "Address", "ZoneDist1", "ZoneDist2", 
    "ZoneDist3", "ZoneDist4","Overlay1", "Overlay2", "SPDist1", "SPDist2", "SPDist3",
    "LtdHeight", "SplitZone", "OwnerName", "HistDist", "Landmark"
]
PLUTO_2019.drop(columns=columns_to_drop, inplace=True)

In [16]:
# Remove unavailable lot type
PLUTO_2019 = PLUTO_2019[PLUTO_2019["LotType"] != 0]

# Remove unavailable 
PLUTO_2019 = PLUTO_2019[PLUTO_2019["ProxCode"] != 0]

In [17]:
# Fixes to exported BBLs
PLUTO_2019["BBL"] = PLUTO_2019["BBL"].round(0).astype(int)

In [18]:
# Save file 
PLUTO_2019.to_csv(r"PLUTO_2019_processed.csv", index=False)

### Data integration according to BBL number

In [19]:
# load two datasets
df_pluto = pd.read_csv(r"PLUTO_2019_processed.csv")
df_ll84 = pd.read_csv(r"LL84_2019_processed.csv")

In [20]:
df_pluto['BBL'] = df_pluto['BBL'].astype(str)
df_ll84['BBL'] = df_ll84['BBL'].astype(str)

In [21]:
df_pluto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 711909 entries, 0 to 711908
Data columns (total 30 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   LandUse     711437 non-null  float64
 1   LotArea     711909 non-null  int64  
 2   BldgArea    711909 non-null  int64  
 3   NumBldgs    711909 non-null  int64  
 4   NumFloors   711909 non-null  float64
 5   UnitsRes    711909 non-null  int64  
 6   LotFront    711909 non-null  float64
 7   LotDepth    711909 non-null  float64
 8   BldgFront   711909 non-null  float64
 9   BldgDepth   711909 non-null  float64
 10  ProxCode    711536 non-null  float64
 11  LotType     711536 non-null  float64
 12  AssessLand  711909 non-null  float64
 13  BuiltFAR    711909 non-null  float64
 14  BoroCode    711909 non-null  int64  
 15  BBL         711909 non-null  object 
 16  XCoord      711909 non-null  int64  
 17  YCoord      711909 non-null  int64  
 18  ZoneMap     711501 non-null  object 
 19  ZM

In [22]:
df_ll84.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19603 entries, 0 to 19602
Data columns (total 10 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   BBL                                                 19603 non-null  object 
 1   Largest Property Use Type - Gross Floor Area (ft²)  19603 non-null  float64
 2   Building Age                                        19603 non-null  int64  
 3   ENERGY STAR Score                                   19603 non-null  int64  
 4   Weather Normalized Source EUI (kBtu/ft²)            19603 non-null  float64
 5   Weather Normalized Site Natural Gas Use (therms)    19603 non-null  float64
 6   Weather Normalized Site Energy Use (kBtu)           19603 non-null  float64
 7   Latitude                                            19603 non-null  float64
 8   Longitude                                           19603 non-null  float64


In [23]:
# Merge two datasets according their BBL number
merged_df = pd.merge(df_pluto, df_ll84, on="BBL", how="inner")

In [24]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8711 entries, 0 to 8710
Data columns (total 39 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   LandUse                                             8711 non-null   float64
 1   LotArea                                             8711 non-null   int64  
 2   BldgArea                                            8711 non-null   int64  
 3   NumBldgs                                            8711 non-null   int64  
 4   NumFloors                                           8711 non-null   float64
 5   UnitsRes                                            8711 non-null   int64  
 6   LotFront                                            8711 non-null   float64
 7   LotDepth                                            8711 non-null   float64
 8   BldgFront                                           8711 non-null   float64
 9

In [25]:
# Save file
merged_df.to_csv(r"Merged_Data.csv", index=False)