# Predicting Building Energy Use

This tutorial demonstrates how to utilize NYC Open data on building energy & water usage to predict city wide building performance. It will introduce fundamental concepts of linear regression, training a statistical model, and evaulating model performance.

The first step is to import all the necessary libraries:
- [Geopandas](https://geopandas.org/index.html): similar to pandas with extensions to make it easier to work  with geospatial data.
- [Matplotlib](https://matplotlib.org/): the defacto library for drawing/plotting in Python
- [Sodapy](https://github.com/xmunoz/sodapy): is an interface for interacting with NYC Open data (more broadly the Socrata API that NYC Open data relies on)
- [Shapely](https://shapely.readthedocs.io/en/stable/manual.html): a library for manipulaitng and analysing geometry
- [Sci-kit learn](https://scikit-learn.org/stable/): machine learning library
- [Psycopg2](https://www.psycopg.org/docs/): is a utility library that makes it easier to connect to postgresql

In [27]:
import os
import pandas as pd
import numpy as np
import geopandas as gpd
from sodapy import Socrata

In [5]:
client = Socrata("data.cityofnewyork.us", os.environ['nyc_soda_cuny_token'])

In [6]:
results = client.get("qb3v-bbre", limit=30000)

In [7]:
bldg_energy = pd.DataFrame.from_records(results)

In [8]:
print(bldg_energy.shape)
bldg_energy.head()

(28807, 67)


Unnamed: 0,order,property_id,property_name,parent_property_id,parent_property_name,city_building,bbl_10_digits,nyc_borough_block_and_lot,nyc_building_identification,address_1_self_reported,...,water_use_all_water_sources,water_use_intensity_all_water,water_required,generation_date,latitude,longitude,community_board,council_district,census_tract,nta
0,1,7365,1155,Not Applicable: Standalone Property,Not Applicable: Standalone Property,No,1009970029,1009970029,1022631,1155 Avenue of the Americas,...,Not Available,Not Available,No,2020-05-28T04:27:00.000,40.756631,-73.982826,105,4,119,Midtown-Midtown South
1,2,8139,200,Not Applicable: Standalone Property,Not Applicable: Standalone Property,No,1013150001,1013150001,1037545,200 East 42nd St.,...,7310.6,19.02,Yes,2020-05-28T04:27:00.000,40.750698,-73.974306,106,4,88,Turtle Bay-East Midtown
2,3,8604,114,Not Applicable: Standalone Property,Not Applicable: Standalone Property,No,1009990019,1009990019,1022667,114 West 47th st,...,Not Available,Not Available,No,2020-05-28T04:27:00.000,40.75831,-73.982504,105,4,125,Midtown-Midtown South
3,4,8841,733,Not Applicable: Standalone Property,Not Applicable: Standalone Property,No,1013190047,1013190047,1037596,733 Third Avenue,...,Not Available,Not Available,No,2020-05-28T04:27:00.000,40.753074,-73.972753,106,4,90,Turtle Bay-East Midtown
4,5,11809,Conde Nast Building,Not Applicable: Standalone Property,Not Applicable: Standalone Property,No,1009950005,1009950005,1085682,4 Times Square,...,Not Available,Not Available,No,2020-05-28T04:27:00.000,40.756181,-73.986244,105,4,119,Midtown-Midtown South


In [28]:
bldg_energy.replace(to_replace={'Not Available': np.nan}, inplace=True)

In [39]:
bldg_energy['number_of_buildings'] = bldg_energy['number_of_buildings'].astype(int)

In [36]:
cols = ['energy_star_score', 'occupancy', 'dof_gross_floor_area_ft',
        'site_eui_kbtu_ft', 'weather_normalized_source', 'weather_normalized_site',
        'weather_normalized_site_1', 'fuel_oil_1_use_kbtu', 'fuel_oil_2_use_kbtu',
        'fuel_oil_4_use_kbtu', 'fuel_oil_5_6_use_kbtu', 'diesel_2_use_kbtu',
        'kerosene_use_kbtu', 'propane_use_kbtu', 'district_steam_use_kbtu',
        'district_hot_water_use_kbtu', 'district_chilled_water_use',
        'natural_gas_use_kbtu', 'weather_normalized_site_2',
        'electricity_use_grid_purchase', 'electricity_use_grid_purchase_1',
        'weather_normalized_site_3', 'annual_maximum_demand_kw',
        'total_ghg_emissions_metric', 'direct_ghg_emissions_metric',
        'water_use_all_water_sources', 'water_use_intensity_all_water',
        'latitude', 'longitude', 'indirect_ghg_emissions_metric',]

In [37]:
for col in cols:
    bldg_energy[col] = bldg_energy[col].astype(float)

# TO DO:

- Merge with Pluto data
- create features: adjacency/stand-alone/occlusion/program/material/year
- run correlations
- train model

In [18]:
# bldg_energy[bldg_energy['direct_ghg_emissions_metric']!='0']

In [None]:
# bldg_energy['largest_property_use_type'].loc[bldg_energy['largest_property_use_type'].str.contains('Housing')]