# Great Energy Predictor - Featurization

<a id='content3'></a>
## Content

1. [Data Description](#description3)
2. [Imports](#imports3)
3. [Feature Selection](#eda)

<a id='description3'></a>
## 1. Data Description
[Back to top](#content3)

The data is a 2016 collection of energy meter readings (from over 1,000 buildings in 16 sites around the world) and weather measurements (from the nearest weather station for each site). It consists of 3 relational files of tabular data with the following features:
##### 1. train.csv - contains energy consumption measurements from 4 types of building meters in 2016
    - building_id - identifies the building
        - There are 1449 buildings in 16 sites around the world in this dataset
    - meter - meter type (not all buildings have all meter types)
        - 0 - electricity
        - 1 - chilldwater
        - 2 - steam
        - 3 - hotwater
    - timestamp - date and time of the meter reading
        - This dataset contains measurements over a span of an entire year
    - meter_reading - Energy consumption, in kilowatt-hour (kWh) or equivalent
        - This is the target variable
##### 2. weather_train.csv - contains weather measurements in 2016 from the weather station that is closest to the site
    - site_id - identifies the site where the building is
    - timestamp - date and time of the weather measurements
    - air_temperature - air temperature, in degrees Celsius
    - cloud_coverage - portion of the sky covered by clouds, in oktas
    - dew_temperature - temperature at which dew forms, in degrees Celsius
    - precip_depth_1_hr - measure of rainfail in 1 hour, in millimeters (mm)
    - sea_level_pressure - atmospheric pressure at sea level, in millibar (mbar) or hectopascals (hPa)
    - wind_direction - compass direction of the wind (0 - 360)
    - wind_speed - wind speed, in meters per second (m/s)
##### 3. building_metadata.csv - contains details about the buildings in the dataset
    - site_id - identifies the site where the building is
    - building_id - identifies the building
    - primary_use - what the building is used for (based on EnergyStar's property types)
    - square_feet - gross floor area of the building, in square feet (ft^2)
    - year_built - year the building was opened
    - floor_count - number of floors the building has
    
This data was retrieved from a public Kaggle competition hosted by ASHRAE.
##### Source: https://www.kaggle.com/c/ashrae-energy-prediction/data

<a id='imports3'></a>
## 2. Imports
[Back to top](#content3)

##### Import libraries

In [1]:
%matplotlib inline

import gc
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import ray.dataframe as pd

In [2]:
# Default plot settings
sns.set(rc={'figure.figsize': (16, 6), 
            'font.size': 12})

##### Import data

In [3]:
data_path = '../data/output/'

In [21]:
# dtypes = {
#     'site_id': 'uint8',
#     'building_id': 'uint16',
#     'primary_use': 'category',
#     'square_feet': 'uint32',
#     'year_built': 'uint16',
#     'floor_count': 'uint8',
#     'use_encoded': 'unit8'
# }

In [22]:
# building = pd.read_csv(f'{data_path}clean_building.csv', dtype=dtypes).iloc[:, 1:]
# building.info()

In [23]:
# dtypes = {
#     'site_id': 'uint8',
#     'air_temperature': 'float32',
#     'cloud_coverage': 'uint8',
#     'dew_temperature': 'float32',
#     'precip_depth_1_hr': 'float32',
#     'sea_level_pressure': 'float32',
#     'wind_direction': 'uint16',
#     'wind_speed': 'float32'
# }

In [24]:
weather = pd.read_csv(f'{data_path}clean_weather.csv', dtype=dtypes, parse_dates=['timestamp']).iloc[:, 1:]
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140544 entries, 0 to 140543
Data columns (total 9 columns):
site_id               140544 non-null uint8
timestamp             140544 non-null datetime64[ns]
air_temperature       140544 non-null float64
cloud_coverage        140544 non-null int64
dew_temperature       140544 non-null float64
precip_depth_1_hr     140544 non-null float64
sea_level_pressure    140544 non-null float64
wind_direction        140544 non-null int64
wind_speed            140544 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(2), uint8(1)
memory usage: 8.7 MB


In [25]:
# dtypes = {
#     'site_id': 'uint8',
#     'building_id': 'uint16',
#     'meter': 'uint8',
#     'type': 'category',
#     'meter_reading': 'float32',
#     'weekday': 'uint8',
#     'hour': 'uint8'
# }

In [26]:
# meter = pd.read_csv(f'{data_path}eda_meter.csv', dtype=dtypes, parse_dates=['timestamp']).iloc[:, 1:]
# meter.info()

In [29]:
dtypes = {
    'site_id': 'uint8',
    'building_id': 'uint16',
    'use_encoded': 'uint8',
    'primary_use': 'category',
    'year_built': 'uint16',
    'floor_count': 'uint8',
    'square_feet': 'uint32',
    'month': 'uint8',
    'day': 'uint8',
    'hour': 'uint8',
    'weekday': 'uint8',
    'meter': 'uint8',
    'type': 'category',
    'meter_reading': 'float32',
    'air_temperature': 'float32',
    'dew_temperature': 'float32',
    'sea_level_pressure': 'float32',
    'cloud_coverage': 'uint8',
    'precip_depth_1_hr': 'float32',
    'wind_direction': 'uint16',
    'wind_speed': 'float32'
}

In [30]:
train_df = pd.read_csv(f'{data_path}eda_train.csv', dtype=dtypes, parse_dates=['timestamp']).iloc[:, 1:]
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20137746 entries, 0 to 20137745
Data columns (total 22 columns):
site_id               uint8
building_id           uint16
use_encoded           uint8
primary_use           category
year_built            uint16
floor_count           uint8
square_feet           uint32
timestamp             datetime64[ns]
month                 uint8
day                   uint8
hour                  uint8
weekday               uint8
meter                 uint8
type                  category
meter_reading         float32
air_temperature       float32
dew_temperature       float32
sea_level_pressure    float32
cloud_coverage        uint8
precip_depth_1_hr     float32
wind_direction        uint16
wind_speed            float32
dtypes: category(2), datetime64[ns](1), float32(6), uint16(3), uint32(1), uint8(9)
memory usage: 1017.9 MB


In [31]:
del dtypes
gc.collect()

8877

<a id='seln'></a>
## 3. Feature Selection
[Back to top](#content3)