# ASHRAE - Great Energy Predictor III (Beginner)

# Introduction
* Q: How much does it cost to cool a skyscraper in the summer?
* A: A lot! And not just in dollars, but in environmental impact.

Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements working? That’s where you come in. Under pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a model. Current methods of estimation are fragmented and do not scale well. Some assume a specific meter type or don’t work with different building types.

In this competition, you’ll develop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies.

>About the Host

![image](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1095143%2Ff9ab8963dea5e7c1716f47310daa96ab%2FASHRAE_Logo_25.jpg?generation=1570808142334850&alt=media)

Founded in 1894, ASHRAE serves to advance the arts and sciences of heating, ventilation, air conditioning refrigeration and their allied fields. ASHRAE members represent building system design and industrial process professionals around the world. With over 54,000 members serving in 132 countries, ASHRAE supports research, standards writing, publishing and continuing education - shaping tomorrow’s built environment today.

Banner photo by Federico Beccari on Unsplash

# Data

Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower cost financing.

This competition challenges you to build these counterfactual models across four energy types based on historic usage rates and observed weather. The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

## Files

### train.csv

* building_id - Foreign key for the building metadata.
* meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* timestamp - When the measurement was taken
* meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

### building_meta.csv

* site_id - Foreign key for the weather files.
* building_id - Foreign key for training.csv
* primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* square_feet - Gross floor area of the building
* year_built - Year building was opened
* floor_count - Number of floors of the building

### weather_[train/test].csv
Weather data from a meteorological station as close as possible to the site.

* site_id
* air_temperature - Degrees Celsius
* cloud_coverage - Portion of the sky covered in clouds, in oktas
* dew_temperature - Degrees Celsius
* precip_depth_1_hr - Millimeters
* sea_level_pressure - Millibar/hectopascals
* wind_direction - Compass direction (0-360)
* wind_speed - Meters per second

### test.csv
The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.
* row_id - Row id for your submission file
* building_id - Building id code
* meter - The meter id code
* timestamp - Timestamps for the test data period

### sample_submission.csv
A valid sample submission.

* All floats in the solution file were truncated to four decimal places; we recommend you do the same to save space on your file upload.
* There are gaps in some of the meter readings for both the train and test sets. Gaps in the test set are not revealed or scored.

## Input data files from Kaggle directory.

In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

#import numpy as np # linear algebra
#import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#df_sample_submission = pd.read_csv("/kaggle/input/ashrae-energy-prediction/sample_submission.csv")
#df_building_metadata = pd.read_csv("/kaggle/input/ashrae-energy-prediction/building_metadata.csv")
#df_weather_train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/weather_train.csv")
#df_weather_test = pd.read_csv("/kaggle/input/ashrae-energy-prediction/weather_test.csv")
#df_train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/train.csv")
#df_test = pd.read_csv("/kaggle/input/ashrae-energy-prediction/test.csv")

%%time

df_train['timestamp'] = pd.to_datetime(df_train['timestamp'])
df_test['timestamp'] = pd.to_datetime(df_test['timestamp'])
df_weather_train['timestamp'] = pd.to_datetime(df_weather_train['timestamp'])
df_weather_test['timestamp'] = pd.to_datetime(df_weather_test['timestamp'])

%%time

df_train.to_feather('train.feather')
df_test.to_feather('test.feather')
df_weather_train.to_feather('weather_train.feather')
df_weather_test.to_feather('weather_test.feather')
df_building_metadata.to_feather('building_metadata.feather')
df_sample_submission.to_feather('sample_submission.feather')

Base on [ASHRAE: feather format for fast loading](https://www.kaggle.com/corochann/ashrae-simple-lgbm-submission) kernel. 'test.csv' is big data and takes time to load. I would like to use the code from the notebook to convert competition data to feather format for fast pandas.DataFrame loading!
1. Using Add Data in the top of right hand corner.
2. Search 'ashrae-feather-format-for-fast-loading' and add it in kernel
3. Using .read_feather in pandas to read the feather

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df_train = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/train.feather')
df_weather_train = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/weather_train.feather')
df_test = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/test.feather')
df_weather_test = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/weather_test.feather')
df_building_metadata = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/building_metadata.feather')
df_sample_submission = pd.read_feather('/kaggle/input/ashrae-feather-format-for-fast-loading/sample_submission.feather')

# Exploratory Data Analytics
I would like to show you what is the data talking about by graph step by step in point form.
* I will describe my process and result before I run the code.
* I will do the very beginner Data Analytics and model prediction. There are just three features in my model and I will not consider about the weather and buliding_meta data.

    1. In df_train, there are 2,0216,100 rows of data 4 columns as below:
        1.   building_id = 1448 unique. It means that there are 1448 buildings in this training data set.
        2.   meter = 4 Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater} *Some buliding had all 4 meters but some doesnt.
        3.   timestamp = for each building the meter_readings recorded once per hour and the recording period is a whole year of 2016. It means that for 1 single buliding_id, it had around 8,000 meter_reading records.
        1.   meter_reading
    2. I plot a Time series line graph for building_id with max and min of mean meter records for you to have a easy outlook.


In [None]:
%%time

df_train["timestamp"] = pd.to_datetime(df_train["timestamp"])
df_test["timestamp"] = pd.to_datetime(df_test["timestamp"])

df_train = df_train.assign(hour=df_train.timestamp.dt.hour,
               day=df_train.timestamp.dt.day,
               month=df_train.timestamp.dt.month,
               year=df_train.timestamp.dt.year)

df_test = df_test.assign(hour=df_test.timestamp.dt.hour,
               day=df_test.timestamp.dt.day,
               month=df_test.timestamp.dt.month,
               year=df_test.timestamp.dt.year)

df_train

In [None]:
########################### Helpers
#################################################################################
## -------------------
## Memory Reducer
# :df pandas dataframe to reduce size             # type: pd.DataFrame()
# :verbose                                        # type: bool
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df
## -------------------

In [None]:
%%time
########################### Base check
#################################################################################
do_not_convert = ['category','datetime64[ns]','object']
for df in [df_train, df_test, df_building_metadata, df_weather_train, df_weather_test,df_sample_submission]:
    original = df.copy()
    df = reduce_mem_usage(df)

    for col in list(df):
        if df[col].dtype.name not in do_not_convert:
            if (df[col]-original[col]).sum()!=0:
                df[col] = original[col]
                print('Bad transformation', col)

In [None]:
df_train

%%time
meter_mean = df_train.groupby(['building_id', 'timestamp']).meter_reading.transform('mean')
len(meter_mean)

df_train['meter_mean'] = meter_mean

In [None]:
import random

mylist = []

for i in range(0,3):
    x = random.randint(1,1448)
    mylist.append(x)

print(mylist)

In [None]:
df_train_polt = df_train[df_train['building_id'].isin(mylist)]
df_train_polt 

In [None]:
df_train_polt.sort_index(axis = 1) 
df_train_polt= df_train_polt.reset_index(drop=True)
df_train_polt

In [None]:
%%time
df_1Day =df_train_polt[((df_train_polt.month == 6) & (df_train_polt.day == 30))]
import seaborn as sns
sns.set(rc={'figure.figsize':(11.2,8.27)})
sns.set(style="darkgrid")
ax = sns.lineplot(x="timestamp", y="meter_reading",hue= 'building_id' ,data= df_1Day)

In [None]:
%%time
df_1month =df_train_polt[((df_train_polt.month == 6))]
import seaborn as sns
sns.set(rc={'figure.figsize':(11.2,8.27)})
sns.set(style="darkgrid")
ax = sns.lineplot(x="timestamp", y="meter_reading",hue= 'building_id' ,data= df_1month)

In [None]:
df_train_polt['meter_reading'][((df_train_polt.building_id == 329)&(df_train_polt.hour == 12))]

In [None]:
%%time
df_1year =df_train_polt[((df_train_polt.hour == 12))]

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="dark", context="talk")
sns.set(rc={'figure.figsize':(20,8.27)})
f, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20,8.27))
sns.barplot(x=df_train_polt['timestamp'][((df_train_polt.building_id == 329)&(df_train_polt.hour == 12))], y=df_train_polt['meter_reading'][((df_train_polt.building_id == 329)&(df_train_polt.hour == 12))], ax=ax1, color='black')
ax1.axhline(0, color="c", clip_on=False)
ax1.set_ylabel("329")

sns.barplot(x=df_train_polt['timestamp'][((df_train_polt.building_id == 358)&(df_train_polt.hour == 12))], y=df_train_polt['meter_reading'][((df_train_polt.building_id == 358)&(df_train_polt.hour == 12))], ax=ax2, color='black')
ax2.axhline(0, color="c", clip_on=False)
ax2.set_ylabel("358")

sns.barplot(x=df_train_polt['timestamp'][((df_train_polt.building_id == 1019)&(df_train_polt.hour == 12))], y=df_train_polt['meter_reading'][((df_train_polt.building_id == 1019)&(df_train_polt.hour == 12))], ax=ax3, color='black')
ax3.axhline(0, color="c", clip_on=False)
ax3.set_ylabel("1019")

sns.despine(bottom=True)
plt.setp(f.axes, yticks=[])
plt.tight_layout(h_pad=2)