Q: How much does it cost to cool a skyscraper in the summer?
A: A lot! And not just in dollars, but in environmental impact.

Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements working? That’s where you come in. Under pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a model. Current methods of estimation are fragmented and do not scale well. Some assume a specific meter type or don’t work with different building types.

In this competition, you’ll develop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies.

About the Host


Founded in 1894, ASHRAE serves to advance the arts and sciences of heating, ventilation, air conditioning refrigeration and their allied fields. ASHRAE members represent building system design and industrial process professionals around the world. With over 54,000 members serving in 132 countries, ASHRAE supports research, standards writing, publishing and continuing education - shaping tomorrow’s built environment today.

Banner photo by Federico Beccari on Unsplash

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**Importing necessary files**

In [None]:
build_meta = pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv")
test = pd.read_csv("../input/ashrae-energy-prediction/test.csv")
train_df = pd.read_csv("../input/ashrae-energy-prediction/train.csv")
weat_train = pd.read_csv("../input/ashrae-energy-prediction/weather_train.csv")
weat_test = pd.read_csv("../input/ashrae-energy-prediction/weather_test.csv")

In [None]:
## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

**Checking shapes of all the files**

In [None]:
print("Shape of building_metadata is: ", build_meta.shape)
print("Shape of train is: ", train_df.shape)
print("Shape of weat_train is: ", weat_train.shape)
print("Shape of test is: ", test.shape)
print("Shape of weat_test is: ", weat_test.shape)

**Submission**

In [None]:
sub = pd.read_csv("../input/ashrae-energy-prediction/sample_submission.csv")
print("Shape of submission is: ", sub.shape)
sub.head()

**Taking a quick look at all the files**

In [None]:
build_meta.head()

In [None]:
train_df.head()

In [None]:
weat_train.head()

In [None]:
test.head()

In [None]:
weat_test.head()

**Merging Train**

In [None]:
import gc

**Reduce Memory usage****

In [None]:
build_meta = reduce_mem_usage(build_meta)
train_df = reduce_mem_usage(train_df)
weat_train = reduce_mem_usage(weat_train)


In [None]:
del test, sub, weat_test
gc.collect()

In [None]:
train = pd.merge(train_df, build_meta, on="building_id", how="left")
print("Shape after merging train and build train is:", train.shape)

del train_df
gc.collect()

train = pd.merge(train, weat_train, on=['site_id','timestamp'], how="left")
print("Shape after merging all train data is:", train.shape)

del weat_train
gc.collect()


**Before we move further let's visualize**

In [None]:
import seaborn as sns
def bar_plot(feature, df):
    sns.set(style="darkgrid")
    ax = sns.countplot(x=feature , data=df)
    


**Visualizing train**

In [None]:
train.head()

In [None]:
train.isna().sum()

> From above we can see that there is a lot of nan values present which we have to carefully handle

**Label (meter_reading)**

In [None]:
#bar_plot("meter_reading", train)
print("total different value of meter_reading is:", train.meter_reading.value_counts().shape)

**site_id**

In [None]:
bar_plot("site_id", train)

1. From the above plot we can conclude that there are total 16 different site_id
2. id 2,3,9,13,14,15 has appeared most number of times
3. id 7,11,12 has appeared least number of times

**meter**

In [None]:
print("total different value of meter is:", train.meter.value_counts())

bar_plot("meter", train)

1. From the above plot we can conclude that there are total 4 different meter category
2. 0 has appeared most number of times
3. 3 has appeared least number of times


**primary_use**

In [None]:
print("total different value of primary use is:", train.primary_use.value_counts())
bar_plot("primary_use", train)

1. this is a categorical variable
2. From above we can see that office has maximum number of count and Religious worship has lowest

**square feet**

In [None]:
print("total different value of square feet is:", train.square_feet.value_counts().shape)
bar_plot("square_feet", train)

> there are total 1397 different types of square feet

**year_build**

In [None]:
print("total different value of year built is:", train.year_built.value_counts())
bar_plot("year_built", train)

> year 1994 has maximum built

**floor count**

In [None]:
print("total different value of floor count is:", train.floor_count.value_counts())
bar_plot("floor_count", train)

1. From above we can conclude that
2. floor 1 has maximum number of count and floor 14 and 16 has least

**air_temperature**

In [None]:
print("total different value of air temperature is:", train.air_temperature.value_counts().shape)
bar_plot("air_temperature", train)

> from above we can see that there are total 619 different temperature

**cloud_coverage**

In [None]:
print("total different value of cloud coverage is:", train.cloud_coverage.value_counts())
bar_plot("cloud_coverage", train)

> from above we can see that there are 10 different types of cloud coverage ranging from 0 to 9 and 0 has occured most of the time 

**dew temperature **

In [None]:
print("total different value of dew temperature is:", train.dew_temperature.value_counts().shape)
bar_plot("dew_temperature", train)

> there are 522 different dew temperture categories

**precip_depth_1_hr**

In [None]:
print("total different value of precip_depth_1_hr is:", train.precip_depth_1_hr.value_counts().shape)
bar_plot("precip_depth_1_hr", train)

> this too is a categorical variable with 128 different values
> and we can see that 1 value has appeared a lot of times

**sea_level_pressure**

In [None]:
print("total different value of sea_level_pressure is:", train.sea_level_pressure.value_counts().shape)
bar_plot("sea_level_pressure", train)

**wind_direction**

In [None]:
print("total different value of wind_direction is:", train.wind_direction.value_counts().shape)
print(train.wind_direction.value_counts())
bar_plot("wind_direction", train)

> here too 0 has appeared a lot of times

**wind_speed **

In [None]:
print("total different value of wind_speed is:", train.wind_speed.value_counts().shape)
print(train.wind_speed.value_counts())
bar_plot("wind_speed", train)

> here too wind speed 0 has occured a lot of time

# Feature Engineering

**Date**

In [None]:
import datetime

#convert into datetime
train["timestamp"] = pd.to_datetime(train["timestamp"])

#Extarct year, month, weeks etc from timestamp
train["year"] = pd.DatetimeIndex(train["timestamp"]).year
train["month"] = pd.DatetimeIndex(train["timestamp"]).month
train["day"] = pd.DatetimeIndex(train["timestamp"]).day
train["week"] = pd.DatetimeIndex(train["timestamp"]).week


In [None]:
train.head(2)

In [None]:
print("total different value of year is:", train.year.value_counts().shape)
print(train.year.value_counts())
bar_plot("year", train)

> From above we can see that year is 2016 for all rows so we donot need to consider it

In [None]:
train.drop("year", axis=1, inplace=True)

In [None]:
print("total different value of month is:", train.month.value_counts().shape)
#print(train.month.value_counts())
bar_plot("month", train)

In [None]:
print("total different value of day is:", train.day.value_counts().shape)
#print(train.day.value_counts())
bar_plot("day", train)

In [None]:
print("total different value of week is:", train.week.value_counts().shape)
#print(train.week.value_counts())
bar_plot("week", train)

**Sorting dataframe on basis of timestamp**

In [None]:
train.sort_values(by='timestamp', inplace=True)

In [None]:
import matplotlib.pyplot as plt


train['timestamp'].plot()


**Importing text data**

In [None]:
test_df = pd.read_csv("../input/ashrae-energy-prediction/test.csv")
weat_test = pd.read_csv("../input/ashrae-energy-prediction/weather_test.csv")

In [None]:
test_df = reduce_mem_usage(test_df)
weat_test = reduce_mem_usage(weat_test)


In [None]:
test = pd.merge(test_df, build_meta, on="building_id", how="left")
print("Shape after merging test and build test is:", train.shape)

del test_df
gc.collect()

test = pd.merge(test, weat_test, on=['site_id','timestamp'], how="left")
print("Shape after merging all test data is:", train.shape)

del weat_test, build_meta
gc.collect()


**DAtetime year for test**

In [None]:
#convert into datetime
test["timestamp"] = pd.to_datetime(test["timestamp"])

#Extarct year, month, weeks etc from timestamp
#train["year"] = pd.DatetimeIndex(train["timestamp"]).year
test["month"] = pd.DatetimeIndex(test["timestamp"]).month
test["day"] = pd.DatetimeIndex(test["timestamp"]).day
test["week"] = pd.DatetimeIndex(test["timestamp"]).week


In [None]:
test.head(2)

In [None]:
train.drop(["timestamp"], axis=1, inplace=True)
test.drop(["timestamp"], axis=1, inplace=True)

In [None]:
target = train["meter_reading"]
del train["meter_reading"]

In [None]:
gc.collect()

**Target Encoding For categorical words**

In [None]:
from category_encoders import *

def target_encoder(feature):
    cat_vectorizer = TargetEncoder().fit(train[feature].astype(str), target)

    train[feature] = cat_vectorizer.transform(train[feature].astype(str))
    test[feature] = cat_vectorizer.transform(test[feature].astype(str))
    print("Done")
    

In [None]:
cols = ["primary_use"]
for i in cols:
    target_encoder(feature = i)

In [None]:
 #cols = ["floor_count", "year_built", "floor_count", "air_temperature ", "cloud_coverage",\
  #     "dew_temperature", "precip_depth_1_hr", "sea_level_pressure", "wind_direction", "ind_speed"]


In [None]:
train.fillna(-99, inplace=True)
test.fillna(-99, inplace=True)

**Model**

In [None]:
feat_cols = [cols for cols in train.columns]

# work in progress. will update this kernel with few more beautiful insights

**Please upvote if you like my work and give your suggestion on where I need to improve**