1.  [Introduction](#section-one)
2.  [Libraries](#section-two)
3.  [Understanding The Data](#section-tree)
    - [Train Set](#subsection-trainset)
        - [Train](#subsection-t)
        - [Building Metadata](#subsection-b)
        - [Weather train](#subsection-w)
4.  [Exploratory Data Analysis](#section-four)
    - [Merging Tables](#subsection-m)
    - [Analysis](#subsection-a)
        - [Removing Outliers](#subsection-o)
5.  [Modeling](#section-five)

<a id="section-one"></a>
# 1. Introduction

<a id="section-two"></a>
# 2. Libraries

In [None]:
# Set your own project id here
PROJECT_ID = 'ASHRAE - Energy Prediction1'
from google.cloud import storage
storage_client = storage.Client(project=PROJECT_ID)

In [None]:
!pip install seaborn==0.11.0

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import gc

In [None]:
print(f'''Pandas version: {pd.__version__}
NumPy version: {np.__version__}
Matplotlib version: {mpl.__version__}
Seaborn version: {sns.__version__}''')

In [None]:
gc.enable()

<a id="section-tree"></a>
# 3. Understanding The Data

Six tables are given to predict building energy usage in future. One of these tables is sample submission file. This table will not be used for neigher training nor testing. 

There are three tables for model training,
* train
* weather_train
* building_metadata

Two tables for testing,
* test
* weather_test

All tables are given in csv format. In this notebook pandas is used to import, read and manipulate the data. 


<a id="subsection-trainset"></a>
# Train Set

In [None]:
train = pd.read_csv('../input/ashrae-energy-prediction/train.csv')
weather_train = pd.read_csv('../input/ashrae-energy-prediction/weather_train.csv')
building = pd.read_csv('../input/ashrae-energy-prediction/building_metadata.csv')

During the data analysis I face the problem with memory usage. Solve this problem by reducing memory size of the data frames without changing the data.
In the cell below there is a memory reduction function.

In [None]:
## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train = reduce_mem_usage(train)
weather_train = reduce_mem_usage(weather_train)
building = reduce_mem_usage(building)

<a id="subsection-t"></a>
## Train

* building_id - Foreign key for the building metadata.
* meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* timestamp - When the measurement was taken
* meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error. UPDATE: The site 0 electric meter readings are in kBTU

Let's take a look at the first few rows and data types of train set.

In [None]:
train.head()

In [None]:
train.info()

timestamp column contains date information, so the data type of the column is change to datetime.
After changing the data type find time period for recordings (dates that entries start and end).

In [None]:
train['timestamp'] = pd.to_datetime(train['timestamp'])
print(f'''{train['timestamp'].dtype}
{train['timestamp'].min()}
{train['timestamp'].max()}''')

Check the missing values and the distribution of the values.

In [None]:
train.isnull().sum()

In [None]:
train['meter_reading'].describe()

There is no missing values in the train data frame. However,meter_reading colum has '0' value. Since the meter_reading colum is representing energy consumption it is not logical a building have zero consumption. Let's take a closer look at the buildings that have zero electricity consumption.

In [None]:
print('Percent of zero read values:', "%.2f"%(train[train['meter_reading']== 0].shape[0] /train.shape[0]))
train[train['meter_reading'] == 0]

Only 9% of the meter readings are zero. 9% could be imputed with the 91% with a good accuracy. Try to find if there is a visible pattern between zero values.


In [None]:
print(f'''Total number of buildings: {train['building_id'].nunique()}
Number of buildings with zero readings: {train[train['meter_reading']== 0]["building_id"].value_counts().shape[0]}
Percentage: {round(train[train['meter_reading']== 0]["building_id"].value_counts().shape[0] / train['building_id'].nunique(),2)}

Number of zero readings per building: 
{train[train['meter_reading']== 0]["building_id"].value_counts().sort_values(ascending = False)}

Number of buildings with only one zero readings: {sum(train[train['meter_reading']== 0]['building_id'].value_counts() == 1)}
Percentage of buildings with only one zero reading: {round( sum(train[train['meter_reading']== 0]['building_id'].value_counts() == 1) / train[train['meter_reading']== 0]["building_id"].value_counts().shape[0] , 2)}
''')

Although only 9% of the reading value is zero, 66% of the buildings cotain zero reading values. Number of zero readings varies from 1 to 16031. Small number of zero readings might occur due to an equipment error, however for larger amounts ity is most likely that there is another reason. 

Also, it is possible that a building have more than one meter type. So having zero reading doesn't have to mean that there is no energy consumption at all. It is possible to use different types of energy sources time to time. 

Having that in mind, let's see if one kind of meter contains more zero values than other.

In [None]:
(train[train['meter_reading']== 0]).groupby('building_id')['meter'].value_counts().sort_values(ascending = False)

In [None]:
print('Number of zero readings for each meter:\n',train[train['meter_reading']== 0]["meter"].value_counts(),'\n')

for i in range(train["meter"].nunique()):
    percent = round(train[train['meter_reading']== 0]["meter"].value_counts()[i] /train["meter"].value_counts()[i],2)
    print(f'% of zero reads for Meter {i}: {percent}')

Zero readings are higher for meter type 1 and 0. Type 2 and 3 has zero readings as much as half of type 1.

On the other hand percentage of zero read values is highest for meter 3, followed by 1 and 2. Type 0 has the lowest percentage with 0.4%.

It can be said that zero readings related with the meter type. Also, for different meter types seasons might have an effect on energy usage. By analyzing the distribution of the zero readings in given period of time and the application areas of different meter types it can be possible to find the reason and -if neccessary- the best imputation method  for these readings.

Meter type information will be usefull at this point;
* 0: electricity
* 1: chilledwater
* 2: steam
* 3: hotwater

PS. Not every building has all meter types.


In [None]:
train['meter'] = pd.Categorical(train['meter']).rename_categories({0: 'electricity', 1: 'chilledwater', 2: 'steam', 3: 'hotwater'})

In [None]:
g = sns.FacetGrid(train[train['meter_reading']== 0], col="meter",hue = 'meter',palette='coolwarm',col_wrap=2,height=3, aspect=2)
g.map(sns.histplot, 'timestamp', bins=12)

Occurance of zero readings changes with respect to time and meter type. For electricity meter zeros were read mostly in first 5 months of the year. After that zero reading counts decrease significantly. Like electricity meters chilledwater energy meter also have most of the zeros in the first months of the year. Yet, for chilled water meter decrease on zero readings occurs only between 5 - 9th months and increase start after 9th month, reading counts reach first months again. When comes to steam and hotwater meters zero reading occurance increase in the middle of the year and has a distribution close to normal distribution. 

These values probably highly related with the usage areas of different electiricity sources. To gain a more deeper inside let's review the meter types and usage area;

* 0 - Electricity : An electric meter is a device used to measure the electrical energy usage of a home, building, or other electrically powered device. Digital meters simply state the number of kWh of electricity have been used. It's important to note that neither the digital nor the analog meters reset at the beginning of the month, the power company subtracts off the start from the end to figure out how much to bill the household [[1]](https://commons.wikimedia.org/wiki/File:Hydro_quebec_meter.JPG#/media/File:Hydro_quebec_meter.JPG) [[2]](https://www.hydro.mb.ca/customer_services/how_to_read/meter.shtml).
* 1 - Chilledwater: Chilled Water Energy Meters, commonly referred to as BTU meters, can be defined by the measurement of heat/chilled water energy      consumption. The quantity of thermal energy transferred from the cooling water to the consumer over a defined period of time is proportional to the temperature difference between the flow and return and the volume of cooling water that has flowed through [[3]](https://www.districtenergy.com/customer-resources/how-it-works/cooling-with-chilled-water/)[[4]](https://www.badgermeter.com/flow-measurement-solutions-for-chilled-water-applications/).
    * Application Areas: Cooling / heating systems with water as a cooling/heat carrier, transfer stations, larger cooling/heating systems in apartment    buildings(specially in commercial buildings)[[5]](https://www.ista.com/ae/solutions/technology/).
* 2 - Steam: The use of a steam flowmeter may be used to directly measure the steam usage of an operational item of plant. Steam is one of the most widely used commodities for conveying heat energy. Its use is popular throughout industry for a broad range of tasks from mechanical power production to space heating and process applications[[6]](https://www.spiraxsarco.com/learn-about-steam/introduction/steam---the-energy-fluid).
* 3: Hotwater: Your hot water meter allows us to measure hot water consumption within your building or apartment so we can charge for the water heated [[7]](https://www.originenergy.com.au/content/dam/origin/residential/docs/hot-water/your-centralised-hot-water.pdf).

In conclusion I've decide to keep zero readings as it is unless I realize 
any anomalies further in analysis.

**For electiricity:** It is still unlogical to have zero electricity consumption.Possible reasons for zero readings; 
 -Using different kind of energy source.
 -Having no informaiton about the consuption and record electiricity consumption as zero to table.

**For Chilled water:** It is possible to have zero reading in cooler months -chilled water system is usually used for cooling purposes- or certain time periods when the building is not used actively.

**For Steam:** Fluctuations in consumption can be seasonal as a result of change in weather. During hot months it is logical to have less or zero consumption. 

**For Hot water:** Fluctuations in consumption can be seasonal as a result of change in weather. During hot months it is logical to have less or zero consumption. 

Besides from the zeros there are also very high readings in the data. Lets try to see if there are any outliers and the distribution of the readings with a box plot.

In [None]:
fig, axes = plt.subplots(1, 1, figsize=(14, 6))
sns.boxplot(y='meter', x='meter_reading', data=train, showfliers=True)

Looks like we have a lot of outliers! See what will the plot look like without outliers and zero values.

In [None]:
fig, axes = plt.subplots(1, 1, figsize=(14, 6))
sns.boxplot(y='meter', x='meter_reading', data=train[train['meter_reading'] !=  0 ], showfliers=False);

In [None]:
sns.lineplot(data=train.groupby(['timestamp']).sum(), x="timestamp", y="meter_reading")

In [None]:
sns.relplot(data=train.groupby(['timestamp','meter']).sum(), x="timestamp", y="meter_reading",col="meter", hue="meter",kind="line")

In [None]:
sns.relplot(data=(train[train['meter'].isin(['electricity','chilledwater','hotwater'])].groupby(['timestamp','meter']).sum()), x="timestamp", y="meter_reading",col="meter", hue="meter",kind="line")

It is clear that we have some very high values in the readings. It is better to indentity outliers considering the building and site information (eg. for a big building with hight energy consumption kWh/area value might not be that high)

In [None]:
gc.collect()

<a id="subsection-b"></a>
## Building Metadata

* site_id - Foreign key for the weather files.
* building_id - Foreign key for training.csv
* primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* square_feet - Gross floor area of the building
* year_built - Year building was opened
* floor_count - Number of floors of the building

Building meta data table includes information about the buildings.

In [None]:
building.head()

In [None]:
building.info()

In [None]:
building.describe()

In [None]:
building.describe(include = 'O')

It is interesting that we have inf as mean value of year_built column.

In [None]:
print(f'Null value counts:\n{building.isnull().sum()}\n')
for col in list(building.columns):
    if building[col].isnull().sum() > 0:
        print(f'% of null in column {col}: {round(building[col].isnull().sum() / building.shape[0], 2 )}' )

To analyze distribution of null values for different usage areas Nan values filled with 'Missing'.

In [None]:
building.fillna('Missing', inplace=True)

In [None]:
print('For Primary Usage Areas\n')
print('% of Null year_built Values Less Than 50%')
for usage in list(building['primary_use'].unique()):
    percent  = round(sum(building[building['primary_use']== usage]['year_built']== 'Missing') / building[building['primary_use']== usage].shape[0], 2)
    if percent < 0.5:
        print(f'{usage}: {percent}')
        
print('\n% of Null year_built Values Higher Than 50%')
for usage in list(building['primary_use'].unique()):
    percent  = round(sum(building[building['primary_use']== usage]['year_built']== 'Missing') / building[building['primary_use']== usage].shape[0], 2)
    if percent > 0.5:
        print(f'{usage}: {percent}')
    

In [None]:
print('For Primary Usage Areas\n')
print('\n% of Null floor_count Values Less Than 50%\n')
for usage in list(building['primary_use'].unique()):
    percent  = round(sum(building[building['primary_use']== usage]['floor_count']== 'Missing') / building[building['primary_use']== usage].shape[0], 2)
    if percent < 0.5:
        print(f'{usage}: {percent}')
        
print('\n% of Null floor_count Values Higher Than 50%\n')
for usage in list(building['primary_use'].unique()):
    percent  = round(sum(building[building['primary_use']== usage]['floor_count']== 'Missing') / building[building['primary_use']== usage].shape[0], 2)
    if percent > 0.5:
        print(f'{usage}: {percent}')

I decide to drop floor_count and year_built because there is not enough information for imputation.

In [None]:
building.drop(['floor_count','year_built'],axis=1,inplace=True)

In [None]:
gc.collect()

<a id="subsection-w"></a>
## Weather train

Weather data from a meteorological station as close as possible to the site.

* site_id
* air_temperature - Degrees Celsius
* cloud_coverage - Portion of the sky covered in clouds, in oktas
* dew_temperature - Degrees Celsius
* precip_depth_1_hr - Millimeters
* sea_level_pressure - Millibar/hectopascals
* wind_direction - Compass direction (0-360)
* wind_speed - Meters per second


In [None]:
weather_train.head()

In [None]:
weather_train.info()

timestamp column contains date information, so the data type of the column is change to datetime.
After changing the data type find time period for recordings (dates that entries start and end).

In [None]:
weather_train['timestamp'] = pd.to_datetime(weather_train['timestamp'])
print(f'''{weather_train['timestamp'].dtype}
{weather_train['timestamp'].min()}
{weather_train['timestamp'].max()}''')

Null value check

In [None]:
print(f'Null value counts:\n{weather_train.isnull().sum()}\n')
for col in list(weather_train.columns):
    if weather_train[col].isnull().sum() > 0:
        print(f'% of null in column {col}: {round(weather_train[col].isnull().sum() / weather_train.shape[0], 4 )}' )

There are missing values in seven colums. Luckly, percentage of missing values is not high. So, it is possible to fill missing values. I will drop colud coverage and precipt depth only.

Weather characteristics change according to the season and location, and fluctuations occur even during the day. Hence, to impute missing values with good accuracy site_id and date information must be used. 

In [None]:
weather_train.drop(['cloud_coverage','precip_depth_1_hr'], axis = 1,inplace=True)

In [None]:
weather_train['hour'] = weather_train.timestamp.dt.hour
weather_train['month'] = weather_train.timestamp.dt.month

Best way to impute meteorological data is using mean values. [reference study](http://www.scienceasia.org/2008.34.n3/scias34_341.pdf)

In [None]:
def site_mean_weather(table):
    for col in list(table.columns[table.isnull().any()]):
        imputaion = table.groupby(['site_id','hour','month'])[col].transform('mean')
        table[col].fillna(imputaion,inplace = True)
    print('Imputation with mean values is completed.')
     
site_mean_weather(weather_train)

In [None]:
weather_train.isnull().sum()

After imputation I realize there are still missing values in sea_level_pressure column, so I droped that column to.

In [None]:
weather_train.drop(['sea_level_pressure'], axis = 1, inplace = True)

In [None]:
gc.collect()

<a id="section-four"></a>
# 4. Exploratory Data Analysis

<a id="subsection-m"></a>
## Merging Tables 

To evaluate all train data together train, building and weather tables are merged together and unnecessary tables are deleted.

In [None]:
df = pd.merge(train,building, on="building_id", how="left")
df = df.merge(weather_train, on=["site_id", "timestamp"], how="left")

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.nunique()

In [None]:
df.isnull().sum()

In [None]:
df['month'] = df.timestamp.dt.month
df['hour'] = df.timestamp.dt.hour

Fill the missing weather data with mean values.

In [None]:
site_mean_weather(df)

In [None]:
for col in list(df.columns[df.isnull().any()]):    
    imputaion = df.groupby(['hour','month'])[col].transform('mean')
    df[col].fillna(imputaion,inplace = True)
print('Imputation is completed.')

We don't need train, building and weather_train tables anymore.

In [None]:
del train
del building
del weather_train
gc.collect()

Check the dtypes for further memory reduction.

In [None]:
df.dtypes

In [None]:
df[['primary_use','hour','month','site_id','building_id','wind_direction']] = df[['primary_use','hour','month','site_id','building_id','wind_direction']].astype('category')

In [None]:
gc.collect()

As disgussed before, it will be more convenient to analyse meter readings after dividing it to area. cons/sqft column is formed to contain this information. Before creating the that column, the units of the electric meter readings for site 0 is changed to kWh ([it was given in BTU](https://www.kaggle.com/c/ashrae-energy-prediction/discussion/119261)).

In [None]:
df['cons/sqft'] = df['meter_reading'] / df['square_feet']

In [None]:
df.loc[(df['site_id'] == 0) & (df['meter'] == 'electricity'), 'meter_reading'] = df[(df['site_id'] == 0) & (df['meter'] == 'electricity')]['meter_reading'].apply(lambda x: x* 0.2931 )

In [None]:
df['day'] = df.timestamp.dt.year

In [None]:
df = reduce_mem_usage(df)
gc.collect()

<a id="subsection-a"></a>
## Analysis

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4),constrained_layout=True)
fig.suptitle('Building Counts')
sns.barplot(ax=axes[0],y="site_id", x='building_id', data=df.groupby(['site_id'])['building_id'].nunique().reset_index())
axes[0].set(xlabel = 'Building Count', ylabel = 'Site id')  
sns.barplot(ax=axes[1],y="primary_use", x='building_id', data=df.groupby(['primary_use'])['building_id'].nunique().reset_index())
axes[1].set(xlabel = 'Building Count', ylabel = 'Primary Use')

In [None]:
gc.collect()

What is the distribution of energy consumption among the sites and primary usages ?

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(20, 16),constrained_layout=True)
fig.suptitle('Energy Consumption for Primary Use')
sns.boxplot(ax=axes[0, 0], y='primary_use', x='meter_reading', data=df, showfliers=True)
sns.boxplot(ax=axes[0, 1], y='primary_use', x='meter_reading', data=df, showfliers=False)
sns.boxplot(ax=axes[1, 0], y='primary_use', x='cons/sqft', data=df, showfliers=True)
sns.boxplot(ax=axes[1, 1], y='primary_use', x='cons/sqft', data=df, showfliers=False);

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 16),constrained_layout=True)
fig.suptitle('Energy Consumption for Sites')
sns.boxplot(ax=axes[0, 0], y='site_id', x='meter_reading', data=df, showfliers=True)
sns.boxplot(ax=axes[0, 1], y='site_id', x='meter_reading', data=df, showfliers=False)
sns.boxplot(ax=axes[1, 0], y='site_id', x='cons/sqft', data=df, showfliers=True)
sns.boxplot(ax=axes[1, 1], y='site_id', x='cons/sqft', data=df, showfliers=False);

In [None]:
gc.collect()

Highest enery consuming sites

In [None]:
df['day'] = df.timestamp.dt.day

In [None]:
df[df['meter']=='electricity'].groupby(['site_id','month','day','building_id'])['meter_reading'].sum().sort_values(ascending = False).reset_index().head(100)

In [None]:
df[df['meter']=='electricity'].groupby(['site_id','month','day','building_id'])['cons/sqft'].sum().sort_values(ascending = False).reset_index().head(100)

In [None]:
df.groupby(['site_id'])['meter_reading'].sum().sort_values(ascending = False).reset_index()
gc.collect()

<a id="subsection-o"></a>
### Removing Outliers

As it can be seen from box plots site 13 and 6, building type education and entertainment have lot's of outliers.

In [None]:
gc.collect()

In [None]:
total = 0
print('Outlier distribution in meter types')
for col in list(df['meter'].unique()):
    r = np.percentile(df[df['meter'] == col]['cons/sqft'],75) + 1.5 * (np.percentile(df[df['meter'] == col]['cons/sqft'],75) - np.percentile(df[df['meter'] == col]['cons/sqft'],25))
    print(f'''Percentage of outliers in {col} to all readings: { round(df[(df['meter'] == col) & (df['cons/sqft'] > r)].shape[0]/ df.shape[0],5)}''')
    total +=  df[(df['meter'] == col) & (df['cons/sqft'] > r)].shape[0]
print(f'Total fraction of outliers {round(total / df.shape[0],5)}')
gc.collect()

In [None]:
total = 0
print('Outlier distribution in building types')
for col in list(df['primary_use'].unique()):
    r = np.percentile(df[df['primary_use'] == col]['cons/sqft'],75) + 1.5 * (np.percentile(df[df['primary_use'] == col]['cons/sqft'],75) - np.percentile(df[df['primary_use'] == col]['cons/sqft'],25))
    print(f'''Percentage of outliers in {col} to all readings: { round(df[(df['primary_use'] == col) & (df['cons/sqft'] > r)].shape[0]/ df.shape[0],5)}''')
    total +=  df[(df['primary_use'] == col) & (df['cons/sqft'] > r)].shape[0]
print(f'Total fraction of outliers {round(total / df.shape[0],5)}')
gc.collect()

In [None]:
total = 0
print('Outlier distribution in building types')
for col in list(df['site_id'].unique()):
    r = np.percentile(df[df['site_id'] == col]['cons/sqft'],75) + 1.5 * (np.percentile(df[df['site_id'] == col]['cons/sqft'],75) - np.percentile(df[df['site_id'] == col]['cons/sqft'],25))
    print(f'''Percentage of outliers in {col} to all readings: { round(df[(df['site_id'] == col) & (df['cons/sqft'] > r)].shape[0]/ df.shape[0],5)}''')
    total +=  df[(df['site_id'] == col) & (df['cons/sqft'] > r)].shape[0]
print(f'Total fraction of outliers {round(total / df.shape[0],5)}')
gc.collect()

In [None]:
df.groupby(['month','day','building_id','meter'])['meter_reading'].sum().sort_values(ascending = False).reset_index().head()

In [None]:
df.groupby(['month','day','building_id','meter'])['meter_reading'].sum().sort_values(ascending = False).reset_index().head(100)['building_id'].value_counts()

In [None]:
gc.collect()

In [None]:
df.groupby(['building_id','site_id','meter'])['meter_reading'].sum().sort_values(ascending = False).reset_index().head(10)

In [None]:
df.groupby(['building_id','site_id','meter'])['cons/sqft'].sum().sort_values(ascending = False).reset_index().head(10)

<a id="subsection-m"></a>
## Feature Selection & Preparation

For different usage areas energy consumption may fluctuate significantly between weekdays and weekends. Weekend column is added to dataframe.

In [None]:
df["weekday"] = df.timestamp.dt.weekday 
df.loc[df['weekday'].isin([5, 6]), 'Weekend'] = 1
df['Weekend'].fillna(0,inplace = True)
df['Weekend'] = df['Weekend'].astype('bool')

<a id="section-five"></a>
# 5. Modeling

Due to memory constrains modeling is carried in another notebook. Please see [ASHRAE - Energy Prediction2](https://www.kaggle.com/fatmanuranl/ashrae-energy-prediction2)