# Objective

* Perform a exploratory data analysis and a baseline model for ASHRAE - Great Energy Predictor III competition

# Description of the problem

Energy savings is one of the important area of focus our current world. Energy savings has two key elements:

* Forecasting future energy usage without improvements
* Forecasting energy use after a specific set of improvements have been implemented

In this competition, youâ€™ll develop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies.

# Libraries

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 1000)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Exploring train and test sets

In [None]:
print('Reading train set...')
train = pd.read_csv('/kaggle/input/ashrae-energy-prediction/train.csv')
print('Reading test set...')
test = pd.read_csv('/kaggle/input/ashrae-energy-prediction/test.csv')
print('Train set has {} rows and {} columns'.format(train.shape[0], train.shape[1]))
print('Test set has {} rows and {} columns'.format(test.shape[0], test.shape[1]))

This dataset have a lot of rows to process in memory. Let's use a memory reduction function to handle it better.

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

We have no missing values in our train and test set.

Here is a description of each feature:

* building_id - Foreign key for the building metadata.
* meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* timestamp - When the measurement was taken
* meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

We have no missing values in test set. We have the same max for the building_id in the train and test.

Each building_id combine with each meter category is a time series. In other words we have a multi time series problem. Let's calculate first how many time series we have in our training and test set.

In [None]:
train.head()

In [None]:
print('We have {} time series in our training set'.format((train['building_id'].astype(str) + train['meter'].astype(str)).nunique()))
print('We have {} time series in our test set'.format((test['building_id'].astype(str) + test['meter'].astype(str)).nunique()))

We also need to check the consistency of this time series. In other words let's check how many time series are in the training set and also are in test set. On the other hand let's calculate how many of them are in the test set but not in the training set.

In [None]:
train_series = list((train['building_id'].astype(str) + train['meter'].astype(str)).unique())
test_series = list((test['building_id'].astype(str) + test['meter'].astype(str)).unique())
print('Number of series that are in the training set and are also contained in the test set {}'.format(len([x for x in train_series if x in test_series])))

Great!, we have consistency. The processing is very slow. Let's check the distribution of each meter (there are 4) and then split the training set in 4 parts to explore faster (maybee it's a good idea to make 4 models).

In [None]:
def plot_count(df, col):
    total = len(df)
    plt.figure(figsize = (12,8))
    plot_me = sns.countplot(df[col])
    plot_me.set_xlabel('{} type'.format(col), fontsize = 16)
    plot_me.set_ylabel('frequency', fontsize = 16)
    for p in plot_me.patches:
        height = p.get_height()
        plot_me.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(height/total*100),
                ha="center", fontsize=15)
        
plot_count(train, 'meter')

* 59.66% of the observations correspond to the electricity meter.

* 20.69% of the observations correspond to the chilled water meter.

* 20.69% of the observations correspond to the steam meter.

* 20.69% of the observations correspond to the hotwater meter.

Let's check the test distribution. Should be almost the same.

In [None]:
plot_count(test, 'meter')

We can see some differences. Let's divide the dataset in 4 sets.

* It's possible that not all of the time series have records for all the dates.
* It's possible that not all of the time series have records for all the hours in a specific date.
* It's possible that not all of the time series have the same time window (all start and end at the same date).

In [None]:
train['timestamp'] = pd.to_datetime(train['timestamp'])
test['timestamp'] = pd.to_datetime(test['timestamp'])
train_el = train[train['meter']==0]
train_ch = train[train['meter']==1]
train_st = train[train['meter']==2]
train_ho = train[train['meter']==3]
test_el = test[test['meter']==0]
test_ch = test[test['meter']==1]
test_st = test[test['meter']==2]
test_ho = test[test['meter']==3]

# Timestamp feature

* Let's start exploring the timestamp feature for each meter type

# Electricity Meter

In [None]:
# count the number of building for each timestamp for electicity meter
def plot_time_freq(df, name = 'electricity', se = 'train'):
    print('We have {} series'.format(df['building_id'].nunique()))
    print('Min date: ', df.timestamp.min())
    print('Max date: ', df.timestamp.max())
    print('Time behaviour for {} meter for the {} set'.format(name, se))
    df['date'] = df['timestamp'].dt.date 
    df['week'] = df['timestamp'].dt.week
    df['dayofmonth'] = df['timestamp'].dt.day
    df['month'] = df['timestamp'].dt.month
    df['dayofweek'] = df['timestamp'].dt.dayofweek
    df['hour'] = df['timestamp'].dt.hour
    tmp1 = df.groupby(['date'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    tmp2 = df.groupby(['week'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    tmp3 = df.groupby(['dayofmonth'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    tmp4 = df.groupby(['hour'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    tmp5 = df.groupby(['month'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    tmp6 = df.groupby(['dayofweek'])['building_id'].count().reset_index().rename(columns = {'building_id': 'frequency'})
    fig, (ax1, ax2, ax3, ax4, ax5, ax6) = plt.subplots(6, 1, figsize = (12, 12))
    sns.lineplot(tmp1['date'], tmp1['frequency'], ax = ax1)
    ax1.set_title('Date Frequency')
    ax1.set_xlabel('Date', fontsize = 10)
    ax1.set_ylabel('Frequency', fontsize = 10)
    sns.lineplot(tmp2['week'], tmp2['frequency'], ax = ax2)
    ax2.set_title('Week Frequency')
    ax2.set_xlabel('Week', fontsize = 10)
    ax2.set_ylabel('Frequency', fontsize = 10)
    sns.lineplot(tmp3['dayofmonth'], tmp3['frequency'], ax = ax3)
    ax3.set_title('Day of month frequency')
    ax3.set_xlabel('Day of month', fontsize = 10)
    ax3.set_ylabel('Frequency', fontsize = 10)
    sns.lineplot(tmp4['hour'], tmp4['frequency'], ax = ax4)
    ax4.set_title('Hour frequency')
    ax4.set_xlabel('Hour', fontsize = 10)
    ax4.set_ylabel('Frequency', fontsize = 10)
    sns.lineplot(tmp5['month'], tmp5['frequency'], ax = ax5)
    ax5.set_title('Month frequency')
    ax5.set_xlabel('Month', fontsize = 10)
    ax5.set_ylabel('Frequency', fontsize = 10)
    sns.lineplot(tmp6['dayofweek'], tmp6['frequency'], ax = ax6)
    ax6.set_title('Day of week frequency')
    ax6.set_xlabel('Day of week', fontsize = 10)
    ax6.set_ylabel('Frequency', fontsize = 10)
    plt.tight_layout()
    plt.show()

plot_time_freq(train_el, 'electricity', 'train')

* We have data for one entire year (2016)
* We can see some strange decending spikes in Frebuary, March, September, November and Decemeber
* We can see a stable day of month frequency
* We can see two peaks in the hour plot, 3-5 and 22-1, 15 and 1 are the lowest (1 is very strange)
* We can see a lower frequency in Frebuary and March. (Frebuary have 28 days)
* We can see two spikes for day of week 4 and 5 (Thursday and Friday)

There should be an explanation for this :), no idea why haha

Let's check the test set

In [None]:
plot_time_freq(test_el, 'electricity', 'test')

* Date and hour frequency are constant.

In [None]:
# let's check a month, in this case Frebuary and March
def build_sw(df, cols, p_cols, value):
    sw = df.groupby(cols)['meter'].count().reset_index()
    sw1 = sw[sw[p_cols]==value]
    plt.figure(figsize = (10,8))
    plt.scatter(sw1[cols[2]], sw1[cols[0]])
    plt.title('Observation for each serie for {} {}'.format(p_cols, value))
    plt.show()
build_sw(train_el, ['building_id', 'month', 'dayofmonth'], 'month', 2)

Eureka!, top building ids dont have records between day 11 till 29.

In [None]:
build_sw(train_el, ['building_id', 'month', 'dayofmonth'], 'month', 3)

That's why we have those strange decending spikes in Frebuary, and March

We confirm our first hypothesis:

Not all of the time series have records for all the dates in the electricity meter in the training set.

In [None]:
def check_hour(df):
    tmp = df.groupby(['building_id', 'date'])['meter'].count().reset_index()
    return tmp[tmp['meter']!=24].iloc[::10].head(10)
check_hour(train_el)

In [None]:
check_hour(test_el)

We can also see that some series dont have observations for all the hours in a specific date. But in the test set i could not find any!!!. 

It's a good idea to predict the meter_reading for the series that have missing records?? And then use those prediction in the final training. We need to explore further. Leave your comments :))

Let's check the start date for each series (we have 1413 series for the electricity meter).

In [None]:
def start_date(df):
    b_id = []
    min_date = []
    for i in list(df['building_id'].unique()):
        b_id.append(i)
        min_date.append(df[df['building_id']==i]['date'].min())
    tmp = pd.DataFrame({'building_id': b_id, 'min_date': min_date})
    tmp['min_date'] = tmp['min_date'].astype(str)
    print('There are {} series that start after 2016-01-01'.format(tmp[tmp['min_date']!='2016-01-01'].shape[0]))
start_date(train_el)

Let's check the max date. We already know that all the series that are in the training set are in the test set so we need need to check for the max date in the test set.

In [None]:
def end_date(df):
    b_id = []
    max_date = []
    for i in list(df['building_id'].unique()):
        b_id.append(i)
        max_date.append(df[df['building_id']==i]['date'].max())
    tmp = pd.DataFrame({'building_id': b_id, 'max_date': max_date})
    tmp['max_date'] = tmp['max_date'].astype(str)
    print('There are {} series that finish before 2018-12-31'.format(tmp[tmp['max_date']!='2018-12-31'].shape[0]))
end_date(test_el)

So all the series have observations in the last date of the test set

Let's build a function that can check the time frequency consistency of the dataframes

In [None]:
def check_series(df, n_years = 1):
    n_day_month = {1 : 31, 2: 28, 3: 31, 4: 30, 5: 31, 6: 30, 7: 31, 8: 31, 9:30, 10:31, 11:30, 12:31}
    df1 = df.groupby('month')['meter'].count().reset_index()
    df1['n_days'] = df1['month'].map(n_day_month)
    df1['meter_'] = df1['n_days'] * df.building_id.nunique() * n_years * 24
    df1['missing_observations_%'] = 100 - (df1['meter'] / df1['meter_']) * 100
    df1['missing_observations_%'] = df1['missing_observations_%'].astype(str) + '%'
    return df1

check_series(train_el, 1)

In [None]:
check_series(test_el, 2)

* Test set have 1413 series where each series have observations for the 365 days of the year and for each day 24 hours.
* Train set is not the case :(. 

# Chilled Water

Let's check the building_ids (time series) for chilled water meter

In [None]:
plot_time_freq(train_ch, 'chilled water', 'train')

In [None]:
plot_time_freq(test_ch, 'chilled water', 'test')

In [None]:
build_sw(train_ch, ['building_id', 'month', 'dayofmonth'], 'month', 2)

In [None]:
build_sw(train_ch, ['building_id', 'month', 'dayofmonth'], 'month', 3)

In [None]:
check_hour(train_ch)

In [None]:
check_hour(test_ch)

In [None]:
start_date(train_ch)

In [None]:
check_series(train_ch, 1)

In [None]:
check_series(test_ch, 2)

* Same escenario for chilled water

# Steam

Let's check the building_ids (time series) for steam meter

In [None]:
plot_time_freq(train_st, 'steam', 'train')

In [None]:
plot_time_freq(test_st, 'steam', 'test')

In [None]:
check_series(train_st, 1)

In [None]:
check_series(test_st, 2)

* Same escenario for steam

# Hot Water

Let's check the building_ids (time series) for how water

In [None]:
plot_time_freq(train_ho, 'hot water', 'train')

In [None]:
plot_time_freq(test_ho, 'hot water', 'test')

In [None]:
check_series(train_ho, 1)

In [None]:
check_series(test_ho, 2)

# Cross Series

Let's check how many series have the 4 meter types, 3, 2 and 1

In [None]:
cross_series = train.groupby(['building_id'])['meter'].nunique().reset_index()
cross_series.columns = ['building_id', 'n_meter']
print('{} series are in the 4 types of meters'.format(cross_series[cross_series['n_meter']==4].shape[0]))
print('{} series are in the 3 types of meters'.format(cross_series[cross_series['n_meter']==3].shape[0]))
print('{} series are in the 2 types of meters'.format(cross_series[cross_series['n_meter']==2].shape[0]))
print('{} series are only in 1 meter'.format(cross_series[cross_series['n_meter']==1].shape[0]))

# Target Variable vs Time Analysis

Let's explore the target variable (meter_reading) vs time for each meter type

In [None]:
fig, ax = plt.subplots(2, 2, figsize = (12, 12))
sns.distplot(np.log1p(train_el['meter_reading']), ax = ax[0,0])
ax[0,0].set_title('Distribution for electricity meter')     
sns.distplot(np.log1p(train_ch['meter_reading']), ax = ax[0,1])
ax[0,1].set_title('Distribution for chilled water meter') 
sns.distplot(np.log1p(train_st['meter_reading']), ax = ax[1,0])
ax[1,0].set_title('Distribution for steam meter') 
sns.distplot(np.log1p(train_ho['meter_reading']), ax = ax[1,1])
ax[1,1].set_title('Distribution for hot water meter')

* We have very different distribution for each meter type. Also, we have a lot of 0.
* What is the meaning of 0 readings (can this 0 readings be related with the previous founds?)

Let's plot some random building_ids for each meter type

In [None]:
def plot_series(df, building_id, meter_type):
    plt.figure(figsize = (8, 8))
    df1 = df[df['building_id']==building_id]
    df1.groupby(['date'])['meter_reading'].sum().reset_index()
    sns.lineplot(df1['date'], df1['meter_reading'])
    plt.xlabel('Date', fontsize = 10)
    plt.ylabel('Meter reading (sum of the day)')
    plt.suptitle('Meter reading for building_id {} for {} meter type'.format(building_id, meter_type))
    plt.show()
plot_series(train_el, 0, 'electricity')

This series start on 2016-01-01 but i believe it really starts on May. Maybee the first months were calibration of the meter or a malfunction, or maybee it's right and all the people that leave in that building were in vacations xD.... i really don't know...

Let's plot more series

In [None]:
plot_series(train_el, 1, 'electricity')

Similar to the previous building_id. In May reading increase a lot. Why is this happening?

In [None]:
plot_series(train_el, 1400, 'electricity')

Very different haha. Also look in Fre and March, we are missing some observations

In [None]:
plot_series(train_el, 800, 'electricity')

Also different

In [None]:
plot_series(train_el, 300, 'electricity')

To understand this series better we need more information!!.

In [None]:
plot_series(train_ch, 1000, 'chilled water')

In [None]:
plot_series(train_ch, 1350, 'chilled water')

Also missing some observations

# Check other time variables

In [None]:
def plot_time_variables(df1, df2, df3, df4, col, meter_type):
    df1 = df1.groupby([col])['meter_reading'].sum().reset_index()
    df2 = df2.groupby([col])['meter_reading'].sum().reset_index()
    df3 = df3.groupby([col])['meter_reading'].sum().reset_index()
    df4 = df4.groupby([col])['meter_reading'].sum().reset_index()
    fig, ax = plt.subplots(2, 2, figsize = (12, 12))
    sns.lineplot(df1[col], df1['meter_reading'], ax = ax[0,0])
    sns.lineplot(df2[col], df2['meter_reading'], ax = ax[0,1])
    sns.lineplot(df3[col], df3['meter_reading'], ax = ax[1,0])
    sns.lineplot(df4[col], df4['meter_reading'], ax = ax[1,1])
     
plot_time_variables(train_el, train_ch, train_st, train_ho, 'hour', 'electricity')

Hour feature is very predictive.

In [None]:
plot_time_variables(train_el, train_ch, train_st, train_ho, 'week', 'electricity')

We definitely have a seasonality for each meter type

In [None]:
plot_time_variables(train_el, train_ch, train_st, train_ho, 'dayofweek', 'electricity')

Day of week is also predictive

Let's move forward and check the other files.

# What do we know so far

* We have the same series in the training and testing data (2380)
* We have 1413, 498, 224 and 145 series for electricity, chilled water, steam and hot water meters
* For some series in the training data we are missing some observations 
* For the test data we dont have missing observations, each series have one observation for each hour for the 365 days of year (2017 and 2018).
* We have a lot of 0 meter readings in each meter type
* Some series have 0 meter readings at the beggining and very low readings at the end.
* 13 series are in the 4 types of meters
* 331 series are in the 3 types of meters
* 230 series are in the 2 types of meters
* 875 series are only in 1 meter
* Time features are very predictive

# Dont know so far :(

* Why do we have so many 0 meter readings?
* Why do we have 0 meter readings at the beggining and very low readings the end of the series?
* Why do we have missing observations in some series of the train set but not in the test set?

Going to try and find a answer to all this questions with the other features of the competition.

# Exploring Building Meta Data

In [None]:
bm = pd.read_csv('/kaggle/input/ashrae-energy-prediction/building_metadata.csv')
bm = reduce_mem_usage(bm)
bm.head()

In [None]:
# check if we have all the building metadata
len(list(set(train['building_id'].unique()).intersection(set(bm['building_id'].unique()))))

Its ok

In [None]:
# check for missing values
def missing_values(df):
    df1 = pd.DataFrame(bm.isnull().sum()).reset_index()
    df1.columns = ['feature', 'n_missing_values']
    df1['ratio'] = df1['n_missing_values'] / df.shape[0]
    df1['unique'] = df.nunique().values
    df1['max'] = df.max().values
    df1['min'] = df.min().values
    return df1
missing_values(bm)

* So we have a lot of missing values for year built and floor count. On the other hand we have very small buildings (283 square feet) and very big (875000 square feet). 
* Talles building have 26 floor, smaller building have 1 floor
* We have 16 site_id
* We have 16 primary_use types

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 5))
sns.distplot(bm['square_feet'], ax = ax1)
ax1.set_title('Square feet distribution')
sns.distplot(bm['year_built'].dropna(), ax = ax2)
ax2.set_title('Year built distribution')

In [None]:
train_el = train_el.merge(bm, on = 'building_id')
train_el.head()

In [None]:
# is there a relation between square feet and floor count
bm[['square_feet', 'floor_count']].corr()

# Building metadata meter readings

Let's check the meter reading for each building metadata information

In [None]:
def plot_group_meta(cols, df, name):
    df1 = df.groupby(cols)['meter_reading'].sum().reset_index()
    for i in list(df1[cols[0]].unique()):
        df2 = df1[df1[cols[0]]==i]
        plt.figure(figsize = (9, 9))
        sns.lineplot(df2[cols[1]], df2['meter_reading'])
        plt.title('Meter readings for {} meter for {} {}'.format(name, cols[0], i))
        
        
plot_group_meta(['site_id', 'date'], train_el, 'electricity')

* Intresting, take a look site_id 15. Seems that this site_id belongs to buildings that have missing observations.
* Site_id 0 also have a very unique behaviour

In [None]:
build_sw(train_el[train_el['site_id']==15], ['building_id', 'month', 'dayofmonth'], 'month', 2)

In [None]:
build_sw(train_el[train_el['site_id']==15], ['building_id', 'month', 'dayofmonth'], 'month', 3)

Wow!. Site id_15 have missing observations. Maybe site_id reffers to a country? or place?

In [None]:
plot_group_meta(['primary_use', 'date'], train_el, 'electricity')

Let's check the correlation between square_feet, year_built, and floor count with meter reading

In [None]:
def plot_corr(df, files):
    plt.figure(figsize = (10,8))
    sns.heatmap(df.corr(), annot = True, cmap="YlGnBu")
    plt.title('Correlation analysis between target variable and {}'.format(files))
    plt.show
corr_frame = train_el[['meter_reading', 'year_built', 'square_feet', 'floor_count']]
plot_corr(corr_frame, 'building metadata')

* Square feet is more the feature that have the talles positive correlation with meter_reading,  second is floor count.

More to come, stay tuned!.