Welcome to the EDA. The kernel's title, אַשְׁרֵי, is a joke because there is a Jewish prayer called the "Ashrei" which is pronounced the same way as ASHRAE.

In [None]:
import numpy as np
import pandas as pd
import os, gc
from tqdm import tqdm_notebook

### To begin, let's just look at the train and test.

In [None]:
tr = pd.read_csv('../input/ashrae-energy-prediction/train.csv')
te = pd.read_csv('../input/ashrae-energy-prediction/test.csv')

In [None]:
tr.head()

In [None]:
te.head()

We can see that in the raw files, we are just given the building_id, and the meter type. We are also given the timestamp of the reading. Every building has different meters: one for gas, one for electricity, one for water, etc. The dataset has sampled the readings from these different meters over time. I bet if we group by building_id in the trainset, we will see the building_id comes up multiple times.

In [None]:
tr.groupby(['building_id','meter']).size()

In [None]:
tr['meter'].value_counts()

In [None]:
# Zoom in onto one building, and one meter.
tr.query('building_id==0 & meter==0')

Indeed, we are seeing meter readings recorded every hour. The organizer noted that there are some gaps as well.

### Let's use building_metadata.csv to gain more information on these building_id's.

In [None]:
meta = pd.read_csv('../input/ashrae-energy-prediction/building_metadata.csv')

In [None]:
meta

In [None]:
for col in meta.columns:
    if meta[col].isnull().sum() > 0: # If you have any rows with NaN in it...
        print(col)

We should be careful that year_built and floor_count have Nulls. This is important for algorithms that don't tolerate Nulls.

### Let's convert the categorical primary_use into integers to save RAM space.

In [None]:
primary_use_to_id = pd.concat([pd.Series(meta['primary_use'].unique()), (pd.Series(meta['primary_use'].unique())).astype('category').cat.codes], axis=1).set_index(0).to_dict()[1]
id_to_primary_use = pd.concat([pd.Series(meta['primary_use'].unique()), (pd.Series(meta['primary_use'].unique())).astype('category').cat.codes], axis=1).set_index(1).to_dict()[0]

In [None]:
primary_use_to_id

In [None]:
meta['primary_use'] = meta['primary_use'].map(primary_use_to_id).astype('int32')

### Now let's rowbind train and test so we can merge meta onto it all at once.

In [None]:
te['meter_reading'] = -1
tr['row_id'] = -1

tr_te = pd.concat([tr,te],axis=0,sort=True)
del tr,te; gc.collect()

Before we do that, the timestamp object is really expensive to store in RAM. Let's convert it to cat codes like we did with meta's primary_use.

In [None]:
timestamp_to_id = pd.concat([pd.Series(tr_te['timestamp'].unique()), (pd.Series(tr_te['timestamp'].unique())).astype('category').cat.codes], axis=1).set_index(0).to_dict()[1]
id_to_timestamp = pd.concat([pd.Series(tr_te['timestamp'].unique()), (pd.Series(tr_te['timestamp'].unique())).astype('category').cat.codes], axis=1).set_index(1).to_dict()[0]

In [None]:
# The number of unique timestamps we have in train + test.
print(len(timestamp_to_id))

In [None]:
tr_te.head()

In [None]:
tr_te['timestamp'] = tr_te['timestamp'].map(timestamp_to_id).astype('int64')

In [None]:
tr_te.head()

In [None]:
gc.collect()

In [None]:
print('Original shape is',tr_te.shape) 
tr_te = tr_te.merge(meta, on='building_id')
print('Shape after merging is',tr_te.shape) 

In [None]:
tr_te.head()

### Now let's actually begin some EDA. Let's look at how square_feet, year_built, and floor_count are distributed.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=20,15
import seaborn as sns

In [None]:
sns.kdeplot(meta['square_feet'], label='square_feet', shade=True, kernel='epa')

In [None]:
sns.kdeplot(meta['year_built'], label='year_built', shade=True, kernel='epa')

Wait, is that year_built in the future?

In [None]:
meta['year_built'].max(), meta['year_built'].min()

Nope. Just an effect of the Kernel Density Estimate.

In [None]:
sns.kdeplot(meta['floor_count'], label='floor_count', shade=True, kernel='epa')

### Let's see how meter_reading may change with respect to square_feet, year_built, and floor_count.

In [None]:
# sns.scatterplot(tr_te['square_feet'], tr_te['meter_reading'])
# sns.scatterplot(tr_te['year_built'], tr_te['meter_reading'])
# sns.scatterplot(tr_te['floor_count'], tr_te['meter_reading'])

### Let's download weather data and merge

In [None]:
weather_tr = pd.read_csv('../input/ashrae-energy-prediction/weather_train.csv')
weather_tr['timestamp'] = weather_tr['timestamp'].map(timestamp_to_id).astype('int64')
weather_te = pd.read_csv('../input/ashrae-energy-prediction/weather_test.csv')
weather_te['timestamp'] = weather_te['timestamp'].map(timestamp_to_id).astype('int64')

weather = pd.concat([weather_tr,weather_te],axis=0)
del weather_tr, weather_te; gc.collect()

Before we can merge, I need to change the data types to be smaller, or else we will get a MemoryError in Kaggle kernels.

In [None]:
tr_te['building_id'] = tr_te['building_id'].astype('int8')
tr_te['meter'] = tr_te['meter'].astype('int8')
tr_te['row_id'] = tr_te['row_id'].astype('int32')
tr_te['timestamp'] = tr_te['timestamp'].astype('int32')
tr_te['site_id'] = tr_te['site_id'].astype('int8')
tr_te['primary_use'] = tr_te['primary_use'].astype('int8')
tr_te['square_feet'] = tr_te['square_feet'].astype('int32')
tr_te['year_built'] = tr_te['year_built'].astype('float16')
tr_te['floor_count'] = tr_te['floor_count'].astype('float16')

weather['site_id'] = weather['site_id'].astype('int8')
weather['timestamp'] = weather['timestamp'].astype('int32')
weather['air_temperature'] = weather['air_temperature'].astype('float16')
weather['cloud_coverage'] = weather['cloud_coverage'].astype('float16')
weather['dew_temperature'] = weather['dew_temperature'].astype('float16')
weather['precip_depth_1_hr'] = weather['precip_depth_1_hr'].astype('float16')
weather['sea_level_pressure'] = weather['sea_level_pressure'].astype('float16')
weather['wind_direction'] = weather['wind_direction'].astype('float16')
weather['wind_speed'] = weather['wind_speed'].astype('float16')

gc.collect()

In [None]:
print('Original shape is',tr_te.shape) 
tr_te = tr_te.merge(weather, on=['site_id','timestamp'], how='left')
print('Shape after merging is',tr_te.shape) 

In [None]:
tr_te.dtypes

In [None]:
primary_use_to_id

In [None]:
tr_te.head()