 **Introduction**
 
 This notebook assesses the main characteristics of the training dataset and shows some basic relationships among the main variables

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
#from time import time
import datetime as dt

**Load data**

In [None]:
data_folder = '/kaggle/input/ashrae-energy-prediction/'
train_df = pd.read_csv(data_folder + 'train.csv')

I added the 1+log of the target variable (meter_reading). This metric is the same as the one used for evaluation

In [None]:
train_df['log_meter_reading'] = np.log(1+train_df['meter_reading'])
# convert timestamp into datetime object
train_df['timestamp'] = pd.to_datetime(train_df['timestamp'],format='%Y-%m-%d %H:%M:%S')


In [None]:
train_df.head()

In [None]:
train_df.tail()

The training dataset contains data for 1448 different buildings.

However, each building can have more than one meter type.

In [None]:
train_df.groupby('building_id')['meter'].nunique().hist(bins= np.arange(0.5,5.5,1));
plt.xticks(np.arange(1,5));
plt.ylabel('# of buildings')
plt.xlabel('# of different meter values per building');

Most buildings have only one meter type, but the fraction of those with more than one is not negligible.

Only 13 buildings have all 4 meter types.

As shown by other participants, the meter type influences the time trend of the target variable.

So it makes sense to consider the combinations 'building/meter'

In [None]:
building_meter_datapoints = train_df.groupby(['building_id','meter'])['meter_reading'].count().to_frame()
building_meter_datapoints.shape

There are 2380 combinations of building/meter type

In [None]:
building_meter_datapoints.head()

As shown earlier, the training dataset covers for the whole year 2016, with data sampled every hour.

Since 2016 was a leap year, the max number of data points can be 366*24=8784

And in fact this is the max number of datapoints found.

In [None]:
building_meter_datapoints.meter_reading.max()

In [None]:
max_data_points = 366*24
sum(building_meter_datapoints.meter_reading==max_data_points)

In [None]:
bins = np.concatenate( ( np.arange(0,9000,1000), np.array([max_data_points-.5]) ) )
max_bin = np.array([max_data_points-.5, max_data_points+100])

fig, ax = plt.subplots(figsize=(8,6))
ax.hist(building_meter_datapoints.meter_reading, bins=bins,histtype='bar')
ax.hist(building_meter_datapoints.meter_reading, bins=max_bin)
plt.grid();
plt.yscale('log');
plt.xlabel('# of data points');
plt.ylabel('building-meter occurrences');
# The red bar represents cases with complete dataset
# NOTE: the plot is in log-scale

Let's find the building with fewest datapoints

In [None]:
building_meter_datapoints.loc[building_meter_datapoints.idxmin()]

In [None]:
b_403_0 = train_df[(train_df['building_id']==403) & (train_df['meter']==0)]
b_403_0.plot(x='timestamp',y='meter_reading', figsize=(8,6));
plt.grid();

Only 19 days of data collection. 

Good luck with predicting how the building behaves in the next one and a half years...

**Correlation between quantities**


In [None]:
# Read building metadata
building_df = pd.read_csv(data_folder + 'building_metadata.csv')
building_df['log_square_feet'] = np.log(building_df['square_feet'])

In [None]:
plt.hist(np.log10(1+train_df['meter_reading']))
plt.yscale('log');
plt.grid();
plt.xlabel('log10[meter_reading]')
plt.ylabel('datapoints');

The range of values for the target variable spans several order of magnitude

Interesting to see a spike on the right end of the graph. Could they by outliers?

In [None]:
plt.hist(np.log10(1+building_df['square_feet']))
plt.yscale('log');
plt.grid();
plt.xlabel('log10[square_feet]')
plt.ylabel('# of buildings');

Building sizes vary across several orders of magnitude too.

Ignoring all other variables, the correlation coefficient between building size and meter reading is very weak

In [None]:
trainbuilding_df = train_df.join(building_df, on='building_id', rsuffix = 'r')

trainbuilding_df[['meter_reading','square_feet']].corr()

However, in log scale, the correlation becomes moderate.

Now consider that this correlation does not consider building primary use, weather, meter type and reading variation over time !!

In [None]:
trainbuilding_df[['log_meter_reading','log_square_feet']].corr()

If we restrict the choice to all buildings with a given primary use (eg. Education) and meter type (eg. 0), the correlation becomes larger

In [None]:
trainbuilding_edu0_df = trainbuilding_df[(trainbuilding_df['primary_use']=='Education') & (trainbuilding_df['meter']==0)]
trainbuilding_edu0_df.head()

In [None]:
trainbuilding_edu0_df[['log_meter_reading','log_square_feet']].corr()

To be continued with other variables...