#Energy Usage Prediction using LSTM

#Import Packages

In [None]:
## NumPy is a package in Python used for Scientific Computing. NumPy package is used to perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store values of same datatype.
import numpy as np
## Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
import pandas as pd
## Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
import matplotlib
import matplotlib.pyplot as plt
## Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
import seaborn as sns

In [None]:
## `%matplotlib` is a magic function in IPython. With this, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.
%matplotlib inline

#Dataset Analysis

In [None]:
dataset = pd.read_csv("../input/smart-home-dataset-with-weather-information/HomeC.csv")
dataset.info()

In [None]:
dataset.head()

> Let's have a look at name and data type of each feature (column).

In [None]:
tmp_str = "Feature(attribute)     DataType"; 
print(tmp_str+"\n"+"-"*len(tmp_str))
print(dataset.dtypes)

### The weather and energy dataset
The dataset contains the readings with a time span of 1 minute of house appliances in kW from a smart meter and weather conditions of that particular region.

#### Data Columns Descriptions:
(source: Data Source: https://www.kaggle.com/taranvee/smart-home-dataset-with-weather-information)
##### Index 
- **time**
    * Time of the readings, with a time span of 1 minute.

##### Energy Usage 
- **use [kW]**
    * Total energy consumption
- **gen [kW]**
    * Total energy generated by means of solar or other power generation resources
- **House overall [kW]**
    * overall house energy consumption
- **Dishwasher [kW]** 
    * energy consumed by specific appliance
- **Furnace 1 [kW]**
    * energy consumed by specific appliance
- **Furnace 2 [kW]**
    * energy consumed by specific appliance
- **Home office [kW]**
    * energy consumed by specific appliance
- **Fridge [kW]**
    * energy consumed by specific appliance
- **Wine cellar [kW]**
    * energy consumed by specific appliance
- **Garage door [kW]**
    * energy consumed by specific appliance
- **Kitchen 12 [kW]**
    * energy consumption in kitchen 1
- **Kitchen 14 [kW]**
    * energy consumption in kitchen 2
- **Kitchen 38 [kW]**
    * energy consumption in kitchen 3
- **Barn [kW]**
    * energy consumed by specific appliance
- **Well [kW]**
    * energy consumed by specific appliance
- **Microwave [kW]**
    * energy consumed by specific appliance
- **Living room [kW]**
    * energy consumption in Living room
- **Solar [kW]**
    * Solar power generation

##### Weather
- **temperature**:
    * Temperature is a physical quantity expressing hot and cold.
- **humidity**:
    * Humidity is the concentration of water vapour present in air.
- **visibility**:
    * Visibility sensors measure the meteorological optical range which is defined as the length of atmosphere over which a beam of light travels before its luminous flux is reduced to 5% of its original value.

- **apparentTemperature**:
    * Apparent temperature is the temperature equivalent perceived by humans, caused by the combined effects of air temperature, relative humidity and wind speed. The measure is most commonly applied to the perceived outdoor temperature.
- **pressure**: 
    * Falling air pressure indicates that bad weather is coming, while rising air pressure indicates good weather
- **windSpeed**:
    * Wind speed, or wind flow speed, is a fundamental atmospheric quantity caused by air moving from high to low pressure, usually due to changes in temperature.
- **cloudCover**:
    * Cloud cover (also known as cloudiness, cloudage, or cloud amount) refers to the fraction of the sky obscured by clouds when observed from a particular location. Okta is the usual unit of measurement of the cloud cover.
- **windBearing**:
    * In meteorology, an azimuth of 000° is used only when no wind is blowing, while 360° means the wind is from the North. True Wind Direction True North is represented on a globe as the North Pole. All directions relative to True North may be called "true bearings."
- **dewPoint**:
    * the atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form.
- **precipProbability**:
    * A probability of precipitation (POP), also referred to as chance of precipitation or chance of rain, is a measure of the probability that at least some minimum quantity of precipitation will occur within a specified forecast period and location.
- **precipIntensity**:
    * The intensity of rainfall is a measure of the amount of rain that falls over time. The intensity of rain is measured in the height of the water layer covering the ground in a period of time. It means that if the rain stays where it falls, it would form a layer of a certain height.
 
##### Others
- **summary**:
    * Report generated by the by the data collection systm (apparently!).
    * Including:
    ```
    Clear, Mostly Cloudy, Overcast, Partly Cloudy, Drizzle,
       Light Rain, Rain, Light Snow, Flurries, Breezy, Snow,
       Rain and Breezy, Foggy, Breezy and Mostly Cloudy,
       Breezy and Partly Cloudy, Flurries and Breezy, Dry,
       Heavy, Snow.
    ```
- **icon**:
    * The icon that is used by the data collection systm (apparently!).
    * Including:
    ```
    cloudy, clear-night, partly-cloudy-night, clear-day, partly-cloudy-day, rain, snow, wind, fog.
    ```
    

In [None]:
## Return a tuple representing the dimensionality of the DataFrame.
print("Shape of the data: {} --> n_rows = {}, n_cols = {}".format(dataset.shape, dataset.shape[0],dataset.shape[1]))

In [None]:
## pandas.DataFrame.head: This function returns the first n rows for the object based on position. 
#It is useful for quickly testing if your object has the right type of data in it.
dataset.head(10)

In [None]:
## This function returns last n rows from the object based on position. 
#It is useful for quickly verifying data, for example, after sorting or appending rows.
dataset.tail(10)

> Wee see that the last row is invalid, so let's remove it.

#Data Preprocessing

In [None]:
dataset = dataset[0:-1] ## == dataset[0:dataset.shape[0]-1] == dataset[0:len(dataset)-1] == dataset[:-1]
dataset.tail()

In [None]:
## pandas.DataFrame.columns: The column labels of the DataFrame.
dataset.columns

> Let's clean the columns names by removing the `[kW]` uint.

In [None]:
# Python string method replace() returns a copy of the string in which the occurrences of old have been replaced with new, 
#optionally restricting the number of replacements to max.
dataset.columns = [col.replace(' [kW]', '') for col in dataset.columns]
dataset.columns

#Feature Engineering

> Sometimes we are only interest in an aggregated result. To make it easy, we can make a new column and save the desired result in that new column.
> For example: if we are interested in the `total` energy usage by both `furnaces` or the `average` usage of all `kitchens`:

In [None]:
dataset['sum_Furnace'] = dataset[['Furnace 1','Furnace 2']].sum(axis=1)
dataset['avg_Kitchen'] = dataset[['Kitchen 12','Kitchen 14','Kitchen 38']].mean(axis=1)

> If you do not need old columns, you can drop them.

In [None]:
dataset = dataset.drop(['Kitchen 12','Kitchen 14','Kitchen 38'], axis=1)
dataset = dataset.drop(['Furnace 1','Furnace 2'], axis=1)
dataset.columns

* In this dataset, time is recorded in the [Unix Time](https://en.wikipedia.org/wiki/Unix_time) format.
> Unix Time represents the number of seconds that have passed since `00:00:00 UTC Thursday, 1 January 1970`.

In [None]:
dataset['time'].head()

> We would like to convert this large number that represents a unix timestamp (i.e. "1284101485") to a readable date. So, one idea is to now when is the `start time`.

In [None]:
import time 
print(' start ' , time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(1451624400)))

In [None]:
import time 
print(' start ' , time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(dataset['time'].iloc[0]))))

>  Data publisher says the dataset contains the readings with a time span of `1 minute` of house appliances
in `kW` from a `smart meter` and `weather conditions` of that particular region.
So, we set `freq='min'` and convert Uinx time to readable date.

In [None]:
time_index = pd.date_range('2016-01-01 05:00', periods=len(dataset),  freq='min')  
time_index = pd.DatetimeIndex(time_index)
dataset = dataset.set_index(time_index)
dataset = dataset.drop(['time'], axis=1)
dataset.iloc[np.r_[0:5,-5:0]].iloc[:,0] #numpy.r is the simple way to build up arrays quickly,
#you can use the array to index your dataframe. For example, here I want to see the first and the last 5 samples

In [None]:
dataset.shape

> We have 500K rows and each row shows the home status at a specific `minute`.
Let's plot the `temperature` data and see what is the result.

In [None]:
dataset['temperature'].plot(figsize=(25,5))

> It may seem too noisy to you. We can `resample` data by taking the `average temperature` every `day` and then plot it.

In [None]:
## pandas.DataFrame.resample: Convenience method for frequency conversion and resampling of time series. 
dataset['temperature'].resample(rule='D').mean().plot(figsize=(25,5)) #D calendar day frequency

> Here are the `rule`s you can use:
- B         business day frequency
- C         custom business day frequency (experimental)
- D         calendar day frequency
- W         weekly frequency
- M         month end frequency
- SM        semi-month end frequency (15th and end of month)
- BM        business month end frequency
- CBM       custom business month end frequency
- MS        month start frequency
- SMS       semi-month start frequency (1st and 15th)
- BMS       business month start frequency
- CBMS      custom business month start frequency
- Q         quarter end frequency
- BQ        business quarter endfrequency
- QS        quarter start frequency
- BQS       business quarter start frequency
- A         year end frequency
- BA, BY    business year end frequency
- AS, YS    year start frequency
- BAS, BYS  business year start frequency
- BH        business hour frequency
- H         hourly frequency
- T, min    minutely frequency
- S         secondly frequency
- L, ms     milliseconds
- U, us     microseconds
- N         nanoseconds

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (25,5)

Now, we look at the dataset columns

In [None]:
dataset.columns

> It seems `use` and `House overall` show the same data. Let's visualize these two columns.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1)
dataset['use'].resample('D').mean().plot(ax=axes[0]) #D calendar day frequency
dataset['House overall'].resample('D').mean().plot(ax=axes[1]) #D calendar day frequency

> They are same. It's better to remove one of them.

In [None]:
dataset = dataset.drop(columns=['House overall'])
dataset.shape

> Columns `summary` and `icon` are not numerical. 

In [None]:
## pandas.Series.value_counts: Return a Series containing counts of unique values.
dataset['icon'].value_counts()

In [None]:
## pandas.Series.value_counts: Return a Series containing counts of unique values.
dataset['summary'].value_counts()

In [None]:
dataset = dataset.drop(columns=['summary', 'icon'])
dataset.shape

In [None]:
## pandas.Series.unique: Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
dataset['cloudCover'].unique()

> We see that for some rows we have an invalid value for `cloudCover`. 

In [None]:
dataset[dataset['cloudCover']=='cloudCover'].shape

> There are plenty of ways deal with this kind of invalid values. The simplest one is to remove rows that include this invalid value. but more sophisticated way is to replace them. see this: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

In [None]:
dataset['cloudCover'][56:60]

> We replace this missing valuess with the next valid observation  we have.

In [None]:
dataset['cloudCover'].replace(['cloudCover'], method='bfill', inplace=True)
dataset['cloudCover'] = dataset['cloudCover'].astype('float')
dataset['cloudCover'].unique()

In [None]:
dataset['cloudCover'][56:60]

In [None]:
dataset.info()

In [None]:
dataset = dataset.resample('D').mean()
print("Shape of daily dataset: {} --> n_rows = {}, n_cols = {}".format(dataset.shape, dataset.shape[0],dataset.shape[1]))

# Time-Series Prediction with LSTM



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from tensorflow import keras
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
usedf = dataset['use']

In [None]:
usedf.head()

In [None]:
usedf=usedf.to_frame()

In [None]:
usedf.head()

In [None]:
usedf.info()

In [None]:
# This function is used to prepare the time-series data
# according to the problem definition.
def add_lags(series, times):
  cols = []
  column_index = []
  for time in times:
    cols.append(series.shift(-time))
    lag_fmt = "t+{time}" if time > 0 else "t{time}" if time < 0 else "t"
    column_index += [(lag_fmt.format(time=time), col_name)
        for col_name in series.columns]
  df = pd.concat(cols, axis=1)
  df.columns = pd.MultiIndex.from_tuples(column_index)
  return df

In [None]:
X = add_lags(usedf, times=range(-30+1,1)).iloc[30:-5]
y = add_lags(usedf, times=[5]).iloc[30:-5]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
train_slice = slice(None, "2016-10-30")

In [None]:
test_slice = slice("2016-11-1", None)

In [None]:
# Split the dataset into 80% training and 20% testing as follows
X_train, y_train = X.loc[train_slice], y.loc[train_slice]
X_test, y_test = X.loc[test_slice], y.loc[test_slice]

In [None]:
print(X_train.shape)

In [None]:
print(X_test.shape)

In [None]:
def multilevel_df_to_ndarray(df):
  shape = [-1] + [len(level) for level in df.columns.remove_unused_levels().levels]
  return df.values.reshape(shape)

In [None]:
X_train_3D = multilevel_df_to_ndarray(X_train)
X_test_3D = multilevel_df_to_ndarray(X_test)

In [None]:
print(X_train_3D.shape)

In [None]:
print(X_test_3D.shape)

In [None]:
y_train = y_train.values
y_test = y_test.values

In [None]:
print(y_train.shape)

In [None]:
print(y_test.shape)

In [None]:
model_LSTM = keras.models.Sequential()
model_LSTM.add(keras.layers.LSTM(units = 100, return_sequences = True,input_shape = X_train_3D.shape[1:]))
model_LSTM.add(keras.layers.LSTM(units = 50))
model_LSTM.add(keras.layers.Dense(1))
model_LSTM.summary()


The number of parameters of LSTM:
Input vectors of size m 
Output vectors of size n 
4(nm + n^2 )
LSTM with  bias vectors: 4(nm + n^2 + n) (default in keras )

=4 ( 100 x 1 + 100x100 + 100) = 4x 10200 = 40,800

= 4 (50 x 100 + 50x50 +50) = 4x7500 = 30200

= 50 + 1 = 51

In [None]:
model_LSTM.compile(loss='mse', optimizer='adam', metrics=['mae'])

In [None]:
history_LSTM = model_LSTM.fit(x=X_train_3D, y=y_train,epochs=50, validation_split=0.1, batch_size=32)

In [None]:
test_loss, test_mae = model_LSTM.evaluate(x=X_test_3D, y=y_test)

In [None]:
print(test_loss, test_mae)

#Exercise: Hyper Parameter Tuning

In [None]:
model_LSTM = keras.models.Sequential()
model_LSTM.add(keras.layers.LSTM(units = 100, return_sequences = True,input_shape = X_train_3D.shape[1:]))
model_LSTM.add(keras.layers.LSTM(units = 50))
model_LSTM.add(keras.layers.Dense(1))
model_LSTM.summary()
