## Read data and basic data clean-up

In [None]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import os
import zipfile
import cv2
import tensorflow as tf
import time
import dateutil
import sklearn.metrics as sm

In [None]:
PATH='../input/appliances-energy-prediction/KAG_energydata_complete.csv'
data = pd.read_csv(PATH)
data.columns = [x.lower() for x in data.columns]
data.head(5)

In [None]:
data.isnull().sum().sort_values(ascending=False)

In [None]:
data.apply(lambda x: len(x.unique()))

In [None]:
data.describe()

In [None]:
data.info()

----------------------------------------------------------------------------------------------------------------
### Inferences:
     1. There are 29 columns - 1 date time column, 2 Integer columns and 26 Float column
     2. Nearly 1 coulmn (Lights) is having less than 10 unique items, which can be considered as categorical column
     3. There are NULL values in any of the given columns
     4. Target, which is Appliances need to predicted

## Feature Engineering

Checking for Outliers and removing extreme 1% of the data.

In [None]:
sns.distplot(data["appliances"])

In [None]:
data = data[data['appliances'].between(data['appliances'].quantile(.0), data['appliances'].quantile(.99))]
sns.boxplot(data["appliances"],color="green")

### Injesting new features to the dataset

In [None]:
data["exact_date"]=data['date'].str.split(' ').str[0]

data["hours"]=(data['date'].str.split(':').str[0].str.split(" ").str[1]).astype(str).astype(int)
data["seconds"]=((data['date'].str.split(':').str[1])).astype(str).astype(int).mul(60)

data["week"]=(data['date'].str.split(' ').str[0])
data["week"]=(data['week'].apply(dateutil.parser.parse, dayfirst=True))
data["weekday"]=(data['week'].dt.dayofweek).astype(str).astype(int)
data["week"]=(data['week'].dt.day_name())

data['log_appliances'] = np.log(data.appliances)
data['hour*lights'] = data.hours * data.lights
data['hour_avg'] = list(map(dict(data.groupby('hours')["appliances"].mean()).get, data.hours))

data.head(5)

## Perform analysis & model development 

### Day wise Electricity consumption

In [None]:
dates=data["exact_date"].unique()
arranged_day = pd.Categorical(data["exact_date"], categories=dates,ordered=True)
date_series = pd.Series(arranged_day)
table = pd.pivot_table(data,values="appliances",index=date_series, aggfunc=[np.sum],fill_value=0)
table.plot(kind="bar",figsize=(20, 7))
plt.show()

### Weekend vs Weekday?

In [None]:
days=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
arranged_day = pd.Categorical(data["week"], categories=days,ordered=True)
day_series = pd.Series(arranged_day)
table = pd.pivot_table(data,index=["hours"],
               values="appliances",columns=day_series,
               aggfunc=[np.sum],fill_value=0)

fig, ax = plt.subplots(figsize=(20, 10))
ax.set_title('Heatmap : Appliances(wh)')

heatmap = ax.pcolor(table)

ax.set_xlabel("Week Days")
ax.set_ylabel("Hours")

plt.colorbar(heatmap)
ax.set_yticks(range(len(table.index)+1))
ax.set_xticks(range(len(table.columns)+1))

plt.xlabel("Week")
plt.ylabel("Hours of Day")
plt.show()

In [None]:
table.plot.box(figsize=(20, 7))

Weekends (Saturdays and Sundays) are observed to have high consumption of Electricity. (> 25% than Weekdays).

### Hour of the Day?

In [None]:
# Data sets in 30 minitues and 1 hour basis
data['date'] = pd.to_datetime(data['date'])
data = data.set_index('date')
df_hour = data.resample('1H').mean()
df_30min =data.resample('30min').mean()

In [None]:
# Qualitative predictors 
'''We assume that we have low(high) energy load when the appliances consumption is lower(higher) 
   than a given point of the hourly average counsumption. This point is dependent of data time frequency 
   and the numbers below are set after several tryouts based on appliances' consumption standard deviation.
'''

data['low_consum'] = (data.appliances+25<(data.hour_avg))*1
data['high_consum'] = (data.appliances+100>(data.hour_avg))*1

df_hour['low_consum'] = (df_hour.appliances+25<(df_hour.hour_avg))*1
df_hour['high_consum'] = (df_hour.appliances+25>(df_hour.hour_avg))*1

df_30min['low_consum'] = (df_30min.appliances+25<(df_30min.hour_avg))*1
df_30min['high_consum'] = (df_30min.appliances+35>(df_30min.hour_avg))*1

In [None]:
# Plot of Mean Energy Consumption per Hour of a Day

data.groupby('hours')['appliances'].mean().plot(figsize=(10,8))
plt.xlabel('Hour')
plt.ylabel('Appliances consumption in Wh')
ticks = list(range(0, 24, 1))
plt.title('Mean Energy Consumption per Hour of a Day')

plt.xticks(ticks);

High Electricity consumption of >140Wh is observed during evening hours 16:00 to 20:00. At night hours from 23:00-6:00 the power load is below 50Wh, meaning that most appliances are off or standby. Between 9:00-13:00 the power load is >100Wh and after launch reduces again to <100Wh. At afternoon, the energy consumption ranges from 130-185Wh as family members are at home and many devices are on. 

### Histogram of Appliance's consumption

In [None]:
f, axes = plt.subplots(1, 2,figsize=(10,4))

sns.distplot(df_hour.appliances, hist=True, color = 'blue',hist_kws={'edgecolor':'black'},ax=axes[0])
axes[0].set_title("Appliance's consumption")
axes[0].set_xlabel('Appliances wH')

sns.distplot(df_hour.log_appliances, hist=True, color = 'blue',hist_kws={'edgecolor':'black'},ax=axes[1])
axes[1].set_title("Log Appliance's consumption")
axes[1].set_xlabel('Appliances log(wH)')

The distribution of power load is not normal as we have left asymetry, for this reason we shall use log(power load) which has closer to normal distribution for further analysis.

### Pearson Correlation among the variables

In [None]:

col = ['log_appliances', 'lights', 't1', 'rh_1', 't2', 'rh_2', 't3', 'rh_3', 't4',
       'rh_4', 't5', 'rh_5', 't6', 'rh_6', 't7', 'rh_7', 't8', 'rh_8', 't9',
       'rh_9', 't_out', 'press_mm_hg', 'rh_out', 'windspeed', 'visibility',
       'tdewpoint','hours']
corr = data[col].corr()
plt.figure(figsize = (18,18))
sns.set(font_scale=1)
sns.heatmap(corr, cbar = True, annot=True, square = True,cmap="RdYlGn", fmt = '.2f', xticklabels=col, yticklabels=col)
plt.show();

The Energy consumption is highly correlated with:
    1. Hours : 0.34
    2. Lights : 0.26
    3. T2 : 0.22
    4. T6 : 0.26
    
Also all temperature values inside house are highly correlated with each other (> 0.8)

### Linear dependencey evaluation

In [None]:
col = ['t6','t2', 'rh_6','lights','hours','t_out','windspeed','tdewpoint']
sns.set(style="ticks", color_codes=True)
sns.pairplot(data[col])
plt.show();

Inside temperatures, outside temperatures and tdewpoint have linear relationship. These features will best suite for Linear regression modelling.

### Transforming categorical variables 

In [None]:
for cat_feature in ['weekday', 'hours']:
    df_hour = pd.concat([df_hour, pd.get_dummies(df_hour[cat_feature])], axis=1)
    df_30min = pd.concat([df_30min, pd.get_dummies(df_30min[cat_feature])], axis=1)
    df = pd.concat([data, pd.get_dummies(data[cat_feature])], axis=1)

Generated 3 data sets with time interval 10 minutes, 30 minutes, 1 hour respectively. Using the 1 hour data set for further analysis as it having less noise.

## Modelling

Trying out 6 Regression models:
    1. LinearRegression
    2. SVR
    3. RandomForestRegressor
    4. LGBMRegressor
    5. XGBRegressor
    6. catboost

In [None]:
feature_set = ['low_consum','high_consum','hours','t6','rh_6','lights','hour*lights',
               'tdewpoint','visibility','press_mm_hg','windspeed']

In [None]:
# to avoid warnings from standardscaler
df_hour.lights = df_hour.lights.astype(float)
df_hour.log_appliances = df_hour.log_appliances.astype(float)
df_hour.hour = df_hour.hours.astype(float)
df_hour.low_consum = df_hour.low_consum.astype(float)
df_hour.high_consum = df_hour.high_consum.astype(float)

In [None]:
# Creation of train/test sets
test_size=.2
test_index = int(len(df_hour.dropna())*(1-test_size))

X_train, X_test = df_hour[feature_set].iloc[:test_index,], df_hour[feature_set].iloc[test_index:,]
y_train = df_hour.log_appliances.iloc[:test_index,]

y_test =  df_hour.log_appliances.iloc[test_index:,]

In [None]:
from sklearn.preprocessing import StandardScaler

# Normalizing of X matrices for each model to mean = 0 and standard deviation = 1

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
from sklearn import linear_model

lin_model = linear_model.LinearRegression()
lin_model.fit(X_train,y_train)

In [None]:
from sklearn import svm

svr_model = svm.SVR(gamma='scale')
svr_model.fit(X_train,y_train)

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=100,random_state=1)            
rf_model.fit(X_train, y_train)

In [None]:
import xgboost as xgb
from xgboost import plot_importance
import lightgbm as lgb
from catboost import CatBoostRegressor as cbr

model_lgb = lgb.LGBMRegressor(num_leaves=41, n_estimators=200)
model_lgb.fit(X_train, y_train)

In [None]:
model_xgb = xgb.XGBRegressor(objective='reg:squarederror')
model_xgb.fit(X_train, y_train)

In [None]:
model_cbr = cbr(random_seed=242, verbose=0, early_stopping_rounds=10)
model_cbr.fit(X_train, y_train)

### Model Evaluation, Cross-validation & Selection

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

# Function to evaluate the models
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    r_score = 100*r2_score(test_labels,predictions)
    accuracy = 100 - mape
    print(model,'\n')
    print('Average Error       : {:0.4f} degrees'.format(np.mean(errors)))
    print('Variance score R^2  : {:0.2f}%' .format(r_score))
    print('Accuracy            : {:0.2f}%\n'.format(accuracy))

In [None]:
evaluate(lin_model, X_test, y_test)
evaluate(svr_model, X_test, y_test)
evaluate(rf_model, X_test, y_test)
evaluate(model_lgb, X_test, y_test)
evaluate(model_xgb, X_test, y_test)
evaluate(model_cbr, X_test, y_test)

In [None]:
#instead of KFold I use TimeSeriesSplit (10 splits) due to time series data
cv = TimeSeriesSplit(n_splits = 10)

print('Linear Model:')
scores = cross_val_score(lin_model, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(lin_model, X_train, y_train, cv=cv,scoring='r2')
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

print('SVR Model:')
scores = cross_val_score(svr_model, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(svr_model, X_train, y_train, cv=cv)
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

print('Random Forest Model:')
scores = cross_val_score(rf_model, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(rf_model, X_train, y_train, cv=cv)
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

print('LGBMRegressor Model:')
scores = cross_val_score(model_lgb, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(model_lgb, X_train, y_train, cv=cv)
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

print('XGBRegressor Model:')
scores = cross_val_score(model_xgb, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(model_xgb, X_train, y_train, cv=cv)
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

print('CatBoostRegressor Model:')
scores = cross_val_score(model_cbr, X_train, y_train, cv=cv,scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f) degrees" % (100+scores.mean(), scores.std() * 2))
scores = cross_val_score(model_cbr, X_train, y_train, cv=cv)
print("R^2: %0.2f (+/- %0.2f) degrees" % (scores.mean(), scores.std() * 2))

Random Forest Model is having the best Accuracy and CatBoost is having the highest R^2.

## Model performance on test data

In [None]:
y1_pred = lin_model.predict(X_test)
y2_pred = svr_model.predict(X_test)
y3_pred = rf_model.predict(X_test)
y4_pred = model_lgb.predict(X_test)
y5_pred = model_xgb.predict(X_test)
y6_pred = model_cbr.predict(X_test)

In [None]:
fig, axs = plt.subplots(1, 6, figsize=(16,4), sharey=True)
axs[0].scatter(y1_pred,y_test-y1_pred)
axs[0].set_title('Linear Regression')
axs[1].scatter(y2_pred,y_test-y2_pred)
axs[1].set_title('SVR')
axs[2].scatter(y3_pred,y_test-y3_pred)
axs[2].set_title('Random Forest')
axs[3].scatter(y4_pred,y_test-y4_pred)
axs[3].set_title('LGB')
axs[4].scatter(y5_pred,y_test-y5_pred)
axs[4].set_title('XGB')
axs[5].scatter(y6_pred,y_test-y6_pred)
axs[5].set_title('CBR')
fig.text(0.06, 0.5, 'Residuals', ha='center', va='center', rotation='vertical')
fig.text(0.5, 0.01,'Fitted Values', ha='center', va='center')

RF, LGB, XGB, CBR models appears to has mean random residuals close to 0 and constant standard deviation.

In [None]:
fig, axs = plt.subplots(1, 6, figsize=(16,4), sharey=True)
axs[0].scatter(y_test,y1_pred)
axs[0].set_title('Linear Regression')
axs[1].scatter(y_test,y2_pred)
axs[1].set_title('SVR')
axs[2].scatter(y_test, y3_pred)
axs[2].set_title('Random Forest')
axs[3].scatter(y_test, y4_pred)
axs[3].set_title('LGB')
axs[4].scatter(y_test, y5_pred)
axs[4].set_title('XGB')
axs[5].scatter(y_test, y6_pred)
axs[5].set_title('CBR')
fig.text(0.06, 0.5, 'Predictions', ha='center', va='center', rotation='vertical')
fig.text(0.5, 0.01,'True Values', ha='center', va='center')

XGB model appears to be the one which predicts high and low values of energy consumption.

### Prediction of each model vs Test data

In [None]:
fig = plt.figure(figsize=(20,8))
plt.plot(y_test[:100].values,label='Target value',color='b')
plt.plot(y1_pred[:100],label='Linear Prediction ', linestyle='--', color='y')
plt.plot(y2_pred[:100],label='SVR Prediction ', linestyle='--', color='g')
plt.plot(y3_pred[:100],label='Random Forest', linestyle='--', color='r')
plt.plot(y4_pred[:100],label='LGB', linestyle='--', color='black')
plt.plot(y5_pred[:100],label='XGB', linestyle='--', color='orange')
plt.plot(y6_pred[:100],label='CBR', linestyle='--', color='purple')

plt.legend(loc=1)

XBG is predicitng highs and lows better than other models. Overall Random Forest appears to closely fit with the test data.

### Parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

parameters = {
    'max_depth': [800,1000,1500],
    'min_samples_leaf': [5,8,10],
    'min_samples_split': [5,10,15],
    'n_estimators': [40,60,100],
    'random_state':[1]    
}

cv =cv
grid_model = GridSearchCV(RandomForestRegressor(), parameters, cv=cv)

grid_model = grid_model.fit(X_train, y_train)
print(grid_model.best_estimator_)
print(grid_model.best_params_)

In [None]:
best_rf_model = grid_model.best_estimator_
grid_accuracy = evaluate(grid_model, X_test, y_test)
y_best_pred = best_rf_model.predict(X_test)

#### The Variance score of the model impoved from 65% to 68.26%.

### Final predictions on test set based on best RF model

In [None]:
# Calculate Confidence interval 95% for the predictions
sum_errs = np.sum((y_test - y_best_pred)**2)
stdev = np.sqrt(1/(len(df_hour)-2) * sum_errs)

interval = 1.96 * stdev #95% CI
lower, upper = y_best_pred - interval, y_best_pred + interval

In [None]:
fig = plt.figure(figsize=(20,8))
plt.plot(y_test[:100].values,label='Target value',color='b')
#plt.plot(y_pred,label='Best Tree Prediction ', linestyle='-', color='b')
plt.plot(lower[:100],label='Lower Limit ', linestyle='--', color='r')
plt.plot(upper[:100],label='Upper Limit ', linestyle='--', color='y')
plt.title('Predicted Lower Limit and Upper Limit of best RF model')

plt.legend(loc=1)

### Factors influencing energy consumption

In [None]:
factor_list = feature_set
factors = np.array(X_test)

importances = list(rf_model.feature_importances_)

factor_importances = [(factor, round(importance, 2)) 
    for factor, importance in zip(factor_list, importances)]

factor_importances = sorted(factor_importances, key = lambda X_test: X_test[1], reverse = True)
 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in factor_importances];

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

x_values = list(range(len(importances)))
plt.bar(x_values, importances, orientation = 'vertical')
plt.xticks(x_values, factor_list, rotation='vertical')
plt.ylabel('Importance'); plt.xlabel('Variable')
plt.title('Factors influencing energy consumption')


#### Hour of the Day is the important influencing parameter for Energy consumption.

## Observations and motivation for next steps

### Observations:
    1. Hour of the Day is the most important influencing parameter for Energy consumption
    2. XBG is predicitng highs and lows better than other models, 
    3. Overall Random Forest appears to closely fit with the test data
    4. High Electricity consumption of >140Wh is observed during evening hours 16:00 to 20:00
    5. Weekends (Saturdays and Sundays) are observed to have high consumption of Electricity. (> 25% than Weekdays)
    6. Though light consumpstion appeared as highly correlated with Appliance electricity consumption, lights are having very low importance as a feature

### Motivation for future steps:
    1. Available data is only for 1 house, we learn important information if we analyse several houses 
    2. Further informations like House geometry, number of people residing at house over time may give few more insights
    3. Need to capture data for several months to bring in seasonal effects on energy consumption
    4. Optimal positioning and quality of sensors can be analysed for better data capturing
    5. The predictions of appliances energy use could probably be better if the weather station was closer to the house
    6. Noise and CO2 level in the room can also be an important data for improving predictions

## References
    [1] Luis M. Candanedo, Veronique Feldheim, Dominique Deramaix, Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Volume 140, 1 April 2017, Pages 81-97, ISSN 0378-7788.
    [2] N. Arghira, L. Hawarah, S. Ploix, M. Jacomino, Prediction of appliances energy use in smart homes, Energy 48 (1) (2012) 128–134.
    [3] M. Muratori, M.C. Roberts, R. Sioshansi, V. Marano, G. Rizzoni, A highly resolved modeling technique to simulate residential power demand, Appl. Energy 107 (2013) 465–473.
    [4] Saleh Seyedzadeh, Farzad Pour Rahimian, Ivan Glesk & Marc Roper, Machine learning for estimation of building energy consumption and performance: a review, Visualization in Engineering volume 6, Article number: 5 (2018) 