# Prediction of Hourly Energy Consumption of Turkey

Predicting the power demand with high accuracy might introduce a great set of values for a country, for a city or even for households. Stakeholders might adjust their power production accordingly to reduce cost; or they can buy sufficient amounts of energy if they meet their power needs from external sources. In some certain cases, such as in tendering processes in a daily energy exchange, the stakeholders may generate addtional profit, too. 

In this notebook I will introduce basics of training a Machine Learning model predicting Power Consumption of Turkey for the next 24 hours, using Ensemble Methods.

## Importing and Processing the Data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/hourly-power-consumption-of-turkey-20162020/RealTimeConsumption-01012016-04082020.csv', encoding='cp1254')

In [None]:
df.head()

In [None]:
df['Date'] =pd.to_datetime(df['Date'] +' '+ df['Hour'], format='%d.%m.%Y %H:%M')

Let's check whether we miss any entry in the time series "Data" feature:

In [None]:
pd.date_range(start = '2016-01-01 00:00:00', end = '2020-03-24 00:00:00', freq = 'D').difference(df.Date)

In [None]:
df = df.drop('Hour', axis = 1)

In [None]:
df.head()

In [None]:
df['Consumption (MWh)'] = df['Consumption (MWh)'].str.replace(',','')
df['Consumption (MWh)'] = pd.to_numeric(df['Consumption (MWh)'])

In [None]:
df = df.sort_values('Date')

In [None]:
df.head()

In [None]:
print(df['Date'].min(), df['Date'].max())

For the purposes of this notebook, I will not be including the Covid period as approximately started in Turkey:

In [None]:
df = df.set_index('Date').loc[:'2020-03-24 23:00:00', :].reset_index()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.set_index('Date').plot(style='.', figsize=(15,5), title='Consumption vs. Date')
plt.show()

In [None]:
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(figsize=(30, 5))
plot_acf(df.set_index('Date'),lags = 720, ax=ax)
plt.show()


sns.set(style='whitegrid')
fig, ax = plt.subplots(figsize=(35, 5))
plot_pacf(df.set_index('Date'),lags = 205, ax=ax)
plt.xticks(np.arange(0, 210, step=5))
plt.show()

In [None]:
plt.figure(figsize = (15, 7))
ax = sns.boxplot(x=df['Date'].dt.hour, y="Consumption (MWh)", data=df)
plt.title('Hourly Consumption', fontsize=11)

In [None]:
df['Consumption (MWh)'] = np.log1p(df['Consumption (MWh)'])

## Basic Feature Engineering

### Lag Features

In [None]:
df['rolling_mean_t41'] = df['Consumption (MWh)'].shift(38)
df['rolling_mean_t41'] = df['Consumption (MWh)'].shift(41)
df['rolling_mean_t48'] = df['Consumption (MWh)'].shift(48)
df['rolling_mean_t72'] = df['Consumption (MWh)'].shift(72) 
df['rolling_mean_t168'] = df['Consumption (MWh)'].shift(168)

In [None]:
df

### Rolling Features

In [None]:
df['rolling_mean_t38'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(12).mean())
df['rolling_mean_t50'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(24).mean())
df['rolling_mean_t62'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(48).mean())
df['rolling_median_t38'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(12).median())
df['rolling_median_t50'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(24).median())
df['rolling_median_t62'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(48).median())
df['rolling_std_t38'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(12).std())
df['rolling_std_t50'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(24).std())
df['rolling_std_t62'] = df['Consumption (MWh)'].transform(lambda x: x.shift(38).rolling(48).std())

In [None]:
df

In [None]:
df = df.dropna(axis=0, how='any').reset_index(drop=True)

### Time Features

In [None]:
df['hourofday'] = df['Date'].dt.hour
df['quarter'] = df['Date'].dt.quarter
df['month'] = df['Date'].dt.month
df['year'] = df['Date'].dt.year
df['dayofyear'] = df['Date'].dt.dayofyear
df['dayofmonth'] = df['Date'].dt.day
df['weekofyear'] = df['Date'].dt.weekofyear
df['days_in_month'] = df['Date'].dt.days_in_month

In [None]:
df.head()

In [None]:
df.tail()

### Train-Test Split

In [None]:
split_date = '01-Jan-2016'
split_date1 = '01-Jan-2020'
split_date2 = '14-Mar-2020'
split_date3 = '15-Mar-2020'
df_train = df.set_index('Date').loc[split_date:'31-Dec-2019', :].reset_index()
df_test = df.set_index('Date').loc[split_date1:split_date2, :].reset_index()

In [None]:
df_test[['Date','Consumption (MWh)']].set_index('Date').rename(columns={'Consumption (MWh)': 'TEST SET'})\
        .join(df_train[['Date','Consumption (MWh)']].set_index('Date')\
              .rename(columns={'Consumption (MWh)': 'TRAINING SET'}),how='outer').plot(figsize=(25,5), title='Tüketim Miktarı (MWh)', style='.')
plt.ylim(9.8, 10.8)
plt.show()

In [None]:
df_train.to_csv('energy_cons_train.csv', index = None) #Keeping the train and test data for another notebook :)
df_test.to_csv('energy_cons_test.csv', index = None)

In [None]:
df_train = df_train.drop(['Date'], axis=1)
df_test = df_test.drop(['Date'], axis=1)

## The Model

In [None]:
def percentage_error(actual, predicted):
    res = np.empty(actual.shape)
    for j in range(actual.shape[0]):
        if actual[j] != 0:
            res[j] = (actual[j] - predicted[j]) / actual[j]
        else:
            res[j] = predicted[j] / np.mean(actual)
    return res

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs(percentage_error(np.asarray(y_true), np.asarray(y_pred)))) * 100

In [None]:
print(df_train.shape, df_test.shape) 

In [None]:
y_train = df_train['Consumption (MWh)'].values
X_train = df_train.drop('Consumption (MWh)', axis=1).values

y_test = df_test['Consumption (MWh)'].values
X_test = df_test.drop('Consumption (MWh)', axis=1).values

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
#!pip install lightgbm
from lightgbm import LGBMRegressor

In [None]:
model_lgbm = LGBMRegressor(objective='rmse', n_estimators=3000, learning_rate=0.01, num_leaves=36, min_child_samples = 15, 
                           n_jobs=-1,  random_state = None, max_depth = 3, reg_lambda = 0.0, reg_alpha = 0.0, min_split_gain=0.0)
eval_set_ALLRESTS = [(X_train, y_train), (X_test, y_test)]
model_lgbm.fit(X_train, y_train, eval_set = eval_set_ALLRESTS ,eval_metric='rmse', early_stopping_rounds=15, verbose=20)

In [None]:
y_train_lgbm = model_lgbm.predict(X_train)
print("Train set RMSE (Log): " + str(np.sqrt(mean_squared_error(y_train_lgbm, y_train))))
print("Train set MAPE (Log): " + str(mean_absolute_percentage_error(y_train, y_train_lgbm)))
print("Train set RMSE (Non-Log): " + str(np.sqrt(mean_squared_error(np.expm1(y_train_lgbm), np.expm1(y_train)))))
print("Train set MAPE (Non-Log): " + str(mean_absolute_percentage_error(np.expm1(y_train), np.expm1(y_train_lgbm))))
print("% Success (Non-Log): " + str(100 - mean_absolute_percentage_error(np.expm1(y_train), np.expm1(y_train_lgbm))))

In [None]:
y_test_lgbm = model_lgbm.predict(X_test)
print("Validation set RMSE (Log): " + str(np.sqrt(mean_squared_error(y_test_lgbm, y_test))))
print("Validation set MAPE (Log): " + str(mean_absolute_percentage_error(y_test, y_test_lgbm)))
print("Validation set RMSE (Non-Log): " + str(np.sqrt(mean_squared_error(np.expm1(y_test_lgbm), np.expm1(y_test)))))
print("Validation set MAPE (Non-Log): " + str(mean_absolute_percentage_error(np.expm1(y_test), np.expm1(y_test_lgbm))))
print("% Success (Non-Log): " + str(100 - mean_absolute_percentage_error(np.expm1(y_test), np.expm1(y_test_lgbm))))

We may say that our model performed ~96.2% on train set and ~96.7% on the test set, not bad isn't it!

In [None]:
from matplotlib import pyplot
# retrieve performance metrics
results = model_lgbm.evals_result_
epochs = len(results['training']['rmse'])
x_axis = range(0, epochs)
# plot MAE
plt.figure(figsize=(17,8))
fig, ax = pyplot.subplots()
ax.plot(x_axis, results['training']['rmse'], label='Train')
ax.plot(x_axis, results['valid_1']['rmse'], label='Validation')
ax.legend();
pyplot.ylabel('RMSE')
pyplot.xlabel('# of iterations (or # of estimators)')
pyplot.title('LGBM RMSE')
pyplot.show()

In [None]:
# Create a pd.Series of features importances
importances = pd.Series(data=model_lgbm.feature_importances_,
                        index= df_train.drop('Consumption (MWh)', axis=1).columns)

# Sort importances
importances_sorted = importances.sort_values()
plt.figure(figsize=(12,20))
# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightblue')
plt.title('Features Importances')
plt.show()

This entire code is for educational purposes, for an industry level production ready application, one needs to perform an eloborate feature engineering and an iterative hyperparameter-tuning. For the production phase, time-series spesific cross validation would also help for understanding the generalization power of the model to the future unseen data. In addition, there may be many domain spesific features or some fundamental features those higly affecting the model performance, a few examples to these might be given as: hourly weather condition, Vacation & Special days, features regarding energy consuming factories and sun set & rise data. 

It might be good exercise to try these and further features, selecting the highly correlated ones and also tuning the hyperparameters, finally trying other models such as Random-Forest, XGBoost, NGBoost, Prophet or DNNs (LSTM, etc.).