# Prediction of Total Energy Consumption

In this case study we need to predict the total load consumption depending upon different parameters, such as generation from different sources and the weather condition.

## Understanding the data features:

1)generation biomass-Power generated by biomass

2)generation fossil brown coal/lignite-Power generated by fossil brown coal/lignite

3)generation fossil gas-power generated by fossil gas

4)generation fossil hard coal-power generated by fossil hard coal

5)generation fossil oil-power generated by fossil oil

6)generation hydro pumped storage consumption-power generated by pumped storage consumption(This is used as an emergency power                                               resource) 

7)generation hydro run-of-river and poundage-power generated by hydro run

8)generation hydro water reservoir-power generated by water reservior

9)generation nuclear-power generated by nuclear energy

10)generation other-power generated by other sources

11)generation other renewable-power generated by other renewable energies other than mentioned in the dataset

12)generation solar-power generated by solar energy

13)generation waste-power generated by waste

14)generation wind onshore-power generated by wind onshore

15)total load actual-__This is the dependent varibale.It tell us about the total load consumption.__

16)temp- Temperature of the area when load consumption was recorded

17)pressure-Pressure of the area when load consumption was recorded

18)humidity-Humidity of the area when load consumption was recorded.

19)wind_speed-Wind speed of the area when load consumption was recorded.

20)wind_deg-Wind direction of the area when load consumption was recorded.

21)rain_1h-It tells us about the intensity of rainfall.

22)snow_3h-It is divided into 4 values and tells us about the intensity of snowfall.

23)weather_id-It gives us 23 values of different weather condition. 

24)weather_main-Even this column tells us about the weather condition i.e whether it was clear or cloudy or it was raining when                 the load was recorded.

25)weather_description-It tells us about the overcast,whether it was raining or sunny.

26)time-It gives us the date and time when load was recorded


# Importing Libraries

In [None]:
# suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# 'Pandas' is used for data manipulation and analysis
import pandas as pd 

# 'Numpy' is used for mathematical operations on large, multi-dimensional arrays and matrices
import numpy as np

# 'Matplotlib' is a data visualization library for 2D and 3D plots, built on numpy
import matplotlib.pyplot as plt

# 'Seaborn' is based on matplotlib; used for plotting statistical graphics
import seaborn as sns

# 'Scikit-learn' (sklearn) emphasizes various regression, classification and clustering algorithms
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# 'Statsmodels' is used to build and analyze various statistical models
import statsmodels
import statsmodels.api as sm
from statsmodels.tools.eval_measures import rmse
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 'SciPy' is used to perform scientific computations
from scipy.stats import shapiro
from scipy import stats

# import functions to perform feature selection
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

#import functions for time series
import itertools
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

# Reading the dataframe

In [None]:
#reading the data
df_energy=pd.read_csv("../input/energy-dataset/energy_dataset.csv")

In [None]:
#displaying the first five records
df_energy.head()

# Understanding the dataset

In [None]:
#"df.shape" gives the number of rows and columns in the dataset
df_energy.shape

There are 35064 rows and 37 columns

In [None]:
#understanding the data types and null values in each column
df_energy.info()

1)From the above display we can see that the datatype of time column is object whereas it should be in datetime format.

2)The columns "generation hydro pumped storage aggregated" has null values.There are even missing values in other columns which needs to be handled. 

In [None]:
#We need to further understand that whether other columns are really of float type or other datatype
for i in df_energy.columns:
    print(i,"--->",df_energy[i].nunique(),"--->",df_energy[i].dtypes)

We see that the columns "generation fossil coal-derived gas","generation fossil oil shale","generation fossil peat","generation geothermal","generation marine","generation wind offshore" have just one value.So these should be of object type.Let us explore it further.

In [None]:
#checking the values in the columns 'generation fossil coal-derived gas','generation fossil oil shale',
#'generation fossil peat','generation geothermal','generation marine','generation wind offshore'
cols=['generation fossil coal-derived gas','generation fossil oil shale','generation fossil peat',
      'generation geothermal','generation marine','generation wind offshore']
for values in cols:
    print(df_energy[values].unique())

We can see from the above output that these columns doesnot have any values other than 0.So we delete these columns.

In [None]:
#deleting the columns above
df_energy=df_energy.drop(['generation fossil coal-derived gas','generation fossil oil shale','generation fossil peat','generation geothermal','generation marine','generation wind offshore'],axis=1)

In [None]:
#dropping the column 'generation hydro pumped storage aggregated' as there are no values in it
df_energy=df_energy.drop(['generation hydro pumped storage aggregated'],axis=1)

In [None]:
#conversion of datatypes of columns
cols=['rain_1h','snow_3h','weather_description','weather_main']
df_energy[cols]=df_energy[cols].astype(object)

In [None]:
#Changing the datatype of time column
df_energy[['Date','Time']]=df_energy['time'].str.split(" ",n=1,expand=True)
df_energy['Date']=pd.to_datetime(df_energy['Date'])
df_energy[['Time','Spare']]=df_energy['Time'].str.split("+",n=1,expand=True)


In [None]:
df_energy=df_energy.drop(["Spare","time"],axis=1)
df_energy['Time']=pd.to_datetime(df_energy['Time'],format='%H:%M:%S')
df_energy['Time']=df_energy['Time'].dt.time

In [None]:
#Finally checking the columns and the datatype of all the columns 
df_energy.info()

We can see that the columns have been handled and the datatype of time column has been changed.

Now our dataset is ready for doing EDA

# Extrapolatory Data Analysis 

In [None]:
#sns.distplot(df_energy['generation biomass'])
sns.set_color_codes()
sns.distplot(df_energy['generation biomass'], color="b")
plt.show()

In [None]:
#creating a new variable fossil and adding up all the power generated from fossil
fossil=df_energy['generation fossil brown coal/lignite']+df_energy['generation fossil gas']+df_energy['generation fossil hard coal']+df_energy['generation fossil oil']
sns.distplot(fossil, color="b")
plt.show()

In [None]:
#creating a variable renewable and storing adding up all the powers generated from renewable source of energy
renewable=df_energy['generation hydro run-of-river and poundage']+df_energy['generation hydro water reservoir']+df_energy['generation hydro pumped storage consumption']+df_energy['generation wind onshore']+df_energy['generation other renewable']+df_energy['generation solar']
sns.distplot(renewable, color="b")
plt.show()

In [None]:
sns.distplot(df_energy['generation other'])
plt.show()

In [None]:
sns.distplot(df_energy['total load actual'])
plt.show()

In [None]:
df_energy['total load actual'].skew()

In [None]:
sns.distplot(df_energy['temp'])
plt.show()

In [None]:
sns.distplot(df_energy['humidity'])
plt.show()

From the graphs plotted above we can see that maximum graphs are normally distributed.The new variable renewable is slightly right skewed and the feature humidity is slightly left skewed.

In [None]:
sns.boxplot(df_energy['pressure'])
plt.show()

In [None]:
sns.boxplot(df_energy['temp'])

In [None]:
sns.boxplot(df_energy['wind_speed'])

In [None]:
sns.boxplot(df_energy['wind_deg'])

We can see that there are many outliers in the pressure and temperature column.We will handle these outliers.

In [None]:
from scipy.stats.mstats import winsorize
df_energy['pressure']=winsorize(df_energy['pressure'],(0.1,0.1))


In [None]:
df_energy['wind_speed']=winsorize(df_energy['wind_speed'],(0.01,0.1))

Checking the columns after handling the outliers

In [None]:
sns.boxplot(df_energy['pressure'])
plt.show()

In [None]:
sns.boxplot(df_energy['wind_speed'])
plt.show()

# Checking for the missing values

In [None]:
missing=df_energy.isnull().sum()
missing_percent=(df_energy.isna().mean())*100
pd.concat([missing,missing_percent],axis=1,keys=["missing","missing_percent"])

In [None]:
#making a dataframe of all missing values
df1=df_energy[df_energy.isnull().any(axis=1)]

In [None]:
#plotting a swarmplot of all missing values with respect to date
sns.swarmplot(x='Date', data=df1)
plt.xticks(rotation=60)
plt.title('Missing values with respect to time')
plt.show()

We can see there are many missing values in the starting of the dataframe.

In [None]:
#interpolating the missing values
df_energy.interpolate(method='linear', limit_direction='forward', inplace=True, axis=0)

In [None]:
#Checking the dataframe after handling the missing values
df_energy.isnull().sum()

In [None]:
#plotting to see if there is any other missing value in our dataset
plt.figsize=(15,10)
sns.heatmap(df_energy.isnull(),cbar=False)
plt.show()

So we have handled the missing values.Now our dataset is ready to used for making model.

In [None]:
#Findig the correlation between variables
df_energy.corr()


In [None]:
#plotting the correlation between variables in which correlation is high
df5=df_energy.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(df5[(df5>0.5)|(df5<-0.5)],annot=True,cbar=False,linewidth=0.5,linecolor='blue')

The temperature minumim column and the temperature maximum column are having high correlation.So we will drop these columns to avoid multicollinearity.

In [None]:
df_energy=df_energy.drop(['temp_min','temp_max'],axis=1)

Now our dataset is ready for model building

# Data Preparation for model building

In [None]:
#segregating the categorical and numeric variables into two variables
df_cat=df_energy.select_dtypes(include=object)
df_num=df_energy.select_dtypes(include=np.number)

In [None]:
#getting dummies for categorical variables
df_dummy=pd.get_dummies(df_cat,drop_first=True)

In [None]:
#creating the final dataframe for model building
df_final=pd.concat([df_dummy,df_num],axis=1)

In [None]:
X=df_final.drop('total load actual',axis=1)

In [None]:
y=df_final['total load actual']

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

# Model 1(Ordinary least square)

In [None]:
# build a full model using OLS()
linreg_full_model = sm.OLS(y_train, X_train).fit()


In [None]:
# print the summary output
linreg_full_model.summary()

In [None]:
# predict the 'log_Property_Sale_Price' using predict()
predicted = linreg_full_model.predict(X_test)

In [None]:
# calculate rmse using rmse()
linreg_full_model_rmse = rmse(y_test, predicted)

# calculate R-squared using rsquared
linreg_full_model_rsquared = linreg_full_model.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_rsquared_adj = linreg_full_model.rsquared_adj 

In [None]:
# create a list of column names
cols = ['Model', 'RMSE', 'R-Squared', 'Adj. R-Squared']

# create a empty dataframe of the colums
result_tabulation = pd.DataFrame(columns = cols)

# compile the required information
linreg_full_model_with_metrics = pd.Series({'Model': "Linreg full model",
                     'RMSE':linreg_full_model_rmse,
                     'R-Squared': linreg_full_model_rsquared,
                     'Adj. R-Squared': linreg_full_model_rsquared_adj     
                   })

# append our result table using append()
# ignore_index=True: does not use the index labels
# python can only append a Series if ignore_index=True or if the Series has a name
result_tabulation = result_tabulation.append(linreg_full_model_with_metrics, ignore_index = True)

# print the result table
result_tabulation

# Model 2(using feature engineering(Total generation from fossil))

In [None]:
# create a new variable 'TotalFossil' using the variables 'generation fossil brown coal/lignite', 'generation fossil gas', 'generation fossil hard coal', and 'generation fossil oil'
# add the new variable to the dataframe 'df_house'
df_energy['TotalFossil'] = df_energy['generation fossil brown coal/lignite'] + df_energy['generation fossil gas'] + df_energy['generation fossil hard coal'] + df_energy['generation fossil oil']



In [None]:
#segregating the variables into categorical and continuous
df_num=df_energy.select_dtypes(include=np.number)
df_cat=df_energy.select_dtypes(include=object)

In [None]:
#dropping the redundant variables
df_num=df_num.drop(['generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil'], axis=1)

In [None]:
#getting dummies for categorical variables
df_dummy=pd.get_dummies(df_cat,drop_first=True)

#creating the final dataframe for model building
df_final=pd.concat([df_dummy,df_num],axis=1)

X=df_final.drop('total load actual',axis=1)

y=df_final['total load actual']

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
linreg_full_model_fossil = sm.OLS(y_train, X_train).fit()


In [None]:
predicted = linreg_full_model_fossil.predict(X_test)

In [None]:
linreg_full_model_fossil_rmse = rmse(y_test, predicted)

# calculate R-squared using rsquared
linreg_full_model_fossil_rsquared = linreg_full_model_fossil.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_fossil_rsquared_adj = linreg_full_model_fossil.rsquared_adj 

In [None]:
# create the result table for all accuracy scores
# accuracy measures considered for model comparision are RMSE, R-squared value and Adjusted R-squared value
# create a list of column names
cols = ['Model', 'RMSE', 'R-Squared', 'Adj. R-Squared']

# create a empty dataframe of the colums
# columns: specifies the columns to be selected


# compile the required information
linreg_full_model_fossil = pd.Series({'Model': "Linreg full model with new feature(Total generation by Fossil) ",
                     'RMSE':linreg_full_model_fossil_rmse,
                     'R-Squared': linreg_full_model_fossil_rsquared,
                     'Adj. R-Squared': linreg_full_model_fossil_rsquared_adj     
                   })

# append our result table using append()
# ignore_index=True: does not use the index labels
# python can only append a Series if ignore_index=True or if the Series has a name
result_tabulation = result_tabulation.append(linreg_full_model_fossil, ignore_index = True)

# print the result table
result_tabulation

# Model 3(Using feature engineering(Total generation from renewable energy))

In [None]:
#dropping the feature added
df_energy=df_energy.drop('TotalFossil',axis=1)

In [None]:
#creating a variable renewable in which total power generation by renewable energies are added
df_energy['renewable']=df_energy['generation other renewable']+df_energy['generation solar']+df_energy['generation wind onshore']+df_energy['generation hydro pumped storage consumption']+df_energy['generation hydro run-of-river and poundage']+df_energy['generation hydro water reservoir']

In [None]:
#segregating the categorical and numerical variables
df_num=df_energy.select_dtypes(include=np.number)
df_cat=df_energy.select_dtypes(include=object)

In [None]:
#dropping the redundant variables
df_num.drop(['generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation other renewable', 'generation solar',
       'generation wind onshore'],axis=1,inplace=True)

In [None]:
#getting dummies for categorical variables
df_dummy=pd.get_dummies(df_cat,drop_first=True)

#creating the final dataframe for model building
df_final=pd.concat([df_dummy,df_num],axis=1)

X=df_final.drop('total load actual',axis=1)

y=df_final['total load actual']

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
linreg_full_model_renewable = sm.OLS(y_train, X_train).fit()


In [None]:
predicted = linreg_full_model_renewable.predict(X_test)

In [None]:
linreg_full_model_renewable_rmse = rmse(y_test, predicted)

# calculate R-squared using rsquared
linreg_full_model_renewable_rsquared = linreg_full_model_renewable.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_renewable_rsquared_adj = linreg_full_model_renewable.rsquared_adj 

In [None]:
#create a list of column names
cols = ['Model', 'RMSE', 'R-Squared', 'Adj. R-Squared']



# compile the required information
linreg_full_model_renewable = pd.Series({'Model': "Linreg full model with new feature(renewable) ",
                     'RMSE':linreg_full_model_renewable_rmse,
                     'R-Squared': linreg_full_model_renewable_rsquared,
                     'Adj. R-Squared': linreg_full_model_renewable_rsquared_adj     
                   })

# append our result table using append()
result_tabulation = result_tabulation.append(linreg_full_model_renewable, ignore_index = True)

# print the result table
result_tabulation

In [None]:
#dropping the column added
df_energy=df_energy.drop(['renewable'],axis=1)

# Model 4(Using VIF selecting the important features)

In [None]:
#dropping the dependent variable
df_features = df_energy.drop(['total load actual'], axis = 1)

# filter the numerical features in the dataset
df_numeric_features_vif = df_features.select_dtypes(include=[np.number])

In [None]:
# for each numeric variable, calculate VIF and save it in a dataframe 'vif'

# use for loop to iterate the VIF function 
for ind in range(len(df_numeric_features_vif.columns)):
    
    # create an empty dataframe
    vif = pd.DataFrame()

    # calculate VIF using list comprehension
    vif["VIF_Factor"] = [variance_inflation_factor(df_numeric_features_vif.values, i) for i in range(df_numeric_features_vif.shape[1])]

    # create a column of variable names
    vif["Features"] = df_numeric_features_vif.columns

    # filter the variables with VIF greater than 10 and store it in a dataframe 'multi' 
    # one can choose the threshold other than 10 (it depends on the business requirements)
    multi = vif[vif['VIF_Factor'] > 10]
    
    # if dataframe 'multi' is not empty, then sort the dataframe by VIF values
    # if dataframe 'multi' is empty (i.e. all VIF <= 10), then print the dataframe 'vif' and break the for loop using 'break' 
    if(multi.empty == False):
        df_sorted = multi.sort_values(by = 'VIF_Factor', ascending = False)
    else:
        print(vif)
        break
    
    # use if-else to drop the variable with the highest VIF
    #  else print the final dataframe 'vif' with all values after removal of variables with VIF less than 10  
    if (df_sorted.empty == False):
        df_numeric_features_vif = df_numeric_features_vif.drop(df_sorted.Features.iloc[0], axis=1)
    else:
        print(vif)

In [None]:
#creating the final dataframe for model building
df_final = pd.concat([df_numeric_features_vif, df_dummy], axis=1)
X=df_final
y=df_energy[['total load actual']]

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
# build a full model using OLS()
# consider the log of sales price as the target variable
# use fit() to fit the model on train data
linreg_full_model_vif = sm.OLS(y_train, X_train).fit()

# print the summary output
print(linreg_full_model_vif.summary())

In [None]:
# predict the 'log_Property_Sale_Price' using predict()
predicted = linreg_full_model_vif.predict(X_test)

In [None]:
# calculate rmse using rmse()
linreg_full_model_vif_rmse = rmse(y_test, predicted)

# calculate R-squared using rsquared
linreg_full_model_vif_rsquared = linreg_full_model_vif.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_vif_rsquared_adj = linreg_full_model_vif.rsquared_adj 

In [None]:
# append the accuracy scores to the table
# compile the required information
linreg_full_model_vif_metrics = pd.Series({'Model': "Linreg with VIF",
                                                'RMSE': rmse(y_test,predicted)[0],
                                                'R-Squared': linreg_full_model_vif_rsquared,
                                                'Adj. R-Squared': linreg_full_model_vif_rsquared_adj})

# append our result table using append()
# ignore_index=True: does not use the index labels
# python can only append a Series if ignore_index=True or if the Series has a name
result_tabulation = result_tabulation.append(linreg_full_model_vif_metrics, ignore_index = True)

# print the result table
result_tabulation

# Model 5(Using forward elimination)

In [None]:
# filter the numerical features in the dataset using select_dtypes()
df_numeric_features = df_energy.select_dtypes(include=np.number)

# filter the categorical features in the dataset using select_dtypes()
df_categoric_features = df_energy.select_dtypes(include = object)

In [None]:
# use 'get_dummies()' from pandas to create dummy variables
df_dummy = pd.get_dummies(df_categoric_features, drop_first = True)

In [None]:
# concatenate the numerical and dummy encoded categorical variables using concat()
df_final = pd.concat([df_numeric_features, df_dummy], axis=1)
X = df_final.drop(['total load actual'], axis = 1)
y = df_final[['total load actual']]

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
# initiate linear regression model to use in feature selection
linreg = LinearRegression()

# build step forward selection
linreg_forward = sfs(estimator = linreg, k_features = 'best', forward = True, verbose = 2, scoring = 'r2', n_jobs = -1)

sfs_forward = linreg_forward.fit(X_train, y_train)

In [None]:
# print the number of selected features
print('Number of features selected using forward selection method:', len(sfs_forward.k_feature_names_))

# print a blank line
print('\n')

# print the selected feature names when k_features = 'best'
print('Features selected using forward selection method are: ')
print(sfs_forward.k_feature_names_)

In [None]:
# consider numeric features
df_numeric_features = df_energy.loc[:, ['generation biomass', 'generation fossil brown coal/lignite', 'generation fossil gas', 
                                        'generation fossil hard coal', 'generation fossil oil', 'generation hydro pumped storage consumption', 'generation hydro run-of-river and poundage', 
                                        'generation hydro water reservoir', 'generation nuclear', 'generation other', 'generation other renewable', 'generation solar', 'generation waste', 
                                        'generation wind onshore', 'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'clouds_all']]

# consider categoric features
df_categoric_features = df_energy.loc[:, ["rain_1h","snow_3h","weather_main","weather_description"]]

In [None]:
dummy_encoded_variables = pd.get_dummies(df_categoric_features, drop_first = True)

In [None]:
df_dummy = pd.concat([df_numeric_features, dummy_encoded_variables], axis=1)
X=df_dummy
y = df_energy[['total load actual']]


In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
# build a full model using OLS()
linreg_full_model_forward = sm.OLS(y_train, X_train).fit()

# print the summary output
print(linreg_full_model_forward.summary())

In [None]:
linreg_full_model_forward_predictions = linreg_full_model_forward.predict(X_test)

In [None]:
# calculate rmse using rmse()
linreg_full_model_forward_rmse = rmse(y_test, linreg_full_model_forward_predictions)

# calculate R-squared using rsquared
linreg_full_model_forward_rsquared = linreg_full_model_forward.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_forward_rsquared_adj = linreg_full_model_forward.rsquared_adj 

In [None]:
# append the accuracy scores to the table
# compile the required information
linreg_full_model_forward_metrics = pd.Series({'Model': "Linreg with Forward Selection",
                                                'RMSE': linreg_full_model_forward_rmse[0],
                                                'R-Squared': linreg_full_model_forward_rsquared,
                                                'Adj. R-Squared': linreg_full_model_forward_rsquared_adj})

# append our result table using append()
# ignore_index=True: does not use the index labels
# python can only append a Series if ignore_index=True or if the Series has a name
result_tabulation = result_tabulation.append(linreg_full_model_forward_metrics, ignore_index = True)

# print the result table
result_tabulation

# Model 6(Using Backward elimination) 

In [None]:
# filter the numerical features in the dataset using select_dtypes()
df_numeric_features = df_energy.select_dtypes(include=np.number)

# filter the categorical features in the dataset using select_dtypes()
df_categoric_features = df_energy.select_dtypes(include = object)

In [None]:
# use 'get_dummies()' from pandas to create dummy variables
dummy_encoded_variables = pd.get_dummies(df_categoric_features, drop_first = True)

In [None]:
# concatenate the numerical and dummy encoded categorical variables using concat()
df_dummy = pd.concat([df_numeric_features, dummy_encoded_variables], axis=1)
X = df_dummy.drop(['total load actual'], axis = 1)

# extract the target variable from the data set
y = df_dummy[['total load actual']]

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
# initiate linear regression model to use in feature selection
linreg = LinearRegression()

# build step backward feature selection
linreg_backward = sfs(estimator = linreg, k_features = 'best', forward = False, verbose = 2, scoring = 'r2', n_jobs = -1)

# fit the backward elimination on train data using fit()
sfs_backward = linreg_backward.fit(X_train, y_train)

In [None]:
# print the number of selected features
print('Number of features selected using backward elimination method:', len(sfs_backward.k_feature_names_))

# print a blank line
print('\n')

# print the selected feature names when k_features = 'best'
print('Features selected using backward elimination method are: ')
print(sfs_backward.k_feature_names_)

In [None]:
# consider numeric features
df_numeric_features = df_energy.loc[:, ['generation biomass', 'generation fossil brown coal/lignite', 'generation fossil gas', 'generation fossil hard coal', 
                                        'generation fossil oil', 'generation hydro pumped storage consumption', 'generation hydro run-of-river and poundage', 
                                        'generation hydro water reservoir', 'generation nuclear', 'generation other', 'generation other renewable', 'generation solar', 'generation waste', 'generation wind onshore',
                                        'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'rain_3h', 'clouds_all', 'weather_id']]

# consider categoric features
df_categoric_features = df_energy.loc[:, ["rain_1h","snow_3h","weather_main","weather_description","Time"]]

In [None]:
# use 'get_dummies()' from pandas to create dummy variables
dummy_encoded_variables = pd.get_dummies(df_categoric_features, drop_first = True)

In [None]:
# concatenate the numerical and dummy encoded categorical variables using concat()
df_dummy = pd.concat([df_numeric_features, dummy_encoded_variables], axis=1)
X=df_dummy
y=df_energy[['total load actual']]

In [None]:
# add the intercept column using 'add_constant()'
X= sm.add_constant(X)



# split data into train subset and test subset for predictor and target variables
# 'test_size' returns the proportion of data to be included in the test set
# set 'random_state' to generate the same dataset each time you run the code 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
# print dimension of predictors train set
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
linreg_full_model_backward = sm.OLS(y_train, X_train).fit()

# print the summary output
print(linreg_full_model_backward.summary())

In [None]:
# predict the 'log_Property_Sale_Price' using predict()
linreg_full_model_backward_predictions = linreg_full_model_backward.predict(X_test)

In [None]:
# calculate rmse using rmse()
linreg_full_model_backward_rmse = rmse(y_test, linreg_full_model_backward_predictions)

# calculate R-squared using rsquared
linreg_full_model_backward_rsquared = linreg_full_model_backward.rsquared

# calculate Adjusted R-Squared using rsquared_adj
linreg_full_model_backward_rsquared_adj = linreg_full_model_backward.rsquared_adj 

In [None]:
# append the accuracy scores to the table
linreg_full_model_backward_metrics = pd.Series({'Model': "Linreg with Backward Elimination",
                                                'RMSE': linreg_full_model_backward_rmse[0],
                                                'R-Squared': linreg_full_model_backward_rsquared,
                                                'Adj. R-Squared': linreg_full_model_backward_rsquared_adj})

# append our result table using append()
result_tabulation = result_tabulation.append(linreg_full_model_backward_metrics, ignore_index = True)

# print the result table
result_tabulation

# Model 7(Linear Regression using SGD)

In [None]:
#segregating the categorical and numerical variables
df_num=df_energy.select_dtypes(include=np.number)
df_cat=df_energy.select_dtypes(include=object)


In [None]:
#getting dummies for categorical variables
df_dummy=pd.get_dummies(df_cat,drop_first=True)

#creating the final dataframe for model building
df_final=pd.concat([df_dummy,df_num],axis=1)

X=df_final.drop('total load actual',axis=1)

y=df_final['total load actual']


In [None]:
# split data into train subset and test subset for predictor and target variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# check the dimensions of the train & test subset for 
print("The shape of X_train is:",X_train.shape)

# print dimension of predictors test set
print("The shape of X_test is:",X_test.shape)

# print dimension of target train set
print("The shape of y_train is:",y_train.shape)

# print dimension of target test set
print("The shape of y_test is:",y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression
# build the model
OLS_model = LinearRegression()

# fit the model
OLS_model.fit(X_train, y_train)


In [None]:
# predict the values
y_pred_OLS = OLS_model.predict(X_test)

In [None]:
# compute the R-Squared
r_squared_OLS = OLS_model.score(X_train,y_train)

# Number of observation or sample size
n = 24544 

# No of independent variables
p = 85

#Compute Adj-R-Squared
Adj_r_squared_OLS = 1 - (1-r_squared_OLS)*(n-1)/(n-p-1)

# Compute RMSE
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse_OLS = sqrt(mean_squared_error(y_test, y_pred_OLS))



In [None]:
# append the accuracy scores to the table
linreg_full_model_SGD = pd.Series({'Model': "Linreg with SGD",
                                                'RMSE': rmse_OLS,
                                                'R-Squared': r_squared_OLS,
                                                'Adj. R-Squared':Adj_r_squared_OLS})

# append our result table using append()
result_tabulation = result_tabulation.append(linreg_full_model_SGD, ignore_index = True)

# print the result table
result_tabulation

In [None]:
plt.rcParams['figure.figsize'] = [10,8]

In [None]:
result=pd.DataFrame({'Model':[1,2,3,4,5,6,7],
                     'RMSE':[1238.66,1280.40,1696.45,6584.66,6000.29,6552.28,1238.61]}                                                                                                                             
                   )
result.plot(kind='bar',x='Model',y='RMSE')

# Conclusion:

Total 7 models have been built to predict the total load consumption depending upon various generation and weather factors.
Out of all the models we select the 7th model that is linear regression using SGD to predict the total load consumption because 
it has got the best Adjusted R-squared value and least RMSE value.

As we know that Adjusted R-squared gives us information about the best features added and RMSE gives us information about 
the least difference between actual and predicted value.Since RMSE value for linear regression with SGD is minimum so we select
this model to predict the power consumption.

Even from the statistical summary if we see the AIC,BIC and log-likelihood values, then we can observe that the AIC and BIC values
of linear regression with SGD is minimum.AIC and BIC is the penalty that is given to the model for losing information during model
building.So, as the values of AIC and BIC is minimum for the model we select this model.

# Time Series Analysis

In [None]:
#copying the dataframe in another dataframe
df=df_energy.copy(deep=True)

In [None]:
#displaying the first five records
df.head()

# Preparing the data

In [None]:
#Dropping all the columns except total actual load and date
#As in time series forecasting we reuire the column to be forecasted and the date
cols = ['generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 'rain_1h',
       'rain_3h', 'snow_3h', 'clouds_all', 'weather_id', 'weather_main',
       'weather_description','Time']
df=df.drop(cols,axis=1)
df = df.sort_values('Date')


In [None]:
#grouping the data by date and taking the sum of all the load on that date
df = df.groupby('Date')['total load actual'].sum().reset_index()

In [None]:
#setting the index of the dataframe to date
df.set_index('Date', inplace=True)

In [None]:
#displaying the final dataframe
df.head()

In [None]:
#plotting the dataframe in time axis
df.plot(figsize=(15, 6))
plt.show()

# Decomposing

Decomposing the time series into three distinct components: trend, seasonality, and noise.

In [None]:
#resampling the data by month as working with the current data is difficult due to lots of data
y = df['total load actual'].resample('MS').mean()

In [None]:

decomposition = seasonal_decompose(y)

plt.plot(y, label = 'Original')
plt.legend(loc = 'best')

trend = decomposition.trend
plt.show()
plt.plot(trend, label = 'Trend')
plt.legend(loc = 'best')

seasonal = decomposition.seasonal
plt.show()
plt.plot(seasonal, label = 'Seasonal')
plt.legend(loc = 'upper right')

residual = decomposition.resid
plt.show()
plt.plot(residual, label = 'Residual')
plt.legend(loc='best')

# Checking Stationarity

In [None]:
from pandas import Series
from statsmodels.tsa.stattools import adfuller
#series = Series.from_csv('daily-total-female-births.csv', header=0)
result = adfuller(y)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

As the p-value is greater than 0.05, it means the series is not stationary.Even the statistics value is greater than the 1% critical value so we can conclude that the series is not stationary. 

In [None]:
#Differencing to make the series stationary
y = y - y.shift(1)

In [None]:
#plotting the series after differencing
y.dropna(inplace=True)
y.plot()

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(y)

plt.plot(y, label = 'Original')
plt.legend(loc = 'best')

trend = decomposition.trend
plt.show()
plt.plot(trend, label = 'Trend')
plt.legend(loc = 'best')

seasonal = decomposition.seasonal
plt.show()
plt.plot(seasonal, label = 'Seasonal')
plt.legend(loc = 'best')

residual = decomposition.resid
plt.show()
plt.plot(residual, label = 'Residual')
plt.legend(loc='best')

In [None]:
#dividing the data into test and train
size = int(len(y) * 0.95)
train, test = y[0:size], y[size:len(y)]

# Time Series Forcasting using ARIMA

In [None]:
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

# Parameter Selection 

In [None]:
from pylab import rcParams
for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(y, order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False, 
enforce_invertibility=False)
            results = mod.fit()
            print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
        except:
            continue

# Fitting the ARIMA model

In [None]:
mod = sm.tsa.statespace.SARIMAX(y,
                                order=(0, 1, 1),
                                seasonal_order=(0, 1, 1, 12),
                                enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])

# Running Model Diagnostics

In [None]:
results.plot_diagnostics(figsize=(16, 8))
plt.show()

# Validating Forecasts

In [None]:
#set forecasts to start at 2017–01–01 to the end of the data to forecast
pred = results.get_prediction(start=pd.to_datetime('2017-01-01'), dynamic=False)
pred_ci = pred.conf_int()
ax = y['2015':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Total Load Actual')
plt.legend()
plt.show()

In [None]:
y_forecasted = pred.predicted_mean
y_truth = train['2016-01-01':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))

print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))