In [None]:
import numpy as np 
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Task

- No task is provided so I'll just approach this as an exercise in data exploration
- I should start by visualising the distribution of different energy sources in India by region
- The time series of energy production in different places is also interesting, is there any seasonality to the data which might allow us to forecast likely demand?


In [None]:
state_region = pd.read_csv('../input/daily-power-generation-in-india-20172020/State_Region_corrected.csv')
file_data = pd.read_csv('../input/daily-power-generation-in-india-20172020/file.csv')

In [None]:
!pip install -q pandas-profiling[notebook]

In [None]:
from pandas_profiling import ProfileReport

In [None]:
state_region_profile = ProfileReport(state_region, title="state_region Profiling Report")
file_data_profile = ProfileReport(file_data, title="file_data Profiling Report")

In [None]:
state_region_profile.to_notebook_iframe()

In [None]:
file_data_profile.to_notebook_iframe()

In [None]:
state_region.head()

In [None]:
# Some data is missing
state_region['National Share (%)'].sum()

In [None]:
# MU stands for million units. In India 1 unit of energy is a kilowatt-hour, so 1 MU == 1 gigawatt-hour
file_data

In [None]:
# Date column spanning Sept 2017 to March 2020. We're going to need the datetime library ;)
file_data['Date'].unique()

# Exploring the percentages of total energy produced by each state

In [None]:
state_region.plot.bar(y = "National Share (%)", x = "State / Union territory (UT)", figsize = (15, 6));

### So Rajasthan produces the largest percentage of Indias total energy. Let see where all these places are on the map and color them by % of total energy production

# Geospatial Analysis

In [None]:
# grab a GeoJSON of Indian states
#!curl -O https://github.com/Subhash9325/GeoJson-Data-of-Indian-States/blob/master/Indian_States

In [None]:
# !curl -O https://www.kaggle.com/sauravmishra1710/indian-state-geojson-data/version/2?select=india_state_geo.json

In [None]:
# state_region.columns

In [None]:
from urllib.request import urlopen
import json

with urlopen('https://raw.githubusercontent.com/Subhash9325/GeoJson-Data-of-Indian-States/master/Indian_States') as response:
    counties = json.load(response)

In [None]:
# we need to change the name Odissa to Orissa, to match the GeoJSON
state_region.replace('Odisha', 'Orissa', inplace = True)

In [None]:
# because there isn't an id column in the GeoJSON, we need to specify the properties.NAME_1 and make a copy of the State / Union.. column under the same name in our df
import pandas as pd
df = state_region
df['NAME_1'] = df['State / Union territory (UT)']
import plotly.express as px

fig = px.choropleth_mapbox(df, geojson=counties, locations='State / Union territory (UT)', featureidkey="properties.NAME_1", color='National Share (%)',
                           color_continuous_scale="Viridis",
                           range_color=(0, max(df['National Share (%)'])),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 22.5934 , "lon": 77.2223},
                           opacity=0.5,
                           labels={'state':'State / Union territory (UT)'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Looking at the map of % energy production in India at a state level we can conclude:
- Our top three producing states Rajasthan, Madhya Pradesh and Maharashtra, are each adjacent to each other
- I wonder if these are the areas of India with the greatest urban populations and/or economic/industrial activity? Maybe these areas just produce more energy because they have more thermal, hydroelectric or nuclear power plants? We could answer these questions with population, per capita productivity overlays


# Time Series Analysis

In [None]:
file_data.head()

In [None]:
# convert the thermal generation columns to valid numeric columns
file_data = file_data.replace(',','', regex=True)

In [None]:
file_data['Thermal Generation Actual (in MU)'] = file_data['Thermal Generation Actual (in MU)'].astype('float64')
file_data['Thermal Generation Estimated (in MU)'] = file_data['Thermal Generation Estimated (in MU)'].astype('float64')

In [None]:
# fairly strong seasonality in the data already visible
file_data.plot(alpha = 0.4);

In [None]:
file_data['Date'] = pd.to_datetime(file_data['Date'], format = '%Y/%m/%d')
file_data.index = file_data['Date']

In [None]:
file_data.dtypes

In [None]:
# lets format the date column properly
ax = file_data.iloc[:,1::].plot(subplots=True, layout=(6,1), figsize = (15, 10));

- From the plot above we can see definite seasonality in the data:
 1.  Hydro peaks in the summer months until October and drops over the winter
 2.  Nuclear seems to pick up the slack in the winter
- Thermal generation seems to be fairly consistent in time and also produces far more MU of energy than the others, in the region of 1000 MU compare to 200/400 MU for hydro and nuclear. Thermal represents the hydrocarbon (coal/natural gas) fire power stations that make up the backbone of the energy grid

- Do these patterns hold if we now use the "Region" column to get a regional breakdown of the time series for each energy source

In [None]:
cols = ['Thermal Generation Actual (in MU)',
       'Thermal Generation Estimated (in MU)',
       'Nuclear Generation Actual (in MU)',
       'Nuclear Generation Estimated (in MU)',
       'Hydro Generation Actual (in MU)',
       'Hydro Generation Estimated (in MU)']

In [None]:
# pd.pivot_table(file_data,index=file_data.index, columns='Region', values=cols)

In [None]:
# pd.pivot_table(file_data,index=file_data.index, columns='Region', values=cols).plot(subplots=True, layout = (6, 5), figsize = (25, 15), legend = True);

In [None]:
import matplotlib.pyplot as plt
g = file_data.groupby('Region')

fig, axes = plt.subplots(g.ngroups, sharex=True, figsize=(10, 10))

for i, (region, d) in enumerate(g):
    ax = d.plot.line(x='Date', y=cols, ax=axes[i], title=region)
    if i == 0:
        box = ax.get_position()
        ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
        #ax.legend(loc = 'best')
        ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
        fig.tight_layout()
    else:
        ax.legend().remove()

- In these region by region time series, the seasonality seems most pronounced in the hydro power generation. Hydro energy is rainfall dependent and probably gives a greater output in the monsoons and the Northern and NE regions of India,because this is where the foothills of the Himalayas are located
- At its' peak, Hydro generation in NE India is comparable to that produced Thermally
- Nuclear and thermal energy also appears to have some seasonality, though this is less pronounced
- We can now see that Western India produces the most energy by thermal means, followed by the northern and southern regions
- There is no nuclear energy produced in Eastern or North Eastern India
- Clearly nuclear energy represents a tiny fraction of Indian energy production

## Trend, Seasonality and Stationarity - North Eastern Region
- lets quantify the trend and seasonality in the data with an autocorrelation function (ACF) plot

In [None]:
NE_data = file_data.loc[file_data['Region'] == 'NorthEastern']

In [None]:
cols = ['Thermal Generation Actual (in MU)',
       'Thermal Generation Estimated (in MU)',
       'Hydro Generation Actual (in MU)',
       'Hydro Generation Estimated (in MU)']

In [None]:
# ACF plot - aka correlogram - correlation of the time series with a lagged copy of itself reveals trend and seasonality
import statsmodels.api as sm
fig, axes = plt.subplots(len(cols), sharex=False, figsize=(15, 15))
for i, j in enumerate(cols):
    ax = sm.graphics.tsa.plot_acf(NE_data[j].values.squeeze(), lags=40, ax = axes[i], title=j)

- Stationarity (where the mean and variance are constant in time) - stationarity tests for the presence of a unit root

In [None]:
# Augmented Dickey Fuller (ADF) test. https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test
from statsmodels.tsa.stattools import adfuller
def adf_test(timeseries, name):
    print ('Results of Dickey-Fuller Test:{}'.format(name))
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

In [None]:
for i in NE_data[cols]:
    adf_test(NE_data[i], i)

### ADF test conclusions:
1. Thermal estimated is more negative than the 1% critical value with p = 0.000005, so it is definitely trending and non-stationary
2. The other time series are not significantly or trending and non-stationary

In [None]:
#  Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests. https://en.wikipedia.org/wiki/KPSS_test  'c’ : The data is stationary around a constant (default).
from statsmodels.tsa.stattools import kpss
def kpss_test(timeseries, name):
    print ('Results of KPSS Test:{}'.format(name))
    kpsstest = kpss(timeseries, regression='c', nlags=None)
    kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
    for key,value in kpsstest[3].items():
        kpss_output['Critical Value (%s)'%key] = value
    print (kpss_output)

In [None]:
for i in NE_data[cols]:
    kpss_test(NE_data[i], i)

### KPSS assuming stationarity around a constant test results:
- Thermal generation actual and estimated exceed the critical value for 1%, at p = 0.01, therefore they are non-stationary time series
- The Hydro values do not exceed any of the critical values, therefore they are stationary time series

In [None]:
#  Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests. https://en.wikipedia.org/wiki/KPSS_test  ‘ct’ : The data is stationary around a trend.
from statsmodels.tsa.stattools import kpss
def kpss_test(timeseries, name):
    print ('Results of KPSS Test:{}'.format(name))
    kpsstest = kpss(timeseries, regression='ct', nlags=None)
    kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
    for key,value in kpsstest[3].items():
        kpss_output['Critical Value (%s)'%key] = value
    print (kpss_output)

In [None]:
for i in NE_data[cols]:
    kpss_test(NE_data[i], i)

### KPSS assuming stationarity around a trend test results:
- Thermal generation estimated exceeds the critical value for 1%, at p = 0.01, therefore they are non-stationary time series and trending upward
- The Hydro values exceed the 5% critical values, therefore they may be non-stationary trending time series

# Time series forecasting in Facebook Prophet - Bayesian curve fitting (generalised additive model)
- Prophet is like a bundle of classical time series forecasting methods (ARIMA, Holt-Winters), using a Bayesian generalised additive model, wrapped up in an easy to use API, which works best with daily data demonstrating seasonality. The model works like this:

#### y = trend * (daily_seasonality + weekly_seasonality)

- Because we have daily data with seasonality Prophet should work quite well for forecasting.


- LSTMs are also appealing modern methods of time series forecasting but are less effective at modelling the seasonality we have in our data here. Both Prophet and LSTMs have the attractive feature of not requiring user defined input parameters in order to get a meaningful forecast, where as ARIMA and Holt-Winters require the setting of parameters defining the seasonality and auto regressive properties of the input data. Some genius even combined LSTMs (good at out of context immediate trend prediction) and Prophet (good at modelling seasonal data), so maybe you can get the best of both worlds https://ieeexplore.ieee.org/document/8986377

In [None]:
!pip install -q fbprophet

In [None]:
df1 = pd.DataFrame(data = NE_data.iloc[:,2], index = NE_data.index)
df1.reset_index(level=0, inplace=True)
df1.rename(columns= {'Date':'ds', 'Thermal Generation Actual (in MU)': 'y'}, inplace = True)

In [None]:
from fbprophet import Prophet
m = Prophet()
m.fit(df1)
future = m.make_future_dataframe(periods=365*3)
forecast = m.predict(future)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
from fbprophet.plot import plot_plotly
import plotly.offline as py
py.init_notebook_mode()

fig = plot_plotly(m, forecast)  # This returns a plotly Figure
py.iplot(fig)

- Taking a look at the forecast for the 'Thermal Generation Actual' data, we can see that Thermal energy production is expect to remain constant over the next three years but looking at the confidence interval, is more likely to increase than decrease
- Fridays show the least Thermal energy generation on a weekly scale and August is the month with the lowest Thermal energy output, which is expect to recur in the coming years

In [None]:
df2 = pd.DataFrame(data = NE_data.iloc[:,6], index = NE_data.index)
df2.reset_index(level=0, inplace=True)
df2.rename(columns= {'Date':'ds', 'Hydro Generation Actual (in MU)': 'y'}, inplace = True)

In [None]:
m = Prophet()
m.fit(df2)
future = m.make_future_dataframe(periods=365*3)
forecast = m.predict(future)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
py.init_notebook_mode()

fig = plot_plotly(m, forecast)  # This returns a plotly Figure
py.iplot(fig)

- Taking a look at the forecast for the 'Hydro Generation Actual' data, we can see that Hydroelectric energy production is expected to trend down over the next three years
- Again Fridays show the least Thermal energy generation on a weekly scale and January through April shows little production with a substantial increase in production through the summer months peaking in September. This yearly seasonality is expected to recur in the coming years


### Forecasting conclusions:
1. It looks like as the Hydroelectric power generation increases through the summer, thermal power generation is reduced, likely coinciding with the seasonal melting of snow and/or monsoon season in the foothills of the Himalayas 
2. My common sense tells me that as economic and population growth continues, energy production in NE India will have to increase to match, by increasing energy production capacity. I think if more training data were available the forecasts might indicate more of a trend towards increased thermal production capacity and depending on temperature/precipitation/capacity changes over time, the hydroelectric output forecast would also be affected

# Conclusions

- Clearly Hydroelectric energy production is highly seasonal and are exploited more readily where you have mountains to create the necessary gradients and collect water

- Thermal production may be reduced as hydroelectric production increases

- The states with the most energy production in India, are generating it by Thermal means

- Nuclear energy is not widely exploited in India compared to Thermal and Hydroelectric energy production

# Questions to ponder

- I would like to generate an overlay of population and energy production and economic activity (per capita productivity/income) and energy production on the chloropleth map. Is energy being generated where there are more people and economic activity?

- In terms of forecasting, what are the factors determining future energy production requirements? Economic growth (you need capital investment to increase energy production capacity), population growth (more people need more electrical energy to live a modern lifestyle) and climate change (as rainfall patterns change hydroelectric output varies and the government attitude to Thermal power generation from burning hydrocarbons may also shift)