# <img src="https://www.theneweconomy.com/wp-content/uploads/2018/10/Off-grid-power.jpg" width = 400>

# Energy Generation in India

India is the world's third-largest producer and third largest consumer of electricity. The national electric grid in India has an installed capacity of 370.106 GW as of 31 March 2020. Renewable power plants, which also include large hydroelectric plants, constitute 35.86% of India's total installed capacity.
India has a surplus power generation capacity but lacks adequate distribution infrastructure.

India's electricity sector is dominated by fossil fuels, in particular coal, which during the 2018-19 fiscal year produced about three-quarters of the country's electricity. The government is making efforts to increase investment in renewable energy. The government's National Electricity Plan of 2018 states that the country does not need more non-renewable power plants in the utility sector until 2027, with the commissioning of 50,025 MW coal-based power plants under construction and addition of 275,000 MW total renewable power capacity after the retirement of nearly 48,000 MW old coal-fired plants.

We are going ahead are exploring the data given to gain insights. Lets begin ...

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = 8, 5
plt.style.use("fivethirtyeight")
pd.options.plotting.backend = "plotly"

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
state = pd.read_csv('../input/daily-power-generation-in-india-20172020/State_Region_corrected.csv')
data_df = pd.read_csv('../input/daily-power-generation-in-india-20172020/file.csv') 

In [None]:
state.head()

In [None]:
data_df.head()

In [None]:
plt.show(sns.heatmap(data_df.isnull()))

In [None]:
def findNull(df):
    print("Column\t\t\tNull Percentage\t\t\tNull Records")
    for col in df.columns:
        null_sum = df[col].isnull().sum()
        print(f"{col}\t\t\t{null_sum/len(df)*100}%\t\t\t{null_sum} Null Records")

In [None]:
findNull(data_df)

On inspection we see that the Nuclear data points have 1854 ie. 40% missing values
Null values seem to be coming from regions where there is no thermal energy generation. We impute these Null by 0.

In [None]:
data_df.fillna(0.0,inplace=True)

Thermal Data needs to be cleant

In [None]:
data_df["Thermal Generation Actual (in MU)"] = data_df["Thermal Generation Actual (in MU)"].str.replace(',', '').astype(float)
data_df["Thermal Generation Estimated (in MU)"] = data_df["Thermal Generation Estimated (in MU)"].str.replace(',', '').astype(float)

Convert Date field to datetime

In [None]:
data_df.Date = pd.to_datetime(data_df.Date) 

# Boxplots for outliers

In [None]:
def boxOut(df, feature1, feature2, title):
    fig = make_subplots(2,1)
    fig.add_trace(go.Box(x=df[feature1], 
                         name="Actual",
                        boxpoints='all',),
                         row=1, col=1)
    fig.add_trace(go.Box(x=df[feature2], 
                         name="Estimated",
                        boxpoints='all',),
                         row=2, col=1)
    fig.update_layout(height=800, 
                      width=800,
                      title=title)
    fig.show()

In [None]:
boxOut(data_df,"Thermal Generation Actual (in MU)","Thermal Generation Estimated (in MU)","Thermal Generation Outliers")

In [None]:
boxOut(data_df,"Nuclear Generation Actual (in MU)","Nuclear Generation Estimated (in MU)","Nuclear Generation Outliers")

In [None]:
boxOut(data_df,"Hydro Generation Actual (in MU)","Hydro Generation Estimated (in MU)","Hydro Generation Outliers")

Thermal and Hydro Generation have substantial number of outliers in both actual and predicted values.

# Exploring Generation Series by Region

## Northern

In [None]:
df = data_df[data_df['Region']=='Northern']
fig = df.plot(x='Date',y=['Thermal Generation Actual (in MU)', 'Thermal Generation Estimated (in MU)'])
fig.update_layout(title="Thermal Generation in Northern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)'])
fig.update_layout(title="Nuclear Generation in Northern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Hydro Generation Actual (in MU)', 'Hydro Generation Estimated (in MU)'])
fig.update_layout(title="Hydro Generation in Northern Region",legend_orientation="h")

## Southern

In [None]:
df = data_df[data_df['Region']=='Southern']
fig = df.plot(x='Date',y=['Thermal Generation Actual (in MU)', 'Thermal Generation Estimated (in MU)'])
fig.update_layout(title="Thermal Generation in Southern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)'])
fig.update_layout(title="Nuclear Generation in Southern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Hydro Generation Actual (in MU)', 'Hydro Generation Estimated (in MU)'])
fig.update_layout(title="Hydro Generation in Southern Region",legend_orientation="h")

## Western

In [None]:
df = data_df[data_df['Region']=='Western']
fig = df.plot(x='Date',y=['Thermal Generation Actual (in MU)', 'Thermal Generation Estimated (in MU)'])
fig.update_layout(title="Thermal Generation in Western Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)'])
fig.update_layout(title="Nuclear Generation in Western Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Hydro Generation Actual (in MU)', 'Hydro Generation Estimated (in MU)'])
fig.update_layout(title="Hydro Generation in Western Region",legend_orientation="h")

## Eastern

In [None]:
df = data_df[data_df['Region']=='Eastern']
fig = df.plot(x='Date',y=['Thermal Generation Actual (in MU)', 'Thermal Generation Estimated (in MU)'])
fig.update_layout(title="Thermal Generation in Eastern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)'])
fig.update_layout(title="Nuclear Generation in Eastern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Hydro Generation Actual (in MU)', 'Hydro Generation Estimated (in MU)'])
fig.update_layout(title="Hydro Generation in Eastern Region",legend_orientation="h")

## NorthEastern

In [None]:
df = data_df[data_df['Region']=='NorthEastern']
fig = df.plot(x='Date',y=['Thermal Generation Actual (in MU)', 'Thermal Generation Estimated (in MU)'])
fig.update_layout(title="Thermal Generation in NorthEastern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)'])
fig.update_layout(title="Nuclear Generation in NorthEastern Region",legend_orientation="h")

In [None]:
fig = df.plot(x='Date',y=['Hydro Generation Actual (in MU)', 'Hydro Generation Estimated (in MU)'])
fig.update_layout(title="Hydro Generation in NorthEastern Region",legend_orientation="h")

# Exploring Actual vs Predicted Values

In [None]:
def actualVpredicted(df,f1,f2,title):
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=df.Region,
        y=df.groupby(['Region'])[f1].sum(),
        name='Actual',
        marker_color='indianred'
    ))
    fig.add_trace(go.Bar(
        x=df.Region,
        y=df.groupby(['Region'])[f2].sum(),
        name='Predicted',
        marker_color='lightsalmon'
    ))

    fig.update_layout(barmode='group', title = title)
    fig.show()

In [None]:
actualVpredicted(data_df,"Thermal Generation Actual (in MU)","Thermal Generation Estimated (in MU)","Thermal Generation Outliers")

In [None]:
actualVpredicted(data_df,"Nuclear Generation Actual (in MU)","Nuclear Generation Estimated (in MU)","Nuclear Generation Outliers")

In [None]:
actualVpredicted(data_df,"Hydro Generation Actual (in MU)","Hydro Generation Estimated (in MU)","Hydro Generation Outliers")

In [None]:
data_df['Month'] = data_df.Date.dt.month

In [None]:
def monthly_distribution(df, groupby, dict_features, colors, filter=None):
    temp = df.groupby(groupby).agg(dict_features)
    fig = go.Figure()
    for f,c in zip(dict_features, colors):
        fig.add_traces(go.Bar(y=temp[f].values,
                              x=temp.index,
                              name=f,
                              text=temp[f].values,
                              marker=dict(color=c)
                             ))
    fig.update_traces(marker_line_color='rgb(255,255,255)',
                      marker_line_width=2.5,
                      opacity=0.7,
                      textposition="outside",
                      texttemplate='%{text:.2s}')
    fig.update_layout(
                    width=1000,
                    xaxis=dict(title="Month", showgrid=False),
                    yaxis=dict(title="MU", showgrid=False),
                    legend=dict(
                                x=0,
                                y=1.2))
                                
    fig.show()

In [None]:
dict_features = {
    "Thermal Generation Estimated (in MU)": "sum",
    "Thermal Generation Actual (in MU)": "sum",
   
}
monthly_distribution(data_df, groupby="Month", dict_features=dict_features, colors=px.colors.qualitative.Prism)
dict_features = {
    "Nuclear Generation Estimated (in MU)": "sum",
    "Nuclear Generation Actual (in MU)": "sum",
}
monthly_distribution(data_df, groupby="Month", dict_features=dict_features, colors=px.colors.qualitative.Antique)
dict_features = {
     "Hydro Generation Estimated (in MU)": "sum",
    "Hydro Generation Actual (in MU)": "sum"
}
monthly_distribution(data_df, groupby="Month", dict_features=dict_features, colors=px.colors.qualitative.Set3)

- Thermal and Nuclear generation rate goes down in Summer
- Hydro Generation peaks in monsoon season

# Exploring the State Data

In [None]:
state.head()

In [None]:
fig = px.bar(state, x='State / Union territory (UT)', y='National Share (%)', color='Area (km2)',height=400)
fig.update_layout(title = 'States Power Generation colored by area')
fig.show()

## References
https://www.kaggle.com/sayar1106/india-s-power-generation-statistics