# EDA of Power Generation in India(2017-2020)

Power generation and consumption are important aspects for any nation. Being able to provide electricity at cheap rates is important for the growth of the economy. 

## Initial Setup

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import csv files
power_wrt_region = pd.read_csv('/kaggle/input/daily-power-generation-in-india-20172020/State_Region_corrected.csv', thousands = ',')

power_wrt_time = pd.read_csv('/kaggle/input/daily-power-generation-in-india-20172020/file.csv', parse_dates = ['Date'], thousands = ',')

## 1. Power generation with respect to time analysis

In [None]:
power_wrt_time.head()

We see that there are NaN values in the data. These must be filled. Let us first get the columns which have missing values.

In [None]:
power_wrt_time.info()

In [None]:
# Filling of NaN values
power_wrt_time.isnull().any()

Let us also visualise the missing values using __Missingo__ library. This is a great library to visualise data with missing values.

In [None]:
# import missingo
import missingno as msn

Now we visualise the dataset. We generate a barplot of all columns using ```msn.bar``` method

In [None]:
msn.bar(power_wrt_time)

In [None]:
# Creates a matrix of missing values along with position
msn.matrix(power_wrt_time)

Quite a lot of missing values in the dataset. It is also clear that the `NaN` occurs in pairs in both the columns.A possible reason might be that nuclear power plants were not present in those states.It is entirely possible that nuclear power plants could not be set up there. So it would seem reasonable to impute NaN values with 0

In [None]:
# Impute with 0
power_wrt_time.fillna(0, inplace = True)

Another thing we notice is that the data is time series data having which is divided into different `Regions`. So let us, for convenience extract data from different regions

In [None]:
# Northern Region Power
north_power = power_wrt_time[power_wrt_time.Region == 'Northern'].drop(['Region'], axis = 1)  
#north_power.set_index('Date', inplace = True)

# Southern Region Power
south_power = power_wrt_time[power_wrt_time.Region == 'Southern'].drop(['Region'], axis = 1) 
#south_power.set_index('Date', inplace = True)

# Eastern Region Power
east_power = power_wrt_time[power_wrt_time.Region == 'Eastern'].drop(['Region'], axis = 1) 
#east_power.set_index('Date', inplace = True)

# Western Region Power
west_power = power_wrt_time[power_wrt_time.Region == 'Western'].drop(['Region'], axis = 1) 
#west_power.set_index('Date', inplace = True)

# North Eastern Region Power
northeast_power = power_wrt_time[power_wrt_time.Region == 'NorthEastern'].drop(['Region'], axis = 1)
#northeast_power.set_index('Date', inplace = True)

Now that we have prepared our data, Let us visualise it.

In [None]:
north_power.head()

### Exploratory Data Analysis
We will explore the data now

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%matplotlib inline

Plot of Thermal Generation Estimation wrt to time

In [None]:
plt.figure(figsize = (20, 10))
sns.lineplot(x = 'Date', y = 'Thermal Generation Actual (in MU)', data = north_power)


Thermal Power Generation seems to be increasing over time.

Let us plot all attributes in one chart and see

In [None]:
north_power.plot(x = 'Date', figsize = (20, 10), title = 'All power Statistics for North Region')

#### Inferences
Fossil Fuels are still the top used fuels for generating electricity.

Thermal Power Generation seems to gradually rise with time.

There seems to be seasonal variation in the Hydroelectric Power Demand. More hydroelectricity is used during the winter months. This can be as during summer, more water needs to be released and cannot be stored in dams. So usage is low during summer. During winter months hydroelecticity is used more.

Nuclear power is being used in very limited quantities


Let us plot the southern power consumption statistics

In [None]:
south_power.plot(x = 'Date', figsize = (20, 10), title = 'All power Statistics for South Region')

#### Inferences

Southern States have same demand for Thermal Electricity as the northern states do.

However they have lower usage of hydroelectricity.

The nuclear power consumption seems to more than northern regions. Let us verify this claim

In [None]:
plt.figure(figsize = (20 ,10))
sns.lineplot(x = power_wrt_time['Date'], 
             y =power_wrt_time['Nuclear Generation Actual (in MU)'],
             hue = 'Region', 
             markers = True,
             data = power_wrt_time,
             palette = sns.color_palette("mako_r", 5))

In [None]:
sns.barplot(x = power_wrt_time['Region'], 
             y =power_wrt_time['Nuclear Generation Actual (in MU)'],
             data = power_wrt_time)

Our claims seems to reasonable

Let us plot a few more random plots

In [None]:
east_power.plot(x = 'Date',y = 'Hydro Generation Actual (in MU)', figsize = (20, 10), kind='kde')

In [None]:
plt.figure(figsize = (20 ,10))
sns.boxplot(x = 'Region', y = 'Hydro Generation Actual (in MU)', data = power_wrt_time)

In [None]:
northeast_power.plot(x = 'Date', y = 'Thermal Generation Actual (in MU)',figsize = (20, 10), kind='kde')