<h1>Solar Power Plant - EDA</h1> 

In [None]:
#import the modules
import os
import numpy as np
import pandas as pd
import plotly as plt 
from plotly.subplots import make_subplots
import plotly.graph_objects as go


In [None]:
#read all the files using pandas' read_csv
plant1_pg = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv")
plant2_pg = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv")
plant1_ws = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv")
plant2_ws = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv")

#creating a map
files = {
    0: plant1_pg,
    1: plant1_ws,
    2: plant2_pg,
    3: plant2_ws,
}

**<h2>Power Plant Data</h2>**
* DATE_TIME -> date and time of the particular recorded instance
* SOURCE_KEY -> it's a unique identity for the inverters (there can be multiple solar panels attached to a single inverter)
* DC_POWER -> power generated in kW per 15 min time period
* AC_POWER -> power generated in kW per 15 min time period
* DAILY_YIELD -> power generated on a particular day
* TOTAL_YIELD -> total power generated from the very beginning

In [None]:
#plant-1 power generation data
files[0].sample(5)

**<h2>Weather Sensor Data at Power Plant</h2>**
* DATE_TIME -> date and time of the particular recorded instance
* SOURCE_KEY -> it's a unique identity of the sensor (only one sensor at a plant)
* AMBIENT_TEMPERATURE -> temperature of surrounding environment of the plant
* MODULE_TEMPERATURE -> temperature of the solar panel
* IRRADIATION -> it's a way of messuring amount of sun light coming on the unit area of solar panels

**ASSUMPTIONS**
1. Module temperature is the average temprature of all the solar panels at a given time
2. all the solar panels are of same unit size OR the irradiation is the mean value for all the panels in a power plant

In [None]:
#plant-1 weather sensor data
files[1].sample(5)

**I have observed all the 4 files and we have to do**
1. All 4 datasets are having no null values, which is good.
2. DATE_TIME(obj) needs to be DATE_TIME(datetime64).
3. PLANT_ID will be droped 

In [None]:
#converting the date-time in the right format using to_datetime 
#droping the plant_id column from all the files
for i in range(len(files)):
    files[i]["DATE_TIME"] = pd.to_datetime(files[i]["DATE_TIME"])
    files[i] =  files[i].drop(columns=["PLANT_ID"], axis=1) 
    

**<h2>Power Generation and Temperature</h2>**
In most of the parts of India, it's presumable to have maximum sunlight between 11:00 am to 4:00 pm and more sunlight means more power generation, more hot weather as well. 

from the data we can see that, as weather gets hot panels start getting hotter an hotter, at the same time panels are getting more sunlight and power generation increases drastically due to wich panels' temperature raises.

In [None]:
import plotly.express as px

def LineChart(temp_df,columns,start_date_time,end_date_time, title):
    temp_df = temp_df.loc[start_date_time : end_date_time]
    fig = px.line(temp_df[columns])
    fig.update_layout(title_text = title, title_x=0.5)
    fig.show()    

In [None]:
data1 = files[0][files[0].SOURCE_KEY == "3PZuoBAID5Wc2HD"]
data1= data1.set_index('DATE_TIME')

stime = "25-05-2020 05:00"
etime = "25-05-2020 20:00"
LineChart(data1, ["DC_POWER","AC_POWER"],stime,etime,"Power Generation during the Day")

In [None]:
data2 = files[1].set_index('DATE_TIME')

stime = "2020-05-25 05:00:00"
etime = "2020-05-25 20:00:00"
LineChart(data2,["AMBIENT_TEMPERATURE","MODULE_TEMPERATURE"],stime,etime,"Temperature during the day")

In [None]:
data2 = files[1].set_index('DATE_TIME')
stime = "2020-05-25 05:00:00"
etime = "2020-05-25 20:00:00"
LineChart(data2,["IRRADIATION"],stime,etime,"Irradiation during the day")

In [None]:
#check for the data distribution and outliers
def BoxPlots(files, column1,column2,titles):
    fig = make_subplots(rows=2, cols=1, subplot_titles=titles)
    for i,file in enumerate(files):
        fig.add_trace(go.Box(x = list(file[column1].astype('int64')),name=column1),row=i+1,col=1)
        fig.add_trace(go.Box(x = list(file[column2].astype('int64')), name=column2),row=i+1,col=1)
    fig.update_layout(height=800, width=1000)
    fig.show()
    

BoxPlots([files[0],files[2]],"DC_POWER","AC_POWER",["Plant-1","Plant-2"])



plant-1 has heavy capacity/production of DC power compare to the Ac power.


In [None]:
BoxPlots([files[1],files[3]],"AMBIENT_TEMPERATURE","MODULE_TEMPERATURE",["Plant-1","Plant-2"])

In [None]:
tmp1 = files[0].copy()
tmp1["MONTH"] = tmp1["DATE_TIME"].dt.month
tmp1["YEAR"] = tmp1["DATE_TIME"].dt.year
plant1 = tmp1[tmp1.YEAR == 2020].sort_values('MONTH').groupby('MONTH').agg({"DAILY_YIELD":"sum"}).reset_index()
plant1.index.name = None

plant2 = pd.DataFrame(0, index=plant1.index, columns=['MONTH','DAILY_YIELD'])
plant2['MONTH'] = pd.DataFrame(range(1,13))

tmp2 = files[2].copy()
tmp2["MONTH"] = tmp2["DATE_TIME"].dt.month
tmp2["YEAR"] = tmp2["DATE_TIME"].dt.year 
tmp2 = tmp2[tmp2.YEAR == 2020].sort_values('MONTH').groupby('MONTH').agg({"DAILY_YIELD":"sum"}).reset_index()
tmp2.index.name = None
plant2.iloc[4:6] = tmp2.iloc[:].values

In [None]:
def BarMonth(temp_df,x,y, year, title):
    fig = px.bar(temp_df,x=x,y=y)
    fig.update_layout(title_text = title, title_x=0.5)
    fig.show()  

In [None]:
BarMonth(plant1,'MONTH','DAILY_YIELD',2020,"Plant-1 Monthly Yield")
BarMonth(plant2,'MONTH','DAILY_YIELD',2020,"Plant-2 Monthly Yield")

It's quite obvious that month 5(May) and 6(June) both are summer months in india and during those time we get maximum heat. 