<div class= Descriptive Analytics></div>

In this notebook, we are doing descriptive analytics for two solar power plants in India. To do so, we are talking about:

1. [Load data](#load)
2. [Initial exploration](#explore)
3. [The mean value of daily yield for each plant](#mean)
4. [The total irradiation per day](#total)
5. [The max ambient and module temperature for each plant](#max)
5. [How many inverters are there for each plant?](#Q1)
6. [What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?](#Q2)
7. [Which inverter (source_key) has produced maximum DC/AC power?](#Q3)
8. [The Rank of the inverters based on the DC/AC power they produce](#Q4)
9. [Find the best solar power plant](#best)

We start.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#import all package needed
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import normaltest
import holoviews as hv
from holoviews import opts
import cufflinks as cf
hv.extension('bokeh')

<a id='load'></a>

# 1. Load data

In [None]:
file1 = '/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv'
file2 = '/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv'
file3 = '/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv'
file4 = '/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv'

In [None]:
plant1 = pd.read_csv(file1)
sensor1 = pd.read_csv(file3)

In [None]:
plant2 = pd.read_csv(file2)
sensor2 = pd.read_csv(file4)

In [None]:
plant1.tail()

In [None]:
sensor1.tail()

In [None]:
plant1.info()

In [None]:
sensor1.info()

In [None]:
plant2.tail()

In [None]:
sensor2.tail()

In [None]:
plant2.info()

In [None]:
sensor2.info()

In [None]:
#how many inverters we have in plant I and II
print('We have: \n 1- For plant I: {} Inverters. \n 2- for Plant II: {} Inverters.'.format(plant1['SOURCE_KEY'].nunique(),
                                                                                         plant2['SOURCE_KEY'].nunique()))

<a id='explore'></a>

# 2. Initial Exploration

## For plant I

In [None]:
plant1.drop(columns = 'PLANT_ID').describe()

In [None]:
sensor1.drop(columns = 'PLANT_ID').describe()

In [None]:
fig = plt.figure(dpi=100, figsize=(15,10))
fig.subplots_adjust(wspace=0.2, hspace=0.2)
cols = list(set(plant1.columns) - set(['PLANT_ID', 'SOURCE_KEY', 'DATE_TIME']))
for i in range(1,5):
    ax = fig.add_subplot(2,2,i)
    sns.violinplot(plant1[cols[i-1]] , ax=ax)

In [None]:
fid = plt.figure(dpi=100, figsize=(15,10))
fid.subplots_adjust(wspace=0.2, hspace=0.2)
cls = list(set(sensor1.columns) - set(['PLANT_ID', 'SOURCE_KEY', 'DATE_TIME']))
for i in range(1,4):
    ax = fid.add_subplot(2,2,i)
    sns.violinplot(sensor1[cls[i-1]] , ax=ax)

In [None]:
def hist2D(df = None, col1 = None, col2 = None, xlabel = None, ylabel = None):
    '''
        df: DataFrame
        col1, col2: columns from DataFrame
        xlabel,ylabe for name for plotting
    '''
    plt.figure(figsize=(15,5))
    plt.hist2d(df[col1], df[col2], bins = (30, 30))
    plt.colorbar()
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    ax = plt.gca()
    ax.axis('tight')

## For plant II

In [None]:
plant2.drop(columns = 'PLANT_ID').describe()

In [None]:
sensor2.drop(columns = 'PLANT_ID').describe()

In [None]:
figu = plt.figure(dpi=100, figsize=(15,10))
figu.subplots_adjust(wspace=0.2, hspace=0.2)
for i in range(1,5):
    ax = figu.add_subplot(2,2,i)
    sns.violinplot(plant2[cols[i-1]] , ax=ax)

In [None]:
fis = plt.figure(dpi=100, figsize=(15,10))
fis.subplots_adjust(wspace=0.2, hspace=0.2)
for i in range(1,4):
    ax = fis.add_subplot(2,2,i)
    sns.violinplot(sensor2[cls[i-1]] , ax=ax)

<font color=blue>We learn: </font>

    1. Sensor for plant I is similar to sensor for plant II.
    2. DC POWER and AC POWER for plant II have a median equal to 0 but for plant I is different. This means that 1/2 entry of generation data DC/AC POWER for plant II have a value equal to 0. 
    3. DAILY_YIELD for plant II and plant I are similar.
    4. With only 1/2 entry DC/AC POWER generation data notnull , TOTAL YIELD for plant II is very huge than TOTAL YIELD for plant I.

<a id='mean'></a>

# The mean value of daily yield for each plant. 

In [None]:
#convert date time object type to datetime
plant1['DATE_TIME'] = pd.to_datetime(plant1.pop('DATE_TIME'), format='%d-%m-%Y %H:%M')
plant2['DATE_TIME'] = pd.to_datetime(plant2.pop('DATE_TIME'), format='%Y-%m-%d %H:%M')
sensor2['DATE_TIME'] = pd.to_datetime(sensor2.pop('DATE_TIME'), format='%Y-%m-%d %H:%M')
sensor1['DATE_TIME'] = pd.to_datetime(sensor1.pop('DATE_TIME'), format='%Y-%m-%d %H:%M')

In [None]:
#I remove time in Date Time to get only date.
plant1['DATE'] = plant1.DATE_TIME.dt.date
plant2['DATE'] = plant2.DATE_TIME.dt.date

In [None]:
mean_daily_yield1 = plant1.groupby(by='DATE')['DAILY_YIELD'].agg('mean').reset_index()
mean_daily_yield2 = plant2.groupby(by='DATE')['DAILY_YIELD'].agg('mean').reset_index()

In [None]:
## we plot a mean

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(x='DATE', y='DAILY_YIELD', data=mean_daily_yield1)
plt.grid(True)
plt.title('Mean Daily Yield for Plant I.',  weight='bold')
plt.ylabel('MEAN DAILY YIELD')
plt.ylim(2000,5500)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(x='DATE', y='DAILY_YIELD', data=mean_daily_yield2)
plt.grid(True)
plt.title('Mean Daily Yield for Plant II.',  weight='bold')
plt.ylabel('MEAN DAILY YIELD')
plt.ylim(1500,4500)
plt.show()

In [None]:
mean = pd.DataFrame()
mean['Mean_Daily_Yield_PLANTI'] = mean_daily_yield1.mean()
mean['Mean_Daily_Yield_PLANTII'] = mean_daily_yield2.mean()

In [None]:
mean.T.style.background_gradient('viridis')

In [None]:
print('Gap between two plants for mean daily yield is {} KWh.'.\
      format(round(abs(mean.Mean_Daily_Yield_PLANTI.values[0] -
                                                            mean.Mean_Daily_Yield_PLANTII.values[0]),2)))

In [None]:
mean.T.plot(kind='pie', subplots=True, figsize=(15,10))
plt.title('Mean Daily Yield Comparison',  weight='bold')
plt.show()

<font color=green> We learn: </font>

    1. For May month, Plant I gets a maximun mean daily yield but plant II gets also a maximun mean daily yield 
    for June month.
    2. A gap between mean daily yield for plant I and II is only 9.16 KWh. 

<a id='total'></a>

# The total irradiation per day

In [None]:
sensor1['DATE'] = sensor1.DATE_TIME.dt.date
sensor2['DATE'] = sensor2.DATE_TIME.dt.date

In [None]:
total_irradiation1 = sensor1.groupby('DATE')['IRRADIATION'].agg('sum').reset_index()
total_irradiation2 = sensor2.groupby('DATE')['IRRADIATION'].agg('sum').reset_index()

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(x='DATE', y='IRRADIATION', data=total_irradiation1)
plt.grid(True)
plt.title('TOTAL IRRADIATION PER DAY FOR PLANT I.',  weight='bold')
plt.ylim(10,30)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(x='DATE', y='IRRADIATION', data=total_irradiation2)
plt.grid(True)
plt.title('TOTAL IRRADIATION PER DAY FOR PLANT II.',  weight='bold')
plt.ylim(10,30)
plt.show()

<font color=red> We learn:</font>

    1. Total irradiation per day for Plant I increases in May month after stabilize in June month.
    2. The monotonic of total irradiation per day for plant II decreases directly May to June.

<a id='max'></a>

# The max ambient and module temperature for each plant.

In [None]:
temp = plt.figure(figsize=(20,5), dpi=100)
temp.subplots_adjust(wspace=0.1)
ax3 = temp.add_subplot(1,2,1)
ax4 = temp.add_subplot(1,2,2)
sns.lineplot(x='DATE', y='AMBIENT_TEMPERATURE', data=sensor1, ax=ax3)
sns.lineplot(x='DATE', y='MODULE_TEMPERATURE', data=sensor1, ax=ax4)
ax3.set_title('TEMPERATURE FOR PLANT I',  weight='bold')
ax4.set_title('TEMPERATURE FOR PLANT I',  weight='bold')
ax3.grid(True)
ax4.grid(True)
plt.show()

In [None]:
te = plt.figure(figsize=(20,5), dpi=100)
te.subplots_adjust(wspace=0.1)
ax5 = te.add_subplot(1,2,1)
ax6 = te.add_subplot(1,2,2)
sns.lineplot(x='DATE', y='AMBIENT_TEMPERATURE', data=sensor2, ax=ax5)
sns.lineplot(x='DATE', y='MODULE_TEMPERATURE', data=sensor2, ax=ax6)
ax5.set_title('TEMPERATURE FOR PLANT II',  weight='bold')
ax6.set_title('TEMPERATURE FOR PLANT II',  weight='bold')
ax5.grid(True)
ax6.grid(True)
plt.show()

In [None]:
temp_plant1 = pd.DataFrame(sensor1[['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE']].max(), columns=['PLANT I'])
temp_plant2 = pd.DataFrame(sensor2[['AMBIENT_TEMPERATURE', 'MODULE_TEMPERATURE']].max(), columns=['PLANT II'])

In [None]:
temp_plant1.style.background_gradient('viridis')

In [None]:
temp_plant2.style.background_gradient('viridis')

In [None]:
pie = plt.figure(figsize=(20,10))
pie.subplots_adjust(wspace=0.2)
ax7 = pie.add_subplot(1,2,1)
ax8 = pie.add_subplot(1,2,2)

temp_plant1.plot(kind='pie', subplots=True, ax=ax7)
temp_plant2.plot(kind='pie', subplots=True, ax=ax8)
ax7.set_title('Plant I Max Temperature', weight='bold')
ax8.set_title('Plant II Max Temperature',  weight='bold')
plt.show()

<font color=green>We lean:</font>

    1. May is the hot month.
    2. A gap between max ambient temperature for plant I and II is around a 4°C.
    3. A gap between max module temperature for plant I and II is around 1°C.
    4. Ratio between max module and max ambient temperature is almost same in the two plants.

<a id='Q1'></a>

# How many inverters are there for each plant?

In [None]:
print('Plant I have {} inverters.'.format(plant1['SOURCE_KEY'].nunique()))

In [None]:
print('Plant II have {} inverters.'.format(plant2['SOURCE_KEY'].nunique()))

<a id='Q2'></a>

# What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?

The answer are:

## Plant I

In [None]:
#the data have been recorded after 15min. But we are transforming it for 1h
plant1_group = plant1.groupby('DATE_TIME')[['AC_POWER', 'DC_POWER']].agg('sum')

In [None]:
# slice [start:stop:step], starting from index 4 take every 5th record.
plant1_group =  plant1_group[0::4].reset_index()
plant1_group['Date'] = plant1_group.DATE_TIME.dt.date

In [None]:
date1 = plant1_group.Date.unique()

In [None]:
maximun1 = []
minimun1 = []

for dt in date1:
    maximun1.append(plant1_group[plant1_group.Date==dt].max())
    minimun1.append(plant1_group[plant1_group.Date==dt].min())

In [None]:
min_plant1 = pd.DataFrame(minimun1)
max_plant1 = pd.DataFrame(maximun1)

### Miminun AC/DC POWER for Plant I

In [None]:
min_plant1

### Maximun AC/DC POWER for Plant I

In [None]:
max_plant1.style.background_gradient('viridis')

## Plant II

In [None]:
#the data have been recorded after 15min. But we are transforming it for 1h
plant2_group = plant2.groupby('DATE_TIME')[['AC_POWER', 'DC_POWER']].agg('sum')

In [None]:
# slice [start:stop:step], starting from index 4 take every 5th record.
plant2_group =  plant2_group[0::4].reset_index()
plant2_group['Date'] = plant2_group.DATE_TIME.dt.date

In [None]:
date2 = plant2_group.Date.unique()

In [None]:
maximun2 = []
minimun2 = []

for dt in date2:
    maximun2.append(plant2_group[plant2_group.Date==dt].max())
    minimun2.append(plant2_group[plant2_group.Date==dt].min())

In [None]:
min_plant2 = pd.DataFrame(minimun2)
max_plant2 = pd.DataFrame(maximun2)

In [None]:
min_plant2

In [None]:
max_plant2.style.background_gradient('viridis')

<a id = 'Q3'></a>

# Which inverter (source_key) has produced maximum DC/AC power?

In [None]:
inverter1 = plant1.groupby('SOURCE_KEY')[['AC_POWER', 'DC_POWER']].agg('sum')
inverter2 = plant2.groupby('SOURCE_KEY')[['AC_POWER', 'DC_POWER']].agg('sum')

In [None]:
inverter1.plot(kind='bar', subplots=True, figsize=(20,15))
plt.show()

In [None]:
stop1 = inverter1 == inverter1.max()

In [None]:
print('The inverter  has produced maximun DC/AC POWER for plant I is: {}'.format(inverter1.index[stop1.iloc[:,0]].values[0]))

In [None]:
inverter2.plot(kind='bar', subplots=True, figsize=(20,15))
plt.show()

In [None]:
stop2 = inverter2 == inverter2.max()

In [None]:
print('The inverter  has produced maximun DC/AC POWER for plant II is: {}'.format(inverter2.index[stop2.iloc[:,0]].values[0]))

<a id ='Q4'></a>

# The Rank of the inverters based on the DC/AC power they produce

## Plant I

In [None]:
inverter1.sort_values(by=['AC_POWER'], ascending=False).style.background_gradient('viridis')

## Plant II

In [None]:
inverter2.sort_values(by=['AC_POWER'], ascending=False).style.background_gradient('viridis')

# Conclusion

1. Distribution of Total yield for plant I and plant II are different but Daily yield are almost same.

2. Sensor for plant I and sensor for plant II have same distribution; but irradiation for plant II decrease.

3. Just only 1/2 entry data are not null (median=0.0) for plant II and give good yield.

4. Plant I produces a huge DC Power but his AC Power is the same range with AC power for plant II. This means that Plant I loses huge Power in conversion.

5. A gap between mean daily yield for plant I and plant II is only 9.16 KWh.