# Visualization and further exploration

In this notebook, we seek to understand the behaviour of two solar power plants through the data generated by the photovoltaic modules. To do so, we will talk about:

1. **Reminder on photovoltaic systems or PV systems**
2. **EDA on:**
    - ***DC and AC power***
    - ***Irradiation***
    - ***ambient and module temperature***
    - ***yield***
3. **Correlation of all features**
4. **Comparison of two power plants** 

## Reminder on PV systems

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a0/From_a_solar_cell_to_a_PV_system.svg" width="450"><br/>
</center>


**``PV system``** is a power system designed to supply usable solar power by means of photovoltaics.


**``PV Cell``** is an electrical device that converts the energy of light directly into electricity by the photovoltaic effect, which is a physical and chemical phenomenon. It is also the basics photovoltaic device that is the building block PV modules.

**``Photovoltaic effect``**  is the generation of voltage and electric current in a material upon exposure to light.

**``PV module``** is a group of PV cell connected in serie and/or parallel and encapsulated in an environmentally protective laminate.

**``PV panel``** is a group of modules that is the basic building block of a PV array.

**``PV array``** is a group of panels that comprises the complete PV generating unit.

### PV inverter

<center>
<img src="https://www.futuregenerationenergy.ie/wp-content/uploads/2017/03/santnu_new.jpg" width="450"><br/>
</center>

**``PV inverter``** convert battery or PV array DC power to AC power for use with conventional utility-powered appliances. It is heart of PV systems because PV array is a DC source, an inverter is required to convert the dc power to normal ac power that is used in our homes and offices.

PV systems are very influenced by weather condition, if the weather is good, we get a maximun yield but if the weather is bad, we get a minimun yield. That is why there is important to know how weather condition can impact on yield of the two solar power plants.

**Source**
- [Photovoltaic(PV) Tutorial](http://web.mit.edu/taalebi/www/scitech/pvtutorial.pdf)

- [PV Inverter](https://www.futuregenerationenergy.ie/domestic/solar-pv-inverters/)

- [PV Systems](https://en.wikipedia.org/wiki/Photovoltaic_system)


According to the notion of PV systems, the important feature are:

- *DC power*

- *AC power*

- *Yield*

- *ambiant Temperature*

- *module temperature*

- *irradiation*

Okay, let's go to the next section.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#import all package needed
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import normaltest
import holoviews as hv
from holoviews import opts
import cufflinks as cf
hv.extension('bokeh')

In [None]:
cf.set_config_file(offline = True)
sns.set(style="whitegrid")

## Plant I: Solar Power Generation data

Plant contains 22 inverters where each inverter are connected with several PV array. Every 15 min, each inverter records his data. So, if we want to know how many the plant has produced a power in a hour, we just compute the contribution of 22 inverters. 

In [None]:
#we take file for plant 1 Generation data
file = '/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv'

In [None]:
plant1_data = pd.read_csv(file) # load data

In [None]:
plant1_data.tail()

In [None]:
print('The number of inverter for data_time {} is {}'.format('15-05-2020 23:00', plant1_data[plant1_data.DATE_TIME == '15-05-2020 23:00']['SOURCE_KEY'].nunique()))

In [None]:
plant1_data.info() # we check if there exist missing value

In [None]:
#we compute a sum of 22 inverters
plant1_data = plant1_data.groupby('DATE_TIME')[['DC_POWER','AC_POWER', 'DAILY_YIELD','TOTAL_YIELD']].agg('sum')

In [None]:
plant1_data = plant1_data.reset_index()

In [None]:
plant1_data.head()

**``Cleaning data``**

I convert ``DATE_TIME`` object type to datetime type. After I separate ``DATE_TIME`` to **date** and **time**

In [None]:
plant1_data['DATE_TIME'] = pd.to_datetime(plant1_data['DATE_TIME'], errors='coerce')

In [None]:
plant1_data['time'] = plant1_data['DATE_TIME'].dt.time
plant1_data['date'] = pd.to_datetime(plant1_data['DATE_TIME'].dt.date)

In [None]:
plant1_data.shape # our data reduced very well

In [None]:
#we check
plant1_data.head()

In [None]:
plant1_data.info()

### EDA for ``DC power``, ``AC power`` and ``Yield``.

Here, we use

1. Line or scatter plot

2. change rate.

3. Box and Whisker plot

4. calendar plot

5. Bar chart.

## **``DC Power``**

In [None]:
#plant1_data.iplot(x= 'time', y='DC_POWER', xTitle='Time',  yTitle= 'DC Power', title='DC POWER plot')
plant1_data.plot(x= 'time', y='DC_POWER', style='.', figsize = (15, 8))
plant1_data.groupby('time')['DC_POWER'].agg('mean').plot(legend=True, colormap='Reds_r')
plt.ylabel('DC Power')
plt.title('DC POWER plot')
plt.show()

Between 05:33:20 and 18:00:00, the Plant produces a dc power but otherwise there is null. The reason is sunlight. 

In [None]:
#Okay, we are going to see dc power in each day produced by Plant.
#we create calendar_dc data how in each day Plant produce a dc power in each time.

calendar_dc = plant1_data.pivot_table(values='DC_POWER', index='time', columns='date')

In [None]:
calendar_dc.tail()

In [None]:
# define function to multi plot

def multi_plot(data= None, row = None, col = None, title='DC Power'):
    cols = data.columns # take all column
    gp = plt.figure(figsize=(20,20)) 
    
    gp.subplots_adjust(wspace=0.2, hspace=0.8)
    for i in range(1, len(cols)+1):
        ax = gp.add_subplot(row,col, i)
        data[cols[i-1]].plot(ax=ax, style = 'k.')
        ax.set_title('{} {}'.format(title, cols[i-1]))

In [None]:
multi_plot(data=calendar_dc, row=9, col=4)

Almost all the curves are the same despite some fluctuation between 11 am and 2 pm. except the curve of May 20 and 25 which gives a uniform shape.

In [None]:
daily_dc = plant1_data.groupby('date')['DC_POWER'].agg('sum')

In [None]:
daily_dc.plot.bar(figsize=(15,5), legend=True)
plt.title('Daily DC Power')
plt.show()

Only **``2020-05-25``** dc power is maximun.

## **``Daily Yield``**

In [None]:
plant1_data.plot(x='time', y='DAILY_YIELD', style='b.', figsize=(15,5))
plant1_data.groupby('time')['DAILY_YIELD'].agg('mean').plot(legend=True, colormap='Reds_r')
plt.title('DAILY YIELD')
plt.ylabel('Yield')
plt.show()

data gives us a logistics-like function but after ``18:00`` the energy decrease slowly; suddenly at ``00:00`` breakdown.

In [None]:
#pivot table data
daily_yield = plant1_data.pivot_table(values='DAILY_YIELD', index='time', columns='date')

In [None]:
# we plot all daily yield
multi_plot(data=daily_yield.interpolate(), row=9, col=4, title='DAILY YIELD')

As we can see some daily_yield date (``2020-02-06``, ``2020-05-19``,...) have a logistic shape  with missing values but others have not.

Every 15 min data is recorded. After 15 min, we get a **new yield**. To compute this new yield it is just this formula:

``new yield = next yield - previous yield``. It is a difference equation that ``.diff()`` pandas function can help us to compute it.

In [None]:
#plotting a change rate daily yield over time
multi_plot(data=daily_yield.diff()[daily_yield.diff()>0], row=9, col=4, title='new yield')

Between ``08:20`` and ``16:40``, we obtain each 15min, **$new yield > 2500$** with fluctuation.

**Daily Yield each day**

In [None]:
daily_yield.boxplot(figsize=(18,5), rot=90, grid=False)
plt.title('DAILY YIELD IN EACH DAY')
plt.show()

For each day, the daily yield change. some day is high. The observation of all boxes is good, outlier does not exist. 

For further details see
Wikipedia's entry for [``boxplot``](<https://en.wikipedia.org/wiki/Box_plot>).

In [None]:
daily_yield.diff()[daily_yield.diff()>0].boxplot(figsize=(18,5), rot=90, grid=False)
plt.title('DAILY YIELD CHANGE RATE EACH 15 MIN EACH DAY')
plt.show()

Only two days have an outlier **2020-03-06** and **2020-05-21**. 

In [None]:
#we compute a daily yield for each date.
dyield = plant1_data.groupby('date')['DAILY_YIELD'].agg('sum')

In [None]:
dyield.plot.bar(figsize=(15,5), legend=True)
plt.title('Daily YIELD')
plt.show()

## Plant_1: Weather Sensor Data

In [None]:
file1 = '/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv'

In [None]:
plant1_sensor = pd.read_csv(file1)

In [None]:
plant1_sensor.head()

In [None]:
plant1_sensor.info()

In [None]:
plant1_sensor['DATE_TIME'] = pd.to_datetime(plant1_sensor['DATE_TIME'], errors='coerce')

In [None]:
# same work cleaning data
plant1_sensor['date'] = pd.to_datetime(pd.to_datetime(plant1_sensor['DATE_TIME']).dt.date)
plant1_sensor['time'] = pd.to_datetime(plant1_sensor['DATE_TIME']).dt.time


del plant1_sensor['PLANT_ID']
del plant1_sensor['SOURCE_KEY']

In [None]:
plant1_sensor.tail()

### EDA for   ``Ambient Temperature``, ``Module Temperature`` and ``Irradiation``

Here, we do

1. Line or scatter plot

2. %change.

3. Box and Whisker plot

4. calendar plot

5. Bar chart.

6. Lag plot

### ``Ambient Temperature``

In [None]:
plant1_sensor.plot(x='time', y = 'AMBIENT_TEMPERATURE' , style='b.', figsize=(15,5))
plant1_sensor.groupby('time')['AMBIENT_TEMPERATURE'].agg('mean').plot(legend=True, colormap='Reds_r')
plt.title('Daily AMBIENT TEMPERATURE MEAN (RED)')
plt.ylabel('Temperature (°C)')
plt.show()

In [None]:
ambient = plant1_sensor.pivot_table(values='AMBIENT_TEMPERATURE', index='time', columns='date')

In [None]:
ambient.tail()

In [None]:
ambient.boxplot(figsize=(15,5), grid=False, rot=90)
plt.title('AMBIENT TEMPERATURE BOXES')
plt.ylabel('Temperature (°C)')

**Which date ambient temperature mean is maximun?**

In [None]:
am_temp = plant1_sensor.groupby('date')['AMBIENT_TEMPERATURE'].agg('mean')

In [None]:
am_temp.plot(grid=True, figsize=(15,5), legend=True, colormap='Oranges_r')
plt.title('AMBIENT TEMPERATURE 15 MAY- 17 JUNE')
plt.ylabel('Temperature (°C)')

**Comment**:

In May, ambiant Temperature in Plant 1 was between 24 and 30°C, this means that May was very hot. But in June ambiant Temperature decreases considerately between 24 and 26°C.

In the next cell, we will seek how % change of ambient Temperature is.

In [None]:
am_change_temp = (am_temp.diff()/am_temp)*100

In [None]:
am_change_temp.plot(figsize=(15,5), grid=True, legend=True)
plt.ylabel('%change')
plt.title('AMBIENT TEMPERATURE %change')

**Comment**

1. Sunday 17 May 2020 to Monday 18 May 2020, the ambient Temperature decreases to 10%.

2. Monday 18 May 2020 to Tuesday 19 May 2020, the ambient Temperature increases to 15% and tomorrow decreases to 5%.

3. Wednesday 20 May 2020 to Thursday 21 May 2020, the ambient Temperature increases to 10% and tomorrow decreases to 15%.

4. June month's, the ambiant Temperature %change stabilize between -2.5 and 2.5%. 

## Ambient Temperature: seasonal, trend and residual.

In [None]:
from scipy.signal import periodogram

In [None]:
decomp = sm.tsa.seasonal_decompose(am_temp)

In [None]:
cols = ['trend', 'seasonal', 'resid'] # take all column
data = [decomp.trend, decomp.seasonal, decomp.resid]
gp = plt.figure(figsize=(15,15)) 
    
gp.subplots_adjust(hspace=0.5)
for i in range(1, len(cols)+1):
    ax = gp.add_subplot(3,1, i)
    data[i-1].plot(ax=ax)
    ax.set_title('{}'.format(cols[i-1]))

**Comment**

seasonality of ambient Temperature is the **7 days** to see a maximun of temperature.  

### ``Module Temperature``

In [None]:
plant1_sensor.plot(x='time', y='MODULE_TEMPERATURE', figsize=(15,8), style='b.')
plant1_sensor.groupby('time')['MODULE_TEMPERATURE'].agg('mean').plot(colormap='Reds_r', legend=True)
plt.title('DAILY MODULE TEMPERATURE & MEAN(red)')
plt.ylabel('Temperature(°C)')

In [None]:
module_temp = plant1_sensor.pivot_table(values='MODULE_TEMPERATURE', index='time', columns='date')

In [None]:
module_temp.boxplot(figsize=(15,5), grid=False, rot=90)
plt.title('MODULE TEMPERATURE BOXES')
plt.ylabel('Temperature (°C)')

**Comment**

Four dates contains outliers: **18-05-2020, 30-05-2020, 31-05-2020, 01-06-2020**. The outlier of these 3 dates occurs precisely at interval time $[11:06:40, 16:40]$. see

In [None]:
multi_plot(module_temp, row=9,  col=4, title='Module Temp.')

In [None]:
#we can also see also calendar plot
mod_temp = plant1_sensor.groupby('date')['MODULE_TEMPERATURE'].agg('mean')

In [None]:
mod_temp.plot(grid=True, figsize=(15,5), legend=True)
plt.title('MODULE TEMPERATURE 15 MAY- 17 JUNE')
plt.ylabel('Temperature (°C)')

**Comment**

May month's have: 2 huges hot date 21 and 29. 

In [None]:
#we plot a %change of MODULE TEMPERATURE.
chan_mod_temp = (mod_temp.diff()/mod_temp)*100

In [None]:
chan_mod_temp.plot(grid=True, legend=True, figsize=(15,5))
plt.ylabel('%change')
plt.title('MODULE TEMPERATURE %change')

### ``Irradiation``

In [None]:
plant1_sensor.plot(x='time', y = 'IRRADIATION', style='.', legend=True, figsize=(15,5))
plant1_sensor.groupby('time')['IRRADIATION'].agg('mean').plot(legend=True, colormap='Reds_r')
plt.title('IRRADIATION')

In [None]:
irra = plant1_sensor.pivot_table(values='IRRADIATION', index='time', columns='date')

In [None]:
irra.tail()

In [None]:
irra.boxplot(figsize=(15,5), rot = 90, grid=False)
plt.title('IRRADIATION BOXES')

In [None]:
rad = plant1_sensor.groupby('date')['IRRADIATION'].agg('sum')

In [None]:
rad.plot(grid=True, figsize=(15,5), legend=True)
plt.title('IRRADIATION 15 MAY- 17 JUNE')

**N.B** Thursday 21 May 2020 is a date where plant 1 are:

1. more produce dc power.

2. ambient temperature, module temperature are maximun.

This date is very special.

# Correlation

In this part, we are making correlation between feature to see how some feature can explain another feature. or see relation between them.

In [None]:
# we are merge our solar power generation data and weather sensor data
power_sensor = plant1_sensor.merge(plant1_data, left_on='DATE_TIME', right_on='DATE_TIME')

In [None]:
power_sensor.tail(3)

In [None]:
#we remove the columns that we do not need
del power_sensor['date_x']
del power_sensor['date_y']
del power_sensor['time_x']
del power_sensor['time_y']

In [None]:
power_sensor.tail(3)

In [None]:
power_sensor.info()

In [None]:
#we start correlation
power_sensor.corr(method = 'spearman')

**Comment**

``DAILY_YIELD`` is not correlated with all feature but  ``AMBIENT_TEMPERATURE`` is moreless correlated.

``TOTAL_YIELD`` is also not correlated with all feature. I remove it in the correlation matrix. 

In [None]:
corr = power_sensor.drop(columns=['DAILY_YIELD', 'TOTAL_YIELD']).corr(method = 'spearman')

In [None]:
plt.figure(dpi=100)
sns.heatmap(corr, robust=True, annot=True, fmt='0.3f', linewidths=.5, square=True)
plt.show()

In [None]:
# we make pairplot
sns.pairplot(power_sensor.drop(columns=['DAILY_YIELD', 'TOTAL_YIELD']))
plt.show()

In [None]:
#we plot dc power vs ac power

In [None]:
plt.figure(dpi=100)
sns.lmplot(x='DC_POWER', y='AC_POWER', data=power_sensor)
plt.title('Regression plot')
plt.show()

**Comment**

This graph said that inverter convert dc power to ac power linearly. $dc power = 10*ac power$ inverter lost 90% of their power when it convert. 

In [None]:
plt.figure(dpi=100)
sns.lmplot(x='AMBIENT_TEMPERATURE', y='DC_POWER', data=power_sensor)
plt.title('Regression plot')
plt.show()

**comment**

DC_power increases non linearly with an Ambient_Temperature.

In [None]:
plt.figure(dpi=100)
sns.lmplot(x='MODULE_TEMPERATURE', y='DC_POWER', data=power_sensor)
plt.title('Regression plot')
plt.show()

**comment**

``DC_POWER`` is produced linearly by ``MODULE_TEMPERATURE`` with some variability. 

In [None]:
plt.figure(dpi=100)
sns.lmplot(x='IRRADIATION', y='DC_POWER', data=power_sensor)
plt.title('Regression plot')
plt.show()

**Comment**

DC_Power increase with IRRADIATION.

What happens if I introuduce a difference Temperature between ``AMBIENT_TEMPERATURE`` AND ``MODULE_TEMPERATURE``.

In [None]:
# we introduce DELTA_TEMPERATURE
power_sensor['DELTA_TEMPERATURE'] = abs(power_sensor.AMBIENT_TEMPERATURE - power_sensor.MODULE_TEMPERATURE)

In [None]:
# we check if all is ok
power_sensor.tail(3)

In [None]:
#now we use correlation
power_sensor.corr(method='spearman')['DELTA_TEMPERATURE']

**comment**

we remark that YIELD does not depend on ``DELTA_TEMPERATURE`` also.

In [None]:
sns.lmplot(x='DELTA_TEMPERATURE', y='DC_POWER', data=power_sensor)
plt.title('correlation between DC_POWER and DELTA_TEMPERATURE')

**comment**

We know that $\dot Q \propto \Delta T$.So, we could say that ``DC_POWER`` is influenced by heat transfer.

In [None]:
sns.lmplot(x='DELTA_TEMPERATURE', y='IRRADIATION', data=power_sensor)
plt.title('Regression plot')

**comment**

``IRRADIATION`` of Module and Heat Transfert between ambient air and Module are very well correlated.

**short conclusion**

In this section, we conclude that:

1. Yield does not depend on the Temperature, the dc/ac power and irradiation.

2. the transfert function between dc and ac power is linear.

3. dc power is indeed influenced by the ambient temperature, by the temperature of the module, by the irradiation and finally by the heat transfer between the module and the air.

4. all 22 Inverters of Plant I lost 90% of their dc power when it convert.

## Comparison of two power plants

### Plant 1 data vs Plant2 data

In [None]:
file2 = '/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv'

In [None]:
plant2_data = pd.read_csv(file2)

In [None]:
plant2_data.head(3)

In [None]:
plant2_data.info()

In [None]:
#we compute a sum of 22 inverters
plant2_data = plant2_data.groupby('DATE_TIME')[['DC_POWER','AC_POWER', 'DAILY_YIELD','TOTAL_YIELD']].agg('sum').reset_index()

In [None]:
plant2_data['DATE_TIME'] = pd.to_datetime(plant2_data['DATE_TIME'], errors='coerce')
plant2_data['time'] = plant2_data['DATE_TIME'].dt.time
plant2_data['date'] = pd.to_datetime(plant2_data['DATE_TIME'].dt.date)

In [None]:
plant2_data.tail(3)

In [None]:
plant2_data.info()

In [None]:
#we conpare a dc power of two plant
ax = plant1_data.plot(x='time', y='DC_POWER', figsize=(15,5), legend=True, style='b.')
plant2_data.plot(x='time', y='DC_POWER', legend=True, style='r.', ax=ax)
plt.title('Plant1(blue) vs Plant2(red)')
plt.ylabel('Power (KW)')

Plant 1 produces dc power 6 time than plant 2 in daily

In [None]:
#we conpare a dc power of two plant
ax1 = plant1_data.plot(x='time', y='AC_POWER', figsize=(15,5), legend=True, style='b.', )
plant2_data.plot(x='time', y='AC_POWER', legend=True, style='r.', ax=ax1)
plt.title('Plant1(blue) vs Plant2(red)')
plt.ylabel('Power (KW)')

The two plants are almost the same ac power

In [None]:
p2_daily_dc = plant2_data.groupby('date')['DC_POWER'].agg('sum')

In [None]:
axh = daily_dc.plot.bar(legend=True, figsize=(15,5), color='Blue', label='DC_POWER Plant I')
p2_daily_dc.plot.bar(legend=True, color='Red', label='DC_POWER Plant II', stacked=False)
plt.title('DC POWER COMPARISON')
plt.ylabel('Power (KW)')
plt.show()

Each date plant1 is huge to produce a dc power but plant 2 reach almost 1 GW.

In [None]:
daily_ac = plant1_data.groupby('date')['AC_POWER'].agg('sum')
p2_daily_ac = plant2_data.groupby('date')['AC_POWER'].agg('sum')

In [None]:
ac = daily_ac.plot.bar(legend=True, figsize=(15,5), color='Blue', label='AC_POWER Plant I')
p2_daily_ac.plot.bar(legend=True, color='Red', label='AC_POWER Plant II')
plt.title('AC POWER COMPARISON')
plt.ylabel('Power (KW)')
plt.show()

Plant I and Plant II are almost same to produce a ac power for each day.

In [None]:
#compute daily_yield for each date
p2_dyield = plant2_data.groupby('date')['DAILY_YIELD'].agg('sum')

In [None]:
dy = dyield.plot.bar(figsize=(15,5), legend=True, label='DAILY_YIELD PLANT I', color='Blue')
p2_dyield.plot.bar(legend=True, label='DAILY_YIELD PLANT II', color='Red')
plt.ylabel('Energy (KWh)')
plt.title('DAILY YIELD COMPARISON')

Plant I and plant II have almost same daily yield but certain days, they are differents

In [None]:
#compute a average total_yield for plant I for each day
tyield = plant1_data.groupby('date')['TOTAL_YIELD'].agg('mean')

#compute a average total_yield for plant II for each day
p2_tyield = plant2_data.groupby('date')['TOTAL_YIELD'].agg('mean')

In [None]:
aver = p2_tyield.plot.bar(figsize=(15,5), legend=True, label='AVERAGE TOTAL YIELD PLANT II', color='Red')
tyield.plot.bar(legend=True, label='AVERAGE TOTAL YIELD PLANT I', color='Blue',ax=aver)

The gap between average total yield for plant II and average total yield for plant I for each date is very large. 

## Plant I weather sensor vs Plant II weather sensor

In [None]:
file3 = '/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv'

In [None]:
plant2_sensor = pd.read_csv(file3)

In [None]:
plant2_sensor.tail()

In [None]:
plant2_sensor.info()

In [None]:
plant2_sensor['DATE_TIME'] = pd.to_datetime(plant2_sensor['DATE_TIME'], errors='coerce')

In [None]:
# same work cleaning data for plant II
plant2_sensor['date'] = pd.to_datetime(pd.to_datetime(plant2_sensor['DATE_TIME']).dt.date)
plant2_sensor['time'] = pd.to_datetime(plant2_sensor['DATE_TIME']).dt.time


del plant2_sensor['PLANT_ID']
del plant2_sensor['SOURCE_KEY']

In [None]:
plant2_sensor.head()

In [None]:
plant1_sensor[['AMBIENT_TEMPERATURE','MODULE_TEMPERATURE','time']].plot(x='time', label='Plant I', title='PLANT I', figsize=(15,5), style='.')
plant2_sensor[['AMBIENT_TEMPERATURE','MODULE_TEMPERATURE','time']].plot(x='time', label='Plant II', title='PLANT II', figsize=(15,5), style='.')
plt.ylabel('Temperature (°C)')

In [None]:
#compare IRRADIATION PLANT I VS PLANT II
aq = plant1_sensor.plot(x='time', y='IRRADIATION', legend=True, label='IRRADIATION PLANT I', color='Blue', style='.', figsize=(15,5))
plant2_sensor.plot(x='time', y='IRRADIATION', legend=True, label='IRRADIATION PLANT II',  color='Red', style='.', ax=aq)
plt.title('IRRADIATION COMPARISON')

Plant I and Plant II have same  IRRADIATION  distribution between 05:33:20 and 18:00:00

## correlation for PLANT II

In [None]:
# we are merging our solar power generation data and weather sensor data for plant 2
sensorData = plant2_sensor.merge(plant2_data, left_on='DATE_TIME', right_on='DATE_TIME')

In [None]:
#we remove the columns that we do not need
del sensorData['date_x']
del sensorData['date_y']
del sensorData['time_x']
del sensorData['time_y']

In [None]:
sensorData.tail()

I create five new feature DELTA_TEMPERATURE, NEW_DAILY_YIELD, NEW_TOTAL_YIELD, NEW_AMBIENT_TEMPERATURE and NEW_MODULE_TEMPERATURE.

delta temperature = ambient temperature - module temperature. All other new variable is just the first derivative in time.

1. New daily yield is the next daily yield - previous daily yield.
2. new total yield is the next total yield - previous total yield
3. new ambient temperature is the next ambient temperature - previous ambient temperature.

and so on, do not forget that it is after 15 min of each daily 

In [None]:
sensorData = sensorData.assign(DELTA_TEMPERATURE = abs(sensorData.MODULE_TEMPERATURE - sensorData.AMBIENT_TEMPERATURE),
                              NEW_DAILY_YIELD = sensorData.DAILY_YIELD.diff(),
                              NEW_TOTAL_YIELD = sensorData.TOTAL_YIELD.diff(),
                              NEW_AMBIENT_TEMPERATURE = sensorData.AMBIENT_TEMPERATURE.diff(),
                              NEW_MODULE_TEMPERATURE = sensorData.MODULE_TEMPERATURE.diff(),
                              NEW_AC_POWER = sensorData.AC_POWER.diff())

In [None]:
#see
sensorData.head()

In [None]:
sensorData.corr(method='spearman').style.background_gradient('viridis')

In [None]:
plt.figure(dpi=100, figsize=(15,10))
sns.heatmap(sensorData.corr(method='spearman'), robust=True, annot=True, fmt='0.2f', linewidths=.5, square=False)
plt.show()

In plant II, ``TOTAL_YIELD`` is opposite with all feature except ``DAILY_YIELD``.

In [None]:
#we plot ac vs dc power
sns.lmplot(x='DC_POWER', y='AC_POWER', data=sensorData)
plt.title('Regression plot')

In plant two, dc power = ac power, Inverter lost 0% of the power.

In [None]:
#we plot New DAILY YIELD vs ac power
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(x='AC_POWER', y='NEW_DAILY_YIELD', data=sensorData)
plt.title('Regression plot')

**We learn**

1. AC_POWER < 5000 KW, the NEW_DAILY_YIELD is negative.
2. AC_POWER between 5000 KW and 12000 KW, NEW_DAILY_YIELD is both positive and negative
3. AC_POWER > 12000 KW is positive.

In [None]:
#we plot New DAILY YIELD vs IRRADIATION
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(x='IRRADIATION', y='NEW_DAILY_YIELD', data=sensorData)
plt.title('Regression plot')

**We learn**

NEW_DAILY_YIELD are positive and negative along the variation of irradiation

In [None]:
#we plot New DAILY YIELD vs ac power
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(x='MODULE_TEMPERATURE', y='NEW_DAILY_YIELD', data=sensorData)
plt.title('Regression plot')

**We learn**

for MODULE_TEMPERATURE < 30°C, NEW_DAILY_TEMPERATURE is negative. This means that PV panel product the energy if temperature is around 35°C. 

In [None]:
#we plot New DAILY YIELD vs DELTA TEMPERATURE
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(x='DELTA_TEMPERATURE', y='NEW_DAILY_YIELD', data=sensorData)
plt.title('Regression plot')

**We learn**

NEW_DAILY_YIELD is only negative if DELTA_TEMPERATURE < 5°C. This means that daily yield decrease every 15min if the difference temperature between ambient and module temperature is less than 5°C.

In [None]:
#we plot New TOTAL YIELD vs New daily yield
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(y='NEW_TOTAL_YIELD', x='NEW_DAILY_YIELD', data=sensorData)
plt.title('Regression plot')

In [None]:
#we plot New TOTAL YIELD vs New daily yield
plt.figure(dpi=(100), figsize=(15,5))
sns.regplot(y='NEW_AC_POWER', x='NEW_MODULE_TEMPERATURE', data=sensorData)
plt.title('Regression plot')

New AC Power is the change of previous and next AC Power produced in the time. We have more AC Power only if New Module Temperature is between -5 and 5. 

**We learn**

1. New daily yield decrease total yield decrease. new daily yield increase, new total yield increase. 
2. new daily yield is zeros, new total yield is zeros

**General Conclusion**

throughout this notebook, we can say that
1. plant I produces 6 times more DC power than plant II. And loses 90% of it when converting to AC power.
2. While Plant II loses nothing when converting DC power to AC power.

3. AC power output is almost the same for both plants.

4. The daily yield is almost the same for the two plants.

5. The gap between The average total yield for plant I and plant II is very large. 

6. Daily yield decrease if delta temperature is less than 5°C.

7. Daily yield decrease for some value of AC power.

END.

**Be free to comment, share and download, give your opinion for this work. Thanks**