# Story on Power Plant in India

In this notebook, we seek to understand the behaviour of two solar power plants through the data generated by the photovoltaic modules and take relevant information. To do so, we will talk about:

1. **Reminder on photovoltaic systems or PV systems**
2. **Comparison of two power plants**
    - *Descriptive analysis (Weather condition and Generation data)*
    - *EDA (Weather condition and Generation data)*
    
3. **Identify relevant information**
4. **Conclusion**

## Reminder on PV systems

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a0/From_a_solar_cell_to_a_PV_system.svg" width="450"><br/>
</center>


**``PV system``** is a power system designed to supply usable solar power by means of photovoltaics.


**``PV Cell``** is an electrical device that converts the energy of light directly into electricity by the photovoltaic effect, which is a physical and chemical phenomenon. It is also the basics photovoltaic device that is the building block PV modules.

**``Photovoltaic effect``**  is the generation of voltage and electric current in a material upon exposure to light.

**``PV module``** is a group of PV cell connected in serie and/or parallel and encapsulated in an environmentally protective laminate.

**``PV panel``** is a group of modules that is the basic building block of a PV array.

**``PV array``** is a group of panels that comprises the complete PV generating unit.

### PV inverter

<center>
<img src="https://www.futuregenerationenergy.ie/wp-content/uploads/2017/03/santnu_new.jpg" width="450"><br/>
</center>

**``PV inverter``** convert battery or PV array DC power to AC power for use with conventional utility-powered appliances. It is heart of PV systems because PV array is a DC source, an inverter is required to convert the dc power to normal ac power that is used in our homes and offices.

PV systems are very influenced by weather condition, if the weather is good, we get a maximun yield but if the weather is bad, we get a minimun yield. That is why there is important to know how weather condition can impact on yield of the two solar power plants.

**Source**
- [Photovoltaic(PV) Tutorial](http://web.mit.edu/taalebi/www/scitech/pvtutorial.pdf)

- [PV Inverter](https://www.futuregenerationenergy.ie/domestic/solar-pv-inverters/)

- [PV Systems](https://en.wikipedia.org/wiki/Photovoltaic_system)


Okay, let's go to the next section.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#import all package needed
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import normaltest
import holoviews as hv
from holoviews import opts
import cufflinks as cf
hv.extension('bokeh')

In [None]:
sns.set(style="whitegrid")

# Load and preparation data

We start by weather sensor data for plant I and II

In [None]:
file1 ='/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv'
file2 = '/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv'

In [None]:
sensor1 = pd.read_csv(file1)
sensor2 = pd.read_csv(file2)

In [None]:
sensor1.tail()

In [None]:
sensor2.tail()

In [None]:
sensor1.info()

In [None]:
sensor2.info()

In [None]:
sensor1['date'] = pd.to_datetime(sensor1['DATE_TIME']).dt.date
sensor2['date'] = pd.to_datetime(sensor2['DATE_TIME']).dt.date
sensor1['time'] = pd.to_datetime(sensor1['DATE_TIME']).dt.time
sensor2['time'] = pd.to_datetime(sensor2['DATE_TIME']).dt.time
del sensor1['DATE_TIME']
del sensor2['DATE_TIME']
del sensor1['PLANT_ID']
del sensor2['PLANT_ID']
del sensor1['SOURCE_KEY']
del sensor2['SOURCE_KEY']

In [None]:
sensor2.head()

In [None]:
sensor = sensor1.merge(sensor2, left_on='date', right_on='date', suffixes=('_PLANT1', '_PLANT2'))

In [None]:
sensor.head()

In [None]:
sensor.info()

Now, we start generation data for plant I and II

In [None]:
file3 = '/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv'
file4 = '/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv'

In [None]:
data1 = pd.read_csv(file3)
data2 = pd.read_csv(file4)

In [None]:
data1.tail()

In [None]:
data2.tail()

In [None]:
data1.info()

In [None]:
data2.info()

In [None]:
print('Number of Inverters in Plant I: {}'.format(data1.SOURCE_KEY.nunique()))

In [None]:
print('Number of Inverters in Plant II: {}'.format(data2.SOURCE_KEY.nunique()))

In [None]:
df1 = data1.reset_index()
df2 = data2.reset_index()
data = df1.merge(df2, left_on='index', right_on='index', suffixes=('_PLANT1', '_PLANT2'))
del data['index']
del data['PLANT_ID_PLANT1']
del data['PLANT_ID_PLANT2']

In [None]:
data.head()

In [None]:
data.info()

# Comparison of two power plants


## Descriptive Analytics: Weather condition

In [None]:
sensor.head()

In [None]:
sensor.describe()

According to the described dataframe, we realize that *the difference of max, median and mean Ambient Temperature of two plants* is around **2°C**, also for the Module Temperature of two plants. But for Irradiation of two plants is almost same.

**We learn**

- *Power plant 2 is hotter than power plant 1.* 

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(sensor.corr(), annot=True, center=0, square=True)
plt.title('Correlation between feature weather condition')
plt.show()

From this correlation matrix, we see that for two plants, Module temperature and Irradiation are strongest correlated and also, Ambient Temperature and Module Temperature are strong correlated.
The feature weather condition for two plants does not exist any relation between them.

**We learn**

- The two power plants are not installed in the same region/province. This is why plant 2 is installed in the hottest region/province than region/province where plant 1 is installed.

- PV system in plant 1 and PV system in plant 2 are the same manufactured product.

In [None]:
print('Maximun valeurs °C for Temperature and, W/m^2 for Iraddiation :')
sensor.describe().loc['max', :]

### Total irradiation per day

In [None]:
total_irrad_per_day = sensor.groupby('date')[['IRRADIATION_PLANT1', 'IRRADIATION_PLANT2']].agg('sum')

In [None]:
total_irrad_per_day.head()

In [None]:
plt.figure(figsize=(15,15), dpi=100)
sns.heatmap(total_irrad_per_day, annot=True, fmt=".6g", center=0)
plt.title('TOTAL IRRADIATION PER DAY IN EACH PLANT')
plt.show()

Some days the two plants produces low total Irradiation and  high total irradiation. We are subdivised data to see which date plant produce low or high total irradiation.

In [None]:
low_irrad = total_irrad_per_day[total_irrad_per_day < 2000]
high_irrad = total_irrad_per_day[total_irrad_per_day > 2000]

In [None]:
fig = plt.figure(figsize=(20,20), dpi=300)
fig.subplots_adjust(wspace=0.25)
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
sns.heatmap(low_irrad, annot=True, fmt='.6g', center=0, ax=ax1)
sns.heatmap(high_irrad, annot=True, fmt='.6g', center=0, ax=ax2)
ax1.set_title('LOW TOTAL IRRADIATION PER DAY', fontdict={'fontsize': 16})
ax2.set_title('HIGH TOTAL IRRADIATION PER DAY', fontdict={'fontsize': 16})
plt.show()

2020-05-15 to 2020-06-17, we have 34 days in total, where:

1. Plant 1 produces 15 days of low total irradiation and 19 days of high total irradiation.

2. Plant 2 produces 14 days of low total irradiation and 20 days of high total irradiation.

Only 1 more day between Plant 1 and Plant 2. This prove that Plant 2 are much more sunshine plant 1.

**We learn**

- Plant 2 is much more sunshine than Plant 1.

## Descriptive Analytics: Generation Data

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
print('Daily Yield mean value (Kwh): \n\n', data.describe().loc['mean',['DAILY_YIELD_PLANT1', 'DAILY_YIELD_PLANT2']])

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(data.corr(), annot=True, fmt='.2g', center=0, square=True)
plt.show()

**We learn**
- DC, AC power and Daily Yield for Plant 1 and Plant 2 are opposite production i.e when plant 1 is a low production plant 2 is high production (see correlation matrix).

### The maximum/minimum amount of DC/AC Power generated in a time interval/day.

In [None]:
data['date'] = pd.to_datetime(data['DATE_TIME_PLANT2']).dt.date
data['time'] = pd.to_datetime(data['DATE_TIME_PLANT2']).dt.time

data = data.set_index(data.DATE_TIME_PLANT2)
del data['DATE_TIME_PLANT2']

In [None]:
def min_max(value=None) :

    dc1 = []
    ac1 = []
    dc2 = []
    ac2 = []

    for dt in np.unique(data.date.values):
    
        df = data[data.date == dt]
    
        if value == 'max':
            dc1.append(df[df['DC_POWER_PLANT1'] == df['DC_POWER_PLANT1'].max()])
            dc2.append(df[df['DC_POWER_PLANT2'] == df['DC_POWER_PLANT2'].max()])
    
            ac1.append(df[df['AC_POWER_PLANT1'] == df['AC_POWER_PLANT1'].max()])
            ac2.append(df[df['AC_POWER_PLANT2'] == df['AC_POWER_PLANT2'].max()])
        else:
            dc1.append(df[df['DC_POWER_PLANT1'] == df['DC_POWER_PLANT1'].min()])
            dc2.append(df[df['DC_POWER_PLANT2'] == df['DC_POWER_PLANT2'].min()])
    
            ac1.append(df[df['AC_POWER_PLANT1'] == df['AC_POWER_PLANT1'].min()])
            ac2.append(df[df['AC_POWER_PLANT2'] == df['AC_POWER_PLANT2'].min()])
            
    return pd.concat(dc1)['DC_POWER_PLANT1'], pd.concat(ac1)['AC_POWER_PLANT1'],\
                    pd.concat(dc2)['DC_POWER_PLANT2'], pd.concat(ac2)['AC_POWER_PLANT2']

In [None]:
def plot_data(value=None, kind=None):
    fig = plt.figure(figsize=(20,20), dpi=200)
    fig.subplots_adjust(hspace=0.4, wspace=0.2)
    for i in range(1, len(value)+1):
        ax = fig.add_subplot(2, 2, i)
        value[i-1].plot(kind=kind, ax=ax)
        ax.set_ylabel('A')
        ax.set_title(value[i-1].name)

In [None]:
min_dc1, min_ac1, min_dc2, min_ac2 = min_max()

In [None]:
max_dc1, max_ac1, max_dc2, max_ac2 = min_max(value='max')

In [None]:
plot_data(value=[min_dc1, min_ac1, min_dc2, min_ac2], kind='line')

In [None]:
plot_data(value=[max_dc1, max_ac1, max_dc2, max_ac2], kind='bar')

### Inverter produced maximun DC/AC Power is:

In [None]:
source1 = data[data.DC_POWER_PLANT1 == data.DC_POWER_PLANT1.max()] 
source2 = data[data.DC_POWER_PLANT2 == data.DC_POWER_PLANT2.max()] 

**For PLANT 1:**

In [None]:
source1[['DATE_TIME_PLANT1', 'SOURCE_KEY_PLANT1', 'DC_POWER_PLANT1', 'AC_POWER_PLANT1']]

**For PLANT 2:**

In [None]:
source2[['DATE_TIME_PLANT1','SOURCE_KEY_PLANT2', 'DC_POWER_PLANT2', 'AC_POWER_PLANT2']]

Plant 1 produce 10.2 time DC power than plant 2 but AC power for plant1 is a same with AC power for plant 2. This means that plant 1 looses 90% of his production. 

**We learn**
- plant 1 looses 90% of his DC power in the conversion.

### Ranking  the inverters based on DC/AC power they produce

In [None]:
rk1 = data.groupby('SOURCE_KEY_PLANT1')[['DC_POWER_PLANT1', 'AC_POWER_PLANT1']].agg('sum').sort_values(by='DC_POWER_PLANT1')
rk2 = data.groupby('SOURCE_KEY_PLANT2')[['DC_POWER_PLANT2', 'AC_POWER_PLANT2']].agg('sum').sort_values(by='DC_POWER_PLANT2')

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(rk1, annot=True, fmt='.10g', center=0)
plt.title('Ranking Inverters for Plant 1 for two month production')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(rk2, annot=True, fmt='.10g', center=0)
plt.title('Ranking Inverters for Plant 2 for two month production')
plt.show()

**Summary**

we learn
1. The two power plants are not installed in the same region/province. 
2. Plant 2 is much more sunshine than Plant 1.
3. DC, AC power and Daily Yield for Plant 1 and Plant 2 are opposite production.
4. plant 1 looses 90% of his DC power in the conversion.
5. Module Temperature and Irradiation are stronger correlated.

## EDA: Weather condition

### Distribution

In [None]:
figu = plt.figure(dpi=100, figsize=(15,10))
figu.subplots_adjust(wspace=0.2, hspace=0.2)
cols = list(set(sensor.columns) - set(['date', 'time_PLANT1', 'time_PLANT2']))
for i in range(1,len(cols)+1):
    ax = figu.add_subplot(2,3,i)
    sns.boxplot(sensor[sorted(cols)[i-1]] , ax=ax)

boxplot for Module Temperature Plant 2 have an outlier. It show that Plant 2 is installed in region/province where a sunshine is very high Temperature.  

In [None]:
dist = plt.figure(dpi=100, figsize=(15,10))
dist.subplots_adjust(wspace=0.2, hspace=0.2)
for i in range(1,len(cols)+1):
    ax = dist.add_subplot(2,3,i)
    sns.distplot(sensor[sorted(cols)[i-1]] , ax=ax)

### Visualize time series

In [None]:
def meteo(feature=None, sun='date'):
    
    max_temp = []
    min_temp = []
    
    for dt in list(sensor[sun].unique()):
        
        df = sensor[sensor.date == dt]
        max_temp.append(df[feature].max())
        min_temp.append(df[feature].min())
        
    mean = sensor.groupby(sun)[feature].agg('mean')
    
    result = pd.DataFrame()
    
    result[feature+'_mean'] = mean
    result[feature+'_max'] = max_temp
    result[feature+'_min'] = min_temp
    
    return result

#### For date

In [None]:
meteo(feature='AMBIENT_TEMPERATURE_PLANT1').plot(rot=30, figsize=(15,5))
plt.title('AMBIENT TEMPERATURE history in PLANT 1')
plt.ylabel('°C')
plt.show()

meteo(feature='AMBIENT_TEMPERATURE_PLANT2').plot(rot=30, figsize=(15,5))
plt.title('AMBIENT TEMPERATURE history in PLANT 2')
plt.ylabel('°C')
plt.show()

Ambient Temperature for plant 1 does not exceed 35°C in many days. But Ambient Temperature for plant 2 exceed 35°C that's why max Temp is between 35°C and 40°C.

This graph show that Plant 2 is installed in the hottest region/province in India.

In [None]:
meteo(feature='MODULE_TEMPERATURE_PLANT1').plot(rot=30, figsize=(15,5))
plt.title('MODULE TEMPERATURE history in PLANT 1')
plt.ylabel('°C')
plt.show()

In [None]:
meteo(feature='MODULE_TEMPERATURE_PLANT2').plot(rot=30, figsize=(15,5))
plt.title('MODULE TEMPERATURE history in PLANT 2')
plt.ylabel('°C')
plt.show()

We remark that mean module temperature of two plants does not exceed 40°C.

In [None]:
sensor.groupby('date')[['IRRADIATION_PLANT1', 'IRRADIATION_PLANT2']].agg('sum').plot(rot=30, figsize=(15,5))
plt.title('TOTAL IRRADATION PER DAY FOR THE TWO PLANTS')
plt.ylabel('W/m²')
plt.show()

This graph give a low and high irradiation production.


#### For Time

In [None]:
meteo(feature='AMBIENT_TEMPERATURE_PLANT1', sun='time_PLANT1').plot(rot=30, figsize=(15,5))
plt.title('AMBIENT TEMPERATURE daily history in PLANT 1')
plt.ylabel('°C')
plt.show()

In [None]:
meteo(feature='AMBIENT_TEMPERATURE_PLANT2', sun='time_PLANT2').plot(rot=30, figsize=(15,5))
plt.title('AMBIENT TEMPERATURE daily history in PLANT 2')
plt.ylabel('°C')
plt.show()

In [None]:
meteo(feature='MODULE_TEMPERATURE_PLANT1', sun='time_PLANT1').plot(rot=30, figsize=(15,5))
plt.title('MODULE TEMPERATURE daily history in PLANT 1')
plt.ylabel('°C')
plt.show()

In [None]:
meteo(feature='MODULE_TEMPERATURE_PLANT2', sun='time_PLANT2').plot(rot=30, figsize=(15,5))
plt.title('MODULE TEMPERATURE daily history in PLANT 2')
plt.ylabel('W/m²')
plt.show()

In [None]:
sensor.groupby('time_PLANT1')[['IRRADIATION_PLANT1']].agg('sum').plot(rot=30, figsize=(15,5))
plt.title('DAILY TOTAL IRRADATION FOR PLANT1')
plt.ylabel('W/m²')
plt.show()

In [None]:
sensor.groupby('time_PLANT2')[['IRRADIATION_PLANT2']].agg('sum').plot(rot=30, figsize=(15,5))
plt.title('DAILY TOTAL IRRADATION FOR PLANT2')
plt.ylabel('W/m²')
plt.show()

In daily,  Plant 1 produce same irradiation than Plant 2.

### Relation between feature

In [None]:
plt.figure(figsize=(10,5))
plt.hist2d(sensor['AMBIENT_TEMPERATURE_PLANT1'], sensor['MODULE_TEMPERATURE_PLANT1'], density=True, bins=250)
plt.xlabel('AMBIENT_TEMPERATURE_PLANT1')
plt.ylabel('MODULE_TEMPERATURE_PLANT1')
plt.title('Ambient and Module Temperation for Plant 1')
plt.show()

In [None]:
plt.figure(figsize=(10,5))
plt.hist2d(sensor['AMBIENT_TEMPERATURE_PLANT2'], sensor['MODULE_TEMPERATURE_PLANT2'], density=True, bins=250)
plt.xlabel('AMBIENT_TEMPERATURE_PLANT2')
plt.ylabel('MODULE_TEMPERATURE_PLANT2')
plt.title('Ambient and Module Temperation for Plant 2')
plt.show()

In [None]:
reg1 = plt.figure(figsize=(10,10))
reg1.subplots_adjust(hspace=0.4)
ar1 = reg1.add_subplot(2, 1, 1)
ar2 = reg1.add_subplot(2, 1, 2)
sns.regplot(x='MODULE_TEMPERATURE_PLANT1', y='IRRADIATION_PLANT1', data=sensor, ax=ar1)
sns.regplot(x='MODULE_TEMPERATURE_PLANT2', y='IRRADIATION_PLANT2', data=sensor, ax=ar2)
ar1.set_title('Regression IRRADIATION AND MODULE TEMPERATURE FOR PLANT 1')
ar2.set_title('Regression IRRADIATION AND MODULE TEMPERATURE FOR PLANT 2')
plt.show()

Strong correlation between Irradiation and Module Temperature.

## EDA: Generation data

### Distribution

In [None]:
#interesting columns
column = list(set(data.columns) - set(['SOURCE_KEY_PLANT1', 'DATE_TIME_PLANT1', 'SOURCE_KEY_PLANT2','date', 'time']))        

In [None]:
box = plt.figure(figsize=(20,20))
box.subplots_adjust(wspace=0.1, hspace=0.1)
for i in range(1, len(column)+1):
    ax = box.add_subplot(2, 4, i)
    sns.boxplot(data[sorted(column)[i-1]], ax=ax)
plt.show()

In [None]:
dpl = plt.figure(figsize=(20,20))
dpl.subplots_adjust(wspace=0.2, hspace=0.2)
for i in range(1, len(column)+1):
    ax = dpl.add_subplot(2, 4, i)
    sns.distplot(data[sorted(column)[i-1]], ax=ax)
plt.show()

Between 00.00 am to 6.00 am, there is the night no sunshine. Many feature have more data around 0 that's why,  DC/AC power for two plants
have median = 0. 



### Visualize time series

#### For date

In [None]:
data.groupby('date')[['DC_POWER_PLANT1', 'DC_POWER_PLANT2']].agg('sum').plot(rot=90, figsize=(15,5),
                                                                             kind='bar', logy=True)
plt.title('TOTAL DC POWER PER DAY')
plt.ylabel('KWh')
plt.show()

Huge production of DC POWER per day for plant I than plant II.

In [None]:
data.groupby('date')[['AC_POWER_PLANT1', 'AC_POWER_PLANT2']].agg('sum').plot(rot=90, figsize=(15,5),
                                                                             kind='bar')
plt.title('TOTAL AC POWER PER DAY')
plt.ylabel('KWh')
plt.show()

one more again, Plant 1 looses 90% of DC/AC Power production.

In [None]:
data.groupby('date')[['TOTAL_YIELD_PLANT1', 'TOTAL_YIELD_PLANT2']].agg('mean').plot(rot=90, figsize=(15,5),
                                                                             kind='bar', logy=True)
plt.title('MEAN TOTAL YIELD PER DAY')
plt.ylabel('KWh')
plt.show()

Plant 1 have huge problem, it produces very high DC power and Irradiation than Plant 2, but have less total yield per day. I think that the reason come from an inverter due to certain inverters loose 90% of power when it convert DC power to AC power. Plant 1 need panel maintenance.

#### For time

In [None]:
data.groupby('time')[['DAILY_YIELD_PLANT1', 'DAILY_YIELD_PLANT2']].agg('mean').plot(rot=30, figsize=(15,5))
plt.title('MEAN DAILY YIELD PER TIME')
plt.ylabel('KWh')
plt.show()

In daily, we have opposite production between Plant 1 and Plant 2. That is why daily yield plant1 and daily yield for plant 2 have negative correlation.

In [None]:
data.groupby('time')[['TOTAL_YIELD_PLANT1', 'TOTAL_YIELD_PLANT2']].agg('mean').plot(rot=30, figsize=(15,5))
plt.title('MEAN TOTAL YIELD PER TIME')
plt.ylabel('KWh')
plt.show()

Plant 1 need panel cleaning and maintenance.

#### For date-time

In [None]:
data.reset_index()[:10000].groupby('DATE_TIME_PLANT2')[['DC_POWER_PLANT1', 'DC_POWER_PLANT2']].agg('sum').plot(rot=30,
                                                                                                              figsize=(15,5))
plt.title('TOTAL OF 22 INVERTERS DC POWER PER DATE-TIME')
plt.ylabel('KWh')
plt.show()

In [None]:
data.reset_index()[:10000].groupby('DATE_TIME_PLANT2')[['AC_POWER_PLANT1', 'AC_POWER_PLANT2']].agg('sum').plot(rot=30,
                                                                                                              figsize=(15,5))
plt.title('TOTAL OF 22 INVERTERS AC POWER PER DATE-TIME')
plt.ylabel('KWh')
plt.show()

In [None]:
data.reset_index()[:20000].groupby('DATE_TIME_PLANT2')[['DAILY_YIELD_PLANT1', 'DAILY_YIELD_PLANT2']].agg('sum').plot(rot=30,
                                                                                                              figsize=(20,5))
plt.title('TOTAL OF 22 INVERTERS DAILY YIELD PER DATE-TIME')
plt.ylabel('KWh')
plt.show()

After 5 days, the daily yield shape change.

In [None]:
data.reset_index()[:10000].groupby('DATE_TIME_PLANT2')[['TOTAL_YIELD_PLANT1', 'TOTAL_YIELD_PLANT2']].agg('mean').plot(rot=30,
                                                                                                              figsize=(15,5))
plt.title('TOTAL OF 22 INVERTERS TOTAL YIELD PER DATE-TIME')
plt.ylabel('KWh')
plt.show()

By date-time, we see again that Plant 1 have serious problem for maintenance.

### Relation between feature

In [None]:
reg2 = plt.figure(figsize=(10,10))
reg2.subplots_adjust(hspace=0.4)
ar11 = reg2.add_subplot(2, 1, 1)
ar22 = reg2.add_subplot(2, 1, 2)
sns.regplot(x='DC_POWER_PLANT1', y='AC_POWER_PLANT1', data=data, ax=ar11)
sns.regplot(x='DC_POWER_PLANT2', y='AC_POWER_PLANT2', data=data, ax=ar22)
ar11.set_title('Regression DC POWER AND AC POWER FOR PLANT 1')
ar22.set_title('Regression DC POWER AND AC POWER FOR PLANT 2')
plt.show()

**Summary**
1. confirmation: plant 2 is installed in the hot region/province.
2. confirmation: panel for plant 1 loose 90% of DC power to AC power.
3. plant 1 need panel cleaning and maintenance.
4. daily yield depends on weather (we are checking it in the next section).
5. opposite production between plant 1  and plant 2(check next section).

# Identify relevant information.

In this section, we are making correlation between weather feature and generation data feature to find a relevant information.

According to the fact that plant 1 need a maintenance, we will seek which inverters is deficient or impact on the production.

In [None]:
grouped_data = data.reset_index().groupby('date')[column].agg('mean').reset_index()

In [None]:
grouped_data.tail()

In [None]:
grouped_data.info()

In [None]:
grouped_sensor = sensor.groupby('date')[cols].agg('mean').reset_index()

In [None]:
grouped_sensor.info()

In [None]:
assemble_data = grouped_data.merge(grouped_sensor, left_on='date', right_on='date', suffixes=('_data', '_sensor'))

In [None]:
assemble_data.info()

## Correlation between weather feature and generation data feature


In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(assemble_data.corr(), annot=True, center=0, square=False)
plt.show()

We remark that
1. corr(DAILY_YIELD_PLANT2, AC_POWER_PLANT2) = 0.69 
2. corr(AC_POWER_PLANT2, IRRADIATION_PLANT2) = 0.84
3. corr(DAILY_YIELD_PLANT2, IRRADIATION_PLANT2) = 0.71 (daily yield depends weather condition)
4. corr(MODULE_TEMPERATURE_PLANT1, IRRADIATION_PLANT2) = 0.68
5. corr(MODULE_TEMPERATURE_PLANT1, MODULE_TEMPERATURE_PLANT2) = 0.78
6. corr(MODULE_TEMPERATURE_PLANT1, AMBIENT_TEMPERATURE_PLANT2) = 0.8
7. corr(AMBIENT_TEMPERATURE_PLANT1, MODULE_TEMPERATURE_PLANT2) = 0.85
8. corr(AMBIENT_TEMPERATURE_PLANT1, AMBIENT_TEMPERATURE_PLANT2) = 0.85
9. corr(TOTAL_YIELD_PLANT1 , TOTAL_YIELD_PLANT2) = -0.52 (checked: plant1 and plant2 are opposite)

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(x='AC_POWER_PLANT2', y='DAILY_YIELD_PLANT2',lowess=True, data=assemble_data)
plt.show()

For plant 2 AC power can influence a daily yield.

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='AC_POWER_PLANT2', x='IRRADIATION_PLANT2',lowess=True, data=assemble_data)
plt.show()

IRRADIATION for plant 2  can be help to define AC power.

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(x='AMBIENT_TEMPERATURE_PLANT2', y='AMBIENT_TEMPERATURE_PLANT1',lowess=True, data=assemble_data)
plt.show()

if we know an ambient temperature in plant1, we can define approximately an ambient temperature in the plant 2.

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='IRRADIATION_PLANT2', x='IRRADIATION_PLANT1',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='TOTAL_YIELD_PLANT1', x='TOTAL_YIELD_PLANT2',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='DAILY_YIELD_PLANT2', x='MODULE_TEMPERATURE_PLANT2',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='TOTAL_YIELD_PLANT1', x='AMBIENT_TEMPERATURE_PLANT1',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='TOTAL_YIELD_PLANT1', x='MODULE_TEMPERATURE_PLANT1',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='TOTAL_YIELD_PLANT2', x='AMBIENT_TEMPERATURE_PLANT2',lowess=True, data=assemble_data)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.regplot(y='TOTAL_YIELD_PLANT2', x='MODULE_TEMPERATURE_PLANT2',lowess=True, data=assemble_data)
plt.show()

The relevant information is:

Due to a sunshine of sun, ambient temperature for plant 1 and plant 2 are dependent. Irradiation for plant 1 and plant 2 is also dependent.

In plant 2, we can define the AC power if we know before a valeur of Irradiation.

Also, knowing some feature of plant 2 can be useful to predict some feature of plant 1. If we suppose that an ambient temperature is a same in the two place (plant1 and plant2), normally, plant 1 must be give a good performance like plant 2. But nothing. 

Hence, the graph show that TOTAL_YIELD_PLANT1 is decreasing when AMBIENT/MODULE_TEMPERATURE_PLANT1 increase. Plant1 is deficient.  

## Which inverter is deficient in the plant 1?

In [None]:
data1['date'] = pd.to_datetime(data1['DATE_TIME']).dt.date

In [None]:
dc_power = pd.pivot_table(data1, values='DC_POWER', index='date', columns='SOURCE_KEY')
ac_power = pd.pivot_table(data1, values='AC_POWER', index='date', columns='SOURCE_KEY')
daily_yield = pd.pivot_table(data1, values='DAILY_YIELD', index='date', columns='SOURCE_KEY')
total_yield = pd.pivot_table(data1, values='TOTAL_YIELD', index='date', columns='SOURCE_KEY')

In [None]:
dc = dc_power.describe().T
ac = ac_power.describe().T
dyield = daily_yield.describe().T
tyield = total_yield.describe().T

In [None]:
dc[['min', '50%', 'max']].plot(figsize=(15,5), kind='bar')
plt.title('DC POWER')
plt.ylabel('Kw')
plt.xlabel('Inverter')
plt.show()

We identify **1BY6WEcLGh8j5v7** and **bvBOhCH3iADSZry** as the inverters that have produced low DC POWER in the two month.

In [None]:
ac[['min', '50%', 'max']].plot(figsize=(15,5), kind='bar')
plt.title('AC POWER')
plt.ylabel('Kw')
plt.xlabel('Inverter')
plt.show()

This graph shows how 22 inverters lose 90% of his DC/AC power with 1BY6WEcLGh8j5v7 and  bvBOhCH3iADSZry as being the lowest producer of dc/ac power. 

In [None]:
dyield[['min', '50%', 'max']].plot(figsize=(15,5), kind='bar')
plt.title('DAILY YIELD')
plt.ylabel('Kwh')
plt.xlabel('Inverter')
plt.show()

In [None]:
tyield[['min', '50%', 'max']].plot(figsize=(15,5), kind='bar')
plt.title('TOTAL YIELD')
plt.ylabel('Kwh')
plt.xlabel('Inverter')
plt.show()

All inverter give almost same total and daily yield. (curious)

**Summary**
1. yield depend on weather condition.
2. ac/dc power depend on weather condition.

# Conclusion

Throughout this work, we notice that power plant 1 is deficient and power plant 2 have good performance.

The bad performance of power plant 1 can be translate by the fact that total yield decrease when ambient/module temperature increase. I think that it is the reason why the power plant1 lose 90% of his dc/ac power in the inverters.

Power Plant 2 have a good performance because it is installed where sunshine is perfect and the panel get sometime a maintenance.
Power plant1 need cleaning and maintenance of his panels and inverters.

## See the next task...
**Be free to download, share and comment**