<img src="https://i.imgur.com/6L1XKFq.jpeg">
<h1><center>Solar Power Plant - Interactive Exploratory Data Analysis & Condition Monitoring<center><h1>


# Table of Contents
* [1. Introduction](#section-one)
* [2. Preprocessing](#section-two)
* [3. Data Exploration](#section-three)
* [4. Condition Monitoring](#section-four)
* [5. Conclusion](#section-five)

<a id="section-one"></a>
# 1. Introduction & first ideas:

This notebook is going to focus on exploratory data analysis and condition monitoring/fault detection (challenges 2&3). However, the employed methods may be useful in combination with data from local weather forecasts for accurate short-term forecasting of power generation (challenge 1).


## 1.1 Solar Power Plants

In general, solar power plants convert energy from sunlight to electricity. The two most prominent technologies are
- **concentrated solar power**, where focused sunlight is used to heat up an absorber which is then employed to power a steam engine, and 
- **photovoltaic cells**, where the photovoltaic effect (photon absorption leads to injection of excited electrons into the conduction band) is employed to generate direct current (DC) power which is then converted into alternating current (AC) power by the use of inverters.

The data seems to be from solar power plants that use potovoltaic cell technology.

## 1.2 Dataset Description
According to the dataset description, the data contains two sets of information about two solar power plants located in India, that were acquired over a period of 34 days:
- power generation data  (measured at the inverter, each inverter corresponding to multiple lines of solar panels)
- weather data (measured with a single line of sensors optimally placed at the plant)

## 1.3 The challenges
1. Can we **predict the power generation for next couple of days**? - this allows for better grid management
2. Can we **identify the need for panel cleaning/maintenance**?
3. Can we **identify faulty or suboptimally performing equipment**?

## 1.4 First thoughts and ideas
- Challenge 1: This challenge belongs to the category of **TIME SERIES FORECASTING**. Three options come to mind:

    1. Use available time series of AC or DC power data
    2. Use available time series from weather sensors 
    3. Use local weather forecast (external data) and combine with power data
    
    Option 1 does not take weather data into account and Option 2 might have difficulties with predicting power drop due to cloud formations since the provided weather data is relatively limited. Option 3 might be more promising since weather forecast stations have more data available (e.g. cloud forecasting). Then the task shifts to predicting power output from forecasted sunlight duration and irradiance.
- Challenge 2 & 3: These challenges belong the category of **CONDITION MONITORING**, **Anomaly Detection**, and **Fault Detection** and seem identical at first glance. However, performance drop due to need for cleaning/maintenance may be slower and less severe compared with performance drop due to faulty equipment. Therefore, we should be able distinguish between the two.


<a id="section-two"></a>
# 2. Import & Preprocessing

Let's have a first look at the data. For the sake of simplicity we are only looking at data of plant 1 (power generation and weather). Data of power plant 2 can then be explored in a similar manner.

In [None]:
import os
import pandas as pd
import numpy as np
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.dates as mdates 
xformatter = mdates.DateFormatter('%H:%M') # for time axis plots

# import plotly.offline as py
# py.init_notebook_mode(connected=True)


import sklearn
from scipy.optimize import curve_fit

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import all available data 
df_gen1 = pd.read_csv("../input/solar-power-generation-data/Plant_1_Generation_Data.csv")
df_gen2 = pd.read_csv("../input/solar-power-generation-data/Plant_2_Generation_Data.csv")

df_weather1 = pd.read_csv("../input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv")
df_weather2 = pd.read_csv("../input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv")

## 2.1 Check for issues

In [None]:
df_gen1.info()

Let's look at the column descriptions:
- **DATE_TIME:** Date and time for each observation. Observations recorded at 15 minute intervals.
- **PLANT_ID:**  this will be common for the entire file
- **SOURCE_KEY:** Source key in this file stands for the inverter id.
- **DC_POWER:** Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
- **AC_POWER:** Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
- **TOTAL_YIELD:** This is the total yield for the inverter till that point in time.

In [None]:
df_weather1.info()

DATE_TIME and PLANT_ID are identical with the description above. Other than that:
- **SOURCE_KEY:** Stands for the sensor panel id. This will be common for the entire file because there's only one sensor panel for the plant.
- **AMBIENT_TEMPERATURE:** This is the ambient temperature at the plant. *Note: After comparing this data with weather data in Gandikotta (Andhra), I assume the correct unit for this data is $°C$*
- **MODULE_TEMPERATURE:** There's a module (solar panel) attached to the sensor panel. This is the temperature reading for that module. *Note: After comparing this data with other publications, I assume the correct unit for this data is $°C$*
- **IRRADATION:** Amount of irradiation for the 15 minute interval. *Note: After comparing this data with other publications, I assume the correct unit for this data is $kW/m^2$*

In [None]:
# quick stat overview of datasets
df_gen1.describe()

In [None]:
df_weather1.describe()

In [None]:
# Check missing values
df_gen1.isnull().sum().sort_values(ascending=False)


In [None]:
# Check missing values
df_gen1.isnull().sum().sort_values(ascending=False)

It looks like we have no missing values. We can use the PLANT_ID column to check if our data only contains information of one power plant:

In [None]:
df_gen1['PLANT_ID'].nunique()

In [None]:
df_weather1['PLANT_ID'].nunique()

As expected, we only have data from one plant in this database. How many inverters are in the database and how many measurements are there per inverter?

In [None]:
df_gen1.SOURCE_KEY.value_counts()

In [None]:
df_gen1.SOURCE_KEY.value_counts().sum()

In [None]:
print('There are {} different inverters. Number of measurements per inverter range from {} to {}.' .format(df_gen1.SOURCE_KEY.nunique(),df_gen1.SOURCE_KEY.value_counts().min(), df_gen1.SOURCE_KEY.value_counts().max() ))

As we can see there are 22 different inverters with between 3104 and 3155 measurements. This difference may cause an issue with prediction models and should be taken into account. Since one entry corresponds to a 15 min measurement, the maximum difference of 51 entries corresponds to a difference of almost 13 hours.

## 2.2 Preprocess and merge datasets

We want to merge both datasets. To do this we adjust the DATE_TIME formats, drop unnecessary columns and merge along DATE_TIME. In addition, we are going to add separate date and time columns as well as name our inverters from 1 to 22.

In [None]:
# adjust datetime format
df_gen1['DATE_TIME'] = pd.to_datetime(df_gen1['DATE_TIME'],format = '%d-%m-%Y %H:%M')
df_weather1['DATE_TIME'] = pd.to_datetime(df_weather1['DATE_TIME'],format = '%Y-%m-%d %H:%M:%S')

# drop unnecessary columns and merge both dataframes along DATE_TIME
df_plant1 = pd.merge(df_gen1.drop(columns = ['PLANT_ID']), df_weather1.drop(columns = ['PLANT_ID', 'SOURCE_KEY']), on='DATE_TIME')

In [None]:
# add inverter number column to dataframe
sensorkeys = df_plant1.SOURCE_KEY.unique().tolist() # unique sensor keys
sensornumbers = list(range(1,len(sensorkeys)+1)) # sensor number
dict_sensor = dict(zip(sensorkeys, sensornumbers)) # dictionary of sensor numbers and corresponding keys

# add column
df_plant1['SENSOR_NUM'] = 0
for i in range(df_gen1.shape[0]):
    df_plant1['SENSOR_NUM'][i] = dict_sensor[df_gen1["SOURCE_KEY"][i]]

# add Sensor Number as string
df_plant1["SENSOR_NAME"] = df_plant1["SENSOR_NUM"].apply(str) # add string column of sensor name

In [None]:
# adding separate time and date columns
df_plant1["DATE"] = pd.to_datetime(df_plant1["DATE_TIME"]).dt.date # add new column with date
df_plant1["TIME"] = pd.to_datetime(df_plant1["DATE_TIME"]).dt.time # add new column with time

# add hours and minutes for ml models
df_plant1['HOURS'] = pd.to_datetime(df_plant1['TIME'],format='%H:%M:%S').dt.hour
df_plant1['MINUTES'] = pd.to_datetime(df_plant1['TIME'],format='%H:%M:%S').dt.minute
df_plant1['MINUTES_PASS'] = df_plant1['MINUTES'] + df_plant1['HOURS']*60

# add date as string column
df_plant1["DATE_STR"] = df_plant1["DATE"].astype(str) # add column with date as string

In [None]:
df_plant1.head()

<a id="section-three"></a>
# 3. Data Exploration & Failure Detection

Now that we have a merged dataset we can take a closer look at data distributions and correlations.

## 3.1 Data Distribution and Correlations

In [None]:
cols_corr = ["DC_POWER", "AC_POWER", "DAILY_YIELD", "TOTAL_YIELD", "AMBIENT_TEMPERATURE", "MODULE_TEMPERATURE", "IRRADIATION", "SENSOR_NUM", "HOURS", "MINUTES", "MINUTES_PASS"]
corrMatrix = df_plant1[cols_corr].corr()
plt.figure(figsize=(15,5))
fig_corr = sns.heatmap(corrMatrix,cmap="YlGnBu", annot=True)
plt.show()

In [None]:
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [15, 5]})
cols_pair = ["DC_POWER", "AC_POWER", "DAILY_YIELD", "TOTAL_YIELD", "AMBIENT_TEMPERATURE", "MODULE_TEMPERATURE", "IRRADIATION", "SENSOR_NUM", "MINUTES_PASS"]
fig_pair = sns.pairplot(df_plant1[cols_pair])
plt.show()

From the last two figures we already can gain a lot of insight:

* high correlation between DC Power and AC Power
* high correlation between Power and Irradiation
* correlation between DC Power, AC Power and Module Temperature and Ambient Temperature
* correlation between Daily Yield and Ambient Temperature

and that there seem to be outliers in

* AC+DC Power - Irradiation
* DC Power - AC Power (very few)

We can use these outliers for equipment fault and need for maintenance detection!

* Outliers in Power-Irradiation indicate failure of the panel lines. If there is enough sunlight but no power is generated, this points to faulty photovoltaic cells.
* Outliers in DC-AC conversion indicate failure at the inverter. If there is DC power delivered but less AC power generated than expected the inverter may be malfunctioning.

If we take a close look at TOTAL_YIELD vs SENSOR_NUM, we see that there are two groups of inverters. One group starts with a higher total yield than the other one. This is most likely because this group was installed earlier than the other group.


**NOTE:** There are multiple entries where DAILY_YIELD decreased during a day, which should not be possible according to the definition of this column. **Since Daily_YIELD is calculated from measured DC_POWER, there seems to be an issue with how this data was generated.**

Let's look a bit more closer at the pairplots where we identified outliers and see if these are spread out evenly across all inverters or if we can identify specific inverters.

In [None]:
cols_out = ["DC_POWER", "AC_POWER", "IRRADIATION", "SOURCE_KEY"]
sns.pairplot(df_plant1[cols_out], hue="SOURCE_KEY", diag_kind="hist", palette="tab10")
plt.show()

Most of the outliers seem to come from a small group of inverters!

## 3.2 AC Power vs DC Power

This indicates how well our inverter is converting DC power to AC Power. Any outliers in this power conversion process can be used to detect faulty inverters.

In [None]:
plt.figure(figsize=(10,5))
fig_power = sns.scatterplot(data=df_plant1, x="DC_POWER", y="AC_POWER", hue="SOURCE_KEY", palette="tab10")
fig_power.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.show()

## 3.3 Power vs Temperature

From our pairplots we can clearly see some outliers in the DC&AC Power vs Module Temperature. Other Temperatures do not show any obvious outliers.

In [None]:
plt.figure(figsize=(10,5))
fig_temp = sns.scatterplot(data=df_plant1, x="MODULE_TEMPERATURE", y="DC_POWER", hue="SOURCE_KEY", palette="tab10")
fig_temp.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.show()

## 3.4 Power vs Irradiation
This indicates how well our photovoltaic panel lines can convert sunlight to DC power. Any outliers in this solar energy to electricity coversion indicate malfunctioning photovoltaic panel lines.

In [None]:
plt.figure(figsize=(10,5))
fig_irr = sns.scatterplot(data=df_plant1, x="IRRADIATION", y="DC_POWER", hue="SOURCE_KEY", palette="tab10")
fig_irr.legend(loc='center left', bbox_to_anchor=(1, 0.5), ncol=1)
plt.show()

Our data clearly shows events where some inverters received no DC power even though there was more than enough sunlight to generate power. We clearly have some equipment malfunction in our data!

To illustrate this, we can take a closer look at the daily distribution of the generated power and the measured irradiation:

In [None]:
fig = px.scatter(df_plant1, x="TIME", y="DC_POWER", title="DC Power: Daily Distribution", color = "DATE_STR")
fig.update_traces(marker=dict(size=5, opacity=0.7), selector=dict(mode='markers'))
fig.show()

In [None]:
fig = px.scatter(df_plant1, x="TIME", y="IRRADIATION", title="Irradiation: Daily Distribution", color = "DATE_STR")
fig.update_traces(marker=dict(size=5, opacity=0.7), selector=dict(mode='markers'))
fig.show()

The first figure shows multiple occasions where there was no power generated during daytime. The second figure shows that the irradiation level never dropped low enough during the day to explain this loss of power. This indicates equipment failure!

FYI: These plot are interactive! To choose any specific days, double click on the day in the legend on the right.

## 3.5 Temperature

In [None]:
df_plant1.MODULE_TEMPERATURE.describe()

In [None]:
df_plant1.AMBIENT_TEMPERATURE.describe()

In [None]:
date=["2020-05-16"]
plt.plot(df_plant1[df_plant1["DATE_STR"].isin(date)].DATE_TIME, df_plant1[df_plant1["DATE_STR"].isin(date)].MODULE_TEMPERATURE, label="Module")
plt.plot(df_plant1[df_plant1["DATE_STR"].isin(date)].DATE_TIME, df_plant1[df_plant1["DATE_STR"].isin(date)].AMBIENT_TEMPERATURE, label="Ambient")
plt.gcf().axes[0].xaxis.set_major_formatter(xformatter) # set xaxis format
plt.legend()
plt.title("Daily Module and Ambient Temperature: {}" .format(date[0]))
plt.xlabel("Time")
plt.ylabel("Temperature (C°)")
plt.show()

Ambient temperatures range from 20 to 35°C, modules reach temperatures from 18 to 65 °C. Modules reach significantly higher temperatures than their ambient air during daytime. Ambient temperature is lagging behind daily module cooldown. This means the modules cool down quicker than their environment.

In [None]:
plt.plot(df_plant1.DATE_TIME, df_plant1.MODULE_TEMPERATURE, label="Module")
plt.plot(df_plant1.DATE_TIME, df_plant1.AMBIENT_TEMPERATURE, label="Ambient")
plt.title("Long-term Module and Ambient Temperature:")
plt.legend()
plt.xlabel("Time")
plt.ylabel("Temperature (C°)")
plt.show()

There are two days with significantly lower temperature ("bad weather") on. Such events may be difficult to forecast without access to more weather data (air pressure, wind, humidity, cloud formation etc.) and advanced weather forecasting models.

<a id="section-four"></a>
# 4. Condition Monitoring

In section 3 we found evidence that point towards equipment failure. Now we are going to build models to detect equipment failure automatically.

## 4.1 Rule-based Fault Detection
During the data exploration we found a simple way to identify faulty equipment: If there is no power measured at the inverter during normal daytime operation, we can assume/identify equipment failure. Let's create a new column ("STATUS") that identifies faulty operation:

In [None]:
# Function to check if time is during daytime operation
def time_in_range(start, end, x):
    """Return true if x is in the range [start, end]"""
    if start <= end:
        return start <= x <= end
    else:
        return start <= x or x <= end
    
# set normal daytime operation range
start=datetime.time(6,30,0) # sunrise
end=datetime.time(17,30,0) # sunset

# Create new column to check proper operation
# Return "Normal" if operation is normal and "Fault" if operation is faulty
df_plant1["STATUS"] = 0
for index in df_plant1.index:
    if  time_in_range(start, end, df_plant1["TIME"][index]) and df_plant1["DC_POWER"][index] == 0:
        df_plant1["STATUS"][index] = "Fault"
    else:
        df_plant1["STATUS"][index] = "Normal"

In [None]:
fig = px.scatter(df_plant1, x="IRRADIATION", y="DC_POWER", title="Fault Identification", color="STATUS", labels={"DC_POWER":"DC Power (kW)", "IRRADIATION":"Irradiation"})
fig.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
fig.show()

### 4.1.1 Days with faults

In [None]:
df_plant1[df_plant1["STATUS"]== "Fault"]["DATE"].value_counts()

In [None]:
fig=px.bar(df_plant1[df_plant1["STATUS"]== "Fault"]["DATE"].value_counts(), title="Fault Events: Rule-based", labels={"value":"Faults", "index":"Date", "SENSOR_NAME":"Inverter"})
fig.update(layout_showlegend=False)

In [None]:
# Date & Inverter Time series
# fig10 = px.scatter(df_plant1, x="DATE_TIME", y="STATUS", title="Fault Identification: Events", color="SOURCE_KEY")
# fig10.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
# fig10.show()

In [None]:
# uncomment to see faulty days
# fig = px.scatter(df_plant1[df_plant1.DATE_STR=="2020-06-07"], x="DATE_TIME", y="DC_POWER", title="2020-06-07", color="SENSOR_NAME", labels={"DC_POWER":"DC Power", "DATE_TIME":"Time"})
# fig.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
# fig.show()

In [None]:
# fig = px.scatter(df_plant1[df_plant1.DATE_STR=="2020-06-14"], x="DATE_TIME", y="DC_POWER", title="2020-06-07", color="SENSOR_NAME", labels={"DC_POWER":"DC Power", "DATE_TIME":"Time"})
# fig.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
# fig.show()

### 4.1.2 Number of recorded faults

In [None]:
df_plant1.STATUS.value_counts()

In [None]:
print("There are {} records of faulty operation!" .format(df_plant1.STATUS.value_counts().Fault))

### 4.1.3 Inverters with faults

In [None]:
df_plant1[df_plant1["STATUS"]== "Fault"]["SOURCE_KEY"].value_counts()

In [None]:
fig=px.bar(df_plant1[df_plant1["STATUS"]== "Fault"]["SOURCE_KEY"].value_counts(), title="Inverter Faults: Rule-based", labels={"value":"Faults", "index":"Inverter", "SENSOR_NAME":"Inverter"})
fig.update(layout_showlegend=False)

### 4.1.4 Summary

In [None]:
print("The most faults were recorded on {} and {}." .format(df_plant1[df_plant1["STATUS"]== "Fault"]["DATE"].value_counts().index[0], df_plant1[df_plant1["STATUS"]== "Fault"]["DATE"].value_counts().index[1]))
print("Inverter {} and {} had the most failures." .format(df_plant1[df_plant1["STATUS"]== "Fault"]["SOURCE_KEY"].value_counts().index[0],df_plant1[df_plant1["STATUS"]== "Fault"]["SOURCE_KEY"].value_counts().index[1]))

## 4.2 Fault Detection with Regression Models
While the simple rule-based approach in the previous chapter was already successful at detecting severe failure, finding less obvious anomalies and more subtle indicators for equipment failure (or cleaning/maintenance need) require more effort.

### 4.2.1 Linear Model

A first attempt at predicting DC power from irradiance by assuming a linear relationship

\begin{equation}
    P(t) = a + b \cdot E(t)
\end{equation}

with the generated DC power $P(t)$, irradiance $E(t)$ and coefficients $a ,b$.

Note: PV module panels can reach up to 65°C. The efficiency of PV cells is usually lower at high temperatures. This leads to a non-linear relationship between irradiance and generated DC power. We are going to model this nonlinearity in the next section.

In [None]:
from sklearn.linear_model import LinearRegression

# Model
reg = LinearRegression()

# choose training data
train_dates = ["2020-05-16", "2020-05-17","2020-05-18" ,"2020-05-19", "2020-05-20", "2020-05-21"]
df_train = df_plant1[df_plant1["DATE_STR"].isin(train_dates)]

#fit & predict
reg.fit(df_train[["IRRADIATION"]], df_train.DC_POWER)
prediction = reg.predict(df_plant1[["IRRADIATION"]])

# save prediction, residual, and absolute residual
df_train["Prediction"] = reg.predict(df_train[["IRRADIATION"]])
df_train["Residual"] = df_train["Prediction"] - df_train["DC_POWER"]
df_plant1["Prediction"] = reg.predict(df_plant1[["IRRADIATION"]])
df_plant1["Residual"] = df_plant1["Prediction"] - df_plant1["DC_POWER"]
df_plant1["Residual_abs"] = df_plant1["Residual"].abs()

In [None]:
fig = px.scatter(df_plant1, x="DATE_TIME", y="DC_POWER", title="Fault Identification: Linear model (Zoomed in)", color="Residual_abs", labels={"DC_POWER":"DC Power (kW)", "DATE_TIME":"Date Time", "Residual_abs":"Residual"}, range_x=[datetime.date(2020, 6, 1), datetime.date(2020, 6, 17)])
fig.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
fig.show()

In [None]:
# choose data
day = "2020-06-07"
inverter1 = "2"
inverter2 = "22"
df_pred = df_plant1[(df_plant1["DATE_STR"] == day)].copy()

fig, axes = plt.subplots(1, 2)

sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].DC_POWER, label="Measured DC", color="b", ax=axes[0])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].Residual, label="Residual", color="g", ax=axes[0])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].Prediction, label="Predicted DC", color="r", ax=axes[0])

sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].DC_POWER, label="Measured DC", color="b", ax=axes[1])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].Residual, label="Residual", color="g", ax=axes[1])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].Prediction, label="Predicted DC", color="r", ax=axes[1])

plt.gcf().axes[0].xaxis.set_major_formatter(xformatter) # set xaxis format
plt.gcf().axes[1].xaxis.set_major_formatter(xformatter) # set xaxis format
axes[0].set_xlabel("Time")
axes[1].set_xlabel("Time")
axes[0].set_ylabel("DC Power (kW)")
axes[1].set_ylabel("")
axes[1].set_ylim(-2500, 14000)
axes[0].set_title("Example: Normal Operation")
axes[1].set_title("Example: Fault Detected")
plt.show()

In [None]:
inverter2= "22"
df_pred2 = df_plant1[df_plant1["SENSOR_NAME"]==inverter2].copy()
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Prediction, label="Predicted DC", color="r")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.DC_POWER, label="Measured DC", color="b")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Residual-5000, label="Residual (Offset)", color="g")
plt.xlabel("Date")
plt.ylabel("Power (kW)")
plt.title("Fault Detection: Inverter 22")
plt.show()

### 4.2.1 Nonlinear Model

According to [Hooda et al. (2018)](https://www.researchgate.net/publication/309399733_PV_Power_Predictors_for_Condition_Monitoring) the generated power of a photovoltaic cell can be modeled by the nonlinear equation
\begin{equation}
    P(t) = a E(t) \left(1-b (T(t) + \frac{E(t)}{800} (c-20) - 25) - d \ln(E(t))\right)
\end{equation}
with irradiance $E(t)$, Temperature $T(t)$ and coefficients $a,b,c,d$.

In [None]:
def func(X, a, b, c, d):
    '''Nonlinear function to predict DC power output from Irradiation and Temperature.'''
    x,y = X
    x=x*1000
    y=y*1000
    return a*x*(1-b*(y+x/800*(c-20)-25)-d*np.log(x+1e-10))

# fit function
p0 = [1.,0.,-1.e4,-1.e-1] # starting values
popt, pcov = curve_fit(func, (df_train.IRRADIATION, df_train.MODULE_TEMPERATURE), df_train.DC_POWER, p0, maxfev=5000)
sigma_abcd = np.sqrt(np.diagonal(pcov))

# predict & save
df_train["Prediction_NL"] = func((df_train.IRRADIATION, df_train.MODULE_TEMPERATURE), *popt)
df_train["Residual_NL"] = df_train["Prediction_NL"] - df_train["DC_POWER"]

df_plant1["Prediction_NL"] = func((df_plant1.IRRADIATION, df_plant1.MODULE_TEMPERATURE), *popt)
df_plant1["Residual_NL"] = df_plant1["Prediction_NL"] - df_plant1["DC_POWER"]

In [None]:
plt.scatter(df_plant1.IRRADIATION, df_plant1.DC_POWER, label="Measured")
plt.scatter(df_plant1.IRRADIATION, df_plant1.Prediction_NL, color="r", label="NL Prediction")
plt.legend()
plt.xlabel("Irradiation (kW/m²)")
plt.ylabel("DC Power (kW)")
plt.title("Nonlinear Model Prediction")
plt.show()

In [None]:
# choose data
day = "2020-06-07"
inverter1 = "2"
inverter2 = "22"
df_pred = df_plant1[(df_plant1["DATE_STR"] == day)].copy()

fig, axes = plt.subplots(1, 2)

sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].DC_POWER, label="Measured DC", color="b", ax=axes[0])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].Residual_NL, label="NL Residual", color="g", ax=axes[0])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter1].Prediction_NL, label="NL Predicted DC", color="r", ax=axes[0])

sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].DC_POWER, label="Measured DC", color="b", ax=axes[1])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].Residual_NL, label="NL Residual", color="g", ax=axes[1])
sns.lineplot(df_pred.DATE_TIME,df_pred[df_pred["SENSOR_NAME"] == inverter2].Prediction_NL, label="NL Predicted DC", color="r", ax=axes[1])

plt.gcf().axes[0].xaxis.set_major_formatter(xformatter) # set xaxis format
plt.gcf().axes[1].xaxis.set_major_formatter(xformatter) # set xaxis format
axes[0].set_xlabel("Time")
axes[1].set_xlabel("Time")
axes[0].set_ylabel("DC Power (kW)")
axes[1].set_ylabel("")
axes[0].set_ylim(-3000, 14000)
axes[1].set_ylim(-3000, 14000)
axes[0].set_title("Example: Normal Operation")
axes[1].set_title("Example: Fault Detected")
plt.show()

In [None]:
inverter2= "1"
df_pred2 = df_plant1[df_plant1["SENSOR_NAME"]==inverter2].copy()
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Prediction_NL, label="NL Prediction", color="r")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.DC_POWER, label="Measured DC", color="b")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Residual_NL-5000, label="NL Residual (Offset)", color="g")
plt.xlabel("Date")
plt.ylabel("Power (kW)")
plt.title("Fault Detection Example: Inverter {}".format(inverter2))
plt.show()

In [None]:
inverter2= "22"
df_pred2 = df_plant1[df_plant1["SENSOR_NAME"]==inverter2].copy()
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Prediction_NL, label="NL Prediction", color="r")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.DC_POWER, label="Measured DC", color="b")
sns.lineplot(df_pred2.DATE_TIME,df_pred2.Residual_NL-5000, label="NL Residual (Offset)", color="g")
plt.xlabel("Date")
plt.ylabel("Power (kW)")
plt.title("Fault Detection Example: Inverter {}".format(inverter2))
plt.show()

### 4.2.3 Model Comparison

To compare the two models we can take a look at their respective residuals. The nonlinear model seems to perform slightly better than the linear model, especially at times of high irradiance.

In [None]:
# plot model comparison residual
plt.figure(figsize=(5,5))
sns.scatterplot(df_train.Prediction, df_train.Residual, color="b", label="LI Residual")
sns.scatterplot(df_train.Prediction_NL, df_train.Residual_NL, color="r", label="NL Residual")
axes = plt.gca()
plt.ylabel("Residual")
plt.xlabel("Predicted DC Power")
plt.title("Model Comparison")
plt.show()

# 4.2.4 NL Fault Detection

Let's now use the irradiance and temperature data to predict the expected DC power with the nonlinear model. This allows us to identify addtional anomalies by comparing the measured DC power with the prediction. The additional anomalies indicate equipment underperformance or need for maintenance.


In [None]:
# set confidence range for residual for fault
limit_fault=4000

# Create new column to check proper operation
# Return "Normal" if operation is normal and "Fault" if operation is faulty
df_plant1["STATUS_NL"] = 0
for index in df_plant1.index:
    if  df_plant1["Residual_NL"][index] > limit_fault:
        df_plant1["STATUS_NL"][index] = "Fault"  
    else:
        df_plant1["STATUS_NL"][index] = "Normal"

In [None]:
fig=px.bar(df_plant1[df_plant1["STATUS_NL"]== "Fault"]["DATE"].value_counts(), title="Faults: Nonlinear Model", labels={"value":"Faults", "index":"Date", "SENSOR_NAME":"Inverter"}, )
fig.update(layout_showlegend=False)

In [None]:
fig=px.bar(df_plant1[df_plant1["STATUS_NL"]== "Fault"]["SOURCE_KEY"].value_counts(), title="Underperformance & Faults: Nonlinear Model", labels={"value":"Faults", "index":"Inverter", "SENSOR_NAME":"Inverter"})
fig.update(layout_showlegend=False)

In [None]:
fig = px.scatter(df_plant1, x="DATE_TIME", y="DC_POWER", title="Underperformance & Faults: Nonlinear Model (Zoomed in)", color="STATUS_NL", labels={"DC_POWER":"DC Power (kW)", "DATE_TIME":"Date Time", "STATUS_NL":"Status"},range_x=[datetime.date(2020, 6, 1), datetime.date(2020, 6, 17)])
fig.update_traces(marker=dict(size=3, opacity=0.7), selector=dict(mode='marker'))
fig.show()

In [None]:
print("The most anomalies were recorded on {} and {}." .format(df_plant1[df_plant1["STATUS_NL"]== "Fault"]["DATE"].value_counts().index[0], df_plant1[df_plant1["STATUS_NL"]== "Fault"]["DATE"].value_counts().index[1]))
print("Inverter {} and {} had the most events of failure/underperformance." .format(df_plant1[df_plant1["STATUS_NL"]== "Fault"]["SOURCE_KEY"].value_counts().index[0],df_plant1[df_plant1["STATUS_NL"]== "Fault"]["SOURCE_KEY"].value_counts().index[1]))

<a id="section-five"></a>
# 5. Summary

* **EDA:** Noticed some potential issues with the data: DAILY_YIELD is decreasing on some days even though this should not be possible by definition. Inverters have different number of data points

* **Challenge 1:** Even though this notebook did not focus on forecasting, the power-irradiance models may be helpful in combination with external data from local weather forecasts to predict the generated power for the next couple days.

* **Challenge 2&3:** Successfully identified **events of equipment failure and underperformance** with a rule-based method and linear/nonlinear modeling of the relationship between irradiance, temperature and DC power. This approach can be useful for real-time condition monitoring and fault detection.

### Thank you for reading and please upvote this notebook if you found it useful!