# Solar Power Generation: Anomaly Detection & Forecasting

This notebooks shows how important data cleaning and exploration is.  Simply throwing this data into an AutoML program for forecasting would be reckless, because there are too many gotchas in the way the data is measured and variables that are not included in this dataset but that are critical for solar power forecasting.  For example, there are no nulls, but that does not mean there are no missing observations.  

The notebook focuses on data cleaning and exploration, missing value imputation, visualization, anomaly and outlier detection, and forecasting.  However, it should be emphasized that the dataset has significant limitations when used for forecasting.  So it really should only be used for anomaly and outlier detection, or in other words, looking for faulty equipment.

In [None]:
import pandas as pd
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
import plotly.express as px

import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

In [None]:
g1 = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv")
g2 = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv")
w1 = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv")
w2 = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv")

# I prefer to work with lowercase column names
for df in [g1, g2, w1, w2]:
    df.columns = [col.lower() for col in df.columns]
    print(df.info(), "\n")

## Data Cleaning

Steps to take:
1. Convert date-times to Pandas datetime format.
2. Check for Null values and decide what to do with them.
3. Check for missing time steps (all sensors should have readings for the same datetimes), and insert Null values if there are missing time steps.
4. If Nulls are inserted for missing time steps, decide what to do with them.
5. Look for univariate outliers and decide what to do with them.

In [None]:
# convert date-times to pandas datetime format
g1['date_time']= pd.to_datetime(g1['date_time'],format='%d-%m-%Y %H:%M')
g2['date_time']= pd.to_datetime(g2['date_time'],format='%Y-%m-%d %H:%M:%S')
w1['date_time']= pd.to_datetime(w1['date_time'],format='%Y-%m-%d %H:%M:%S')
w2['date_time']= pd.to_datetime(w2['date_time'],format='%Y-%m-%d %H:%M:%S')

for df in [g1, g2, w1, w2]:
    print(df.date_time.head(), "\n")

In [None]:
# check for null values
for df in [g1, g2, w1, w2]:
    print(df.isna().sum(), "\n")

In [None]:
# there are no nulls, but are there date-times that are completely missing from some sensors?
Counter(g1.source_key)

There are date-times that are missing from some sensors.  I will fill these with NAs, and determine how to handle them.

In [None]:
# insert rows for missing date-times to be filled with NAs
# https://stackoverflow.com/questions/62690513/python-pandas-insert-rows-for-missing-dates-time-series-in-groupby-dataframe

# store original dataframe shapes to validate row counts after inserting nulls for missing time steps
original_rowcounts = [df.shape[0] for df in [g1, g2, w1, w2]]

# define a function to do the time step insertions
def insert_missing_date_times(x):
    """
    Re-indexes a Pandas series to a date-time sequence with a frequency of 15 minutes.
    Any missing time steps in the original series index will be inserted.
    """
    return x.reindex(pd.date_range(x.index.min(), x.index.max(), freq='15min', name='date_time'))

# apply the insertion function to each dataframe
g1_new = g1.set_index('date_time').groupby('source_key').apply(insert_missing_date_times).drop('source_key', axis=1)
g1_new.reset_index(inplace=True)
g2_new = g2.set_index('date_time').groupby('source_key').apply(insert_missing_date_times).drop('source_key', axis=1)
g2_new.reset_index(inplace=True)
w1_new = w1.set_index('date_time').groupby('source_key').apply(insert_missing_date_times).drop('source_key', axis=1)
w1_new.reset_index(inplace=True)
w2_new = w2.set_index('date_time').groupby('source_key').apply(insert_missing_date_times).drop('source_key', axis=1)
w2_new.reset_index(inplace=True)

# now that the indices have been updated, it is safe to overwrite the originals and drop the temporary dfs
g1 = g1_new
g2 = g2_new
w1 = w1_new
w2 = w2_new
del g1_new, g2_new, w1_new, w2_new

# verify that the math works out: there should be just as many rows added as missing values
validated_rowcounts = [
    df.isna().sum().max() + original_rowcounts[idx] == df.shape[0] for idx, df in enumerate(
        [g1, g2, w1, w2]
    )
]
print(validated_rowcounts)  # should all be True

# check the number of missing values
print("\n")
for df in [g1, g2, w1, w2]:
    print(df.isna().sum(), "\n")


Now it should be easy to combine the generator and weather datasets, since the datetimes should match.

In [None]:
# prepare the weather data for merging
for df in [w1, w2]:
    df.rename(columns={"source_key": "weather_sensor_key"}, inplace=True)
    df.drop("plant_id", axis=1, inplace=True)

# merge the data
plant1 = pd.merge(g1, w1, how='left', on=['date_time'])
plant2 = pd.merge(g2, w2, how='left', on=['date_time'])

# count nulls and inspect the data
print(plant1.isna().sum(), "\n")
print(plant2.isna().sum(), "\n")

plant1.head()

In [None]:
# before filling null values, figure out when sunrise/sunset occur
p1_times_of_no_light = plant1[plant1['irradiation']==0.0].groupby('date_time')['source_key'].count().reset_index()
p1_times_of_no_light['hour'] = p1_times_of_no_light['date_time'].dt.hour

p2_times_of_no_light = plant2[plant2['irradiation']==0.0].groupby('date_time')['source_key'].count().reset_index()
p2_times_of_no_light['hour'] = p2_times_of_no_light['date_time'].dt.hour

print("plant1 hours of no sunlight \n", p1_times_of_no_light.groupby('hour')['source_key'].count(), "\n")
print("plant2 hours of no sunlight \n", p2_times_of_no_light.groupby('hour')['source_key'].count(), "\n")

It looks like the sun is shining between the hours of 5am and 6pm.  So any null values outside of this range can safely be set to 0.  To make it easier to fill these values, I will create a bunch of time features first.

In [None]:
# create additional time features
for df in [plant1, plant2]:
    df['date'] = df['date_time'].dt.date
    df['hour'] = df['date_time'].dt.hour
    df['day'] = df['date_time'].dt.day
    df['weekday'] = df['date_time'].dt.day_name()
    df['month'] = df['date_time'].dt.month
    df['year'] = df['date_time'].dt.year

# inspect
plant1.head()

In [None]:
# determine how total yield is calculated so that its nulls can be filled
grouped = plant1.groupby(['source_key', 'date']).last().reset_index()
# this should be mostly True
grouped['total_yield'] - grouped['daily_yield'] == grouped['total_yield'].shift(1)

Total yield = previous total yield + daily yield at the end of the day, or at least, it seems like that is how it is calculated.  There are rows where this equation does not hold, so there may be adjustments made to yield that are not included as features in this dataset, so there is no way to know for certain.  Nevertheless, I will assume the equation holds mostly true for now, especially since I am just using it to fill a few missing values.

To fill the missing values, I will fill anything outside the hours of sunlight with 0, except for temperatures.  Then I will fill missing daily yields after sunset with the last non-null value for that day.  Then I will linearly interpolate other missing values, because the observations are ordered by time and follow roughly linear trends throughout the day.  Linear interpolation may chop off the parabolic peaks in these patterns, corresponding to noon/max sunlight, but there are so few missing values, it should not matter very much.  I just need them to be filled close to what they would be.

After interpolation, there should be no more missing values.

In [None]:
# fill the null values created by inserting the missing time steps

# start by filling missing values outside the hours of sunlight with 0 (except for temperatures)
# daily_yield is cumulative, so only fill it before the sun rises
for df in [plant1, plant2]:
    df.loc[(df['dc_power'].isna()) & ((df['hour'] < 5) | (df['hour'] > 18)), 'dc_power'] = 0
    df.loc[(df['ac_power'].isna()) & ((df['hour'] < 5) | (df['hour'] > 18)), 'ac_power'] = 0
    df.loc[(df['irradiation'].isna()) & (df['hour'] < 5), 'daily_yield'] = 0
    df.loc[(df['irradiation'].isna()) & ((df['hour'] < 5) | (df['hour'] > 18)), 'irradiation'] = 0

# fill missing daily_yield after sunset with the last non-null value for that day
for df in [plant1, plant2]:
    fill_values = df[~df['daily_yield'].isna()].groupby(["source_key", "date"])["daily_yield"].last().reset_index()
    fill_values.rename(columns={"daily_yield": "daily_yield_fill_value_after_sunset"}, inplace=True)
    df = pd.merge(df, fill_values, how='left', on=['source_key', 'date'])
    df.loc[(df['daily_yield'].isna()) & (df['hour'] > 18), 'daily_yield'] = df.loc[(df['daily_yield'].isna()) & (df['hour'] > 18), 'daily_yield_fill_value_after_sunset']
    df.drop("daily_yield_fill_value_after_sunset", axis=1, inplace=True)

# linearly interpolate other missing values, but ignore total_yield because it = previous total yield + daily yield
for df in [plant1, plant2]:
    df['dc_power'].interpolate(method='linear', axis=0, inplace=True)
    df['ac_power'].interpolate(method='linear', axis=0, inplace=True)
    df['daily_yield'].interpolate(method='linear', axis=0, inplace=True)
    df['module_temperature'].interpolate(method='linear', axis=0, inplace=True)
    df['ambient_temperature'].interpolate(method='linear', axis=0, inplace=True)
    df['irradiation'].interpolate(method='linear', axis=0, inplace=True)

# fill missing total_yield by getting the most recent non-null value and adding the current daily yield
for df in [plant1, plant2]:
    final_non_null_total_yields = df[~df['total_yield'].isna()].groupby(["source_key", "date"])["total_yield"].last().reset_index()
    final_daily_yields = df.groupby(["source_key", "date"])["daily_yield"].last().reset_index()
    fill_values = pd.merge(final_non_null_total_yields, final_daily_yields, how='left', on=['source_key', 'date'])
    fill_values['total_yield_fill_value'] = fill_values['total_yield'] + fill_values['daily_yield']
    df = pd.merge(df, fill_values.drop(['total_yield', 'daily_yield'], axis=1), how='left', on=['source_key', 'date'])
    df['total_yield'].fillna(df['total_yield_fill_value'], inplace=True)
    df.drop('total_yield_fill_value', axis=1, inplace=True)

# if there are still any missing total_yields, fill them with the most recent non-null value up to that time
# also fill missing plant IDs this way, since they are all the same
for df in [plant1, plant2]:
    df.sort_values(['source_key', 'date_time'], ascending=True, inplace=True)
    df['total_yield'].fillna(method='ffill', inplace=True)
    df['plant_id'].fillna(method='ffill', inplace=True)

# finally, ensure there are no more missing values
for df in [plant1, plant2]:
    print(df.isna().sum(), "\n")


Now I will identify outliers due to measurement error.  The tricky part here will be distinguishing between outliers that are indications of possible problems with the inverter versus sensor reading errors.  Right now I only want to identify the latter.  Those might be more extreme outliers or points that do not make any sense at all.  A visual spot check should reveal if there are any of these.

Something important to remember here is that sensor readings should be consistent overall, but they should be consistent by hour of the day as well.  So I will also need to check distributions by hour of the day.  It is possible that extreme outliers could be masked in the larger distribution, but become noticeable when viewed from the hourly level.

In [None]:
sns.set(rc={'figure.figsize':(11.7, 8.27)})

In [None]:
# plant 1
for col in plant1.columns[plant1.dtypes == 'float']:
    sns.distplot(plant1[col], kde=False)
    plt.title(f"Plant 1 Distribution of {col}")
    plt.show()

In [None]:
# plant 2
for col in plant2.columns[plant2.dtypes == 'float']:
    sns.distplot(plant2[col], kde=False)
    plt.title(f"Plant 2 Distribution of {col}")
    plt.show()

In [None]:
# plant 1 by hour of the day
for col in plant1.columns[plant1.dtypes=='float']:
    grouped = plant1.groupby('hour').mean().reset_index()
    sns.scatterplot(data=grouped, x='hour', y=col)
    plt.title(f"Plant 1 Distribution of {col} by Hour of the Day")
    plt.show()

In [None]:
# plant 2 by hour of the day
for col in plant2.columns[plant1.dtypes=='float']:
    grouped = plant2.groupby('hour').mean().reset_index()
    sns.scatterplot(data=grouped, x='hour', y=col)
    plt.title(f"Plant 2 Distribution of {col} by Hour of the Day")
    plt.show()

There are days that have daily_yields above 0 after midnight and before the sun has risen.  This is not possible and indicates a sensor reading error.  I will set these values to 0.

DC Power seems to be 10x higher for plant 1 than for plant 2.  So I will divide plant 1's values by 10 to make them comparable.  

Total yield is a weird field.  The dataset's documentation says it is supposed to be the total yield for an inverter up to a certain point in time.  So unless the inverter is brand new, it should never have a total yield of 0.  Plant 1 has none of these, but plant 2 has plenty of inverters with total yield > 0 one day, but that are suddenly 0 the next day.  It doesn't make sense.  Also, if the total_yield feature measures total yield up to a point in time, then it should equal the previous day's total yield, plus the current day's final daily yield.  Looking at the data, that math works out most of the time, but it often does not. There seem to be adjustments made that are not recorded as a feature in this dataset, causing them to show up as sudden drops in total yield.  Since it is uncertain how total yield calculated, and why it varies, it is an unreliable field.  I will keep it in the data, but will avoid using it for modeling.

**TLDR: Total yield is an unreliable feature, so I will avoid using it for modeling.**

In [None]:
# fix unreasonable daily_yield values
for df in [plant1, plant2]:
    df.loc[(df['daily_yield'] > 0) & (df['hour'] < 5), 'daily_yield'] = 0

# re-scale plant 1's dc power values to make them comparable to plant 2's
plant1['dc_power'] /= 10.

## Feature Engineering

It would be nice to calculate features that capture the DC power, AC power, and yield at different levels.  Now that there are time features in the dataset, I can calculate features at the following levels: 
* the 15 minute level (1 value per row)
* the hour level (1 value per inverter per hour)
* the day level (1 value per inverter per day)
* the inverter level (1 value per inverter)  

Afterwards, I can do some more EDA on the new features.

In [None]:
#calculate other useful features

for df in [plant1, plant2]:
    df.sort_values(['source_key', 'date_time'], ascending=True, inplace=True)
    df.rename(columns={"daily_yield": "cumulative_daily_yield"}, inplace=True)
    
    # 15 minute level features (1 value per row)
    
    df['dc_ac_ratio'] = np.where(df['ac_power'] == 0, 0, df['dc_power']/df['ac_power'])
    df['yield'] = df['cumulative_daily_yield'].diff().fillna(0)
    # fix differences at the boundaries
    source_key_mask = df['source_key'] != df['source_key'].shift(1)
    day_mask = df['date'] != df['date'].shift(1)
    df.loc[source_key_mask, 'yield'] = 0
    df.loc[day_mask, 'yield'] = 0
    
    # hour level features (1 value per inverter per hour)
    
    df['avg_hourly_dc_power'] = df.groupby(['source_key', 'hour'])['dc_power'].transform(func=np.mean)
    df['avg_hourly_ac_power'] = df.groupby(['source_key', 'hour'])['ac_power'].transform(func=np.mean)
    df['hourly_yield'] = df.groupby(['source_key', 'hour'])['yield'].transform(func=np.sum)
    
    # day level features (1 value per inverter per day)
    
    df['avg_daily_dc_power'] = df.groupby(['source_key', 'date'])['dc_power'].transform(func=np.mean)
    df['avg_daily_ac_power'] = df.groupby(['source_key', 'date'])['ac_power'].transform(func=np.mean)
    df['avg_daily_dc_ac_ratio'] = df.groupby(['source_key', 'date'])['dc_ac_ratio'].transform(func=np.mean)
    # daily_yield is cumulative, so the final value, when grouped by inverter and date, is the total daily yield
    df['total_daily_yield'] = df.groupby(['source_key', 'date'])['cumulative_daily_yield'].transform(func='last')
    
    # inverter level features (1 value per inverter)
    
    # average daily yield should be calculated using the daily total, so there should be 1 average per inverter
    df['avg_daily_yield'] = df.groupby(['source_key'])['total_daily_yield'].transform(func=np.mean)


In [None]:
# manually spot check the features
plant1[plant1['hour']>6].head(15)

In [None]:
for col in plant1.columns[plant1.dtypes=='float']:
    sns.distplot(plant1[col], kde=False)
    plt.title(f"Plant 1 Distribution of {col}")
    plt.show()

In [None]:
for col in plant2.columns[plant1.dtypes=='float']:
    sns.distplot(plant2[col], kde=False)
    plt.title(f"Plant 2 Distribution of {col}")
    plt.show()

### EDA on the New Features

Looking at the distributions above, a few things become clear.  

1. Many features will need to be transformed to have more normal distributions before modeling.  However, the large number of zero values is the main cause of the skewed distributions.  I will ponder that before making any transformations...
2. The new yield and hourly_yield features have some extreme outliers.
3. Plant 2 had days when no power was generated.
4. dc_ac_ratio is either 0 or 1, meaning it cannot be used to find faulty inverters, because the inverter either converts 100% of DC to AC or nothing at all.  That also means that either dc_power or ac_power can be used, while the other can be ignored.  The same can be said for all features derived off the relationship between dc_power and ac_power.  **I will keep the dc_power features, drop the ac_power features and all dc/ac ratio features.**

**Note about zero values:**
There are many zero values in power generation due to hours with no sunlight, or days where no power was generated.  Scaling or standardizing the data, in preparation for modeling, will be hard with all of these zeros.  Instead, I will drop the rows where there is no sunlight, because who cares about forecasting or looking for anomalies when there is no sunlight?  There would be no power produced at all during these times.  Dropping these rows should remove most of the zeros and reduce or remove the skew in the distributions.  **I will not do anything about the skewed distributions now, but if I use models that require assuming normality, I will address them then.**

### Why not combine the data for both plants?

Some relationships between features, like the one between irradiation and yield, would be the same for both plants.  In these cases, it would make sense to combine the data and model the combined data so there are more samples to learn from.  Other relationships, however, are unique to inverters.  So combining the data would yield no benefit for the feature relationships specific to each inverter.  For example, a time series model fitted to one inverter would most likely be useless when applied to another inverter (unless the yield is so dependent on irradiation that the time series essentially captures irradiation patterns).  The time series models I plan to use for anomaly detection and forecasting will be specific to each inverter.  I may use a neural network, and at that time, I will combine the inputs, but until then, I will keep them separate.

In [None]:
# inspect the outliers in yield and hourly_yield
plant1[plant1['hourly_yield'] < -2000].head(15)

In [None]:
plant2[plant2['hourly_yield'] < -2000].head(15)

In [None]:
# see previous notes about why total_yield and ac_power related features are being removed
features_to_keep = [
    "source_key", "plant_id", 
    "date_time", "date", "hour", "day", "weekday", "month", "year",    
    'dc_power', 'cumulative_daily_yield', 
    'ambient_temperature', 'module_temperature', 'irradiation',
    'avg_hourly_dc_power',
    'avg_daily_dc_power', 
    'total_daily_yield', 'avg_daily_yield',
    'yield', 'hourly_yield',
]
plant1.drop([c for c in plant1.columns if c not in features_to_keep], axis=1, inplace=True)
plant2.drop([c for c in plant2.columns if c not in features_to_keep], axis=1, inplace=True)

## Exploratory Data Analysis

In [None]:
sns.set(rc={'figure.figsize':(11.7, 8.27)})

In [None]:
# view an inverter's DC output over time
sns.lineplot(data=plant1[plant1['source_key']=='YxYtjZvoooNbGkE'], x='date_time', y='dc_power')
plt.title("DC Power Output over Time")
plt.show()

In [None]:
# plots for all inverters

p1_grouped = plant1.groupby(['source_key', 'date']).last().reset_index()
p2_grouped = plant2.groupby(['source_key', 'date']).last().reset_index()

sns.lineplot(data=p1_grouped, x='date', y='total_daily_yield', hue='source_key')
plt.title("Plant 1 Daily Yield for all Inverters Over Time")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

sns.lineplot(data=p2_grouped, x='date', y='total_daily_yield', hue='source_key')
plt.title("Plant 2 Daily Yield for all Inverters Over Time")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

p1_mean_grouped = plant1.groupby('date').mean().reset_index()
p2_mean_grouped = plant2.groupby('date').mean().reset_index()

sns.lineplot(data=p1_mean_grouped, x='date', y='irradiation', label="Plant 1")
sns.lineplot(data=p2_mean_grouped, x='date', y='irradiation', label="Plant 2")
plt.title("Daily Mean Irradiation Over Time")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

p1_mean_grouped = plant1.groupby(['source_key', 'date']).mean().reset_index()
p2_mean_grouped = plant2.groupby(['source_key', 'date']).mean().reset_index()

sns.lineplot(data=p1_mean_grouped, x='date', y='dc_power', hue='source_key')
plt.title("Plant 1 Daily DC Power for all Inverters Over Time")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

sns.lineplot(data=p2_mean_grouped, x='date', y='dc_power', hue='source_key')
plt.title("Plant 2 Daily DC Power for all Inverters Over Time")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#### Insights from these Plots

Plant 1 has some inverters that had no yield some days, and some whose dc power dropped lower than the rest. Those could be signs of faulty inverters.

Plant 2 has a lot more variation.  A couple interverters had no yield some days, and one had no dc power output for several consecutive days.  

Sunlight was similar for both plants, with some minor variation.

#### Things to Explore, Based on These Insights

Inverter performance might vary over time because of weather and other factors, but what about overall performance?  It would be interesting to see how the inverters perform by hour of the day, for the entire time period.

In [None]:
# inverter performance might vary over time bc of weather & other factors
# but what about overall performance?  for all hours of the day for the entire period, how do the inverters perform?
p1_grouped = plant1.groupby(['source_key', 'hour']).last().reset_index()
p2_grouped = plant2.groupby(['source_key', 'hour']).last().reset_index()

sns.lineplot(data=p1_grouped, x='hour', y='avg_hourly_dc_power', hue='source_key')
plt.title("Plant 1 Average DC Output for each Inverter by Hour of the Day")
plt.ylabel("Output (kW)")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

sns.lineplot(data=p2_grouped, x='hour', y='avg_hourly_dc_power', hue='source_key')
plt.title("Plant 2 Average DC Output for each Inverter by Hour of the Day")
plt.ylabel("Output (kW)")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

#### Insights from These Plots

Plant 1 has a couple inverters with lower dc power, on average, than the rest.

Plant 2 has so much variation.  On average, plant 1 seems to have higher performing inverters, at least in terms of consistency and dc output.  It's interesting that plant 2 has a higher yields, on average, despite lower dc output.  

#### Things to Explore Next

I want to look at bivariate relationships.

In [None]:
sns.pairplot(data=plant1[['module_temperature', 'ambient_temperature', 'irradiation', 'yield', 'dc_power']])
plt.show()

In [None]:
sns.pairplot(data=plant2[['module_temperature', 'ambient_temperature', 'irradiation', 'yield', 'dc_power']])
plt.show()

#### Insights from These Plots

From these charts, it is apparent that:
1. Temperature and irradiation are all strongly correlated.  No surprise there.
2. The distribution of yield is strange, and because of it, it is hard to spot any trends.  It looks like dc_power might be a much more reliable indicator of output.  There are likely other factors affecting yield that are not in this dataset.  That would explain why yields sometimes drop during the day, as if they're being adjusted for some unobserved phenomenon. 
3. Plant 2 has so much variation that it hides the trend of power output to be related to temperature and irradiation.  

**After exploring the data, I have decied to use dc power as the target variable for forecasting, instead of yield.**  I will also use it for anomaly detection.

## Anomaly Detection

A good inverter should be consistently productive.  So I need to look at features that measure productivity and consistency.  That can be average daily yield by standard deviation in daily yield, or the number of times the inverter stopped working, as counted by the number of zeros in DC power output.  I will use the latter, since periodic outages are signs of a faulty inverter.

In [None]:
a = plant1[plant1['dc_power'] == 0].groupby('source_key')['date_time'].count().reset_index().rename(columns={"date_time": "nbr_zeros_dc_power"}).sort_values('nbr_zeros_dc_power', ascending=False)
b = plant1.groupby(['source_key'])[['source_key', 'avg_daily_yield']].mean().reset_index().sort_values('avg_daily_yield', ascending=False)
plant1_efficiency = pd.merge(a, b, how='inner', on='source_key')

a = plant2[plant2['dc_power'] == 0].groupby('source_key')['date_time'].count().reset_index().rename(columns={"date_time": "nbr_zeros_dc_power"}).sort_values('nbr_zeros_dc_power', ascending=False)
b = plant2.groupby(['source_key'])[['source_key', 'avg_daily_yield']].mean().reset_index().sort_values('avg_daily_yield', ascending=False)
plant2_efficiency = pd.merge(a, b, how='inner', on='source_key')

fig = px.scatter(
    plant1_efficiency, x='nbr_zeros_dc_power', y='avg_daily_yield', 
    color='source_key', hover_data=['source_key'],
    title="Plant 1 Inverter Avg Daily Yield by Nbr Occurrences of 0 DC Output"
)
fig.update_layout(shapes = [
    {'type': 'line', 'yref': 'paper', 'xref': 'paper', 'y0': 0, 'y1': 1, 'x0': 0, 'x1': 1}
])
fig.show()

fig = px.scatter(
    plant2_efficiency, x='nbr_zeros_dc_power', y='avg_daily_yield', 
    color='source_key', hover_data=['source_key'],
    title="Plant 2 Inverter Avg Daily Yield by Nbr Occurrences of 0 DC Output"
)
fig.update_layout(shapes = [
    {'type': 'line', 'yref': 'paper', 'xref': 'paper', 'y0': 0, 'y1': 1, 'x0': 0, 'x1': 1}
])
fig.show()


#### Insights from These Plots

The inverters in the top left of these plots are the best performers.  The ones in the bottom right are the worst. 

**The inverters in the bottom right section of these plots could be faulty and need replacing.**

#### Things to Look for Next

The next step is to look for times when all inverters performed poorly and irradiation was good.  This could indicate that the panels are all dirty at those times, and should be cleaned.

In [None]:
# filter out possibly faulty inverters by finding them in the plots above
plant1_good = plant1[plant1['avg_daily_yield']>4000].copy()  # easier way to filter these out than by name (see graph)
plant2_good = plant2[plant2['avg_daily_yield']>4700].copy()

# regress dc_power on irradiation and module_temperature - they are linearly related, so a regression is appropriate
# do not fit an intercept, since there should be no power without irradiation
lr_model = LinearRegression(fit_intercept=False, normalize=True)
lr_model.fit(plant1_good[['module_temperature', 'irradiation']], plant1_good['dc_power'])
print("Plant 1 R^2", lr_model.score(plant1_good[['module_temperature', 'irradiation']], plant1_good['dc_power']))
lr_preds = lr_model.predict(plant1_good[['module_temperature', 'irradiation']])
residuals = plant1_good['dc_power'] - lr_preds
plant1_good['lower_dc_power_than_expected'] = np.where(residuals < 0, 1, 0)
plant1_good['1day_ma_of_lower_dc_power_than_expected'] = plant1_good.groupby('date')['lower_dc_power_than_expected'].transform(np.mean)

lr_model = LinearRegression(fit_intercept=False, normalize=True)
lr_model.fit(plant2_good[['module_temperature', 'irradiation']], plant2_good['dc_power'])
print("Plant 2 R^2", lr_model.score(plant2_good[['module_temperature', 'irradiation']], plant2_good['dc_power']))
lr_preds = lr_model.predict(plant2_good[['module_temperature', 'irradiation']])
residuals = plant2_good['dc_power'] - lr_preds
plant2_good['lower_dc_power_than_expected'] = np.where(residuals < 0, 1, 0)
plant2_good['1day_ma_of_lower_dc_power_than_expected'] = plant2_good.groupby('date')['lower_dc_power_than_expected'].transform(np.mean)

sns.scatterplot(data=plant1_good, x='date_time', y='dc_power', hue='lower_dc_power_than_expected')
plt.title("Plant 1 DC Power when Lower than Expected")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlim(plant1_good.date_time.min(), plant1_good.date_time.max())
plt.show()

sns.scatterplot(data=plant1_good, x='date_time', y='dc_power', hue='1day_ma_of_lower_dc_power_than_expected')
plt.title("Plant 1 Daily Average DC Power Lower than Expected")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlim(plant1_good.date_time.min(), plant1_good.date_time.max())
plt.show()

sns.scatterplot(data=plant2_good, x='date_time', y='dc_power', hue='lower_dc_power_than_expected')
plt.title("Plant 2 DC Power when Lower than Expected")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlim(plant2_good.date_time.min(), plant2_good.date_time.max())
plt.show()

sns.scatterplot(data=plant2_good, x='date_time', y='dc_power', hue='1day_ma_of_lower_dc_power_than_expected')
plt.title("Plant 2 Daily Average DC Power Lower than Expected")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlim(plant2_good.date_time.min(), plant2_good.date_time.max())
plt.show()

**The dark purple bands in these plots show when all of the inverters for a plant, on average, performed worse than expected.  It could indicate that the panels needed to be cleaned.**

Interestingly, there are times when the performance was worse than expected, and then suddenly turned around.  That could indicate that the panels were cleaned.  Signs pointing to this are the colors getting darker purple over time, then quickly reversing to light pink.

In the absence of labeled data or knowledge of when (or if) the panels were cleaned during this time period, guessing when panels are dirty based on changes in collective inverter performance is the best we can do.

## Forecasting Future Output

The problem with forecasting solar power generation is that is is so dependent on weather.  Fitting an autoregressive model to yield or output seems silly, because by definition, autoregression fits a variable to its previous values.  How can you predict power generation without knowing anything about the weather?  The weather data is not very helpful here either.  Yes, we can say that the number of hours of sunlight tomorrow should be similar to what they are today, but that is assuming there are no clouds.  There is no way to predict how cloud cover might impact irradiation, with the features in this dataset.  So there is no clean way to forecast power generation.  Nevertheless, I will fit a model and see what happens.  

#### Approach

I explained before why yield was an unreliable feature.  So DC power will be my target variable. 

The first model I will fit is a model that forecasts DC Power, at the 15 minute level, for each inverter.  I will use irradiation and module temperature as features.  A simple regression model that predicts DC power output for given values of irradiation and module temperature would be useful, but it would be more useful to forecast future outputs.  So the model will make 1 step forecasts (forecasts of the DC power output over the next 15 minutes), and it will be specific to an inverter.   

This first model of DC power can be used to find outliers in the DC power output of an individual inverter.  By calculating the 95% prediction intervals, outliers can be found by comparing the true DC power outputs to the predicted value intervals.  Since we only really care about faulty inverters, the lower interval will be of the most importance.  So observations that are lower than expected (predicted) will be plotted, and any trends will become visible.  The trends may show whether an inverter is performing worse than expected for consecutive time steps, which may indicate that the inverter is faulty and needs replacing.  However, when making any judgements on inverter faultiness using the model, it will be important to remember that the benchmark for "good" performance is the inverter's own historical performance.  So if the inverter performed poorly from the start of the dataset, the model may not show anything useful, because it will have trained exclusively on poor performance. 

Another thing that would be useful for a power plant to know is what its expected daily output would be, going into the future.  Each inverter has an average DC output for each day (the average DC power output of every 15 minute interval while there is sunlight).  A model could be fitted to these to forecast the average DC power output for the entire plant.  So I will fit a second model to forecast the average DC power output, for all of a plant's inverters, for the next day.  

In [None]:
# get a list of inverters, so that 1 can be used as an example
set(plant1.source_key)

In [None]:
class InverterModel:
    def __init__(self, df, source_key):
        self.df = df
        self.source_key = source_key
        self.N = None
        self.T = 16  # use T previous time steps to predict the next one (there are 4 per hour)
        self.D = 2  # irradiation and module_temperature are the features to be used
        self.model = None
        self.r = None
        self.predictions = None
        self.validation_predictions = None
        self.anomalies = None

    def _prepare_input_data(self):
        """
        Prepares input data for a LSTM, for a given source_key.  
        Input will be a N x T x D array, where:
            N = number of samples, or rows
            T = number of time steps to predict into the future
            D = number of features
        The data will be normalized, and the first half will be used as the training set,
        while the second half will be used as the validation set.  This method assumes 
        that the dataset has already been sorted by date_time.
        """
        # remove the times when irradiation = 0
        dsub = self.df[
            (self.df['irradiation'] > 0) & (self.df['source_key']==self.source_key)
        ][['dc_power', 'irradiation', 'module_temperature']].values

        # store the number of samples
        self.N = len(dsub)

        # normalize the data
        # use the first half of the dataset for training and the second half for validation
        # so fit the scaler to the first half, but transform all of it
        normalizer = MinMaxScaler()
        normalizer.fit(dsub[:-self.N//2])
        dsub = normalizer.transform(dsub)

        # reshape the data in preparation for input
        X = []
        Y = []
        for t in range(dsub.shape[0] - self.T):
            x = dsub[t:t+self.T, 1:]  # stop at time t+T, because t+T is the target
            X.append(x)
            y = dsub[t+self.T, 0]
            Y.append(y)

        # store arrays as object attributes, so they can be referenced in other methods
        self.X = np.array(X)
        self.Y = np.array(Y)

        # ensure that the X inputs are shape (N-T, T, D) and the Y inputs are shape (N,)
        assert self.X.shape == (len(dsub)-self.T, self.T, self.D)
        assert self.Y.shape[0] == len(dsub)-self.T

    def _fit_lstm(self):
        """
        Builds and trains an LSTM model on the given dataset.  The first half of the 
        dataset is used for training, and the second half is used for validation.
        """
        # build LSTM model
        i = Input(shape=(self.T, self.D))
        x = LSTM(10)(i)
        x = Dense(5)(x)
        x = Dense(1)(x)
        self.model = Model(i, x)
        self.model.compile(
            loss='mse',
            optimizer=Adam(lr=0.05),
        )

        # train the LSTM
        # use the first half of the dataset for training and the second half for validation
        self.r = self.model.fit(
            self.X[:-self.N//2], self.Y[:-self.N//2],
            batch_size=32,
            epochs=100,
            validation_data=(self.X[-self.N//2:], self.Y[-self.N//2:]),
            verbose=0,
        )

    def _plot_lstm_loss(self):
        """
        Plots the LSTM's training and validation loss.
        """
        plt.plot(self.r.history['loss'], label='Training Loss')
        plt.plot(self.r.history['val_loss'], label='Validation Loss')
        plt.legend()
        plt.title("Training and Validation Loss by Epoch")
        plt.xlabel("Epoch")
        plt.ylabel("Loss")
        plt.show()
    
    def _one_step_forecast(self, show_plot=False):
        """
        Makes one-step forecast using true targets
        """
        outputs = self.model.predict(self.X)
        print("Prediction shape:", outputs.shape)
        self.predictions = outputs[:,0]

        if show_plot:
            plt.plot(self.Y, label='targets')
            plt.plot(self.predictions, label='predictions')
            plt.title("One Step Forecast")
            plt.legend()
            plt.show()
    
    def _multi_step_forecast(self, show_plot=False):
        """
        Makes multi-step forecast using true targets from validation set
        """
        validation_target = self.Y[-self.N//2:]
        self.validation_predictions = []

        # first validation input
        last_x = self.X[-self.N//2]  # array of shape (T, D)

        while len(self.validation_predictions) < len(validation_target):
            # get predictions and turn the 1x1 array into a scaler
            pred = self.model.predict(last_x.reshape(1, self.T, self.D))[0,0]

            # update the predictions list
            self.validation_predictions.append(pred)

            # make the new input
            last_x = np.roll(last_x, -1)
            last_x[-1] = pred

        if show_plot:
            plt.plot(validation_target, label='True Value')
            plt.plot(self.validation_predictions, label='Forecasted Value')
            plt.title(f"{self.T} Step Forecast for the Validation Set")
            plt.legend()
            plt.show()
    
    def _find_anomalies(self, anomaly_type="low", show_plot=False):
        """
        Finds the anomalies in the one-step ahead predictions, using a 
        95% prediction interval.  The anomaly_type specifies whether to 
        focus on samples where the true value was outside of the prediction interval.
        For instance, we may only wish to see samples where output was lower than 
        expected.
        """
        # calculate the prediction interval to find anomalies
        sum_errors = np.sum((self.Y-self.predictions)**2)
        stdev = np.sqrt(1/(len(self.Y)-2) * sum_errors)
        
        # 95% prediction interval = 1.96 standard deviations
        interval = 1.96 * stdev
        lower, upper = self.predictions - interval, self.predictions + interval
        
        # samples where the true value was outside of the prediction interval
        if anomaly_type == "high":
            self.anomalies = np.where(self.Y > upper, 1, 0)
        elif anomaly_type == "low":
            self.anomalies = np.where(self.Y < lower, 1, 0)
        else:
            self.anomalies = np.where(((self.Y > upper) | (self.Y < lower)), 1, 0)

        if show_plot:
            plt.plot(self.Y, label='targets')
            #plt.plot(upper, label='predictions')
            for idx, a in enumerate(list(self.anomalies)):
                if a == 1:
                    plt.plot((idx), (self.Y[idx]), 'o', color='red')
            plt.legend()
            plt.title("Anomalous DC Power Outputs")
            plt.show()

        print(f"{np.sum(self.anomalies)} total anomalies")

    def forecast(self, forecast_type='single_step', show_plot=False):
        """
        Executes all methods in order, and forecasts the given number of steps ahead.
        """
        self._prepare_input_data()
        self._fit_lstm()
        if show_plot:
            self._plot_lstm_loss()
        if forecast_type == 'single_step':
            self._one_step_forecast(show_plot=show_plot)
        else:
            self._multi_step_forecast(show_plot=show_plot)
        self._find_anomalies(show_plot=show_plot)

# run 1 inverter as a test/example
InverterModel(df=plant1, source_key='1BY6WEcLGh8j5v7').forecast(show_plot=True)

# fit a model to each inverter
p1_inverters = list(set(plant1.source_key))
p2_inverters = list(set(plant2.source_key))

inverter_models = dict.fromkeys(p1_inverters + p2_inverters)
'''
for i in p1_inverters:
    inverter_models[i] = InverterModel(df=plant1, source_key=i).forecast(show_plot=False)
for i in p2_inverters:
    inverter_models[i] = InverterModel(df=plant2, source_key=i).forecast(show_plot=False)
'''

My suspicions about the plausibility of forecasting output with the features in this dataset seem to have proven accurate, as the LSTM model is overfitting the training set.  The model has picked up on the daily cycles of sunlight, but its performance on the validation set is poor.  

When the model is used to make multi-step forecasts (15 minute forecasts for the next 4 hours), it performs horribly.  This may be due to the influence of weather, which cannot possibly be predicted with the provided data. 

Now I will move on to forecasting the average daily DC power output for each plant.  

In [None]:
# univariate modeling of avg daily DC power, for all inverters, at the day level
all_inverters = []
for i in p1_inverters:
    dsub = plant1[plant1['source_key']==i].groupby('date')['avg_daily_dc_power'].last().reset_index(drop=True).values
    all_inverters.append(dsub)

# create a N x T x D matrix, where:
#  N = number of samples (34 days)
#  T = number of time steps to predict into the future
#  D = number of features (22 inverters)
all_inverters = np.array(all_inverters).T

# store the number of samples
N = len(all_inverters)

# normalize the data
# use the first half of the dataset for training and the second half for validation
# so fit the scaler to the first half, but transform all of it
normalizer = MinMaxScaler()
normalizer.fit(all_inverters[:-N//2])
all_inverters = normalizer.transform(all_inverters)

# reshape the data in preparation for input
T = 1  # use T previous time steps to predict the next one (there are 4 per hour)
D = all_inverters.shape[1]
X = []
Y = []
for t in range(all_inverters.shape[0] - T):
    x = all_inverters[t:t+T]  # stop at time t+T, because t+T is the target
    X.append(x)
    y = all_inverters[t+T]
    Y.append(y)

X = np.array(X)
Y = np.array(Y)

# ensure that the X inputs are shape (N, T, D) and the Y inputs are shape (N,)
print(X.shape)
print(Y.shape)

In [None]:
# build LSTM model
i = Input(shape=(T, D))
x = LSTM(10)(i)
x = Dense(5)(x)
x = Dense(1)(x)
model = Model(i, x)
model.compile(
  loss='mse',
  optimizer=Adam(lr=0.05),
)

# train the LSTM
# use the first half of the dataset for training and the second half for validation
r = model.fit(
  X[:-N//2], Y[:-N//2],
  batch_size=32,  # data is small enough to fit into 1 batch
  epochs=100,
  validation_data=(X[-N//2:], Y[-N//2:]),
)

In [None]:
# Plot the loss
plt.plot(r.history['loss'], label='Training Loss')
plt.plot(r.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title("Training and Validation Loss by Epoch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

In [None]:
# One-step forecast using true targets
outputs = model.predict(X)
print(outputs.shape)
predictions = outputs[:,0]

plt.plot(np.mean(Y, axis=1), label='targets')
plt.plot(predictions, label='predictions')
plt.title("Plant 1's Avg Daily DC Power Output Forecasts")
plt.legend()
plt.show()

All the model seems to be doing is learning a moving average.  That is not too surprising, as the model cannot learn anything about the weather, so predicting something close to the average output likely yields the lowest mean squared error.  **What this means is that unless a power plant has reliable weather forecasting available, the best it may be able to do when forecasting its output is to use climate data for irradiation, and use the historical average output as its forecast.**