# Introduction

### Welcome to this notebook!

My aim is to provide **rigorous** explanations for each step of the process in order to perform exploratory data analysis. 

This notebook can be broken down into the following:

### Stages:

1. Data Manipulation & Aggregation
2. Creating functions for visualisation

> Once we have done this we can perform our **analysis**, which is divided into analysing:

3. The performance of plants,
4. The age of the plants,
5. The conditions each plant is working under, and
6. How the feature variables are related.

Finally, we draw some **conclusions**.

---

# Abstract

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

---

# Import the necessary libraries

In [None]:
# Reading files from directory
import os

# Data manipulation & analysis
import pandas as pd
import datetime as dt

# Linear Algebra
import numpy as np

# Statistical Tests
from scipy.stats import pearsonr

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

---

# Acquire Data

## Find the data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

> Now that we have the filepaths for the data, we can read it using pandas.

---

## Reading the data

In [None]:
# Power generation data
plant_1_gd = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv")
plant_2_gd = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv")

# Weather sensor data
plant_1_wsd = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv")
plant_2_wsd = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv")

---

## What does our data look like?

### Power Generation Data:

|No.|Variable|Data Type|
|---|---|---|
|1|DATE_TIME|String|
|2|PLANT_ID|integer|
|3|SOURCE_KEY|integer|
|4|DC_POWER|float|
|5|AC_POWER|float|
|6|DAILY_YIELD|float|
|7|TOTAL_YIELD|float|

### Plant Weather Sensor Data:

|No.|Variable|Data Type|
|---|---|---|
|1|DATE_TIME|String|
|2|PLANT_ID|integer|
|3|SOURCE_KEY|integer|
|4|AMBIENT_TEMPERATURE|float|
|5|MODULE_TEMPERATURE|float|
|6|IRRADATION|float|

---

# Data manipulation

### Change DATE_TIME ---> valid datetime data type.

In [None]:
# Change Datetime column to valid datetime format
plant_1_gd.DATE_TIME = pd.to_datetime(plant_1_gd.DATE_TIME)
plant_2_gd.DATE_TIME = pd.to_datetime(plant_2_gd.DATE_TIME)

plant_1_wsd.DATE_TIME = pd.to_datetime(plant_1_wsd.DATE_TIME)
plant_2_wsd.DATE_TIME = pd.to_datetime(plant_2_wsd.DATE_TIME)

---

### Splitting   DATETIME  --->  DATE & TIME 

In [None]:
def split_date_time(data):
    data['DATE'] = np.array([dt.datetime.date(x) for x in data['DATE_TIME']])
    data['TIME'] = np.array([dt.datetime.time(x) for x in data['DATE_TIME']])
    return data

In [None]:
# Applying the above function to split: DATETIME ---> DATE & TIME
plant_1_gd = split_date_time(plant_1_gd)
plant_2_gd = split_date_time(plant_2_gd)

plant_1_wsd = split_date_time(plant_1_wsd)
plant_2_wsd = split_date_time(plant_2_wsd)

---

## What does our updated data look like?

### *Updated* Power Generation Data:

|No.|Variable|Data Type|
|---|---|---|
|1|DATE_TIME|String|
|2|PLANT_ID|integer|
|3|SOURCE_KEY|integer|
|4|DC_POWER|float|
|5|AC_POWER|float|
|6|DAILY_YIELD|float|
|7|TOTAL_YIELD|float|
|8|DATE|date|
|9|TIME|time|

### *Updated* Plant Weather Sensor Data:

|No.|Variable|Data Type|
|---|---|---|
|1|DATE_TIME|String|
|2|PLANT_ID|integer|
|3|SOURCE_KEY|integer|
|4|AMBIENT_TEMPERATURE|float|
|5|MODULE_TEMPERATURE|float|
|6|IRRADATION|float|
|7|DATE|date|
|8|TIME|time|

---

## Converting DATE ---> INT

When we fit a linear best fit line we will require the DATE field to be a numeric value.

In [None]:
# Create a dictionary to map DATE ---> INT
def map_dates(dates):
    return {date:i for i,date in enumerate(dates)}

# Function to convert all dates to integer values
def date_to_int(dates):
    date_map = map_dates(dates.unique())
    ds = np.array([date_map[d] for d in dates])
    return ds

---

# Data aggregation

In order to perform analysis on the datasets, we need to group the data by relevant columns and apply necessary aggregations over the resulting columns, e.g. sum, mean & variance.

## Group by: DATE

We want to **sum** or calculate the **mean** of the numerical variables from all the inverters (depending on the task), for each DATE level. If we take the sum, we acquire


| DATE | $x_1$ | ... | $x_p$ |
| --- | --- | --- | --- |
| $d_1$ | $\sum_{i=1}^{Q}({x_{1i})^{d_{1}}}$ | ... | $\sum_{i=1}^{Q}({x_{pi})^{d_{1}}}$ |
| ... |  |  |  |
| $d_{\alpha}$ | $\sum_{i=1}^{Q}({x_{1i})^{d_{\alpha}}}$ | ... | $\sum_{i=1}^{Q}({x_{pi})^{d_{\alpha}}}$ |

where:
* $ p = $ number of numeric variables,
* $ \alpha = $ number of dates,
* $ Q = $ number of inverters,
* $\sum_{i=1}^{Q}({x_{ji})^{d_{k}}}$ is the sum of variable $x_j$ over all $Q$ inverters on date $d_{k}$.

And if we use the mean, we simply divide $\sum_{i=1}^{Q}({x_{ji})^{d_{k}}}$ by $Q$.

In [None]:
# Function to group data by DATE and apply either sum or mean aggregation
def group_by_date(data,method='sum'):
    grouped_df = data.groupby(['DATE'])
    if method == 'avg':
        return grouped_df.agg('mean').reset_index()
    else:
        return grouped_df.agg('sum').reset_index()

In [None]:
# PLANT 1 GENERATION DATA GROUPED BY DATE
gd_dd1 = group_by_date(plant_1_gd)
# PLANT 2 GENERATION DATA GROUPED BY DATE
gd_dd2 = group_by_date(plant_2_gd)

# PLANT 1 WEATHER SENSOR DATA GROUPED BY DATE
wsd_dd1 = group_by_date(plant_1_wsd,method='avg')
# PLANT 2 WEATHER SENSOR DATA GROUPED BY DATE
wsd_dd2 = group_by_date(plant_2_wsd,method='avg')

---

## Group by: DATE & TIME

We want to **sum** the numerical variables from all the inverters, for each DATE & TIME level. That is, 


| DATE | TIME | $x_1$ | ... | $x_p$ |
| --- | --- | --- | --- | --- |
| $d_1$ | $t_1$ | $\sum_{i=1}^{Q}({x_{1i})^{d_{1}t_{1}}}$ | ... | $\sum_{i=1}^{Q}({x_{pi})^{d_{1}t_{1}}}$ |
| $d_1$ | $t_2$ | $\sum_{i=1}^{Q}({x_{1i})^{d_{1}t_{2}}}$ | ... | $\sum_{i=1}^{Q}({x_{pi})^{d_{1}t_{2}}}$ |
| ... |  |  |  |  |
| $d_{\alpha}$ | $t_{\beta}$ | $\sum_{i=1}^{Q}({x_{1i})^{d_{\alpha}t_{\beta}}}$ | ... | $\sum_{i=1}^{Q}({x_{pi})^{d_{\alpha}t_{\beta}}}$ |

where:
* $ \beta = $ number of times,
* $\sum_{i=1}^{Q}({x_{ji})^{d_{k}t_{l}}}$ is the sum of variable $x_j$ over all $Q$ inverters on date $d_{k}$ at time $t_{l}$.

In [None]:
def group_by_date_time(data):
    grouped_df = data.groupby(['DATE','TIME'])
    df = grouped_df.agg('sum').reset_index()
    return df

In [None]:
# PLANT 1 GENERATION DATA GROUPED BY DATE & TIME
gd_dt1 = group_by_date_time(plant_1_gd)
# PLANT 2 GENERATION DATA GROUPED BY DATE & TIME
gd_dt2 = group_by_date_time(plant_2_gd)

# PLANT 1 WEATHER SENSOR DATA GROUPED BY DATE & TIME
wsd_dt1 = group_by_date_time(plant_1_wsd)
# PLANT 2 WEATHER SENSOR DATA GROUPED BY DATE & TIME
wsd_dt2 = group_by_date_time(plant_2_wsd)

---

## Group by: TIME

We want to analyse the **mean**, that is,

$$\mu_X = E[X] =\frac{1}{N}\sum_{i=1}^{N}{x_i}$$


and the **standard deviation**, that is,

$$ \sigma_X = \sqrt{Var[X]} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}{(x_i-\mu_X)^2}}$$

for each numerical variable $X$, at each time.

(where: $N$ is the total number of observations of $X$.)

The table will look like this,


| TIME | $E[x_1]$ | $sd[x_1]$ |... | $E[x_p]$ | $sd[x_p]$ |
| --- | --- | --- | --- | --- | --- |
| $t_1$ | $(\mu_{x_1})^{t_{1}}$ | $(sd[x_1])^{t_{1}}$  |... | $(\mu_{x_p})^{t_{1}}$ | $(sd[x_p])^{t_{1}}$ |
| ... |  |  |  |  |  |
| $t_{\beta}$ | $(\mu_{x_1})^{t_{\beta}}$ | $(sd[x_1])^{t_{\beta}}$ |... | $(\mu_{x_p})^{t_{\beta}}$ |$(sd[x_p])^{t_{\beta}}$ |

where:
* $(\mu_{x_j})^{t_{l}}$ is the mean for $x_j$ at time $t_l$, and
* $(sd[x_j])^{t_{l}}$ is the standard deviation for $x_j$ at time $t_l$.

In [None]:
# We can specify what dataset we are giving the below function by specifying data_type as either:
# 'gd' (i.e. generation data) or 'wsd' (i.e weather sensor data)

def group_by_time(data, data_type='gd'):
    grouped_df = data.groupby(['TIME'])
    if data_type == 'gd':
        col_names = ['TIME','AVG_DC_POWER','SE_DC_POWER','AVG_AC_POWER','SE_AC_POWER',
                     'AVG_DAILY_YIELD','SE_DAILY_YIELD','MEDIAN_TOTAL_YIELD','SE_TOTAL_YIELD']
        df = grouped_df.agg({'DC_POWER':['mean','std'],
                             'AC_POWER':['mean','std'],
                             'DAILY_YIELD':['mean','std'],
                             'TOTAL_YIELD':['median','std']}).reset_index()
        df.columns = col_names
    else:
        col_names = ['TIME', 'AVG_AMBIENT_TEMPERATURE', 'SE_AMBIENT_TEMPERATURE', 'AVG_MODULE_TEMPERATURE',
                     'SE_MODULE_TEMPERATURE', 'AVG_IRRADIATION', 'SE_IRRADIATION']
        df = grouped_df.agg({'AMBIENT_TEMPERATURE': ['mean', 'std'],
                             'MODULE_TEMPERATURE': ['mean', 'std'],
                             'IRRADIATION': ['mean', 'std']}).reset_index()
        df.columns = col_names
    return df

In [None]:
# PLANT 1 GENERATION DATA GROUPED BY TIME
gd_t1 = group_by_time(gd_dt1,data_type='gd')
# PLANT 2 GENERATION DATA GROUPED BY TIME
gd_t2 = group_by_time(gd_dt2,data_type='gd')

# PLANT 1 WEATHER SENSOR DATA GROUPED BY TIME
wsd_t1 = group_by_time(wsd_dt1,data_type='wsd')
# PLANT 2 WEATHER SENSOR DATA GROUPED BY TIME
wsd_t2 = group_by_time(wsd_dt2,data_type='wsd')

## Group by: (DATE or TIME) & INVERTERS

We only need to group the Power Generator Data by inverters because we only have data from one inverter in the Weather Sensor data. We want to calculate the sum, mean and standard deviation for each inverter. For example if we group by DATE & inverters;


| DATE | INVERTER | $x_1$ | $E[x_1]$ | $sd[x_1]$ |... | $x_p$ | $E[x_p]$ | $sd[x_p]$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| $d_1$ | $I_1$ | $\sum_{i=1}^{\beta}({x_{1i})^{d_{1}I_{1}}}$ | $(\mu_{x_1})^{d_{1}I_{1}}$ | $(sd[x_1])^{d_{1}I_{1}}$  |... | $\sum_{i=1}^{\beta}({x_{pi})^{d_{1}I_{1}}}$ |$(\mu_{x_p})^{d_{1}I_{1}}$ | $(sd[x_p])^{d_{1}I_{1}}$  |
| $d_1$ | $I_2$ | $\sum_{i=1}^{\beta}({x_{1i})^{d_{1}I_{2}}}$ |$(\mu_{x_1})^{d_{1}I_{2}}$ | $(sd[x_1])^{d_{1}I_{2}}$  | ... | $\sum_{i=1}^{\beta}({x_{pi})^{d_{1}I_{2}}}$ |$(\mu_{x_p})^{d_{1}I_{2}}$ | $(sd[x_p])^{d_{1}I_{2}}$  |
| ... |  |  |  |  | | | | |
| $d_{\alpha}$ | $I_Q$ | $\sum_{i=1}^{\beta}({x_{1i})^{d_{\alpha}I_{Q}}}$ | $(\mu_{x_1})^{d_{\alpha}I_{Q}}$ | $(sd[x_1])^{d_{\alpha}I_{Q}}$  |... | $\sum_{i=1}^{\beta}({x_{pi})^{d_{\alpha}I_{Q}}}$ |$(\mu_{x_p})^{d_{\alpha}I_{Q}}$ | $(sd[x_p])^{d_{\alpha}I_{Q}}$  |

where:
* $\sum_{i=1}^{\beta}({x_{ji})^{d_{k}I_{l}}}$ is the sum of the variable $x_j$ on date $d_{k}$ for inverter $I_{l}$ (over all times $t_1$ to $t_{\beta}$),
* $(\mu_{x_j})^{d_{k}I_{l}}$ is the mean of the variable $x_j$ on date $d_{k}$ for inverter $I_{l}$ (over all times $t_1$ to $t_{\beta}$), and
* $(sd[x_j])^{d_{k}I_{l}}$ is the standard deviation of the variable $x_j$ on date $d_{k}$ for inverter $I_{l}$ (over all times $t_1$ to $t_{\beta}$).

In [None]:
def group_by_x_inverters(x, data):
    grouped_df = data.groupby([x,'SOURCE_KEY'])
    col_names = [x, 'SOURCE_KEY','SUM_DC_POWER','AVG_DC_POWER', 'SE_DC_POWER',
                 'SUM_AC_POWER','AVG_AC_POWER','SE_AC_POWER',
                 'SUM_DAILY_YIELD','AVG_DAILY_YIELD', 'SE_DAILY_YIELD',
                 'SUM_TOTAL_YIELD','AVG_TOTAL_YIELD', 'SE_TOTAL_YIELD',]
    df = grouped_df.agg({'DC_POWER': ['sum','mean', 'std'],
                         'AC_POWER': ['sum','mean', 'std'],
                         'DAILY_YIELD': ['sum','mean', 'std'],
                         'TOTAL_YIELD': ['sum','mean', 'std']}).reset_index()
    df.columns = col_names
    return df

In [None]:
# PLANT 1 GENERATION DATA GROUPED BY DATE & INVERTERS
inv1_date = group_by_x_inverters('DATE', plant_1_gd)
# PLANT 2 GENERATION DATA GROUPED BY DATE & INVERTERS
inv2_date = group_by_x_inverters('DATE', plant_2_gd)

# PLANT 1 WEATHER SENSOR DATA GROUPED BY TIME & INVERTERS
inv1_time = group_by_x_inverters('TIME', plant_1_gd)
# PLANT 2 WEATHER SENSOR DATA GROUPED BY TIME & INVERTERS
inv2_time = group_by_x_inverters('TIME', plant_2_gd)

#### The good thing about the above grouping functions is that they can be applied to both the Generation Data and Weather Sensor data, so in this way our functions generalise well.

## Combine: Generation & Weather Sensor Data

When we analyse correlations between all feature variables we will need the Generation Data & Weather Sensor Data in one pandas DataFrame.

In [None]:
# We will pass the data that has been grouped by DATE & TIME
def combine_gd_wsd(gds, wsds):
    
    #[Generation Data from plant 1, Generation Data from plant 2]
    [gd1, gd2] = gds
    
    #[Weather Sensor Data from plant 1, Weather Sensor Data from plant 2]
    [wsd1,wsd2] = wsds
    
    # When we grouped by DATE,TIME the plant ID's were treated as a number and are summed, so we must ammend this
    gd1['PLANT_ID'] = 4135001
    gd2['PLANT_ID'] = 4136001

    wsd1['PLANT_ID'] = 4135001
    wsd2['PLANT_ID'] = 4136001

    both_plants_gd = pd.concat(gds)
    both_plants_wsd = pd.concat(wsds)

    return pd.merge(both_plants_gd,both_plants_wsd,on=['PLANT_ID','DATE','TIME'])

---

# Functions for plotting

The plan is to create functions for plotting that can be applied to **any specified columns from both datasets** so that we do not end up repeating code unneccesarily.

## Plotting: Variable ~ $X$

Where,
* Variable is any numeric variable, and
* $X$ can be either DATE or TIME

In [None]:
# plants will be passed as a list : [plant1-data,plant2-data]
def plot_variable_vs_x(plants, x, variable, style='.', bestfit = False):
    # DATE or TIME
    x1 = plants[0][x].astype(str)
    x2 = plants[1][x].astype(str)
    
    # Numeric variable
    y1 = plants[0][variable]
    y2 = plants[1][variable]
    
    fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
    
    fig.set_figheight(8)
    fig.set_figwidth(8)

    fig.suptitle('PLANT 1 & 2 ' + variable.replace('_',' '))
    
    # Generator function to return colours
    my_colors = plt.rcParams['axes.prop_cycle']()

    ax1.plot(x1, y1, style, **next(my_colors), alpha=.7, ms=4)
    ax2.plot(x2, y2, style, **next(my_colors), alpha=.7, ms=4)
    
    # Calculate linear best fit line
    if bestfit:
        x1_int = date_to_int(x1)
        x2_int = date_to_int(x2)

        coef1 = np.polyfit(x1_int, y1, 1)
        poly1d_fn1 = np.poly1d(coef1)

        coef2 = np.polyfit(x2_int, y2, 1)
        poly1d_fn2 = np.poly1d(coef2)
        
        # We can repeatedly call my_colors and it will return a color.
        ax1.plot(x1_int,poly1d_fn1(x1_int), **next(my_colors))
        ax2.plot(x2_int,poly1d_fn2(x2_int), 'k-', **next(my_colors))
        
    # Set the x-ticks, i.e. the x-axis values
    x1_ticks = x1.unique()
    x1_tick_loc = np.arange(len(x1_ticks)+1)
    
    x2_ticks = x2.unique()
    x2_tick_loc = np.arange(len(x2_ticks)+1)

    ax1.set_xticks(x1_tick_loc[::8])
    ax1.set_xticklabels(x1_ticks[::8], rotation=90)

    ax2.set_xticks(x2_tick_loc[::8])
    ax2.set_xticklabels(x2_ticks[::8],rotation=90)
    
    # Set labels & title
    ax1.set_xlabel(x)
    ax2.set_xlabel(x)

    ax1.set_title('PLANT1')
    ax2.set_title('PLANT2')

    ax1.set_ylabel(variable.replace('_',' '))

    for a in fig.get_axes():
        a.label_outer()

    plt.show()
    return None

---

## Plotting: $E[X]$ & $sd[X]$ over time.

Plotting the mean, $E[X]$, and standard deviation, $sd[X]$, of a variable, $X$, is slightly different, so we define a new function.

In [None]:
# As above we pass plants as a list: [plant1-data,plant2-data]
def plot_mean_sd(plants, variable, style='k-'):
    x = 'TIME'
    x1 = plants[0][x].astype(str)
    x2 = plants[1][x].astype(str)

    median_variables = ['TOTAL_YIELD']

    if variable in median_variables:
        avg_variable = 'MEDIAN_' + variable
    else:
        avg_variable = 'AVG_' + variable
        
    # Variable we are plotting on the y-axis
    y1 = plants[0][avg_variable]
    y2 = plants[1][avg_variable]

    se_variable = 'SE_' + variable
    
    # Standard deviations
    error1 = plants[0][se_variable]
    error2 = plants[1][se_variable]
    
    fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)

    fig.suptitle('PLANT 1 & 2 AVERAGE ' + variable.replace('_', ' '))
    
    fig.set_figheight(8)
    fig.set_figwidth(8)

    ax1.plot(x1, y1, style)
    ax2.plot(x2, y2, style)

    ax1.set_title('PLANT1')
    ax2.set_title('PLANT2')
    
    ax1.fill_between(x1, y1 - error1, y1 + error1)
    ax2.fill_between(x2, y2 - error2, y2 + error2)

    x1_ticks = x1.unique()
    x1_tick_loc = np.arange(len(x1_ticks)+1)
    
    x2_ticks = x2.unique()
    x2_tick_loc = np.arange(len(x2_ticks)+1)

    ax1.set_xticks(x1_tick_loc[::8])
    ax1.set_xticklabels(x1_ticks[::8], rotation=90)

    ax2.set_xticks(x2_tick_loc[::8])
    ax2.set_xticklabels(x2_ticks[::8],rotation=90)

    ax1.set_xlabel(x)
    ax2.set_xlabel(x)

    ax1.set_ylabel('AVG ' + variable.replace('_', ' '))

    for a in fig.get_axes():
        a.label_outer()

    plt.show()
    return None

---

## Plotting: Variable ~ $X$ for each inverter

When we plot a variable for each inverter we must create a grid to plot the 22 inverters on, so the method is slighty different due to this.

Where again,
* Variable is any numeric variable, and
* $X$ can be either DATE or TIME

In [None]:
def plot_inverter_vs_variable(plant, x, y, title, style = 'k-'):
    var_name = 'SUM_' + y
    
    # Create the grid with 6 rows and 4 columns
    fig, axs = plt.subplots(6,4,sharex=True,sharey=True)
    
    fig.set_figheight(8)
    fig.set_figwidth(8)

    fig.suptitle(title + ' INVERTER ' + y.replace('_', ' '))

    inverters = plant.SOURCE_KEY.unique().astype(str)

    x_ticks = plant[x].unique().astype(str)
    x_tick_loc = np.arange(len(x_ticks) + 1)

    my_colors = plt.rcParams['axes.prop_cycle']()
    
    # Plot each inverter on the grid
    i, j = 0, 0
    for invert in inverters:
        inverter_data = plant.loc[plant['SOURCE_KEY'] == invert]

        x1 = inverter_data[x].astype(str)
        y1 = inverter_data[var_name]

        axs[i][j].set_title(invert,fontdict=dict(fontsize=7))

        axs[i][j].plot(x1, y1, style, **next(my_colors))

        axs[i][j].set_xticks(x_tick_loc[::8])
        axs[i][j].set_xticklabels(x_ticks[::8],rotation=90)

        if i == 5:
            i = 0
            j += 1
        else:
            i += 1

    axs[5][3].set_xticklabels(x_ticks[::8], rotation=90)

    for a in fig.get_axes():
        a.label_outer()

    plt.show()
    return None

---

## Heatmap plot

A heatmap plot is a visual representation of the correlations between two numeric variables $X$, $Y$. We will use the pearson correlation coefficient which is a statistic that measures **linear correlation** between two variables. It is defined as:

$$cor(X,Y) = \frac{cov(X,Y)}{sd(X).sd(Y)}$$

where:
* $cov(X,Y) = E[(X-\mu_X)(X-\mu_Y)]$ is the covariance between $X$ & $Y$.

In [None]:
def my_heat_map(data):
    # Compute the correlation matrix
    corr = data.corr()

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    fig, ax = plt.subplots()
    fig.set_figheight(8)
    fig.set_figwidth(8)

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, center=0, annot=True,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})

    plt.show()
    return None
    
# CREDIT: https://seaborn.pydata.org/examples/many_pairwise_correlations.html

---

# Plant Efficiency

#### Plot the mean & standard deviation (over time) of the variables that indicate performance of the plants.

## AC POWER:

In [None]:
gd_time_data = [gd_t1, gd_t2]

plot_mean_sd(gd_time_data, 'AC_POWER')



> * Plant 1 produces more AC power than plant 2. 
> * The distribution of average AC power production for plant 1 is a bell curve with maximum value located at 12:00pm. 
> * The average AC power production for plant 2 increases until around 8:00am, then levels off and begins to decrease around 4:00pm.

---

## DC POWER:

In [None]:
plot_mean_sd(gd_time_data, 'DC_POWER')


> * Plant 1 produces over 10 times more DC power than plant 2 at its peak production time of 12:00pm.
> * The distribution of average DC power over time for plant 1 is again a bell curve with centre located at 12:00pm.

---

## DAILY YIELD:

#### Daily yield is **cumulative**.

In [None]:
plot_mean_sd(gd_time_data, 'DAILY_YIELD')

> * The (cumulative) daily yield for plants 1 and 2 are very similar.
> * On average, plant 1 produces around 30,000 more yield than plant 2 daily.
> * Power production begins around 8:00am in the morning and finishes around 6:00pm in the evening.
> * There is a dip in daily yield at 6:00pm for plant 1 which doesn't make sense as the yield is cumulative. This must be how the inverters are configured.
> * The daily yield decreases to zero at 6:00am for plant 2 which doesn't make sense, but again this must be how the inverters are configured for this plant.

---

# AGE INFERENCE

## Age of Plant 1 & 2

#### Inspect the total (cumulative) yield for plant 1 & 2.

In [None]:
date_data = [gd_dd1,gd_dd2]

plot_variable_vs_x(date_data, x='DATE', variable='TOTAL_YIELD', style='k-')

> * The total yield for plant 2 is ~ 70 times greater than the total yield for plant 2.
> * Therefore, **assume** that plant 2 is older than plant 1.

---

## Age of Inverters

#### Inspect the total (cumulative) yield for each inverter seperately for both plant 1 & 2.

In [None]:
plot_inverter_vs_variable(inv1_date, 'DATE', 'TOTAL_YIELD', 'PLANT 1')
plot_inverter_vs_variable(inv2_date, 'DATE', 'TOTAL_YIELD', 'PLANT 2')

> * The total yield for inverters from plant 1 are all similar,
> * The total yield for inverters from plant 2 vary greatly,
> * **Assume**: The higher the total yield <=> the older the inverter,
> * **Assume**: The lower the total yield <=> the more recently the inverter has been replaced.

---

# Equipment Optimality

#### **Question:** Are all inverters performing optimally?

* **Assume**: Since plant 1 is **younger** and more **efficient** the performance of its inverters are optimal.

## Inverter Performance:

#### Let's compare the **AC POWER** production for the inverters from plant 1 & 2.

### PLANT 1

In [None]:
plot_inverter_vs_variable(inv1_time, 'TIME', 'AC_POWER', 'PLANT 1')

> * Symmetric bell curve shape for all inverters centred at 12:00pm.

---

### PLANT 2

In [None]:
plot_inverter_vs_variable(inv2_time, 'TIME', 'AC_POWER', 'PLANT 2')

> * A few symmetric bell curves (which is what we would like to see) like the inverter located at (row 1, column 1),
> * Most production flatlines around 10:00am then begins to decrease at around 4:00pm, like the inverter located at (row 3, column 1).
> * We see a slump in the AC power production for a few of the inverters where we would expect to see the highest production like inverter in (row 6, column 1).

#### **Question**: Is the flatlining and slumping behaviour related to the age of the inverters?

> * Just by quick visual inspection there does not seem to be any relationship, so no need to perform a statistical test.

#### **Question**: What is causing the flatlining and slumping behaviour?

> * **Assuming** that all inverters are collecting data correctly my best guess is that the solar panels themselves are not performing optimally, i.e. once they reach a certain AC power level they malfunction.

### The same trend can be seen for **DC POWER** so I omit this.

---

## Inverter Malfunction

#### Inspecting the **DAILY YIELD** for plant 2 over time,

In [None]:
plot_inverter_vs_variable(inv2_date, 'DATE', 'DAILY_YIELD', 'PLANT 2')

> As you can see the inverters in:


| row | column |
|---|---|
|5|1|
|3|2|
|3|3|
|3|4|


> have all stopped on 08/06/2020, 8 days before they should have. This can be noticed for the other variables as well, that is, AC_POWER & DC_POWER but I have omitted this.

* **Assuming** that they have not been turned off purposefully we must assume that there has been a malfunction.
---

# Conditions

Next, we analyse the weather sensor data to assess the conditions that the solar plants are operating under.

## Ambient Temperature:

In [None]:
# This data uses average of values for each date
wsd_date_data = [wsd_dd1, wsd_dd2]
plot_variable_vs_x(wsd_date_data,'DATE', 'AMBIENT_TEMPERATURE', style='k-', bestfit=True)

> * The average ambient temperature at both plants is consistently decreasing from the 15th of May to the 16th of June.
> * Plant 1 is located in a cooler region however the temperature at plant 2 is decreasing at a much more rapid pace.

---

## Module Temperature

In [None]:
plot_variable_vs_x(wsd_date_data,'DATE', 'MODULE_TEMPERATURE', style='k-', bestfit=True)

> * Same trend in temperature over time for module temperature and ambient temperature.
> * The (inverter) module temperature, on average, is about 6 degrees hotter than the surrounding ambient temperature.

---

## Irradiation

In [None]:
plot_variable_vs_x(wsd_date_data,'DATE', 'IRRADIATION', style='k-', bestfit=True)

> * Irradiation levels for plant 1 are slowly decreasing over time but they are rapidly decreasing over time at plant 2.

#### We conclude that plant 2 is located in an area with much harsher conditions, that is, more intense temperatures & irradiation values & also greater variance in temperature & irradiation than plant 1.
---

# Correlations

Assessing the pearson correlation coefficient for each pair of numeric variables in the data.

In [None]:
# Combine all generator & weather sensor data
all_data = combine_gd_wsd([gd_dt1,gd_dt2],[wsd_dt1,wsd_dt2])

# Get correlations forall numeric variables
my_heat_map(all_data.loc[:, all_data.columns != 'PLANT_ID'])

From the above heat map of correlations we obtain the following noteable results:

|Relationship|Correlation co-eff|Variable 1|Variable 2|
|---|---|---|---|
|**Very strong positive**|0.95|Module Temperature|Irradiation|
|**Very strong positive**|0.93|AC Power|Irradiation|
|**Very strong positive**|0.9|AC Power|Module Temperature|
|**Strong positive**|0.82|Ambient Temperature|Module Temperature|
|**Moderate positive**|0.75|AC Power|DC Power|
|**Mild positive**|0.65|Ambient Temperature|Irradiation|

---

# Concluding Thoughts:

* We have discovered that plant 1 is younger and is more efficient than plant 2,
* However, plant 2 seems to be located in a region with much harsher weather conditions, this along with the fact that it appears to be older than plant 1 will factor into the some-what unusual behaviour of it's inverters,
* As the irradiation levels increase, we see an increase in the production of AC Power,
* However with the increase in both irradiation or AC Power we also see an increase in module temperature which, over time, could degrade our equipment.


#### The next stage will be to perform time-series analysis to try to predict future power production, however I will leave this to a seperate notebook.


#### Thanks for reading, any constructive critism is gladly accepted. Hopefully this notebook is a bit different to the others you have read using this dataset!