# EDA of Solar Power Generation Dataset

* ## Introduction

This exploratory data analysis aims to help to understand the data of the presented dataset and to try answering the following questions:

 - What is the mean value of daily yield?
 - What is the total irradiation per day?
 - What is the max ambient and module temperature?
 - How many inverters are there for each plant?
 - What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?
 - Which inverter (source_key) has produced maximum DC/AC power?
 - Rank the inverters based on the DC/AC power they produce
 - Is there any missing data?
 
After that, a conclusion about the entire notebook will be presented.

Finally, some questions and thoughts regarding the entire EDA and the possible projects to consider will be written.

## Methods

In [None]:
# Data methods 

import pandas as pd
import os

def get_plant_data(number=1, generation_data=False):
    """
    Loads a .csv into a pd.DataFrame instance and convert its date_time column 
    using the right format
    """
    date_format = '%Y-%m-%d %H:%M'
    
    if number == 1 and generation_data:
        file_name = 'Plant_1_Generation_Data.csv'
        date_format = '%d-%m-%Y %H:%M'
        
    if number == 1 and not generation_data:
        file_name = 'Plant_1_Weather_Sensor_Data.csv'
        
    if number == 2 and generation_data:
        file_name = 'Plant_2_Generation_Data.csv'
        
    if number == 2 and not generation_data:
        file_name = 'Plant_2_Weather_Sensor_Data.csv'
        
    file_path = os.path.join(
        '/kaggle/input/solar-power-generation-data/', file_name)

    df = pd.read_csv(file_path)
    df.loc[:, 'DATE_TIME'] = pd.to_datetime(df['DATE_TIME'], format=date_format)
    
    return df


def join_datasets(df_plant_1, df_plant_2):
    """
    Joins two related datasets
    """
    
    df = pd.concat([df_plant_1, df_plant_2])
    df.columns = [c.lower() for c in df.columns]
    
    return df

In [None]:
# Viz methods

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
try:
    init_notebook_mode(connected=True)
except:
    pass
import plotly.graph_objs as go

def get_bar_trace(x, y, name):
    return go.Bar(name=name, x=x, y=y)

def get_simple_layout(title, x_axis_name, y_axis_name):
    return {
        "title": {"text": f'{title}'},
        "xaxis": {"title": {"text": f"{x_axis_name}"}},
        "yaxis": {"title": {"text": f"{y_axis_name}"}},
        "legend": {
            "x": 0.8,
            "y": 0
        },
        "autosize": True
    }

def plot_figure(data, layout):
    fig = go.Figure(data=data, layout=layout)
    fig.show()

## Datasets obtention

In [None]:
df_plant1_gd = get_plant_data(number=1, generation_data=True)
df_plant1_sensors = get_plant_data(number=1, generation_data=False)


df_plant2_gd = get_plant_data(number=2, generation_data=True)
df_plant2_sensors = get_plant_data(number=2, generation_data=False)

generation_dataset, sensors_dataset = [
    join_datasets(df_p1, df_p2)
    for (df_p1, df_p2) in [
        [df_plant1_gd, df_plant2_gd],
        [df_plant1_sensors, df_plant2_sensors],
    ]
]

## Datasets description

### Generation dataset

**Solar power generation data for one plant gathered at 15 minutes intervals over a 34 days period.**

#### Shape

In [None]:
generation_dataset.shape[0]

#### Data by plant

In [None]:
generation_dataset.groupby(['plant_id']).size()

#### Columns

 - date_time: Date and time for each observation. Observations recorded at 15 minute intervals.
 - plant_id: Plant ID - this will be common for the entire file.
 - source_key: Source key in this file stands for the inverter id.
 - dc_power: Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
 - ac_power: Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
 - daily_yield: Daily yield is a cumulative sum of power generated on that day, till that point in time.
 - total_yield: This is the total yield for the inverter till that point in time.


In [None]:
generation_dataset.columns

#### Description

In [None]:
generation_dataset.describe()

### Weather sensor dataset

**Weather sensor data gathered for one solar plant every 15 minutes over a 34 days period.**

#### Shape

In [None]:
sensors_dataset.shape

#### Data by plant

In [None]:
sensors_dataset.groupby(['plant_id']).size()

#### Columns

 - date_time: Date and time for each observation. Observations recorded at 15 minute intervals.
 - plant_id: Plant ID - this will be common for the entire file.
 - source_key: Source key in this file stands for the inverter id.
 - ambient_temperature: This is the ambient temperature at the plant.
 - module_temperature: There's a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
 - irradiation: Amount of irradiation for the 15 minute interval.

In [None]:
sensors_dataset.columns

#### Description

In [None]:
sensors_dataset.describe()

## Presented questions:

### Q1: What is the mean value of daily yield?


In [None]:
result = generation_dataset.daily_yield.mean()

print(f'The mean value of daily yield is {result:.2f} kW')

### Q2: What is the total irradiation per day?


In [None]:
daily_irradiation = sensors_dataset.groupby(sensors_dataset.date_time.dt.date).agg({
    'irradiation': 'sum'
}).reset_index()

data = get_bar_trace(x=daily_irradiation['date_time'], y=daily_irradiation['irradiation'], name='daily_irradiation')
layout = get_simple_layout(title='Daily irradiation measured by weather sensors',
                           x_axis_name='date',
                           y_axis_name='irradiation amount')
plot_figure(data, layout)

### Q3: What is the max ambient and module temperature?


In [None]:
max_ambient_temp, max_module_temp = sensors_dataset.ambient_temperature.max(), sensors_dataset.module_temperature.max()

print(f'The max ambient temperature is {max_ambient_temp:.2f} while {max_module_temp:.2f} is the max module temperature')

### Q4: How many inverters are there for each plant?


In [None]:
inverters_by_plant = generation_dataset.groupby(['plant_id']).agg({
    'source_key': 'nunique'
}).reset_index()

_ = [print(f'There are {row["source_key"]} inverters in plant {row["plant_id"]}') for _, row in inverters_by_plant.iterrows()]

### Q5: What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?


In [None]:
min_dc_power, max_dc_power = generation_dataset.dc_power.min(), generation_dataset.dc_power.max()
min_ac_power, max_ac_power = generation_dataset.ac_power.min(), generation_dataset.ac_power.max()

generation_dataset_daily = generation_dataset.groupby(['plant_id', generation_dataset.date_time.dt.date]).agg({
    'dc_power': 'sum',
    'ac_power': 'sum'
})

min_dc_power_daily, max_dc_power_daily = generation_dataset_daily.ac_power.min(), generation_dataset_daily.ac_power.max()
min_ac_power_daily, max_ac_power_daily = generation_dataset_daily.dc_power.min(), generation_dataset_daily.dc_power.max()

print(f'The minimum DC generated power in the intervals is {min_dc_power:.2f}, the maximum is {max_dc_power:.2f}')
print(f'The minimum AC generated power in the intervals is {min_ac_power:.2f}, the maximum is {max_ac_power:.2f}')
print()
print(f'The minimum DC generated power in a whole day is {min_dc_power_daily:.2f} whereas the maximum is {max_dc_power_daily:.2f}')
print(f'The minimum AC generated power in a whole day is {min_ac_power_daily:.2f} while the maximum is {max_ac_power_daily:.2f}')

### Q6: Which inverter (source_key) has produced maximum DC/AC power? 


In [None]:
inverters_generation = generation_dataset.groupby(['source_key']).agg({
    'dc_power': 'sum',
    'ac_power': 'sum'
}).reset_index()

inverters_generation.loc[:, 'total'] = inverters_generation['ac_power'] + inverters_generation['dc_power']

mask_max_dc = inverters_generation['dc_power'] == inverters_generation['dc_power'].max()
mask_max_ac = inverters_generation['ac_power'] == inverters_generation['ac_power'].max()
mask_max = inverters_generation['total'] == inverters_generation['total'].max()

print(f"The inverted that has generated the maximum amount of power is {inverters_generation.loc[mask_max_ac | mask_max_dc | mask_max, 'source_key'].values[0]}")

### Q7: Rank the inverters based on the DC/AC power they produce


In [None]:
inverters_generation.sort_values('total', ascending=False, inplace=True)

In [None]:
data = get_bar_trace(x=inverters_generation['source_key'], y=inverters_generation['total'], name='total_inverter_generation')
layout = get_simple_layout(title='Total power generation by inverter',
                           x_axis_name='inverter',
                           y_axis_name='power generation (Hz)')
plot_figure(data, layout)

### Q8: Is there any missing data?


#### Generation missing data

In [None]:
date_range = pd.date_range(
    start=generation_dataset.date_time.min(), 
    end=generation_dataset.date_time.max(), 
    freq='15min'
)
df_expected_lectures = pd.DataFrame({
    'date_time': date_range
})

all_expected_lectures = pd.concat([
    pd.DataFrame({
        'date_time': date_range,
        'plant_id': [plant_id] * len(date_range),
        'source_key': [source_key] * len(date_range)
    })
    for plant_id in generation_dataset['plant_id'].unique()
    for source_key in generation_dataset.loc[generation_dataset['plant_id'] == plant_id]['source_key'].unique()
])

all_expected_lectures = all_expected_lectures.merge(
    generation_dataset, on=['plant_id', 'source_key', 'date_time'], how='left')

#### Missing lectures by plant and inverter

In [None]:
total_missing_lectures = all_expected_lectures.loc[all_expected_lectures['dc_power'].isna()].groupby(['plant_id', 'source_key']).agg({
    'date_time': 'nunique'
}).reset_index()

total_missing_lectures

In [None]:
print(f'The total missing lectures are {total_missing_lectures["date_time"].sum()}')

In [None]:
missing_lectures_by_date = all_expected_lectures.loc[
    all_expected_lectures['dc_power'].isna()
].groupby(['plant_id', 'source_key', all_expected_lectures.date_time.dt.date]).size().reset_index()


missing_lectures_by_date.columns = list(missing_lectures_by_date.columns[:-1]) + ['total']

data = get_bar_trace(x=missing_lectures_by_date['date_time'], y=missing_lectures_by_date['total'], name='missing_lectures_by_date')
layout = get_simple_layout(title='Missing power generation lectures by date',
                           x_axis_name='date',
                           y_axis_name='missing lectures')
plot_figure(data, layout)

#### Weather sensors missing data

In [None]:
sensors_dataset

In [None]:
date_range = pd.date_range(
    start=sensors_dataset.date_time.min(), 
    end=sensors_dataset.date_time.max(), 
    freq='15min'
)
df_expected_lectures = pd.DataFrame({
    'date_time': date_range
})

all_expected_lectures = pd.concat([
    pd.DataFrame({
        'date_time': date_range,
        'plant_id': [plant_id] * len(date_range),
        'source_key': [source_key] * len(date_range)
    })
    for plant_id in sensors_dataset['plant_id'].unique()
    for source_key in sensors_dataset.loc[sensors_dataset['plant_id'] == plant_id]['source_key'].unique()
])

all_expected_lectures = all_expected_lectures.merge(
    sensors_dataset, on=['plant_id', 'source_key', 'date_time'], how='left')

#### Missing lectures by plant and inverter

In [None]:
total_missing_lectures = all_expected_lectures.loc[all_expected_lectures['module_temperature'].isna()].groupby(['plant_id', 'source_key']).agg({
    'date_time': 'nunique'
}).reset_index()

total_missing_lectures

In [None]:
print(f'The total missing lectures are {total_missing_lectures["date_time"].sum()}')

In [None]:
missing_lectures_by_date = all_expected_lectures.loc[
    all_expected_lectures['module_temperature'].isna()
].groupby(['plant_id', 'source_key', all_expected_lectures.date_time.dt.date]).size().reset_index()


missing_lectures_by_date.columns = list(missing_lectures_by_date.columns[:-1]) + ['total']

data = get_bar_trace(x=missing_lectures_by_date['date_time'], y=missing_lectures_by_date['total'], name='missing_lectures_by_date')
layout = get_simple_layout(title='Missing weather sensors lectures by date',
                           x_axis_name='date',
                           y_axis_name='missing lectures')
plot_figure(data, layout)

## Conclusions

### About the stats of the datasets

 - The daily yield median is 2.8k, while the 15 minutes interval median is 6 for dc_power and 3.5 for ac_power. 
 - The irradiaton measured by sensors median is 0.02, but it has a mean of .23. The ambient temperature is median is 25.95 and the module temperature median is 26.39

### About the asked questions 

 - The difference between the maximum module temperature and the maximum ambient temperature is 27.5 degrees.
 - It's pretty simple to see that the inverters from one plant generate much more amount of power than the other, probably they're placed in places that receive different quantity of sun during the day, or they're just smaller.
 - The difference of missing lectures between sensors and power generation is huge, while in the first one there are 7k of missing data points, in the second one only 87 data points are missing.

## Questions and thoughts

- A possible approach to check if the panel was working at certain time (thinking about Predictive Maintenance) might be by considering the missing data and the temperature difference between ambient and module by interval.
- Why the external sensors has missing data? It could be a higher signal that the environment wasn't the best and both (sensor and module) stopped working, if both data is missed. 
- For suboptimally performing equipment a way that might be considered is analyzing the power generation difference between sensors by hour, those that are outliers and lies below the Q1-1.5\*IQR could be the ones that aren't working as well as expected. LOF may be a good choice to do this.