In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Given Tasks to Consider:

# **Descriptive Analytics -**

Task Details

1. Load the data from the CSV files
2. Explore each dataset - columns, counts, basic stats
3. Understand the domain context and explore underlying patterns in the data
4. Explore the data and try to answer questions like -

    • What is the mean value of daily yield?
    
    • What is the total irradiation per day?
    
    • What is the max ambient and module temperature?
    
    • How many inverters are there for each plant?
    
    • What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?
    
    • Which inverter (source_key) has produced maximum DC/AC power?
    
    • Rank the inverters based on the DC/AC power they produce
    
    • Is there any missing data?
    
You might have to pre-process the data to allow for some of the analysis (hint: date and time)

# **Visualization and Further Exploration -**

Task Details

1. Create a fresh notebook with visualizations that help understand the data and underlying patterns.
Start with graphs that explain the patterns for attributes independent of other variables. These will usually be tracked as changes of attributes against DATETIME, DATE, or TIME. Examples - how is DC or AC Power changing as time goes by? how is irradiation changing as time goes by? how are ambient and module temperature changing as time goes by? how does yield change as time goes by? Explore plotting variables against different granularities of DATETIME and which is the best option for each variable.

2. Plot two variables against each other to discover degree of correlation between them. Try out different variable pairs - ambient and module temperature, DC and AC Power, Irradiation and module/ambient temperature, irradiation and DC/AC Power. Can you find different ways of visualizing the above relationships?

As always, you will be evaluated how thorough is your work. There are extra marks for creativity on this one - the more interesting and varied your graphs the more points you get.

# **Competition -**

Task Details

You will be evaluated on the following criteria -

1. Thoroughness of your data analysis
2. Quality of data visualization
3. Understanding of tech as well as domain
4. Are you providing actionable insights?
5. Softer aspects - flow of the storyline, confidence in your outcomes, ability to justify your analysis and answer questions

# **Tell A Story -**

Task Details

This is probably the most important skill for a Data Scientist - the ability to communicate the story. You have worked hard over the last two tasks for understand the data and discover patterns and relationships. It is time to put all that to good use.

Find one story worth telling based on all your work. Create a notebook that walks the viewer through the entire story, one step at a time.

A few tips -

• Pick an interesting conclusion that you want to arrive at

• Build a logical progression from loading and pre-processing data to showing minor observations along the way and eventually building up to the grand finale

• Substantiate your argument with data along the way (you are a data scientist not just a story teller :))

• Every good story has some key elements - characters, setting, plot, complication and solution, try to build as many of them as you can.

• Deliver for the aha moment! Give the user an insight that can potentially impact the business. If the viewer doesn't get that then they won't appreciate your effort.

Evaluation

You will be graded on -

• how engaging is your story?

• how well did you substantiate it with data?

• how interesting was the conclusion/key insight?

• how good were your visualizations?

# **Task the 1st: Descriptive Analytics**

1. Load Data From CSV

In [None]:
# /kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv
# /kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv
# /kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv
# /kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv
plant_1_generation_df = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Generation_Data.csv")
plant_1_sensor_df = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_1_Weather_Sensor_Data.csv")
plant_2_generation_df = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Generation_Data.csv")
plant_2_sensor_df = pd.read_csv("/kaggle/input/solar-power-generation-data/Plant_2_Weather_Sensor_Data.csv")

2. Explore each dataset - columns, counts, basic stats

In [None]:
def df_exploration(df, df_name):
    print(f"First five rows of {df_name}:\n{df.head()}\n")
    print(f"Columns in {df_name}:\n{df.columns}\n")
    print(f"The dataframe contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
    print(f"Statistical summary of {df_name}:\n{df.describe()}\n")
    print(f"Statistic summary of object data of {df_name}:\n{df.describe(exclude='number')}\n")

In [None]:
# Explore plant_1_generation dataset
df_exploration(plant_1_generation_df, "plant_1_generation_df")

In [None]:
df_exploration(plant_1_sensor_df, "plant_1_sensor_df")

In [None]:
df_exploration(plant_2_generation_df, "plant_2_generation_df")

In [None]:
df_exploration(plant_2_sensor_df, "plant_2_sensor_df")

3. Understand the domain context and explore underlying patterns in the data

The columns included in both plant_generation datasets include:

• 'DATE_TIME': The date and time of the recorded observation, taken in intervals of 15 minutes.

• 'PLANT_ID': ID of the plant being observed. Single unique value per dataset.

• 'SOURCE_KEY': ID of the inverter observed. Each inverter has multiple solar panels assigned to it. The inverter acts as the "computer" of the solar panel process, inverting DC (direct current) energy into AC (alternating current).

• 'DC_POWER': Record of direct current power the inverter generated during the 15-minute period from previous observation. Metric used: kW

• 'AC_POWER': Record of alternating current power the inverter generated during the 15-minute period from previous observation. Metric used: kW

• 'DAILY_YIELD': The sum of power generated throughout the day up to that recorded time.

• 'TOTAL_YIELD': The sum of the total power generated in that inverter's lifetime up to that recorded time.


The columns included in both plant_sensor datasets include:

'DATE_TIME': The data and time of the recorded obseravation, taken in intervals of 15 minutes.

'PLANT_ID': ID of the plant being observed. Single unique value per dataset

'SOURCE_KEY': ID of the sensor panel for that plant. Single unique value for each sensor dataset - there is only one sensor panel to a plant with multiple sensors per panel. 

'AMBIENT_TEMPERATURE': Ambient temperature at the plant at time of recording. Also known in laymen's terms as "air temperature". Important calculation for system design and thermal analysis.

'MODULE_TEMPERATURE': The temperature reading for that recorded module or solar panel.

'IRRADIATION': Amount of irradiation for that 15-minute interval. Irradience in solar energy is the unit area power (electromagnetic radiation) recieved from the sun (https://en.wikipedia.org/wiki/Solar_irradiance).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(plant_1_generation_df)
plt.show()

Some observations from the pairplot of plant_1_generation:

1. DC power shows a near perfect positive correlation with AC power. This makes sense as the inverter makes the AC power dependent on the amount of DC power - its job is to invert DC to AC. 

2. We can see that DC and AC power both contribute to instances of DAILY_YIELD as well as TOTAL_YIELD, which intuitively makes sense as both DAILY_YIELD and TOTAL_YIELD should be direct results of AC/DC power.

In [None]:
sns.pairplot(plant_1_sensor_df)
plt.show()

Some observations of the plant_1_sensor_df pairplot:
    
1. Increasing ambient temperature is strongly correlated with increasing module temperature and increasing irradiation levels.

2. Increasing module temperature is very strongly correlated with increasing irradiation levels. This leads us to conclude that increasing energy recieved from the sun has the effect of increasing module temperature as well as the surrounding air (ambient) temperature. This intuitively makes sense because higher levels of energy from the sun cause increasing temperatures.

In [None]:
sns.pairplot(plant_2_generation_df)
plt.show()

In [None]:
sns.pairplot(plant_2_sensor_df)
plt.show()

From the plant_2_generation_df pairplot and the plant_2_sensor_df pairplot we see very similar results as we saw in the plant_1_ dataframe pairplots allowing us to assume that both plants are operating in a very similar manner. An interesting thing of note, though it appears to be minor in the visualizations, is the correlation between ambient temperature and module temperature. Each feature appears to increase at a fairly positive rate to each other up to a certain point, then we see the module temperature begin to decrease marginally as ambient temperature continues to increase. As I mentioned, this is just slight and could be nothing, but it could also be indicitive of a possible temperature regulation system in place that keeps the module from rising above a certain level, or perhaps the material used to create the modules (the solar panels). A final thought on this, taking into consideration how heat and energy work, this could possibly also be caused by the modules shifting that energy from the sun down stream to the inverter, and then to further energy storage or management systems. The ambient temperature itself is not in a controlled environment and thus will continue to gain heat as the amount of energy obtained from the sun increases. The module temperature shifts that energy away, in a sense regulating its moment-to-moment temperature.

In [None]:
import datetime
# Let's take a look at some of the non-numeric data compared to the numeric data now
# Start by converting the date and time column to datetime format
plant_1_generation_df["DATE_TIME"] = pd.to_datetime(plant_1_generation_df["DATE_TIME"])
plant_1_sensor_df["DATE_TIME"] = pd.to_datetime(plant_1_sensor_df["DATE_TIME"])
plant_2_generation_df["DATE_TIME"] = pd.to_datetime(plant_2_generation_df["DATE_TIME"])
plant_2_sensor_df["DATE_TIME"] = pd.to_datetime(plant_2_sensor_df["DATE_TIME"])



In [None]:
# Check the dtype of the DATE_TIME column to make sure we reformatted correctly
plant_1_generation_df['DATE_TIME'].dtype

In [None]:
# Check a few DATE_TIME rows
plant_1_generation_df[["DATE_TIME", "DC_POWER"]]

In [None]:
# Create 4 scatter plots comparing DATE_TIME to DC_POWER and IRRADIATION levels
# We want to see how the data looks across time.
fig, axes = plt.subplots(2, 2, figsize=(10, 6))
sns.scatterplot(plant_1_generation_df["DATE_TIME"], plant_1_generation_df["DC_POWER"], ax=axes[0,0])
sns.scatterplot(plant_1_sensor_df["DATE_TIME"], plant_1_sensor_df["IRRADIATION"], ax=axes[0,1])
sns.scatterplot(plant_2_generation_df["DATE_TIME"], plant_2_generation_df["DC_POWER"], ax=axes[1,0])
sns.scatterplot(plant_2_sensor_df["DATE_TIME"], plant_2_sensor_df["IRRADIATION"], ax=axes[1,1])
plt.show()

All recordings appear to have occurred in the year 2020. Let's do some feature engineering to create columns for the month, day, and time. Since all instances are in 2020 we can ignore the year.

In [None]:
# Create month feature in all datesets from DATE_TIME.month
plant_1_generation_df["month"] = pd.DatetimeIndex(plant_1_generation_df["DATE_TIME"]).month
plant_1_sensor_df["month"] = pd.DatetimeIndex(plant_1_sensor_df["DATE_TIME"]).month
plant_2_generation_df["month"] = pd.DatetimeIndex(plant_2_generation_df["DATE_TIME"]).month
plant_2_sensor_df["month"] = pd.DatetimeIndex(plant_2_sensor_df["DATE_TIME"]).month

In [None]:
# Create day feature in all datasets from DATE_TIME.day
plant_1_generation_df["day"] = pd.DatetimeIndex(plant_1_generation_df["DATE_TIME"]).day
plant_1_sensor_df["day"] = pd.DatetimeIndex(plant_1_sensor_df["DATE_TIME"]).day
plant_2_generation_df["day"] = pd.DatetimeIndex(plant_2_generation_df["DATE_TIME"]).day
plant_2_sensor_df["day"] = pd.DatetimeIndex(plant_2_sensor_df["DATE_TIME"]).day

In [None]:
# Create hour feature in all datasets from DATE_TIME.hour
plant_1_generation_df["hour"] = pd.DatetimeIndex(plant_1_generation_df["DATE_TIME"]).hour
plant_1_sensor_df["hour"] = pd.DatetimeIndex(plant_1_sensor_df["DATE_TIME"]).hour
plant_2_generation_df["hour"] = pd.DatetimeIndex(plant_2_generation_df["DATE_TIME"]).hour
plant_2_sensor_df["hour"] = pd.DatetimeIndex(plant_2_sensor_df["DATE_TIME"]).hour

In [None]:
# Create minute feature in all datasets from DATE_TIME.minute
plant_1_generation_df["minute"] = pd.DatetimeIndex(plant_1_generation_df["DATE_TIME"]).minute
plant_1_sensor_df["minute"] = pd.DatetimeIndex(plant_1_sensor_df["DATE_TIME"]).minute
plant_2_generation_df["minute"] = pd.DatetimeIndex(plant_2_generation_df["DATE_TIME"]).minute
plant_2_sensor_df["minute"] = pd.DatetimeIndex(plant_2_sensor_df["DATE_TIME"]).minute

In [None]:
# Look at a single inverter id to ensure that our datetime feature engineering worked
plant_1_generation_df[plant_1_generation_df["SOURCE_KEY"]=="1BY6WEcLGh8j5v7"]

In [None]:
# Let's compare our datetimes to the rest of the features again
fig, axes = plt.subplots(2, 2, figsize=(10, 6))
sns.scatterplot(plant_1_generation_df["month"], plant_1_generation_df["DC_POWER"], ax=axes[0,0])
sns.scatterplot(plant_1_sensor_df["month"], plant_1_sensor_df["IRRADIATION"], ax=axes[0,1])
sns.scatterplot(plant_2_generation_df["month"], plant_2_generation_df["DC_POWER"], ax=axes[1,0])
sns.scatterplot(plant_2_sensor_df["month"], plant_2_sensor_df["IRRADIATION"], ax=axes[1,1])
plt.show()

In [None]:
sns.pairplot(plant_1_generation_df)
plt.show()

Some new observations based on the new month, day, hour, and minute features:

1. There appears to be missing data across all features bettween days 1 and around 12 or 13. Looking into this further will yield more accurate numbers. Possibly due to a hold on gathering observations?

2. Daily yields appear to be slightly higher around the middle of the year. Likely due to a more sunny season?

3. Total yield shows a slow positively increasing trend through each passing month, as we would intuitively expect.

4. Most of the data appears to have been collected within months 5 and 6 (May and June). Why? Higher solar radiation periods for better data collection?

5. The DC and AC Power increases quickly as the hour of day increases, up to a certain point, then begins to decline as the hour increases. This is likely due to the sun's movement throughout the day. More sun mid-day, no sun early-day and late-day.

6. The Daily yield begins a gradual increase as the hour increases, then begins a significant increase up to a certain point (around the same point that DC and AC Power begins its decline) then flattens out as DC and AC power fall back to lower points. Again this can be explained by the sun's movement throughout the day. The daily yield will start low, then as the hours of the day brings more sunlight, thus more AC/DC power, daily yield will begin to increase. This increase will will begin to slow down as it gets later in the day and the amount of sunlight decreases, thus decreasing DC and AC power levels. Eventually this increase becomes so minimal as to become obsolete and we see a flatline in the daily increase.

7. Total yield shows a steady increase as we get later in the year as well as later in the day, again this intuitively makes sense. Total yield shoul be continuously inreasing as time passes and more power is gained.

In [None]:
sns.pairplot(plant_1_sensor_df)
plt.show()

New observations based on new date features:

1. We can immediately see a glaring difference between this dataset and the previous dataset. In the month column, instead of having 12 values, as the previous dataset had, this dataset has only two values (5 and 6) indicating that there appears to only be records of observations for those two months. Incidently they are the same two months that had the most significant instances of observations in the prevoius dataset.

2. As we would expect, we can see a positive correlation between Ambient Temperature, Module Temperature, and Irradiation as the hour of day increases, up to a certain point, then we see a negative correlation where the aforementioned features begin a decline as the hour of day increases. This can be explained in the same way as our first observation: the sun's movement through the sky throughout the day.

3. Perhaps an important note, perhaps not: the days we saw missing in the previous dataset, around days 2-13, do not appear to be missing here. 

In [None]:
sns.pairplot(plant_2_generation_df)
plt.show()

New observations based on new date features:

1. It appears that at this plant site recordings were only made for two months, 5 and 6 (May and june). 

2. We see very similar observations in terms of correlations that we saw in the generation dataset for plant 1, between features DC Power, AC Power, and the time of day. 

3. The correlation between daily yield and the hour of the day seems drastically different at first glance; however, we can see that it undergoes the same positive correlation that we saw at plant site 1. The major difference here is that at plant site 2 we appear to start the day off with high levels of daily yield power collection. One possible explanation of this is that it is excess yield carried over from the previous day.

In [None]:
sns.pairplot(plant_2_sensor_df)
plt.show()

New observations after creation of new date features:

1. Only two months have recorded observations. Again months 5 and 6 (May and June).

2. As we observed in plant site 1's sensor data, Ambient Temperature, Module Temperature, and Irradiation all increase as the hour of day increases, up to a point, then begins to decline – likely the amount of sunlight provided throughout the day.

In [None]:
# First order of business, let's check out what the deal is with the month descrepancies across
# all of the datasets
print(plant_1_generation_df["month"].value_counts())
print(plant_1_sensor_df["month"].value_counts())
print(plant_2_generation_df["month"].value_counts())
print(plant_2_sensor_df["month"].value_counts())

For the time being we will simply keep note of these differences. They could be important later on, especially if we intend to build a prediction model.

4. Explore the data

In [None]:
# Let's create a dataframe that holds only those rows that were recorded at the 23 hour and 45 minute
# mark - this is the last recording in a day, and thus where we will find our total daily_yield
# for any given day.
plant_1_last_daily_recording = plant_1_generation_df[(plant_1_generation_df["hour"]==23) & (plant_1_generation_df["minute"]==45)]
plant_2_last_daily_recording = plant_2_generation_df[(plant_2_generation_df["hour"]==23) & (plant_2_generation_df["minute"]==45)]

# Now we can find the mean of the two last_daily_recording dataframes
plant_1_mean_daily_yield = plant_1_last_daily_recording["DAILY_YIELD"].mean()
plant_2_mean_daily_yield = plant_2_last_daily_recording["DAILY_YIELD"].mean()
print(f"Mean daily yield for plant site 1: {plant_1_mean_daily_yield}")
print(f"Mean daily yield for plant site 2: {plant_2_mean_daily_yield}")

Both plant sites appear to have very similar daily yield means.

In [None]:
# What is the total irradiation per day?
# Let's get a look at our dataframe for a refresher
plant_1_sensor_df

In [None]:
plant_1_sensor_may = plant_1_sensor_df[plant_1_sensor_df["month"]==5]
plant_1_sensor_june = plant_1_sensor_df[plant_1_sensor_df["month"]==6]
plant_2_sensor_may = plant_2_sensor_df[plant_2_sensor_df["month"]==5]
plant_2_sensor_june = plant_2_sensor_df[plant_2_sensor_df["month"]==6]

In [None]:
# Now using the newly created dataframes we can come up with the total number of days
# accounted for in the data for each plant
plant_1_days_in_may = len(plant_1_sensor_may["day"].value_counts()) # could also have used .unique
plant_1_days_in_june = len(plant_1_sensor_june["day"].value_counts()) # could also have used .unique
plant_2_days_in_may = len(plant_2_sensor_may["day"].value_counts()) # could also have used .unique
plant_2_days_in_june = len(plant_2_sensor_june["day"].value_counts()) # could also have used .unique
plant_1_total_days = plant_1_days_in_may + plant_1_days_in_june
plant_2_total_days = plant_2_days_in_may + plant_2_days_in_june
print(plant_1_total_days, plant_2_total_days)

In [None]:
# Now let's find the irradiation sum for each plant
plant_1_irrad_sum = plant_1_sensor_df["IRRADIATION"].sum()
plant_2_irrad_sum = plant_2_sensor_df["IRRADIATION"].sum()
print(plant_1_irrad_sum, plant_2_irrad_sum)

In [None]:
# Now to find the total irradiation per day we divide our irradiation sums by the number of days
# accounted for in the two plant's datasets
plant_1_rad_per_day = plant_1_irrad_sum / plant_1_total_days
plant_2_rad_per_day = plant_2_irrad_sum / plant_2_total_days
print(f"Plant 1 has a total irradiation per day of {plant_1_rad_per_day}")
print(f"Plant 2 has a total irradiation per day of {plant_2_rad_per_day}")

In [None]:
# What is the max ambient and module temperature
# Let's start by visualizing the ambient and module temperature for 
# each plant site
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Graph for plant 1 sensor ambient temperature data in may and june
sns.lineplot(x=plant_1_sensor_may["day"] ,y=plant_1_sensor_may["AMBIENT_TEMPERATURE"], ax=axes[0, 0])
sns.lineplot(x=plant_1_sensor_june["day"] ,y=plant_1_sensor_june["AMBIENT_TEMPERATURE"], ax=axes[0, 0])
axes[0,0].set_title("Plant 1 Ambient Temperature")
axes[0,0].legend(("may", "june"), loc='upper right', shadow=True)

# Graph for plant 1 sensor module temperature data in may and june
sns.lineplot(x=plant_1_sensor_may["day"], y=plant_1_sensor_may["MODULE_TEMPERATURE"], ax=axes[0, 1])
sns.lineplot(x=plant_1_sensor_june["day"], y=plant_1_sensor_june["MODULE_TEMPERATURE"], ax=axes[0, 1])
axes[0,1].set_title("Plant 1 Module Temperature")
axes[0,1].legend(("may", "june"), loc='upper right', shadow=True)

# Graph for plant 2 sensor ambient temperature data in may and june
sns.lineplot(x=plant_2_sensor_may["day"], y=plant_2_sensor_may["AMBIENT_TEMPERATURE"], ax=axes[1, 0])
sns.lineplot(x=plant_2_sensor_june["day"], y=plant_2_sensor_june["AMBIENT_TEMPERATURE"], ax=axes[1, 0])
axes[1,0].set_title("Plant 2 Ambient Temperature")
axes[1,0].legend(("may", "june"), loc='upper right', shadow=True)

# Graph for plant 1 sensor module temperature data in may and june
sns.lineplot(x=plant_2_sensor_may["day"], y=plant_2_sensor_may["MODULE_TEMPERATURE"], ax=axes[1, 1])
sns.lineplot(x=plant_2_sensor_june["day"], y=plant_2_sensor_june["MODULE_TEMPERATURE"], ax=axes[1, 1])
axes[1,1].set_title("Plant 2 Module Temperature")
axes[1,1].legend(("may", "june"), loc='upper right', shadow=True)

plt.show()

In [None]:
# Let's quantify the max temperatures now
plant_1_max_amb_temp = plant_1_sensor_df["AMBIENT_TEMPERATURE"].max()
plant_1_max_mod_temp = plant_1_sensor_df["MODULE_TEMPERATURE"].max()
plant_2_max_amb_temp = plant_2_sensor_df["AMBIENT_TEMPERATURE"].max()
plant_2_max_mod_temp = plant_2_sensor_df["MODULE_TEMPERATURE"].max()

print(f"For Plant site 1 the max ambient temperature recorded was {plant_1_max_amb_temp}")
print(f"For Plant site 1 the max module temperature recorded was {plant_1_max_mod_temp}")
print(f"For Plant site 2 the max ambient temperature recorded was {plant_2_max_amb_temp}")
print(f"For Plant site 2 the max module temperature recorded was {plant_2_max_mod_temp}")

In [None]:
# How many inverters are there for each plant?
# This should be a simple calculation. All we need to do is create a list
# of unique values for the SOURCE_KEY in both sites' generation datasets
plant_1_inverter_count = plant_1_generation_df["SOURCE_KEY"].unique()
plant_2_inverter_count = plant_2_generation_df["SOURCE_KEY"].unique()
print(f"Plant 1 site has a total of {len(plant_1_inverter_count)} inverters.")
print(f"Plant 2 site has a total of {len(plant_2_inverter_count)} inverters.")

In [None]:
# What is the maximum/minimum amount of DC/AC Power generated in a time interval/day?
# To solve this problem we need to find the sum of the last daily recording of the DAILY_YIELD column 
# for each unique inverter per day. For this we can use the may and june dataframes we created earlier 
# as well as the inverter list we made to solve the problem just prior to this one


In [None]:
# Create empty graph for storing total daily yields by day
plant_1_daily_total_yield_may = {}
plant_1_daily_total_yield_june = {}
plant_2_daily_total_yield_may = {}
plant_2_daily_total_yield_june = {}

# Populate key values with keys that will match up to the days in our dataframe
for i in range(1, 32):
    plant_1_daily_total_yield_may[i] = 0  # Set the value of these keys to 0
    plant_1_daily_total_yield_june[i] = 0
    plant_2_daily_total_yield_may[i] = 0
    plant_2_daily_total_yield_june[i] = 0
    
# iterate through each day in plant_1_daily_recording list
for day in plant_1_daily_total_yield_may.keys():
    temp_df = plant_1_last_daily_recording[(plant_1_last_daily_recording["day"]==day) & (plant_1_last_daily_recording["month"]==5)]
    plant_1_daily_total_yield_may[day] = temp_df["DAILY_YIELD"].sum()
    temp_df = plant_1_last_daily_recording[(plant_1_last_daily_recording["day"]==day) & (plant_1_last_daily_recording["month"]==6)]
    plant_1_daily_total_yield_june[day] = temp_df["DAILY_YIELD"].sum()
    
    temp_df = plant_2_last_daily_recording[(plant_2_last_daily_recording["day"]==day) & (plant_2_last_daily_recording["month"]==5)]
    plant_2_daily_total_yield_may[day] = temp_df["DAILY_YIELD"].sum()
    temp_df = plant_2_last_daily_recording[(plant_2_last_daily_recording["day"]==day) & (plant_2_last_daily_recording["month"]==6)]
    plant_2_daily_total_yield_june[day] = temp_df["DAILY_YIELD"].sum()

In [None]:
# Now that we have the daily_yield for every day with a record we can find the maximum and minimum
plant_1_may_daily_yield_max = 0
plant_1_june_daily_yield_max = 0
plant_2_may_daily_yield_max = 0
plant_2_june_daily_yield_max = 0

for val in plant_1_daily_total_yield_may.values():
    if val > plant_1_may_daily_yield_max:
        plant_1_may_daily_yield_max = val

for val in plant_1_daily_total_yield_june.values():
    if val > plant_1_june_daily_yield_max:
        plant_1_june_daily_yield_max = val

for val in plant_2_daily_total_yield_may.values():
    if val > plant_2_may_daily_yield_max:
        plant_2_may_daily_yield_max = val
        
for val in plant_2_daily_total_yield_june.values():
    if val > plant_2_june_daily_yield_max:
        plant_2_june_daily_yield_max = val
        
print(f"The max daily power yield for the month of May at plant site 1 was {plant_1_may_daily_yield_max}")
print(f"The max daily power yield for the month of June at plant site 1 was {plant_1_june_daily_yield_max}")
print(f"The max daily power yield for the month of May at plant site 2 was {plant_2_may_daily_yield_max}")
print(f"The max daily power yield for the month of June at plant site 2 was {plant_2_june_daily_yield_max}")

In [None]:
# Now find the absolute max for each plant site
plant_1_abs_max = max([plant_1_may_daily_yield_max, plant_1_june_daily_yield_max])
plant_2_abs_max = max([plant_2_may_daily_yield_max, plant_2_june_daily_yield_max])

print(f"The absolute max daily yield at plant site 1 was {plant_1_abs_max}")
print(f"The absolute max daily yield at plant site 2 was {plant_2_abs_max}")

In [None]:
# For funsies let's look at the daily yield numbers visually
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

axes[0,0].plot(list(plant_1_daily_total_yield_may.keys()), list(plant_1_daily_total_yield_may.values()))
axes[0,0].set_title("Plant 1 Daily Yield Totals for May")

axes[0,1].plot(list(plant_1_daily_total_yield_june.keys()), list(plant_1_daily_total_yield_june.values()))
axes[0,1].set_title("Plant 1 Daily Yield Totals for June")

axes[1,0].plot(list(plant_2_daily_total_yield_may.keys()), list(plant_2_daily_total_yield_may.values()))
axes[1,0].set_title("Plant 2 Daily Yield Totals for May")

axes[1,1].plot(list(plant_2_daily_total_yield_june.keys()), list(plant_2_daily_total_yield_june.values()))
axes[1,1].set_title("Plant 2 Daily Yield Totals for June")

plt.show()

In [None]:
# Which inverter (source_key) has produced maximum DC/AC power?
# To solve this problem we will need the last daily yield dataframes
# we created earlier.
plant_1_source_key_yields = {}
plant_2_source_key_yields = {}

for inverter in plant_1_last_daily_recording["SOURCE_KEY"].unique():
    temp_df = plant_1_last_daily_recording[plant_1_last_daily_recording["SOURCE_KEY"]==inverter]
    plant_1_source_key_yields[inverter] = temp_df["DAILY_YIELD"].sum()
    
for inverter in plant_2_last_daily_recording["SOURCE_KEY"].unique():
    temp_df2 = plant_2_last_daily_recording[plant_2_last_daily_recording["SOURCE_KEY"]==inverter]
    plant_2_source_key_yields[inverter] = temp_df2["DAILY_YIELD"].sum()
    
plant_1_source_key_yields

In [None]:
plant_1_max_yield = 0
plant_1_max_yield_inverter = ""

for inverter_val in plant_1_source_key_yields.items():
    if inverter_val[1] > plant_1_max_yield:
        plant_1_max_yield = inverter_val[1]
        plant_1_max_yield_inverter = inverter_val[0]

plant_2_max_yield = 0
plant_2_max_yield_inverter = ""

for inverter_val in plant_2_source_key_yields.items():
    if inverter_val[1] > plant_2_max_yield:
        plant_2_max_yield = inverter_val[1]
        plant_2_max_yield_inverter = inverter_val[0]
        
print(f"The plant 1 site inverter {plant_1_max_yield_inverter} had the greatest max yield of {plant_1_max_yield}")
print(f"The plant 2 site inverter {plant_2_max_yield_inverter} had the greatest max yield of {plant_2_max_yield}")

In [None]:
# Let's visualize that
# For funsies let's look at the daily yield numbers visually
fig, axes = plt.subplots(1, 2, figsize=(20, 10))

axes[0].plot(list(plant_1_source_key_yields.keys()), list(plant_1_source_key_yields.values()))
axes[0].set_title("Plant 1 Max Yield Totals for Individual Inverters")

axes[1].plot(list(plant_2_source_key_yields.keys()), list(plant_2_source_key_yields.values()))
axes[1].set_title("Plant 2 Max Yield Totals for Individual Inverters")

plt.show()


In [None]:
# Rank the inverters based on the DC/AC power they produce
# Using the data prior to this we can determine a ranking system
# for our inverters
plant_1_source_key_yields_df = pd.DataFrame(plant_1_source_key_yields.items(), columns=["Inverter_ID", "Max_Power_Yield"])
plant_2_source_key_yields_df = pd.DataFrame(plant_2_source_key_yields.items(), columns=["Inverter_ID", "Max_Power_Yield"])

# The following is an ordered list of best inverter to worst inverter based on max power yield
print("PLANT 1")
print(plant_1_source_key_yields_df.sort_values("Max_Power_Yield", ascending=False).reset_index(drop=True))
print()
print("PLANT 2")
print(plant_2_source_key_yields_df.sort_values("Max_Power_Yield", ascending=False).reset_index(drop=True))

In [None]:
# Are there any empty values?
# This is an easy problem to answer
print(plant_1_generation_df.isnull().sum())
print(plant_1_sensor_df.isnull().sum())
print(plant_2_generation_df.isnull().sum())
print(plant_2_sensor_df.isnull().sum())

In [None]:
# There are no obvious empty values in any of the dataframes.
# However, from all of the observations and graphs made above we can see that for each site there appears to be a number of days
# at the beginning of the month in May and at the end of the month in June that show no data collection, with the exception of a single
# day in May at plant site 1 on May 6th - perhaps this was simply just a test to prepare for the actual data collection process that would begin
# on May 16?

In [None]:
# Let's look at the data from the months prior to May and the months after June at plant site 1.
print(plant_1_generation_df[plant_1_generation_df["month"] < 5]["day"].value_counts())
print(plant_1_generation_df[plant_1_generation_df["month"] > 6]["day"].value_counts())
print(plant_1_generation_df[(plant_1_generation_df["month"] == 5)]["day"].value_counts())
print(plant_1_generation_df[(plant_1_generation_df["month"] == 6)]["day"].value_counts())

In [None]:
# Some interesting observations there. Data was gathered on the 6th day of every month of the year at plant site 1, with the only months 
# in which data was obtained outside of that 6th day being the months May and June. This could be for testing purposes or year-long research on a chosen day of the month. 
# If we wanted to include year-long research we would need to remove all days except for the 6th day of each month, and even then we would only be able to 
# use data from the first plant site. Whenever we are comparing data between the two plant sites I think it will be important to remove the months preceeding May and
# the months proceeding June.