# Data Preprocessing & Splitting

## Imported Packages

In [1]:
import pandas as pd

## Final Dataset Information

Timeline:
- Start Date:2017-01-01 00:00:00 UTC
- End Date: 2017-12-31 23:58:00 UTC

Location:
- Residential Rooftop PV Solar Panel in Utrecht, Netherlands
- Coordinates: (52.07, 5.13)

Time Resolutions: 5 mins and 30 mins

Missing Values Percentage: 
- 30 mins: 802/17520 ≈ 4.58%
- 5 mins: 4850/105120 ≈ 4.61%

Constants:
- Azimuth: 180 (degrees) 
- Tilt: 15 (degrees)
- Max AC Capacity: 2413
- Max DC Capacity: 2500

Important Notes: 
- For the data of time stamp and PV Power variables, I used https://zenodo.org/records/10953360 to download 'unfiltered_pv_power_measurements.csv' under Version v2, picking ID005 from all the 175 PV systems in the Netherlands due to having no missing values between 1/1/2017 and 12/31/2017. I found this dataset from this Literature Review (https://ieeexplore.ieee.org/document/10107594), I also use to guide me through the forecasting/predictive modeling process.
- Negative PV power readings on a solar panel on the dataset are normal as the panel is generating more power than it's consuming, which  only occurs at night (when irradiance is 0). Furthermore, the negative readings are miniscule (very close to 0) with all values equal or greater to -0.50.
- For the data of all weather variables (all variables below excluding Time and PV Power), I used https://toolkit.solcast.com.au/ to download them for free with time resolutions of 5 and 30 mins with the coordinates of (52.07, 5.13) between 1/1/2017 and 12/31/2017.

## 8 Variables

Working with 5 and 30 minute resolutions with the start time of "6/1/17 0:00" and end time of "6/30/17 23:55", there are 8 variables we will explore analyze their relationships. They are below, their definitions, and units:

- Time Stamps (UTC): format='%m/%d/%y %H:%M'
- PV Power Outputs of Residential of a Single Rooftop Solar Panels (watts)
- Ambient Temperature (°C)
- Wind Speed at 10 m above MSL (m/s)
- Wind Direction at 10 m above MSL (degrees)
- Direct Normal Irradiance (DNI): the component that is involved in thermal (concentrating solar power, CSP) and photovoltaic concentration technology (concentrated photovoltaic, CPV).
- Global Horizontal Irradiance (GHI): the sum of direct and diffuse radiation received on a horizontal plane. GHI is a reference radiation for the comparison of climatic zones; it is also essential parameter for calculation of radiation on a tilted plane.
- Global Tilt Irradiance (GTI): total radiation received on a surface with defined tilt and azimuth, fixed or sun-tracking. This is the sum of the scattered radiation, direct and reflected. It is a reference for photovoltaic (PV) applications, and can be occasionally affected by shadow.

---

## Splitting Unfilitered (Raw) PV Power Dataset

### 1 Min Resolutions to 30 Min Resolutions

In [2]:
# Load the dataset
data = pd.read_csv("./RAW_PV_POWER.csv")

# Convert the 'DateTime' column to datetime format
data['DateTime'] = pd.to_datetime(data['DateTime'], format='%Y-%m-%d %H:%M:%S UTC')

# Filter the rows with increments of exactly 30 minutes
filtered_data = data[data['DateTime'].dt.minute % 30 == 0]

# Display the first few rows of the filtered dataset to confirm
filtered_data.head()

# Save the filtered dataset to a new CSV file
filtered_data.to_csv("./Filtered_PV_POWER.csv", index=False)

print("Filtered data saved to 'Filtered_PV_POWER_30.csv'")

Filtered data saved to 'Filtered_PV_POWER_30.csv'


### 1 Min Resolutions to 5 Min Resolutions

In [3]:
# Load the dataset
data = pd.read_csv("./RAW_PV_POWER.csv")

# Convert the 'DateTime' column to datetime format
data['DateTime'] = pd.to_datetime(data['DateTime'], format='%Y-%m-%d %H:%M:%S UTC')

# Filter the rows with increments of exactly 5 minutes
filtered_data = data[data['DateTime'].dt.minute % 5 == 0]

# Display the first few rows of the filtered dataset to confirm
print(filtered_data.head())

# Save the filtered dataset to a new CSV file
filtered_data.to_csv("./Filtered_PV_Power_Data_5min.csv", index=False)

              DateTime     ID005
0  2017-01-01 00:00:00 -0.425226
5  2017-01-01 00:05:00 -0.641800
10 2017-01-01 00:10:00 -0.621484
15 2017-01-01 00:15:00 -0.489517
20 2017-01-01 00:20:00 -0.540800


---

## Compiling weather and filtered PV power Datasets into a 5 and 30 min resolution datasets

No programming is necessary to do this step. I created a new Excel file, copying and pasting the variable columns from the filtered PV power and weather dataset for 5 minutes to create only one dataset. I downloaded the dataset as a .csv file and uploaded it to JupyterLab. For the 30-minute resolution, I did the exact process as above but only worked with the 30-minute datasets for the PV power and weather.