# Predicting Photovoltaic Energy Production Performance

We are going to perform a predictive analysis for...

Specifically, this will cover:

* Performing a train-test split to evaluate model performance on unseen data
* Applying appropriate preprocessing steps to training and test data
* Identifying overfitting and underfitting

### Data Understanding

I will be using the enhanced photovoltaic dataset, modeling the `...` based on all other numeric features of the dataset. ([dataset here](https://www.kaggle.com/datasets/ziya07/photovoltaic-plant-monitoring/data))

#### Let's import libaries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_auc_score,
    RocCurveDisplay,
    accuracy_score
)

#### Quick EDA

In [2]:
# Load dataset
df = pd.read_csv("Data/enhanced_photovoltaic_data.csv", index_col=0)
# Preview
print(df.head())

                     Solar_Irradiance  Temperature   Humidity  Wind_Speed  \
Timestamp                                                                   
2024-01-01 00:00:00        499.632095    19.530827  70.534191   12.621713   
2024-01-01 01:00:00        960.571445    16.662407  49.937520    7.069308   
2024-01-01 02:00:00        785.595153    33.528016  88.132803   14.686574   
2024-01-01 03:00:00        678.926787    29.361828  25.943481    9.512132   
2024-01-01 04:00:00        324.814912    36.045719  72.336336    1.893971   

                     Panel_Angle  Energy_Output Energy_Output_Class  
Timestamp                                                            
2024-01-01 00:00:00     6.210683     117.090063                 Low  
2024-01-01 01:00:00    84.670607     478.957863                High  
2024-01-01 02:00:00    45.622839     190.098226                 Low  
2024-01-01 03:00:00    36.847086     503.067447                High  
2024-01-01 04:00:00    72.979072     506

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 721 entries, 2024-01-01 00:00:00 to 2024-01-31 00:00:00
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Solar_Irradiance     721 non-null    float64
 1   Temperature          721 non-null    float64
 2   Humidity             721 non-null    float64
 3   Wind_Speed           721 non-null    float64
 4   Panel_Angle          721 non-null    float64
 5   Energy_Output        721 non-null    float64
 6   Energy_Output_Class  721 non-null    object 
dtypes: float64(6), object(1)
memory usage: 45.1+ KB
None


In [5]:
# --- Missing values check ---
print("Missing values per column:")
print(df.isnull().sum())

Missing values per column:
Solar_Irradiance       0
Temperature            0
Humidity               0
Wind_Speed             0
Panel_Angle            0
Energy_Output          0
Energy_Output_Class    0
dtype: int64


In [6]:
print("\nPercentage of missing values:")
print((df.isnull().mean() * 100).round(2))


Percentage of missing values:
Solar_Irradiance       0.0
Temperature            0.0
Humidity               0.0
Wind_Speed             0.0
Panel_Angle            0.0
Energy_Output          0.0
Energy_Output_Class    0.0
dtype: float64


### Modeling

For the current matter we will build **two models** : 
1) A Logistic regression as the baseline model.
2) A Decision tree model as the second one, more complexe and finally tune it for more improvement.

For this reason I will first perform a **train-test split**, so that I am fitting the model using the training dataset and evaluating the model using the testing dataset.


### Requirements

#### 1. Perform a Train-Test Split

#### 2. Fit a `Logistic regression` Model

#### 3. Fit a `Decision tree` Model

#### 4. Fit a `Decision tree Tuned` Model ( Improve the previous model )

#### 5. Compare the models

#### 6. Determine feature importance

## 1. Train-Test Split


# 2. Fit a Logistic Regresssion Model

This is our baseline model. We will use StandardScaler class to scale sets data

## 3. Fit a `Decision Tree` Model



## 4. Improved Model — Hyperparameter Tuning

## 5. Model Comparison

#### Observation


## 6. Feature importance

## Business Recommendations