# SIMPLER APPROACH: Supervised learning example: regression.

## ☀️ Prediction of photovoltaic generation for self-consumption.

**Objective:** Predict the next day's PV generation of a household, in order to intelligently manage its consumption. 
* We will use historical data of the **target variable** we want to predict (historical PV generation data) and other features that can help to predict the model.



<img src="figures/ml.png" alt="Data center diagram" width="800">

# **0. Import libraries and data**.

In [None]:
# We import libraries
import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings('ignore')



# We load the input data set
dataset = pd.read_csv('data/regression_PVforecasting.csv', delimiter=';')


# **1. Data analysis: Understanding the data**

It is necessary to visualize and understand the data we are going to work with, as well as to know its characteristics. 

1. How many rows we have? How many attributes are there in the data?  
2. What are these attributes?
3. Is there any missing data?
4. Statistical summary of the input data set.

**How many attributes are there in the data?**

In [None]:
### Dataset shape
dataset.shape

**What do they mean?**

In [None]:
# First 5 rows
dataset.head()

In [None]:
# Last 5 rows
dataset.tail()


**.dtypes** methods is essential for data cleaning and preprocessing

* ``int64`` or ``float64`` → numeric → can be used for math, stats, ML models

* ``object`` → usually text or mixed data → needs preprocessing

* ``datetime64`` → time-aware operations (resampling, trends, etc.)

In [None]:
# data format
dataset.dtypes

In [None]:
# Convert localhour in datetime
dataset['localhour'] = pd.to_datetime(dataset['localhour'])

In [None]:
dataset['localhour']

Let's check the data types again

In [None]:
# data format
dataset.dtypes

**3. Is any data missing?** A check is made to see if any data is missing, and if so, the count of empty cells in each attribute is performed. In this case, no data is missing.

In [None]:
# Check for missing data
dataset.isna().sum()

**4. Statistical Summary of the data.**

In [None]:
dataset.describe()

## Visualize the data.

A visual way to understand the input data. 

1. Boxplots and Density plots
2. Correlation matrix

### Boxplots

The boxplot allows us to identify outliers and compare distributions. In addition, we know how 50% of the values are distributed (inside the box).

In [None]:
atributos_boxplot = dataset.plot(kind='box', subplots=True, layout=(3, 3), figsize=(15, 10), sharex=False,
                                 sharey=False, fontsize=10)
plt.show()

### **Correlation matrix** 

Why the correlation matrix is useful in Machine learning:

* **Feature selection:** It shows how strongly each input feature is correlated with the target variable.
    * Features with very low correlation (=0) might add little predictive value.
    * Highly correlated features with the target are often more relevant.
* **Detect multicollinearity**: It helps identify features that are strongly correlated with each other.
    * Using highly correlated predictors can make models (especially linear ones) unstable or redundant.
* **Improve interpretability and model performance**. By removing or combining correlated features, you can make the model simpler, faster, and less prone to overfitting.


In [None]:
### Seaborn visualization library
import seaborn as sns

# Calculation of correlation coefficients
#Pearson for linear correlation
#Spearman for montonous correlation
#https://duchesnay.github.io/pystatsml/auto_gallery/ml_resampling.html
    
corr = dataset.corr(method='pearson') 
# Remove repeated values
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
  
f, ax = plt.subplots(figsize=(12, 10))
#Generar Heat Map,
sns.heatmap(corr, annot=True, fmt=".2f")
    # xticks
#plt.xticks(range(len(corr.columns)), corr.columns);
    # yticks
#plt.yticks(range(len(corr.columns)), corr.columns)
    # plot
plt.show()

# 2. Prepare the data.



### Feature selection

We select/add/remove features in an iterative procress of training and testing the model performance


Add ``time`` and ``month`` columns through the datetime column. 
The data is scaled

In [None]:
# Add month columns
dataset['month'] = pd.DatetimeIndex(dataset['localhour']).month
dataset['hour'] = pd.DatetimeIndex(dataset['localhour']).hour
dataset.drop(['localhour'], axis=1, inplace=True)
dataset

Divide the data into **attributes**: X (features) and **tags**: y (target).

In [None]:
# Features X ; Target y 
X = dataset.drop(['pvgen'], axis=1) 
y = dataset['pvgen']

In [None]:
X

In [None]:
y

### Impute missing data

First, let's check if there is missing data we should handle

In [None]:
print("Missing values in X:")
print(X.isnull().sum())

print("\nMissing values in y:")
print(y.isnull().sum())

In [None]:
# --- Interpolate missing values ---
# For the feature matrix X
X_interpolated = X.interpolate(method='linear')


In [None]:
print("Missing values in X:")
print(X_interpolated.isnull().sum())

# 3. Data separation: Split the data in train-validation-test.

The data are divided into training data ``X_train``, ``y_train``, validation data ``X_val``, ``y_val`` and test data ``X_test``, ``y_test``.


In [None]:
from sklearn.model_selection import train_test_split

test_size = 0.2  # percentage of the input data that I will use to validate the model
  
# I divide the data into training, validation and test data.
X_train, X_test, y_train, y_test = train_test_split(X_interpolated, y, test_size=test_size, shuffle=False)


In [None]:
X_train

In [None]:
y_test

In [None]:
X_test

In [None]:
y_test

### Scale the data

Different scalers handle data in different ways, depending on the algorithm and the data’s nature

* **StandardScaler (Z-score normalization)**   ``from sklearn.preprocessing import StandardScaler ``
    * Centers data around mean = 0 and standard deviation = 1.
    * Keeps outliers but rescales the overall distribution.
* **MinMaxScaler (Normalization to a range)**  ``from sklearn.preprocessing import MinMaxScaler``
    * Rescales features to a fixed range (by default [0, 1]).
    * Preserves shape of original distribution, but sensitive to outliers.
* **RobustScaler (less sensitive to outliers)**  ``from sklearn.preprocessing import RobustScaler``
    * Uses median and interquartile range (IQR) instead of mean and std.
    * Great for datasets with outliers or skewed distributions.



For this scenario,  **data is scaled** using the ``MinMaxScaler()`` method, which scales and translates each attribute individually such that it is within the range [0, 1]. This needs to be done when the scales of the attributes are different (e.g. radiation [0, 650], wind speed [2, 15]).

In [None]:
from sklearn.preprocessing import MinMaxScaler

# scale attributes/features
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train),columns=X_train.columns)
X_train_scaled.head()

#### Apply the scaler to X_test

In [None]:
# Transform  test data using the same scaler

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns)

In [None]:
X_test_scaled

# 4 & 6. Model building and final test evaluation of the model.

* The selected evaluation metrics are **RMSE**.
* Check all the available evaluation metrics in Scikit Learn https://scikit-learn.org/stable/modules/model_evaluation.html
![available evaluation metrics](./Figures/regression_evaluationmetric.png)

### Important methods:

``fit()`` – Function used for training the ML model
* Purpose: Teaches the model the relationship between the input features (X_train) and the target variable (y_train).
* What it does: Finds model parameters (e.g., coefficients, tree splits, neural network weights, etc. depending on the ML model selected) that minimize prediction error.
* Output: A trained model ready to make predictions.

``predict()`` – Method for making predictions

* Purpose: Uses the trained model to make predictions on new or unseen data.
* What it does: Applies the learned parameters from training to compute output values (y_pred).
* Output: Predicted labels or values (regression outputs).

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor


# --- Define and train each model ---
# RANDOM FOREST
rf = RandomForestRegressor()
rf.fit(X_train_scaled, y_train)
y_rf_pred = rf.predict(X_test_scaled)

# KNN
knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train)
y_knn_pred = knn.predict(X_test_scaled)

# NEURAL NETWORK MLP
mlp = MLPRegressor()
mlp.fit(X_train_scaled, y_train)
y_mlp_pred = mlp.predict(X_test_scaled)


### Evaluation metrics 

* The ``evaluate_model()`` function is a helper function used to assess the performance of regression models.
* This produces a clear comparison of the Random Forest, K-Nearest Neighbors, and Neural Network (MLP) models in terms of their predictive accuracy and overall performance.



In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate_model(name, y_true, y_pred):
    print(f"=== {name} ===")
    print(f"MAE: {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE: {mean_squared_error(y_true, y_pred):.4f}")
    print(f"R²:  {r2_score(y_true, y_pred):.4f}\n")

evaluate_model("Random Forest", y_test, y_rf_pred)
evaluate_model("KNN", y_test, y_knn_pred)
evaluate_model("MLP", y_test, y_mlp_pred)

Each of the models is trained, the results are saved and compared visually.

### Now the model is trained, congrats! Let's make predictions with that model, and see if the outcomes are good or not

## Graph results obtained. 

In [None]:
# Plot y_predict vs y_test

x = range(len(y_rf_pred))
plt.figure(figsize=(20,5))
plt.xlabel('Time', size=15)
plt.ylabel('Energy produced (kWh)', size=15)

plt.plot(x, y_rf_pred, alpha=0.4, color='blue', label='RF PV predict')
plt.plot(x, y_knn_pred, alpha=0.4, color='green', label='KNN PV predict')
plt.plot(x, y_mlp_pred, alpha=0.4, color='black', label='MLP PV predict')

plt.plot(x, y_test, alpha=0.4, color='red',  label='PV real')
plt.title('Prediction vs Real')
plt.legend()
plt.show()

### We need to Zoom in!

If necessary, install the Plotly library ``!pip install plotly``.


In [None]:
import plotly.graph_objects as go  # Importamos la librería de plotly

init = list(range(len(y_rf_pred)))

y_predict_rf = pd.DataFrame(data=y_rf_pred, index=init, columns=['RF PV predict'])
y_predict_knn = pd.DataFrame(data=y_knn_pred, index=init, columns=['KNN PV predict'])
y_predict_milp = pd.DataFrame(data=y_mlp_pred, index=init, columns=['MLP PV predict'])

# Reindex y_test so it starts from 0, ensuring y_predict and y_test share the same index for plotting
y_test_plot = pd.DataFrame(data=y_test.values, index=init, columns=['test'])

# We create figure
fig = go.Figure()

fig.add_trace(go.Scatter(x=init, y=y_predict_rf['RF PV predict'][init],
                    mode='lines',
                    name='RF PV prediction'))
fig.add_trace(go.Scatter(x=init, y=y_predict_knn['KNN PV predict'][init],
                    mode='lines',
                    name='KNN PV prediction'))
fig.add_trace(go.Scatter(x=init, y=y_predict_milp['MLP PV predict'][init],
                    mode='lines',
                    name='MILP PV prediction'))

fig.add_trace(go.Scatter(
    x=init,
    y=y_test_plot['test'][init],
    mode='lines',
    name='PV real',
    line=dict(color='black', width=0.7) 
))


# We edit figure
fig.update_layout(autosize=False,
                  width=1000,
                    height=500,
                    title='Prediction vs Real',
                   xaxis_title='Periods',
                   yaxis_title='Energy (kWh)')


fig.show()