# Prediction of solar radiation data given weather conditions

This notebook presentes a data analysis on a four-month dataset collected at the HI-SEAS weather station (Hawaii). The sampling rate is of around 5 minutes and, the collected variables are the following:
* Solar radiation [W/m^2]
* Temperature [F]
* Atmospheric pressure [Hg]
* Humidity [%]
* Wind speed [miles/h]
* Wind direction [degrees]

The objective of the evaluations is to derive a ML model to forecast the available solar radiation as a function of the available features.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#importing libraries
import pandas as pd
from datetime import datetime
from dateutil.tz import *
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV


In [None]:
#Importing the dataset
data = pd.read_csv('/kaggle/input/SolarEnergy/SolarPrediction.csv')

# Data cleaning and preparation

In this section, the dataset is analysed to identify whether there are missing values and whether all the data is identified by the correct data-type. After these evaluations are completed, more sophisticated analyses can be carried out on the dataset.

In [None]:
#Checking which data is available in the dataset and which data-type is associated to each column of the dataset
data.info()

In [None]:
#Checking if there are missing values
data.isnull().sum()

A preliminary analysis of the dataset indicates that there are no missing values, and therefore there is no need to understand how to deal with potential missing records.

It is possible to notice, however, that some features are not represented by the right class: the various dates are included as "objects", and hence they should be converted to "datetime" objects to facilitate their handling during the study.

As a first step, the UNIXTime is converted into a datetime object, and the right timezone is allocated to this feature (when converting UNIX time to datetime, the UTC timezone is assumed, and there is therefore the need to update this info, because the data is collected under the HST timezone).

In [None]:
#Converting UNIX time to datetime object
data['Date']= pd.to_datetime(data['UNIXTime'],unit='s')

#Setting the right timezone to the datetime object
data['Date'] = data['Date'].dt.tz_localize('UTC').dt.tz_convert('HST')

As a second step, the SunRiseTime and SunSetTime columns are adjusted. These columns only contain information about the sunrise and sunset time, while it would be beneficial to have them containing both the time and the date.

In order to update this columns, it is possible to proceed as follows:
1. Convert the "SunRiseTime" and "SunSetTime" columns into a datetime.time object
2. Convert the "Data" column into a datetime.date object
3. Combine the datetime.time and datetime.date objects through the function pd.datetime.combine()

After doing so, similarly to what done previously, the correct timezone is allocated to the data.

In [None]:
#Extracting date from Data column
data['Data'] = pd.to_datetime(data['Data']).dt.date

#Converting Sunrise and Sunset columns into datetime.time objects
data['TimeSunRise'] = pd.to_datetime(data['TimeSunRise']).dt.time
data['TimeSunSet'] = pd.to_datetime(data['TimeSunSet']).dt.time

#Creating new sunset/sunrise columns featuring also the right date
data['sunrise_time'] = data.apply(lambda row: pd.datetime.combine(row['Data'], row['TimeSunRise']), axis = 1)
data['sunset_time'] = data.apply(lambda row: pd.datetime.combine(row['Data'], row['TimeSunSet']), axis = 1)

#Adding approriate timezone
data['sunrise_time'] = data['sunrise_time'].dt.tz_localize('HST')
data['sunset_time'] = data['sunset_time'].dt.tz_localize('HST')

Now that the date columns have been correctly identified as datetime objects, it is suitable to set the newly created "Date" column as index, sort the data by the index (ascending order), and drop the non-required columns.

In [None]:
#Setting 'Date' as index
data.set_index('Date', inplace = True)

#Sorting by the index
data.sort_index()

data.drop(columns = ['Data', 'Time', 'TimeSunRise',
                    'TimeSunSet'], inplace = True)

In [None]:
#Inspecting the first rows of the dataset
data.head()

If the data handling has been carried out correctly, then it would be reasonable to expect that the solar radiation, for any considered day, would be approximately zero before the sunrise time, and after the sunset time. It is possible to check this by means of a graphical inspection.

In [None]:
data_one_day = data.loc['2016-09-29':'2016-09-30',:]

plt.figure(figsize = (12,3))
plt.plot(data_one_day.Radiation, 'o', markerfacecolor = 'w')

#Plotting vertical line at sunrise
plt.axvline(data_one_day.sunrise_time.iloc[0], label = 'Sunrise time', color = 'blue')

#Plotting vertical line at sunset
plt.axvline(data_one_day.sunset_time.iloc[0], label = 'Sunset time', color = 'red') 

#Adjusting timezone of x-axis
plt.gca().xaxis_date('HST')

plt.legend()
plt.show()

The plot suggests that the various dates have been correctly manipulated. It is now possible to proceed with the preliminary data analysis of the dataset.

# Preliminary Data Analysis

The objective of the preliminary data analysis is to get a sense of how the data looks like, and to confirm whether the data actually makes sense (i.e. it would be questionable to identify negative values for the Solar radiation).

The first step of the preliminary data analysis is therefore to check the ranges of the various features of the dataset, and to do a cross-check whether these ranges are reasonable.

In [None]:
#Analysing the ranges of the various features of the datset
data.describe()

The ranges here identified look reasonable. In particular:
* The Solar radiation assumes only positive values, and has a maximum value of 1600 W/m^2 (this is reasonable as the average solar radiation is estimated to be of around 1361 W/m^2);
* The temperature ranges from 30.4 F to 71 F (This corresponds in a range between -1 C to 21 C);
* The pressure variates very little, and in any case has a value of around 1 bar;
* The Humidity has values over 100 %, but only very slightly. This can be considered acceptable for the scope of this work;
* Wind direction is correctly in the range from 0 to 360 degrees. Notice that the direction 0 degrees and 360 degrees are the same measurement, hence some data transformation could be required to correctly account for this phenomena;
* Wind speed is always positive, and its maximum value (40.5 miles/hour or 18 m/s) is reasonable as it corresponds to a grade 8 of the Beaufort scale.

As a second step of the preliminary data analysis, it is reasonable to check for the distribution of the data, in order to understand how the various data is allocated between the lower and upper limits. This can be carried out by plotting either a distribution plot or a boxplot. 

Both are plotted in this case, as they enable to have a more comprehensive understanding of the data.

In [None]:
fig, ax = plt.subplots(nrows =2, ncols = 6, figsize = (25, 10))

sns.distplot(data.Radiation, ax = ax[0,0])
ax[0,0].set_xlabel('Solar radiation [W/m^2]', fontsize = 14)

sns.distplot(data.Temperature, ax = ax[0,1])
ax[0,1].set_xlabel('Temperature [F]', fontsize = 14)

sns.distplot(data.Pressure, ax = ax[0,2])
ax[0,2].set_xlabel('Pressure [Hg]', fontsize = 14)

sns.distplot(data.Humidity, ax = ax[0,3])
ax[0,3].set_xlabel('Humidity [%]', fontsize = 14)

sns.distplot(data.Speed, ax = ax[0,4])
ax[0,4].set_xlabel('Wind speed [miles/h]', fontsize = 14)

sns.distplot(data['WindDirection(Degrees)'], ax = ax[0,5])
ax[0,5].set_xlabel('Wind direction [Degrees]', fontsize = 14)


sns.boxplot(data.Radiation, ax = ax[1,0])
ax[1,0].set_xlabel('Solar radiation [W/m^2]', fontsize = 14)

sns.boxplot(data.Temperature, ax = ax[1,1])
ax[1,1].set_xlabel('Temperature [F]', fontsize = 14)

sns.boxplot(data.Pressure, ax = ax[1,2])
ax[1,2].set_xlabel('Pressure [Hg]', fontsize = 14)

sns.boxplot(data.Humidity, ax = ax[1,3])
ax[1,3].set_xlabel('Humidity [%]', fontsize = 14)

sns.boxplot(data.Speed, ax = ax[1,4])
ax[1,4].set_xlabel('Wind speed [miles/h]', fontsize = 14)

sns.boxplot(data['WindDirection(Degrees)'], ax = ax[1,5])
ax[1,5].set_xlabel('Wind direction [Degrees]', fontsize = 14)

fig.suptitle('Distribution and box plot of the various features', fontsize = 22)
fig.tight_layout()
fig.subplots_adjust(top=0.88)

plt.show()

Looking at the distribution of the data it is possible to conclude that most features have a skewed distribution, except for the wind directions, which is characterized by three peaks.

As it was possible to assume, roughtly 50 % of values of the solar radiation are located in the range between 0 W/^2 and 250 W/m^2 (there is no or little solar radiation at night). With respect to the wind speed, it seems that the high wind speeds areextreme outliers in a distribution that has most of its values in the range between 0 miles/h and 20 miles/h.

As a last step in the preliminary data analysis, it makes good sense to plot the data for limited range of time. In this case, a five-day period is selected.

Plotting the data enables to have an understanding regarding the variability of data itself during the day, which could provide some interesting insights to be accounted for when proceeding with the building of a ML model to predict the solar radiation.

Aside from the data, also the hourly-median of the data is represented in the following plots. This allows for an easier identification of potential patterns. The median is selected over the mean, because it is less affected by the presence of potential outliers.

In [None]:
#Creation of the median dataset
data_median = data.resample('H').median().dropna()

In [None]:
#Extraction of the data for a five-day period
data_5 = data.loc['2016-10-03':'2016-10-08',:]
data_5_median = data_median.loc['2016-10-03':'2016-10-08',:]


fig, ax = plt.subplots(nrows =6, ncols = 1, figsize = (23,25))

ax[0].plot(data_5.Radiation,'o', markerfacecolor='w')
ax[0].plot(data_5_median.Radiation, linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[0].set_ylabel('Radiation [W/m^2]', fontsize = 14)
ax[0].legend(fontsize = 14)

ax[1].plot(data_5.Temperature,'o', markerfacecolor='w')
ax[1].plot(data_5_median.Temperature, linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[1].set_ylabel('Temperature [F]', fontsize = 14)
ax[1].legend(fontsize = 14)

ax[2].plot(data_5.Pressure,'o', markerfacecolor='w')
ax[2].plot(data_5_median.Pressure, linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[2].set_ylabel('Pressure [Hg]', fontsize = 14)
ax[2].legend(fontsize = 14)

ax[3].plot(data_5.Humidity,'o', markerfacecolor='w')
ax[3].plot(data_5_median.Humidity, linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[3].set_ylabel('Humidity [%]', fontsize = 14)
ax[3].legend(fontsize = 14)

ax[4].plot(data_5.Speed,'o', markerfacecolor='w')
ax[4].plot(data_5_median.Speed, linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[4].set_ylabel('Wind Speed [miles/h]', fontsize = 14)
ax[4].legend(fontsize = 14)

ax[5].plot(data_5['WindDirection(Degrees)'],'o', markerfacecolor='w')
ax[5].plot(data_5_median['WindDirection(Degrees)'], linewidth = 1.5, color = 'red', label = 'Hourly median')
ax[5].set_ylabel('Wind direction [degrees]', fontsize = 14)
ax[5].legend(fontsize = 14)

fig.suptitle('Trend of the various parameters over a five-day period', fontsize = 22)
fig.tight_layout(rect=[0, 0.03, 1, 0.97])

plt.show()

Looking at the plots it is possible to deduce the following:
1. The data for the Temperature, humidity, and wind speed seems to assume only discrete values. This could be connected with the type of sensors used for the data campaing;
2. The pressure data seems to follow some clear pattern in which high and low pressure values interchange each other;
3. The wind speed data is extremely volatile. The high volatility could make this feature a less "certain" one when carrying out the regression analysis;
4. As expected, solar radiation is constant at zero during the night, but high variability is experienced during the day-hours;
5. The wind direction data is volatile, but clear trends can be identified. Sometimes the variation of the measurements between 0 degrees and 360 degrees create a sense of "change" of the wind direction, which in practice is not there.

# Feature engineering and correlation analysis

After carrying out a preliminary data analysis of the dataset, it is time to define which features will be used when building the ML regression model, and to carry out correlation analyses aimed at identifying if there are clear patterns (linear or non-linear) between the variable to be predicted (the solar radiation), and the features.

As a first step, it is reasonable to consider that all the information contained in the dataset are useful for the prediction of the target variable. Therefore, the following features will be considered:
1. Temperature
2. Pressure
3. Humidity
4. Wind speed
5. Wind direction

On top of that, it is also important to include features giving an indication of the time and date, because the solar radiation changes according to the solar position in the sky, and to the duration of the solar day. For this reason, two new features are included in the dataset: the relative time of the day (Rel_time), and the duration of the solar day (Hours of light).

The relative time of the day is defined as follows:
(current time - sunrise_time)/(sunrise_time - sunset_time)

It assumes the following values:
* < 0 before sunrise
* = 0 at sunrise
* '>' 0 but < 1 between sunrise and sunset
* = 1 at sunset
* '>' 1 after sunset

The duration of the solar day is instead computed by subtracting the sunrise time from the sunset time.

All the calculations are carried out by referring to the UNIX time, therefore the duration of the day is later divided by 3600 to have a value in hours.

In [None]:
#Converting sunrise and sunset times into timestamp
data['sunrise_timestamp'] = data.apply(lambda row: datetime.timestamp(row['sunrise_time']), axis = 1)
data['sunset_timestamp'] = data.apply(lambda row: datetime.timestamp(row['sunset_time']), axis = 1)

#Creating a column containing the number of daily light hours
data['Hours_of_light'] = (data['sunset_timestamp'] - data['sunrise_timestamp'])/60/60

#Creating column describing current time relative to sunrise/sunset
data['Rel_time'] = (data['UNIXTime']- data['sunrise_timestamp'])/(data['sunset_timestamp']-data['sunrise_timestamp'])

In [None]:
#Removing non-necessary columns
data.drop(columns = ['UNIXTime','sunrise_timestamp', 'sunset_timestamp', 
                     'sunset_time', 'sunrise_time'], inplace = True)

Now that all the features have been defined and included in the dataset, it is time to identify if there are clear patterns between the features and the target parameter. One way to do so is to plot the correlation matrix, which displays whether there is a linear correlation between the various variables. 

A value close to 1 indicates a strongly linear positive correlation, while a value close to -1 indicates a strongly negative linear correlation. 

Values around zero indicate that there is no linear correlation, but do not exclude that other kind of correlations are present (i.e. exponential, more than linear, quadratic, etc..)

In [None]:
#Plotting a heatmap of the various features in the dataset
fig, ax = plt.subplots(figsize = (6,6))
sns.heatmap(data.corr(), annot = True, cmap = 'YlGnBu')
fig.suptitle('Correlation matrix', fontsize = 16)
plt.show()

The correlation matrix indicates a positive linear correlation between the ambient temperature and the solar radiation (coefficient = 0.73). No clear linear correlation appears for the other features, and the second highest correlation value is identified for the humidity (yet it is only of -0.23).

As a second step in the correlation analysis, it is possible to draw scatter plots showing the distribution of the values of the various features as a function of the value of the target parameter (solar radiation). This allows to identify potential non-linear trends present. 

In [None]:
fig, ax = plt.subplots(nrows =2, ncols = 4, figsize = (23,8))

ax[0,0].plot(data.Temperature, data.Radiation,'o', markerfacecolor='w')
ax[0,0].set_xlabel('Temperature [F]', fontsize = 14)
ax[0,0].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[0,1].plot(data.Pressure, data.Radiation,'o', markerfacecolor='w')
ax[0,1].set_xlabel('Pressure [Hg]', fontsize = 14)
ax[0,1].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[0,2].plot(data.Humidity, data.Radiation,'o', markerfacecolor='w')
ax[0,2].set_xlabel('Humidity [%]', fontsize = 14)
ax[0,2].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[0,3].plot(data.Hours_of_light, data.Radiation,'o', markerfacecolor='w')
ax[0,3].set_xlabel('Hours of light [h]', fontsize = 14)
ax[0,3].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[1,0].plot(data.Rel_time, data.Radiation,'o', markerfacecolor='w')
ax[1,0].set_xlabel('Rel_time', fontsize = 14)
ax[1,0].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[1,1].plot(data.Speed, data.Radiation,'o', markerfacecolor='w')
ax[1,1].set_xlabel('Wind speed [miles/h]', fontsize = 14)
ax[1,1].set_ylabel('Radiation [W/m^2]', fontsize = 14)

ax[1,2].plot(data['WindDirection(Degrees)'], data.Radiation,'o', markerfacecolor='w')
ax[1,2].set_xlabel('Wind direction [degrees]', fontsize = 14)
ax[1,2].set_ylabel('Radiation [W/m^2]', fontsize = 14)

fig.delaxes(ax[1,3])

fig.suptitle('Scatter plots of the solar radiation as a function of the various features', fontsize = 22)
fig.tight_layout()
fig.subplots_adjust(top=0.88)

plt.show()

The scatter plots suggest the following:
* It is confirmed a linear correlation between solar radiation and ambient temperature;
* It seems that the highest values of the solar radiation are taking place when the ambient pressure is the highest;
* The Rel_time feature is correctly implemented: the solar radiation is > 0 between sunrise (Rel_time=0) and sunset (Rel_time=1);
* It seems that the maximum present solar radiation decreases for high wind speeds.

# Prediction of the solar radiation with a linear model

The preliminary investigations indicated that the solar radiation is related to the various features in a non-linear way (except for the ambient temperature). It is however interesting to check how the use of a linear model will perform in this setting. In addition, the performance of the linear model can be used as a benchmark when considering more complex ML models.

In order to train a linear ML model, the dateset is divided into features (X) and the variable to be predicted (y = Solar radiation) and then split into train and test sets. Splitting the data into train and test sets enables to check the accuracy of the ML model when predicting non-previously seen data.

In [None]:
#Renaming dataset
df = data

#Splitting dataset into labels and features
X = df.drop(columns = 'Radiation')
y = df.Radiation

In [None]:
#Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.3,
                                                   random_state = 42)

In [None]:
#Initiating linear regression model
lr = LinearRegression()

#Training the linear regression model
lr.fit(X_train, y_train)

#Carrying out prediction with the linear model
lr_predict_train = lr.predict(X_train)
lr_predict_test = lr.predict(X_test)

In [None]:
#Checking the performance of the linear model

#Squared error
print('Linear model, R^2 training set:{:.2f}'.format(r2_score(y_train, lr_predict_train)))
print('Linear model, R^2 test set:{:.2f}'.format(r2_score(y_test,lr_predict_test)))

#Mean squared error (MSE)
print('Linear model, MSE training set:{:.2f}'.format(MSE(y_train, lr_predict_train)))
print('Linear model, MSE test set:{:.2f}'.format(MSE(y_test,lr_predict_test)))

The results indicate a relatively poor fitting (R squared or around 0.6). In addition, it is possible to notice that the model have similar scores for the train and test sets, this indicates that the model does not overfit the training data, and that its performance on the test set is comparable to the one on the training set.

As a way to have a visual understanding of the performance of the model, it is possible to plot the predicted and real solar radiation data over a five day-period.

In [None]:
#Carrying out predictions with linear model for the 5-day period
X_five = X.loc['2016-10-03':'2016-10-08',:]
y_five_lr = lr.predict(X_five)


fig, ax = plt.subplots(figsize = (23,5))

ax.plot(data_5.Radiation,'o', markerfacecolor='w')
ax.plot(data_5.index, y_five_lr, linewidth = 1.5, color = 'red', label = 'Linear model prediction')
ax.set_ylabel('Radiation [W/m^2]', fontsize = 14)
ax.legend(fontsize = 14)

plt.show()

The visual representation indicates that the linear model is capable of predicting where the peak solar radiation is located, and to match (in most cases) the maximum value of the solar radiation. However, it seems that it is not capable of correctly predicting the periods where the solar radiation is zero.

This suggests that more complex model are required in order to have a more comprehensive understanding of the relationship between the solar radiation and the selected features.

# Prediction of the solar radiation with tree-based models

As a more complex class of models to predict the solar radiation, tree-based models are selected. This class of models is characterized by two main advantages:
1. They are capable of capturing non-linear relationships between features and labels
2. They do not require feature scaling (i.e. regularization)

Two models are here selected for prediction: Random Forest and Grandient Boosting. Both models are ensamble methods. This means that they are based on the combination of a set of simple models, which together lead to a more robust and accurate prediction model. 

Given that this set of models is more complex than the simple linear regression, some parameters (hyper-parameters) have to be user-specified before the model can be trained. The proper selection of these parameter is of uttermost importance in order to attain a suitable model. This selection process is generally called "hyper-parameter tuning" and it is here carried out by means of a grid-search approach: a list of potential values for the various hyper-parameters is defined and then multiple models are trained as a way to identify the best performing set of hyper-parameters.

The first model to be considered is Random Forest.

In [None]:
#Initiating Random Forest regressor
rf_model = RandomForestRegressor(random_state = 42)

#Define the grid of hyperparameters
params_rf = {
    'n_estimators': [500, 600, 700],
    'max_depth': [5, 6, 7],
    'min_samples_leaf': [0.075, 0.05, 0.025],
    'max_features': ['log2', 'sqrt']   
}


#Initiate Grid search
grid_rf = GridSearchCV(estimator = rf_model,
                       param_grid = params_rf,
                       cv = 3,
                       scoring = 'neg_mean_squared_error',
                       verbose = 1,
                       n_jobs = -1)

In [None]:
#Fitting the grid search
grid_rf.fit(X_train, y_train)

In [None]:
#Extracting best hyperparameters
rf_best_hyperparams = grid_rf.best_params_
print('Best hyperparameters for RF: \n', rf_best_hyperparams)

In [None]:
#Extracting best rf model
rf = grid_rf.best_estimator_

Given that Random Forest and Gradient Boosting are complex models, the chances of overfitting the training set are considerable. When a model is overfitting the training set, it does fit not only its trends, but also its noise. Therefore, an overfitted model will perform poorly on unseen data.

A way to check whether the model overfits the training set is to carry out a cross-validation procedure. During the cross-validation procedure, the training set is split into k folds, and the model is trained k times using a different portion of data from the training set. Each trained model will have a different prediction accuracy. The average error of the models trained during the cross validation procedure is defined "cross-validation" error.

Two scenarios are possible:
1. The cross-validation error is greater than the training set error -> in this case the model is overfitting the training set;
2. The cross-validation error is similar to the training set error -> it is possible to assume that the model is not overfitting the training set and will perform similarly on unseen data.



In [None]:
#Checking if there is overfitting through the use of Cross validation
rf_MSE_CV = -cross_val_score(rf, X_train, y_train,
                            cv = 10, 
                            scoring = 'neg_mean_squared_error',
                            n_jobs = -1)

In [None]:
#Computing Random Forest predictions in the traning and test sets
rf_predict_train = rf.predict(X_train)
rf_predict_test = rf.predict(X_test)


In [None]:
#Computing the MSE in the traning set, test set, and cross-validation procedure
print('CV MSE for RF:{:.2f}'.format(rf_MSE_CV.mean()))
print('Train MSE for RF:{:.2f}'.format(MSE(y_train,rf_predict_train)))
print('Test MSE for RF:{:.2f}'.format(MSE(y_test,rf_predict_test)))

The training MSE and cross-validation MSE are fairly similar and, thus, it is possible to conclude that the model is not overfitting the data.

In [None]:
#Computing the R^2 in the traning set and test set for the Random Forest Regressor
print('Random Forest, R^2 score training set:{:.2f}'.format(r2_score(y_train, rf_predict_train)))
print('Random Forest, R^2 score test set:{:.2f}'.format(r2_score(y_test,rf_predict_test)))

The same procedure carried out for the Random Forest regressor is now carried out for the Gradient Boosting regressor.

In [None]:
#Initiating Gradient Boosting regressor
gb_model = GradientBoostingRegressor(random_state = 42)

#Define the grid of hyperparameters
params_gb = {
    'n_estimators': [200, 300, 600],
    'max_depth': [2, 3,5],
    'min_samples_leaf': [0.125, 0.1, 0.075],
    'max_features': ['log2', 'sqrt']   
}


#Initiate Grid search
grid_gb = GridSearchCV(estimator = gb_model,
                       param_grid = params_gb,
                       cv = 3,
                       scoring = 'neg_mean_squared_error',
                       verbose = 1,
                       n_jobs = -1)

In [None]:
#Fitting the grid search
grid_gb.fit(X_train, y_train)

In [None]:
#Extracting best hyperparameters
gb_best_hyperparams = grid_gb.best_params_
print('Best hyperparameters for GB: \n', gb_best_hyperparams)

In [None]:
#Extracting best gb model
gb = grid_gb.best_estimator_

In [None]:
#Checking if there is overfitting through the use of Cross validation
gb_MSE_CV = -cross_val_score(gb, X_train, y_train,
                            cv = 10, 
                            scoring = 'neg_mean_squared_error',
                            n_jobs = -1)

In [None]:
#Computing Gradient Boosting predictions on the train and test stes
gb_predict_train = gb.predict(X_train)
gb_predict_test = gb.predict(X_test)


In [None]:
#GB CV MSE
print('CV MSE for GB:{:.2f}'.format(gb_MSE_CV.mean()))
print('Train MSE for GB:{:.2f}'.format(MSE(y_train,gb_predict_train)))
print('Test MSE for GB:{:.2f}'.format(MSE(y_test,gb_predict_test)))

The training MSE and cross-validation MSE are fairly similar and, thus, it is possible to conclude that the model is not overfitting the data.

In [None]:
#Computing the R^2 in the traning set and test set for the Gradient Boosting Regressor
print('Gradient Boosting, R^2 score training set:{:.2f}'.format(r2_score(y_train, gb_predict_train)))
print('Gradient Boosting, R^2 score test set:{:.2f}'.format(r2_score(y_test,gb_predict_test)))

Tree-based models allow for extracting the importance that the various features have in determining the regression model. Having a look at the importance of the various features, it is possible to understand whether the two models allocated the same importance to the the various parameters.

In [None]:
#Plotting feature importances for Random Forest and Gradient boosting

#Creating a pd.Series of feature importances
importances_rf = pd.Series(rf.feature_importances_, index = X.columns)
importances_gb = pd.Series(gb.feature_importances_, index = X.columns)

#Sorting importances
sorted_importances_rf = importances_rf.sort_values()
sorted_importances_gb = importances_gb.sort_values()

#Plotting sorted importances
fig, ax = plt.subplots(ncols = 2, figsize = (27,7))
sorted_importances_rf.plot(kind = 'barh', color = 'lightblue', ax = ax[0])
sorted_importances_gb.plot(kind = 'barh', color = 'lightblue', ax = ax[1])
ax[0].set_title('Random Forest Regressor')
ax[1].set_title('Gradient Boosting Regressor')
fig.suptitle('Feature importances in the two ML models', fontsize = 24)
plt.show()

The plots indicate that the RF and GB selected the same order of importance for the available features, and that the most important features are the ambient temperature and the Rel_time (the variable that describes in which phase of the day the considered time-point is).

Lastly, a graphical understanding of how the two model performe is possible by plotting the forecasted solar radiation over a five-day period.

In [None]:
#Computing predictions of the two ML models in the 5-day period
y_five_rf = rf.predict(X_five)
y_five_gb = gb.predict(X_five)

fig, ax = plt.subplots(figsize = (23,10))

ax.plot(data_5.Radiation,'o', markerfacecolor='w')
ax.plot(data_5.index, y_five_rf, linewidth = 1.5, color = 'red', label = 'Random Forest prediction')
ax.plot(data_5.index, y_five_gb, linewidth = 1.5, color = 'orange', label = 'Gradient Boosting prediction')

ax.set_ylabel('Radiation [W/m^2]', fontsize = 14)
ax.legend(fontsize = 14)

plt.show()

The graphical inspection, indicates that the Gradient boosting regressor leads to the most accurate predictions. This could be expected due to the previously shown performance indicators:
* R^2 -> 0.9 for Gradient Boosting, and 0.77 for Random Forest
* MSE -> 10,000 for Gradient boosting, and 23,000 for Random Forest

Looking at the plots it seems that RF regressor is characterized by the following drawbacks:
1. It is characterized by high variance in the prediction
2. It seems to systematically overestimate the solar radiation after the sunset

Both aspects could be improved by refining the hyper-parameter tuning procedure (not all possible parameters were screened).


# Conclusions and recommendations for further analyses

A dataset containing weather data was analyzed and regression models were built with the objective of predicting the solar radiation given the weather data. The best performing model was a Gradient Boosting regressor (R^2 = 0.89) and the most significant features were found to be the ambient temperature and a parameter indicating the relative time in relation to the sunrise/sunset times.

Suggestions for improving the analysis/predictions are the following:
1. Carry out a more estensive hyper-parameter tuning procedure (a simple gride search of few parameters was here carried out);
2. Consider to limit the evaluations to the only hours of the day when the solar radiation is present (this allows to build a model that does not have to account for the night hours);
3. Include uncertainty bonds in the predicted solar radiation;
4. Consider the use of ML models that account for the fact that the data is available as a time-series (i.e. SARIMAX models).