# Trees That Determine Solar Radiation

In the 1600s, it was discovered that trees (plants) use solar radiation (sunlight) to make their food. Ever thought about some trees that could determine the amount of solar radiation at any time...

Yes I'm talking about Decision Trees, A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification or regression rules.

![img](https://static.javatpoint.com/tutorial/machine-learning/images/decision-tree-classification-algorithm.png)


## Understanding the Problem and Data

Solar irradiance is the power per unit area received from the Sun in the form of electromagnetic radiation as reported in the wavelength range of the measuring instrument. The solar irradiance is measured in watt per square metre (W/m<sup>2</sup>) in SI units. Solar irradiance is often integrated over a given time period in order to report the radiant energy emitted into the surrounding environment (joule per square metre, J/m<sup>2</sup>) during that time period. This integrated solar irradiance is called solar irradiation, solar exposure, solar insolation, or insolation.

![img](https://www.newport.com/medias/sys_master/images/images/hef/hb0/8798462345246/LS-158b-400w.gif)

The dataset includes observations of:

- Solar Irradiance (W/m<sup>2</sup>)
- Temperature (&deg;F)
- Barometric Pressure (Hg)
- Humidity (%)
- Wind Direction (&deg;)
- Wind Speed (mph)
- Sun Rise/Set Time

It contains measurements for the 4 months (2016-09-01 to 2016-12-31) [Pacific/Honolulu] and you have to predict the level of solar radiation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
sns.set_style('darkgrid')

In [None]:
data = pd.read_csv('../input/SolarEnergy/SolarPrediction.csv')
print(data.shape)
data.head()

In [None]:
data.describe()

**Checking Missing Values**

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sns.heatmap(data.isnull(), cbar=False, yticklabels=False)

# Exploring the data

**Parsing date time data**

In [None]:
data['Date'] = pd.to_datetime(data['Data']).dt.date.astype(str)
data['TimeSunRise'] = data['Date'] + ' ' + data['TimeSunRise']
data['TimeSunSet'] = data['Date'] + ' ' + data['TimeSunSet']
data['Date'] = data['Date'] + ' ' + data['Time']

data = data.sort_values('Date').reset_index(drop=True)
data.set_index('Date', inplace=True)
data.drop(['Data', 'Time', 'UNIXTime'], axis=1, inplace=True)
data.index = pd.to_datetime(data.index)
data.head()

In [None]:
data.rename({
    'Radiation': 'Radiation(W/m2)', 'Temperature': 'Temperature(F)', 'Pressure': 'Pressure(mm Hg)', 'Humidity': 'Humidity(%)',
    'Speed': 'Speed(mph)'
}, axis=1, inplace=True)
data.head()

**Radiation as a time series**

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
data['Radiation(W/m2)'].plot(ax=ax, style=['--'], color='red')
ax.set_title('Radiation as a Time Series', fontsize=18)
ax.set_ylabel('W/m2')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
data.groupby(pd.Grouper(freq="D"))['Radiation(W/m2)'].mean().plot(ax=ax, style=['--'], color='red')
ax.set_title('Radiation as a Time Series (Daily)', fontsize=18)
ax.set_ylabel('W/m2')
plt.show()

**Feature Distribution**

In [None]:
for col in ['Radiation(W/m2)','Temperature(F)', 'Pressure(mm Hg)', 'Humidity(%)', 'WindDirection(Degrees)', 'Speed(mph)']:
    fig, ax = plt.subplots(figsize=(20, 3))
    data[col].plot.box(ax=ax, vert=False, color='red')
    ax.set_title(f'{col} Distrubution', fontsize=18)
    plt.show()

**Feature Analysis**

In [None]:
fig = plt.figure()
fig.suptitle('Feature Correlation', fontsize=18)
sns.heatmap(data.corr(), annot=True, cmap='RdBu', center=0)

**Feature Extraction**

In [None]:
def total_seconds(series):
    return series.hour*60*60 + series.minute*60 + series.second

In [None]:
data['MonthOfYear'] = data.index.strftime('%m').astype(int)
data['DayOfYear'] = data.index.strftime('%j').astype(int)
data['WeekOfYear'] = data.index.strftime('%U').astype(int)
data['TimeOfDay(h)'] = data.index.hour
data['TimeOfDay(m)'] = data.index.hour*60 + data.index.minute
data['TimeOfDay(s)'] = total_seconds(data.index)
data['TimeSunRise'] = pd.to_datetime(data['TimeSunRise'])
data['TimeSunSet'] = pd.to_datetime(data['TimeSunSet'])
data['DayLength(s)'] = total_seconds(data['TimeSunSet'].dt) - total_seconds(data['TimeSunRise'].dt)
data['TimeAfterSunRise(s)'] = total_seconds(data.index) - total_seconds(data['TimeSunRise'].dt)
data['TimeBeforeSunSet(s)'] = total_seconds(data['TimeSunSet'].dt) - total_seconds(data.index)
data['RelativeTOD'] = data['TimeAfterSunRise(s)'] / data['DayLength(s)']
data.drop(['TimeSunRise','TimeSunSet'], inplace=True, axis=1)
data.head()

In [None]:
fig, ax = plt.subplots(4, 2, figsize=(20, 20))
for j, timeunit in enumerate(['MonthOfYear', 'TimeOfDay(h)']):
    grouped_data=data.groupby(timeunit).mean().reset_index()
    palette = sns.color_palette("YlOrRd", len(grouped_data))
    for i, col in enumerate(['Radiation(W/m2)', 'Temperature(F)', 'Pressure(mm Hg)', 'Humidity(%)']):
        sns.barplot(data=grouped_data, x=timeunit, y=col, ax=ax[i][j], palette=palette)
        ax[i][j].set_title(f'Mean {col} by {timeunit}', fontsize=12)
        range_values = grouped_data[col].max() - grouped_data[col].min()
        ax[i][j].set_ylim(max(grouped_data[col].min() - range_values, 0), grouped_data[col].max() + 0.25*range_values)

* Solar radiation is positively correlated with temperature
* Atmospheric Pressure and Humidity are correlated with each other
* Temperature plots are as expected bell shaped peaked at 12 noon
* Slight decrease in temperature and solar radiation as winter arrives

In [None]:
fig = plt.figure(figsize=(20, 12))
fig.suptitle('Feature Correlation', fontsize=18)
sns.heatmap(data.corr(), annot=True, cmap='RdBu', center=0)

# Modelling

In [None]:
feats = [
    'Temperature(F)', 'Pressure(mm Hg)', 'Humidity(%)', 'WindDirection(Degrees)', 'Speed(mph)', 
    'MonthOfYear','DayOfYear', 'RelativeTOD',
]
X = data[feats].values
y = data['Radiation(W/m2)'].values

print(X.shape)

In [None]:
from sklearn.model_selection import KFold, RandomizedSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

kf = KFold(shuffle=True, random_state=19)

**Baseline Model**

In [None]:
scores = []
rmse = []
mae = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = DummyRegressor(strategy='mean').fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
    rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))
    mae.append(mean_absolute_error(y_test, model.predict(X_test)))
    
print('Mean R2 Score:', round(np.mean(scores), 5))
print('Mean RMSE:', round(np.mean(rmse), 5))
print('Mean MAE:', round(np.mean(mae), 5))

**Decision Tree**

In [None]:
%%time

scores = []
rmse = []
mae = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    dtmodel = DecisionTreeRegressor(random_state=19).fit(X_train, y_train)
    scores.append(dtmodel.score(X_test, y_test))
    rmse.append(np.sqrt(mean_squared_error(y_test, dtmodel.predict(X_test))))
    mae.append(mean_absolute_error(y_test, dtmodel.predict(X_test)))
    
print('Mean R2 Score:', round(np.mean(scores), 5))
print('Mean RMSE:', round(np.mean(rmse), 5))
print('Mean MAE:', round(np.mean(mae), 5))

**Tree Ensembles**

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

1. **Random Forest**: A random forest is a meta estimator that fits a number of decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

2. **Extra Trees**: In extremely randomized trees, randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

3. **Gradient Boosting**: Boosting is a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set. The gradient boosting algorithm begins by training a decision tree in which each observation is assigned an equal weight. After evaluating the first tree, we increase the weights of those observations that are difficult to fit and lower the weights for those that are easy to fit. The second tree is therefore grown on this weighted data. Here, the idea is to improve upon the predictions of the first tree.

4. **Light GBM**: Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithms grow level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

5. **XG Boost**: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

6. **XG Boost RF** XGBoost RF is an optimized distributed gradient boosting library combining the features of random forests with gradient boosting.

7. **Cat Boost**: “CatBoost” name comes from two words “Category” and “Boosting”. For fitting a model on some data generally, we are required to convert categorical data into the numerical format. using several pre-processing methods like “label encoding”, “one hot encoding” and others. But catboost can use categorical features directly and is scalable in nature.

In [None]:
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from catboost import CatBoostRegressor

trees = {
    'RandomForest': RandomForestRegressor(random_state=19), 'ExtraTrees': ExtraTreesRegressor(random_state=19),
    'GradientBoosting': GradientBoostingRegressor(random_state=19), 'LightGBM': LGBMRegressor(random_state=19),
    'XGBoost': XGBRegressor(random_state=19), 'XGBoostRF': XGBRFRegressor(random_state=19), 
    'CatBoost': CatBoostRegressor(random_state=19, silent=True)
}

In [None]:
%%time

performance = {'rmse':[], '100* r2':[], 'mae':[]}
for name, model in trees.items():
    scores = []
    rmse = []
    mae = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = model.fit(X_train, y_train)
        scores.append(100*model.score(X_test, y_test))
        rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))
        mae.append(mean_absolute_error(y_test, model.predict(X_test)))
    performance['100* r2'].append(np.mean(scores))
    performance['rmse'].append(np.mean(rmse))
    performance['mae'].append(np.mean(mae))

In [None]:
fig = px.bar(pd.DataFrame(performance, index=trees.keys()), barmode='group', title='Model Comparison')
fig.show()

* Extra Trees seems to work best for our purpose
    * Maximum R2 score and Minimum RMSE and MAE
* Other good models are Random Forest and Cat Boost

**Feature Importance Analysis**

In [None]:
feat_imp = {
    k: trees[k].feature_importances_ for k, v in trees.items()
}
feat_imp['DecisionTree'] = dtmodel.feature_importances_
feat_imp = pd.DataFrame(feat_imp)

feat_imp /= feat_imp.sum()
feat_imp.index = feats

fig, ax= plt.subplots(figsize=(20, 6))
fig.suptitle('Feature Importance', fontsize=18)
pd.DataFrame(feat_imp).plot.bar(ax=ax, color=sns.color_palette("summer", 8))

* As derived from visualizations temperature is the most important feature
* Relative Time of the day, is also a key feature