### Table of Contents

1. [Introduction](#1.-Introduction)

1. [Base Libraries](#2.-Base-Libraries)

1. [Data Preprocessing](#3.-Data-PreProcessing)

1. [Testing K Factors](#4.-Testing-K-Factors)

1. [Tuning Hyperparameters](#5.-Tuning-Hyperparameters)

1. [Validating the Models with Metrics](#6.-Validating-the-Models-with-Metrics)

1. [Predicting Energy Generation](#7.-Predicting-Energy-Generation)

1. [Conclusions](#8.-Conclusions)

1. [References](#9.-References)

# 1. Introduction

For forecasting, historical data is used as input and future trends are predicted on the basis of this data, but accurate selection and extraction of meaningful features from data are challenging. However, the *k*nn algorithm, one of the most simple and useful Machine Learning techniques, can be easily used to generate powerful regression models. In this notebook, I focused on the use of kNN itself and its hyperparametrizations, to guarantee a better understanding of this algorithm and keep it as a plain tutorial to anyone who needs it. Another resources like feature engineering, dimensionality reduction or even another machine learning techniques are not used.

The dataset used is a energy generation compilation of several european countries, measured in THh between 2000 and 2019. Its contents were extracted from World in Data. 

# 2. Base Libraries

First, the basic libraries are imported: **pandas**, **matplotlib** and **numpy** to Dataframes, Graphs and numeric operations; **MinMaxScaler** to normalize our data between 0 and 1, **train_test_split** to help split the dataset (usually 70% training/ 30% testing), **GridSearchCV** for hyperparameter tuning and **neighbors** to generate our models using *k*NN

In [None]:
# Base Libraries
import pandas as pd
import matplotlib.pyplot as plt  
import numpy as np
# Transformation
from sklearn.preprocessing import MinMaxScaler
# Models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import neighbors
# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score

To measure the efficiency of our generated models, we use a bunch of metrics (maybe more than we need):
* **mean_absolute_error** (MAE): a measure of errors between paired observations expressing the same phenomenon;
* **mean_squared_error**  (root - RMSE): the standard deviation of the residuals (prediction errors);
* **mean_squared_log_error** (root - RMSLE): that measures the ratio between actual and predicted;
* **r2_score** (R2): coefficient of determination, the proportion of the variance in the dependent variable that is predictable from the independent variable(s); and
* **explained_variance_score** (EVS): measures the discrepancy between a model and actual data

# 3. Data PreProcessing

Initially, the data is imported from the dataset, and the categorical columns are excluded (in this case, only "Country"). To transform the dataset and still keep it as a Dataframe, the *scaler* library is used inside the *Dataframe* library, normalizing the data between 0 and 1, and keeping the dataframe properties. After this, the *describe* function is called, to show some important data about the dataframe, like mean, max, min, std and others.

In [None]:
pd.options.display.float_format = '{:.4f}'.format

scaler = MinMaxScaler(feature_range=(0, 1))

df_power = pd.read_csv('../input/hydropower-generation/Hydropower_Consumption.csv', sep=',')
df_power = df_power.drop(columns = ["Country"])
df_power = pd.DataFrame(scaler.fit_transform(df_power), 
                        columns=['2000','2001','2002','2003','2004','2005',
                                 '2006','2007','2008','2009','2010','2011',
                                 '2012','2013','2014','2015','2016','2017',
                                 '2018','2019'])
df_power.describe()

Now, my goal was to check the possibility of creating a model that would predict power generation for 2019, based on the previous 18 years (2000 - 2018) with at least 75% accuracy. For this, the data set was separated into X and y, X being my prediction data, and y what I intended to predict. For this, the data set was separated into X and y, X being my prediction data, and y what I intended to predict. For this, they are divided into training (70%) and testing (30%).

In [None]:
X = df_power.drop(columns = ["2019"])
y = df_power["2019"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Since we already have the training and test data selected, it is necessary to find the k factor that will generate the best results for the algorithm.

# 4. Testing *k* Factors


One of the ways to find this *k* factor is to perform a test with several values, and measure the percentage results. It can almost be considered a "brute force test", since it will be necessary to exhaust a large number of possibilities of *k*

In [None]:
rmsle_val = []
best_rmsle = 1.0

for k in range(20):
    k = k+1
    knn = neighbors.KNeighborsRegressor(n_neighbors = k)

    knn.fit(X_train, y_train) 
    y_pred = knn.predict(X_test)
    rmsle = np.sqrt(mean_squared_log_error(y_test,y_pred))
    if (rmsle < best_rmsle):
        best_rmsle = rmsle
        best_k = k
    rmsle_val.append(rmsle)
    print('RMSLE value for k= ' , k , 'is:', rmsle)

print(f"Best RMSLE: {best_rmsle}, Best k: {best_k}")

Our metric indicates that the smallest error occurs when we have *k* = 4 (RMSLE of 0.0390), indicating a relative error between the predicted and current values of 3.90%.

Therefore, we will present all the values in a graph, which will show us visually the results obtained. This function is known as the "elbow function", given the percentage variation that occurs between the values of *k*, first downwards and then upwards, when *k* finds its best value. We are plotting the RMSLE values against the *k* values

In [None]:
curve = pd.DataFrame(rmsle_val) #elbow curve 
curve.plot(figsize=(8,5))

The graph shows a sharp drop in the RMSLE as *k* advances to 4, when it starts to increase indefinitely, leading us to the conclusion that 4 is the best result for *k*. There is also a way to find the best value for k in another way: hyperparametrization using GridSearchCV.

# 5. Tuning Hyperparameters

A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

In [None]:
params = {'n_neighbors':[2,3,4,5,6,7,8,9]}

knn = neighbors.KNeighborsRegressor()

model = GridSearchCV(knn, params, cv=5)
model.fit(X_train,y_train)
model.best_params_

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. For our dataset, GridSearch indicates 4 as the best values to our *k*, quite similar to our tests realized before.

Once you've found the best value for k, it's time to train the model and predict the results. We will also make use of the *score* function, which will allows us to see the accuracy rate for our model (80.16%).

In [None]:
knn = neighbors.KNeighborsRegressor(n_neighbors = 4)

knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

Once we have the model ready and the prediction made, we can apply the metrics and analyze the results.

# 6. Validating the Models with Metrics

Here the test results are compared with the prediction results within the metrics, so that we can see what the results are for each of them.

In [None]:
r2_valid = r2_score(y_test, y_pred)
mae_valid = mean_absolute_error(y_test, y_pred)
evs_valid = explained_variance_score(y_test, y_pred, multioutput='uniform_average')
rmse_valid = np.sqrt(mean_squared_error(y_test, y_pred))
rmsle_valid = np.sqrt(mean_squared_log_error(y_test, y_pred))

print('R2 Valid:',r2_valid)
print('EVS Valid:', evs_valid)
print('MAE Valid:', mae_valid)
print('RMSE Valid:',rmse_valid)
print('RMSLE Valid:', rmsle_valid)

The results show an R2 of 80.16% and an EVS of 81.31%, indicating that our model has a great fit to your sample (R2) and a strong association between the model and its current data (EVS). For the RMSLE, it only considers the relative error between and the predicted and the actual value and the scale of the error is not significant. On the other hand, RMSE value increases in magnitude if the scale of error increases. For our model, both presents very low values (4.99% and 3.90%)

# 7. Predicting Energy Generation


Once we have the prediction ready and the model tested, we organize the results side by side to be able to make a comparison.

In [None]:
data_prediction = list(zip(y_test,y_pred))
data_prediction = pd.DataFrame(data_prediction, columns=['Test','Prediction'])
data_prediction.head(10)

The results presented show a great proximity between the test and prediction values, showing that the model we created managed to fulfill its proposed objective of performing above 75% accuracy.

# 8. Conclusions

Power generation data sets have a very close pattern, which can be identified even with a simple regression tool, even if the pre-processing of the data is simpler. Several other techniques and algorithms can also be applied to improve the performance of the model at higher levels.

# 9. References

[1] Ali, M., Khan, Z. A., Mujeeb, S., Abbas, S., & Javaid, N. (2019). Short-Term Electricity Price and Load Forecasting using Enhanced Support Vector Machine and K-Nearest Neighbor. 2019 Sixth HCT Information Technology Trends (ITT). doi:10.1109/itt48889.2019.9075063 

[2] Ashfaq, T., & Javaid, N. (2019). Short-Term Electricity Load and Price Forecasting using Enhanced KNN. 2019 International Conference on Frontiers of Information Technology (FIT). doi:10.1109/fit47737.2019.00057 

[3] Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(1). doi:10.1186/s13040-017-0155-3 

[4] Claesen, Marc; Bart De Moor (2015). "Hyperparameter Search in Machine Learning". arXiv:1502.02127

[5] Zaki, M., & Meira, W. (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. New York City, New York: Cambridge University Press (10.1017/CBO9780511810114).