# Modelling a wind farm with AutoML
This example uses the same data of [this notebook](https://www.kaggle.com/matteodefelice/modelling-wind-power-generation). In this case, rather than specifying a ML model, we use an automated machine-learning (AutoML) tool. There are many AutoML tools available, in this example I use [TPOT](http://epistasislab.github.io/tpot/). 

TPOT, available with open license for all the OSs (including Windows), optimises a scikit-learn pipeline via a Genetic Programming (GP) algorithm. Basically, the algorithm (as all the evolutionary algorithms) evolves a population (where each individual represents a pipeline) using mutation and crossover operators. 

More info on TPOT can be found on this [Medium article](https://towardsdatascience.com/tpot-pipelines-optimization-with-genetic-algorithms-56ec44ef6ede) or on the [official scientific paper](https://dl.acm.org/doi/10.1145/2908812.2908918)

In [None]:
from tpot import TPOTRegressor

import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns

import matplotlib.pyplot as plt

We read here the training data: hourly generation of the Gordonbush wind farm for the years 2016-2018

In [None]:
df = pd.read_csv('../input/gordonbush-wind/gordonbush-2016_2018.csv')
df.info()

A bit of data wrangling to create the input data: a column for each grid point per weather variable (two components of wind and wind speed).

In [None]:
x = df[['latitude', 'longitude', 'time', 'u100', 'v100', 'ws']]
x = x.assign(point = x['latitude'].astype(str) + x['longitude'].astype(str))
x = x.drop(['latitude', 'longitude'], axis = 1)
x = x.pivot(index = 'time', columns = ['point'], values = ['u100', 'v100', 'ws'])
x.columns = x.columns.to_flat_index().str.join('_')
x = x.reset_index().drop('time', axis = 1)
x.head()

We extract the output (more information on the data can be found on the [previous notebook](https://www.kaggle.com/matteodefelice/modelling-wind-power-generation))

In [None]:
y = df.loc[df['latitude'] == 58.25].loc[df['longitude'] == -4]['ActualGenerationOutput']
y.shape

We setup the TPOT algorithm. **Very important**: the parameters used in this example have been chosen for a quick computation (<5 minutes). We use for the training **only the 5% of the total data** (`subsample` parameter). To maximise the performance I'd suggest using the 100% and possibly increasing the number of generations and the population size (default values are 100 and 100). The objective function is the R squared.

In [None]:
tpot = TPOTRegressor(generations=5, population_size=10, verbosity=2, random_state=41, 
                     scoring = 'r2', #r2
                     n_jobs = 4, 
                     subsample = 0.05)

The algorithm runs using `x` and `y`. Normally the score (R squared computed in cross-validation) increases each generation due to the evolution of the population. 

In [None]:
tpot.fit(x, y)


Now `tpot` contains all the fitted pipelines and the information on the algorithm runs. We can see the best pipeline (the best individual). 

In [None]:
tpot.fitted_pipeline_

Or even the Pareto front considering both complexity and performance. 

In [None]:
tpot.pareto_front_fitted_pipelines_

We want to evaluate the performance of this model on the year 2019. Then we load the data applying the same processing to create two new input/output datasets. 

In [None]:


df_test = pd.read_csv('../input/gordonbush-wind/gordonbush-2019.csv')
xt = df_test[['latitude', 'longitude', 'time', 'u100', 'v100', 'ws']]
xt = xt.assign(point = xt['latitude'].astype(str) + xt['longitude'].astype(str))
xt = xt.drop(['latitude', 'longitude'], axis = 1)
xt = xt.pivot(index = 'time', columns = ['point'], values = ['u100', 'v100', 'ws'])
xt.columns = xt.columns.to_flat_index().str.join('_')
xt = xt.reset_index().drop('time', axis = 1)

yt = df_test.loc[df_test['latitude'] == 58.25].loc[df_test['longitude'] == -4]['ActualGenerationOutput']
xt.shape, yt.shape



We fit the best pipeline on this new data and calculate the correlation coefficient. The result is similar to the one we got using a neural network in the previous example, but in this case we didn't select the best model and - as said above - we used the parameters for a quick computation. Then setting `subsample` to 1 and increasing generations and populations *might* lead to even better results. 

In [None]:
y_hat_test = tpot.fitted_pipeline_.predict(xt)
print(scipy.stats.pearsonr(yt.values, y_hat_test.flatten()))

We plot the scatter data and the histograms. Also in this case, we can see that the ML model cannot predict the cases of zero generation probably due to outages (or curtailment?). 

In [None]:
plt.scatter(y = y_hat_test, x = yt.values)
plt.xlabel('Testing wind generation')
plt.ylabel('Best pipeline prediction')
plt.title('Wind generation on the testing data')
plt.grid(True)

fig, axs = plt.subplots(2, 1, figsize=(12, 5))
bins = np.linspace(0, 80, 20)

sns.histplot(y_hat_test, bins = bins, ax = axs[0], kde = False).set(title='Predicted')
sns.histplot(yt, bins = bins, ax = axs[1], kde = False).set(title = 'Observed (2019)')

plt.show()

Finally, we can plot the first two weeks of data: in orange the actual generation and in blue the model output. 

In [None]:
plt.figure(figsize=[20, 6])
plt.plot(y_hat_test[0:335])
plt.plot(yt.values[0:335])