# Introduction to ML projects

<img src="https://www.mrtfuelcell.polimi.it/images/logo_poli.jpg" height="200">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg" height="150">

A2A ML Course - day 3 - 04/10/2024

Author: Maciej Sakwa <br>
Coauthors: Micheal Wood, Emanuele Ogliari

## Outline

1. Machine Learining project structure
2. Hands-on ML project:
    - Getting the data and problem definition
    - Explorative Data Analysis
    - Feature Engineering
    - Model development
    - Results

## Learning obejctives

* Understand the necessary steps to complete a ML project
* Estimate the timeframe required for a ML project
* First hand experience with programming a simple ML project

---

## Machine Learning project structure

<img src="https://freesvg.org/img/Brain-Computer.png" width="300">

Despite the numerous possible projects that can be solved using Machine Learning and Deep Learning the general project structure to follow will be almost the same every time. 

Surprisingly, the the biggest decisions that have to be made are not about the ML or DL models, **they are about data**.

This lesson is created to give you understanding of the necessary steps needed to acomplish any ML project no matter of the size of the dataset. With hands-on experience, you will be able to understand where lies the actual difficulty of ML projects.

### Structure outline

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/project%20structure.png?raw=true" width="800">


To summarise the main steps can be described as:

| N° | Step | Details |
| --- | --- | --- |
|1. | **GET DATA** | Acquire and curate the dataset. Define the problem to solve |
|2. | **EDA** | Explore the data to understand how (and **if**!) you can use it to solve the problem |
|3. | **FEATURE ENG.** | Transform the data to better fit your needs |
|4. | **MODELING** |Develop ML/DL models, test them, fine tune them for better results |
|5. | **PRESENT AND LAUNCH** | Get feedback from experts in topic, launch your project on the platform of choice|

The **IF** in the second point is crucial for development of *useful* ML projects.
ML is often called a **black-box**. However it is not a *magic* black-box.
You can not solve *any* problem using *any* data. The data has to facilitate the solution of the problem.

>**A reasonable data-driven problem definition is necessary for development of good ML projects.**

Development of ML projects should come from cooperation between Data Scientists and Engineers who understand the topic through and through.

---

## Hands-on ML project

### 1. Get the data and define the problem

Any project has to start somewhere, an ML project starts with data acquisition. In reality, the process of data acquisition and preprocessing is a highly complex problem in itself. So big companies have people to do that for them (so called *Data Engineers*) 

Fortunately for us, let's consider the step already done and let's begin with loading our csv datasets.

*Replace links with github*

In [352]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We load two datasets:
1. The electricity generation in Spain
2. The weather data from 5 big cities in Spain

In [353]:
data_energy = pd.read_csv('./data/energy_supply.csv')
data_weather = pd.read_csv('./data/weather_features.csv')

This tutorial will serve you as a guideline to **supervised learning** branch of Machine Learning algorithms (sometimes also called **shalow learning**).

**The task that we will try to perform is to predict the actual electricity price in Spain knowing the actual generation data (and weather parameters) in the country.**

### 2. Explorative data analysis

In reality, it is very hard to *decouple* EDA from Feature Engineering. **It's a loop, an iterative process.**  However, for the sake of learning, as EDA let's just give a look to what we loaded. For now let's focus on the energy dataset data_energy. The other one (data_weather) will be usefull later.

Try using the `.describe()` and `.info()` methods. Also list all the columns in the dataset.

In [354]:
# Try your code before the "#" sign

Let's do a short excercise to refresh the pandas syntax. Do you remember that you can use a list to extract several columns? Let's try it here:

In [355]:
fossil_list = [
    'generation fossil brown coal/lignite',
    'generation fossil coal-derived gas',
    'generation fossil gas',
    'generation fossil hard coal',
    'generation fossil oil',
    'generation fossil oil shale',
    'generation fossil peat']

hydro_list = [
    'generation hydro pumped storage aggregated',
    'generation hydro pumped storage consumption',
    'generation hydro run-of-river and poundage',
    'generation hydro water reservoir',
    'generation marine']

In [223]:
data_energy_fossils = ... # Filter the data to contain only the fossil fuels

Also, do you remember that you create new columns? For that reason we can use aggregative functions that we learned last time. Create the `'total'` column by using a `.sum()` function. <br> For it to work we have to specify the axis along which we sum. The column axis is the second one, so we have to wrice `axis=1` in the parentheses.

In [None]:
data_energy_fossils['total'] = ... # Create the new colum

As you can see the dataset that we are working with is a **time series** dataset

---

**Working with time series data in pandas** 

<img src="https://miro.medium.com/v2/resize:fit:650/0*MJtKLn0wgompp9lJ.jpeg" width="500">

Time series datasets are probably one of the most important and most common form of data that we can acquire. In fact, time flies (and does not stop) so it is quite obvious that people wanted to know and see how things change with time. Time series data do exacly that. Initially working with a new form of data might be a bit frightening. However, there is nothing to worry about! In fact, time series data is an organised form of tabular data. As pandas comes packed with tools to work with tabular data, it is not a surprise it can handle well time series datasets. <br> pandas has many built-in functions and tools to help us accomplish that.

The rule of thumb when working with time series datasets in pandas is to make it the **index** of the Dataframe (so the first column).



In [356]:
data_energy.index = pd.to_datetime(data_energy['time'], utc=True)
data_energy.drop(columns='time', inplace=True)

In this piece of code we set the **index** to be the time column. We switch it to date time format by using the function `pd.to_datetime()`. As the `'time'` column becomes redundant, we can remove it using the `.drop()` method.

Having the datetime as the Dataframe index allows for easy time-based filtering, for example:

In [None]:
data_energy.loc['2017-01-01':'2017-01-31']

As an excercise, try to extract the first three months of 2016, and save it as `data_energy_time_slice`.

In [11]:
data_energy_time_slice = ... # Remove the dots and complete the query, remember to add .copy() at the end

It will be useful when we will try to plot the data.

For example the code below, let's you plot the energy generation by source. We can use the time slicing to take a closer look to some periods of time. <br>Try to experiment with the plot. You can add or remove different columns, change the colors, scales, time range, ect...

In [None]:
# Initialize the figure
plt.figure(figsize=(16, 8))

# Single column data
plt.plot(data_energy_time_slice['generation solar'], label='Solar')
plt.plot(data_energy_time_slice['generation wind onshore'], label='Wind')
plt.plot(data_energy_time_slice['generation biomass'], label='Biomass')
plt.plot(data_energy_time_slice['generation nuclear'], label='Nuclear')

# Aggregated data from our defined lists
plt.plot(data_energy_time_slice[fossil_list].sum(axis=1), label='Fossils') 
plt.plot(data_energy_time_slice[hydro_list].sum(axis=1), label='Hydro')

# Visuals
plt.legend()
plt.xlabel('Time')
plt.ylabel('Power generation (MW)')
plt.grid(which='major', alpha = 0.5)

plt.show()

Now, let's see how the energy prices change in the same period. Try to plot the spot price and the day ahead price. The corresponding column names are `'price actual'` and `'price day ahead'`.

In [None]:
# Initialize the figure
plt.figure(figsize=(16, 8))

plt.plot(...)
plt.plot(...)

# Visuals
plt.legend()
plt.xlabel('Time')
plt.ylabel('Cost (€/MW)')
plt.grid(which='major', alpha = 0.5)

plt.show()

*What are some conclusions we can draw here? Do you see some correlations?*

We can quantify the correlations by calling the .corr() function. By default, it calculates the pearsons coefficient $\rho_{X, Y}$ between each two columns $X$ and $Y$, for all the columns in the dataframe:

$$
 \rho_{X, Y} = \frac{cov(X, Y)}{\sigma_X \sigma_Y}
$$

Where $cov(X, Y)$ is the covariance of the two columns, and $\sigma_X$ and $\sigma_Y$ are the standard deviations of columns. Let's try it out:

In [None]:
data_energy.corr()

There is a lot of empty NaN values! It's either because a lot of values are missing, or the columns are 0 valued, let's check:

In [None]:
data_energy.sum()

There is a lot of columns with 0-values! Let's remove them:

In [361]:
zero_columns = [
    'generation fossil coal-derived gas', 
    'generation fossil oil shale', 
    'generation fossil peat', 
    'generation geothermal', 
    'generation hydro pumped storage aggregated', 
    'generation marine', 
    'generation wind offshore', 
    'forecast wind offshore eday ahead']

data_energy.drop(columns=zero_columns, inplace=True)

Once we got rid of the zero columns, let's calculate the `.corr()` function again. The output is also a Dataframe, so we can filter it the same way. Let's see what columns impact the actual price the strongest:

In [None]:
data_energy.corr()['price actual'].sort_values(ascending=False)

So the actual price is highly correlated positively with the day ahead price (obviously), but also with fossil fuel based generation, and total generation needed. So the more energy we need, and the more fossil fuels produce, the more expensive the energy gets.

Also it's negavitely correlated with hydro-based generation, and with wind-based generation, they are cheap sources. Surprisingly the PV-based production has low impact on the price. 

Before we move on, let's check for the missing data. Use the .info() and .dropna() methods to get rid of null-values:

In [None]:
data_energy.dropna(inplace=True)
data_energy.info()

To make the task a bit more contemporary let's assume we have no day-ahead knowledge of the system. Simply, let's remove the `'forecast'` columns and the day-ahead price:

In [364]:
forecast_list = [item for item in data_energy.columns.to_list() if 'forecast' in item] + ['price day ahead']
forecast_list

data_energy.drop(columns=forecast_list, inplace = True)

Now we can move on to splitting the dataset.

### 3. Feature engineering

Let's explain some thing first:

>The problem that we are trying to solve can be classified as a standard **regression** task. In theoretical terms, all regression problems are a subset of a more broader Machine Learning group called **supervised learning**.
>
>In **supervised learning**, by the term *supervised* we mean that we have some prior knowledge on the desired outputs that we can use in modeling. It means, that we can teach the model to use some variables we have, to predict the desired output. In more mathematical terms, in **supervised learning** we have a set of variables $\mathbf{X}$, a set of outputs $y$, and we are searching for a function $f(\mathbf{X})$ that approximates the outputs $\hat{y}$ using the inputs $\mathbf{X}$. Of course, the approximation is almost never ideal, as most of the problems that we deal with are highly **nonlinear**, but still we are trying to minimize the difference between the real outputs and the approximated outputs $min(y - \hat{y})$. In reality, we rarely search for the approximating functions on our own, there are countless **ML and DL models** ready to be used. *We only have to optimize the model to work on our data.*
>
>We often call the variables **features** or **inputs**, and the outputs **labels** or **targets**. Moreover, the process of optimizing the approximating function $f(\mathbf{X})$ is often reffered to as **fitting** or **training**.

With all that theory laid out, let's define the **features** and **labels** for our prediction. 

In our case, we want to estimate the actual energy price using the generation data, let's extract the corresponding columns:

In [365]:
input_columns = data_energy.columns.to_list() 
input_columns.remove('price actual')    # Inputs are all columns except for the 'price actual'
label_columns = ['price actual']        # Labels is the 'price actual' column

In [366]:
inputs = data_energy[input_columns].copy()
labels = data_energy[label_columns].copy()

For now we don't do any modifications to the **features** we have in the dataset, let's see what results we get with almost *raw* data.


---

**Basics of Scikit Learn**

<img src="https://raw.githubusercontent.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/refs/heads/main/images/Scikit_learn_logo_small.svg" height="200">


Before we start processing data with the models, we should briefly discuss an amazing python package called **Scikit Learn**. Is is a very robust and surprisingly easy to use library that contains a plethora of basic ML algorithms and data transformations.  As with most of the common libraries, it has an excessive documentation website that can be found [here](https://scikit-learn.org/stable/). It has been first introduced in 2007, and obviously the ML and DL landscape changed significantly since then. 

However, **Scikit Learn** (or sklearn) to this day is a library often used by students and professionals alike due to its simplicity and efficiency, making it **a great entry point into the world of ML**.

The characteristic trait of sklearn is a very rigid module.Class.function structure, where each model or data transformer is loaded as a python Class and various transformations are done by calling the functions. Some of the most common functions inlcude:

- `.fit(X, y)` - fit a model to a dataset
- `.transform(X)` - transform a dataset using a model (can be connected with `.fit()` via `.fit_transform()`)
- `.predict(X)` - predicts the values from a given input
- `.metric(y_1, y_2)` - calculates an error metric between y_1 and y_2, where the `.metric()` is a selected error type, e.g. `.mean_absolute_error(y_1, y_2)`

Of course, this list is not exhaustive. With that, let's import some submodules from sklearn package:

In [367]:
from sklearn import preprocessing, neighbors, ensemble, tree, linear_model

 
 ---


### 4. Model development

4.1. Data Preprocessing

Before we input the data into the model there is one important transformation that we should't forget: **scaling**. 

Each column of our dataframe represents one **feature** of our inputs, and quite naturally each feature takes values from a certain range (or mathematically, they are sampled from a certain population). 

For example, the range of values in the column of `'generation nuclear'` is different from `'generation fossil hard coal'`, because the installed power of nuclear is different from hard coal. We understand that, but the model might have a difficult time understanding why some features are bigger then others, it will naturaly think the bigger features are more important. In this way, we are introducing **bias** into model fitting.

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/hist_fossil_fuels.png?raw=true" height="200">
<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/hist_nuclear.png?raw=true" height="200">


If we want the model to avoid this bias in fitting, we have to **scale** the input data. The two most common methods are:
 
- **min-max scaling**
- **standard scaling**


If you are interested you can find more mathematical explainations in the appendix. In our case, we will use the StandardScaler() class from sklearn to scale the values.

In [368]:
scaler = preprocessing.StandardScaler() # Set up the class instance

Now, use `.fit_transform()` to transform our inputs. We do not have to transform the labels.

> **NB:** The output of the scaler comes in the form of a numpy array object. The numpy package will be explained further down the line.

In [369]:
inputs_scaled = scaler.fit_transform(inputs) # Get the inputs
labels_scaled = labels.values.ravel()

The next step is to divide the inputs and the labels in train and test subgroups. As the name implies, we use the first subgroup to train the model and the latter to test it. It should remain unseen in training.

Typically the division is set to be **70-30**, or **80-20**. In time series data, we usualy cut off the newest part of the dataset:

In [370]:
n = len(inputs) 
cutoff = int(0.7*n) # Set the cutoff threshold at 70% of the length of the input array

train_inputs = inputs_scaled[:cutoff]   # Cut before
train_labels = labels_scaled[:cutoff]   # Cut before

test_inputs = inputs_scaled[cutoff:]    # Cut after
test_labels = labels_scaled[cutoff:]    # Cut after

You can visualise the division with the cell below:

In [None]:
plt.plot(np.arange(0, cutoff), train_inputs[:, 0])
plt.plot(np.arange(cutoff, n), test_inputs[:, 0])
plt.show()

4.2. Training models

Once the data is set up, the next step is usually benchmarking some of the common models that are used to solve similar problems.

In theory, some theory knowledge is necessary to properly set up and use the models. In practice, everything can be easily done without knowing the theory behind the models. We can easily treat them like **black-boxes**.

The general goal of this lesson is to get you familiar with the *structure* of the project, not with the ins and outs of the models. So for now we will just proceed with some selected models and see the results. However, if you want to learn more about the models that we use here, feel free to check out the appendix where you can find details on how each model functions.

We can declare each model in the same way we set up the scaler:

In [373]:
lr = linear_model.LinearRegression()

In the parentheses, you can tweak some parameters (also called **hyperparameters**), but for now let's keep the default options.

To train the model, we can call the `.fit()` method. As this is a *supervised* model, we specify the train inputs and labels. <br> *It might take a minute, but it depends on the model.*

In [None]:
lr.fit(X=train_inputs, y=train_labels)

Finally, we can call the `.predict()` method with test inputs to see how the model performs with unseen data:

In [375]:
results_lr = lr.predict(test_inputs)

We can quickly visualise the results with a plot (we can declare it in a function to quickly reuse it later):

In [None]:
def plot_results(results:np.array, model_name:str, x_range=(0, 520)) -> None:   
    plt.figure(figsize=(8, 4))

    # Plots
    plt.plot(test_labels, label='True values')
    plt.plot(results, label=model_name)

    # Visuals
    plt.xlim(x_range)
    plt.xlabel('Test sample')
    plt.ylabel('Price (€/MW)')
    plt.grid(which='major', alpha = 0.5)

    # Tidy up
    plt.legend()
    plt.tight_layout()
    plt.show()

plot_results(results_lr, 'LR')

And that's it! We examined the data, explored some features, then picked and trained a model, and displayed the results. **Buon lavoro!**

However, we are not done yet. Let's benchmark some other common simple models, and get some detailed metrics. 

Let's quickly save the things that we did as functions:

In [377]:
def train_and_pred(model, X_train, y_train, X_test):
    
    # Fit the model
    model.fit(X_train, y_train)
    # Predict with the model
    y_test = model.predict(X_test)

    return y_test

Now let's train several predictors to check how they perform on our data:

- Lasso Regression

In [378]:
lasso = linear_model.Lasso()

results_lasso = train_and_pred(lasso, train_inputs, train_labels, test_inputs)

- Decision Tree regressor

In [379]:
dtr = tree.DecisionTreeRegressor()
results_dtr = train_and_pred(dtr, train_inputs, train_labels, test_inputs)

**Excercise:** Try to implement the K-Nearest Neighbors regressor, save the results as `results_knn`. You can find the syntax reference [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

- K-Nearest Neighbors

Let's collect the models' results as a dictionary: <br>
*You can add another row if you did the excercise*

In [380]:
results = {
    'LR': results_lr,
    'Lasso': results_lasso,
    'DTR': results_dtr,
}

### 5. Results

Once the models are trained, we can check how they perform using various metrics. Let's load the submodule:

In [381]:
from sklearn import metrics

The most common metrics in regression problems are:

* the mean absolute error: $ MAE = |y_{true}-y_{pred}| $ 
* the root mean square error: $ RMSE = \sqrt{y^2_{true}-y^2_{pred}}$

where the $y_{true}$ is the vector of true values (or `test_labels` in our case) and the $y_{pred}$ are the predicted values for a given model. 

> **NB:** You can also get the percentage versions of the metrics by dividing the value by $y_{pred}$. Be careful though! It is dangerous to do so when the labels are close to zero as the error might skyrocket even for accurate predictions!

Let's define a function that takes the results and the labels and outputs the metrics:

In [382]:
def print_results(model:str, results:np.ndarray, labels:np.ndarray) -> None:
    
    # Calculate the metrics using the labels and the results
    mae = metrics.mean_absolute_error(labels, results)
    rmse = np.sqrt(metrics.mean_squared_error(labels, results))
    mape = np.mean(mae / labels)
    
    # Print out the results
    print(f'Results for model {model}: \n- MAE:\t{mae:.02f}\n- RMSE:\t{rmse:.02f}\n- MAPE:\t{mape*100:.02f}%')

**Excercise:** Try to calculate the normalized MAE defined as the MAE value divided by the average output value:

Now let's see the results:

In [None]:
for name, model in results.items():
    print_results(model=name, results=model, labels=test_labels)

We can also try to plot them:

In [None]:
plot_results(results_lasso, 'Lasso')

Try to plot the other models. *Can you create a plot that has all of them?*

Which model performs the best?

And that's it! You developed your first ML model. 20% error is not that great though, we might have to go back in the loops to improve our results...

---


## Going back the loops

Now that the model is trained and benchmark comes the hard part, we have to evaluate our work. If our model's performance is satisfactory, we are done. But unfortunatly our model's performance is not that great.

**Because of that, we have to take a few steps backwards...**

There are a couple of options to improve the performance of our ML system. One idea is to try out different models. Maybe there are some that could work better with our dataset. But as we've seen, the difference in performance is not that big model-to-model. 

To boost the performance, we have some options, such as:

- explore and benchmark different models, check if it should be more robust or smaller, 
- explore the data and find new relations between features,
- explore the problem deeper, try to relate it to real life and understand what are the natural correlations between phenomena to add new features.

E.g., understanding what impacts the energy price in real life leads to creation of better predictive models.

For now, let's experiment by adding some features to the model:

### Aggregated and mixed features

Sometimes simple mathematical operations on existing features can create new one that are stronger, let's check the correlation table again:

In [None]:
data_energy.corr()['price actual'].sort_values(ascending=False)

Some fossil fuels have an impact on increasing the price, on the other hand the renewables reduce it. 

Let's create new columns with a sum of fossils and alternative fuels to check what is the total impact they have:

In [387]:
# Create lists with all the fossil fuels

fossil_list = [
    'generation fossil hard coal',
    'generation fossil gas',
    'generation fossil brown coal/lignite',
    'generation fossil oil']

# Create lists with all the alternative fuels

alternatives_list = [
    'generation other', 
    'generation other renewable',
    'generation solar', 
    'generation hydro water reservoir', 
    'generation nuclear', 
    'generation hydro run-of-river and poundage', 
    'generation biomass',
    'generation wind onshore',
    'generation hydro pumped storage consumption']

data_energy['generation fossil total'] = data_energy[fossil_list].sum(axis=1)
data_energy['generation alternatives total'] = data_energy[alternatives_list].sum(axis=1) 

Also let's calculate the share they have in the total energy mix at a given hour (we simply divide the sum of load by the total load):

In [388]:
data_energy['generation alternatives share'] = data_energy['generation alternatives total'] / data_energy['total load actual']
data_energy['generation fossil share'] = data_energy['generation fossil total'] / data_energy['total load actual']

Now let's check the impact of the new features we added:

In [None]:
data_energy.corr()['price actual'].sort_values(ascending=False)

Very nice, the `'total'` and `'share'` columns have quite high correlations with the price. Let's move to time-based feautures. 

In [390]:
data_energy.drop(columns=fossil_list+alternatives_list, inplace=True)

### Time features

Adding some time-based features might boost the model's performance if there are some linear temporal dependencies in our model. Let's quickly add some time-based featuers by using our fancy date time index:

In [391]:
data_energy['hour'] = data_energy.index.hour            # Adds hour column
data_energy['day_of_week'] = data_energy.index.weekday  # Adds weekday column
data_energy['month'] = data_energy.index.month          # Adds month column

To check that what we added is correct, we can quickly draw plots of the average price per hour, per week day and per month using grouping:

In [None]:
fig, ax = plt.subplots(ncols=3, figsize=(12, 3), sharey=True)

ax[0].plot(data_energy.groupby('hour')['price actual'].mean())
ax[0].set_xlabel('Hour')
ax[1].plot(data_energy.groupby('day_of_week')['price actual'].mean())
ax[1].set_xlabel('Day of week')
ax[2].plot(data_energy.groupby('month')['price actual'].mean())
ax[2].set_xlabel('Month')

ax[0].set_ylabel('Price (€/MW)')

plt.show()

The trends are very clear!

But... they are not exactly linear, so the model might have problems findning the correct weights for them. This is why categorical features are often introduced.

### Categorical features

As the name implies, the categorical features specify a certain **category** to which each input belongs.

Let's consider for example the days of the week, we can clearly see that the energy price is quite constant throughout the workdays, but decreases in Saturadays and Sundays. 

To help the model use that information, we can create new column where a **category** is added:

In [393]:
def is_workday(X: pd.Series):
    weekdays_list = []

    for day in X:
        if day < 5:  # We start the count at 0
            weekdays_list.append('weekday')
        elif day == 6:
            weekdays_list.append('sat')
        else:
            weekdays_list.append('sun')
    
    return weekdays_list

Let's run the function and check the result:

In [None]:
data_energy['weekdays'] = is_workday(X = data_energy.day_of_week)
data_energy.head()

We can do the same for the hours, as the trend is again highly non linear:

- It's higher in the business and rush hours
- It's lower in the middle of the day (siesta time)
- It's even lower in the night

Let's create another category that uses that:

In [395]:
def is_rush_hour(X: pd.Series):
    hours_list = []

    for hour in X:
        if ((hour > 8 and hour < 13) or (hour > 17 and hour < 21)): # Between 9-14 and 18-22
            hours_list.append('rush_hour')
        elif (hour >= 13 and hour <= 17):
            hours_list.append('siesta_hour')
        else:
            hours_list.append('night_hour')

    return hours_list

Let's check:

In [None]:
data_energy['rush_hours'] = is_rush_hour(data_energy.hour)
data_energy.head()

Great!

However, we can't just feed the category labels to the model. It would not know what to do with text.

For that reason, we have to perform **encoding**, and more specifically the One-Hot encoding.

OneHot Encoding transforms the input column into a number columns equal to the number of categories. Each column corresponds to one category. For each row the correct category column is marked with 1 (hot) and the rest are left as 0 (cold).


<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/one_hot.png?raw=true" height="300">


This method allows the model to give independent weights to different categories.

Fortunately, there is a `sklearn` implementation of this method. Let's declare it, transform the categorical columns from our data:

In [397]:
encoder = preprocessing.OneHotEncoder(handle_unknown='ignore', categories='auto', sparse_output=False) # Load the engine
encoder.set_output(transform='pandas')                                                                 # Set the output to a pandas df
inputs_categorical = encoder.fit_transform(data_energy[['weekdays', 'rush_hours', 'month']])           # Transform the columns

In fact, we end up with a table like above:

In [None]:
inputs_categorical

Now, let's remove the 'text' category columns from our df. We don't need them anymore.

In [None]:
data_energy.drop(columns=['weekdays', 'rush_hours'], inplace = True)
data_energy.head()

### Auxiliary features

*Extra topic*

We can support our predictions by importing and processing additional data from other sources. Of course this significantly adds to the workload, but often it is necessary to combine various data sources to get good quality of the prediction.

In the beginning of the lesson we imported a secondary dataset that contains weather data for 5 big cities in Spain, let's check it out:

In [None]:
data_weather

Extract the names of the cities:

In [None]:
data_weather.city_name.unique()

There are 5 cities: Madrit, Barcelona, Valencia, Seville, and Bilbao. For convenience, let's move each city into separate `Dataframes`:

In [402]:
data_weather_valencia = data_weather[data_weather['city_name'] == 'Valencia'].copy()
data_weather_madrid = data_weather[data_weather['city_name'] == 'Madrid'].copy()
data_weather_bilbao = data_weather[data_weather['city_name'] == 'Bilbao'].copy()
data_weather_barcelona = data_weather[data_weather['city_name'] == ' Barcelona'].copy()
data_weather_seville = data_weather[data_weather['city_name'] == 'Seville'].copy()

And now let's put them into a single dictionary:

In [403]:
dictionary_data_weather = {
    'Valencia': data_weather_valencia, 
    'Madrid': data_weather_madrid, 
    'Bilbao': data_weather_bilbao, 
    'Barcelona': data_weather_barcelona, 
    'Seville': data_weather_seville
}

As they are time series datasets, we should remember to add the date time index. We can do it in a loop:

In [404]:
for city, data in dictionary_data_weather.items():
    
    # Add TS index and remove the column
    data.index = pd.to_datetime(data['dt_iso'], utc=True)
    data.drop(columns='dt_iso', inplace=True)

    # Update the data
    dictionary_data_weather[city] = data

Studying all 5 datasets at the same time might be dificult, let's focus only on the Madrid data for now.

We can connect the madrid dataset with the energy prices from the previous dataset using the `.merge()` method. Merging operation is very useful when we want to make sure that the resulting dataset will have the desired shape. The datasets are connected through a key (a column) that is shared between both sources.

We have to specify the shared column (in our case the TS index) and the merging method ('inner' - keeps shared rows, 'outer' - keeps all rows, 'left' - keeps all rows from the left table, 'right' - the oposite of left). 

<img src="https://datacomy.com/data_analysis/pandas/merge/types-of-joins.png" height="200">


At the first glance merging might be quite complex. But don't worry, we won't use it much. If you want more information on methods of connecting two data sources, have a read [here](https://realpython.com/pandas-merge-join-and-concat/)

After merging with the price actual series it appears at the end of our dataset:

In [None]:
data_weather_madrid.merge(data_energy['price actual'], how='inner', left_index=True, right_index=True)

Let's save it as a new variable:

In [406]:
data_weather_madrid_with_price = data_weather_madrid.merge(data_energy['price actual'], how='inner', left_index=True, right_index=True)

Now that we have the price included in the dataset, we can check if it has any correlation with the weather parameters:

In [None]:
data_weather_madrid_with_price.corr(numeric_only=True)['price actual'].sort_values(ascending=False)

The correlation are not that strong, but in reality probably both the temperature and the wind will have some impact on the results. Let's extract them. As we have 5 cities we can either get them separately, or aggregated. 

Let's try aggregated first.

Let's check for missing data first:

In [None]:
for city, data in dictionary_data_weather.items():
    
    print(f"There are {data.shape[0]} observations about city: {city}.")

And remove it...

In [None]:
for city, data in dictionary_data_weather.items():

    clean_data = data.drop_duplicates(subset='dt_iso', keep='first')

    dictionary_data_weather[city] = clean_data

    print(f"There are {clean_data.shape[0]} observations about city: {city}.")

Now, for the aggregation let's use the city's population as the weight for a weighted average:

In [418]:
pop_madrid = 6_791_667
pop_barcelona = 5_474_482
pop_valencia = 2_522_383
pop_seville = 1_519_639
pop_bilbao = 1_037_847

pop_total = pop_madrid + pop_barcelona + pop_valencia + pop_seville + pop_bilbao

Now we can define the weight by dividing the city's population by the total population:

In [147]:
dictionary_weights_temp = {
    'Valencia': pop_valencia / pop_total, 
    'Madrid': pop_madrid / pop_total, 
    'Bilbao': pop_bilbao / pop_total, 
    'Barcelona': pop_barcelona / pop_total, 
    'Seville': pop_seville / pop_total
}

And we can create the `'temp'` and `'wind_speed'` columns as a weighted average:

In [419]:
inputs_weather = pd.DataFrame()

inputs_weather['temp'] = \
    dictionary_data_weather['Valencia']['temp'] * dictionary_weights_temp['Valencia'] + \
    dictionary_data_weather['Madrid']['temp'] * dictionary_weights_temp['Madrid'] + \
    dictionary_data_weather['Bilbao']['temp'] * dictionary_weights_temp['Bilbao'] + \
    dictionary_data_weather['Barcelona']['temp'] * dictionary_weights_temp['Barcelona'] + \
    dictionary_data_weather['Seville']['temp'] * dictionary_weights_temp['Seville']

inputs_weather['wind_speed'] = \
    dictionary_data_weather['Valencia']['wind_speed'] * dictionary_weights_temp['Valencia'] + \
    dictionary_data_weather['Madrid']['wind_speed'] * dictionary_weights_temp['Madrid'] + \
    dictionary_data_weather['Bilbao']['wind_speed'] * dictionary_weights_temp['Bilbao'] + \
    dictionary_data_weather['Barcelona']['wind_speed'] * dictionary_weights_temp['Barcelona'] + \
    dictionary_data_weather['Seville']['wind_speed'] * dictionary_weights_temp['Seville']

inputs_weather['wind_speed_squared'] = ...

We save it as a separate dataframe and concatenate with the original one:

In [None]:
inputs_weather

In [None]:
data_energy = pd.concat([data_energy, inputs_weather], axis=1)
data_energy.dropna(inplace=True)
data_energy.head()

Wind speed has a pretty high negative correlation with the price. It's quite strightforward, the more it blows the stronger the projection of wind energy and the lower the price. Also, probably, strong winds in summer decrease the energy load from cooling.

In [None]:
data_energy.corr()['price actual'].sort_values(ascending=True)

### Retrain

Now, with new features in hand, we can redo the initial calculations. 

The steps should be all familiar.

First, we separate the inputs from the labels:

In [430]:
input_columns = data_energy.columns.to_list()
input_columns.remove('price actual')
label_columns = ['price actual']

inputs = data_energy[input_columns].copy()
labels = data_energy[label_columns].copy()

Then, we scale the inputs:

In [431]:
scaler = preprocessing.StandardScaler()
scaler.set_output(transform='pandas')

inputs_scaled = scaler.fit_transform(inputs)
labels_scaled = labels.values.ravel()

... and add the One-Hot categorical table...

In [432]:
inputs_scaled = inputs_scaled.merge(inputs_categorical, how='inner', left_index=True, right_index=True)

And separate the train and test datasets:

In [433]:
n = len(inputs) 
cutoff = int(0.7*n)

X_train = inputs_scaled[:cutoff]
y_train = labels_scaled[:cutoff]

X_test = inputs_scaled[cutoff:]
y_test = labels_scaled[cutoff:]

Let's check the final shape of the training dataset:

In [None]:
inputs_scaled

And finally, we can retrain the model. 

**Excercise:** Pick any model you want, train it with `.fit()` and predict with `.predict()`. Print the results with `print_results()` and plot them with `plot_results()` functions that we defined before.

In [None]:
# Write your code here

And we are done! We did a lot of work for quite a small improvement. Sadly, that is often the reality of Data Science... 

# Appendix - Theoretical background on used models

### Feature Scaling

- Min-Max scaling - rescales the values to a slected range (usualy 0-1). For 0-1 range the equation takes the form:

$$
    X_{scaled} = \frac{(X - min(X))}{max(X) - min(X)}
$$

- Standard scaling - rescales the values by removing the mean and scaling to unit variance (in statistical terms we calculate the z-scores):

$$
    X_{scaled} = \frac{X - \mu_x}{\sigma_x}
$$

### Linear regression

You probably all know Linear Regression by now. It is a very common statistical model that is often used as an introductory model to Machine Learning.

In Linear Regression the relationship of the output of the model (or the prediction) to the descriptive variables (or the inputs) is modeled as a linear response, which mathematically can be written as:

$$
    y_i = w_0 + w_1 x_{i1} + w_2 x_{i2} ... + w_p x_{ip} + \epsilon_i
$$

where the $y_i$ is the $i-th$ output, $x_1$ to $x_p$ are the features, and $w_1$ to $w_p$ are corresponding learned weights. The $\epsilon$ is a random noise, that has to be taken into account in modeling. This equation in matrix terms becomes simpler:

$$
    \mathbf{y} = \mathbf{X W} + \mathbf{\epsilon}
$$

Fitting such a model becomes a task of finding the parameter matrix $\mathbf{W}$ so that the term $ \mathbf{\epsilon} = \mathbf{y} - \mathbf{X W}$ is minimal.

The classic method of solving this task is through the Ordinary Least Squares method (OLS). In practise the method is minimizing the sum of squares of the differences between the observed dependent variable (or outputs) and the modeled outputs. The objective function for the minimization is given as:

$$ 
    S(W) = min|| \mathbf{y} - \mathbf{X W} || ^ 2
$$

Often it's easier to visualise the procedure as fitting a line, so that the sum of distances between the line and the observed points is the smallest possible:

<img src="https://miro.medium.com/v2/resize:fit:1280/1*nhGPRU12caIw7NK5Rr3p-w.gif" height="400">


Image by [Logan Yang](https://medium.com/swlh/from-animation-to-intuition-linear-regression-and-logistic-regression-f641a31e1caf) 

### Lasso regression

Lasso regression is a modification of Linear regression that introduces $L_1$ regularization. It is often preferred when we have more features as it tends to produce solutions with less non-zero features.

The only difference from the standard LR is the new parameter in the objective funtion:

$$ 
    S(W) = min \frac{1}{2 n_{samples}} || \mathbf{y} - \mathbf{X W} || ^ 2 + \alpha |w|
$$

### K-Nearest Neighbors

K-Nearest Neighbors is a very simple algorithm that is principally used in classification. However, there exists also a quite conveniant regression implementation that is widely used due to its simplicity. KNN defines the output label by the distance to the nearest neighbors of the point in the mathematical space.

In other words, it calculates the distance between our point and all the other points in the system. Then it orders the points descendingly according to the calculated distance. The output label is estimated on the basis of kept neighbors. Usualy, the distance is calculated as the Euclidean norm:

$$
    ||x|| = \sqrt{x_1^2 + ... + x_n^2}
$$



<img src="https://images.datacamp.com/image/upload/v1686762721/image2_a2876c62d1.png" height="300">

Image by [DataCamp](https://www.datacamp.com/tutorial/k-nearest-neighbors-knn-classification-with-r-tutorial)

In regression the output is the mean of the neighbors' outputs.



### Decision Trees

Decision tree is again mainly a classification algorithm. In training the model grows a decision tree, with decision nodes and output leaves. When making prediction we always start at the *root node* and at each division, the model asks itself something about the dataset, e.g. is our fossil generation higher than 20GW?, is our hydro generation lower than 5GW? and based on a sequence of responses arrives at an output *leaf node*

In classification the quality of the divisions is decided using a metric called *gini impurity*. A nore is pure, the more uniform it is in terms of class division. 

In case of regression the division is decided based on the *mean squared error*.

Decision tree models are very prone to overfitting if the leaves are not *pruned*. We have to limit the depth of the tree or the number of nodes, or it will eventually grow to find a separate *leaf node* for each input sample (and that is severe overfitting)

<img src="https://miro.medium.com/v2/resize:fit:1400/1*ElW-ERvIfiV6RSbs74RO_A.png" height="300">

Image by Alan Jeffares via [Medium](https://towardsdatascience.com/decision-trees-60707f06e836)