<center><img src='../../img/ai4eo_logos.jpg' alt='Logos AI4EO MOOC' width='80%'></img></center>

<hr>

<a href='https://www.futurelearn.com/courses/artificial-intelligence-for-earth-monitoring/1/steps/1280510' target='_blank'><< Back to FutureLearn</a><br>

# Understanding Machine Learning Workflows

*by Julia Wagemann, MEEO S.r.l.*

<hr>

## Watch the video tutorial

In [None]:
from IPython.display import HTML
HTML('<div align="center"><iframe src="https://player.vimeo.com/video/636104119?h=47a1dff635" width="640" height="360" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen align="middle"></iframe></div>')     

<hr>

## Introduction

This notebook provides you an overview of the individual steps of a Machine Learning workflow. It provides you, before we start with the domain-specific applications, a general introduction into the principles of Machine Learning, the commonly used nomenclature and a Machine Learning workflow. 



A Machine Learning workflow can be divided in five main steps:


<img src='../../img/ml_workflow.png' alt='Logos AI4EO MOOC' width='60%'></img>



Subsequently, we will explain you in detail each of the five workflow steps.
If applicable, we link to domain-specific examples that will be offered in the domain-specific weeks of this course.


<hr>

## Overview

* [1 - Understanding your Machine Learning problem](#understand)
* [2 - Data preparation](#data_preparation)
* [3 - Building, training and testing the model](#model_training)
* [4 - Model evaluation](#model_evaluation)
* [5 - Inference](#inference)

<hr>

<br>

## <a id='understand'></a>1. Understanding your Machine Learning problem

<img src='../../img/ml_workflow_1.png' alt='Logos AI4EO MOOC' width='50%'></img>

The first step in every Machine Learning project is to make yourself clear with the problem you would like to solve and based on the problem you select a Machine Learning (ML) model. A good understanding of your problem is key for the success of your project. The ML model you select is only an algorithm and the accuracy of the prediction or classification depends on your understanding of the ML problem and the data you provide to the ML model.

### Artificial Intelligence vs. Machine Learning vs. Deep Learning

Before we dive deeper into Machine Learning categories, it is helpful to better understand the difference between `Artificial Intelligence`, `Machine Learning` and `Deep Learning`.

<img style='float:left;' src='../../img/ai_ml_categories.png' alt='Logos AI4EO MOOC' width='25%'></img>

<br>
<br>

* **Artificial Intelligence**: is a science which studies ways to build intelligent programs and machines that can creatively solve problems.

* **Machine Learning**: is a subset of Artificial Intelligence that provides a system the ability to learn and improve from experience, without being explicitly programmed. ML makes use of a variety of different algorithms that help to solve the problems.

* **Deep Learning**: Deep Learning is a subset of Machine Learning in which Neural Networks are the backbone. The difference between a single neural network and a deep learning algorithm is the number of node layers or depth. An algorithm has to have at least three node layers in order to be classified as deep learning algorithm.

Subsequently, we will show you the different categories of Machine Learning and for each category we will list popular ML algorithms.

### Machine Learning Categories

Machine Learning can be roughly divided in four basic approaches: `Supervised Learning`, `Unsupervised Learning`, `Reinforcement Learning` and `Deep Learning / Neural Networks`. The categories are quite overlapping and fuzzy, and a particular method can sometimes be hard to place in one category.

#### Supervised Learning

In supervised learning, your algorithm learns from 'labelled' data. Based on the 'labelled' data, the algorithm assigns test data into specific categories. The goal in supervised learning is to predict outcomes for new data and you know upfront the type of results to expect. Supervised learning algorithms can be classified in two categories:
* **Classification**: you have a classification problem if your target variable is categorical, e.g. the result can be classified into either class A or class B. Common classification algorithms are:
  * `K-Nearest Neighbor`
  * `Naive Bayes`
  * `Decision Trees / Random Forest`
  * `Support Vector Machine`
  * `Logistic Regression`


* **Regression**: Regression is another type of a supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are often used to predict numerical values based on different data points. Popular regression algorithms are:
  * `Linear regression`
  * `Logistic regression`
  * `Polynomial regression`


<br>

**Domain-specific ML examples using supervised learning algorithms for classification**<br>
>
> **Week 3**
> * [3D2 - Supervised land cover classification with Sentinel-2 data - Decision Tree / Random Forest / Support Vector Machine (SVM)](../../3_land/3D2_land_cover_classification_with_Sentinel-2_data/3D2_supervised_classification_using_Sentinel-2.ipynb)
>
> **Week 4**
> * [4Fa - Using ML for water type identification - Naive Bayes Classification](../../4_ocean/4F_ML_to_differentiate_between_sediment_and_chlorophyll/4Fa_Using_ML_for_water_type_identification.ipynb)
>
> **Week 6**
> * [6C - Global Cloud Classification with CUMULO - Gradient Boosting of Decision Trees](../../6_climate/6C_cumulo_cloud_classification/6C_cumulo_cloud_classification.ipynb)

<br>

#### Unsupervised Learning

In unsupervised learning, your algorithm learns from unlabeled, un-categorized data and the algorithm operates on the data without prior training. These algorithms discover hidden patterns in data without the need for human intervention. The goals in unsupervised learning is to get insights from large volumes of new data. Unsupervised learning can be categorized into three categories:
* **Clustering**: is a data mining technique for grouping unlabeled data based on their similarities or differences.

* **Association**: is another type of an unsupervised learning method that uses different rules to find relationships between variables in a given dataset.

* **Dimensionality Reduction**: is a learning technique used when the number of features (or dimensions) in a given dataset is too high. Dimensionality reduction algorithms reduce the number of data inputs to a manageable size while also preserving the data integrity. This technique is often used in the pre-processing stage to bring the data input to a manageable size.

Common unsupervised learning algorithms are:
  * `Gaussian mixtures`
  * `K-Means Clustering`
  * `Boosting`
  * `Hierarchical Clustering`
  * `K-Means Clustering`
  * `Spectral Clustering`

<br>

**ML example using an unsupervised learning alogorithm**<br>
>
> **Week 3**
> * [3D2 - Unsupervised land cover classification with Sentinel-2 data - K-means clustering](../../3_land/3D2_land_cover_classification_with_Sentinel-2_data/3D2_unsupervised_classification_using_Sentinel-2.ipynb)

<br>

#### Reinforcement Learning

Reinforcement Learning is commonly used in situations where an AI agent, e.g. a self-driving car, must operate in an environment where feedback about good or bad choices is available with some delay.

**ML example of reinforcement learning with Earth Observation data**
> Week 2
> * [2B - Aesthetics Aware Reinforcement Learning for Image Cropping](../../2_ai4eo/2B_Reinforcement_Learning/2B_reinforcement_learning_image-cropping_training.ipynb)

<br>

#### Deep Learning / Neural Networks

Deep Learning / Neural Networks is a subset of Machine Learning and the main difference to other Machine Learning algorithms is in how much data the algorithm uses to learn. Neural Networks replicate the structure and function of the human brain. Neural networks consist of artificial neurons, also known as nodes, which are stacked next to each other in three layers: `input layer`, `hidden layer(s)` and `output layer`.

Neural networks can leverage labeled datasets (supervised learning) but can also work with unstructured, raw data (unsupervised learning). The *deep* in Deep Learning refers to the depth of layers in a neural network. A neural network that consists of more than three layers (including input and output layers) can be considered a deep learning algorithm.



Common algorithms in Deep Learning are:
* `Convolutional Neural Networks (CNNs)`
* `Recurrent Neural Networks (RNNs)`
* `Multilayer Perceptrons (MLPs)`
* `Self Organizing Maps (SOMs)`

<br>

**ML examples applying Deep Learning algorithms**<br>
>
> **Week 3**
> * [3B - Tile-based classification with EuroSAT data - Sequential Convolutional Neural Network](../../3_land/3B_tile-based_classification_with_EuroSAT_data/3B_tile-based_classification_with_EuroSAT_data_training.ipynb)
> * [3D1 - Soil moisture estimation - Sequential Convolutional Neural Network](../../3_land/3D1_soil_moisture_estimation/3D_soil_moisture_estimation.ipynb)
>
> **Week 4**
> * [4B - Ship classification with Sentinel-1 data - VGG16 Convolutional Neural Network](../../4_ocean/4B_ship_classification_with_Sentinel-1_data/4B_ship_classification_with_sentinel-1_data.ipynb)
> * [4Fb - Using ML to retrieve water quality indicators in coastal areas - Auto-associative Neural Network](../../4_ocean/4F_ML_to_differentiate_between_sediment_and_chlorophyll/4Fb_Using_ML_to_retrieve_water_quality_indicators_in_coastal_waters.ipynb)
>
> **Week 5**
> * [5B - Ozone concentration estimation - Sequential Convolutional Neural Network](../../5_atmosphere/5B_ozone_concentration_estimation/5B_Ozone_concentration_estimation.ipynb)
> * [5C - Physics-based Machine Learning for Copernicus Sentinel-5P Methane Retrieval - Sequential Convolutional Neural Network](../../5_atmosphere/5C_methane_retrievals_using_s5p/5C_Retrieving_methane_from_Sentinel_5P.ipynb)
>
> **Week 6**
> * [6F - ML4Floods - Convolutional Neural Network](../../6_climate/6F_ml4floods/6F_ml4floods_training.ipynb)

<br>

## <a id='data_preparation'></a>2. Data preparation

<img src='../../img/ml_workflow_2.png' alt='Logos AI4EO MOOC' width='50%'></img>

Data preparation, including cleaning and pre-processing, is often the most time-consuming aspect. In fact, many of the examples provided in this course work with already pre-processed data. Data is the foundational aspect of machine learning that can impact performance, fairness, robutness and scalability of your ML system. In the end, your result depends on the accuracy of your input data. Data in its raw form cannot be used for most Machine Learning algorithms and depending on your ML problem, the algorithm and Python package you are using, you have to pre-process the data.

Common pre-processing steps specifically for **Earth Observation** data are:
* **Data labelling**: this is often a very time-intensive manual process required for supervised learning algorithms. You select manually data points or images and assign them a label (or class). The Machine-Learning algorithm uses the labelled information in order to predict these labels (classes / categories) for unseen data.

* **One-hot encoding of labels**: is a common way of pre-processing categorical features for Machine Learning. The labels are represented as class numbers, e.g. ranging from 0 to 2. In the one-hot encoding process, the class numbers are converted to a (binary) bitwise representation in order to avoid the algorithm assuming any sort of intrinsic hierarchy or number order.

* **Geo-referencing and subsetting**: depending on which input data you have, this step might be necessary for satellite images to bring them onto a common grid.


* **Resampling**: this step is necessary if your input data have different spatial or temporal resolutions. In resampling, you bring all the input data to the same spatial / temporal resolution.

* **Rearranging and reshaping**: this is a step that is specifically required in Machine Learning with Earth Observation data, as they are multi-dimensional in their nature. In this operation, you reshape the original data structure into a suitable structure for the ML algorithm.


* **Normalisation** is a common pre-processing step in Machine Learning to bring variables that are measured in different units to a common scale. Common normalisation functions are e.g. `minmax` or `mean_std`. `minmax` normalises the data based on the minimum and maximum values of the data and `mean_std` uses the *mean* and *standard deviation* to normalise the data.
  
* **Dimensionality reduction**: if your input data is large, you could apply e.g. a clustering algorithm in order to reduce the number of input data. In this way you also reduce redundancy and colinearity.


**Read more on the impact and downstream effects of data issues in ML workflows**
> <a href='https://ai.googleblog.com/2021/06/data-cascades-in-machine-learning.html' target='_blank'>Data Cascades in Machine Learning</a>

<br>

## <a id='model_training'></a>3. Building, training and testing the model

<img src='../../img/ml_workflow_3.png' alt='Logos AI4EO MOOC' width='50%'></img>

After data pre-processing, you can start with building your Machine Learning model. The first step is to split your data in `training`, `validation` and `test` data followed by building the model with defining a set of `hyperparameters`. Afterwards, you run (train) the model.

### Splitting the data
The first step is to define the input (X) and output (y) variables :
* **Input variable (X)**: the input variable is often refered to as `X` and is the variable / parameter you give to the ML model as input.
* **Output variable (y)**: the output variable is often refered to as `y` and is the variable / parameter you would like to predict with your Machine Learning model.

<br>

The next step is then to split the input and output variables into two or three subsets:
* **Training data**: the training data is used to train the model and is always a greater proportion (e.g. 70% or 80%) compared to the validation and test data subsets.
* **Test data**: test data is a subset of *unseen* data of a smaller proportion (e.g. 20%) used to assess the performance of a trained model.

Alternatively, you can split your training data set and use a proportion for cross-validation.
* **Validation data**: validation data is primarily used to estimate the skill of a machine learning model during the training process, which helps to fine-tune hyperparameters.

scikit-learn's function <a href='https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html' target='_blank'>train_test_split</a> is a common function used to split the data into subsets for training, validation and testing.

<br>

### Building the model
The next step is to build the Machine Learning model. This process involves to define the model architecture and a set of `hyperparameters`. Hyperparameters are used to fine-tune the accuracy of your Machine Learning model.

Hyperparameters are classified in `model hyperparameters` and `algorithm hyperparameters`:
* **model hyperpararmeters**: they influence the performance of the model, as they are defined during the model setup, but cannot be change while fitting the model to the training data
* **algorithm hyperparameters**: they have no influence on the performance of the model but affect the speed and quality of the learning process. Examples are `learning rate` or `batch size`.


<br>

### Fitting (training) the model
The last step is the training process. During the training process, the defined model is *fitted* to the training data during multiple training cycles (called `epochs`). After each iteration step, the results are validated / tested against the validation or test dataset. This step helps to assess and evaluate the model accuracy during the training process. 

Some algorithms also allow for setting an early stopping hyperparameter. This parameter would stop the training process, if after a given number of epochs / training cycles the model accuracy has not improved.


<br>

## <a id='model_evaluation'></a>4. Model evaluation

<img src='../../img/ml_workflow_4.png' alt='Logos AI4EO MOOC' width='50%'></img>

After your model is trained, you want to know the performance and accuracy of your model. For this reason, the evaluation of the trained model is an important step. Subsequently, we will present you a selection of common evaluation metrics for Machine Learning. 

**Note:** We do not provide a full list of possible evaluation metrics, but rather a list of metrics that are used in this course. See the <a href='https://scikit-learn.org/stable/modules/model_evaluation.html' target='_blank'>documentation of scikit-learn's metrics and scoring APIs</a> to get a good overview of common evaluation metrics for different Machine Learning categories. 

**General metrics:**

* **Mean Absolute Error (MAE)**: Mean Absolute Error is the average of the difference between the original and the predicted values. MAE shows how far the predictions were from the actual output, but it does not provide any information on the direction of the error. For example, if the model is under- or over-predicting.

* **Root Mean Squared Error (RMSE) and Normalized Root Mean Squared Error (nRMSE)**: RMSE and nRMSE is a standard way to measure the error of a model. RMSE is often used to evaluate a trained model for usefulness / accuracy.

* **Pearson Correlation Coefficient**: The Pearson Correlation Coefficient is a measure of the strength of a linear association between two variables (actual vs. predicted).

<br>

**Metrics specifically for classification problems:**
* **Overall Accuracy Score**: Accuracy is the ratio of number of correct predictions to the total number of input samples.

* **Confusion Matrix**: a confusion matrix is a matrix which describes the complete performance of the model. Each row of the matrix represents the actual class and each column represents the predicted class - or vice versa. The matrix allows you to quantify the accuracy of your classification - how much information has been correctly classified and how much information has not been correctly classified. The matrix can be provided as table or as graphical output (e.g. as a heatmap).

* **Recall (Sensitivity)**: Recall is the number of correct positive results divided by the number of all relevant samples. The higher the Recall the better.

* **Precision (Positive predictive value)**: Precision is the number of correct positive results divided by the number of positive results predicted by the classifier. The higher the Precision the better.

* **F1 Score**: the F1 score helps to measure the metrics *Recall* and *Precision* at the same time and is the harmonic mean of both metric. The F1 score can be between 0 and 1.0.

* **Specificity (True Negative Rate)**: Specificity is the number of correct negative results divided by the number of all relevant samples. 





<br>

## <a id='inference'></a> 5. Inference - Applying a trained ML model to make predictions

<img src='../../img/ml_workflow_5.png' alt='Logos AI4EO MOOC' width='50%'></img>

**Inference** is the process of applying a trained Machine Learning model to new data in order to make predictions. Logically, inference is the very last step of a Machine Learning workflow, because inference cannot happen without the previous steps.

**Domain-specific inference examples shown in this course**:
* [3B - Tile-based classification with EuroSAT data - Inference](../../3_land/3B_tile-based_classification_with_EuroSAT_data/3B_tile-based_classification_with_EuroSAT_data_inference.ipynb)
* [6F - ML4Floods - Inference of a flood extent segmentation model using Sentinel-2 data](../../6_climate/6F_ml4floods/6F_ml4floods_inference.ipynb)

<br>

<a href='https://www.futurelearn.com/courses/artificial-intelligence-for-earth-monitoring/1/steps/1280510' target='_blank'><< Back to FutureLearn</a><br>

<hr>

<img src='../../img/copernicus_logo.png' alt='Copernicus logo' align='left' width='20%'></img>

Course developed for <a href='https://www.eumetsat.int/' target='_blank'> EUMETSAT</a>, <a href='https://www.ecmwf.int/' target='_blank'> ECMWF</a> and <a href='https://www.mercator-ocean.fr/en/' target='_blank'> Mercator Ocean International</a> in support of the <a href='https://www.copernicus.eu/en' target='_blank'> EU's Copernicus Programme</a> and the <a href='https://wekeo.eu/' target='_blank'> WEkEO platform</a>.
