In [None]:
for_pdf <- TRUE
for_html <- FALSE


In [None]:
from book_funs2 import *



# Brief overview of methods of machine learning

<!-- \fancyhead[CO,CE]{Your Document Header} -->

## A framework for machine learning 

### Prediction machine for supervised/unsupervised  learning

Machine learning methods can be roughly split into two main approaches: unsupervised and supervised methods. Both can be described within a general framework, referred to here as a **prediction machine**. In short, a predictor, denoted by $\mathcal{P}_m$, is an extrapolation or interpolation operator 
$$
f_z = \mathcal{P}_{m}(X,Y=[],Z=X,f(X)).
  (\#eq:Pm)
$$
We use on a standard Python notation and the brackets above indicate that the variables $Y, Z$ are optional input data.

* The choice of the method is indicated by the subscript $m$. Each method relies on a set of **external parameters**. Fine tuning such parameters is sometimes very cumbersome and provide a source of error and, in fact, some of the strategies in the literature propose to rely on a learning machine in order to determine these external parameters. No performance indicator is provided for this parameter tuning step, and this is an issue to take into account in the applications before selecting up a particular method.

* The input data $X, Y, Z, f(X)$ are as follows. 
  * The non-optional parameter $X \in \mathbb{R}^{N_x \times D}$ is called the **training set**. The parameter $D$ is usually referred as the **total number of features**.
  * The variable $f(X) \in \mathbb{R}^{N_x \times D_f}$ is called the **training set values**, while the parameter $D_f$ is the **number of training features**.
  * The variable $Z \in \mathbb{R}^{N_z \times D}$ is called the **test set**. If it is not specified, we tacitly assume that $Z=X$.
  * The variable $Y \in \mathbb{R}^{N_y \times D}$ is called the **internal parameter set**\footnote{also called weight set in neural network theory} and is necessary in order to define $\mathcal{P}_m$.
* The output data are as follow.
  * **Supervised learning**: this corresponds to choosing the input function values $f(X)$ and we then write 
$$  
f_Z = \mathcal{P}_m(X,Y=[],Z=X, f(X)) \sim f(Z), (\#eq:Pms)
$$
  where the values $f_z \in \mathbb{R}^{N_z \times D}$ are called a **prediction**.
We distinguish between two cases. 
    * If the input data $Y$ is left empty, then the prediction machine \@ref(eq:Pm) is called a **feed-backward machine**. In this case, the method computes this set with an internal method and determine $f_z$.
    * If $Y$ is specified as input data, then the prediction machine \@ref(eq:Pm) is referred as a **feed-forward machine**. In this case, the method uses the set of internal parameters and compute the prediction $f_z$.
  * **Unsupervised learning**: we may also choose 
\begin{equation} 
  f_z = \mathcal{P}_m(X,Z=X), (\#eq:Pmu)
\end{equation}
where the output values $f_z \in \mathbb{R}^{N_z \times D}$ are sometimes called **clusters** for the so-called clustering methods (described later on).
  
Other machine learning methods can be described with the same notation. For instance, two methods $m_1,m_2$ being defined, then the following composition describes a feed-backward machine, which is quite close to the definition of **semi-supervised learning** in the literature and also encompasses feed-backward learning machines: 
$$ 
  f_z = \mathcal{P}_{m_1}(X, \mathcal{P}_{m_2}(X,f(X)),Z,f(X)), (\#eq:Pmsu)
$$
We summarize our main notation in Table \@ref(tab:mainnotations). The sizes of the input data, that is, the integers $D, N_x, N_y, N_z, D_f$, are also considered as input parameters. The distinction between supervised and unsupervised learning is a matter of having, or not, optional input data and the correspondence will be clarified in the rest of this chapter.


In [None]:
summary = data.frame(
  stringsAsFactors = FALSE,
       check.names = FALSE,
                 x = c("training set","size Nx * D"),
                 y = c("parameter set","size Ny * D"),
                 z = c("test set","size Nz * D"),
            `f(x)` = c("training values","size Nx * Df"),
           `fz` = c("predictions", "size Nz * Df")
)
knitr::kable(summary, label = "mainnotations", caption = "Main parameters for machine learning")


| $X$           |  $Y$           | $Z$       |          $f(X)$  |        $f_z$ |
|---            |          ---   |---        |---      |---     |
| training set  | parameter set  |  test set | training values  | predictions  |
| size $N_x \times D$ | size   $N_y \times D$       |  size $N_z \times D$    |      size $N_x \times Df$       |     size $N_z \times Df$     |

Table:  (\#tab:mainnotations) Main parameters for machine learning

Moreover, from any machine learning method $m$ we can also compute the gradient of a real valued function $f=f(x_1, \ldots, x_D)$ by
$$
(\nabla f)_Z = (\nabla_Z \mathcal{P}_{m})(X,Y=[],Z=X,f(X)=[]) \sim \nabla f(Z), (\#eq:dm)
$$
where $\nabla:= (\partial_{x_1}, \ldots,\partial_{x_D})$, then we say that $m$ is a differentiable learning machine.

### Techniques of supervised learning 

Supervised learning \@ref(eq:Pms) corresponds to the situation where the function values $f(X)$ is part of the input data: 
\begin{equation} 
  f_z = \mathcal{P}_m(X,Y=[],Z=X,f(X)).
\end{equation}
Supervised learning can be best understood as a simple extrapolation procedure: from historical observations of a given function $X, f(X)$, one wants to predict (or extrapolate) the function on a new set of values $Z$.
Concerning the terminology, a method is said to be **multi-class** or multi-output if the function $f$ under consideration can be vector-valued, that is, $D_f \ge 1$ in our notation. Observe that one can always combine learning machines in order to produce multi-class methods. However, this usually comes at a  heavy computational cost, and this motivates our definition. Moreover, the input function $f$ can be either: 

* discrete, that is, the set of unique values $f(\mathbb{R}^D)$ is a discrete set, denoted $Ran(f)$. The set is referred as **labels**, and this set can always be mapped to integer $[1,\ldots,\#(Ran(f))]$, where $\#(E)$ denotes the number of elements, or cardinal, of a set.
* continuous, or
* mixed (some data being discrete, some data being continuous).

A classification of existing methods of supervised learning can be found at the website https://scikit-learn.org 

[^200]:[a classification of methods is available using this link]
(https://scikit-learn.org/stable/supervised_learning.html). 


We distinguish between: 

* Different families of methods: linear models, support vector machines, neural networks, \ldots

![](.\CodPyFigs\SMLT.png){width=50%}

* Different particular methods: neural networks, Gaussian processes, \ldots

![](.\CodPyFigs\SML.png){width=50%}

* Different computational libraries: scikit-learn, TensorFlow,\ldots

### Techniques of unsupervised learning 

Unsupervised learning corresponds to the situation where the function values $f(X)$ is not part of input data (see \@ref(eq:Pms)):
\begin{equation} 
   \mathcal{P}_m(X, Y = [] , Z = X). 
\end{equation}
Unsupervised learning can be best understood as a simple interpolation procedure: from historical observations of a given distribution $X$, one wants to extract (or interpolate) $N_y$ features that best represent $X$.
The output data of a standard clustering method are the **cluster set**, denoted $Y \in \mathbb{R}^{N_y\times D}$.

There are natural connections between supervised and unsupervised learning.

* In the context of semi-supervised clustering methods, the clusters $y$ are used in a supervised learning machine to produce a prediction $f_z \in\mathbb{R}^{N_z \times D_f}$; see \@ref(eq:Pmsu).
* In the context of unsupervised clustering methods, a prediction $f_z \in\mathbb{R}^{N_z}$ can also be made. This prediction attaches each point $z^i$ of the test set to the cluster set $Y$, producing $f_z$ as a map $[1,\ldots,N_z] \mapsto [1,\ldots,N_y]$.

In the literature, many clustering methods are available for performing the task above; see for instance the dedicated Wikipedia page[^202].

[^202]:[link to cluster analysis Wikipedia page](https://en.wikipedia.org/wiki/Cluster_analysis). 

* Different family of methods are available: linear models, support vector machines, neural networks,\ldots

![](.\CodPyFigs\UMLT.png){width=50%}


* Different particular methods are available: neural networks, Gaussian processes,\ldots

![](.\CodPyFigs\UML.png){width=50%}


* Different libraries are also available: Scikit-learn,\ldots

Clustering represents one approach to unsupervised learning, and the library Scikit-learn does offer a quite impressive list of clustering methods; see [^203]. In Figure 2.1 we provide some illustration: 

[^203]:[link to scikit-learn clustering](https://scikit-learn.org/stable/modules/clustering.html)

![List of scikit-learn clustering methods.](CodPyFigs/scikitclustercomparaison.png)

* Each column corresponds to a particular clustering algorithm.
* Each row corresponds to a particular clustering, unsupervised problem:
  * Each image scatter shows the training set $X$ and the test set $Z$, which coincide here.
  * Each image color indicates the predicted values $f_z$.


## Exploratory data analysis

**Preliminaries**. Exploratory data analysis plays a central role in data engineering and allows one to understand the structure of a given dataset, including its correlation and statistical properties. For instance, we can study whether a data distribution is multimodal, skew, or discontinuous, among other features. The technique can help in many different applications and, for instance in unsupervised learning, one can produce a first guess concerning the number of possible clusters associated with a given dataset, or concerning the type of kernels one should choose before applying a kernel regression method.

As an example, we illustrate the visualization tools that we are using, consider the Iris flower data set. Iris data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

**Non-parametric density estimations**. The density of the input data is estimated using a kernel density estimate (KDE). Let $(x^1, x^2, \dots, x^n)$ be independent and identically distributed samples, drawn from some univariate distribution with unknown density denoted by $f$ at any given point $x$. We are interested in estimating the shape of this function $f$ and the kernel density estimator is
$$
\widehat{f}_{h}(x)={\frac {1}{n}}\sum _{i=1}^{n}K_{h}(x-x^{i})={\frac {1}{nh}}\sum _{i=1}^{n}K{\Big (}{\frac {x-x^{i}}{h}}{\Big )},
$$
where $K$ is a kernel (say any non-negative function) and $h > 0$ is a smoothing parameter called the **bandwidth**. Among the range of possible kernels that are are commonly used, we have: uniform, triangular, biweight, triweight, Epanechnikov, normal, and many others. The ability of the KDE to accurately represent the data depends on the choice of the smoothing bandwidth. An over-smoothed estimate can remove meaningful features, but an under-smoothed estimate can obscure the true shape within the random noise.


In [None]:
D,Nx,Ny,Nz= -1,-1,-1,-1
x, fx, y, fy, z, fz = iris_data_generator().get_data(D = D,Nx= Nx, Ny = Ny, Nz = Nz)
f_names = iris_data_generator().get_feature_names()
xfx = pd.DataFrame(x, columns = f_names)
multi_plot([xfx.T],data_plots.distribution_plot1D, f_names = f_names)


**Scatter plots**. Another way to visualize data is to rely on a scatter plot, where the data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.



In [None]:
multi_plot(x.T,data_plots.scatter_plot, f_names = f_names)



**Heat maps**. The correlation matrix of $n$ random variables $x^{1},\ldots ,x^{n}$ is the $n\times n$ matrix whose $(i,j)$ entry is $corr(x^{i},x^{j})$. Thus the diagonal entries are all identically unity. 



In [None]:
data_plots.heatmap(x, title= "Correlation matrix", f_names = f_names)



**Summary plots**. The summary plot visualizes the density of each feature of the data on the diagonal. The KDE plot on the lower diagonal and the scatter plot on the upper diagonal.



In [None]:
data_plots.density_scatter(xfx)



In [None]:
scenarios_list = [ (784, 2**(5), 2**(5-2), 100)]



## Performance indicators for machine learning

### Distances and divergences

**f-divergences**. The notion of distance between probability distributions has many applications in mathematical statistics, information theory, such as hypothesis and distribution testing, density estimation, etc.  One family of well-studied and understood family of distances/divergences between probability distributions are so-called $f-$divergences, we give a brief classification. Let $f : (0, \infty) \mapsto \mathbb{R}$ be a convex function with $f(1) = 0$. Let $P$ and $Q$ be two probability distributions on a discrete measurable space $(\mathcal{X} , \mathcal{F})$. If $P$ is absolutely continuous with respect to $Q$, then $f$-divergence is defined as

$$
D_f(P||Q) = \mathbb{E}^{Q}[f\left( \frac{dP}{dQ}\right)] = \sum_x Q(x) f\left( \frac{dP(x)}{dQ(x)}\right)
$$
We list the following common $f-$divergences:

* **Kullback-Leibler (KL) divergence** with $f(x)=x\log(x)$.

* **Squared Hellinger Distance** with $f(x)=(1-\sqrt{x})^2$. Then the formula of Hellinger distance $\mathcal{H}^2(P,Q)$ is given by

$$
\mathcal{H}(P,Q) = \frac{1}{\sqrt{2}}||\sqrt{dP} - \sqrt{dQ}||_2.
$$

**Maximum mean discrepancy**. Another popular family of distances are **the integral probability metrics** (IPMs)[^221], that includes Wasserstein or Kantorovich distance, total variation (TVD) or Kolmogorov distance and maximum mean discrepancy (MMD). MMD is defined in Section \@ref(error-estimates-based-on-the-generalized-maximum-mean-discrepancy).

[^221]: A. Muller, "Integral probability metrics and their generating classes of functions", Advances in Applied Probability, vol. 29, pp. 429–443, 1997.

### Indicators for supervised learning

**Comparison to ground truth values**. A huge family of indicators is available in order to evaluate the performance of a learning machine, most of them being readily described and implemented in scikit-learn[^204].


[^204]:[link to scikit-learn metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics). 

We do not discuss them all, but rather overview those that we have included in the CodPy library. First of all, in the context of supervised clustering methods, if the function $f$ is known in advance, then predictions of learning machines $f_z$ can be compared with **ground truth values**, $f(Z) \in \mathbb{R}^{N_z \times D_f}$. Below we list the main metrics that are used.

* For labeled functions (i.e., discrete functions), a common indicator is the **score**, defined as
$$
    \frac{1}{N_z} \#\{ f_z^n = f(Z)^n, n=1\ldots N_z\} (\#eq:score)
$$
producing an indicator between 0 and 1, the higher being the better.
* For continuous functions (i.e., discrete functions), a common indicator is $\ell^p$ norms, defined as
$$
    \frac{1}{N_z}\| f_z - f(Z) \|_{\ell^p}, \ \quad \quad 1 \le p \le \infty. 
$$
the case $p=2$ is referred as the *root-mean-square error (RMSE)*. 

* As the above indicator is not normalized, the following version is preferred.
$$
    \frac{\| f_z - f(Z)\|_{\ell^p}}{\| f_z\|_{\ell^p} +\|f(Z)\|_{\ell^p}}, \ \quad \quad
    1 \le p \le \infty. (\#eq:rmse)
$$
producing an indicator between 0 and 1, the smaller being the better, interpreted as error-percentages.
In finance, this notion is sometimes referred to as the basis point indicator.

**Cross validation scores**. The cross validation score consists in randomly selecting a part of the training set and values as test set and values, and to perform a score or RMSE type error analysis on each run. See the [dedicated page on scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html).

**Confusion matrix**. This indicator is available for labeled, supervised learning, is a matrix representation of the numbers of ground-truth labels in a row, while each column represents the predicted labels in an actual class. Confusion matrix is a quite simple and efficient data error visualization methods, a simple example is shown in the following sections. Its common form is
$$
  M(i,j) = \#\{f(Z) = i \quad and \quad f_z = j\},
$$
representing correct predicted numbers in the matrix diagonal, since off-diagonal elements counts false positive predictions. Note that numerous others performance indicators can be straightforwardly deduced from the confusion matrix, as Rand Index, Fowlkes-Mallows scores, etc...

**Norm of output**. If no ground truth values are known, the quality of the prediction $f_z$, depends on **a priori error estimates** or error bounds. Such estimates exist only for kernel methods (to the best of the knowledge of the authors), and are described in the next chapter, see \@ref(functional-spaces-and-kolmogorov-decomposition). Such estimates uses the norm of functions and was proven to be a useful indicator in the applications.

**ROC curves**. A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers starting in 1941, which led to its name.

ROC is the plot of TPR versus FPR by varying the threshold. These metrics are are summarized up in the following table:


| Metric   |      Formula |  Equivalent |
|----------|:-------------:|------:|
| True Positive Rate TPR |  $\frac{TP}{TP + FN}$ | Recall, sensitivity |
| False Positive Rate FPR|    $\frac{FP}{TN+FP}$  |   	1-specificity |


We can use precision score ($PRE$) to measure the performance across all classes:

$$
PRE=\frac{TP}{TP+FP}.
$$
In “micro averaging”, we calculate the performance, e.g., precision, from the individual true positives, true negatives, false positives, and false negatives of the the k-class model:
$$
PRE_{micro}=\frac{TP_{1}+\dots+TP_{k}}
{TP_{1}+\dots+TP_{k}+FP_{1}+\dots+FP_{k}}.
$$
And in macro-averaging, we average the performances of each individual class
$$
PRE_{macro}=\frac{PRE_{1}+\dots+PRE_{k}}{k}.
$$

### Indicators for unsupervised learning

**Maximum mean discrepancy**. Evaluation of clustering algorithms benefits from a lot of performance indicators, a lot of them being implemented in Scikit-learn:[see this link](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

As an alternative to standard unsupervised learning metrics, we propose to use MMD. It is used primarily to produce worst error estimates, together with the norm of functions, as described in \@ref(functional-spaces-and-kolmogorov-decomposition), but it was also found to be useful as a performance indicator for unsupervised learning machine.

**Inertia indicator**. The inertia indicator is used for *k-means* algorithm. We describe it precisely, as it uses a notation that will be used in other parts. It shares some similarities with the discrepancy error one but is not equivalent. 
To define inertia, one first pick a distance, denoted $d(x, y)$, as the squared Euclidean one, although other distances are considered, as the Manhattan or log-entropy, depending upon the problem under consideration. Consider any point $w \in \mathbb{R}^D$. Then $w$ is attached naturally to a point $y^{\sigma(w,y)}$, where the index function $\sigma(w,y)$ is defined as 
$$
\sigma(w,Y) := \arg \inf_{j=1 \ldots N_Y} d(w,y^j). (\#eq:sigmaw)
$$
Then the inertia is defined as
$$
I(X,Y)= \sum_{n=0}^{N_x}|x^n-y^{\sigma_{d}(x^n,Y)}|^2.
$$
Observe that this functional might not be convex, even if the distance under consideration is convex, as is the squared Euclidean distance. For k-means algorithms, the cluster centers $y$ are computed minimizing this functional. The parameter set $y$ is called the set of  **centroids** for k-means algorithms.

<!-- **Kolmogorov-Smirnov test**. We exhibit three statistical indicators to support our claims, measuring each some kind of distance between the two distributions $X$ and $Y$. The two first tests are one-dimensional based tests. We check it on every axes. The third one is based on the discrepancy error. -->

<!-- Th are one-dimensional tests, based on the cumulative distribution function. The test is -->
<!-- $$ -->
<!--   \|cdf_x  - cdf_z\|_\ell^\infty(\mathbb{R}^N) \ge \frac{c_N}{\sqrt{N}}  -->
<!-- $$ -->
<!-- where $c_N$ is a confidence level. -->


## General specification of tests

### Preliminaries 

We now overview a benchmark methodology and apply it to some supervised learning methods. For each machine, 

* we illustrate the prediction function $\mathcal{P}_m$, and 
* we illustrate the computation of some performance indicators.

We then present benchmarks using these indicators and restrict attention to toy examples while more practical cases will be studied in Chapter \@ref(application-to-supervised-machine-learning).

We begin by describing a general first quality assurance test for supervised learning machines. The goal of this framework is to measure accuracy of any machine learning models, using the extrapolation operator (\#eq:EI). Hence all our unit tests are based on the following input sizes:
$$
  \text{a function: f },\text{a method: m }, \text{five integers: } D, N_x, N_y, N_z, D_f
$$
To benchmark our model, we use a list of scenarios, that is a list of entries $D, N_x, N_y, N_z, D_f$. Table \@ref(tab:299) is an example of a list of five scenarios.


In [None]:
scenarios_list = [ (1, 100*i, 100*i ,100*i ) for i in np.arange(1,5,1)]
pd_scenarios_list = pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])
D = 2
scenarios_list = [ (D, 100*(i**2), 100*(i**2),100*(i**2) ) for i in np.arange(5,1,-1)]
pd_scenarios_list =pd.concat([pd_scenarios_list, pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])]) 


In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


In [None]:
scenarios_list = [ (1, 100*i, 100*i ,100*i ) for i in np.arange(1,5,1)]
pd_scenarios_list = pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])


For the function $f$ we set a periodic and an increasing function:
\begin{equation} \label{2D}
f(X) = \Pi_{d=1..D} \cos (4\pi x_d) + \sum_{d=1..D} x_d.
\end{equation}


In [None]:
def my_fun(x):
    import numpy as np
    from math import pi
    sinss = np.cos(2 * x * pi)
    if x.ndim == 1 : 
        sinss = np.prod(sinss, axis=0)
        ress = np.sum(x, axis=0)
    else : 
        sinss = np.prod(sinss, axis=1)
        ress = np.sum(x, axis=1)
    return ress+sinss


### Extrapolation in one dimension

**Description**. During this experiment, we used a generator, configured to select $X$ (resp. $Y, Z$) as $N_x$ (resp. $N_y, N_z$) points regularly (resp. randomly, regularly) generated on a unit cube. A validation set $Z$ is distributed over a larger cube, to observe extrapolation and interpolation effects.


In [None]:
data_random_generator_ = data_random_generator(fun = my_fun,types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=1,Nx=100,Ny=100,Nz=100)


As an illustration, in Figure \@ref(fig:xfxzfz) we show both graphs $(X, f(X))$ (left, training set),$(Z, f(Z))$ (right, test set).



In [None]:
multi_plot([(x, fx),(z, fz)],plot1D)



**A comparison between methods**. We compare codpy's periodic kernel with following machine learning models: scipy's RBF kernel regression, support vector regression (**SVR**), decision tree (**DT**), adaboost, random forest (**RF**) by scikit-learn library and TensorFlow's neural network (**NN**) model.

The set of external parameters for kernel-based methods consists simply in picking-up a kernel, and is discussed in the next chapter; see Section \@ref(kernel-methods-for-machine-learning). For the SVR we chose RBF kernel, for DT we set the maximum depth to $10$, for the RF and XGBoost we set the number of estimators to $10$ and $5$ respectively and the maximum depth to $5$. For the feed-forward NN we chose $50$ epochs with batch size set to $16$, we chose Adam optimization algorithm and mean squared error as the loss function. The NN is composed of two hidden layers ($64$ cells), one input ($8$ cells) and one output layers ($1$ cell) with the following sequence of activation functions: RELU - RELU - RELU - Linear. All other hyperparameters in the models are default set by scikit-learn, SciPy and TensorFlow.


In [None]:
set_per_kernel = kernel_setters.kernel_helper(kernel_setters.set_gaussianper_kernel,2,1e-8,None)



In [None]:
scenario_generator_ = scenario_generator()
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,
                                  codpyexRegressor(set_kernel = set_per_kernel),
                                  data_accumulator(), **get_codpy_param_chap2())
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = results
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results = list_results[0:1]


In [None]:
rbf_param = {'function': 'gaussian', 'epsilon':None, 'smooth':1e-8, 'norm':'euclidean'}
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_, ScipyRegressor(set_kernel = set_per_kernel), data_accumulator(), **get_codpy_param_chap2())
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
svm_param = {'kernel': 'rbf', 'gamma': 'auto', 'C': 1}
scenario_generator_.run_scenarios(scenarios_list, data_random_generator_, SVR(set_kernel = set_per_kernel), data_accumulator(), **get_codpy_param_chap2(), **svm_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
scenario_generator_.run_scenarios(scenarios_list,data_random_generator_,tfRegressor(set_kernel = set_per_kernel),data_accumulator(), **get_codpy_param_chap2())
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
DT_param = {'max_depth': 10}
scenario_generator_.run_scenarios(scenarios_list,
                                  data_random_generator_,
                                  DecisionTreeRegressor(set_kernel = set_per_kernel),
                                  data_accumulator(), **get_codpy_param_chap2(), **DT_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
ada_param = {'tree_no': 50, 'learning_rate': 1}
scenario_generator_.run_scenarios(scenarios_list,
                                  data_random_generator_,
                                  AdaBoostRegressor(set_kernel = set_per_kernel),
                                  data_accumulator(), **get_codpy_param_chap2(), **ada_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
xgb_param = {'max_depth': 5, 'n_estimators': 10}
scenario_generator_.run_scenarios(scenarios_list,
                                  data_random_generator_,
                                  XGBRegressor(set_kernel = set_per_kernel),
                                  data_accumulator(), **get_codpy_param_chap2(), **xgb_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results += list_results[0:1]


In [None]:
RF_param = {'max_depth': 5, 'n_estimators': 5}
scenario_generator_.run_scenarios(scenarios_list,
                                  data_random_generator_,
                                  RandomForestRegressor(set_kernel = set_per_kernel),
                                  data_accumulator(), **get_codpy_param_chap2(), **RF_param)
results = scenario_generator_.accumulator.get_output_datas().dropna(axis=1)
df_sup_results = pd.concat([df_sup_results, results])
list_results = [(s.z,s.f_z) for s in scenario_generator_.accumulator.predictors]
list_of_list_results = list_of_list_results + list_results


In [None]:
title_list = ["Periodic kernel:CodPy", "The RFB kernel:SciPy", "SVR:Scikit", "NN:TensorFlow",
              "Decision tree:Scikit", "Adaboost:Scikit", "XGBoost", "RF:Scikit"]
multi_plot(list_of_list_results,plot1D,mp_ncols = 4, f_names=title_list, mp_max_items = 8)


In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 3 }
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","scores"),("Ny","discrepancy_errors"),("Ny","execution_time")],**kwargs)


Figure \@ref(fig:a1a2a3) visualizes extrapolation of each method. We note that a periodic kernel gives a better extrapolation between $[-1.5,-1]$ and $[1,1.5]$, that is also confirmed in Figure \@ref(fig:a1a2a4) showing RMSE error for different sample size $N_x$.

Observe that function norms and discrepancy errors are not method-dependent. Clearly, for this example, a periodical kernel-based method outperforms the two other ones. However, it is not our goal to illustrate a particular method supremacy, but a benchmark methodology, particularly in the context of extrapolating test set data far from the training set.

### Extrapolation in two dimensions

**Description**. Now we show the fact that the dimension arising in the problem under consideration does not change benchmark methods. To illustrate this point, we simply repeat the previous steps used for the one-dimensional case, but the dimension is set to two, that is $D=2$, and the reader can test with this parameter. Only data visualization changes.


In [None]:
data_random_generator_ = data_random_generator(fun = my_fun,types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=1,Nx=100,Ny=100,Nz=100)


In [None]:
D = 2
scenarios_list = [ (D, 100*(i**2), 100*(i**2),100*(i**2) ) for i in np.arange(5,1,-1)]
pd_scenarios_list = pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])


The data is generated using five scenarios from Table \@ref(tab:299), corresponding to a two dimensional case. Figure \@ref(fig:xfxzfz2) shows both graphs $(X,f(X))$ (left, training set),$(Z,f(Z))$ (right, test set) for illustration purposes, $f$ is a two-dimensional periodic function defined in Section \@ref(preliminaries). Observe that, if the dimension is greater to two, we use a two dimensional visualization, plotting $\tilde{X},f(X)$, where $\tilde{X}$ is obtained by either setting indices $\tilde{X}:=X[index1,index2]$ or performing a PCA over $x$ and setting $\tilde{X}:=PCA(X)[index1,index2]$.



In [None]:
data_random_generator_ = data_random_generator(fun = my_fun, types=["cart","sto","cart"])
x, fx, y, fy, z, fz =  data_random_generator_.get_data(D=D,Nx=2000,Ny=2000,Nz=2000)
multi_plot([(x,fx),(z,fz)],plot_trisurf,projection='3d')


**A comparison between methods**. We compare two models for function's extrapolation: codpy's periodic Gaussian kernel with SciPy's RBF kernel.



In [None]:
list_results,list_of_list_results,df_sup_resultsN = a1a2a5()



In [None]:
multi_plot(list_results,plot_trisurf,mp_max_items = 4,mp_ncols = 4, projection='3d')



The first two graphs in Figure \@ref(fig:a1a2a5) shows RBF's predictions for first two scenarios defined in Table \@ref(tab:299), and the last two graphs for a periodic Gaussian kernel.



In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","scores"),("Nx","discrepancy_errors"),("Nx","execution_time")], **kwargs)


### Clustering

**Description**. The goal of this section is to overview our own methodology (which will be fully described in the next chapter).

* We illustrate the prediction function $\mathcal{P}_m$ for some methods in the context of supervised learning.
* We illustrate the computations of some performance indicators, as well as to present a toy benchmark using these indicators.

The data is generated using a multimodal and multivariate Gaussian distribution with a covariance matrix $\Sigma = \sigma I_d$. The problem is to identify the modes of the distribution using a clustering method.
In the following we will generate distribution with a predetermined number of modes, it will allow to test validation scores on this toy example.

**A comparison between methods**. We compute and compare codpy's clustering MMD minimization with Scikit's implementation of k-means algorithm. During this experiment we generate distributions with different number of modes (between 2 and 6).


In [None]:
D = 2
scenarios_list = [ (D, 100*(i**2), 100*(i**2),100*(i**2) ) for i in np.arange(5,1,-1)]
pd_scenarios_list =pd.DataFrame(scenarios_list,columns = ["D","Nx","Ny","Nz"])


The two first two graphs in Figure \@ref(fig:750) correspond to the computed clusters using k-means algorithm and the last two graphs correspond to the MMD minimization.



In [None]:
list_of_predictors,scenario_generator_,df_unsup_results = list_of_predictors()
multi_plot(list_of_predictors ,graphical_cluster_utilities.plot_clusters, mp_ncols = 4,xlabel = 'x', ylabel = 'y')


Figure \@ref(fig:750) illustrates four confusion matrices, the first row corresponds to k-means algorithm and the second to the MMD minimization.



In [None]:
title_list = ["k-means", "k-means", "MMD:CodPy", "MMD-codpy"]
multi_plot(list_of_predictors,add_confusion_matrix.plot_confusion_matrix,mp_ncols = 2, f_names=title_list, mp_max_items = 4)


We compare various methods under consideration, by means of performance indicators, as illustrated by Figure  \@ref(fig:740). In order to avoid confusion of defining best possible clustering, we chose inertia as the metric to evaluate the performance of algorithms. MMD error just indicates the fact that two samples are the same, they coincide at the different levels of sample size. Table \@ref(tab:2999) in the appendix to the chapter resumes the experiment. 



In [None]:
scenario_generator_.compare_plots(axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")],mp_max_items=4, mp_ncols = 4)



## Bibliography

XGBoost[^222] is a computationally efficient implementation of the original gradient boost algorithm. Standard libraries for neural networks are TensorFlow[^223] and Pytorch[^224]. Scikit-learn library[^225] offers a comprehensive set of models (linear, SVMs, stochastic gradient and feature selection methods). Recently TensorFlow added "TensorFlow probability[^226]" library that contains modules on linear algebra, statistics, positive-definite kernels and Gaussian process methods, Monte Carlo and etc.


[^222]: see [this dedicated page for a description of XGBoost project](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)
[^223]: see [this dedicated page for a description of TensorFlow neural networks](https://www.tensorflow.org/tutorials/customization/basics)
[^224]: see [this dedicated page for a description of Pytorch's neural networks](https://pytorch.org)
[^225]: see [this dedicated page for a description of Scikit's library](https://scikit-learn.org/stable/supervised_learning.html)
[^226]: see [this dedicated page for a description of TensorFlow probability's library](https://www.tensorflow.org/probability/overview)


## Appendix to chapter 2

**Results of 1D extrapolation**. Table \@ref(tab:2998) illustrates the performance of supervised machine learning models to extrapolate the values of a periodic function define in Section \@ref(preliminaries). We compare the performance using four measures: execution time, scores, the norm of the function to be predicted and MMD errors.


In [None]:
pyresults <- py$df_sup_results
knitr::kable(pyresults,  longtable = T, caption = "Supervised algorithms performance indicators", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "RMSE", "MMD"))  %>%
    kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


**Results of 2D extrapolation**. Table \@ref(tab:657) shows the computed indicators after running all scenarios indicated in Table \@ref(tab:2999).



In [None]:
pyresults <- py$df_sup_resultsN
knitr::kable(pyresults,  caption = "Supervised algorithms performance indicators", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "RMSE", "MMD"))  %>% kable_styling(latex_options = "HOLD_position")


**Results of clustering methods**. Table \@ref(tab:2999) represents the results obtained during clustering experiments, the performance is measured using four indicators: execution time, scores, MMD and inertia.



In [None]:
pyresults <- py$df_unsup_results
knitr::kable(pyresults,  longtable = T, caption = "Unsupervised algorithms performance indicators (Clustering)", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "MMD", "inertia"))  %>%
      kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")
