In [None]:
from book_funs5 import *



# Application to supervised machine learning 

## Aims of this chapter

In this chapter and the following ones, we present some examples of more concrete learning machines problems. Some of these tests are taken from kaggle[^270]. 

[^270]: [kaggle, see this url](https://www.kaggle.com/).

Supervised learning problems can be split into regression and classification problems. Both problems have as goal the construction of a model that can predict the value of the output from the input variables. In the case of regression the output is a real valued variable, whereas in the case of classification the output is category (e.g. "disease" or "no disease"). Codpy's extrapolate and projection function can be used to treat each of above mentioned problems.

We present two cases corresponding two each typical problems in supervised learning: Boston housing prices prediction and MNIST classification.

## Regression problem: housing price prediction

**Description**. This database contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. There are 506 cases and 13 attributes (features) with a target column (price). More details can be found in the article published by Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978.

**A comparison between methods**. We compare codpy's extrapolation operator defined in \@ref(eq:EI)-left with following machine learning models: decision tree (**DT**) by scikit-learn library and TensorFlow's neural network (**NN**) model. Starting from the training set $X \in \RR^{N_x \times D}$, we extrapolate the labels $f_z$, and compare to test set labels $f(Z)$.

For the feed-forward NN we chose $50$ epochs with batch size set to $16$, with Adam optimization algorithm and MSE as the loss function. The NN is composed of two hidden ($64$ cells), one input ($8$ cells) and one output layers with the following sequence of activation functions: RELU - RELU - RELU - Linear. All the rest hyperparameters in the models are default set by scikit-learn, TensorFlow.


In [None]:
scenarios_list, pd_scenarios_list = housing_scenario()



In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


The first plot in Figure \@ref(fig:59898979) compares methods in term of scores, the second and third plots discrepancy errors and execution time for different scenarii defined in Table \@ref(tab:29909).

We give an interpretation of these results.

* First note that the RKHS-based method *codpy lab extra*, that is the extrapolation method, obtains both best scores and worst execution time. 
* Note that if we subtract the discrepancy error from one, the result matches the scores of the method *codpy lab extra*. This indicates that the discrepancy error is an appropriate indicator.
* Another kernel method, *codpy lab proj*, that is the projection method above, is a more balanced method.
* Both kernel methods are shipped with a very standard kernel, that is the Gaussian one, that is the only parameter for kernel methods. We emphasize that kernel engineering can easily improves these results. We do not present these improved kernel methods, as our purposes is to benchmark standard methods.

Observe that function norms and MMD errors are not method-dependent. Clearly, for this example, a periodical kernel-based method outperforms the two other ones. However, it is not our goal to illustrate a particular method supremacy, but a benchmark methodology, particularly in the context of extrapolating test set data far from the training set ones.


In [None]:
housing_list = housing()



## Classification problem: handwritten digits

**Description**. This section contains an example of classification for images, which is a typical academic example referred to as the MNIST problem, and allows us to benchmark our results against more popular methods.

MNIST ("Modified National Institute of Standards and Technology") contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. Since its release in 1999, this classic database of handwritten images has served as the basis for benchmarking classification algorithms.

**Short introduction to MNIST**. The MNIST dataset is composed of $60,000$ images defining a training set of handwritten digits. Each image is a vector having dimensions $784$ (a $28 \times 28$ grayscale image flattened in row-order). There are $10$ digits $0–9$. The test set is composed of $10,000$ images with their labels.

We formalize the problem as follows. Given the test set represented by a matrix $X\in \mathbb{R}^{N_x \times D}$, $D=784$, the labels $f(X) \in \mathbb{R}^{N_x \times D_f}$, $D_f=10$, and the test set $Z\in \mathbb{R}^{N_z \times D}$, $N_z= 10000$, predict the label function $f(Z) \in \mathbb{R}^{N_z \times D_f}$. Data are retrieved from Y. LeCun MNIST home page [this dedicated page for a description of the MNIST database](http://yann.lecun.com/exdb/mnist/), and we will test different values for $N_x$.

The following picture shows an image of hand-written number, that is the first image $x^1$, as well as numerous others

<center>
![](.\CodPyFigs\MNIST.png){width=50%}
<center>


In [None]:
scenarios_list = [ (784, 2**(i), 2**(i-2), 10000)  for i in np.arange(5,9,1)]
pd_scenarios_list = pd.DataFrame(scenarios_list)


**A comparison between methods**. We compare different machine learning models to classify MNIST digits : support vector classifier (**SVC**), decision tree classifier (**DT**), adaboost classifier, random forest classifier(**RF**) by scikit-learn library and TensorFlow's neural network (**NN**) model. 

For the feed-forward NN we chose 10 epochs with batch size set to 16, with Adam optimization algorithm and sparse categorial entropy as the loss function. The NN is composed of 128 input and 10 output layers with a RELU activation function. All the rest hyperparameters in the models are default set by scikit-learn, TensorFlow. We also straightforwardly apply the projection operator \@ref(eq:P) with the kernel function defined by a composition of the Gaussian kernel with a mean distance map, where the training set is $X \in \mathbb{R}^{N_x \times 784}$, and $Y \in \RR^{N_y \times 784} \subset X$ is randomly chosen.


In [None]:
knitr::kable(py$pd_scenarios_list, col.names = c("D","Nx","Ny","Nz"),caption = "Scenario list") %>%
  kable_styling(latex_options = "HOLD_position")


Scores are computed using the formula \@ref(eq:score), a scalar in the interval between 0 and 1, which counts the number of correctly predicted images.

Figure \@ref(fig:584) is a confusion matrix for the last scenario in Table \@ref(tab:538) for a neural network.


In [None]:
scenario_generator_, mnist_list = MNIST()



Figure \@ref(fig:65898) compares methods in term of scores, MMD errors and  execution time. We give an interpretation of these results.

* First notice that the kernel method *codpy class. extra* is a multiple-input/multiple-output classifier, which is basically an extrapolation method, obtains both best scores and worst execution time.
* Notice also that one, minus the discrepancy error, matches the scores of the method *codpy class. extra*. This indicates that the discrepancy error is a pertinent indicator.
* Another RKHS - based method, *codpy class. proj*, allows to reduce the computational complexity of extrapolation by using a projection of the input data to lower dimensions. It is a more balanced method with respect to accuracy vs complexity.
* Both kernel methods use a standard Gaussian kernel, that is the only parameter for kernel methods. We emphasize that kernel engineering can easily improves these results. We do not present these improved kernel methods, as our purposes is to benchmark standard methods.

Observe that function norms and discrepancy errors are not method-dependent. Clearly, for this example, a periodic kernel-based method outperforms the two other ones. However, it is not our goal to illustrate a particular method supremacy, but a benchmark methodology, particularly in the context of extrapolating test set data far from the training set ones.


In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
axis_field_labels = [("Nx","scores"),("Ny","discrepancy_errors"),("Ny","execution_time")],**kwargs)


## Reconstruction problems : learning from sub-sampled signals in tomography.

**Description**. This numerical experience illustrates an interesting capability of learning machines to reconstruction problems from sub-sampled signals. Indeed, in this test, we will be learning from a well-established algorithm, that is the SART one, to fasten the reconstruction.

There are many applications of such problems. We illustrate this section with a problem coming from a medical image reconstruction, that can be used also as a medical helping diagnosis decision tool. However, such problems occur in a wide variety of other situations: biology, oceanography, astrophysics, ... 

Poor input signal quality can sometimes be a choice. For instance, in nuclear medicine, it is better to work with lower radioisotopes concentration for obvious health reasons.
Another interesting motivation for sub-sampling signals can be also accelerating data acquisition processes from expensive machines. 

We illustrate this section with an example of such a reconstruction coming from reconstructing a signal from a sub-sampled SPEC (tomography) problem that we describe now. 

**A problem coming from SPECT tomography**. The purpose of this test is to illustrate a sub-sampling reconstruction in the context of medical imagery, more precisely from sub-sampled SPECT images. To that aim, we start from collecting a set of *high resolution* images[^594]. The set itself  is not really important for our illustration sake in this section. However it  should be chosen carefully for real, production problem.

[^594]: the image set is available publicly at this [kaggle link](https://www.kaggle.com/vbookshelf/computed-tomography-ct-images/). 

This database image consists in high resolution $(512\times512)$ images, consisting in approximately 30 images of 82 patients. The training set is built on the first 81 patient. The 82-th patient is used for the test set. We first transform the training set database to produce our data. For each image in the training set (2470 images):

* We perform a "high" resolution $(256\times256)$ radon transform [^357], called a **sinogram** [^324]. A sinogram is quite close to a Fourier transform of the original image, generating sinusoids.
* We perform a "low" resolution (8x256) radon transform.
* We reconstruct the original image from the high resolution sinogram to simulate high resolution SPECT images from these data. The reconstruction algorithm consists in computing an inverse radon transform [^424]. 

An example of training set construction is presented Figure \ref{fig:SPECT}. Left is the reconstructed image from the "high resolution" sinogram (middle). The low resolution sinogram is plot at right.


In [None]:
knitr::include_graphics(path = "./CodPyFigs/SPECT.png")



[^357]: An introduction to radon transform can be found at [ this wikipedia page](https://en.wikipedia.org/wiki/Radon_transform).

[^324]: We used the standard radon transform from scikit, [available at this url](https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.radon).

[^424]: We used a SART algorithm, 3 iterations, for reconstruction, [available at this url](https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.iradon_sart).

The test consists then in reconstructing all images of the 82-th patient using low-resolution sinograms.


**A comparison between methods**. We present here the test resulting from a benchmark of a kernel-based method and the SART algorithm[^269]

[^269]: We did not succeed finding competitive parameters for other methods.

Following our notations, section  \@ref(a-framework-for-machine-learning), we introduce

* The training set $x \in \RR^{2473 \times 2304}$, consisting in 2473 sinograms having resolution $8 \times 256$, consisting in all low-resolution sinograms of the 81 first patients, plus the first one of the 82-th patient. This last figure is added to check an important feature in these problems : the learning machine must be able to retrieve an already input example.
* The test set $z \in \RR^{29 \times 2304}$, consisting in 29 sinograms of the 82-th patient, having resolution $8 \times 256$.
* The training values set $f_x \in \RR^{2473 \times 65536}$, consisting in the 2473 images in "high-resolution".
* The ground truth values $f(Z) \in \RR^{29 \times 65536}$, consists in 29 images in "high-resolution".

We perform the tests and output the results in Table \ref{tab:207}. The columns are the predictor identity, $D,N_x,N_y,N_z,D_f$, the execution time, and the score, computed with the RMSE \% error indicator, see \@ref(eq:rmse).

* The first line, named *exact*, simply output the original figures, leading to zero error.
* The second one, named *SART*, reconstruct the figures from the SART algorithm with sub-sampled data.
* The third one, named *codpy*, reconstruct the figures from the sub-sampled data with the kernel extrapolation method \@ref(eq:EI).

Figure \@ref(fig:359) plots the first 8 images, presenting the original one at left, the reconstruction from SART algorithm, middle, and our algorithm, right. One can check visually that this kernel method better reconstruct the original image. It would be erroneous to conclude that this reconstruction process performs better than the SART algorithm, and it is not at all our speech here. We simply illustrate here the capacity of our algorithm to recognize existing patterns: indeed, note that the first image is perfectly reconstructed, as it is part of the training set. This property emphasizes that such methods suit well to pattern recognition problems, as automated tools to support professionals diagnosis.


In [None]:
knitr::include_graphics(here::here("CodPyFigs", "reconstruction.png")) 



In [None]:
## remote execute file radon.py,as execution time is too long.
#results = SPECT()


## Appendix 

Tables \@ref(tab:594) and \@ref(tab:601) indicates performance indicators for the Boston housing prices and MNIST datasets. 


In [None]:
pyresults <- py$housing_list
knitr::kable(pyresults,  longtable = T, caption = "Performance indicators for housing prices database", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "MMD"))  %>%
        kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


In [None]:
pyresults <- py$mnist_list
knitr::kable(pyresults, longtable = T, caption = "Performance indicators for MNIST database", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "MMD"))  %>% 
      kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")
