In [None]:
from book_funs6 import *



# Applications to unsupervised machine learning

## Aims of this chapter

In this section we apply some clustering methods for a number of use cases.We benchmarked our kernel-based algorithms (see Section \@ref(clustering) against the popular k-means algorithms. Both are distance-based minimization algorithms, aiming to solve the problem \@ref{eq:dist}, that we recall here

\[
Y = \arg \inf_{Y \in \mathbb{R}^{N_y \times D}} d(X,Y)
\]
The clusters $Y\in \mathbb{R}^{N_y \times D}$ are the results of this minimization algorithm, where :

* For k-means algorithm, the distance is called the *inertia*, see section \@ref(kernel-methods-for-machine-learning).

* For kernel-based algorithms, the distance is *MMD*, see section \@ref(error-estimates-based-on-the-generalized-maximum-mean-discrepancy).

Importantly, if the distance functional $d(X,Y)$ is not convex, then a solution to \@ref(eq:dist) might not be unique. For instance, a k-means algorithm usually output different clusters output at different runs.


## Classification problem: handwritten digits

**Description**. The MNIST test is also studied in the section \@ref(application-to-supervised-machine-learning). Here we consider it as a semi-supervised learning: we use the train set $X \in \mathbb{R}^{N_x \times D}$ to compute the cluster's centroids $Y \in \mathbb{R}^{N_y \times D}$. Then we use these clusters to predict the test labels $f_z \in \mathbb{R}^{N_z \times D_f}$, corresponding to the test set $Z \in \mathbb{R}^{N_z \times D}$.

**A comparison between methods**. First we use scikit's k-means algorithm implementation, which is simply partitioning the input data $X \in \mathbb{R}^{N_x \times D}$ into $N_y$ sets so as to minimize the within-cluster sum of squares, which is defined as "inertia". The inertia represents the sum of distances of all points to the centroid $Y \in \mathbb{R}^{N_y \times D}$ in a cluster. K-means algorithm starts with a group of randomly initialized centroids and then performs iterative calculations to optimize the position of centroids until the centroids stabilizes, or the defined number of iterations is reached. 

Second we apply codpy's MMD minimization-based algorithm described in \@ref(a-kernel-based-clustering-algorithm) using the distance $d_k(x,y)$ induced by a Gaussian kernel: $k(x,y)=\exp(-(x-y)^2)$. 


In [None]:
set_kernel = set_gaussian_kernel
scenarios_list = [ (-1, 1000, 2**i,1000) for i in np.arange(7,9,1)]
scenario_generator_ = scenario_generator()
pd_scenarios_list = pd.DataFrame(scenarios_list)


In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


In [None]:
scenario_generator_, mnist_results = MNIST_clustring()



 The result of k-means algorithm is $N_y$ clusters in $D=784$ dimensions, i.e. $Y \in \mathbb{R}^{N_y\times D}$. Note that the cluster centroids themselves are 784-dimensional points, and can themselves be interpreted as the "typical" digit within the cluster. Figure \@ref(fig:858) plots some examples of computed clusters, interpreted as images. As can be seen, they are perfectly recognizable.

Finally, we illustrate a benchmark plot, displaying the computed performance indicator of scikit's k-means and codpy's MMD minimization-based algorithm in terms of MMD, inertia, accuracy scores (when applicable) and execution time, using scenarios in Table \@ref(tab:29908). The higher the scores and the lower are the inertia and MMD the better.


In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")], **kwargs)


The scores are quite high, compared to supervised methods for similar size of training set, see results section \@ref(application-to-supervised-machine-learning). MMD-based minimization have an inertia indicator that is comparable to k-means. This is surprising as k-means algorithms are based on inertia minimization. Moreover, scores seems to indicate that the MMD distance is a more reliable criteria than inertia on this pattern recognition problem.

## German credit risk

**Description**. The original dataset[^601] contains 1000 entries with 20 categorial/symbolic attributes. In this database, each entry represents a person who takes a credit by a bank. The goal is to categorize each person as good or bad credit risks according to the set of attributes.

[^601]: The German credit risk dataset is described in the [kaggle page link](https://www.kaggle.com/uciml/german-credit)


In [None]:
scenarios_list = [(-1, -1, i,-1) for i in range(10, 21,10)]
scenario_generator_ = scenario_generator()
pd_scenarios_list = pd.DataFrame(scenarios_list)


In [None]:
scenario_generator_, german_credit_results= german_credit()



**A comparison between methods**. The result of k-means and codpy's sharp discrepancy algorithm algorithm is $N_y$ clusters in $D$ dimensions. Notice that the cluster centroids themselves are $D$-dimensional points.



In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


Next we visualize the clusters and corresponding centroids of scikit and codpy's sharp discrepancy algorithm, where we vary the number of clusters $N_y$ from 1 to 8. Obviously in this example we see that the high number of clusters leads to overfitting and one is unable to interpret the resulting clusters when $N_y = 8$.

Finally, we illustrate a benchmark plot, displaying the computed performance indicators of scikit's k-means and codpy's sharp discrepancy algorithms using scenarios from Table \@ref(tab:29911).


In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")], **kwargs)


## Credit card marketing strategy

**Description**. The problem can be formalized as follows.  Develop a customer segmentation to define marketing strategy. The sample dataset[^602] summarizes the usage behavior of 8,950 active credit card holders during the last 6 months. The database contains 17 features and 8,950 records. The data describes customer’s purchase and payment habits, such as how often a customer installment purchases, or how often they make cash advances, how much payments are made, etc. By inspecting each customer, we can find which type of purchase he/she is keen on, or if he/she prefers cash advance over purchases. 

[^602]: The credit card marketing strategy dataset is detailed on this dedicated [kaggle page](https://www.kaggle.com/arjunbhasin2013/ccdata).


In [None]:
scenarios_list = [(-1, -1, i,-1) for i in np.arange(2,21,3)]
scenario_generator_ = scenario_generator()
pd_scenarios_list = pd.DataFrame(scenarios_list)


**A comparison between methods**. The result of k-means algorithm and codpy's sharp discrepancy algorithm is $N_y$ clusters in $D$ dimensions. Note that the cluster centroids $Y \in \mathbb{R}^{N_y \times D}$ themselves are $D$-dimensional points.



In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


In [None]:
scenario_generator_, marketing_results = credit_card_marketing()



Next we visualize the clusters and corresponding centroids of scikit's k-means implementation codpy's sharp discrepancy algorithm, where we vary the number of clusters $N_y$ from $2$ to $4$.

Finally, we illustrate a benchmark plot, displaying the computed performance indicator of scikit's k-means and codpy's sharp discrepancy algorithms using scenarii from Table \@ref(tab:29912).


In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")], **kwargs)


## Credit card fraud detection

**Description**. The database[^603] contains transactions made by credit cards in September 2013 by European cardholders.
It presents transactions that occurred in two days, where we have $492$ frauds out of $284,807$ transactions. The database is highly unbalanced, the positive class (frauds) account for $0.172\%$ of all transactions.

The study addresses the fraud detection system to analyze the customer transactions in order to identify the patterns that lead to frauds. In order to facilitate this pattern recognition work, the k-means clustering algorithm is used which is an unsupervised learning algorithm and applied to find out the normal usage patterns of credit card users based on their past activity.


It contains only numerical input variables which are the result of a PCA transformation.  The only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the database. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. 

Feature 'Class' is the response variable and it takes value $1$ in case of fraud and $0$ otherwise.

[^603]: You can find more details on this use case following the link [kaggle page](https://www.kaggle.com/mlg-ulb/creditcardfraud) link.


In [None]:
scenarios_list = [( -1, 500, i,-1 ) for i in np.arange(15,100,15)]
scenarios_list = [( -1, 500, i,1000 ) for i in np.arange(15,100,15)]
scenario_generator_ = scenario_generator()
pd_scenarios_list = pd.DataFrame(scenarios_list)


**A comparison between methods**. Table \@ref(tab:29913) defines different scenarii of our experiment.



In [None]:
knitr::kable(py$pd_scenarios_list, caption = "scenario list", col.names = c("$D$","$N_x$","$N_y$","$N_z$"), escape = FALSE)%>%
  kable_styling(latex_options = "HOLD_position")


Figure \@ref(fig:580) illustrates confusion matrices for the last scenario of each approach.



In [None]:
scenario_generator_, fraud_results = credit_card_fraud()



Finally, we illustrate a benchmark plot, that shows the performance of scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.



In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","scores"),("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")],**kwargs)


## Portfolio of stock clustering

**Description**. This case represents daily stock price movements $X \in \mathbb{R}^{N_{x} \times D}$ (i.e. the dollar difference between the closing and opening prices for each trading day) from $2010$ to $2015$.


In [None]:
scenarios_list = [(-1, -1, i,-1) for i in range(10, 21,10)]
scenario_generator_ = scenario_generator()


In [None]:
scenario_generator_, idx, idx2, stocks_results = stocks_clustering()



In [None]:
idx = cbind(py$idx,py$idx2)
pander::pander(cbind(py$idx,py$idx2), split.cell = 80, split.table = Inf, style = "rmarkdown", caption = "Stock's clustering", col.names = c("k-means","MMD minimization"))


**A comparison between methods**. The table with a list of stocks shows that k-means clustering and MMD minimization displays stocks into coherent groups. Finally, we illustrate a benchmark plot, that shows the performance of scikit's k-means and codpy's sharp discrepancy algorithms in terms of discrepancy errors, inertia, accuracy scores (when applicable) and execution time.



In [None]:
kwargs = {"mp_max_items" :4, "mp_ncols" : 4 }
scenario_generator_.compare_plots(
    axis_field_labels = [("Ny","discrepancy_errors"),("Ny","inertia"),("Ny","execution_time")], **kwargs)


## Appendix



In [None]:
pyresults <- py$mnist_results
knitr::kable(pyresults,  longtable = T, caption = "Performance indicators for MNIST dataset", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "MMD", "inertia")) %>%
        kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


In [None]:
pyresults <- py$german_credit_results
knitr::kable(pyresults,  longtable = T, caption = "Performance indicators for German credit database", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "MMD", "inertia")) %>%
        kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


In [None]:
pyresults <- py$marketing_results
knitr::kable(pyresults,  longtable = T, caption = "Performance indicators for credit card marketing database", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "MMD", "inertia")) %>%
       kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


In [None]:
pyresults <- py$fraud_results
knitr::kable(pyresults,  longtable = T, caption = "Performance indicators for credit card fraud database", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "scores", "MMD", "inertia")) %>%
      kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")


In [None]:
pyresults <- py$stocks_results
knitr::kable(pyresults,  longtable = T,  caption = "Performance indicators for stock price", escape = FALSE, col.names = c("$predictors$", "$D$", "$N_x$", "$N_y$", "$N_z$", "$D_f$", "time", "MMD", "inertia")) %>%
      kable_styling(latex_options = c("repeat_header","HOLD_position"),
              repeat_header_continued = "\\textit{(Continued on Next Page...)}")
