diff --git a/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst b/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst new file mode 100644 index 00000000000..2aef3a8a378 --- /dev/null +++ b/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst @@ -0,0 +1,80 @@ +=============== +Linear Time MMD +=============== + +The linear time MMD implements a nonparametric statistical hypothesis test to reject the null hypothesis that to distributions :math:`p` and :math:`q`, each only observed via :math:`n` samples, are the same, i.e. :math:`H_0:p=q`. + +The (unbiased) statistic is given by + +.. math:: + + \frac{2}{n}\sum_{i=1}^n k(x_{2i},x_{2i}) + k(x_{2i+1}, x_{2i+1}) - 2k(x_{2i},x_{2i+1}). + +See :cite:`gretton2012kernel` for a detailed introduction. + +------- +Example +------- + +Imagine we have samples from :math:`p` and :math:`q`. +As the linear time MMD is a streaming statistic, we need to pass it :sgclass:`CStreamingFeatures`. +Here, we use synthetic data generators, but it is possible to construct :sgclass:`CStreamingFeatures` from (large) files. +We create an instance of :sgclass:`CLinearTimeMMD`, passing it data and the kernel to use, + +.. sgexample:: linear_time_mmd.sg:create_instance + +An important parameter for controlling the efficiency of the linear time MMD is block size of the number of samples that is processed at once. As a guideline, set as large as memory allows. + +.. sgexample::linear_time_mmd.sg:set_burst + +Computing the statistic is done as + +.. sgexample::linear_time_mmd.sg:estimate_mmd + +We can perform the hypothesis test via computing a test threshold for a given :math:`\alpha`, or by directly computing a p-value. + +.. sgexample::linear_time_mmd.sg:perform_test_threshold + +--------------- +Kernel learning +--------------- + +There are various options to learn a kernel. +All options allow to learn a single kernel among a number of provided baseline kernels. +Furthermore, some of these criterions can be used to learn the coefficients of a convex combination of baseline kernels. + +There are different strategies to learn the kernel, see :sgclass:`CKernelSelectionStrategy`. + +We specify the desired baseline kernels to consider. Note the kernel above is not considered in the selection. + +.. sgexample:: linear_time_mmd.sg:add_kernels + +IMPORTANT: when learning the kernel for statistical testing, this needs to be done on different data than being used for performing the actual test. +One way to accomplish this is to manually provide a different set of features for testing. +In Shogun, it is also possible to automatically split the provided data by specifying the ratio between train and test data, via enabling the train-test mode. + +.. sgexample:: linear_time_mmd.sg:enable_train_test_mode + +A ratio of 1 means the data is split into half during learning the kernel, and subsequent tests are performed on the second half. + +We learn the kernel and extract the result, again see :sgclass:`CKernelSelectionStrategy` more available strategies. Note that the kernel of the mmd itself is replaced. +If all kernels have the same type, we can convert the result into that type, for example to extract its parameters. + +.. sgexample:: linear_time_mmd.sg:select_kernel_single + +Note that in order to extract particular kernel parameters, we need to cast the kernel to its actual type. + +Similarly, a convex combination of kernels, in the form of :sgclass:`CCombinedKernel` can be learned and extracted as + +.. sgexample:: linear_time_mmd.sg:select_kernel_combined + +We can perform the test on the last learnt kernel. +Since we enabled the train-test mode, this automatically is done on the held out test data. + +.. sgexample:: linear_time_mmd.sg:perform_test + +---------- +References +---------- +.. bibliography:: ../../references.bib + :filter: docname in docnames diff --git a/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst b/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst new file mode 100644 index 00000000000..1882d19e7ae --- /dev/null +++ b/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst @@ -0,0 +1,93 @@ +================== +Quadratic Time MMD +================== + +The quadratic time MMD implements a nonparametric statistical hypothesis test to reject the null hypothesis that to distributions :math:`p` and :math:`q`, only observed via :math:`n` and :math:`m` samples respectively, are the same, i.e. :math:`H_0:p=q`. + +The (biased) test statistic is given by + +.. math:: + + \frac{1}{nm}\sum_{i=1}^n\sum_{j=1}^m k(x_i,x_i) + k(x_j, x_j) - 2k(x_i,x_j). + + +See :cite:`gretton2012kernel` for a detailed introduction. + +------- +Example +------- + +Imagine we have samples from :math:`p` and :math:`q`, here in the form of CDenseFeatures (here 64 bit floats aka RealFeatures). + +.. sgexample:: quadratic_time_mmd.sg:create_features + +We create an instance of :sgclass:`CQuadraticTimeMMD`, passing it data the kernel. + +.. sgexample:: quadratic_time_mmd.sg:create_instance + +We can select multiple ways to compute the test statistic, see :sgclass:`CQuadraticTimeMMD` for details. +The biased statistic is computed as + +.. sgexample:: quadratic_time_mmd.sg:estimate_mmd + +There are multiple ways to perform the actual hypothesis test, see :sgclass:`CQuadraticTimeMMD` for details. The permutation version simulates from :math:`H_0` via repeatedly permuting the samples from :math:`p` and :math:`q`. We can perform the test via computing a test threshold for a given :math:`\alpha`, or by directly computing a p-value. + +.. sgexample:: quadratic_time_mmd.sg:perform_test + +---------------- +Multiple kernels +---------------- + +It is possible to perform all operations (computing statistics, performing test, etc) for multiple kernels at once, via the :sgclass:`CMultiKernelQuadraticTimeMMD` interface. + +.. sgexample:: quadratic_time_mmd.sg:multi_kernel + +Note that the results are now a vector with one entry per kernel. +Also note that the kernels for single and multiple are kept separately. + +--------------- +Kernel learning +--------------- + +There are various options to learn a kernel. +All options allow to learn a single kernel among a number of provided baseline kernels. +Furthermore, some of these criterions can be used to learn the coefficients of a convex combination of baseline kernels. + +There are different strategies to learn the kernel, see :sgclass:`CKernelSelectionStrategy`. + +We specify the desired baseline kernels to consider. Note the kernel above is not considered in the selection. + +.. sgexample:: quadratic_time_mmd.sg:add_kernels + +IMPORTANT: when learning the kernel for statistical testing, this needs to be done on different data than being used for performing the actual test. +One way to accomplish this is to manually provide a different set of features for testing. +In Shogun, it is also possible to automatically split the provided data by specifying the ratio between train and test data, via enabling the train-test mode. + +.. sgexample:: quadratic_time_mmd.sg:enable_train_test_mode + +A ratio of 1 means the data is split into half during learning the kernel, and subsequent tests are performed on the second half. + +We learn the kernel and extract the result, again see :sgclass:`CKernelSelectionStrategy` more available strategies. +Note that the kernel of the mmd itself is replaced. +If all kernels have the same type, we can convert the result into that type, for example to extract its parameters. + +.. sgexample:: quadratic_time_mmd.sg:select_kernel_single + +Note that in order to extract particular kernel parameters, we need to cast the kernel to its actual type. + +Similarly, a convex combination of kernels, in the form of :sgclass:`CCombinedKernel` can be learned and extracted as + +.. sgexample:: quadratic_time_mmd.sg:select_kernel_combined + +We can perform the test on the last learnt kernel. +Since we enabled the train-test mode, this automatically is done on the held out test data. + +.. sgexample:: quadratic_time_mmd.sg:perform_test_optimized + +---------- +References +---------- +.. bibliography:: ../../references.bib + :filter: docname in docnames + +:wiki:`Statistical_hypothesis_testing` diff --git a/doc/cookbook/source/index.rst b/doc/cookbook/source/index.rst index 30616938fcb..1b979cb0399 100644 --- a/doc/cookbook/source/index.rst +++ b/doc/cookbook/source/index.rst @@ -47,6 +47,15 @@ Regression examples/regression/** +Statistical Testing +------------------- + +.. toctree:: + :maxdepth: 1 + :glob: + + examples/statistical_testing/** + Kernels ------- diff --git a/doc/cookbook/source/references.bib b/doc/cookbook/source/references.bib index 5cba98e83a3..b32bec37852 100644 --- a/doc/cookbook/source/references.bib +++ b/doc/cookbook/source/references.bib @@ -25,7 +25,7 @@ @book{cristianini2000introduction publisher={Cambridge University Press} } @article{fan2008liblinear, - title={LIBLINEAR: A Library for Large Linear Classification}, + title={{LIBLINEAR: A Library for Large Linear Classification}}, author={R.E. Fan and K.W. Chang and C.J. Hsieh and X.R. Wang and C.J. Lin}, journal={Journal of Machine Learning Research}, volume={9}, @@ -36,7 +36,18 @@ @book{Rasmussen2005GPM author = {Rasmussen, C. E. and Williams, C. K. I.}, title = {Gaussian Processes for Machine Learning}, year = {2005}, - publisher = {The MIT Press} + publisher = {The MIT Press}, + year={2008}, +} + +@article{gretton2012kernel, + title={A kernel two-sample test}, + author={Gretton, A. and Borgwardt, K.M. and Rasch, M.J. and Sch{\"o}lkopf, B. and Smola, A.}, + journal={The Journal of Machine Learning Research}, + volume={13}, + number={1}, + pages={723--773}, + year={2012}, } @article{ueda2000smem, title={SMEM Algorithm for Mixture Models}, @@ -102,6 +113,13 @@ @inproceedings{shalev2011shareboost pages={1179--1187}, year={2011} } + +@inproceedings{gretton2012optimal, + author={Gretton, A. and Sriperumbudur, B. and Sejdinovic, D. and Strathmann, H. and Balakrishnan, S. and Pontil, M. and Fukumizu, K.}, + booktitle={Advances in Neural Information Processing Systems}, + title={{Optimal kernel choice for large-scale two-sample tests}}, + year={2012} +} @article{sonnenburg2006large, title={Large scale multiple kernel learning}, author={S. Sonnenburg and G. R{\"a}tsch and C. Sch{\"a}fer and B. Sch{\"o}lkopf}, diff --git a/doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb b/doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb similarity index 62% rename from doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb rename to doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb index 9169538737c..4b8acceaca5 100644 --- a/doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb +++ b/doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb @@ -11,22 +11,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de" + "#### Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de\n", + "#### Soumyajit De - soumyajitde.cse@gmail.com - github.com/lambday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This notebook describes Shogun's framework for statistical hypothesis testing. We begin by giving a brief outline of the problem setting and then describe various implemented algorithms. All the algorithms discussed here are for Kernel two-sample testing with Maximum Mean Discrepancy and are based on embedding probability distributions into Reproducing Kernel Hilbert Spaces( RKHS )." + "This notebook describes Shogun's framework for statistical hypothesis testing. We begin by giving a brief outline of the problem setting and then describe various implemented algorithms.\n", + "All algorithms discussed here are instances of kernel two-sample testing with the *maximum mean discrepancy*, and are based on embedding probability distributions into Reproducing Kernel Hilbert Spaces (RKHS)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Methods for two-sample testing currently consist of tests based on the *Maximum Mean Discrepancy*. There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours.\n", - "Independence testing is currently based in the *Hilbert Schmidt Independence Criterion*." + "There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours." ] }, { @@ -39,8 +40,8 @@ "source": [ "%pylab inline\n", "%matplotlib inline\n", - "# import all Shogun classes\n", - "from modshogun import *" + "import modshogun as sg\n", + "import numpy as np" ] }, { @@ -54,7 +55,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called *null-hypothesis* $H_0$, which is the complement of an *alternative-hypothesis* $H_A$. \n", + "To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called *null-hypothesis* $H_0:p=q$, which is the complement of an *alternative-hypothesis* $H_A$. \n", "\n", "To distinguish the hypotheses, a test statistic is computed on sample data. Since sample data is finite, this corresponds to sampling the true distribution of the test statistic. There are two different distributions of the test statistic -- one for each hypothesis. The *null-distribution* corresponds to test statistic samples under the model that $H_0$ holds; the *alternative-distribution* corresponds to test statistic samples under the model that $H_A$ holds.\n", "\n", @@ -65,11 +66,11 @@ " * A *type I error* is made when $H_0: p=q$ is wrongly rejected. That is, the test says that the samples are from different distributions when they are not.\n", " * A *type II error* is made when $H_A: p\\neq q$ is wrongly accepted. That is, the test says that the samples are from the same distribution when they are not.\n", "\n", - "A so-called *consistent* test achieves zero type II error for a fixed type I error.\n", + "A so-called *consistent* test achieves zero type II error for a fixed type I error, as it sees more data.\n", "\n", "To decide whether to reject $H_0$, one could set a threshold, say at the $95\\%$ quantile of the null-distribution, and reject $H_0$ when the test statistic lies below that threshold. This means that the chance that the samples were generated under $H_0$ are $5\\%$. We call this number the *test power* $\\alpha$ (in this case $\\alpha=0.05$). It is an upper bound on the probability for a type I error. An alternative way is simply to compute the quantile of the test statistic in the null-distribution, the so-called *p-value*, and to compare the p-value against a desired test power, say $\\alpha=0.05$, by hand. The advantage of the second method is that one not only gets a binary answer, but also an upper bound on the type I error.\n", "\n", - "In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this for any two-sample test is called *bootstrapping*, or the *permutation* test, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. For many test statistics, there are more sophisticated methods of approximating the null distribution." + "In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this is called the *permutation test*, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. Shogun comes with an extremely optimized implementation though. For completeness, Shogun also includes a number of more sohpisticated ways of approximating the null distribution." ] }, { @@ -83,15 +84,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. This class offers methods to\n", + "Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. We here focos on CTwoSampleTest. This class offers methods to\n", "\n", " * compute the implemented test statistic,\n", " * compute p-values for a given value of the test statistic,\n", " * compute a test threshold for a given p-value,\n", - " * sampling the null distribution, i.e. perform the permutation test or bootstrappig of the null-distribution, and\n", - " * performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic, it might be faster to call this than to compute threshold and test statistic seperately with the above methods.\n", - " \n", - "There are special subclasses for testing two distributions against each other (CTwoSampleTest, CIndependenceTest), kernel two-sample testing (CKernelTwoSampleTest), and kernel independence testing (CKernelIndependenceTest), which however mostly differ in internals and constructors." + " * approximate the null distribution, e.g. perform the permutation test and\n", + " * performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic." ] }, { @@ -123,7 +122,7 @@ " +\\textbf{E}_{y,y'}\\left[ k(y,y')\\right]\n", "\\end{align*}\n", "\n", - "Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.\n", + "Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.\n", "\n", "We here only describe how to use the MMD for two-sample testing. Shogun offers two types of test statistic based on the MMD, one with quadratic costs both in time and space, and one with linear time and constant space costs. Both come in different versions and with different methods how to approximate the null-distribution in order to construct a two-sample test." ] @@ -159,11 +158,11 @@ "outputs": [], "source": [ "# use scipy for generating samples\n", - "from scipy.stats import norm, laplace\n", + "from scipy.stats import laplace, norm\n", "\n", - "def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=sqrt(0.5)): \n", + "def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)): \n", " # sample from both distributions\n", - " X=norm.rvs(size=n, loc=mu, scale=sigma2)\n", + " X=norm.rvs(size=n)*np.sqrt(sigma2)+mu\n", " Y=laplace.rvs(size=n, loc=mu, scale=b)\n", " \n", " return X,Y" @@ -179,31 +178,30 @@ "source": [ "mu=0.0\n", "sigma2=1\n", - "b=sqrt(0.5)\n", + "b=np.sqrt(0.5)\n", "n=220\n", "X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)\n", "\n", "# plot both densities and histograms\n", - "figure(figsize=(18,5))\n", - "suptitle(\"Gaussian vs. Laplace\")\n", - "subplot(121)\n", - "Xs=linspace(-2, 2, 500)\n", - "plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))\n", - "plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))\n", - "title(\"Densities\")\n", - "xlabel(\"$x$\")\n", - "ylabel(\"$p(x)$\")\n", - "_=legend([ 'Gaussian','Laplace'])\n", - "\n", - "subplot(122)\n", - "hist(X, alpha=0.5)\n", - "xlim([-5,5])\n", - "ylim([0,100])\n", - "hist(Y,alpha=0.5)\n", - "xlim([-5,5])\n", - "ylim([0,100])\n", - "legend([\"Gaussian\", \"Laplace\"])\n", - "_=title('Histograms')" + "plt.figure(figsize=(18,5))\n", + "plt.suptitle(\"Gaussian vs. Laplace\")\n", + "plt.subplot(121)\n", + "Xs=np.linspace(-2, 2, 500)\n", + "plt.plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))\n", + "plt.plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))\n", + "plt.title(\"Densities\")\n", + "plt.xlabel(\"$x$\")\n", + "plt.ylabel(\"$p(x)$\")\n", + "\n", + "plt.subplot(122)\n", + "plt.hist(X, alpha=0.5)\n", + "plt.xlim([-5,5])\n", + "plt.ylim([0,100])\n", + "plt.hist(Y,alpha=0.5)\n", + "plt.xlim([-5,5])\n", + "plt.ylim([0,100])\n", + "plt.legend([\"Gaussian\", \"Laplace\"])\n", + "plt.title('Samples');" ] }, { @@ -222,8 +220,8 @@ "outputs": [], "source": [ "print \"Gaussian vs. Laplace\"\n", - "print \"Sample means: %.2f vs %.2f\" % (mean(X), mean(Y))\n", - "print \"Samples variances: %.2f vs %.2f\" % (var(X), var(Y))" + "print \"Sample means: %.2f vs %.2f\" % (np.mean(X), np.mean(Y))\n", + "print \"Samples variances: %.2f vs %.2f\" % (np.var(X), np.var(Y))" ] }, { @@ -237,7 +235,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.\n", + "We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.\n", "\n", "An unbiased estimate for the MMD expression above can be obtained by estimating expected values with averaging over independent samples\n", "\n", @@ -251,7 +249,7 @@ "\\mmd_b[\\mathcal{F},X,Y]^2=\\frac{1}{m^2}\\sum_{i=1}^m\\sum_{j=1}^mk(x_i,x_j) + \\frac{1}{n^ 2}\\sum_{i=1}^n\\sum_{j=1}^nk(y_i,y_j)-\\frac{2}{mn}\\sum_{i=1}^m\\sum_{j\\neq i}^nk(x_i,y_j)\n", ".$$\n", "\n", - "Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\\mmd_b[\\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed." + "Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\\mmd_b[\\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed." ] }, { @@ -263,22 +261,25 @@ "outputs": [], "source": [ "# turn data into Shogun representation (columns vectors)\n", - "feat_p=RealFeatures(X.reshape(1,len(X)))\n", - "feat_q=RealFeatures(Y.reshape(1,len(Y)))\n", + "feat_p=sg.RealFeatures(X.reshape(1,len(X)))\n", + "feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))\n", "\n", "# choose kernel for testing. Here: Gaussian\n", "kernel_width=1\n", - "kernel=GaussianKernel(10, kernel_width)\n", + "kernel=sg.GaussianKernel(10, kernel_width)\n", "\n", "# create mmd instance of test-statistic\n", - "mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n", + "mmd=sg.QuadraticTimeMMD()\n", + "mmd.set_kernel(kernel)\n", + "mmd.set_p(feat_p)\n", + "mmd.set_q(feat_q)\n", "\n", "# compute biased and unbiased test statistic (default is unbiased)\n", - "mmd.set_statistic_type(BIASED)\n", + "mmd.set_statistic_type(sg.ST_BIASED_FULL)\n", "biased_statistic=mmd.compute_statistic()\n", "\n", - "mmd.set_statistic_type(UNBIASED)\n", - "unbiased_statistic=mmd.compute_statistic()\n", + "mmd.set_statistic_type(sg.ST_UNBIASED_FULL)\n", + "statistic=unbiased_statistic=mmd.compute_statistic()\n", "\n", "print \"%d x MMD_b[X,Y]^2=%.2f\" % (len(X), biased_statistic)\n", "print \"%d x MMD_u[X,Y]^2=%.2f\" % (len(X), unbiased_statistic)" @@ -288,7 +289,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Any sub-class of CHypothesisTest can compute approximate the null distribution using permutation/bootstrapping. This way always is guaranteed to produce consistent results, however, it might take a long time as for each sample of the null distribution, the test statistic has to be computed for a different permutation of the data. Note that each of the below calls samples from the null distribution. It is wise to choose one method in practice. Also note that we set the number of samples from the null distribution to a low value to reduce runtime. Choose larger in practice, it is in fact good to plot the samples." + "Any sub-class of CHypothesisTest can compute approximate the null distribution using permutation/bootstrapping. This way always is guaranteed to produce consistent results, however, it might take a long time as for each sample of the null distribution, the test statistic has to be computed for a different permutation of the data. Shogun's implementation is highly optimized, exploiting low-level CPU caching and multiple available cores." ] }, { @@ -299,18 +300,14 @@ }, "outputs": [], "source": [ - "# this is not necessary as bootstrapping is the default\n", - "mmd.set_null_approximation_method(PERMUTATION)\n", - "mmd.set_statistic_type(UNBIASED)\n", - "\n", - "# to reduce runtime, should be larger practice\n", - "mmd.set_num_null_samples(100)\n", + "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n", + "mmd.set_num_null_samples(200)\n", "\n", "# now show a couple of ways to compute the test\n", "\n", "# compute p-value for computed test statistic\n", - "p_value=mmd.compute_p_value(unbiased_statistic)\n", - "print \"P-value of MMD value %.2f is %.2f\" % (unbiased_statistic, p_value)\n", + "p_value=mmd.compute_p_value(statistic)\n", + "print \"P-value of MMD value %.2f is %.2f\" % (statistic, p_value)\n", "\n", "# compute threshold for rejecting H_0 for a given test power\n", "alpha=0.05\n", @@ -318,7 +315,7 @@ "print \"Threshold for rejecting H0 with a test power of %.2f is %.2f\" % (alpha, threshold)\n", "\n", "# performing the test by hand given the above results, note that those two are equivalent\n", - "if unbiased_statistic>threshold:\n", + "if statistic>threshold:\n", " print \"H0 is rejected with confidence %.2f\" % alpha\n", " \n", "if p_valueCCustomKernel class, which allows to precompute a kernel matrix (multithreaded) of a given kernel and store it in memory. Instances of this class can then be used as if they were standard kernels." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false - }, - "outputs": [], - "source": [ - "# precompute kernel to be faster for null sampling\n", - "p_and_q=mmd.get_p_and_q()\n", - "kernel.init(p_and_q, p_and_q);\n", - "precomputed_kernel=CustomKernel(kernel);\n", - "mmd.set_kernel(precomputed_kernel);\n", - "\n", - "# increase number of iterations since should be faster now\n", - "mmd.set_num_null_samples(500);\n", - "p_value_boot=mmd.perform_test();\n", - "print \"P-value of MMD test is %.2f\" % p_value_boot" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let us visualise distribution of MMD statistic under $H_0:p=q$ and $H_A:p\\neq q$. Sample both null and alternative distribution for that. Use the interface of CTwoSampleTest to sample from the null distribution (permutations, re-computing of test statistic is done internally). For the alternative distribution, compute the test statistic for a new sample set of $X$ and $Y$ in a loop. Note that the latter is expensive, as the kernel cannot be precomputed, and infinite data is needed. Though it is not needed in practice but only for illustrational purposes here." + "Now let us visualise distribution of MMD statistic under $H_0:p=q$ and $H_A:p\\neq q$. Sample both null and alternative distribution for that. Use the interface of CHypothesisTest to sample from the null distribution (permutations, re-computing of test statistic is done internally). For the alternative distribution, compute the test statistic for a new sample set of $X$ and $Y$ in a loop. Note that the latter is expensive, as the kernel cannot be precomputed, and infinite data is needed. Though it is not needed in practice but only for illustrational purposes here." ] }, { @@ -388,18 +346,21 @@ "num_samples=500\n", "\n", "# sample null distribution\n", - "mmd.set_num_null_samples(num_samples)\n", "null_samples=mmd.sample_null()\n", "\n", "# sample alternative distribution, generate new data for that\n", - "alt_samples=zeros(num_samples)\n", + "alt_samples=np.zeros(num_samples)\n", "for i in range(num_samples):\n", " X=norm.rvs(size=n, loc=mu, scale=sigma2)\n", " Y=laplace.rvs(size=n, loc=mu, scale=b)\n", - " feat_p=RealFeatures(reshape(X, (1,len(X))))\n", - " feat_q=RealFeatures(reshape(Y, (1,len(Y))))\n", - " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n", - " alt_samples[i]=mmd.compute_statistic()" + " feat_p=sg.RealFeatures(np.reshape(X, (1,len(X))))\n", + " feat_q=sg.RealFeatures(np.reshape(Y, (1,len(Y))))\n", + " # TODO: reset pre-computed kernel here\n", + " mmd.set_p(feat_p)\n", + " mmd.set_q(feat_q)\n", + " alt_samples[i]=mmd.compute_statistic()\n", + "\n", + "np.std(alt_samples)" ] }, { @@ -428,26 +389,26 @@ "outputs": [], "source": [ "def plot_alt_vs_null(alt_samples, null_samples, alpha):\n", - " figure(figsize=(18,5))\n", + " plt.figure(figsize=(18,5))\n", " \n", - " subplot(131)\n", - " hist(null_samples, 50, color='blue')\n", - " title('Null distribution')\n", - " subplot(132)\n", - " title('Alternative distribution')\n", - " hist(alt_samples, 50, color='green')\n", + " plt.subplot(131)\n", + " plt.hist(null_samples, 50, color='blue')\n", + " plt.title('Null distribution')\n", + " plt.subplot(132)\n", + " plt.title('Alternative distribution')\n", + " plt.hist(alt_samples, 50, color='green')\n", " \n", - " subplot(133)\n", - " hist(null_samples, 50, color='blue')\n", - " hist(alt_samples, 50, color='green', alpha=0.5)\n", - " title('Null and alternative distriution')\n", + " plt.subplot(133)\n", + " plt.hist(null_samples, 50, color='blue')\n", + " plt.hist(alt_samples, 50, color='green', alpha=0.5)\n", + " plt.title('Null and alternative distriution')\n", " \n", " # find (1-alpha) element of null distribution\n", - " null_samples_sorted=sort(null_samples)\n", - " quantile_idx=int(num_samples*(1-alpha))\n", + " null_samples_sorted=np.sort(null_samples)\n", + " quantile_idx=int(len(null_samples)*(1-alpha))\n", " quantile=null_samples_sorted[quantile_idx]\n", - " axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')\n", - " _=legend()" + " plt.axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')\n", + " legend();" ] }, { @@ -472,7 +433,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As already mentioned, bootstrapping the null distribution is expensive business. There exist a couple of methods that are more sophisticated and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.\n", + "As already mentioned, permuting the data to access the null distribution is probably the method of choice, due to the efficient implementation in Shogun. There exist a couple of methods that are more sophisticated (and slower) and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.\n", "\n", "The first one is a spectral method that is based around the Eigenspectrum of the kernel matrix of the joint samples. It is faster than bootstrapping while being a consistent test. Effectively, the null-distribution of the biased statistic is sampled, but in a more efficient way than the bootstrapping approach. The converges as\n", "\n", @@ -482,12 +443,12 @@ "\n", "where $z_l\\sim \\mathcal{N}(0,2)$ are i.i.d. normal samples and $\\lambda_l$ are Eigenvalues of expression 2 in [2], which can be empirically estimated by $\\hat\\lambda_l=\\frac{1}{m}\\nu_l$ where $\\nu_l$ are the Eigenvalues of the centred kernel matrix of the joint samples $X$ and $Y$. The distribution above can be easily sampled. Shogun's implementation has two parameters:\n", "\n", - " * Number of samples from null-distribution. The more, the more accurate. As a rule of thumb, use 250.\n", + " * Number of samples from null-distribution. The more, the more accurate.\n", " * Number of Eigenvalues of the Eigen-decomposition of the kernel matrix to use. The more, the better the results get. However, the Eigen-spectrum of the joint gram matrix usually decreases very fast. Plotting the Spectrum can help. See [2] for details.\n", "\n", - "If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are much lower than bootstrapping, which is the only consistent alternative. Since Eigenvalues of the gram matrix has to be computed, costs are in $\\mathcal{O}(m^3)$.\n", + "If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are likely to be larger than permutation testing, due to the efficient implementation of the latter: Eigenvalues of the gram matrix cost $\\mathcal{O}(m^3)$.\n", "\n", - "Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic." + "Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic." ] }, { @@ -499,23 +460,24 @@ "outputs": [], "source": [ "# optional: plot spectrum of joint kernel matrix\n", - "from numpy.linalg import eig\n", + "\n", + "# TODO: it would be good if there was a way to extract the joint kernel matrix for all kernel tests\n", "\n", "# get joint feature object and compute kernel matrix and its spectrum\n", "feats_p_q=mmd.get_p_and_q()\n", "mmd.get_kernel().init(feats_p_q, feats_p_q)\n", "K=mmd.get_kernel().get_kernel_matrix()\n", - "w,_=eig(K)\n", + "w,_=np.linalg.eig(K)\n", "\n", "# visualise K and its spectrum (only up to threshold)\n", - "figure(figsize=(18,5))\n", - "subplot(121)\n", - "imshow(K, interpolation=\"nearest\")\n", - "title(\"Kernel matrix K of joint data $X$ and $Y$\")\n", - "subplot(122)\n", + "plt.figure(figsize=(18,5))\n", + "plt.subplot(121)\n", + "plt.imshow(K, interpolation=\"nearest\")\n", + "plt.title(\"Kernel matrix K of joint data $X$ and $Y$\")\n", + "plt.subplot(122)\n", "thresh=0.1\n", - "plot(w[:len(w[w>thresh])])\n", - "_=title(\"Eigenspectrum of K until component %d\" % len(w[w>thresh]))" + "plt.plot(w[:len(w[w>thresh])])\n", + "title(\"Eigenspectrum of K until component %d\" % len(w[w>thresh]));" ] }, { @@ -540,22 +502,23 @@ "num_eigen=len(w[w>thresh])\n", "\n", "# finally, do the test, use biased statistic\n", - "mmd.set_statistic_type(BIASED)\n", + "mmd.set_statistic_type(sg.ST_BIASED_FULL)\n", "\n", "#tell Shogun to use spectrum approximation\n", - "mmd.set_null_approximation_method(MMD2_SPECTRUM)\n", - "mmd.set_num_eigenvalues_spectrum(num_eigen)\n", - "mmd.set_num_samples_spectrum(num_samples)\n", + "mmd.set_null_approximation_method(sg.NAM_MMD2_SPECTRUM)\n", + "mmd.spectrum_set_num_eigenvalues(num_eigen)\n", + "mmd.set_num_null_samples(num_samples)\n", "\n", "# the usual test interface\n", - "p_value_spectrum=mmd.perform_test()\n", + "statistic=mmd.compute_statistic()\n", + "p_value_spectrum=mmd.compute_p_value(statistic)\n", "print \"Spectrum: P-value of MMD test is %.2f\" % p_value_spectrum\n", "\n", - "# compare with ground truth bootstrapping\n", - "mmd.set_null_approximation_method(PERMUTATION)\n", + "# compare with ground truth from permutation test\n", + "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n", "mmd.set_num_null_samples(num_samples)\n", - "p_value_boot=mmd.perform_test()\n", - "print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_spectrum" + "p_value_permutation=mmd.compute_p_value(statistic)\n", + "print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_permutation" ] }, { @@ -595,15 +558,16 @@ "outputs": [], "source": [ "# tell Shogun to use gamma approximation\n", - "mmd.set_null_approximation_method(MMD2_GAMMA)\n", + "mmd.set_null_approximation_method(sg.NAM_MMD2_GAMMA)\n", "\n", "# the usual test interface\n", - "p_value_gamma=mmd.perform_test()\n", + "statistic=mmd.compute_statistic()\n", + "p_value_gamma=mmd.compute_p_value(statistic)\n", "print \"Gamma: P-value of MMD test is %.2f\" % p_value_gamma\n", "\n", "# compare with ground truth bootstrapping\n", - "mmd.set_null_approximation_method(PERMUTATION)\n", - "p_value_boot=mmd.perform_test()\n", + "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n", + "p_value_spectrum=mmd.compute_p_value(statistic)\n", "print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_spectrum" ] }, @@ -637,32 +601,34 @@ " Z=hstack((X,Y))\n", " X=Z[:len(X)]\n", " Y=Z[len(X):]\n", - " feat_p=RealFeatures(reshape(X, (1,len(X))))\n", - " feat_q=RealFeatures(reshape(Y, (1,len(Y))))\n", + " feat_p=sg.RealFeatures(reshape(X, (1,len(X))))\n", + " feat_q=sg.RealFeatures(reshape(Y, (1,len(Y))))\n", " \n", " # gamma\n", - " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n", - " mmd.set_null_approximation_method(MMD2_GAMMA)\n", - " mmd.set_statistic_type(BIASED)\n", + " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n", + " mmd.set_kernel(kernel)\n", + " mmd.set_null_approximation_method(sg.NAM_MMD2_GAMMA)\n", + " mmd.set_statistic_type(sg.ST_BIASED_FULL) \n", " rejections_gamma[i]=mmd.perform_test(alpha)\n", " \n", " # spectrum\n", - " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n", - " mmd.set_null_approximation_method(MMD2_SPECTRUM)\n", - " mmd.set_num_eigenvalues_spectrum(num_eigen)\n", - " mmd.set_num_samples_spectrum(num_samples)\n", - " mmd.set_statistic_type(BIASED)\n", + " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n", + " mmd.set_kernel(kernel)\n", + " mmd.set_null_approximation_method(sg.NAM_MMD2_SPECTRUM)\n", + " mmd.spectrum_set_num_eigenvalues(num_eigen)\n", + " mmd.set_num_null_samples(num_samples)\n", + " mmd.set_statistic_type(sg.ST_BIASED_FULL)\n", " rejections_spectrum[i]=mmd.perform_test(alpha)\n", " \n", " # bootstrap (precompute kernel)\n", - " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n", + " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n", " p_and_q=mmd.get_p_and_q()\n", " kernel.init(p_and_q, p_and_q)\n", - " precomputed_kernel=CustomKernel(kernel)\n", + " precomputed_kernel=sg.CustomKernel(kernel)\n", " mmd.set_kernel(precomputed_kernel)\n", - " mmd.set_null_approximation_method(PERMUTATION)\n", + " mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n", " mmd.set_num_null_samples(num_samples)\n", - " mmd.set_statistic_type(BIASED)\n", + " mmd.set_statistic_type(sg.ST_BIASED_FULL)\n", " rejections_bootstrap[i]=mmd.perform_test(alpha)" ] }, @@ -701,9 +667,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.\n", + "So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.\n", "\n", - "And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing." + "And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing." ] }, { @@ -722,8 +688,8 @@ "angle=pi/4\n", "\n", "# these are streaming features\n", - "gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n", - "gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n", + "gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n", + "gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n", "\t\t\n", "# stream some data and plot\n", "num_plot=1000\n", @@ -755,7 +721,7 @@ "\n", "where $ m_2=\\lfloor\\frac{m}{2} \\rfloor$. While the above expression assumes that $m$ data are available from each distribution, the statistic in general works in an online setting where features are obtained one by one. Since only pairs of four points are considered at once, this allows to compute it on data streams. In addition, the computational costs are linear in the number of samples that are considered from each distribution. These two properties make the linear time MMD very applicable for large scale two-sample tests. In theory, any number of samples can be processed -- time is the only limiting factor.\n", "\n", - "We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream." + "We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream." ] }, { @@ -769,7 +735,11 @@ "block_size=100\n", "\n", "# if features are already under the streaming interface, just pass them\n", - "mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)\n", + "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n", + "mmd.set_kernel(kernel)\n", + "mmd.set_num_samples_p(m)\n", + "mmd.set_num_samples_q(m)\n", + "mmd.set_num_blocks_per_burst(block_size)\n", "\n", "# compute an unbiased estimate in linear time\n", "statistic=mmd.compute_statistic()\n", @@ -785,7 +755,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Sometimes, one might want to use CLinearTimeMMD with data that is stored in memory. In that case, it is easy to data in the form of for example CStreamingDenseFeatures into CDenseFeatures." + "Sometimes, one might want to use CLinearTimeMMD with data that is stored in memory. In that case, it is easy to data in the form of for example CStreamingDenseFeatures into CDenseFeatures." ] }, { @@ -797,25 +767,21 @@ "outputs": [], "source": [ "# data source\n", - "gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n", - "gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n", - "\n", - "# retreive some points, store them as non-streaming data in memory\n", - "data_p=gen_p.get_streamed_features(100)\n", - "data_q=gen_q.get_streamed_features(data_p.get_num_vectors())\n", - "print \"Number of data is %d\" % data_p.get_num_vectors()\n", + "gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n", + "gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n", "\n", - "# cast data in memory as streaming features again (which now stream from the in-memory data)\n", - "streaming_p=StreamingRealFeatures(data_p)\n", - "streaming_q=StreamingRealFeatures(data_q)\n", + "num_samples=100\n", + "print \"Number of data is %d\" % num_samples\n", "\n", - "# it is important to start the internal parser to avoid deadlocks\n", - "streaming_p.start_parser()\n", - "streaming_q.start_parser()\n", + "# retreive some points, store them as non-streaming data in memory\n", + "data_p=gen_p.get_streamed_features(num_samples)\n", + "data_q=gen_q.get_streamed_features(num_samples)\n", "\n", - "# example to create mmd (note that m can be maximum the number of data in memory)\n", + "# example to create mmd (note that num_samples can be maximum the number of data in memory)\n", "\n", - "mmd=LinearTimeMMD(GaussianKernel(10,1), streaming_p, streaming_q, data_p.get_num_vectors(), 1)\n", + "mmd=sg.LinearTimeMMD(data_p, data_q)\n", + "mmd.set_kernel(sg.GaussianKernel(10, 1))\n", + "mmd.set_num_blocks_per_burst(100)\n", "print \"Linear time MMD statistic: %.2f\" % mmd.compute_statistic()" ] }, @@ -830,7 +796,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, *new* data is taken from the stream in every iteration.\n", + "As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, *new* data is taken from the stream in every iteration.\n", "\n", "Bootstrapping is not really necessary since there exists a fast and consistent estimate of the null-distribution. However, to ensure that any approximation is accurate, it should always be checked against bootstrapping at least once.\n", "\n", @@ -848,7 +814,7 @@ "\n", "A normal distribution with this variance and zero mean can then be used as an approximation for the null-distribution. This results in a consistent test and is very fast. However, note that it is an approximation and its accuracy depends on the underlying data distributions. It is a good idea to compare to the bootstrapping approach first to determine an appropriate number of samples to use. This number is usually in the tens of thousands.\n", "\n", - "CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)" + "CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)" ] }, { @@ -859,10 +825,15 @@ }, "outputs": [], "source": [ - "mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)\n", + "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n", + "mmd.set_kernel(kernel)\n", + "mmd.set_num_samples_p(m)\n", + "mmd.set_num_samples_q(m)\n", + "mmd.set_num_blocks_per_burst(block_size)\n", + "\n", "print \"m=%d samples from p and q\" % m\n", "print \"Binary test result is: \" + (\"Rejection\" if mmd.perform_test(alpha) else \"No rejection\")\n", - "print \"P-value test result is %.2f\" % mmd.perform_test()" + "print \"P-value test result is %.2f\" % mmd.compute_p_value(mmd.compute_statistic())" ] }, { @@ -880,38 +851,38 @@ "\\DeclareMathOperator*{\\argmax}{arg\\,max}$\n", "Now which kernel do we actually use for our tests? So far, we just plugged in arbritary ones. However, for kernel two-sample testing, it is possible to do something more clever.\n", "\n", - "Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the *optimal* kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].\n", + "Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the *optimal* kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].\n", "\n", - "MMD Kernel selection in Shogun always involves an implementation of the base class CMMDKernelSelection, which defines the interface for kernel selection. If combinations of kernel should be considered, there is a sub-class CMMDKernelSelectionComb. In addition, it involves setting up a number of baseline kernels $\\mathcal{K}$ to choose from/combine in the form of a CCombinedKernel. All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.\n", + "MMD Kernel selection in Shogun always involves coosing a one of the methods of CGaussianKernel All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.\n", "\n", - " * CMMDKernelSelectionMedian Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.\n", + " * *KSM_MEDIAN_HEURISTIC*: Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.\n", "\n", - " * CMMDKernelSelectionMax Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate.\n", + " * *KSM_MAXIMIZE_MMD*: Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate.\n", "$$\n", "k^*=\\argmax_{k\\in\\mathcal{K}} \\hat \\eta_k,\n", "$$\n", "where $\\eta_k$ is an empirical MMD estimate for using a kernel $k$.\n", "This was first described in [4] and was empirically shown to perform better than the median heuristic above. However, it remains a heuristic that comes with no guarantees. Since MMD estimates can be computed in linear and quadratic time, this method works for both methods. However, for the linear time statistic, there exists a better method.\n", " \n", - " * CMMDKernelSelectionOpt Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation.\n", + " * *KSM_MAXIMIZE_POWER*: Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation.\n", "$$\n", "k^*=\\argmax_{k\\in\\mathcal{K}} \\frac{\\hat \\eta_k}{\\hat\\sigma_k+\\lambda},\n", "$$\n", "where $\\eta_k$ is a linear time MMD estimate for using a kernel $k$ and $\\hat\\sigma_k$ is a linear time variance estimate of $\\eta_k$ to which a small number $\\lambda$ is added to prevent division by zero.\n", - "These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error. *Important:* For this method to work, the kernel needs to be selected on *different* data than the test is performed on. Otherwise, the method will produce wrong results.\n", + "These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error. *Important:* For this method to work, the kernel needs to be selected on *different* data than the test is performed on. Otherwise, the method will produce wrong results.\n", " \n", - " * CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program\n", + " * CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program\n", "$$\n", "\\boldsymbol{\\beta}^*=\\min_{\\boldsymbol{\\beta}} \\{\\boldsymbol{\\beta}^T\\boldsymbol{\\beta} : \\boldsymbol{\\beta}^T\\boldsymbol{\\eta}=\\mathbf{1}, \\boldsymbol{\\beta}\\succeq 0\\},\n", "$$\n", "where $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. See [3] for details. Note that this method is unable to select a single kernel -- even when this would be optimal.\n", "Again, when using the linear time MMD, there are better methods available.\n", "\n", - " * CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \\emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is\n", + " * CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \\emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is\n", "$$\n", "\\boldsymbol{\\beta}^*=\\min_{\\boldsymbol{\\beta}} (\\hat Q+\\lambda I) \\{\\boldsymbol{\\beta}^T\\boldsymbol{\\beta} : \\boldsymbol{\\beta}^T\\boldsymbol{\\eta}=\\mathbf{1}, \\boldsymbol{\\beta}\\succeq 0\\},\n", "$$\n", - "where again $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\\hat Q$ is a linear time estimate of the covariance matrix of the vector $\\boldsymbol{\\eta}$ to whose diagonal a small number $\\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt." + "where again $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\\hat Q$ is a linear time estimate of the covariance matrix of the vector $\\boldsymbol{\\eta}$ to whose diagonal a small number $\\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt." ] }, { @@ -925,11 +896,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call\n", + "In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call\n", "\n", " * *compute_measures* to compute a vector kernel selection criteria if a single kernel selection method is used. It will return a vector of selected kernel weights if a combined kernel selection method is used. For \\shogunclass{CMMDKernelSelectionMedian}, the method does throw an error.\n", "\n", - " * *select\\_kernel* returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method. \n", + " * *select\\_kernel* returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method. \n", "\n", "In order to utilise the selected kernel, it has to be passed to an MMD instance. We now give an example how to select the optimal single and combined kernel for the Gaussian Blobs dataset." ] @@ -949,22 +920,29 @@ }, "outputs": [], "source": [ - "sigmas=[2**x for x in linspace(-5,5, 10)]\n", + "# mmd instance using streaming features\n", + "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n", + "mmd.set_num_samples_p(m)\n", + "mmd.set_num_samples_q(m)\n", + "mmd.set_num_blocks_per_burst(block_size)\n", + "\n", + "sigmas=[2**x for x in np.linspace(-5, 5, 11)]\n", "print \"Choosing kernel width from\", [\"{0:.2f}\".format(sigma) for sigma in sigmas]\n", - "combined=CombinedKernel()\n", - "for i in range(len(sigmas)):\n", - " combined.append_kernel(GaussianKernel(10, sigmas[i]))\n", "\n", - "# mmd instance using streaming features\n", - "block_size=1000\n", - "mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size)\n", + "for i in range(len(sigmas)):\n", + " mmd.add_kernel(sg.GaussianKernel(10, sigmas[i]))\n", "\n", "# optmal kernel choice is possible for linear time MMD\n", - "selection=MMDKernelSelectionOpt(mmd)\n", + "mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_POWER)\n", + "\n", + "# must be set true for kernel selection\n", + "mmd.set_train_test_mode(True)\n", "\n", "# select best kernel\n", - "best_kernel=selection.select_kernel()\n", - "best_kernel=GaussianKernel.obtain_from_generic(best_kernel)\n", + "mmd.select_kernel()\n", + "\n", + "best_kernel=mmd.get_kernel()\n", + "best_kernel=sg.GaussianKernel.obtain_from_generic(best_kernel)\n", "print \"Best single kernel has bandwidth %.2f\" % best_kernel.get_width()" ] }, @@ -983,10 +961,8 @@ }, "outputs": [], "source": [ - "alpha=0.05\n", - "mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, m, block_size)\n", - "mmd.set_null_approximation_method(MMD1_GAUSSIAN);\n", - "p_value_best=mmd.perform_test();\n", + "mmd.set_null_approximation_method(sg.NAM_MMD1_GAUSSIAN);\n", + "p_value_best=mmd.compute_p_value(mmd.compute_statistic());\n", "\n", "print \"Bootstrapping: P-value of MMD test with optimal kernel is %.2f\" % p_value_best" ] @@ -1006,19 +982,20 @@ }, "outputs": [], "source": [ - "mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, 5000, block_size)\n", + "m=5000\n", + "mmd.set_num_samples_p(m)\n", + "mmd.set_num_samples_q(m)\n", + "mmd.set_train_test_mode(False)\n", "num_samples=500\n", "\n", "# sample null and alternative distribution, implicitly generate new data for that\n", - "null_samples=zeros(num_samples)\n", + "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n", + "mmd.set_num_null_samples(num_samples)\n", + "null_samples=mmd.sample_null()\n", + "\n", "alt_samples=zeros(num_samples)\n", "for i in range(num_samples):\n", - " alt_samples[i]=mmd.compute_statistic()\n", - " \n", - " # tell MMD to merge data internally while streaming\n", - " mmd.set_simulate_h0(True)\n", - " null_samples[i]=mmd.compute_statistic()\n", - " mmd.set_simulate_h0(False)" + " alt_samples[i]=mmd.compute_statistic()" ] }, { @@ -1103,7 +1080,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", - "version": "2.7.12" + "version": "2.7.10" } }, "nbformat": 4, diff --git a/examples/meta/generator/targets/cpp.json b/examples/meta/generator/targets/cpp.json index d512e30b309..73a224161e0 100644 --- a/examples/meta/generator/targets/cpp.json +++ b/examples/meta/generator/targets/cpp.json @@ -86,7 +86,7 @@ "MethodCall": "$object->$method($arguments)", "StaticCall": "C$typeName::$method($arguments)", "Identifier": "$identifier", - "Enum":"$value" + "Enum":"$typeName::$value" }, "Element": { "Access": { diff --git a/examples/meta/src/statistical_testing/linear_time_mmd.sg b/examples/meta/src/statistical_testing/linear_time_mmd.sg new file mode 100644 index 00000000000..97e93642b83 --- /dev/null +++ b/examples/meta/src/statistical_testing/linear_time_mmd.sg @@ -0,0 +1,59 @@ +GaussianBlobsDataGenerator features_p() +GaussianBlobsDataGenerator features_q() + +#![create_instance] +LinearTimeMMD mmd() +GaussianKernel kernel(10, 1) +mmd.set_kernel(kernel) +mmd.set_p(features_p) +mmd.set_q(features_q) +mmd.set_num_samples_p(1000) +mmd.set_num_samples_q(1000) +real alpha = 0.05 +#![create_instance] + +#![set_burst] +mmd.set_num_blocks_per_burst(1000) +#![set_burst] + +#![estimate_mmd] +real statistic = mmd.compute_statistic() +#![estimate_mmd] + +#![perform_test] +real threshold = mmd.compute_threshold(alpha) +real p_value = mmd.compute_p_value(statistic) +#![perform_test] + +#![add_kernels] +GaussianKernel kernel1(10, 0.1) +GaussianKernel kernel2(10, 1) +GaussianKernel kernel3(10, 10) +mmd.add_kernel(kernel1) +mmd.add_kernel(kernel2) +mmd.add_kernel(kernel3) +#![add_kernels] + +#![enable_train_test_mode] +mmd.set_train_test_mode(True) +mmd.set_train_test_ratio(1) +#![enable_train_test_mode] + +#![select_kernel_single] +mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_POWER) +mmd.select_kernel() +GaussianKernel learnt_kernel_single = GaussianKernel:obtain_from_generic(mmd.get_kernel()) +real width = learnt_kernel_single.get_width() +#![select_kernel_single] + +#![select_kernel_combined] +mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_POWER, True) +mmd.select_kernel() +CombinedKernel learnt_kernel_combined = CombinedKernel:obtain_from_generic(mmd.get_kernel()) +RealVector weights = learnt_kernel_combined.get_subkernel_weights() +#![select_kernel_combined] + +#![perform_test_optimized] +real statistic_optimized = mmd.compute_statistic() +real p_value_optimized = mmd.compute_p_value(statistic) +#![perform_test_optimized] diff --git a/examples/meta/src/statistical_testing/quadratic_time_mmd.sg b/examples/meta/src/statistical_testing/quadratic_time_mmd.sg new file mode 100644 index 00000000000..e6ec822ab40 --- /dev/null +++ b/examples/meta/src/statistical_testing/quadratic_time_mmd.sg @@ -0,0 +1,71 @@ +CSVFile f_features_p("../../data/two_sample_test_gaussian.dat") +CSVFile f_features_q("../../data/two_sample_test_laplace.dat") + +#![create_features] +RealFeatures features_p(f_features_p) +RealFeatures features_q(f_features_q) +#![create_features] + +#![create_instance] +QuadraticTimeMMD mmd(features_p, features_q) +GaussianKernel kernel(10, 1) +mmd.set_kernel(kernel) +real alpha = 0.05 +#![create_instance] + +#![estimate_mmd] +mmd.set_statistic_type(enum EStatisticType.ST_BIASED_FULL) +real statistic = mmd.compute_statistic() +#![estimate_mmd] + +#![perform_test] +mmd.set_null_approximation_method(enum ENullApproximationMethod.NAM_PERMUTATION) +mmd.set_num_null_samples(200) +real threshold = mmd.compute_threshold(alpha) +real p_value = mmd.compute_p_value(statistic) +#![perform_test] + +#![add_kernels] +GaussianKernel kernel1(10, 0.1) +GaussianKernel kernel2(10, 1) +GaussianKernel kernel3(10, 10) +mmd.add_kernel(kernel1) +mmd.add_kernel(kernel2) +mmd.add_kernel(kernel3) +#![add_kernels] + +#![multi_kernel] +MultiKernelQuadraticTimeMMD mk = mmd.multikernel() +mk.add_kernel(kernel1) +mk.add_kernel(kernel2) +mk.add_kernel(kernel2) + +RealVector mk_statistic = mk.compute_statistic() +RealVector mk_p_value = mk.compute_p_value() +#![multi_kernel] + +#![enable_train_test_mode] +mmd.set_train_test_mode(True) +mmd.set_train_test_ratio(1) +#![enable_train_test_mode] + +#![select_kernel_single] +int num_runs = 1 +int num_folds = 3 +mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_CROSS_VALIDATION, num_runs, num_folds, alpha) +mmd.select_kernel() +GaussianKernel learnt_kernel_single = GaussianKernel:obtain_from_generic(mmd.get_kernel()) +real width = learnt_kernel_single.get_width() +#![select_kernel_single] + +#![select_kernel_combined] +mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_MMD, True) +mmd.select_kernel() +CombinedKernel learnt_kernel_combined = CombinedKernel:obtain_from_generic(mmd.get_kernel()) +RealVector weights = learnt_kernel_combined.get_subkernel_weights() +#![select_kernel_combined] + +#![perform_test_optimized] +real statistic_optimized = mmd.compute_statistic() +real p_value_optimized = mmd.compute_p_value(statistic) +#![perform_test_optimized] diff --git a/examples/undocumented/libshogun/statistics_hsic.cpp b/examples/undocumented/libshogun/statistics_hsic.cpp deleted file mode 100644 index 196d9874a90..00000000000 --- a/examples/undocumented/libshogun/statistics_hsic.cpp +++ /dev/null @@ -1,172 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012 Heiko Strathmann - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -void create_fixed_data_kernel_small(CFeatures*& features_p, - CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q) -{ - index_t m=2; - index_t d=3; - - SGMatrix p(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - p.matrix[i]=i; - -// p.display_matrix("p"); - - SGMatrix q(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - q.matrix[i]=i+10; - -// q.display_matrix("q"); - - features_p=new CDenseFeatures(p); - features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); -} - -void create_fixed_data_kernel_big(CFeatures*& features_p, - CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q) -{ - index_t m=10; - index_t d=7; - - SGMatrix p(d,m); - for (index_t i=0; i q(d,m); - for (index_t i=0; i(p); - features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); -} - -/** tests the hsic statistic for a single fixed data case and ensures - * equality with sma implementation */ -void test_hsic_fixed() -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_small(features_p, features_q, kernel_p, kernel_q); - - index_t m=features_p->get_num_vectors(); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - /* assert matlab result, note that compute statistic computes m*hsic */ - float64_t difference=hsic->compute_statistic(); - SG_SPRINT("hsic fixed: %f\n", difference); - ASSERT(CMath::abs(difference-m*0.164761446385339)<10E-16); - - - SG_UNREF(hsic); -} - -void test_hsic_gamma() -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - hsic->set_null_approximation_method(HSIC_GAMMA); - float64_t p=hsic->compute_p_value(0.05); - SG_SPRINT("p-value: %f\n", p); - - // disabled as I think previous inverse_gamma_cdf was faulty - // now unit test fails. Needs to be investigated statistically - //ASSERT(CMath::abs(p-0.172182287884256)<10E-15); - - SG_UNREF(hsic); -} - -void test_hsic_sample_null() -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - /* do sampling null */ - hsic->set_null_approximation_method(PERMUTATION); - float64_t p=hsic->compute_p_value(0.05); - SG_SPRINT("p-value: %f\n", p); - - /* ensure that sampling null of hsic leads to same results as using - * CKernelIndependenceTest */ - CMath::init_random(1); - float64_t mean1=CStatistics::mean(hsic->sample_null()); - float64_t var1=CStatistics::variance(hsic->sample_null()); - SG_SPRINT("mean1=%f, var1=%f\n", mean1, var1); - - CMath::init_random(1); - float64_t mean2=CStatistics::mean( - hsic->CKernelIndependenceTest::sample_null()); - float64_t var2=CStatistics::variance(hsic->sample_null()); - SG_SPRINT("mean2=%f, var2=%f\n", mean2, var2); - - /* assert than results are the same from bot sampling null impl. */ - ASSERT(CMath::abs(mean1-mean2)<10E-8); - ASSERT(CMath::abs(var1-var2)<10E-8); - - SG_UNREF(hsic); -} - -int main(int argc, char** argv) -{ - init_shogun_with_defaults(); - -// sg_io->set_loglevel(MSG_DEBUG); - - test_hsic_fixed(); - test_hsic_gamma(); - test_hsic_sample_null(); - - exit_shogun(); - return 0; -} - diff --git a/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp b/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp deleted file mode 100644 index 3687cd49d1a..00000000000 --- a/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp +++ /dev/null @@ -1,93 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -void linear_time_mmd() -{ - /* note that the linear time statistic is designed for much larger datasets - * so increase to get reasonable results */ - index_t m=1000; - index_t dim=2; - float64_t difference=0.5; - - /* streaming data generator for mean shift distributions */ - CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim); - CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim); - - /* set kernel a-priori. usually one would do some kernel selection. See - * other examples for this. */ - float64_t width=10; - CGaussianKernel* kernel=new CGaussianKernel(10, width); - - /* create linear time mmd instance */ - index_t blocksize=1000; - CLinearTimeMMD* mmd=new CLinearTimeMMD(kernel, gen_p, gen_q, m, blocksize); - - /* perform test: compute p-value and test if null-hypothesis is rejected for - * a test level of 0.05 */ - float64_t alpha=0.05; - - /* using bootstrapping (not reccomended for linear time MMD, since slow). - * Also, in practice, use at least 250 iterations */ - mmd->set_null_approximation_method(PERMUTATION); - mmd->set_num_null_samples(10); - float64_t p_value_bootstrap=mmd->perform_test(); - /* reject if p-value is smaller than test level */ - SG_SPRINT("bootstrap: p!=q: %d\n", p_value_bootstrapset_null_approximation_method(MMD1_GAUSSIAN); - float64_t p_value_gaussian=mmd->perform_test(); - /* reject if p-value is smaller than test level */ - SG_SPRINT("gaussian approx: p!=q: %d\n", p_value_gaussian typeIerrors(num_trials); - SGVector typeIIerrors(num_trials); - for (index_t i=0; iset_simulate_h0(true); - typeIerrors[i]=mmd->perform_test()>alpha; - mmd->set_simulate_h0(false); - - typeIIerrors[i]=mmd->perform_test()>alpha; - } - - SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors)); - SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors)); - - SG_UNREF(mmd); -} - -int main(int argc, char** argv) -{ - init_shogun_with_defaults(); -// sg_io->set_loglevel(MSG_DEBUG); - - linear_time_mmd(); - - exit_shogun(); - return 0; -} - diff --git a/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp b/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp deleted file mode 100644 index a3f5df0764f..00000000000 --- a/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp +++ /dev/null @@ -1,216 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012 Heiko Strathmann - */ - -#include -#include -#include -#ifdef USE_GPL_SHOGUN -#include -#include -#endif //USE_GPL_SHOGUN -#include -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -void kernel_choice_linear_time_mmd_opt_single() -{ - /* Note that the linear time mmd is designed for large datasets. Results on - * this small number will be bad (unstable, type I error wrong) */ - index_t m=1000; - index_t num_blobs=3; - float64_t distance=3; - float64_t stretch=10; - float64_t angle=CMath::PI/4; - - CGaussianBlobsDataGenerator* gen_p=new CGaussianBlobsDataGenerator( - num_blobs, distance, stretch, angle); - - CGaussianBlobsDataGenerator* gen_q=new CGaussianBlobsDataGenerator( - num_blobs, distance, 1, 1); - - /* create kernels */ - CCombinedKernel* combined=new CCombinedKernel(); - float64_t sigma_from=-3; - float64_t sigma_to=10; - float64_t sigma_step=1; - float64_t sigma=sigma_from; - while (sigma<=sigma_to) - { - /* shoguns kernel width is different */ - float64_t width=CMath::pow(2.0, sigma); - float64_t sq_width_twice=width*width*2; - combined->append_kernel(new CGaussianKernel(10, sq_width_twice)); - sigma+=sigma_step; - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined, gen_p, gen_q, m); - - /* kernel selection instance with regularisation term. May be replaced by - * other methods for selecting single kernels */ - CMMDKernelSelectionOpt* selection= - new CMMDKernelSelectionOpt(mmd, 10E-5); -// - /* select kernel that maximised MMD */ -// CMMDKernelSelectionMax* selection= -// new CMMDKernelSelectionMax(mmd); - -// /* select kernel with width closest to median data distance */ -// CMMDKernelSelectionMedian* selection= -// new CMMDKernelSelectionMedian(mmd, 10E-5); - - /* compute measures. - * For Opt: ratio of MMD and standard deviation - * For Max: MMDs of single kernels - * for Medigan: Does not work! */ - SG_SPRINT("computing ratios\n"); - SGVector ratios=selection->compute_measures(); - ratios.display_vector("ratios"); - - /* select kernel using the maximum ratio (and cast) */ - SG_SPRINT("selecting kernel\n"); - CKernel* selected=selection->select_kernel(); - CGaussianKernel* casted=CGaussianKernel::obtain_from_generic(selected); - SG_SPRINT("selected kernel width: %f\n", casted->get_width()); - mmd->set_kernel(selected); - SG_UNREF(casted); - SG_UNREF(selected); - - mmd->set_null_approximation_method(MMD1_GAUSSIAN); - - /* compute tpye I and II error (use many more trials). Type I error is only - * estimated to check MMD1_GAUSSIAN method for estimating the null - * distribution. Note that testing has to happen on difference data than - * kernel selecting, but the linear time mmd does this implicitly */ - float64_t alpha=0.05; - index_t num_trials=5; - SGVector typeIerrors(num_trials); - SGVector typeIIerrors(num_trials); - for (index_t i=0; iset_simulate_h0(true); - typeIerrors[i]=mmd->perform_test()>alpha; - mmd->set_simulate_h0(false); - - typeIIerrors[i]=mmd->perform_test()>alpha; - } - - SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors)); - SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors)); - - - SG_UNREF(selection); -} - -void kernel_choice_linear_time_mmd_opt_comb() -{ -#ifdef USE_GPL_SHOGUN - /* Note that the linear time mmd is designed for large datasets. Results on - * this small number will be bad (unstable, type I error wrong) */ - index_t m=1000; - index_t num_blobs=3; - float64_t distance=3; - float64_t stretch=10; - float64_t angle=CMath::PI/4; - - CGaussianBlobsDataGenerator* gen_p=new CGaussianBlobsDataGenerator( - num_blobs, distance, stretch, angle); - - CGaussianBlobsDataGenerator* gen_q=new CGaussianBlobsDataGenerator( - num_blobs, distance, 1, 1); - - /* create kernels */ - CCombinedKernel* combined=new CCombinedKernel(); - float64_t sigma_from=-3; - float64_t sigma_to=10; - float64_t sigma_step=1; - float64_t sigma=sigma_from; - index_t num_kernels=0; - while (sigma<=sigma_to) - { - /* shoguns kernel width is different */ - float64_t width=CMath::pow(2.0, sigma); - float64_t sq_width_twice=width*width*2; - combined->append_kernel(new CGaussianKernel(10, sq_width_twice)); - sigma+=sigma_step; - num_kernels++; - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined, gen_p, gen_q, m); - - /* kernel selection instance with regularisation term. May be replaced by - * other methods for selecting single kernels */ - CMMDKernelSelectionCombOpt* selection= - new CMMDKernelSelectionCombOpt(mmd, 10E-5); - - /* maximise L2 regularised MMD */ -// CMMDKernelSelectionCombMaxL2* selection= -// new CMMDKernelSelectionCombMaxL2(mmd, 10E-5); - - /* select kernel (does the same as above, but sets weights to kernel) */ - SG_SPRINT("selecting kernel\n"); - CKernel* selected=selection->select_kernel(); - CCombinedKernel* casted=CCombinedKernel::obtain_from_generic(selected); - casted->get_subkernel_weights().display_vector("weights"); - mmd->set_kernel(selected); - SG_UNREF(casted); - SG_UNREF(selected); - - /* compute tpye I and II error (use many more trials). Type I error is only - * estimated to check MMD1_GAUSSIAN method for estimating the null - * distribution. Note that testing has to happen on difference data than - * kernel selecting, but the linear time mmd does this implicitly */ - mmd->set_null_approximation_method(MMD1_GAUSSIAN); - float64_t alpha=0.05; - index_t num_trials=5; - SGVector typeIerrors(num_trials); - SGVector typeIIerrors(num_trials); - for (index_t i=0; iset_simulate_h0(true); - typeIerrors[i]=mmd->perform_test()>alpha; - mmd->set_simulate_h0(false); - - typeIIerrors[i]=mmd->perform_test()>alpha; - } - - SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors)); - SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors)); - - - SG_UNREF(selection); -#endif //USE_GPL_SHOGUN -} - -int main(int argc, char** argv) -{ - init_shogun_with_defaults(); -// sg_io->set_loglevel(MSG_DEBUG); - - /* select a single kernel for linear time MMD */ - kernel_choice_linear_time_mmd_opt_single(); - - /* select combined kernels for linear time MMD */ - kernel_choice_linear_time_mmd_opt_comb(); - - exit_shogun(); - return 0; -} - diff --git a/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp b/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp deleted file mode 100644 index 5ffe8cdc3c4..00000000000 --- a/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp +++ /dev/null @@ -1,135 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -void quadratic_time_mmd() -{ - /* number of examples kept low in order to make things fast */ - index_t m=30; - index_t dim=2; - float64_t difference=0.5; - - /* streaming data generator for mean shift distributions */ - CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim); - CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim); - - /* stream some data from generator */ - CFeatures* feat_p=gen_p->get_streamed_features(m); - CFeatures* feat_q=gen_q->get_streamed_features(m); - - /* set kernel a-priori. usually one would do some kernel selection. See - * other examples for this. */ - float64_t width=10; - CGaussianKernel* kernel=new CGaussianKernel(10, width); - - /* create quadratic time mmd instance. Note that this constructor - * copies p and q and does not reference them */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feat_p, feat_q); - - /* perform test: compute p-value and test if null-hypothesis is rejected for - * a test level of 0.05 */ - float64_t alpha=0.05; - - /* using permutation (slow, not the most reliable way. Consider pre- - * computing the kernel when using it, see below). - * Also, in practice, use at least 250 iterations */ - mmd->set_null_approximation_method(PERMUTATION); - mmd->set_num_null_samples(3); - float64_t p_value=mmd->perform_test(); - /* reject if p-value is smaller than test level */ - SG_SPRINT("bootstrap: p!=q: %d\n", p_valueset_statistic_type(BIASED); - mmd->set_null_approximation_method(MMD2_SPECTRUM); - mmd->set_num_eigenvalues_spectrum(3); - mmd->set_num_samples_spectrum(250); - p_value=mmd->perform_test(); - /* reject if p-value is smaller than test level */ - SG_SPRINT("spectrum: p!=q: %d\n", p_valueset_statistic_type(BIASED); - mmd->set_null_approximation_method(MMD2_GAMMA); - p_value=mmd->perform_test(); - /* reject if p-value is smaller than test level */ - SG_SPRINT("gamma: p!=q: %d\n", p_valueset_null_approximation_method(PERMUTATION); - mmd->set_num_null_samples(5); - index_t num_trials=5; - SGVector type_I_errors(num_trials); - SGVector type_II_errors(num_trials); - SGVector inds(2*m); - inds.range_fill(); - CFeatures* p_and_q=mmd->get_p_and_q(); - - /* use a precomputed kernel to be faster */ - kernel->init(p_and_q, p_and_q); - CCustomKernel* precomputed=new CCustomKernel(kernel); - mmd->set_kernel(precomputed); - for (index_t i=0; iadd_row_subset(inds); - precomputed->add_col_subset(inds); - type_I_errors[i]=mmd->perform_test()>alpha; - precomputed->remove_row_subset(); - precomputed->remove_col_subset(); - - /* on normal data, this gives type II error */ - type_II_errors[i]=mmd->perform_test()>alpha; - } - SG_UNREF(p_and_q); - - SG_SPRINT("type I error: %f\n", CStatistics::mean(type_I_errors)); - SG_SPRINT("type II error: %f\n", CStatistics::mean(type_II_errors)); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(gen_p); - SG_UNREF(gen_q); - - /* convienience constructor of MMD was used, these were not referenced */ - SG_UNREF(feat_p); - SG_UNREF(feat_q); -} - -int main(int argc, char** argv) -{ - init_shogun_with_defaults(); -// sg_io->set_loglevel(MSG_DEBUG); - - quadratic_time_mmd(); - - exit_shogun(); - return 0; -} - diff --git a/examples/undocumented/python_modular/statistics_hsic.py b/examples/undocumented/python_modular/statistics_hsic.py deleted file mode 100644 index ba1f3470bc3..00000000000 --- a/examples/undocumented/python_modular/statistics_hsic.py +++ /dev/null @@ -1,107 +0,0 @@ -#!/usr/bin/env python -# -# This program is free software you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation either version 3 of the License, or -# (at your option) any later version. -# -# Written (C) 2012-2013 Heiko Strathmann -# -import numpy as np -from math import pi - -parameter_list = [[150,3,3]] - -def statistics_hsic (n, difference, angle): - from modshogun import RealFeatures - from modshogun import DataGenerator - from modshogun import GaussianKernel - from modshogun import HSIC - from modshogun import PERMUTATION, HSIC_GAMMA - from modshogun import EuclideanDistance - from modshogun import Statistics, Math - - # for reproducable results (the numpy one might not be reproducible across - # different OS/Python-distributions - Math.init_random(1) - np.random.seed(1) - - # note that the HSIC has to store kernel matrices - # which upper bounds the sample size - - # use data generator class to produce example data - data=DataGenerator.generate_sym_mix_gauss(n,difference,angle) - #plot(data[0], data[1], 'x');show() - - # create shogun feature representation - features_x=RealFeatures(np.array([data[0]])) - features_y=RealFeatures(np.array([data[1]])) - - # compute median data distance in order to use for Gaussian kernel width - # 0.5*median_distance normally (factor two in Gaussian kernel) - # However, shoguns kernel width is different to usual parametrization - # Therefore 0.5*2*median_distance^2 - # Use a subset of data for that, only 200 elements. Median is stable - subset=np.random.permutation(features_x.get_num_vectors()).astype(np.int32) - subset=subset[0:200] - features_x.add_subset(subset) - dist=EuclideanDistance(features_x, features_x) - distances=dist.get_distance_matrix() - features_x.remove_subset() - median_distance=np.median(distances) - sigma_x=median_distance**2 - features_y.add_subset(subset) - dist=EuclideanDistance(features_y, features_y) - distances=dist.get_distance_matrix() - features_y.remove_subset() - median_distance=np.median(distances) - sigma_y=median_distance**2 - #print "median distance for Gaussian kernel on x:", sigma_x - #print "median distance for Gaussian kernel on y:", sigma_y - kernel_x=GaussianKernel(10,sigma_x) - kernel_y=GaussianKernel(10,sigma_y) - - hsic=HSIC(kernel_x,kernel_y,features_x,features_y) - - # perform test: compute p-value and test if null-hypothesis is rejected for - # a test level of 0.05 using different methods to approximate - # null-distribution - statistic=hsic.compute_statistic() - #print "HSIC:", statistic - alpha=0.05 - - #print "computing p-value using sampling null" - hsic.set_null_approximation_method(PERMUTATION) - # normally, at least 250 iterations should be done, but that takes long - hsic.set_num_null_samples(100) - # sampling null allows usage of unbiased or biased statistic - p_value_boot=hsic.compute_p_value(statistic) - thresh_boot=hsic.compute_threshold(alpha) - #print "p_value:", p_value_boot - #print "threshold for 0.05 alpha:", thresh_boot - #print "p_value <", alpha, ", i.e. test sais p and q are dependend:", p_value_bootalpha - mmd.set_simulate_h0(False) - - typeIIerrors[i]=mmd.perform_test()>alpha - - #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors) - - return statistic, p_value_boot, p_value_gaussian, null_samples, typeIerrors, typeIIerrors - -if __name__=='__main__': - print('LinearTimeMMD') - statistics_linear_time_mmd(*parameter_list[0]) diff --git a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py b/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py deleted file mode 100644 index 677a48672f2..00000000000 --- a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py +++ /dev/null @@ -1,115 +0,0 @@ -#!/usr/bin/env python -# -# This program is free software you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation either version 3 of the License, or -# (at your option) any later version. -# -# Written (C) 2012-2013 Heiko Strathmann -# -from numpy import * -#from pylab import * - -parameter_list = [[1000,10,5,3,pi/4, "opt"], [1000,10,5,3,pi/4, "l2"]] - - -def statistics_mmd_kernel_selection_combined(m,distance,stretch,num_blobs,angle,selection_method): - from modshogun import RealFeatures - from modshogun import GaussianBlobsDataGenerator - from modshogun import GaussianKernel, CombinedKernel - from modshogun import LinearTimeMMD - try: - from modshogun import MMDKernelSelectionCombMaxL2 - except ImportError: - print("MMDKernelSelectionCombMaxL2 not available") - exit(0) - try: - from modshogun import MMDKernelSelectionCombOpt - except ImportError: - print("MMDKernelSelectionCombOpt not available") - exit(0) - - from modshogun import PERMUTATION, MMD1_GAUSSIAN - from modshogun import EuclideanDistance - from modshogun import Statistics, Math - - # init seed for reproducability - Math.init_random(1) - - # note that the linear time statistic is designed for much larger datasets - # results for this low number will be bad (unstable, type I error wrong) - - # streaming data generator - gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0) - gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle) - - # stream some data and plot - num_plot=1000 - features=gen_p.get_streamed_features(num_plot) - features=features.create_merged_copy(gen_q.get_streamed_features(num_plot)) - data=features.get_feature_matrix() - - #figure() - #subplot(2,2,1) - #grid(True) - #plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$') - #title('$X\sim p$') - #subplot(2,2,2) - #grid(True) - #plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5) - #title('$Y\sim q$') - - # create combined kernel with Gaussian kernels inside (shoguns Gaussian kernel is - # different to the standard form, see documentation) - sigmas=[2**x for x in range(-3,10)] - widths=[x*x*2 for x in sigmas] - combined=CombinedKernel() - for i in range(len(sigmas)): - combined.append_kernel(GaussianKernel(10, widths[i])) - - # mmd instance using streaming features, blocksize of 10000 - block_size=10000 - mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size) - - # kernel selection instance (this can easily replaced by the other methods for selecting - # combined kernels - if selection_method=="opt": - selection=MMDKernelSelectionCombOpt(mmd) - elif selection_method=="l2": - selection=MMDKernelSelectionCombMaxL2(mmd) - - # perform kernel selection (kernel is automatically set) - kernel=selection.select_kernel() - kernel=CombinedKernel.obtain_from_generic(kernel) - #print "selected kernel weights:", kernel.get_subkernel_weights() - #subplot(2,2,3) - #plot(kernel.get_subkernel_weights()) - #title("Kernel weights") - - # compute tpye I and II error (use many more trials). Type I error is only - # estimated to check MMD1_GAUSSIAN method for estimating the null - # distribution. Note that testing has to happen on difference data than - # kernel selecting, but the linear time mmd does this implicitly - mmd.set_null_approximation_method(MMD1_GAUSSIAN) - - # number of trials should be larger to compute tight confidence bounds - num_trials=5; - alpha=0.05 # test power - typeIerrors=[0 for x in range(num_trials)] - typeIIerrors=[0 for x in range(num_trials)] - for i in range(num_trials): - # this effectively means that p=q - rejecting is tpye I error - mmd.set_simulate_h0(True) - typeIerrors[i]=mmd.perform_test()>alpha - mmd.set_simulate_h0(False) - - typeIIerrors[i]=mmd.perform_test()>alpha - - #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors) - - return kernel,typeIerrors,typeIIerrors - -if __name__=='__main__': - print('MMDKernelSelectionCombined') - statistics_mmd_kernel_selection_combined(*parameter_list[0]) - #show() diff --git a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py b/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py deleted file mode 100644 index ffa291afbde..00000000000 --- a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py +++ /dev/null @@ -1,124 +0,0 @@ -#!/usr/bin/env python -# -# This program is free software you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation either version 3 of the License, or -# (at your option) any later version. -# -# Written (C) 2012-2013 Heiko Strathmann -# -from numpy import * -#from pylab import * - -parameter_list = [[1000,10,5,3,pi/4, "opt"], [1000,10,5,3,pi/4, "max"], [1000,10,5,3,pi/4, "median"]] - -def statistics_mmd_kernel_selection_single(m,distance,stretch,num_blobs,angle,selection_method): - from modshogun import RealFeatures - from modshogun import GaussianBlobsDataGenerator - from modshogun import GaussianKernel, CombinedKernel - from modshogun import LinearTimeMMD - from modshogun import MMDKernelSelectionMedian - from modshogun import MMDKernelSelectionMax - from modshogun import MMDKernelSelectionOpt - from modshogun import PERMUTATION, MMD1_GAUSSIAN - from modshogun import EuclideanDistance - from modshogun import Statistics, Math - - # init seed for reproducability - Math.init_random(1) - - # note that the linear time statistic is designed for much larger datasets - # results for this low number will be bad (unstable, type I error wrong) - m=1000 - distance=10 - stretch=5 - num_blobs=3 - angle=pi/4 - - # streaming data generator - gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0) - gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle) - - # stream some data and plot - num_plot=1000 - features=gen_p.get_streamed_features(num_plot) - features=features.create_merged_copy(gen_q.get_streamed_features(num_plot)) - data=features.get_feature_matrix() - - #figure() - #subplot(2,2,1) - #grid(True) - #plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$') - #title('$X\sim p$') - #subplot(2,2,2) - #grid(True) - #plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5) - #title('$Y\sim q$') - - - # create combined kernel with Gaussian kernels inside (shoguns Gaussian kernel is - # different to the standard form, see documentation) - sigmas=[2**x for x in range(-3,10)] - widths=[x*x*2 for x in sigmas] - combined=CombinedKernel() - for i in range(len(sigmas)): - combined.append_kernel(GaussianKernel(10, widths[i])) - - # mmd instance using streaming features, blocksize of 10000 - block_size=1000 - mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size) - - # kernel selection instance (this can easily replaced by the other methods for selecting - # single kernels - if selection_method=="opt": - selection=MMDKernelSelectionOpt(mmd) - elif selection_method=="max": - selection=MMDKernelSelectionMax(mmd) - elif selection_method=="median": - selection=MMDKernelSelectionMedian(mmd) - - # print measures (just for information) - # in case Opt: ratios of MMD and standard deviation - # in case Max: MMDs for each kernel - # Does not work for median method - if selection_method!="median": - ratios=selection.compute_measures() - #print "Measures:", ratios - - #subplot(2,2,3) - #plot(ratios) - #title('Measures') - - # perform kernel selection - kernel=selection.select_kernel() - kernel=GaussianKernel.obtain_from_generic(kernel) - #print "selected kernel width:", kernel.get_width() - - # compute tpye I and II error (use many more trials). Type I error is only - # estimated to check MMD1_GAUSSIAN method for estimating the null - # distribution. Note that testing has to happen on difference data than - # kernel selecting, but the linear time mmd does this implicitly - mmd.set_kernel(kernel) - mmd.set_null_approximation_method(MMD1_GAUSSIAN) - - # number of trials should be larger to compute tight confidence bounds - num_trials=5; - alpha=0.05 # test power - typeIerrors=[0 for x in range(num_trials)] - typeIIerrors=[0 for x in range(num_trials)] - for i in range(num_trials): - # this effectively means that p=q - rejecting is tpye I error - mmd.set_simulate_h0(True) - typeIerrors[i]=mmd.perform_test()>alpha - mmd.set_simulate_h0(False) - - typeIIerrors[i]=mmd.perform_test()>alpha - - #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors) - - return kernel,typeIerrors,typeIIerrors - -if __name__=='__main__': - print('MMDKernelSelection') - statistics_mmd_kernel_selection_single(*parameter_list[0]) - #show() diff --git a/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py b/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py deleted file mode 100644 index 343a03c4ed2..00000000000 --- a/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py +++ /dev/null @@ -1,115 +0,0 @@ -#!/usr/bin/env python -# -# This program is free software you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation either version 3 of the License, or -# (at your option) any later version. -# -# Written (C) 2012-2013 Heiko Strathmann -# -import numpy as np - -parameter_list = [[30,2,0.5]] - -def statistics_quadratic_time_mmd (m,dim,difference): - from modshogun import RealFeatures - from modshogun import MeanShiftDataGenerator - from modshogun import GaussianKernel, CustomKernel - from modshogun import QuadraticTimeMMD - from modshogun import PERMUTATION, MMD2_SPECTRUM, MMD2_GAMMA, BIASED, BIASED_DEPRECATED - from modshogun import Statistics, IntVector, RealVector, Math - - # for reproducable results (the numpy one might not be reproducible across - # different OS/Python-distributions - Math.init_random(1) - np.random.seed(1) - - # number of examples kept low in order to make things fast - - # streaming data generator for mean shift distributions - gen_p=MeanShiftDataGenerator(0, dim); - #gen_p.parallel.set_num_threads(1) - gen_q=MeanShiftDataGenerator(difference, dim); - - # stream some data from generator - feat_p=gen_p.get_streamed_features(m); - feat_q=gen_q.get_streamed_features(m); - - # set kernel a-priori. usually one would do some kernel selection. See - # other examples for this. - width=10; - kernel=GaussianKernel(10, width); - - # create quadratic time mmd instance. Note that this constructor - # copies p and q and does not reference them - mmd=QuadraticTimeMMD(kernel, feat_p, feat_q); - - # perform test: compute p-value and test if null-hypothesis is rejected for - # a test level of 0.05 - alpha=0.05; - - # using permutation (slow, not the most reliable way. Consider pre- - # computing the kernel when using it, see below). - # Also, in practice, use at least 250 iterations - mmd.set_null_approximation_method(PERMUTATION); - mmd.set_num_null_samples(3); - p_value_null=mmd.perform_test(); - # reject if p-value is smaller than test level - #print "bootstrap: p!=q: ", p_value_nullalpha; - precomputed.remove_row_subset(); - precomputed.remove_col_subset(); - - # on normal data, this gives type II error - type_II_errors[i]=mmd.perform_test()>alpha; - - return type_I_errors,type_I_errors,p_value_null,p_value_spectrum,p_value_gamma, - -if __name__=='__main__': - print('QuadraticTimeMMD') - statistics_quadratic_time_mmd(*parameter_list[0]) diff --git a/src/interfaces/modular/Preprocessor.i b/src/interfaces/modular/Preprocessor.i index 0086099701e..fdb05143421 100644 --- a/src/interfaces/modular/Preprocessor.i +++ b/src/interfaces/modular/Preprocessor.i @@ -29,9 +29,9 @@ %rename(SortWordString) CSortWordString; /* Feature selection framework */ -%rename(DependenceMaximization) CDependenceMaximization; -%rename(KernelDependenceMaximization) CDependenceMaximization; -%rename(BAHSIC) CBAHSIC; +#%rename(DependenceMaximization) CDependenceMaximization; +#%rename(KernelDependenceMaximization) CDependenceMaximization; +#%rename(BAHSIC) CBAHSIC; %newobject shogun::CFeatureSelection::apply; %newobject shogun::CFeatureSelection::remove_feats; @@ -145,7 +145,3 @@ namespace shogun %include %include - -%include -%include -%include diff --git a/src/interfaces/modular/Preprocessor_includes.i b/src/interfaces/modular/Preprocessor_includes.i index 95a101c4f86..35076c98410 100644 --- a/src/interfaces/modular/Preprocessor_includes.i +++ b/src/interfaces/modular/Preprocessor_includes.i @@ -25,7 +25,4 @@ #include #include -#include -#include -#include %} diff --git a/src/interfaces/modular/Statistics.i b/src/interfaces/modular/Statistics.i index e542c6c6fb1..4c63f8b4e58 100644 --- a/src/interfaces/modular/Statistics.i +++ b/src/interfaces/modular/Statistics.i @@ -7,45 +7,36 @@ * Written (W) 2012-2013 Heiko Strathmann */ +/* These functions return new Objects */ +%newobject shogun::CTwoDistributionTest::compute_distance(CDistance*); +%newobject shogun::CTwoDistributionTest::compute_joint_distance(CDistance*); +%newobject shogun::CQuadraticTimeMMD::get_p_and_q(); + /* Remove C Prefix */ %rename(HypothesisTest) CHypothesisTest; +%rename(OneDistributionTest) COneDistributionTest; +%rename(TwoDistributionTest) CTwoDistributionTest; %rename(IndependenceTest) CIndependenceTest; %rename(TwoSampleTest) CTwoSampleTest; -%rename(KernelTwoSampleTest) CKernelTwoSampleTest; +%rename(MMD) CMMD; %rename(StreamingMMD) CStreamingMMD; %rename(LinearTimeMMD) CLinearTimeMMD; +%rename(BTestMMD) CBTestMMD; %rename(QuadraticTimeMMD) CQuadraticTimeMMD; -%rename(KernelIndependenceTest) CKernelIndependenceTest; -%rename(HSIC) CHSIC; -%rename(NOCCO) CNOCCO; -%rename(KernelMeanMatching) CKernelMeanMatching; -%rename(KernelSelection) CKernelSelection; -%rename(MMDKernelSelection) CMMDKernelSelection; -%rename(MMDKernelSelectionComb) CMMDKernelSelectionComb; -%rename(MMDKernelSelectionMedian) CMMDKernelSelectionMedian; -%rename(MMDKernelSelectionMax) CMMDKernelSelectionMax; -%rename(MMDKernelSelectionOpt) CMMDKernelSelectionOpt; -%rename(MMDKernelSelectionCombOpt) CMMDKernelSelectionCombOpt; -%rename(MMDKernelSelectionCombMaxL2) CMMDKernelSelectionCombMaxL2; - +%rename(MultiKernelQuadraticTimeMMD) CMultiKernelQuadraticTimeMMD; +%rename(KernelSelectionStrategy) CKernelSelectionStrategy; /* Include Class Headers to make them visible from within the target language */ -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include -%include +%include +%include +%include +%include +%include +%include +%include +%include +%include +%include +%include +%include +%include diff --git a/src/interfaces/modular/Statistics_includes.i b/src/interfaces/modular/Statistics_includes.i index 8cf811edeac..48aa564c620 100644 --- a/src/interfaces/modular/Statistics_includes.i +++ b/src/interfaces/modular/Statistics_includes.i @@ -1,22 +1,16 @@ %{ - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include - #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include %} diff --git a/src/shogun/distance/Distance.cpp b/src/shogun/distance/Distance.cpp index e3a581a9a04..7284d412299 100644 --- a/src/shogun/distance/Distance.cpp +++ b/src/shogun/distance/Distance.cpp @@ -56,13 +56,13 @@ bool CDistance::init(CFeatures* l, CFeatures* r) { REQUIRE(check_compatibility(l, r), "Features are not compatible!\n"); - //remove references to previous features - remove_lhs_and_rhs(); - //increase reference counts SG_REF(l); SG_REF(r); + //remove references to previous features + remove_lhs_and_rhs(); + lhs=l; rhs=r; diff --git a/src/shogun/features/DenseFeatures.cpp b/src/shogun/features/DenseFeatures.cpp index d88b5cd2fcd..da012aa2031 100644 --- a/src/shogun/features/DenseFeatures.cpp +++ b/src/shogun/features/DenseFeatures.cpp @@ -641,14 +641,14 @@ template CFeatures* CDenseFeatures::shallow_subset_copy() { CFeatures* shallow_copy_features=NULL; - + SG_SDEBUG("Using underlying feature matrix with %d dimensions and %d feature vectors!\n", num_features, num_vectors); SGMatrix shallow_copy_matrix(feature_matrix); shallow_copy_features=new CDenseFeatures(shallow_copy_matrix); SG_REF(shallow_copy_features); if (m_subset_stack->has_subsets()) shallow_copy_features->add_subset(m_subset_stack->get_last_subset()->get_subset_idx()); - + return shallow_copy_features; } diff --git a/src/shogun/features/streaming/StreamingDenseFeatures.cpp b/src/shogun/features/streaming/StreamingDenseFeatures.cpp index d47a1ec49d0..1db8f72ac5a 100644 --- a/src/shogun/features/streaming/StreamingDenseFeatures.cpp +++ b/src/shogun/features/streaming/StreamingDenseFeatures.cpp @@ -70,6 +70,7 @@ template void CStreamingDenseFeatures::reset_stream() parser.exit_parser(); parser.init(working_file, has_labels, 1); parser.set_free_vector_after_release(false); + parser.set_free_vectors_on_destruct(false); parser.start_parser(); } } diff --git a/src/shogun/io/streaming/InputParser.h b/src/shogun/io/streaming/InputParser.h index 73fdec0812f..652d0db3a1d 100644 --- a/src/shogun/io/streaming/InputParser.h +++ b/src/shogun/io/streaming/InputParser.h @@ -428,6 +428,7 @@ template else example_type = E_UNLABELLED; + SG_UNREF(examples_ring); examples_ring = new CParseBuffer(size); SG_REF(examples_ring); @@ -466,7 +467,8 @@ template } SG_SDEBUG("creating parse thread\n") - examples_ring->init_vector(); + if (examples_ring) + examples_ring->init_vector(); #ifdef HAVE_CXX11 parse_thread.reset(new std::thread(&parse_loop_entry_point, this)); #elif defined(HAVE_PTHREAD) diff --git a/src/shogun/kernel/CombinedKernel.cpp b/src/shogun/kernel/CombinedKernel.cpp index 31292136f8c..9c39a2c8fb7 100644 --- a/src/shogun/kernel/CombinedKernel.cpp +++ b/src/shogun/kernel/CombinedKernel.cpp @@ -811,7 +811,7 @@ CCombinedKernel* CCombinedKernel::obtain_from_generic(CKernel* kernel) if (kernel->get_kernel_type()!=K_COMBINED) { SG_SERROR("CCombinedKernel::obtain_from_generic(): provided kernel is " - "not of type CGaussianKernel!\n"); + "not of type CCombinedKernel!\n"); } /* since an additional reference is returned */ diff --git a/src/shogun/kernel/CustomKernel.h b/src/shogun/kernel/CustomKernel.h index 613a71a852d..165e9a724d9 100644 --- a/src/shogun/kernel/CustomKernel.h +++ b/src/shogun/kernel/CustomKernel.h @@ -550,13 +550,13 @@ class CCustomKernel: public CKernel */ SGMatrix get_float32_kernel_matrix() { - REQUIRE(!m_row_subset_stack, "%s::get_float32_kernel_matrix(): " + REQUIRE(!m_row_subset_stack->has_subsets(), "%s::get_float32_kernel_matrix(): " "Not possible with row subset active! If you want to" " create a %s from another one with a subset, use " "get_kernel_matrix() and the SGMatrix constructor!\n", get_name(), get_name()); - REQUIRE(!m_col_subset_stack, "%s::get_float32_kernel_matrix(): " + REQUIRE(!m_col_subset_stack->has_subsets(), "%s::get_float32_kernel_matrix(): " "Not possible with collumn subset active! If you want to" " create a %s from another one with a subset, use " "get_kernel_matrix() and the SGMatrix constructor!\n", diff --git a/src/shogun/kernel/ShiftInvariantKernel.h b/src/shogun/kernel/ShiftInvariantKernel.h index b52544b3277..3f75646a45d 100644 --- a/src/shogun/kernel/ShiftInvariantKernel.h +++ b/src/shogun/kernel/ShiftInvariantKernel.h @@ -39,6 +39,11 @@ namespace shogun { +namespace internal +{ + class KernelManager; +} + /** @brief Base class for the family of kernel functions that only depend on * the difference of the inputs, i.e. whose values does not change if the * inputs are shifted by the same amount. More precisely, @@ -49,6 +54,9 @@ namespace shogun */ class CShiftInvariantKernel: public CKernel { + + friend class internal::KernelManager; + public: /** Default constructor. */ CShiftInvariantKernel(); diff --git a/src/shogun/labels/BinaryLabels.cpp b/src/shogun/labels/BinaryLabels.cpp index f46890d9718..6f1e93f0484 100644 --- a/src/shogun/labels/BinaryLabels.cpp +++ b/src/shogun/labels/BinaryLabels.cpp @@ -149,6 +149,6 @@ CLabels* CBinaryLabels::shallow_subset_copy() ((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector); if (m_subset_stack->has_subsets()) shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx()); - + return shallow_copy_labels; } diff --git a/src/shogun/labels/BinaryLabels.h b/src/shogun/labels/BinaryLabels.h index 248608483e2..462397596bd 100644 --- a/src/shogun/labels/BinaryLabels.h +++ b/src/shogun/labels/BinaryLabels.h @@ -119,7 +119,6 @@ class CBinaryLabels : public CDenseLabels #ifndef SWIG // SWIG should skip this part virtual CLabels* shallow_subset_copy(); #endif - }; } #endif diff --git a/src/shogun/labels/MulticlassLabels.cpp b/src/shogun/labels/MulticlassLabels.cpp index ef65ea092e0..3133efdbbfd 100644 --- a/src/shogun/labels/MulticlassLabels.cpp +++ b/src/shogun/labels/MulticlassLabels.cpp @@ -144,6 +144,6 @@ CLabels* CMulticlassLabels::shallow_subset_copy() ((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector); if (m_subset_stack->has_subsets()) shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx()); - - return shallow_copy_labels; + + return shallow_copy_labels; } diff --git a/src/shogun/labels/RegressionLabels.cpp b/src/shogun/labels/RegressionLabels.cpp index eb85c368526..5870eafd2d2 100644 --- a/src/shogun/labels/RegressionLabels.cpp +++ b/src/shogun/labels/RegressionLabels.cpp @@ -35,6 +35,6 @@ CLabels* CRegressionLabels::shallow_subset_copy() ((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector); if (m_subset_stack->has_subsets()) shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx()); - + return shallow_copy_labels; } diff --git a/src/shogun/labels/RegressionLabels.h b/src/shogun/labels/RegressionLabels.h index 831b69961c8..56698d149ce 100644 --- a/src/shogun/labels/RegressionLabels.h +++ b/src/shogun/labels/RegressionLabels.h @@ -69,7 +69,6 @@ class CRegressionLabels : public CDenseLabels #ifndef SWIG // SWIG should skip this part virtual CLabels* shallow_subset_copy(); #endif - }; } #endif diff --git a/src/shogun/machine/BaggingMachine.cpp b/src/shogun/machine/BaggingMachine.cpp index 5edeef58c7c..ad3aa046082 100644 --- a/src/shogun/machine/BaggingMachine.cpp +++ b/src/shogun/machine/BaggingMachine.cpp @@ -76,7 +76,7 @@ SGVector CBaggingMachine::apply_get_outputs(CFeatures* data) SGMatrix output(data->get_num_vectors(), m_num_bags); output.zero(); - + #pragma omp parallel for for (int32_t i = 0; i < m_num_bags; ++i) { @@ -178,7 +178,7 @@ bool CBaggingMachine::train_machine(CFeatures* data) labels->remove_subset(); #pragma omp critical - { + { // get out of bag indexes CDynamicArray* oob = get_oob_indices(idx); m_oob_indices->push_back(oob); diff --git a/src/shogun/multiclass/tree/CARTree.cpp b/src/shogun/multiclass/tree/CARTree.cpp index cd7c261da1d..ac0a310ed42 100644 --- a/src/shogun/multiclass/tree/CARTree.cpp +++ b/src/shogun/multiclass/tree/CARTree.cpp @@ -105,7 +105,7 @@ CMulticlassLabels* CCARTree::apply_multiclass(CFeatures* data) // apply multiclass starting from root bnode_t* current=dynamic_cast(get_root()); - + REQUIRE(current, "Tree machine not yet trained.\n"); CLabels* ret=apply_from_current_node(dynamic_cast*>(data), current); @@ -289,7 +289,7 @@ bool CCARTree::train_machine(CFeatures* data) void CCARTree::set_sorted_features(SGMatrix& sorted_feats, SGMatrix& sorted_indices) { - m_pre_sort=true; + m_pre_sort=true; m_sorted_features=sorted_feats; m_sorted_indices=sorted_indices; } @@ -414,7 +414,7 @@ CBinaryTreeMachineNode* CCARTree::CARTtrain(CFeatures* data, SG int32_t c_left=-1; int32_t c_right=-1; int32_t best_attribute; - + SGVector indices(num_vecs); if (m_pre_sort) { @@ -532,13 +532,13 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S SGVector& left, SGVector& right, SGVector& is_left_final, int32_t &num_missing_final, int32_t &count_left, int32_t &count_right, int32_t subset_size, const SGVector& active_indices) { - SGVector labels_vec=(dynamic_cast(labels))->get_labels(); + SGVector labels_vec=(dynamic_cast(labels))->get_labels(); int32_t num_vecs=labels->get_num_labels(); int32_t num_feats; if (m_pre_sort) num_feats=mat.num_cols; else - num_feats=mat.num_rows; + num_feats=mat.num_rows; int32_t n_ulabels; SGVector ulabels=get_unique_labels(labels_vec,n_ulabels); @@ -567,7 +567,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S } } } - + SGVector idx(num_feats); idx.range_fill(); if (subset_size) @@ -579,7 +579,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S float64_t max_gain=MIN_SPLIT_GAIN; int32_t best_attribute=-1; float64_t best_threshold=0; - + SGVector indices_mask; SGVector count_indices(mat.num_rows); count_indices.zero(); @@ -603,6 +603,8 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S { SGVector feats(num_vecs); SGVector sorted_args(num_vecs); + SGVector temp_count_indices(count_indices.size()); + memcpy(temp_count_indices.vector, count_indices.vector, sizeof(int32_t)*count_indices.size()); if (m_pre_sort) { @@ -708,7 +710,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S if(dupes[j]!=j) is_left[j]=is_left[dupes[j]]; } - + float64_t g=0; if (m_mode==PT_MULTICLASS) g=gain(wleft,wright,total_wclasses); @@ -806,7 +808,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S count_right=1; if (m_pre_sort) { - SGVector temp_vec(mat.get_column_vector(best_attribute), mat.num_rows, false); + SGVector temp_vec(mat.get_column_vector(best_attribute), mat.num_rows, false); SGVector sorted_indices(m_sorted_indices.get_column_vector(best_attribute), mat.num_rows, false); int32_t count=0; for(int32_t i=0;i& mat, const S if(dupes[i]!=i) is_left_final[i]=is_left_final[dupes[i]]; } - - } + + } else { for (int32_t i=0;i& feats, co { Map map_weights(weights.vector, weights.size()); - Map map_feats(feats.vector, weights.size()); + Map map_feats(feats.vector, weights.size()); float64_t mean=map_weights.dot(map_feats); total_weight=map_weights.sum(); @@ -1104,7 +1106,7 @@ CLabels* CCARTree::apply_from_current_node(CDenseFeatures* feats, bno { int32_t num_vecs=feats->get_num_vectors(); REQUIRE(num_vecs>0, "No data provided in apply\n"); - + SGVector labels(num_vecs); for (int32_t i=0;i* tree) { REQUIRE(tree, "Tree not provided for pruning.\n"); - + CDynamicObjectArray* trees=new CDynamicObjectArray(); SG_UNREF(m_alphas); m_alphas=new CDynamicArray(); diff --git a/src/shogun/preprocessor/BAHSIC.h b/src/shogun/preprocessor/BAHSIC.h deleted file mode 100644 index e58a89e9d1d..00000000000 --- a/src/shogun/preprocessor/BAHSIC.h +++ /dev/null @@ -1,92 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef BAHSIC_H__ -#define BAHSIC_H__ - -#include -#include - -namespace shogun -{ - -/** @brief Class CBAHSIC, that extends CKernelDependenceMaximization and uses - * HSIC [1] to compute dependence measures for feature selection using a - * backward elimination approach as described in [1]. This class serves as a - * convenience class that initializes the CDependenceMaximization#m_estimator - * with an instance of CHSIC and allows only shogun::BACKWARD_ELIMINATION algorithm - * to use which is set internally. Therefore, trying to use other algorithms - * by set_algorithm() will not work. Plese see the class documentation of CHSIC - * and [2] for more details on mathematical description of HSIC. - * - * Refrences: - * [1] Song, Le and Bedo, Justin and Borgwardt, Karsten M. and Gretton, Arthur - * and Smola, Alex. (2007). Gene Selection via the BAHSIC Family of Algorithms. - * Journal Bioinformatics. Volume 23 Issue Pages i490-i498. Oxford University - * Press Oxford, UK - * [2]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel - * statistical test of independence. Advances in Neural Information Processing - * Systems, 1-8. - */ -class CBAHSIC : public CKernelDependenceMaximization -{ -public: - /** Default constructor */ - CBAHSIC(); - - /** Destructor */ - virtual ~CBAHSIC(); - - /** - * Since only shogun::BACKWARD_ELIMINATION algorithm is applicable for BAHSIC, - * and this is set internally, this method is overridden to prevent this - * to be set from public API. - * - * @param algorithm the feature selection algorithm to use - */ - virtual void set_algorithm(EFeatureSelectionAlgorithm algorithm); - - /** @return the preprocessor type */ - virtual EPreprocessorType get_type() const; - - /** @return the class name */ - virtual const char* get_name() const - { - return "BAHSIC"; - } - -private: - /** Register params and initialize with default values */ - void initialize_parameters(); - -}; - -} -#endif // BAHSIC_H__ diff --git a/src/shogun/preprocessor/DependenceMaximization.cpp b/src/shogun/preprocessor/DependenceMaximization.cpp index ac636f7fad9..16cb71576bc 100644 --- a/src/shogun/preprocessor/DependenceMaximization.cpp +++ b/src/shogun/preprocessor/DependenceMaximization.cpp @@ -31,7 +31,7 @@ #include #include #include -#include +#include #include #include diff --git a/src/shogun/preprocessor/KernelDependenceMaximization.cpp b/src/shogun/preprocessor/KernelDependenceMaximization.cpp deleted file mode 100644 index b8292ab166c..00000000000 --- a/src/shogun/preprocessor/KernelDependenceMaximization.cpp +++ /dev/null @@ -1,141 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include - -using namespace shogun; - -CKernelDependenceMaximization::CKernelDependenceMaximization() - : CDependenceMaximization() -{ - initialize_parameters(); -} - -void CKernelDependenceMaximization::initialize_parameters() -{ - SG_ADD((CSGObject**)&m_kernel_features, "kernel_features", - "the kernel to be used for features", MS_NOT_AVAILABLE); - SG_ADD((CSGObject**)&m_kernel_labels, "kernel_labels", - "the kernel to be used for labels", MS_NOT_AVAILABLE); - - m_kernel_features=NULL; - m_kernel_labels=NULL; -} - -CKernelDependenceMaximization::~CKernelDependenceMaximization() -{ - SG_UNREF(m_kernel_features); - SG_UNREF(m_kernel_labels); -} - -void CKernelDependenceMaximization::precompute() -{ - SG_DEBUG("Entering!\n"); - - REQUIRE(m_labels_feats, "Features for labels is not initialized!\n"); - REQUIRE(m_kernel_labels, "Kernel for labels is not initialized!\n"); - - // ASSERT here because the estimator is set internally and cannot - // be set via public API - ASSERT(m_estimator); - - CFeatureSelection::precompute(); - - // make sure that we have an instance of CKernelIndependenceTest via - // proper cast and set this kernel to the estimator - CKernelIndependenceTest* estimator - =dynamic_cast(m_estimator); - ASSERT(estimator); - - // precompute the kernel for labels - m_kernel_labels->init(m_labels_feats, m_labels_feats); - CCustomKernel* precomputed - =new CCustomKernel(m_kernel_labels->get_kernel_matrix()); - - // replace the kernel for labels with precomputed kernel - SG_UNREF(m_kernel_labels); - m_kernel_labels=precomputed; - SG_REF(m_kernel_labels); - - // we can safely SG_UNREF the feature object for labels now - SG_UNREF(m_labels_feats); - m_labels_feats=NULL; - - // finally set this as kernel for the labels - estimator->set_kernel_q(m_kernel_labels); - - SG_DEBUG("Leaving!\n"); -} - -void CKernelDependenceMaximization::set_kernel_features(CKernel* kernel) -{ - // sanity check. using assert here because estimator instances are - // set internally and cannot be set via public API. - ASSERT(m_estimator); - CKernelIndependenceTest* estimator - =dynamic_cast(m_estimator); - ASSERT(estimator); - - SG_REF(kernel); - SG_UNREF(m_kernel_features); - m_kernel_features=kernel; - - estimator->set_kernel_p(m_kernel_features); -} - -void CKernelDependenceMaximization::set_kernel_labels(CKernel* kernel) -{ - // sanity check. using assert here because estimator instances are - // set internally and cannot be set via public API. - ASSERT(m_estimator); - CKernelIndependenceTest* estimator - =dynamic_cast(m_estimator); - ASSERT(estimator); - - SG_REF(kernel); - SG_UNREF(m_kernel_labels); - m_kernel_labels=kernel; - - estimator->set_kernel_q(m_kernel_labels); -} - -CKernel* CKernelDependenceMaximization::get_kernel_features() const -{ - SG_REF(m_kernel_features); - return m_kernel_features; -} - -CKernel* CKernelDependenceMaximization::get_kernel_labels() const -{ - SG_REF(m_kernel_labels); - return m_kernel_labels; -} diff --git a/src/shogun/preprocessor/KernelDependenceMaximization.h b/src/shogun/preprocessor/KernelDependenceMaximization.h deleted file mode 100644 index 9d4159088dd..00000000000 --- a/src/shogun/preprocessor/KernelDependenceMaximization.h +++ /dev/null @@ -1,105 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef KERNEL_DEPENDENCE_MAXIMIZATION_H__ -#define KERNEL_DEPENDENCE_MAXIMIZATION_H__ - -#include -#include - -namespace shogun -{ - -class CFeatures; -class CKernelSelection; - -/** @brief Class CKernelDependenceMaximization, that uses an implementation - * of CKernelIndependenceTest to compute dependence measures for feature - * selection. Different kernels are used for labels and data. For the sake - * of computational convenience, the precompute() method is overridden to - * precompute the kernel for labels and save as an instance of CCustomKernel - */ -class CKernelDependenceMaximization : public CDependenceMaximization -{ -public: - /** Default constructor */ - CKernelDependenceMaximization(); - - /** Destructor */ - virtual ~CKernelDependenceMaximization(); - - /** @param kernel the kernel for features (data) */ - void set_kernel_features(CKernel* kernel); - - /** @return the kernel for features */ - CKernel* get_kernel_features() const; - - /** @param kernel the kernel for labels */ - void set_kernel_labels(CKernel* kernel); - - /** @return the kernel for labels */ - CKernel* get_kernel_labels() const; - - /** - * Abstract method which is overridden in the subclasses to set accepted - * feature selection algorithm - * - * @param algorithm the feature selection algorithm to use - */ - virtual void set_algorithm(EFeatureSelectionAlgorithm algorithm)=0; - - /** @return the class name */ - virtual const char* get_name() const - { - return "KernelDependenceMaximization"; - } - -protected: - /** - * Precomputes the kernel on labels and replaces the #m_kernel_labels - * with an instance of CCustomKernel. Labels features are set via - * CDependenceMaximization::set_labels call. - */ - virtual void precompute(); - - /** The kernel for data (features) to be used in CKernelIndependenceTest */ - CKernel* m_kernel_features; - - /** The kernel for labels to be used in CKernelIndependenceTest */ - CKernel* m_kernel_labels; - -private: - /** Register params and initialize with default values */ - void initialize_parameters(); - -}; - -} -#endif // KERNEL_DEPENDENCE_MAXIMIZATION_H__ diff --git a/src/shogun/statistical_testing/BTestMMD.cpp b/src/shogun/statistical_testing/BTestMMD.cpp new file mode 100644 index 00000000000..d1be47bd3dc --- /dev/null +++ b/src/shogun/statistical_testing/BTestMMD.cpp @@ -0,0 +1,117 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +CBTestMMD::CBTestMMD() : CStreamingMMD() +{ +} + +CBTestMMD::~CBTestMMD() +{ +} + +void CBTestMMD::set_blocksize(index_t blocksize) +{ + get_data_mgr().set_blocksize(blocksize); +} + +void CBTestMMD::set_num_blocks_per_burst(index_t num_blocks_per_burst) +{ + get_data_mgr().set_num_blocks_per_burst(num_blocks_per_burst); +} + +const std::function)> CBTestMMD::get_direct_estimation_method() const +{ + return mmd::WithinBlockDirect(); +} + +float64_t CBTestMMD::normalize_statistic(float64_t statistic) const +{ + const DataManager& data_mgr=get_data_mgr(); + const index_t Nx=data_mgr.num_samples_at(0); + const index_t Ny=data_mgr.num_samples_at(1); + const index_t Bx=data_mgr.blocksize_at(0); + const index_t By=data_mgr.blocksize_at(1); + return Nx*Ny*statistic*CMath::sqrt((Bx+By)/float64_t(Nx+Ny))/(Nx+Ny); +} + +const float64_t CBTestMMD::normalize_variance(float64_t variance) const +{ + const DataManager& data_mgr=get_data_mgr(); + const index_t Bx=data_mgr.blocksize_at(0); + const index_t By=data_mgr.blocksize_at(1); + return variance*CMath::sq(Bx*By/float64_t(Bx+By)); +} + +float64_t CBTestMMD::compute_p_value(float64_t statistic) +{ + float64_t result=0; + switch (get_null_approximation_method()) + { + case NAM_MMD1_GAUSSIAN: + { + float64_t sigma_sq=compute_variance(); + float64_t std_dev=CMath::sqrt(sigma_sq); + result=1.0-CStatistics::normal_cdf(statistic, std_dev); + break; + } + default: + { + result=CHypothesisTest::compute_p_value(statistic); + break; + } + } + return result; +} + +float64_t CBTestMMD::compute_threshold(float64_t alpha) +{ + float64_t result=0; + switch (get_null_approximation_method()) + { + case NAM_MMD1_GAUSSIAN: + { + float64_t sigma_sq=compute_variance(); + float64_t std_dev=CMath::sqrt(sigma_sq); + result=1.0-CStatistics::inverse_normal_cdf(1-alpha, 0, std_dev); + break; + } + default: + { + result=CHypothesisTest::compute_threshold(alpha); + break; + } + } + return result; +} + +const char* CBTestMMD::get_name() const +{ + return "BTestMMD"; +} diff --git a/src/shogun/statistical_testing/BTestMMD.h b/src/shogun/statistical_testing/BTestMMD.h new file mode 100644 index 00000000000..03439818c17 --- /dev/null +++ b/src/shogun/statistical_testing/BTestMMD.h @@ -0,0 +1,48 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef B_TEST_MMD_H_ +#define B_TEST_MMD_H_ + +#include + +namespace shogun +{ + +class CBTestMMD : public CStreamingMMD +{ +public: + typedef std::function)> operation; + CBTestMMD(); + virtual ~CBTestMMD(); + + void set_blocksize(index_t blocksize); + void set_num_blocks_per_burst(index_t num_blocks_per_burst); + + virtual float64_t compute_p_value(float64_t statistic); + virtual float64_t compute_threshold(float64_t alpha); + + virtual const char* get_name() const; +private: + virtual const operation get_direct_estimation_method() const; + virtual float64_t normalize_statistic(float64_t statistic) const; + virtual const float64_t normalize_variance(float64_t variance) const; +}; + +} +#endif // B_TEST_MMD_H_ diff --git a/src/shogun/statistics/HypothesisTest.cpp b/src/shogun/statistical_testing/HypothesisTest.cpp similarity index 50% rename from src/shogun/statistics/HypothesisTest.cpp rename to src/shogun/statistical_testing/HypothesisTest.cpp index d8167fd9e24..9afd853d094 100644 --- a/src/shogun/statistics/HypothesisTest.cpp +++ b/src/shogun/statistical_testing/HypothesisTest.cpp @@ -1,6 +1,7 @@ /* * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -28,98 +29,84 @@ * either expressed or implied, of the Shogun Development Team. */ -#include -#include +#include #include #include +#include +#include using namespace shogun; +using namespace internal; -CHypothesisTest::CHypothesisTest() : CSGObject() +struct CHypothesisTest::Self +{ + explicit Self(index_t num_distributions); + DataManager data_mgr; +}; + +CHypothesisTest::Self::Self(index_t num_distributions) : data_mgr(num_distributions) { - init(); } -CHypothesisTest::~CHypothesisTest() +CHypothesisTest::CHypothesisTest() { + SG_WARNING("An empty instance of this class should not be used! If you are seeing \ + this error, please contact Shogun developers!\n"); } -void CHypothesisTest::init() +CHypothesisTest::CHypothesisTest(index_t num_distributions) : CSGObject() { - SG_ADD(&m_num_null_samples, "num_null_samples", - "Number of permutation iterations for sampling null", - MS_NOT_AVAILABLE); - SG_ADD((machine_int_t*)&m_null_approximation_method, - "null_approximation_method", - "Method for approximating null distribution", - MS_NOT_AVAILABLE); + self=std::unique_ptr(new CHypothesisTest::Self(num_distributions)); +} - m_num_null_samples=250; - m_null_approximation_method=PERMUTATION; +CHypothesisTest::~CHypothesisTest() +{ } -void CHypothesisTest::set_null_approximation_method( - ENullApproximationMethod null_approximation_method) +void CHypothesisTest::set_train_test_mode(bool on) { - m_null_approximation_method=null_approximation_method; + self->data_mgr.set_train_test_mode(on); } -void CHypothesisTest::set_num_null_samples(index_t num_null_samples) +void CHypothesisTest::set_train_test_ratio(float64_t ratio) { - m_num_null_samples=num_null_samples; + self->data_mgr.set_train_test_ratio(ratio); + self->data_mgr.reset(); } float64_t CHypothesisTest::compute_p_value(float64_t statistic) { - float64_t result=0; - - if (m_null_approximation_method==PERMUTATION) - { - /* sample a bunch of MMD values from null distribution */ - SGVector values=sample_null(); - - /* find out percentile of parameter "statistic" in null distribution */ - CMath::qsort(values); - float64_t i=values.find_position_to_insert(statistic); - - /* return corresponding p-value */ - result=1.0-i/values.vlen; - } - else - SG_ERROR("Unknown method to approximate null distribution!\n"); - - return result; + SGVector values=sample_null(); + std::sort(values.vector, values.vector + values.vlen); + float64_t i=values.find_position_to_insert(statistic); + return 1.0-i/values.vlen; } float64_t CHypothesisTest::compute_threshold(float64_t alpha) { - float64_t result=0; - - if (m_null_approximation_method==PERMUTATION) - { - /* sample a bunch of MMD values from null distribution */ - SGVector values=sample_null(); + SGVector values=sample_null(); + std::sort(values.vector, values.vector + values.vlen); + return values[index_t(CMath::floor(values.vlen*(1-alpha)))]; +} - /* return value of (1-alpha) quantile */ - CMath::qsort(values); - result=values[index_t(CMath::floor(values.vlen*(1-alpha)))]; - } - else - SG_ERROR("Unknown method to approximate null distribution!\n"); +bool CHypothesisTest::perform_test(float64_t alpha) +{ + auto statistic=compute_statistic(); + auto p_value=compute_p_value(statistic); + return p_valuedata_mgr; } -bool CHypothesisTest::perform_test(float64_t alpha) +const DataManager& CHypothesisTest::get_data_mgr() const { - float64_t p_value=perform_test(); - return p_valuedata_mgr; } diff --git a/src/shogun/statistics/HypothesisTest.h b/src/shogun/statistical_testing/HypothesisTest.h similarity index 52% rename from src/shogun/statistics/HypothesisTest.h rename to src/shogun/statistical_testing/HypothesisTest.h index e9607760887..2346e9ef5ee 100644 --- a/src/shogun/statistics/HypothesisTest.h +++ b/src/shogun/statistical_testing/HypothesisTest.h @@ -1,6 +1,7 @@ /* * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -31,40 +32,30 @@ #ifndef HYPOTHESIS_TEST_H_ #define HYPOTHESIS_TEST_H_ +#include #include - #include namespace shogun { -/** enum for different statistic types */ -enum EStatisticType -{ - S_LINEAR_TIME_MMD, - S_QUADRATIC_TIME_MMD, - S_HSIC, - S_NOCCO -}; +class CFeatures; -/** enum for different method to approximate null-distibution */ -enum ENullApproximationMethod +namespace internal { - PERMUTATION, - MMD2_SPECTRUM_DEPRECATED, - MMD2_SPECTRUM, - MMD2_GAMMA, - MMD1_GAUSSIAN, - HSIC_GAMMA -}; -/** @brief Hypothesis test base class. Provides an interface for statistical +class DataManager; + +} + +/** + * @brief Hypothesis test base class. Provides an interface for statistical * hypothesis testing via three methods: compute_statistic(), compute_p_value() * and compute_threshold(). The second computes a p-value for the statistic - * computed by the first method. - * The p-value represents the position of the statistic in the null-distribution, - * i.e. the distribution of the statistic population given the null-hypothesis - * is true. (1-position = p-value). + * computed by the first method. The p-value represents the position of the + * statistic in the null-distribution, i.e. the distribution of the statistic + * population given the null-hypothesis is true. (1-position = p-value). + * * The third method, compute_threshold(), computes a threshold for a given * test level which is needed to reject the null-hypothesis. * @@ -78,20 +69,50 @@ enum ENullApproximationMethod class CHypothesisTest : public CSGObject { public: - /** default constructor */ + /** Default constructor */ CHypothesisTest(); - /** destructor */ + /** Destructor */ virtual ~CHypothesisTest(); - /** @return test statistic for the given data/parameters/methods */ - virtual float64_t compute_statistic()=0; + /** + * Method that enables/disables the training-testing mode. If this option + * is turned on, then the samples would be split in two pieces: one chunk + * would be used for training algorithms and the other chunk would be used + * for performing tests. If this option is turned off, the entire data + * would be used for performing the test. Before running any training + * algorithms, make sure to turn this mode on. + * + * By default, the training-testing mode is turned off. + * + * \sa {set_train_test_ratio()} + * + * @param on Whether to enable/disable the training-testing mode + */ + void set_train_test_mode(bool on); + + /** + * Method that specifies the ratio of training-testing data split for the + * algorithms. Note that this is NOT the percentage of samples to be used + * for training, rather the ratio of the number of samples to be used for + * training and that of testing. + * + * By default, an equal 50-50 split (ratio = 1) is made. + * + * \sa {set_train_test_mode()} + * + * @param ratio The ratio of the number of samples to be used for training + * and that of testing + */ + void set_train_test_ratio(float64_t ratio); - /** computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- + /** + * Method that computes a p-value based on current method for approximating + * the null-distribution. The p-value is the 1-p quantile of the null- * distribution where the given statistic lies in. + * * This method depends on the implementation of sample_null method - * which should be implemented in its sub-classes + * which should be implemented by the sub-classes. * * @param statistic statistic value to compute the p-value for * @return p-value parameter statistic is the (1-p) percentile of the @@ -99,38 +120,24 @@ class CHypothesisTest : public CSGObject */ virtual float64_t compute_p_value(float64_t statistic); - /** computes a threshold based on current method for approximating the - * null-distribution. The threshold is the value that a statistic has + /** + * Method that computes a threshold based on current method for approximating + * the null-distribution. The threshold is the value that a statistic has * to have in ordner to reject the null-hypothesis. + * * This method depends on the implementation of sample_null method - * which should be implemented in its sub-classes + * which should be implemented by the sub-classes. * * @param alpha test level to reject null-hypothesis - * @return threshold for statistics to reject null-hypothesis + * @return Threshold for statistics to reject null-hypothesis */ virtual float64_t compute_threshold(float64_t alpha); - /** Performs the complete two-sample test on current data and returns a - * p-value. - * - * This is a wrapper that calls compute_statistic first and then - * calls compute_p_value using the obtained statistic. In some statistic - * classes, it might be possible to compute statistic and p-value in - * one single run which is more efficient. Therefore, this method might - * be overwritten in subclasses. + /** + * Method that performs the complete hypothesis test on current data and + * returns a binary answer: wheter null hypothesis is rejected or not. * - * The method for computing the p-value can be set via - * set_null_approximation_method(). - * - * @return p-value such that computed statistic is the (1-p) quantile - * of the estimated null distribution - */ - virtual float64_t perform_test(); - - /** Performs the complete two-sample test on current data and returns - * a binary answer wheter null hypothesis is rejected or not. - * - * This is just a wrapper for the above perform_test() method that + * This is just a wrapper for the above compute_p_value() method that * returns a p-value. If this p-value lies below the test level alpha, * the null hypothesis is rejected. * @@ -141,42 +148,34 @@ class CHypothesisTest : public CSGObject */ bool perform_test(float64_t alpha); - /** computes the test statistic m_num_null_samples times, exact - * computation depends on the implementations. + /** + * Interface for computing the test-statistic for the hypothesis test. * - * @return vector of all statistics + * @return Test statistic for the given data/parameters/methods */ - virtual SGVector sample_null()=0; + virtual float64_t compute_statistic()=0; - /** sets the number of permutation iterations for sample_null() + /** + * Interface for computing the samples under the null-hypothesis. * - * @param num_null_samples how often permutation shall be done + * @return Vector of all statistics */ - virtual void set_num_null_samples(index_t num_null_samples); - - /** sets the method how to approximate the null-distribution - * @param null_approximation_method method to use - */ - virtual void set_null_approximation_method( - ENullApproximationMethod null_approximation_method); - - /** returns the statistic type of this test statistic */ - virtual EStatisticType get_statistic_type() const=0; - - virtual const char* get_name() const=0; - -private: - /** register parameters and initialize with default values */ - void init(); + virtual SGVector sample_null()=0; + /** @return The name of the class */ + virtual const char* get_name() const; protected: - /** number of iterations for sampling from null-distributions */ - index_t m_num_null_samples; + explicit CHypothesisTest(index_t num_distributions); + internal::DataManager& get_data_mgr(); + const internal::DataManager& get_data_mgr() const; +private: + CHypothesisTest(const CHypothesisTest& other)=delete; + CHypothesisTest& operator=(const CHypothesisTest& other)=delete; - /** Defines how the the null distribution is approximated */ - ENullApproximationMethod m_null_approximation_method; + struct Self; + std::unique_ptr self; }; } -#endif /* HYPOTHESIS_TEST_H_ */ +#endif // HYPOTHESIS_TEST_H_ diff --git a/src/shogun/statistical_testing/IndependenceTest.cpp b/src/shogun/statistical_testing/IndependenceTest.cpp new file mode 100644 index 00000000000..77b8013a401 --- /dev/null +++ b/src/shogun/statistical_testing/IndependenceTest.cpp @@ -0,0 +1,79 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +struct CIndependenceTest::Self +{ + Self(index_t num_kernels); + KernelManager kernel_mgr; +}; + +CIndependenceTest::Self::Self(index_t num_kernels) : kernel_mgr(num_kernels) +{ +} + +CIndependenceTest::CIndependenceTest() : CTwoDistributionTest() +{ + self=std::unique_ptr(new Self(IndependenceTest::num_kernels)); +} + +CIndependenceTest::~CIndependenceTest() +{ +} + +void CIndependenceTest::set_kernel_p(CKernel* kernel_p) +{ + self->kernel_mgr.kernel_at(0)=kernel_p; +} + +CKernel* CIndependenceTest::get_kernel_p() const +{ + return self->kernel_mgr.kernel_at(0); +} + +void CIndependenceTest::set_kernel_q(CKernel* kernel_q) +{ + self->kernel_mgr.kernel_at(1)=kernel_q; +} + +CKernel* CIndependenceTest::get_kernel_q() const +{ + return self->kernel_mgr.kernel_at(1); +} + +const char* CIndependenceTest::get_name() const +{ + return "IndependenceTest"; +} + +KernelManager& CIndependenceTest::get_kernel_mgr() +{ + return self->kernel_mgr; +} + +const KernelManager& CIndependenceTest::get_kernel_mgr() const +{ + return self->kernel_mgr; +} diff --git a/src/shogun/statistical_testing/IndependenceTest.h b/src/shogun/statistical_testing/IndependenceTest.h new file mode 100644 index 00000000000..492fd46d998 --- /dev/null +++ b/src/shogun/statistical_testing/IndependenceTest.h @@ -0,0 +1,105 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef INDEPENDENCE_TEST_H_ +#define INDEPENDENCE_TEST_H_ + +#include +#include + +namespace shogun +{ + +class CKernel; + +namespace internal +{ + class KernelManager; +} + +/** + * @brief Provides an interface for performing the independence test. + * Given samples \f$Z=\{(x_i,y_i)\}_{i=1}^m\f$ from the joint distribution + * \f$\textbf{P}_{xy}\f$, whether the joint distribution factorize as + * \f$\textbf{P}_{xy}=\textbf{P}_x\textbf{P}_y\f$, i.e. product of the marginals. + * The null-hypothesis says yes, i.e. no dependence, the alternative hypothesis + * says no. + * + * Abstract base class. Provides all interfaces and implements approximating + * the null distribution via permutation, i.e. shuffling the samples from + * one distribution repeatedly using subsets while keeping the samples from + * the other distribution in its original order + * + */ +class CIndependenceTest : public CTwoDistributionTest +{ +public: + /** Default constructor */ + CIndependenceTest(); + + /** Destructor */ + virtual ~CIndependenceTest(); + + /** + * Method that sets the kernel to be used for performing the test for the + * samples from p. + * + * @param kernel_p The kernel instance to be used for samples from p + */ + void set_kernel_p(CKernel* kernel_p); + + /** @return The kernel instance that is used for samples from p */ + CKernel* get_kernel_p() const; + + /** + * Method that sets the kernel to be used for performing the test for the + * samples from q. + * + * @param kernel_q The kernel instance to be used for samples from q + */ + void set_kernel_q(CKernel* kernel_q); + + /** @return The kernel instance that is used for samples from q */ + CKernel* get_kernel_q() const; + + /** + * Interface for computing the test-statistic for the hypothesis test. + * + * @return test statistic for the given data/parameters/methods + */ + virtual float64_t compute_statistic()=0; + + /** + * Interface for computing the samples under the null-hypothesis. + * + * @return vector of all statistics + */ + virtual SGVector sample_null()=0; + + /** @return The name of the class */ + virtual const char* get_name() const; +protected: + internal::KernelManager& get_kernel_mgr(); + const internal::KernelManager& get_kernel_mgr() const; +private: + struct Self; + std::unique_ptr self; +}; + +} +#endif // INDEPENDENCE_TEST_H_ diff --git a/src/shogun/statistical_testing/LinearTimeMMD.cpp b/src/shogun/statistical_testing/LinearTimeMMD.cpp new file mode 100644 index 00000000000..c1e9fac7de5 --- /dev/null +++ b/src/shogun/statistical_testing/LinearTimeMMD.cpp @@ -0,0 +1,152 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +CLinearTimeMMD::CLinearTimeMMD() : CStreamingMMD() +{ +} + +CLinearTimeMMD::CLinearTimeMMD(CFeatures* samples_from_p, CFeatures* samples_from_q) : CStreamingMMD() +{ + set_p(samples_from_p); + set_q(samples_from_q); +} + +CLinearTimeMMD::~CLinearTimeMMD() +{ +} + +void CLinearTimeMMD::set_num_blocks_per_burst(index_t num_blocks_per_burst) +{ + auto& data_mgr=get_data_mgr(); + auto min_blocksize=data_mgr.get_min_blocksize(); + if (min_blocksize==2) + { + // only possible when number of samples from both the distributions are the same + auto N=data_mgr.num_samples_at(0); + for (auto i=2; i)> CLinearTimeMMD::get_direct_estimation_method() const +{ + return mmd::WithinBlockDirect(); +} + +float64_t CLinearTimeMMD::normalize_statistic(float64_t statistic) const +{ + const DataManager& data_mgr = get_data_mgr(); + const index_t Nx = data_mgr.num_samples_at(0); + const index_t Ny = data_mgr.num_samples_at(1); + return CMath::sqrt(Nx * Ny / float64_t(Nx + Ny)) * statistic; +} + +const float64_t CLinearTimeMMD::normalize_variance(float64_t variance) const +{ + const DataManager& data_mgr = get_data_mgr(); + const index_t Bx = data_mgr.blocksize_at(0); + const index_t By = data_mgr.blocksize_at(1); + const index_t B = Bx + By; + if (get_statistic_type() == ST_UNBIASED_INCOMPLETE) + { + return variance * B * (B - 2) / 16; + } + return variance * Bx * By * (Bx - 1) * (By - 1) / (B - 1) / (B - 2); +} + +const float64_t CLinearTimeMMD::gaussian_variance(float64_t variance) const +{ + const DataManager& data_mgr = get_data_mgr(); + const index_t Bx = data_mgr.blocksize_at(0); + const index_t By = data_mgr.blocksize_at(1); + const index_t B = Bx + By; + if (get_statistic_type() == ST_UNBIASED_INCOMPLETE) + { + return variance * 4 / (B - 2); + } + return variance * (B - 1) * (B - 2) / (Bx - 1) / (By - 1) / B; +} + +float64_t CLinearTimeMMD::compute_p_value(float64_t statistic) +{ + float64_t result = 0; + switch (get_null_approximation_method()) + { + case NAM_MMD1_GAUSSIAN: + { + float64_t sigma_sq = gaussian_variance(compute_variance()); + float64_t std_dev = CMath::sqrt(sigma_sq); + result = 1.0 - CStatistics::normal_cdf(statistic, std_dev); + break; + } + default: + { + result = CHypothesisTest::compute_p_value(statistic); + break; + } + } + return result; +} + +float64_t CLinearTimeMMD::compute_threshold(float64_t alpha) +{ + float64_t result = 0; + switch (get_null_approximation_method()) + { + case NAM_MMD1_GAUSSIAN: + { + float64_t sigma_sq = gaussian_variance(compute_variance()); + float64_t std_dev = CMath::sqrt(sigma_sq); + result = 1.0 - CStatistics::inverse_normal_cdf(1 - alpha, 0, std_dev); + break; + } + default: + { + result = CHypothesisTest::compute_threshold(alpha); + break; + } + } + return result; +} + +const char* CLinearTimeMMD::get_name() const +{ + return "LinearTimeMMD"; +} diff --git a/src/shogun/statistical_testing/LinearTimeMMD.h b/src/shogun/statistical_testing/LinearTimeMMD.h new file mode 100644 index 00000000000..fba6013d31e --- /dev/null +++ b/src/shogun/statistical_testing/LinearTimeMMD.h @@ -0,0 +1,49 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef LINEAR_TIME_MMD_H_ +#define LINEAR_TIME_MMD_H_ + +#include + +namespace shogun +{ + +class CLinearTimeMMD : public CStreamingMMD +{ +public: + typedef std::function)> operation; + CLinearTimeMMD(); + CLinearTimeMMD(CFeatures* samples_from_p, CFeatures* samples_from_q); + virtual ~CLinearTimeMMD(); + + void set_num_blocks_per_burst(index_t num_blocks_per_burst); + + virtual float64_t compute_p_value(float64_t statistic); + virtual float64_t compute_threshold(float64_t alpha); + + virtual const char* get_name() const; +private: + virtual const operation get_direct_estimation_method() const; + virtual float64_t normalize_statistic(float64_t statistic) const; + virtual const float64_t normalize_variance(float64_t variance) const; + const float64_t gaussian_variance(float64_t variance) const; +}; + +} +#endif // LINEAR_TIME_MMD_H_ diff --git a/src/shogun/statistical_testing/MMD.cpp b/src/shogun/statistical_testing/MMD.cpp new file mode 100644 index 00000000000..6ede25d88aa --- /dev/null +++ b/src/shogun/statistical_testing/MMD.cpp @@ -0,0 +1,162 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using std::unique_ptr; +using std::shared_ptr; + +struct CMMD::Self +{ + Self() + { + num_null_samples = DEFAULT_NUM_NULL_SAMPLES; + stype = DEFAULT_STYPE; + null_approximation_method = DEFAULT_NULL_APPROXIMATION_METHOD; + strategy=unique_ptr(new CKernelSelectionStrategy()); + } + + index_t num_null_samples; + EStatisticType stype; + ENullApproximationMethod null_approximation_method; + std::unique_ptr strategy; + + static constexpr index_t DEFAULT_NUM_NULL_SAMPLES = 250; + static constexpr EStatisticType DEFAULT_STYPE = ST_UNBIASED_FULL; + static constexpr ENullApproximationMethod DEFAULT_NULL_APPROXIMATION_METHOD = NAM_PERMUTATION; +}; + +CMMD::CMMD() : CTwoSampleTest() +{ + init(); +} + +CMMD::CMMD(CFeatures* samples_from_p, CFeatures* samples_from_q) : CTwoSampleTest(samples_from_p, samples_from_q) +{ + init(); +} + +void CMMD::init() +{ +#if EIGEN_VERSION_AT_LEAST(3,1,0) + Eigen::initParallel(); +#endif + self=unique_ptr(new Self()); +} + +CMMD::~CMMD() +{ + cleanup(); +} + +void CMMD::set_kernel_selection_strategy(EKernelSelectionMethod method, bool weighted) +{ + self->strategy->use_method(method) + .use_weighted(weighted); +} + +void CMMD::set_kernel_selection_strategy(EKernelSelectionMethod method, index_t num_runs, + index_t num_folds, float64_t alpha) +{ + self->strategy->use_method(method) + .use_num_runs(num_runs) + .use_num_folds(num_folds) + .use_alpha(alpha); +} + +CKernelSelectionStrategy const * CMMD::get_kernel_selection_strategy() const +{ + return self->strategy.get(); +} + +void CMMD::add_kernel(CKernel* kernel) +{ + self->strategy->add_kernel(kernel); +} + +void CMMD::select_kernel() +{ + SG_DEBUG("Entering!\n"); + auto& data_mgr=get_data_mgr(); + data_mgr.set_train_mode(true); + CMMD::set_kernel(self->strategy->select_kernel(this)); + data_mgr.set_train_mode(false); + SG_DEBUG("Leaving!\n"); +} + +void CMMD::cleanup() +{ + get_kernel_mgr().restore_kernel_at(0); +} + +void CMMD::set_num_null_samples(index_t null_samples) +{ + self->num_null_samples=null_samples; +} + +index_t CMMD::get_num_null_samples() const +{ + return self->num_null_samples; +} + +void CMMD::set_statistic_type(EStatisticType stype) +{ + self->stype=stype; +} + +EStatisticType CMMD::get_statistic_type() const +{ + return self->stype; +} + +void CMMD::set_null_approximation_method(ENullApproximationMethod nmethod) +{ + self->null_approximation_method=nmethod; +} + +ENullApproximationMethod CMMD::get_null_approximation_method() const +{ + return self->null_approximation_method; +} + +const char* CMMD::get_name() const +{ + return "MMD"; +} diff --git a/src/shogun/statistical_testing/MMD.h b/src/shogun/statistical_testing/MMD.h new file mode 100644 index 00000000000..1da59ae14b3 --- /dev/null +++ b/src/shogun/statistical_testing/MMD.h @@ -0,0 +1,256 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef MMD_H_ +#define MMD_H_ + +#include +#include +#include +#include +#include + +namespace shogun +{ + +class CKernel; +class CKernelSelectionStrategy; +template class SGVector; + +/** @brief Abstract base class that provides an interface for performing kernel + * two-sample test using Maximum Mean Discrepancy (MMD) as the test statistic. + * The MMD is the distance of two probability distributions \f$p\f$ and \f$q\f$ + * in a RKHS (see [1] for formal description). + * + * \f[ + * \text{MMD}[\mathcal{F},p,q]^2=||\mu_p - \mu_q||^2_\mathcal{F}= + * \textbf{E}_{x,x'}\left[ k(x,x')\right] + * -2\textbf{E}_{x,y}\left[ k(x,y)\right] + * +\textbf{E}_{y,y'}\left[ k(y,y')\right] + * \f] + * + * where \f$x,x'\sim p\f$ and \f$y,y'\sim q\f$. + * + * Given two sets of samples \f$\{x_i\}_{i=1}^{n_x}\sim p\f$ and + * \f$\{y_i\}_{i=1}^{n_y}\sim q\f$, \f$n_x+n_y=n\f$, + * the unbiased estimate of the above statistic is computed as + * \f[ + * \hat{\eta}_{k,U}=\frac{1}{n_x(n_x-1)}\sum_{i=1}^{n_x}\sum_{j\neq i} + * k(x_i,x_j)+\frac{1}{n_y(n_y-1)}\sum_{i=1}^{n_y}\sum_{j\neq i}k(y_i,y_j) + * -\frac{2}{n_xn_y}\sum_{i=1}^{n_x}\sum_{j=1}^{n_y}k(x_i,y_j) + * \f] + * + * A biased version is + * \f[ + * \hat{\eta}_{k,V}=\frac{1}{n_x^2}\sum_{i=1}^{n_x}\sum_{j=1}^{n_x} + * k(x_i,x_j)+\frac{1}{n_y^2}\sum_{i=1}^{n_y}\sum_{j=1}^{n_y}k(y_i,y_j) + * -\frac{2}{n_xn_y}\sum_{i=1}^{n_x}\sum_{j=1}^{n_y}k(x_i,y_j) + * \f] + * + * When \f$n_x=n_y=\frac{n}{2}\f$, an incomplete version can also be computed + * as the following + * \f[ + * \hat{\eta}_{k,U^-}=\frac{1}{\frac{n}{2}(\frac{n}{2}-1)}\sum_{i\neq j} + * h(z_i,z_j) + * \f] + * where for each pair \f$z=(x,y)\f$, \f$h(z,z')=k(x,x')+k(y,y')-k(x,y')- + * k(x',y)\f$. + * + * The type (biased/unbiased/incomplete) can be selected via set_statistic_type() + * via the enum values from EStatisticType, ST_BIASED, ST_UNBIASED and ST_INCOMPLETE, + * respectively. The estimate returned by compute_statistic() + * is \f$\frac{n_xn_y}{n_x+n_y}\hat{\eta}_k\f$. + * + * This class provides an interface for adding multiple kernels and then + * selecting the best kernel based on specified strategies. To know more in details + * about various learning algorithms for optimal kernel selection, please refer to [2]. + * + * Along with the statistic comes a method to compute a p-value based on + * different methods. Permutation test is possible. If unsure which one to + * use, sampling with 250 permutation iterations usually always is correct. + * + * To choose, use set_null_approximation_method() and choose from. + * + * NAM_MMD2_SPECTRUM: For a fast, consistent test based on the spectrum of + * the kernel matrix, as described in [2]. Only supported if Eigen3 is installed. + * Only applicable for CQuadraticTimeMMD. + * + * NAM_MMD2_GAMMA: for a very fast, but not consistent test based on moment matching + * of a Gamma distribution, as described in [2]. + * Only applicable for CQuadraticTimeMMD. + * + * NAM_PERMUTATION: For permuting available samples to sample null-distribution + * + * [1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & + * Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning + * Research, 13, 671-721. + * + * [2] Arthur Gretton, Bharath K. Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, + * Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu: Optimal kernel choice + * for large-scale two-sample tests. NIPS 2012: 1214-1222. + */ +class CMMD : public CTwoSampleTest +{ +public: + /** Default constructor */ + CMMD(); + + /** + * Convenience constructor that initializes the samples from two distributions. + * + * @param samples_from_p Samples from \f$p\f$ + * @param samples_from_q Samples from \f$q\f$ + */ + CMMD(CFeatures* samples_from_p, CFeatures* samples_from_q); + + /** Destructor */ + virtual ~CMMD(); + + /** + * Method that sets the specific kernel selection strategy based on the + * specific parameters provided. Please see class documentation for details. + * Use this method for every other strategy other than KSM_CROSS_VALIDATION. + * + * @param method The kernel selection method as specified in EKernelSelectionMethod. + * @param weighted If true, then an weighted combination of the kernel is used after + * solving an optimization. If false, only a single kernel is selected among the + * provided ones. + */ + void set_kernel_selection_strategy(EKernelSelectionMethod method, bool weighted = false); + + /** + * Method that sets the specific kernel selection strategy based on the + * specific parameters provided. Please see class documentation for details. + * Use this method for KSM_CROSS_VALIDATION. + * + * @param method The kernel selection method as specified in EKernelSelectionMethod. + * @param num_runs The number of total runs of the cross-validation algorithm. + * @param num_folds The number of folds (k) to be used in k-fold stratified cross-validation. + * @param alpha The threshold to be used while performing test for the test-folds. + */ + void set_kernel_selection_strategy(EKernelSelectionMethod method, index_t num_runs, index_t num_folds, float64_t alpha); + + /** + * Method that adds a kernel instance to be used for kernel selection. Please + * note that the kernels added by this method are NOT set as the main test kernel + * unless select_kernel() method is executed. + * + * This method is NOT thread safe. Please DO NOT use this method from multiple threads. + * + * @param kernel One of the kernel instances with which learning algorithm will work. + */ + void add_kernel(CKernel *kernel); + + /** + * Method that selects/learns the kernel based on the defined kernel selection strategy. + * If no explicit kernel selection strategy was set using set_kernel_selection_strategy() + * method, then a default strategy is used. Please see EKernelSelectionMethod for the + * default strategy. + * + * This method is NOT thread safe. It replaces the internel kernel set by set_kernel() + * method, if there was any. Please DO NOT use this method from multiple threads. + * + * The learned/selected kernel can be obtained from a subsequent get_kernel() call. + * + * This method expects train-test mode to be turned on at the time of invocation. Please + * see the class documentation of CHypothesisTest. + */ + virtual void select_kernel(); + + /** + * Method that returns the kernel selection strategy wrapper object that will be/ + * was used in the last kernel learning algorithm. Use this method when results of + * intermediate steps taken by the kernel selection algorithms are of interest. + * + * @return The internal instance of CKernelSelectionStrategy that holds intermediate + * measures computed at the time of the last kernel selection algorithm invocation. + */ + CKernelSelectionStrategy const * get_kernel_selection_strategy() const; + + /** + * Interface for computing the test-statistic for the hypothesis test. + * + * @return test statistic for the given data/parameters/methods + */ + virtual float64_t compute_statistic() = 0; + + /** + * Interface for computing the samples under the null-hypothesis. + * + * @return vector of all statistics + */ + virtual SGVector sample_null() = 0; + + /** Method that releases the pre-computed kernel that is used in the computation. */ + void cleanup(); + + /** + * Method that sets the number of null-samples used for computing p-value. + * + * @param null_samples Number of null-samples. + */ + void set_num_null_samples(index_t null_samples); + + /** @return Number of null-samples */ + index_t get_num_null_samples() const; + + /** + * Method that sets the type of the estimator for MMD^2 + * + * @param stype The type of the estimator for MMD^2 + */ + void set_statistic_type(EStatisticType stype); + + /** @return The type of the estimator for MMD^2 */ + EStatisticType get_statistic_type() const; + + /** + * Method that sets the approach to be taken while approximating the null-samples. + * + * @nmethod The null-approximation method + */ + void set_null_approximation_method(ENullApproximationMethod nmethod); + + /** @return The null-approximation method */ + ENullApproximationMethod get_null_approximation_method() const; + + /** @return The name of this class */ + virtual const char* get_name() const; +protected: + virtual float64_t normalize_statistic(float64_t statistic) const = 0; +private: + struct Self; + std::unique_ptr self; + void init(); +}; + +} +#endif // MMD_H_ diff --git a/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.cpp b/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.cpp new file mode 100644 index 00000000000..67fb0327812 --- /dev/null +++ b/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.cpp @@ -0,0 +1,308 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; +using std::unique_ptr; + +struct CMultiKernelQuadraticTimeMMD::Self +{ + Self(CQuadraticTimeMMD* owner); + void update_pairwise_distance(CDistance *distance); + + CQuadraticTimeMMD *m_owner; + unique_ptr m_pairwise_distance; + EDistanceType m_dtype; + KernelManager m_kernel_mgr; + ComputeMMD statistic_job; + VarianceH1 variance_h1_job; + PermutationMMD permutation_job; +}; + +CMultiKernelQuadraticTimeMMD::Self::Self(CQuadraticTimeMMD *owner) : m_owner(owner), + m_pairwise_distance(nullptr), m_dtype(D_UNKNOWN) +{ +} + +void CMultiKernelQuadraticTimeMMD::Self::update_pairwise_distance(CDistance* distance) +{ + ASSERT(distance); + if (m_dtype==distance->get_distance_type()) + { + ASSERT(m_pairwise_distance!=nullptr); + SG_SINFO("Precomputed distance exists for %s!\n", distance->get_name()); + } + else + { + auto precomputed_distance=m_owner->compute_joint_distance(distance); + m_pairwise_distance=unique_ptr(precomputed_distance); + m_dtype=distance->get_distance_type(); + } +} + +CMultiKernelQuadraticTimeMMD::CMultiKernelQuadraticTimeMMD() : CSGObject() +{ + self=unique_ptr(new Self(nullptr)); +} + +CMultiKernelQuadraticTimeMMD::CMultiKernelQuadraticTimeMMD(CQuadraticTimeMMD* owner) : CSGObject() +{ + self=unique_ptr(new Self(owner)); +} + +CMultiKernelQuadraticTimeMMD::~CMultiKernelQuadraticTimeMMD() +{ + cleanup(); +} + +void CMultiKernelQuadraticTimeMMD::add_kernel(CShiftInvariantKernel *kernel) +{ + ASSERT(self->m_owner); + REQUIRE(kernel, "Kernel instance cannot be NULL!\n"); + self->m_kernel_mgr.push_back(kernel); +} + +void CMultiKernelQuadraticTimeMMD::cleanup() +{ + self->m_kernel_mgr.clear(); + invalidate_precomputed_distance(); +} + +void CMultiKernelQuadraticTimeMMD::invalidate_precomputed_distance() +{ + self->m_pairwise_distance=nullptr; + self->m_dtype=D_UNKNOWN; +} + +SGVector CMultiKernelQuadraticTimeMMD::compute_statistic() +{ + ASSERT(self->m_owner); + return statistic(self->m_kernel_mgr); +} + +SGVector CMultiKernelQuadraticTimeMMD::compute_variance_h0() +{ + ASSERT(self->m_owner); + SG_NOTIMPLEMENTED; + return SGVector(); +} + +SGVector CMultiKernelQuadraticTimeMMD::compute_variance_h1() +{ + ASSERT(self->m_owner); + return variance_h1(self->m_kernel_mgr); +} + +SGVector CMultiKernelQuadraticTimeMMD::compute_test_power() +{ + ASSERT(self->m_owner); + return test_power(self->m_kernel_mgr); +} + +SGMatrix CMultiKernelQuadraticTimeMMD::sample_null() +{ + ASSERT(self->m_owner); + return sample_null(self->m_kernel_mgr); +} + +SGVector CMultiKernelQuadraticTimeMMD::compute_p_value() +{ + ASSERT(self->m_owner); + return p_values(self->m_kernel_mgr); +} + +SGVector CMultiKernelQuadraticTimeMMD::perform_test(float64_t alpha) +{ + SGVector pvalues=compute_p_value(); + SGVector rejections(pvalues.size()); + for (auto i=0; i CMultiKernelQuadraticTimeMMD::statistic(const KernelManager& kernel_mgr) +{ + SG_DEBUG("Entering"); + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels (%d) have to be greater than 0!\n", kernel_mgr.num_kernels()); + + const auto nx=self->m_owner->get_num_samples_p(); + const auto ny=self->m_owner->get_num_samples_q(); + const auto stype = self->m_owner->get_statistic_type(); + + CDistance* distance=kernel_mgr.get_distance_instance(); + self->update_pairwise_distance(distance); + kernel_mgr.set_precomputed_distance(self->m_pairwise_distance.get()); + SG_UNREF(distance); + + self->statistic_job.m_n_x=nx; + self->statistic_job.m_n_y=ny; + self->statistic_job.m_stype=stype; + SGVector result=self->statistic_job(kernel_mgr); + + kernel_mgr.unset_precomputed_distance(); + + for (auto i=0; im_owner->normalize_statistic(result[i]); + + SG_DEBUG("Leaving"); + return result; +} + +SGVector CMultiKernelQuadraticTimeMMD::variance_h1(const KernelManager& kernel_mgr) +{ + SG_DEBUG("Entering"); + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels (%d) have to be greater than 0!\n", kernel_mgr.num_kernels()); + + const auto nx=self->m_owner->get_num_samples_p(); + const auto ny=self->m_owner->get_num_samples_q(); + + CDistance* distance=kernel_mgr.get_distance_instance(); + self->update_pairwise_distance(distance); + kernel_mgr.set_precomputed_distance(self->m_pairwise_distance.get()); + SG_UNREF(distance); + + self->variance_h1_job.m_n_x=nx; + self->variance_h1_job.m_n_y=ny; + SGVector result=self->variance_h1_job(kernel_mgr); + + kernel_mgr.unset_precomputed_distance(); + + SG_DEBUG("Leaving"); + return result; +} + +SGVector CMultiKernelQuadraticTimeMMD::test_power(const KernelManager& kernel_mgr) +{ + SG_DEBUG("Entering"); + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels (%d) have to be greater than 0!\n", kernel_mgr.num_kernels()); + REQUIRE(self->m_owner->get_statistic_type()==ST_UNBIASED_FULL, "Only possible with UNBIASED_FULL!\n"); + + const auto nx=self->m_owner->get_num_samples_p(); + const auto ny=self->m_owner->get_num_samples_q(); + + CDistance* distance=kernel_mgr.get_distance_instance(); + self->update_pairwise_distance(distance); + kernel_mgr.set_precomputed_distance(self->m_pairwise_distance.get()); + SG_UNREF(distance); + + self->variance_h1_job.m_n_x=nx; + self->variance_h1_job.m_n_y=ny; + SGVector result=self->variance_h1_job.test_power(kernel_mgr); + + kernel_mgr.unset_precomputed_distance(); + + SG_DEBUG("Leaving"); + return result; +} + +SGMatrix CMultiKernelQuadraticTimeMMD::sample_null(const KernelManager& kernel_mgr) +{ + SG_DEBUG("Entering"); + REQUIRE(self->m_owner->get_null_approximation_method()==NAM_PERMUTATION, + "Multi-kernel tests requires the H0 approximation method to be PERMUTATION!\n"); + + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels (%d) have to be greater than 0!\n", kernel_mgr.num_kernels()); + + const auto nx=self->m_owner->get_num_samples_p(); + const auto ny=self->m_owner->get_num_samples_q(); + const auto stype = self->m_owner->get_statistic_type(); + const auto num_null_samples = self->m_owner->get_num_null_samples(); + + CDistance* distance=kernel_mgr.get_distance_instance(); + self->update_pairwise_distance(distance); + kernel_mgr.set_precomputed_distance(self->m_pairwise_distance.get()); + SG_UNREF(distance); + + self->permutation_job.m_n_x=nx; + self->permutation_job.m_n_y=ny; + self->permutation_job.m_num_null_samples=num_null_samples; + self->permutation_job.m_stype=stype; + SGMatrix result=self->permutation_job(kernel_mgr); + + kernel_mgr.unset_precomputed_distance(); + + for (size_t i=0; im_owner->normalize_statistic(result.matrix[i]); + + SG_DEBUG("Leaving"); + return result; +} + +SGVector CMultiKernelQuadraticTimeMMD::p_values(const KernelManager& kernel_mgr) +{ + SG_DEBUG("Entering"); + REQUIRE(self->m_owner->get_null_approximation_method()==NAM_PERMUTATION, + "Multi-kernel tests requires the H0 approximation method to be PERMUTATION!\n"); + + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels (%d) have to be greater than 0!\n", kernel_mgr.num_kernels()); + + const auto nx=self->m_owner->get_num_samples_p(); + const auto ny=self->m_owner->get_num_samples_q(); + const auto stype = self->m_owner->get_statistic_type(); + const auto num_null_samples = self->m_owner->get_num_null_samples(); + + CDistance* distance=kernel_mgr.get_distance_instance(); + self->update_pairwise_distance(distance); + kernel_mgr.set_precomputed_distance(self->m_pairwise_distance.get()); + SG_UNREF(distance); + + self->permutation_job.m_n_x=nx; + self->permutation_job.m_n_y=ny; + self->permutation_job.m_num_null_samples=num_null_samples; + self->permutation_job.m_stype=stype; + SGVector result=self->permutation_job.p_value(kernel_mgr); + + kernel_mgr.unset_precomputed_distance(); + + SG_DEBUG("Leaving"); + return result; +} + +const char* CMultiKernelQuadraticTimeMMD::get_name() const +{ + return "MultiKernelQuadraticTimeMMD"; +} diff --git a/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.h b/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.h new file mode 100644 index 00000000000..71997025a6f --- /dev/null +++ b/src/shogun/statistical_testing/MultiKernelQuadraticTimeMMD.h @@ -0,0 +1,176 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef MULTI_KERNEL_QUADRATIC_TIME_MMD_H_ +#define MULTI_KERNEL_QUADRATIC_TIME_MMD_H_ + +#include +#include + +namespace shogun +{ + +class CFeatures; +class CQuadraticTimeMMD; +class CShiftInvariantKernel; +template class SGVector; + +namespace internal +{ +class KernelManager; +class MaxMeasure; +class MaxTestPower; +} + +/** + * @brief Class that performs quadratic time MMD test optimized for multiple + * shift-invariant kernels. If the kernels are not shift-invariant, then the + * class CQuadraticTimeMMD should be used multiple times instead of this one. + * + * If the features are updated, then (if any) existing precomputed distance + * instance has to be invalidated by the owner (CQuadraticTimeMMD instance). + * This is already taken care of internally. A separate instance of this class + * should never be created by invoking the constructor. One should always + * call the CQuadraticTimeMMD::multikernel() method to get an instance of this + * class. + */ +class CMultiKernelQuadraticTimeMMD : public CSGObject +{ + friend class CQuadraticTimeMMD; + friend class internal::MaxMeasure; + friend class internal::MaxTestPower; +private: + CMultiKernelQuadraticTimeMMD(CQuadraticTimeMMD* owner); +public: + /** + * Default constructor. Should never be invoked by the user. Please use + * CQuadraticTimeMMD::multikernel() to obtain an instance of this class. + */ + CMultiKernelQuadraticTimeMMD(); + + /** Destructor */ + virtual ~CMultiKernelQuadraticTimeMMD(); + + /** + * Method that adds instances of shift-invariant kernels (e.g. CGaussianKernel). + * Invoke multiple times to add desired number of kernels. All the estimators + * obtianed from the computation will be in the same order the kernels were + * added. + * + * @param kernel The kernel instance. + */ + void add_kernel(CShiftInvariantKernel *kernel); + + /** + * Method that does internal cleanups (essentially releases memory from the + * internally stored pair-wise distance instance. + */ + void cleanup(); + + /** + * Method that returns normalized estimates of the MMD^2 for all the kernels. + * + * @return A vector of values for normalized estimates of the MMD^2 for all + * the kernels. + */ + SGVector compute_statistic(); + + /** + * Method that returns variance estimates of the unbiased MMD^2 estimator + * for all the kernels under the assumption that null-hypothesis was true. + * + * @return A vector of values for variance estimates of the unbiased MMD^2 + * estimator for all the kernels under null. + */ + SGVector compute_variance_h0(); + + /** + * Method that returns variance estimates of the unbiased MMD^2 estimator + * for all the kernels under the assumption that alternative-hypothesis was true. + * + * @return A vector of values for variance estimates of the unbiased MMD^2 + * estimator for all the kernels under alternative. + */ + SGVector compute_variance_h1(); + + /** + * Method that returns proxy measures of the test-power computed as the + * ratio of the unbiased MMD^2 estimator and sqrt of the variance estimate + * of it under alternative. + * + * @return A vector of values for proxy measures of test-power for all kernels. + */ + SGVector compute_test_power(); + + /* + * Method that computes the null-samples for all the kernels, one column per kernel. + * This method uses permutation as the null-approximation technique. + * + * @return Null-samples for all the kernels. + */ + SGMatrix sample_null(); + + /** + * Method that computes the p-values for all the kernels. The API is different + * here than CQuadraticTimeMMD since the test-statistics for the kernels are computed + * internally on the fly. This method uses permutation as the null-approximation + * technique. + * + * @return A vector of p-values for all the kernels. + */ + SGVector compute_p_value(); + + /** + * Method that performs the test and returns whether the null hypothesis was + * accepted or rejected, based on the provided significance level. + * + * @param alpha The significance level of the hypothesis test. Should be between + * 0 and 1. + * @return A vector of values of the test results (true - null hypothesis was + * accepted, false - otherwise) for all the kernels. + */ + SGVector perform_test(float64_t alpha); + + /** @return The name of the class */ + virtual const char* get_name() const; +private: + struct Self; + std::unique_ptr self; + void invalidate_precomputed_distance(); + SGVector statistic(const internal::KernelManager& kernel_mgr); + SGVector variance_h1(const internal::KernelManager& kernel_mgr); + SGVector test_power(const internal::KernelManager& kernel_mgr); + SGMatrix sample_null(const internal::KernelManager& kernel_mgr); + SGVector p_values(const internal::KernelManager& kernel_mgr); +}; + +} +#endif // MULTI_KERNEL_QUADRATIC_TIME_MMD_H_ diff --git a/src/shogun/statistical_testing/OneDistributionTest.cpp b/src/shogun/statistical_testing/OneDistributionTest.cpp new file mode 100644 index 00000000000..d116270dc86 --- /dev/null +++ b/src/shogun/statistical_testing/OneDistributionTest.cpp @@ -0,0 +1,61 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include + +using namespace shogun; +using namespace internal; + +COneDistributionTest::COneDistributionTest() : CHypothesisTest(OneDistributionTest::num_feats) +{ +} + +COneDistributionTest::~COneDistributionTest() +{ +} + +void COneDistributionTest::set_samples(CFeatures* samples) +{ + auto& data_mgr=get_data_mgr(); + data_mgr.samples_at(0)=samples; +} + +CFeatures* COneDistributionTest::get_samples() const +{ + const auto& data_mgr=get_data_mgr(); + return data_mgr.samples_at(0); +} + +void COneDistributionTest::set_num_samples(index_t num_samples) +{ + auto& data_mgr=get_data_mgr(); + data_mgr.num_samples_at(0)=num_samples; +} + +index_t COneDistributionTest::get_num_samples() const +{ + const auto& data_mgr=get_data_mgr(); + return data_mgr.num_samples_at(0); +} + +const char* COneDistributionTest::get_name() const +{ + return "OneDistributionTest"; +} diff --git a/src/shogun/statistical_testing/OneDistributionTest.h b/src/shogun/statistical_testing/OneDistributionTest.h new file mode 100644 index 00000000000..0de9c62c25f --- /dev/null +++ b/src/shogun/statistical_testing/OneDistributionTest.h @@ -0,0 +1,85 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef ONE_DISTRIBUTION_TEST_H_ +#define ONE_DISTRIBUTION_TEST_H_ + +#include + +namespace shogun +{ + +/** + * @brief Class OneDistributionTest is the base class for the statistical + * hypothesis testing with samples from one distributions, \f$mathbf{P}\f$. + */ +class COneDistributionTest : public CHypothesisTest +{ +public: + /** Default constructor */ + COneDistributionTest(); + + /** Destrutor */ + virtual ~COneDistributionTest(); + + /** + * Method that initializes the samples from \f$\mathbf{P}\f$. + * + * @param samples The CFeatures instance representing the samples + * from \f$\mathbf{P}\f$. + */ + void set_samples(CFeatures* samples); + + /** @return The samples from \f$\mathbf{P}\f$. */ + CFeatures* get_samples() const; + + /** + * Method that initializes the number of samples to be drawn from distribution + * \f$\mathbf{P}\f$. Please ensure to call this method if you are intending to + * use streaming data generators that generate the samples on the fly. For + * other types of features, the number of samples is set internally from the + * features object itself, therefore this method should not be used. + * + * @param num_samples The CFeatures instance representing the samples + * from \f$\mathbf{P}\f$. + */ + void set_num_samples(index_t num_samples); + + /** @return The number of samples from \f$\mathbf{P}\f$. */ + index_t get_num_samples() const; + + /** + * Interface for computing the test-statistic for the hypothesis test. + * + * @return test statistic for the given data/parameters/methods + */ + virtual float64_t compute_statistic()=0; + + /** + * Interface for computing the samples under the null-hypothesis. + * + * @return vector of all statistics + */ + virtual SGVector sample_null()=0; + + /** @return The name of the class */ + virtual const char* get_name() const; +}; + +} +#endif // ONE_DISTRIBUTION_TEST_H_ diff --git a/src/shogun/statistical_testing/QuadraticTimeMMD.cpp b/src/shogun/statistical_testing/QuadraticTimeMMD.cpp new file mode 100644 index 00000000000..7043272f966 --- /dev/null +++ b/src/shogun/statistical_testing/QuadraticTimeMMD.cpp @@ -0,0 +1,635 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; +using std::unique_ptr; + +struct CQuadraticTimeMMD::Self +{ + Self(CQuadraticTimeMMD&); + + void init_statistic_job(); + void init_permutation_job(); + void init_variance_h1_job(); + void init_kernel(); + SGMatrix get_kernel_matrix(); + + SGVector sample_null_spectrum(); + SGVector sample_null_permutation(); + SGVector gamma_fit_null(); + + CQuadraticTimeMMD& owner; + unique_ptr multi_kernel; + + /** + * Whether to precompute the kernel matrix. by default this is true. + * It can be changed by the precompute_kernel_matrix() call. Keep in mind that + * precompute is always true as long as the underlying kernel itself is a + * precomputed kernel. Further updation of this value is ignored unless the + * kernel is changed to a non-precomputed one. + */ + bool precompute; + + /** + * Whether the kernel is initialized with the joint features. If a kernel is + * initialized once, then it becomes true. It can then becomes false only when + * (a) the features are updated, or + * (b) the kernel is updated later, or + * (c) the internally precomputed kernel is removed and the underlying kernel is in use. + * However, for (a), if the underlying kernel itself is a pre-computed one, it + * stays true even when the features are updated. Also, for (b), if the newly + * updated kernel is a pre-computed one, then also it stays true. + */ + bool is_kernel_initialized; + + index_t num_eigenvalues; + + ComputeMMD statistic_job; + VarianceH0 variance_h0_job; + VarianceH1 variance_h1_job; + PermutationMMD permutation_job; + + static constexpr bool DEFAULT_PRECOMPUTE = true; + static constexpr index_t DEFAULT_NUM_EIGENVALUES = 10; +}; + +CQuadraticTimeMMD::Self::Self(CQuadraticTimeMMD& mmd) : owner(mmd) +{ + is_kernel_initialized=false; + precompute=DEFAULT_PRECOMPUTE; + num_eigenvalues=DEFAULT_NUM_EIGENVALUES; +} + +void CQuadraticTimeMMD::Self::init_statistic_job() +{ + REQUIRE(owner.get_num_samples_p()>0, + "Number of samples from P (was %s) has to be > 0!\n", owner.get_num_samples_p()); + REQUIRE(owner.get_num_samples_q()>0, + "Number of samples from Q (was %s) has to be > 0!\n", owner.get_num_samples_q()); + + statistic_job.m_n_x=owner.get_num_samples_p(); + statistic_job.m_n_y=owner.get_num_samples_q(); + statistic_job.m_stype=owner.get_statistic_type(); +} + +void CQuadraticTimeMMD::Self::init_variance_h1_job() +{ + REQUIRE(owner.get_num_samples_p()>0, + "Number of samples from P (was %s) has to be > 0!\n", owner.get_num_samples_p()); + REQUIRE(owner.get_num_samples_q()>0, + "Number of samples from Q (was %s) has to be > 0!\n", owner.get_num_samples_q()); + + variance_h1_job.m_n_x=owner.get_num_samples_p(); + variance_h1_job.m_n_y=owner.get_num_samples_q(); +} + +void CQuadraticTimeMMD::Self::init_permutation_job() +{ + REQUIRE(owner.get_num_samples_p()>0, + "Number of samples from P (was %s) has to be > 0!\n", owner.get_num_samples_p()); + REQUIRE(owner.get_num_samples_q()>0, + "Number of samples from Q (was %s) has to be > 0!\n", owner.get_num_samples_q()); + REQUIRE(owner.get_num_null_samples()>0, + "Number of null samples (was %d) has to be > 0!\n", owner.get_num_null_samples()); + + permutation_job.m_n_x=owner.get_num_samples_p(); + permutation_job.m_n_y=owner.get_num_samples_q(); + permutation_job.m_stype=owner.get_statistic_type(); + permutation_job.m_num_null_samples=owner.get_num_null_samples(); +} + +void CQuadraticTimeMMD::Self::init_kernel() +{ + ASSERT(owner.get_kernel()); + if (!is_kernel_initialized) + { + ASSERT(owner.get_kernel()->get_kernel_type()!=K_CUSTOM); + auto samples_p_and_q=owner.get_p_and_q(); + + auto kernel=owner.get_kernel(); + kernel->init(samples_p_and_q, samples_p_and_q); + is_kernel_initialized=true; + SG_SINFO("Kernel is initialized with joint features of %d total samples!\n", samples_p_and_q->get_num_vectors()); + } +} + +SGMatrix CQuadraticTimeMMD::Self::get_kernel_matrix() +{ + ASSERT(precompute); + ASSERT(owner.get_kernel()); + ASSERT(is_kernel_initialized); + + if (owner.get_kernel()->get_kernel_type()!=K_CUSTOM) + { + auto kernel=owner.get_kernel(); + owner.get_kernel_mgr().precompute_kernel_at(0); + kernel->remove_lhs_and_rhs(); + } + + ASSERT(owner.get_kernel()->get_kernel_type()==K_CUSTOM); + auto precomputed_kernel=static_cast(owner.get_kernel()); + return precomputed_kernel->get_float32_kernel_matrix(); +} + +CQuadraticTimeMMD::CQuadraticTimeMMD() : CMMD() +{ + init(); +} + +CQuadraticTimeMMD::CQuadraticTimeMMD(CFeatures* samples_from_p, CFeatures* samples_from_q) : CMMD(samples_from_p, samples_from_q) +{ + init(); +} + +void CQuadraticTimeMMD::init() +{ + self=unique_ptr(new Self(*this)); + self->multi_kernel=unique_ptr(new CMultiKernelQuadraticTimeMMD(this)); +} + +CQuadraticTimeMMD::~CQuadraticTimeMMD() +{ + CMMD::cleanup(); +} + +void CQuadraticTimeMMD::set_p(CFeatures* samples_from_p) +{ + if (samples_from_p!=get_p()) + { + CTwoDistributionTest::set_p(samples_from_p); + get_kernel_mgr().restore_kernel_at(0); + self->is_kernel_initialized=false; + self->multi_kernel->invalidate_precomputed_distance(); + + if (get_kernel() && get_kernel()->get_kernel_type()==K_CUSTOM) + { + SG_WARNING("Existing kernel is already precomputed. Features provided will be\ + ignored unless the kernel is updated with a non-precomputed one!\n"); + self->is_kernel_initialized=true; + } + } + else + { + SG_INFO("Provided features are the same as the existing one. Ignoring!\n"); + } +} + +void CQuadraticTimeMMD::set_q(CFeatures* samples_from_q) +{ + if (samples_from_q!=get_q()) + { + CTwoDistributionTest::set_q(samples_from_q); + get_kernel_mgr().restore_kernel_at(0); + self->is_kernel_initialized=false; + self->multi_kernel->invalidate_precomputed_distance(); + + if (get_kernel() && get_kernel()->get_kernel_type()==K_CUSTOM) + { + SG_WARNING("Existing kernel is already precomputed. Features provided will be\ + ignored unless the kernel is updated with a non-precomputed one!\n"); + self->is_kernel_initialized=true; + } + } + else + { + SG_INFO("Provided features are the same as the existing one. Ignoring!\n"); + } +} + +CFeatures* CQuadraticTimeMMD::get_p_and_q() +{ + CFeatures* samples_p_and_q=nullptr; + REQUIRE(get_p(), "Samples from P are not set!\n"); + REQUIRE(get_q(), "Samples from Q are not set!\n"); + + DataManager& data_mgr=get_data_mgr(); + data_mgr.start(); + auto samples=data_mgr.next(); + if (!samples.empty()) + { + CFeatures *samples_p=samples[0][0].get(); + CFeatures *samples_q=samples[1][0].get(); + samples_p_and_q=FeaturesUtil::create_merged_copy(samples_p, samples_q); + samples.clear(); + } + else + { + SG_SERROR("Could not fetch samples!\n"); + } + data_mgr.end(); + return samples_p_and_q; +} + +void CQuadraticTimeMMD::set_kernel(CKernel* kernel) +{ + if (kernel!=get_kernel()) + { + // removing any pre-computed kernel is done in the base already + CTwoSampleTest::set_kernel(kernel); + self->is_kernel_initialized=false; + + if (kernel->get_kernel_type()==K_CUSTOM) + { + SG_INFO("Setting a precomputed kernel. Features provided will be ignored!\n"); + self->is_kernel_initialized=true; + } + } + else + { + SG_INFO("Provided kernel is the same as the existing one. Ignoring!\n"); + } +} + +void CQuadraticTimeMMD::select_kernel() +{ + CMMD::select_kernel(); + self->is_kernel_initialized=false; + + ASSERT(get_kernel()); + if (get_kernel()->get_kernel_type()==K_CUSTOM) + { + SG_WARNING("Selected kernel is already precomputed. Features provided will be\ + ignored unless the kernel is updated with a non-precomputed one!\n"); + self->is_kernel_initialized=true; + } +} + +float64_t CQuadraticTimeMMD::normalize_statistic(float64_t statistic) const +{ + const index_t Nx=get_num_samples_p(); + const index_t Ny=get_num_samples_q(); + return Nx*Ny*statistic/(Nx+Ny); +} + + +float64_t CQuadraticTimeMMD::compute_statistic() +{ + SG_DEBUG("Entering\n"); + REQUIRE(get_kernel(), "Kernel is not set!\n"); + + self->init_statistic_job(); + self->init_kernel(); + + float64_t statistic=0; + if (self->precompute) + { + SGMatrix kernel_matrix=self->get_kernel_matrix(); + statistic=self->statistic_job(kernel_matrix); + } + else + { + auto kernel=get_kernel(); + if (kernel->get_kernel_type()==K_CUSTOM) + SG_INFO("Precompute is turned off, but provided kernel is already precomputed!\n"); + auto kernel_functor=internal::Kernel(kernel); + statistic=self->statistic_job(kernel_functor); + } + + statistic=normalize_statistic(statistic); + + SG_DEBUG("Leaving\n"); + return statistic; +} + +SGVector CQuadraticTimeMMD::Self::sample_null_permutation() +{ + SG_SDEBUG("Entering\n"); + REQUIRE(owner.get_kernel(), "Kernel is not set!\n"); + + init_permutation_job(); + init_kernel(); + + SGVector result; + if (precompute) + { + SGMatrix kernel_matrix=get_kernel_matrix(); + result=permutation_job(kernel_matrix); + } + else + { + auto kernel=owner.get_kernel(); + if (kernel->get_kernel_type()==K_CUSTOM) + SG_SINFO("Precompute is turned off, but provided kernel is already precomputed!\n"); + auto kernel_functor=internal::Kernel(kernel); + result=permutation_job(kernel_functor); + } + + SGVector null_samples(result.vlen); + for (auto i=0; i CQuadraticTimeMMD::Self::sample_null_spectrum() +{ + SG_SDEBUG("Entering\n"); + REQUIRE(owner.get_kernel(), "Kernel is not set!\n"); + REQUIRE(precompute, "MMD2_SPECTRUM is not possible without precomputing the kernel matrix!\n"); + + index_t m=owner.get_num_samples_p(); + index_t n=owner.get_num_samples_q(); + + REQUIRE(num_eigenvalues>0 && num_eigenvalues kernel_matrix=get_kernel_matrix(); + SGMatrix K(kernel_matrix.num_rows, kernel_matrix.num_cols); + std::copy(kernel_matrix.data(), kernel_matrix.data()+kernel_matrix.size(), K.data()); + + /* center matrix K=H*K*H */ + K.center(); + + /* compute eigenvalues and select num_eigenvalues largest ones */ + Eigen::Map c_kernel_matrix(K.matrix, K.num_rows, K.num_cols); + Eigen::SelfAdjointEigenSolver eigen_solver(c_kernel_matrix); + REQUIRE(eigen_solver.info()==Eigen::Success, "Eigendecomposition failed!\n"); + index_t max_num_eigenvalues=eigen_solver.eigenvalues().rows(); + + SGVector null_samples(owner.get_num_null_samples()); + + /* finally, sample from null distribution */ + for (auto i=0; i CQuadraticTimeMMD::Self::gamma_fit_null() +{ + SG_SDEBUG("Entering\n"); + + REQUIRE(owner.get_kernel(), "Kernel is not set!\n"); + REQUIRE(precompute, "MMD2_GAMMA is not possible without precomputing the kernel matrix!\n"); + REQUIRE(owner.get_statistic_type()==ST_BIASED_FULL, "Provided statistic has to be BIASED!\n"); + + index_t m=owner.get_num_samples_p(); + index_t n=owner.get_num_samples_q(); + REQUIRE(m==n, "Number of samples from p (%d) and q (%d) must be equal.\n", n, m) + + SGVector result(2); + std::fill(result.vector, result.vector+result.vlen, 0); + + init_kernel(); + + /* imaginary matrix K=[K KL; KL' L] (MATLAB notation) + * K is matrix for XX, L is matrix for YY, KL is XY, LK is YX + * works since X and Y are concatenated here */ + SGMatrix kernel_matrix=get_kernel_matrix(); + + /* compute mean under H0 of MMD, which is + * meanMMD =2/m * ( 1 - 1/m*sum(diag(KL)) ); + * in MATLAB. + * Remove diagonals on the fly */ + float64_t mean_mmd=0; + for (index_t i=0; iprecompute, + "Computing variance estimate is not possible without precomputing the kernel matrix!\n"); + + self->init_kernel(); + SGMatrix kernel_matrix=self->get_kernel_matrix(); + return self->variance_h0_job(kernel_matrix); +} + +float64_t CQuadraticTimeMMD::compute_variance_h1() +{ + REQUIRE(get_kernel(), "Kernel is not set!\n"); + self->init_kernel(); + self->init_variance_h1_job(); + float64_t variance_estimate=0; + if (self->precompute) + { + SGMatrix kernel_matrix=self->get_kernel_matrix(); + variance_estimate=self->variance_h1_job(kernel_matrix); + } + else + { + auto kernel=get_kernel(); + if (kernel->get_kernel_type()==K_CUSTOM) + SG_INFO("Precompute is turned off, but provided kernel is already precomputed!\n"); + auto kernel_functor=internal::Kernel(kernel); + variance_estimate=self->variance_h1_job(kernel_functor); + } + return variance_estimate; +} + +float64_t CQuadraticTimeMMD::compute_p_value(float64_t statistic) +{ + REQUIRE(get_kernel(), "Kernel is not set!\n"); + float64_t result=0; + switch (get_null_approximation_method()) + { + case NAM_MMD2_GAMMA: + { + SGVector params=self->gamma_fit_null(); + result=CStatistics::gamma_cdf(statistic, params[0], params[1]); + break; + } + default: + result=CHypothesisTest::compute_p_value(statistic); + break; + } + return result; +} + +float64_t CQuadraticTimeMMD::compute_threshold(float64_t alpha) +{ + REQUIRE(get_kernel(), "Kernel is not set!\n"); + float64_t result=0; + switch (get_null_approximation_method()) + { + case NAM_MMD2_GAMMA: + { + SGVector params=self->gamma_fit_null(); + result=CStatistics::gamma_inverse_cdf(alpha, params[0], params[1]); + break; + } + default: + result=CHypothesisTest::compute_threshold(alpha); + break; + } + return result; +} + +SGVector CQuadraticTimeMMD::sample_null() +{ + REQUIRE(get_kernel(), "Kernel is not set!\n"); + SGVector null_samples; + switch (get_null_approximation_method()) + { + case NAM_MMD2_SPECTRUM: + null_samples=self->sample_null_spectrum(); + break; + case NAM_PERMUTATION: + null_samples=self->sample_null_permutation(); + break; + default: break; + } + return null_samples; +} + +CMultiKernelQuadraticTimeMMD* CQuadraticTimeMMD::multikernel() +{ + return self->multi_kernel.get(); +} + +void CQuadraticTimeMMD::spectrum_set_num_eigenvalues(index_t num_eigenvalues) +{ + self->num_eigenvalues=num_eigenvalues; +} + +index_t CQuadraticTimeMMD::spectrum_get_num_eigenvalues() const +{ + return self->num_eigenvalues; +} + +void CQuadraticTimeMMD::precompute_kernel_matrix(bool precompute) +{ + if (self->precompute && !precompute) + { + if (get_kernel()) + { + get_kernel_mgr().restore_kernel_at(0); + self->is_kernel_initialized=false; + if (get_kernel()->get_kernel_type()==K_CUSTOM) + { + SG_WARNING("The existing kernel itself is a precomputed kernel!\n"); + precompute=true; + self->is_kernel_initialized=true; + } + } + } + self->precompute=precompute; +} + +void CQuadraticTimeMMD::save_permutation_inds(bool save_inds) +{ + self->permutation_job.m_save_inds=save_inds; +} + +SGMatrix CQuadraticTimeMMD::get_permutation_inds() const +{ + return self->permutation_job.m_all_inds; +} + +const char* CQuadraticTimeMMD::get_name() const +{ + return "QuadraticTimeMMD"; +} diff --git a/src/shogun/statistical_testing/QuadraticTimeMMD.h b/src/shogun/statistical_testing/QuadraticTimeMMD.h new file mode 100644 index 00000000000..1f64cbc916a --- /dev/null +++ b/src/shogun/statistical_testing/QuadraticTimeMMD.h @@ -0,0 +1,283 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef QUADRATIC_TIME_MMD_H_ +#define QUADRATIC_TIME_MMD_H_ + +#include +#include + +namespace shogun +{ + +class CMultiKernelQuadraticTimeMMD; +template class SGVector; + +/** + * @brief This class implements the quadratic time Maximum Mean Statistic as + * described in [1]. + * The MMD is the distance of two probability distributions \f$p\f$ and \f$q\f$ + * in a RKHS which we denote by + * \f[ + * \hat{\eta_k}=\text{MMD}[\mathcal{F},p,q]^2=\textbf{E}_{x,x'} + * \left[ k(x,x')\right]-2\textbf{E}_{x,y}\left[ k(x,y)\right] + * +\textbf{E}_{y,y'}\left[ k(y,y')\right]=||\mu_p - \mu_q||^2_\mathcal{F} + * \f] + * + * Estimating variance of the asymptotic distribution of the statistic under + * null and alternative hypothesis can be done using compute_variance_h0() and + * compute_variance_h1() method. + * + * Note that all these operations can be done for multiple kernels + * at once as well. To use this functionality, use multikernel() method to + * obtain a CMultiKernelQuadraticTimeMMD instance and then call methods on that. + * + * If you do not know about your data, but want to use the MMD from a kernel + * matrix, just use the custom kernel constructor and initialize the features as + * CDummyFeatures. Everything else will work as usual. + * + * To make the computation faster, this class always pre-computes the kernel + * and stores the Gram matrix using merged samples from p and q. It essentially + * keeps a backup of the old kernel and rather uses this pre-computed one as + * long as the present kernel is valid. Therefore, after a computation phase + * is executed, upon calling get_kernel() we will obtain the pre-computed + * kernel matrix as a CCustomKernel object. However, if subsequently the + * features are updated or the underlying kernel itself is updated, it discards + * the pre-computed kernel matrix (frees memory) and pulls the old kernel from + * backup (or, simply replace that if a new kernel is provided) and then + * pre-computes that in the next run. + * + * It is possible to turn off the above feature by turning it off. However, + * it will affect the performance of the algorithms, since they are optimzied + * for pre-computed kernel matrices. Therefore, this should only be turned off + * if the storage of the kernel is a major concern. Please note that only + * the lower triangular part of the Gram matrix is stored, in order to exploit + * the symmetry. + * + * Since the methods modifies the object's state, using the methods of this + * class from multiple threads may result in undesired/incorrect results/behavior. + * + * NOTE: \f$n_x\f$ and \f$n_y\f$ are represented by \f$m\f$ and \f$n\f$, + * respectively in the implementation. + * + * [1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. (2012). + * A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 671-721. + * + * [2]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). + * A fast, consistent kernel two-sample test. + */ +class CQuadraticTimeMMD : public CMMD +{ + friend class CMultiKernelQuadraticTimeMMD; + +public: + /** Default constructor */ + CQuadraticTimeMMD(); + + /** + * Convenience constructor. Initializes the features representing samples + * from both the distributions. + * + * @param samples_from_p Samples from p. + * @param samples_from_q Samples from q. + */ + CQuadraticTimeMMD(CFeatures* samples_from_p, CFeatures* samples_from_q); + + /** Destructor */ + virtual ~CQuadraticTimeMMD(); + + /** + * Method that initializes/replaces samples from p. It will invalidate + * existing pre-computed kernel, if any, from previous run. However, if + * the underlying kernel, if set already by this point, is an instance of + * CCustomKernel itself, the supplied features will be ignored. + * + * @param samples_from_p Samples from p. + */ + virtual void set_p(CFeatures* samples_from_p); + + /** + * Method that initializes/replaces samples from q. It will invalidate + * existing pre-computed kernel, if any, from previous run. However, if + * the underlying kernel, if set already by this point, is an instance of + * CCustomKernel itself, the supplied features will be ignored. + * + * @param samples_from_p Samples from q. + */ + virtual void set_q(CFeatures* samples_from_q); + + /** + * Method that creates a merged copy of CFeatures instance from both + * the features, appending the samples from p and q. This method does not + * cache the merged copy from previous call. So, calling this method will + * create a new instance every time. + * + * @return The merged samples. + */ + CFeatures* get_p_and_q(); + + /** + * Method that sets the kernel instance to be used. If a CCustomKernel is + * set, then the features passed would be effectively ignored. Therefore, + * if this is the intended behavior, simply passing two instances of + * CDummyFeatures would do (since they cannot be left null as of now). + * + * If a pre-computed instance already exists from previous runs, this will + * invalidate that one and free memory. + * + * @param kernel The kernel instance. + */ + virtual void set_kernel(CKernel* kernel); + + /** + * Method that learns/selects the kernel from a set of provided kernel + * instances added from the add_kernel() methods. Upon selection, it + * internally replaces the kernel instance, if any, that was already + * present. + * + * Please make sure to set the train-test mode on before using this method. + */ + virtual void select_kernel(); + + /** + * Method that computes the estimator of MMD^2 (biased/unbiased/incomplete) + * as set from set_statistic_type() method. Default is unbiased. + * + * @return A normalized value of the MMD^2 estimator. + */ + virtual float64_t compute_statistic(); + + /** + * Method that returns a number of null-samples, based on the null approximation + * method that was set using set_null_approximation_method(). Default is permutation. + * + * @return Normalized values of the MMD^2 estimates under null hypothesis. + */ + virtual SGVector sample_null(); + + /** + * Method that computes the p-value from the provided statistic. + * + * @param statistic The test statistic + * @return The p-value computed using the null-appriximation method specified. + */ + virtual float64_t compute_p_value(float64_t statistic); + + /** + * Method that computes the threshold from the provided significance level (alpha). + * + * @param alpha The significance level (value should be between 0 and 1) + * @return The threshold computed using the null-approximation method specified. + */ + virtual float64_t compute_threshold(float64_t alpha); + + /** + * Method that computes an estimate of the variance of the unbiased MMD^2 estimator + * under the assumption that the null hypothesis was true. + * + * @return The variance estimate of the unbiased MMD^2 estimator under null. + */ + float64_t compute_variance_h0(); + + /** + * Method that computes an estimate of the variance of the unbiased MMD^2 estimator + * under the assumption that the alternative hypothesis was true. + * + * @return The variance estimate of the unbiased MMD^2 estimator under alternative. + */ + float64_t compute_variance_h1(); + + /** + * Method that returns the internal instance of CMultiKernelQuadraticTimeMMD which + * provides a similar API to this class to compute the estimates for multiple kernel + * all at once. This internal instance shares the same set of samples with this one + * but the kernel has to be added seperately using multikernel().add_kernel() method. + * + * @return An internal instance of CMultiKernelQuadraticTimeMMD. + */ + CMultiKernelQuadraticTimeMMD* multikernel(); + + /** + * Method that sets the number of eigenvalues to be used when spectral estimation + * of the null samples is used. Will be ignored if null-approximation method was + * anything else. + * + * @param num_eigenvalues The number of eigenvalues to be used from the eigenspectrum + * of the Gram matrix. + */ + void spectrum_set_num_eigenvalues(index_t num_eigenvalues); + + /** @return The number of eigenvalues in use for the spectral test */ + index_t spectrum_get_num_eigenvalues() const; + + /** + * Use this method when pre-computation of the kernel matrix is NOT desired. By default + * this class always precomputes the Gram matrix. Please note that the performance will + * be slow if this option is turned off. + * + * @param precompute Flag to whether pre-compute the kernel matrix internally or not. + * If false, the kernel matrix is NOT pre-computed, otherwise it is. Default is true. + */ + void precompute_kernel_matrix(bool precompute); + + /** + * Method that saves the permutation indices that will be used while sampling from the + * null distribution in case permutation approach was adopted. The indices will be + * available only after a successful run of the permutation test. By default, the indices + * are never saved. + * + * @param save_inds Whether to save the permutation indices or not. If true, the indices + * are saved, otherwise not. + */ + void save_permutation_inds(bool save_inds); + + /** + * Method that returns the permutation indices, if that option was turned on by using + * the save_permutation_inds(true). + * + * @return The permutation indices, one column per null-sample. + */ + SGMatrix get_permutation_inds() const; + + /** @return The name of the class */ + virtual const char* get_name() const; + +protected: + virtual float64_t normalize_statistic(float64_t statistic) const; + +private: + struct Self; + std::unique_ptr self; + void init(); +}; + +} +#endif // QUADRATIC_TIME_MMD_H_ diff --git a/src/shogun/statistical_testing/StreamingMMD.cpp b/src/shogun/statistical_testing/StreamingMMD.cpp new file mode 100644 index 00000000000..a16b6dc9d3d --- /dev/null +++ b/src/shogun/statistical_testing/StreamingMMD.cpp @@ -0,0 +1,499 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +struct CStreamingMMD::Self +{ + Self(CStreamingMMD& cmmd); + + void create_statistic_job(); + void create_variance_job(); + void create_computation_jobs(); + + void merge_samples(NextSamples&, std::vector&) const; + void compute_kernel(ComputationManager&, std::vector&, CKernel*) const; + void compute_jobs(ComputationManager&) const; + + std::pair compute_statistic_variance(); + std::pair, SGMatrix> compute_statistic_and_Q(const KernelManager&); + SGVector sample_null(); + + CStreamingMMD& owner; + + bool use_gpu; + index_t num_null_samples; + + EStatisticType statistic_type; + EVarianceEstimationMethod variance_estimation_method; + ENullApproximationMethod null_approximation_method; + + std::function&)> statistic_job; + std::function&)> permutation_job; + std::function&)> variance_job; +}; + +CStreamingMMD::Self::Self(CStreamingMMD& cmmd) : owner(cmmd), + use_gpu(false), num_null_samples(250), + statistic_type(ST_UNBIASED_FULL), + variance_estimation_method(VEM_DIRECT), + null_approximation_method(NAM_PERMUTATION), + statistic_job(nullptr), variance_job(nullptr) +{ +} + +void CStreamingMMD::Self::create_computation_jobs() +{ + create_statistic_job(); + create_variance_job(); +} + +void CStreamingMMD::Self::create_statistic_job() +{ + const DataManager& data_mgr=owner.get_data_mgr(); + + auto Bx=data_mgr.blocksize_at(0); + auto By=data_mgr.blocksize_at(1); + + REQUIRE(Bx>0, "Blocksize for samples from P cannot be 0!\n"); + REQUIRE(By>0, "Blocksize for samples from Q cannot be 0!\n"); + + auto mmd=mmd::ComputeMMD(); + mmd.m_n_x=Bx; + mmd.m_n_y=By; + mmd.m_stype=statistic_type; + + statistic_job=mmd; + permutation_job=mmd::WithinBlockPermutation(Bx, By, statistic_type); +} + +void CStreamingMMD::Self::create_variance_job() +{ + switch (variance_estimation_method) + { + case VEM_DIRECT: + variance_job=owner.get_direct_estimation_method(); + break; + case VEM_PERMUTATION: + variance_job=permutation_job; + break; + default : break; + }; +} + +void CStreamingMMD::Self::merge_samples(NextSamples& next_burst, std::vector& blocks) const +{ + blocks.resize(next_burst.num_blocks()); +#pragma omp parallel for + for (size_t i=0; i& blocks, CKernel* kernel) const +{ + REQUIRE(kernel->get_kernel_type()!=K_CUSTOM, "Underlying kernel cannot be custom!\n"); + cm.num_data(blocks.size()); +#pragma omp parallel for + for (size_t i=0; i(static_cast(kernel->clone())); + kernel_clone->init(blocks[i], blocks[i]); + cm.data(i)=kernel_clone->get_kernel_matrix(); + kernel_clone->remove_lhs_and_rhs(); + } + catch (ShogunException e) + { + SG_SERROR("%s, Try using less number of blocks per burst!\n", e.get_exception_string()); + } + } +} + +void CStreamingMMD::Self::compute_jobs(ComputationManager& cm) const +{ + if (use_gpu) + cm.use_gpu().compute_data_parallel_jobs(); + else + cm.use_cpu().compute_data_parallel_jobs(); +} + +std::pair CStreamingMMD::Self::compute_statistic_variance() +{ + const KernelManager& kernel_mgr=owner.get_kernel_mgr(); + auto kernel=kernel_mgr.kernel_at(0); + REQUIRE(kernel != nullptr, "Kernel is not set!\n"); + + float64_t statistic=0; + float64_t permuted_samples_statistic=0; + float64_t variance=0; + index_t statistic_term_counter=1; + index_t variance_term_counter=1; + + DataManager& data_mgr=owner.get_data_mgr(); + data_mgr.start(); + auto next_burst=data_mgr.next(); + if (!next_burst.empty()) + { + ComputationManager cm; + create_computation_jobs(); + cm.enqueue_job(statistic_job); + cm.enqueue_job(variance_job); + + std::vector blocks; + + while (!next_burst.empty()) + { + merge_samples(next_burst, blocks); + compute_kernel(cm, blocks, kernel); + blocks.resize(0); + compute_jobs(cm); + + auto mmds=cm.result(0); + auto vars=cm.result(1); + + for (size_t i=0; i, SGMatrix > CStreamingMMD::Self::compute_statistic_and_Q(const KernelManager& kernel_selection_mgr) +{ +// const size_t num_kernels=0; +// SGVector statistic(num_kernels); +// SGMatrix Q(num_kernels, num_kernels); +// return std::make_pair(statistic, Q); + REQUIRE(kernel_selection_mgr.num_kernels()>0, "No kernels specified for kernel learning! " + "Please add kernels using add_kernel() method!\n"); + + const size_t num_kernels=kernel_selection_mgr.num_kernels(); + SGVector statistic(num_kernels); + SGMatrix Q(num_kernels, num_kernels); + + std::fill(statistic.data(), statistic.data()+statistic.size(), 0); + std::fill(Q.data(), Q.data()+Q.size(), 0); + + std::vector term_counters_statistic(num_kernels, 1); + SGMatrix term_counters_Q(num_kernels, num_kernels); + std::fill(term_counters_Q.data(), term_counters_Q.data()+term_counters_Q.size(), 1); + + DataManager& data_mgr=owner.get_data_mgr(); + ComputationManager cm; + create_computation_jobs(); + cm.enqueue_job(statistic_job); + + data_mgr.start(); + auto next_burst=data_mgr.next(); + std::vector blocks; + std::vector > mmds(num_kernels); + while (!next_burst.empty()) + { + const size_t num_blocks=next_burst.num_blocks(); + REQUIRE(num_blocks%2==0, + "The number of blocks per burst (%d this burst) has to be even!\n", + num_blocks); + merge_samples(next_burst, blocks); + std::for_each(blocks.begin(), blocks.end(), [](CFeatures* ptr) { SG_REF(ptr); }); + for (size_t k=0; k CStreamingMMD::Self::sample_null() +{ + const KernelManager& kernel_mgr=owner.get_kernel_mgr(); + auto kernel=kernel_mgr.kernel_at(0); + REQUIRE(kernel != nullptr, "Kernel is not set!\n"); + + SGVector statistic(num_null_samples); + std::vector term_counters(num_null_samples); + + std::fill(statistic.vector, statistic.vector+statistic.vlen, 0); + std::fill(term_counters.data(), term_counters.data()+term_counters.size(), 1); + + DataManager& data_mgr=owner.get_data_mgr(); + ComputationManager cm; + + create_statistic_job(); + cm.enqueue_job(permutation_job); + + std::vector blocks; + + data_mgr.start(); + auto next_burst=data_mgr.next(); + + while (!next_burst.empty()) + { + merge_samples(next_burst, blocks); + compute_kernel(cm, blocks, kernel); + blocks.resize(0); + + for (auto j=0; j(new Self(*this)); +} + +CStreamingMMD::~CStreamingMMD() +{ +} + +float64_t CStreamingMMD::compute_statistic() +{ + return self->compute_statistic_variance().first; +} + +float64_t CStreamingMMD::compute_variance() +{ + return self->compute_statistic_variance().second; +} + +SGVector CStreamingMMD::compute_multiple() +{ + return self->compute_statistic_and_Q(get_kernel_selection_strategy()->get_kernel_mgr()).first; +} + +std::pair CStreamingMMD::compute_statistic_variance() +{ + return self->compute_statistic_variance(); +} + +std::pair, SGMatrix > CStreamingMMD::compute_statistic_and_Q(const KernelManager& kernel_selection_mgr) +{ + return self->compute_statistic_and_Q(kernel_selection_mgr); +} + +SGVector CStreamingMMD::sample_null() +{ + return self->sample_null(); +} + +void CStreamingMMD::set_num_null_samples(index_t null_samples) +{ + self->num_null_samples=null_samples; +} + +const index_t CStreamingMMD::get_num_null_samples() const +{ + return self->num_null_samples; +} + +void CStreamingMMD::use_gpu(bool gpu) +{ + self->use_gpu=gpu; +} + +bool CStreamingMMD::use_gpu() const +{ + return self->use_gpu; +} + +void CStreamingMMD::cleanup() +{ + for (size_t i=0; istatistic_type=stype; +} + +const EStatisticType CStreamingMMD::get_statistic_type() const +{ + return self->statistic_type; +} + +void CStreamingMMD::set_variance_estimation_method(EVarianceEstimationMethod vmethod) +{ + // TODO overload this +/* if (std::is_same::value && vmethod == VEM_PERMUTATION) + { + std::cerr << "cannot use permutation method for quadratic time MMD" << std::endl; + }*/ + self->variance_estimation_method=vmethod; +} + +const EVarianceEstimationMethod CStreamingMMD::get_variance_estimation_method() const +{ + return self->variance_estimation_method; +} + +void CStreamingMMD::set_null_approximation_method(ENullApproximationMethod nmethod) +{ + // TODO overload this +/* if (std::is_same::value && nmethod == NAM_MMD1_GAUSSIAN) + { + std::cerr << "cannot use gaussian method for quadratic time MMD" << std::endl; + } + else if ((std::is_same::value || std::is_same::value) && + (nmethod == NAM_MMD2_SPECTRUM || nmethod == NAM_MMD2_GAMMA)) + { + std::cerr << "cannot use spectrum/gamma method for B-test/linear time MMD" << std::endl; + }*/ + self->null_approximation_method=nmethod; +} + +const ENullApproximationMethod CStreamingMMD::get_null_approximation_method() const +{ + return self->null_approximation_method; +} + +const char* CStreamingMMD::get_name() const +{ + return "StreamingMMD"; +} diff --git a/src/shogun/statistical_testing/StreamingMMD.h b/src/shogun/statistical_testing/StreamingMMD.h new file mode 100644 index 00000000000..5336b6b37e1 --- /dev/null +++ b/src/shogun/statistical_testing/StreamingMMD.h @@ -0,0 +1,107 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef STREAMING_MMD_H_ +#define STREAMING_MMD_H_ + +#include +#include +#include +#include +#include + +namespace shogun +{ + +/** forward declarations */ +class CKernel; +class CKernelSelectionStrategy; +template class SGVector; +template class SGMatrix; + +namespace internal +{ + +class KernelManager; +class MaxTestPower; +class MaxCrossValidation; +class WeightedMaxTestPower; + +} + +class CStreamingMMD : public CMMD +{ + friend class internal::MaxTestPower; + friend class internal::WeightedMaxTestPower; + friend class internal::MaxCrossValidation; +public: + typedef std::function)> operation; + + CStreamingMMD(); + virtual ~CStreamingMMD(); + + virtual float64_t compute_statistic(); + virtual float64_t compute_variance(); + + virtual SGVector compute_multiple(); + + virtual SGVector sample_null(); + + void use_gpu(bool gpu); + void cleanup(); + + void set_statistic_type(EStatisticType stype); + const EStatisticType get_statistic_type() const; + + void set_variance_estimation_method(EVarianceEstimationMethod vmethod); + const EVarianceEstimationMethod get_variance_estimation_method() const; + + void set_num_null_samples(index_t null_samples); + const index_t get_num_null_samples() const; + + void set_null_approximation_method(ENullApproximationMethod nmethod); + const ENullApproximationMethod get_null_approximation_method() const; + + virtual const char* get_name() const; +protected: + virtual const operation get_direct_estimation_method() const=0; + virtual float64_t normalize_statistic(float64_t statistic) const=0; + virtual const float64_t normalize_variance(float64_t variance) const=0; + bool use_gpu() const; + std::shared_ptr get_strategy(); +private: + struct Self; + std::unique_ptr self; + virtual std::pair compute_statistic_variance(); + std::pair, SGMatrix > compute_statistic_and_Q(const internal::KernelManager&); +}; + +} +#endif // STREAMING_MMD_H_ diff --git a/src/shogun/preprocessor/BAHSIC.cpp b/src/shogun/statistical_testing/TestEnums.h similarity index 70% rename from src/shogun/preprocessor/BAHSIC.cpp rename to src/shogun/statistical_testing/TestEnums.h index 47cdc9c6a4e..3c912f59bca 100644 --- a/src/shogun/preprocessor/BAHSIC.cpp +++ b/src/shogun/statistical_testing/TestEnums.h @@ -1,6 +1,7 @@ /* * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -28,35 +29,43 @@ * either expressed or implied, of the Shogun Development Team. */ -#include -#include +#ifndef TEST_ENUMS_H_ +#define TEST_ENUMS_H_ -using namespace shogun; +#include -CBAHSIC::CBAHSIC() : CKernelDependenceMaximization() +namespace shogun { - initialize_parameters(); -} -void CBAHSIC::initialize_parameters() +enum EStatisticType { - m_estimator=new CHSIC(); - SG_REF(m_estimator); - m_algorithm=BACKWARD_ELIMINATION; -} + ST_UNBIASED_FULL, + ST_UNBIASED_INCOMPLETE, + ST_BIASED_FULL +}; -CBAHSIC::~CBAHSIC() +enum EVarianceEstimationMethod { - // estimator is SG_UNREF'ed in base CDependenceMaximization destructor -} + VEM_DIRECT, + VEM_PERMUTATION +}; -void CBAHSIC::set_algorithm(EFeatureSelectionAlgorithm algorithm) +enum ENullApproximationMethod { - SG_INFO("Algorithm is set to BACKWARD_ELIMINATION for %s and therefore " - "cannot be set externally!\n", get_name()); -} + NAM_PERMUTATION, + NAM_MMD1_GAUSSIAN, + NAM_MMD2_SPECTRUM, + NAM_MMD2_GAMMA +}; -EPreprocessorType CBAHSIC::get_type() const +enum EKernelSelectionMethod { - return P_BAHSIC; + KSM_MEDIAN_HEURISTIC, + KSM_MAXIMIZE_MMD, + KSM_MAXIMIZE_POWER, + KSM_CROSS_VALIDATION, + KSM_AUTO = KSM_MAXIMIZE_POWER +}; + } +#endif // TEST_ENUMS_H_ diff --git a/src/shogun/statistical_testing/TwoDistributionTest.cpp b/src/shogun/statistical_testing/TwoDistributionTest.cpp new file mode 100644 index 00000000000..68698c8bb3e --- /dev/null +++ b/src/shogun/statistical_testing/TwoDistributionTest.cpp @@ -0,0 +1,164 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +CTwoDistributionTest::CTwoDistributionTest() : CHypothesisTest(TwoDistributionTest::num_feats) +{ +} + +CTwoDistributionTest::~CTwoDistributionTest() +{ +} + +void CTwoDistributionTest::set_p(CFeatures* samples_from_p) +{ + REQUIRE(samples_from_p, "Samples from P cannot be NULL!\n"); + auto& dm=get_data_mgr(); + dm.samples_at(0)=samples_from_p; +} + +CFeatures* CTwoDistributionTest::get_p() const +{ + const auto& dm=get_data_mgr(); + return dm.samples_at(0); +} + +void CTwoDistributionTest::set_q(CFeatures* samples_from_q) +{ + REQUIRE(samples_from_q, "Samples from Q cannot be NULL!\n"); + auto& dm=get_data_mgr(); + dm.samples_at(1)=samples_from_q; +} + +CFeatures* CTwoDistributionTest::get_q() const +{ + const auto& dm=get_data_mgr(); + return dm.samples_at(1); +} + +void CTwoDistributionTest::set_num_samples_p(index_t num_samples_from_p) +{ + auto& dm=get_data_mgr(); + dm.num_samples_at(0)=num_samples_from_p; +} + +const index_t CTwoDistributionTest::get_num_samples_p() const +{ + const auto& dm=get_data_mgr(); + return dm.num_samples_at(0); +} + +void CTwoDistributionTest::set_num_samples_q(index_t num_samples_from_q) +{ + auto& dm=get_data_mgr(); + dm.num_samples_at(1)=num_samples_from_q; +} + +const index_t CTwoDistributionTest::get_num_samples_q() const +{ + const auto& dm=get_data_mgr(); + return dm.num_samples_at(1); +} + +CCustomDistance* CTwoDistributionTest::compute_distance(CDistance* distance) +{ + auto& data_mgr=get_data_mgr(); + bool is_blockwise=data_mgr.is_blockwise(); + data_mgr.set_blockwise(false); + + data_mgr.start(); + auto samples=data_mgr.next(); + REQUIRE(!samples.empty(), "Could not fetch samples!\n"); + + CFeatures *samples_p=samples[0][0].get(); + CFeatures *samples_q=samples[1][0].get(); + SG_REF(samples_p); + SG_REF(samples_q); + + distance->cleanup(); + distance->remove_lhs_and_rhs(); + REQUIRE(distance->init(samples_p, samples_q), "Could not initialize distance instance!\n"); + auto dist_mat=distance->get_distance_matrix(); + distance->remove_lhs_and_rhs(); + distance->cleanup(); + + samples.clear(); + data_mgr.end(); + data_mgr.set_blockwise(is_blockwise); + + auto precomputed_distance=new CCustomDistance(); + precomputed_distance->set_full_distance_matrix_from_full(dist_mat.data(), dist_mat.num_rows, dist_mat.num_cols); + return precomputed_distance; +} + +CCustomDistance* CTwoDistributionTest::compute_joint_distance(CDistance* distance) +{ + auto& data_mgr=get_data_mgr(); + bool is_blockwise=data_mgr.is_blockwise(); + data_mgr.set_blockwise(false); + + data_mgr.start(); + auto samples=data_mgr.next(); + REQUIRE(!samples.empty(), "Could not fetch samples!\n"); + + CFeatures *samples_p=samples[0][0].get(); + CFeatures *samples_q=samples[1][0].get(); + auto p_and_q=FeaturesUtil::create_merged_copy(samples_p, samples_q); + + samples.clear(); + data_mgr.end(); + data_mgr.set_blockwise(is_blockwise); + + distance->cleanup(); + distance->remove_lhs_and_rhs(); + REQUIRE(distance->init(p_and_q, p_and_q), "Could not initialize distance instance!\n"); + auto dist_mat=distance->get_distance_matrix(); + distance->remove_lhs_and_rhs(); + distance->cleanup(); + + auto precomputed_distance=new CCustomDistance(); + precomputed_distance->set_triangle_distance_matrix_from_full(dist_mat.data(), dist_mat.num_rows, dist_mat.num_cols); + return precomputed_distance; +} + +const char* CTwoDistributionTest::get_name() const +{ + return "TwoDistributionTest"; +} diff --git a/src/shogun/statistical_testing/TwoDistributionTest.h b/src/shogun/statistical_testing/TwoDistributionTest.h new file mode 100644 index 00000000000..6706d723cde --- /dev/null +++ b/src/shogun/statistical_testing/TwoDistributionTest.h @@ -0,0 +1,157 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef TWO_DISTRIBUTION_TEST_H_ +#define TWO_DISTRIBUTION_TEST_H_ + +#include + +namespace shogun +{ + +class CDistance; +class CCustomDistance; + +/** + * @brief Class TwoDistributionTest is the base class for the statistical + * hypothesis testing with samples from two distributions, \f$mathbf{P}\f$ + * and \f$\mathbf{Q}\f$. + * + * \sa {CTwoSampleTest} + */ +class CTwoDistributionTest : public CHypothesisTest +{ +public: + /** Default constructor */ + CTwoDistributionTest(); + + /** Destrutor */ + virtual ~CTwoDistributionTest(); + + /** + * Method that initializes the samples from \f$\mathbf{P}\f$. This method + * is kept virtual for the sub-classes to perform additional initialization + * tasks that have to be performed every time features are set/updated. + * + * @param samples_from_p The CFeatures instance representing the samples + * from \f$\mathbf{P}\f$. + */ + virtual void set_p(CFeatures* samples_from_p); + + /** @return The samples from \f$\mathbf{P}\f$. */ + CFeatures* get_p() const; + + /** + * Method that initializes the samples from \f$\mathbf{Q}\f$. This method + * is kept virtual for the sub-classes to perform additional initialization + * tasks that have to be performed every time features are set/updated. + * + * @param samples_from_q The CFeatures instance representing the samples + * from \f$\mathbf{Q}\f$. + */ + virtual void set_q(CFeatures* samples_from_q); + + /** @return The samples from \f$\mathbf{Q}\f$. */ + CFeatures* get_q() const; + + /** + * Method that initializes the number of samples to be drawn from distribution + * \f$\mathbf{P}\f$. Please ensure to call this method if you are intending to + * use streaming data generators that generate the samples on the fly. For + * other types of features, the number of samples is set internally from the + * features object itself, therefore this method should not be used. + * + * @param num_samples_from_p The CFeatures instance representing the samples + * from \f$\mathbf{P}\f$. + */ + void set_num_samples_p(index_t num_samples_from_p); + + /** @return The number of samples from \f$\mathbf{P}\f$. */ + const index_t get_num_samples_p() const; + + /** + * Method that initializes the number of samples to be drawn from distribution + * \f$\mathbf{Q}\f$. Please ensure to call this method if you are intending to + * use streaming data generators that generate the samples on the fly. For + * other types of features, the number of samples is set internally from the + * features object itself, therefore this method should not be used. + * + * @param num_samples_from_q The CFeatures instance representing the samples + * from \f$\mathbf{Q}\f$. + */ + void set_num_samples_q(index_t num_samples_from_q); + + /** @return The number of samples from \f$\mathbf{Q}\f$. */ + const index_t get_num_samples_q() const; + + /** + * Method that pre-computes the pair-wise distance between the samples using + * the provided distance instance. + * + * @param distance The distance instance used for pre-computing the pair-wise + * distance. + * @return A newly created CCustomDistance instance representing the + * pre-computed pair-wise distance between the samples. + */ + CCustomDistance* compute_distance(CDistance* distance); + + /** + * Method that pre-computes the pair-wise distance between the joint samples using + * the provided distance instance. A temporary object appending the samples from + * both the distributions is created in order to perform the task. + * + * @param distance The distance instance used for pre-computing the pair-wise + * distance. + * @return A newly created CCustomDistance instance representing the + * pre-computed pair-wise distance between the joint samples. + */ + CCustomDistance* compute_joint_distance(CDistance* distance); + + /** + * Interface for computing the test-statistic for the hypothesis test. + * + * @return test statistic for the given data/parameters/methods + */ + virtual float64_t compute_statistic()=0; + + /** + * Interface for computing the samples under the null-hypothesis. + * + * @return vector of all statistics + */ + virtual SGVector sample_null()=0; + + /** @return The name of the class */ + virtual const char* get_name() const; +}; + +} +#endif // TWO_DISTRIBUTION_TEST_H_ diff --git a/src/shogun/statistical_testing/TwoSampleTest.cpp b/src/shogun/statistical_testing/TwoSampleTest.cpp new file mode 100644 index 00000000000..9f440cf5fba --- /dev/null +++ b/src/shogun/statistical_testing/TwoSampleTest.cpp @@ -0,0 +1,78 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +struct CTwoSampleTest::Self +{ + Self(index_t num_kernels); + KernelManager kernel_mgr; +}; + +CTwoSampleTest::Self::Self(index_t num_kernels) : kernel_mgr(num_kernels) +{ +} + +CTwoSampleTest::CTwoSampleTest() : CTwoDistributionTest() +{ + self=std::unique_ptr(new Self(TwoSampleTest::num_kernels)); +} + +CTwoSampleTest::CTwoSampleTest(CFeatures* samples_from_p, CFeatures* samples_from_q) : CTwoDistributionTest() +{ + self=std::unique_ptr(new Self(TwoSampleTest::num_kernels)); + set_p(samples_from_p); + set_q(samples_from_q); +} + +CTwoSampleTest::~CTwoSampleTest() +{ +} + +void CTwoSampleTest::set_kernel(CKernel* kernel) +{ + REQUIRE(kernel, "Kernel cannot be NULL!\n"); + self->kernel_mgr.kernel_at(0)=kernel; + self->kernel_mgr.restore_kernel_at(0); +} + +CKernel* CTwoSampleTest::get_kernel() const +{ + return get_kernel_mgr().kernel_at(0); +} + +const char* CTwoSampleTest::get_name() const +{ + return "TwoSampleTest"; +} + +KernelManager& CTwoSampleTest::get_kernel_mgr() +{ + return self->kernel_mgr; +} + +const KernelManager& CTwoSampleTest::get_kernel_mgr() const +{ + return self->kernel_mgr; +} diff --git a/src/shogun/statistical_testing/TwoSampleTest.h b/src/shogun/statistical_testing/TwoSampleTest.h new file mode 100644 index 00000000000..93659722d70 --- /dev/null +++ b/src/shogun/statistical_testing/TwoSampleTest.h @@ -0,0 +1,99 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef TWO_SAMPLE_TEST_H_ +#define TWO_SAMPLE_TEST_H_ + +#include +#include + +namespace shogun +{ + +class CKernel; +class CFeatures; + +namespace internal +{ + class KernelManager; +} + +/** @brief Kernel two sample test base class. Provides an interface for + * performing a two-sample test using a kernel, i.e. Given samples from two + * distributions \f$p\f$ and \f$q\f$, the null-hypothesis is: \f$H_0: p=q\f$, + * the alternative hypothesis: \f$H_1: p\neq q\f$. + * + * In this class, this is done using a single kernel for the data. + * + * Abstract base class. + */ +class CTwoSampleTest : public CTwoDistributionTest +{ +public: + /** Default constructor */ + CTwoSampleTest(); + + /** + * Convenience constructor that initializes the samples from two distributions. + * + * @param samples_from_p Samples from \f$p\f$ + * @param samples_from_q Samples from \f$q\f$ + */ + CTwoSampleTest(CFeatures* samples_from_p, CFeatures* samples_from_q); + + /** Destructor */ + virtual ~CTwoSampleTest(); + + /** + * Method that sets the kernel that is used for performing the two-sample test. + * It is kept virtual so that sub-classes can perform other initialization tasks + * that has to be trigger every time a kernel is set/updated. + * + * @param kernel The kernel instance. + */ + virtual void set_kernel(CKernel* kernel); + + /** @return The kernel instance that is presently being used for performing the test */ + CKernel* get_kernel() const; + + /** + * Interface for computing the test-statistic for the hypothesis test. + * + * @return test statistic for the given data/parameters/methods + */ + virtual float64_t compute_statistic()=0; + + /** + * Interface for computing the samples under the null-hypothesis. + * + * @return vector of all statistics + */ + virtual SGVector sample_null()=0; + + /** @return The name of the class */ + virtual const char* get_name() const; +protected: + internal::KernelManager& get_kernel_mgr(); + const internal::KernelManager& get_kernel_mgr() const; +private: + struct Self; + std::unique_ptr self; +}; + +} +#endif // TWO_SAMPLE_TEST_H_ diff --git a/src/shogun/statistical_testing/internals/Block.cpp b/src/shogun/statistical_testing/internals/Block.cpp new file mode 100644 index 00000000000..7750f7c892c --- /dev/null +++ b/src/shogun/statistical_testing/internals/Block.cpp @@ -0,0 +1,85 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +Block::Block(CFeatures* feats, index_t index, index_t size) : m_feats(feats) +{ + REQUIRE(m_feats!=nullptr, "Underlying feature object cannot be null!\n"); + + // increase the refcount of the underlying feature object + // we want this object to be alive till the last block is free'd + SG_REF(m_feats); + + // create a shallow copy and subset current block separately + CFeatures* block=FeaturesUtil::create_shallow_copy(feats); + ASSERT(block->ref_count()==0); + + SGVector inds(size); + std::iota(inds.vector, inds.vector+inds.vlen, index*size); + block->add_subset(inds); + + // since this block object is internal, we simply use a shared_ptr + m_block=std::shared_ptr(block); +} + +Block::Block(const Block& other) : m_block(other.m_block), m_feats(other.m_feats) +{ + SG_REF(m_feats); +} + +Block& Block::operator=(const Block& other) +{ + m_block=other.m_block; + m_feats=other.m_feats; + SG_REF(m_feats); + return *this; +} + +Block::~Block() +{ + SG_UNREF(m_feats); +} + +std::vector Block::create_blocks(CFeatures* feats, index_t num_blocks, index_t size) +{ + std::vector vec; + for (index_t i=0; i +#include +#include + +#ifndef BLOCK_H__ +#define BLOCK_H__ + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +/** + * @brief Class that holds a block feature. A block feature is a shallow + * copy of an underlying (non-owning) feature object. In its constructor, + * it increases the refcount of the original object (since it has to be + * alive as long as the block is alive) and it decreases the refcount of + * the original object in destructor. + */ +class Block +{ +private: + /** + * Constructor to create a block object. It makes a shallow copy of + * the underlying feature object, and adds subset according to the + * block begin index and the blocksize. + * + * Increases the reference count of the underlying feature object. + * + * @param feats The underlying feature object. + * @param index The index of the block. + * @param size The size of the block (number of feature vectors). + */ + Block(CFeatures* feats, index_t index, index_t size); +public: + /** + * Copy constructor. Every time a block is copied or assigned, the underlying + * feature object is SG_REF'd. + */ + Block(const Block& other); + + /** + * Assignment operator. Every time a block is copied or assigned, the underlying + * feature object is SG_REF'd. + */ + Block& operator=(const Block& other); + + /** + * Destructor. Decreases the reference count of the underlying feature object. + */ + ~Block(); + + /** + * Method that creates a number of block objects. See @Block for details. + * + * @param feats The underlying feature object. + * @param num_blocks The number of blocks to be formed. + * @param size The size of the block (number of feature vectors). + */ + static std::vector create_blocks(CFeatures* feats, index_t num_blocks, index_t size); + + /** + * Operator overloading for getting the block object as a shared ptr (non-const). + */ + inline operator std::shared_ptr() + { + return m_block; + } + + /** + * Operator overloading for getting the block object as a naked ptr (non-const, unsafe). + */ + inline operator CFeatures*() + { + return m_block.get(); + } + + /** + * Operator overloading for getting the block object as a naked ptr (const). + */ + inline operator const CFeatures*() const + { + return m_block.get(); + } + + /** + * @return the block feature object (non-const, unsafe). + */ + inline CFeatures* get() + { + return static_cast(*this); + } + + /** + * @return the block feature object (const). + */ + inline const CFeatures* get() const + { + return static_cast(*this); + } +private: + /** Shallow copy representing the block */ + std::shared_ptr m_block; + + /** Underlying feature object */ + CFeatures* m_feats; +}; + +} + +} +#endif // BLOCK_H__ diff --git a/src/shogun/statistical_testing/internals/BlockwiseDetails.cpp b/src/shogun/statistical_testing/internals/BlockwiseDetails.cpp new file mode 100644 index 00000000000..e8ceb4519b9 --- /dev/null +++ b/src/shogun/statistical_testing/internals/BlockwiseDetails.cpp @@ -0,0 +1,54 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include + +using namespace shogun; +using namespace internal; + +BlockwiseDetails::BlockwiseDetails() : m_blocksize(0), m_num_blocks_per_burst(1), + m_max_num_samples_per_burst(0), m_next_block_index(0), m_total_num_blocks(0), + m_full_data(true) +{ +} + +BlockwiseDetails& BlockwiseDetails::with_blocksize(index_t blocksize) +{ + m_blocksize = blocksize; + m_max_num_samples_per_burst = m_blocksize * m_num_blocks_per_burst; + return *this; +} + +BlockwiseDetails& BlockwiseDetails::with_num_blocks_per_burst(index_t num_blocks_per_burst) +{ + m_num_blocks_per_burst = num_blocks_per_burst; + m_max_num_samples_per_burst = m_blocksize * m_num_blocks_per_burst; + return *this; +} diff --git a/src/shogun/statistical_testing/internals/BlockwiseDetails.h b/src/shogun/statistical_testing/internals/BlockwiseDetails.h new file mode 100644 index 00000000000..916fd53e578 --- /dev/null +++ b/src/shogun/statistical_testing/internals/BlockwiseDetails.h @@ -0,0 +1,97 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include + +#ifndef BLOCK_WISE_DETAILS_H__ +#define BLOCK_WISE_DETAILS_H__ + +namespace shogun +{ + +namespace internal +{ + +/** + * @brief Class that holds block-details for the data-fetchers. + * There are one instance of this class per fetcher. + */ +class BlockwiseDetails +{ + friend class DataFetcher; + friend class StreamingDataFetcher; + friend class DataManager; + +public: + + /** + * Default constructor. + */ + BlockwiseDetails(); + + /** + * Method that sets the blocksize for current fetcher. + * @param blocksize the size of the block + * @return an instance of the current object + */ + BlockwiseDetails& with_blocksize(index_t blocksize); + + /** + * Method that sets the number of blocks to be fetched per burst for current fetcher. + * @param num_blocks_per_burst the number of blocks to be fetched per burst + * @return an instance of the current object + */ + BlockwiseDetails& with_num_blocks_per_burst(index_t num_blocks_per_burst); + +private: + + /** The size of the blocks */ + index_t m_blocksize; + + /** The number of blocks fetched per burst */ + index_t m_num_blocks_per_burst; + + /** The maximum number of samples fetched per burst */ + index_t m_max_num_samples_per_burst; + + /** Index for the next block to be fetched. Set by data fetchers */ + index_t m_next_block_index; + + /** Total number of blocks to be fetched. Set by data fetchers */ + index_t m_total_num_blocks; + + /** Whether the block should consist of full data (i.e. no block at all) */ + bool m_full_data; +}; + +} + +} +#endif // BLOCK_WISE_DETAILS_H__ diff --git a/src/shogun/statistical_testing/internals/ComputationManager.cpp b/src/shogun/statistical_testing/internals/ComputationManager.cpp new file mode 100644 index 00000000000..3551f299c9d --- /dev/null +++ b/src/shogun/statistical_testing/internals/ComputationManager.cpp @@ -0,0 +1,131 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include + +using namespace shogun; +using namespace internal; + +ComputationManager::ComputationManager() +{ +} + +ComputationManager::~ComputationManager() +{ +} + +void ComputationManager::num_data(index_t n) +{ + data_array.resize(n); +} + +SGMatrix& ComputationManager::data(index_t i) +{ + return data_array[i]; +} + +void ComputationManager::enqueue_job(std::function)> job) +{ + job_array.push_back(job); +} + +void ComputationManager::compute_data_parallel_jobs() +{ + // this is used when there are more number of data blocks to be processed + // than there are jobs + result_array.resize(job_array.size()); + for (size_t j=0; j current_data_results(job_array.size()); + for (size_t j=0; j& ComputationManager::result(index_t i) +{ + return result_array[i]; +} + +ComputationManager& ComputationManager::use_gpu() +{ + gpu=true; + return *this; +} + +ComputationManager& ComputationManager::use_cpu() +{ + gpu=false; + return *this; +} diff --git a/src/shogun/statistical_testing/internals/ComputationManager.h b/src/shogun/statistical_testing/internals/ComputationManager.h new file mode 100644 index 00000000000..1eb54894294 --- /dev/null +++ b/src/shogun/statistical_testing/internals/ComputationManager.h @@ -0,0 +1,62 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef COMPUTATION_MANAGER_H__ +#define COMPUTATION_MANAGER_H__ + +#include +#include +#include + +namespace shogun +{ + +template class SGMatrix; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class ComputationManager +{ +public: + ComputationManager(); + ~ComputationManager(); + + void num_data(index_t n); + SGMatrix& data(index_t i); + + void enqueue_job(std::function)> job); + void compute_data_parallel_jobs(); + void compute_task_parallel_jobs(); + void done(); + std::vector& result(index_t i); + + ComputationManager& use_cpu(); + ComputationManager& use_gpu(); +private: + bool gpu; + std::vector > data_array; + std::vector&)> > job_array; + std::vector > result_array; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS + +} // namespace internal + +} // namespace shogun +#endif // COMPUTATION_MANAGER_H__ diff --git a/src/shogun/statistical_testing/internals/DataFetcher.cpp b/src/shogun/statistical_testing/internals/DataFetcher.cpp new file mode 100644 index 00000000000..602c7bafc15 --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataFetcher.cpp @@ -0,0 +1,277 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +DataFetcher::DataFetcher() : m_num_samples(0), train_test_mode(false), + train_mode(false), m_samples(nullptr), features_shuffled(false) +{ +} + +DataFetcher::DataFetcher(CFeatures* samples) : train_test_mode(false), + train_mode(false), m_samples(samples), features_shuffled(false) +{ + REQUIRE(m_samples!=nullptr, "Samples cannot be null!\n"); + SG_REF(m_samples); + m_num_samples=m_samples->get_num_vectors(); +} + +DataFetcher::~DataFetcher() +{ + SG_UNREF(m_samples); +} + +void DataFetcher::set_blockwise(bool blockwise) +{ + if (blockwise) + { + m_block_details=last_blockwise_details; + SG_SDEBUG("Restoring the blockwise details!\n"); + m_block_details.m_full_data=false; + } + else + { + last_blockwise_details=m_block_details; + SG_SDEBUG("Saving the blockwise details!\n"); + m_block_details=BlockwiseDetails(); + } +} + +void DataFetcher::set_train_test_mode(bool on) +{ + train_test_mode=on; +} + +bool DataFetcher::is_train_test_mode() const +{ + return train_test_mode; +} + +void DataFetcher::set_train_mode(bool on) +{ + train_mode=on; +} + +bool DataFetcher::is_train_mode() const +{ + return train_mode; +} + +void DataFetcher::set_train_test_ratio(float64_t ratio) +{ + train_test_ratio=ratio; +} + +float64_t DataFetcher::get_train_test_ratio() const +{ + return train_test_ratio; +} + +void DataFetcher::shuffle_features() +{ + REQUIRE(train_test_mode, "This method is allowed only when Train/Test method is active!\n"); + if (features_shuffled) + { + SG_SWARNING("Features are already shuffled! Call to shuffle_features() has no effect." + "If you want to reshuffle, please call unshuffle_features() first and then call this method!\n"); + } + else + { + const index_t size=m_samples->get_num_vectors(); + SG_SDEBUG("Current number of feature vectors = %d\n", size); + if (shuffle_subset.size()(size); + } + std::iota(shuffle_subset.data(), shuffle_subset.data()+shuffle_subset.size(), 0); + CMath::permute(shuffle_subset); +// shuffle_subset.display_vector("shuffle_subset"); + + SG_SDEBUG("Shuffling %d feature vectors\n", size); + m_samples->add_subset(shuffle_subset); + + features_shuffled=true; + } +} + +void DataFetcher::unshuffle_features() +{ + REQUIRE(train_test_mode, "This method is allowed only when Train/Test method is active!\n"); + if (features_shuffled) + { + m_samples->remove_subset(); + features_shuffled=false; + } + else + { + SG_SWARNING("Features are NOT shuffled! Call to unshuffle_features() has no effect." + "If you want to reshuffle, please call shuffle_features() instead!\n"); + } +} + +void DataFetcher::use_fold(index_t idx) +{ + allocate_active_subset(); + auto num_samples_per_fold=get_num_samples()/get_num_folds(); + auto start_idx=idx*num_samples_per_fold; + if (train_mode) + { + std::iota(active_subset.data(), active_subset.data()+active_subset.size(), 0); + if (start_idxget_num_vectors()*train_test_ratio/(train_test_ratio+1); + std::iota(active_subset.data(), active_subset.data()+active_subset.size(), start_index); +// active_subset.display_vector("active_subset"); +} + +void DataFetcher::start() +{ + REQUIRE(get_num_samples()>0, "Number of samples is 0!\n"); + if (train_test_mode) + { + m_samples->add_subset(active_subset); + SG_SDEBUG("Added active subset!\n"); + SG_SINFO("Currently active number of samples is %d\n", get_num_samples()); + } + + if (m_block_details.m_full_data || m_block_details.m_blocksize>get_num_samples()) + { + SG_SINFO("Fetching entire data (%d samples)!\n", get_num_samples()); + m_block_details.with_blocksize(get_num_samples()); + } + m_block_details.m_total_num_blocks=get_num_samples()/m_block_details.m_blocksize; + reset(); +} + +CFeatures* DataFetcher::next() +{ + CFeatures* next_samples=nullptr; + // figure out how many samples to fetch in this burst + auto num_already_fetched=m_block_details.m_next_block_index*m_block_details.m_blocksize; + auto num_more_samples=get_num_samples()-num_already_fetched; + if (num_more_samples>0) + { + // create a shallow copy and add proper index subset + next_samples=FeaturesUtil::create_shallow_copy(m_samples); + auto num_samples_this_burst=std::min(m_block_details.m_max_num_samples_per_burst, num_more_samples); + if (num_samples_this_burstget_num_vectors()) + { + SGVector inds(num_samples_this_burst); + std::iota(inds.vector, inds.vector+inds.vlen, num_already_fetched); + next_samples->add_subset(inds); + } + m_block_details.m_next_block_index+=m_block_details.m_num_blocks_per_burst; + } + return next_samples; +} + +void DataFetcher::reset() +{ + m_block_details.m_next_block_index=0; +} + +void DataFetcher::end() +{ + if (train_test_mode) + { + m_samples->remove_subset(); + SG_SDEBUG("Removed active subset!\n"); + SG_SINFO("Currently active number of samples is %d\n", get_num_samples()); + } +} + +index_t DataFetcher::get_num_samples() const +{ + if (train_test_mode) + { + if (train_mode) + return m_num_samples*train_test_ratio/(train_test_ratio+1); + else + return m_num_samples/(train_test_ratio+1); + } + return m_samples->get_num_vectors(); +} + +index_t DataFetcher::get_num_folds() const +{ + return 1+ceil(get_train_test_ratio()); +} + +index_t DataFetcher::get_num_training_samples() const +{ + return get_num_samples()*get_train_test_ratio()/(get_train_test_ratio()+1); +} + +index_t DataFetcher::get_num_testing_samples() const +{ + return get_num_samples()/(get_train_test_ratio()+1); +} + +BlockwiseDetails& DataFetcher::fetch_blockwise() +{ + m_block_details.m_full_data=false; + return m_block_details; +} + +void DataFetcher::allocate_active_subset() +{ + REQUIRE(train_test_mode, "This method is allowed only when Train/Test method is active!\n"); + index_t num_active_samples=0; + if (train_mode) + { + num_active_samples=m_samples->get_num_vectors()*train_test_ratio/(train_test_ratio+1); + SG_SINFO("Using %d number of samples for this fold as training samples!\n", num_active_samples); + } + else + { + num_active_samples=m_samples->get_num_vectors()/(train_test_ratio+1); + SG_SINFO("Using %d number of samples for this fold as testing samples!\n", num_active_samples); + } + + ASSERT(num_active_samples>0); + if (active_subset.size()!=num_active_samples) + { + SG_SDEBUG("Resizing the active subset from %d to %d\n", active_subset.size(), num_active_samples); + active_subset=SGVector(num_active_samples); + } +} diff --git a/src/shogun/statistical_testing/internals/DataFetcher.h b/src/shogun/statistical_testing/internals/DataFetcher.h new file mode 100644 index 00000000000..bb22ea2f19c --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataFetcher.h @@ -0,0 +1,109 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include + +#ifndef DATA_FETCHER_H__ +#define DATA_FETCHER_H__ + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +class DataManager; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class DataFetcher +{ + friend class DataManager; + friend class InitPerFeature; +public: + DataFetcher(CFeatures* samples); + virtual ~DataFetcher(); + + void set_blockwise(bool blockwise); + + void set_train_test_mode(bool on); + bool is_train_test_mode() const; + + void set_train_mode(bool on); + bool is_train_mode() const; + + void set_train_test_ratio(float64_t ratio); + float64_t get_train_test_ratio() const; + + virtual void shuffle_features(); + virtual void unshuffle_features(); + + virtual void use_fold(index_t i); + virtual void init_active_subset(); + + virtual void start(); + virtual CFeatures* next(); + virtual void reset(); + virtual void end(); + + virtual index_t get_num_samples() const; + + index_t get_num_folds() const; + index_t get_num_training_samples() const; + index_t get_num_testing_samples() const; + + BlockwiseDetails& fetch_blockwise(); + virtual const char* get_name() const + { + return "DataFetcher"; + } +protected: + DataFetcher(); + BlockwiseDetails m_block_details; + index_t m_num_samples; + bool train_test_mode; + bool train_mode; + float64_t train_test_ratio; +private: + CFeatures* m_samples; + SGVector shuffle_subset; + SGVector active_subset; + bool features_shuffled; + BlockwiseDetails last_blockwise_details; + void allocate_active_subset(); +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} +#endif // DATA_FETCHER_H__ diff --git a/src/shogun/statistical_testing/internals/DataFetcherFactory.cpp b/src/shogun/statistical_testing/internals/DataFetcherFactory.cpp new file mode 100644 index 00000000000..1cb915e7398 --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataFetcherFactory.cpp @@ -0,0 +1,37 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +DataFetcher* DataFetcherFactory::get_instance(CFeatures* feats) +{ + EFeatureClass fclass = feats->get_feature_class(); + if (fclass == C_STREAMING_DENSE || fclass == C_STREAMING_SPARSE || fclass == C_STREAMING_STRING) + { + return new StreamingDataFetcher(static_cast(feats)); + } + return new DataFetcher(feats); +} + diff --git a/src/shogun/statistical_testing/internals/DataFetcherFactory.h b/src/shogun/statistical_testing/internals/DataFetcherFactory.h new file mode 100644 index 00000000000..702744c5861 --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataFetcherFactory.h @@ -0,0 +1,48 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include + +#ifndef DATA_FETCHER_FACTORY_H__ +#define DATA_FETCHER_FACTORY_H__ + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +class DataFetcher; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct DataFetcherFactory +{ + DataFetcherFactory() = delete; + DataFetcherFactory(const DataFetcherFactory& other) = delete; + DataFetcherFactory& operator=(const DataFetcherFactory& other) = delete; + ~DataFetcherFactory() = delete; + + static DataFetcher* get_instance(CFeatures* feats); +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} +#endif // DATA_FETCHER_FACTORY_H__ diff --git a/src/shogun/statistical_testing/internals/DataManager.cpp b/src/shogun/statistical_testing/internals/DataManager.cpp new file mode 100644 index 00000000000..baa96d8aa65 --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataManager.cpp @@ -0,0 +1,426 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +DataManager::DataManager(size_t num_distributions) +{ + SG_SDEBUG("Data manager instance initialized with %d data sources!\n", num_distributions); + fetchers.resize(num_distributions); + std::fill(fetchers.begin(), fetchers.end(), nullptr); + + train_test_mode=default_train_test_mode; + train_mode=default_train_mode; + train_test_ratio=default_train_test_ratio; + cross_validation_mode=default_cross_validation_mode; +} + +DataManager::~DataManager() +{ +} + +index_t DataManager::get_num_samples() const +{ + SG_SDEBUG("Entering!\n"); + index_t n=0; + typedef const std::unique_ptr fetcher_type; + if (std::any_of(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { return f->m_num_samples==0; })) + SG_SERROR("number of samples from all the distributions are not set!") + else + std::for_each(fetchers.begin(), fetchers.end(), [&n](fetcher_type& f) { n+=f->m_num_samples; }); + SG_SDEBUG("Leaving!\n"); + return n; +} + +index_t DataManager::get_min_blocksize() const +{ + SG_SDEBUG("Entering!\n"); + index_t min_blocksize=0; + typedef const std::unique_ptr fetcher_type; + if (std::any_of(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { return f->m_num_samples==0; })) + SG_SERROR("number of samples from all the distributions are not set!") + else + { + index_t divisor=0; + std::function gcd=[&gcd](index_t m, index_t n) + { + return n==0?m:gcd(n, m%n); + }; + for (size_t i=0; im_num_samples); + min_blocksize=get_num_samples()/divisor; + } + SG_SDEBUG("min blocksize is %d!", min_blocksize); + SG_SDEBUG("Leaving!\n"); + return min_blocksize; +} + +void DataManager::set_blocksize(index_t blocksize) +{ + SG_SDEBUG("Entering!\n"); + auto n=get_num_samples(); + + REQUIRE(n>0, + "Total number of samples is 0! Please set the number of samples!\n"); + REQUIRE(blocksize>0 && blocksize<=n, + "The blocksize has to be within [0, %d], given = %d!\n", + n, blocksize); + REQUIRE(n%blocksize==0, + "Total number of samples (%d) has to be divisble by the blocksize (%d)!\n", + n, blocksize); + + for (size_t i=0; im_num_samples; + REQUIRE((blocksize*m)%n==0, + "Blocksize (%d) cannot be even distributed with a ratio of %f!\n", + blocksize, m/n); + fetchers[i]->fetch_blockwise().with_blocksize(blocksize*m/n); + SG_SDEBUG("block[%d].size = ", i, blocksize*m/n); + } + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::set_num_blocks_per_burst(index_t num_blocks_per_burst) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(num_blocks_per_burst>0, + "Number of blocks per burst (%d) has to be greater than 0!\n", + num_blocks_per_burst); + + index_t blocksize=0; + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [&blocksize](fetcher_type& f) + { + blocksize+=f->m_block_details.m_blocksize; + }); + REQUIRE(blocksize>0, + "Blocksizes are not set!\n"); + + index_t max_num_blocks_per_burst=get_num_samples()/blocksize; + if (num_blocks_per_burst>max_num_blocks_per_burst) + { + SG_SINFO("There can only be %d blocks per burst given the blocksize (%d)!\n", max_num_blocks_per_burst, blocksize); + SG_SINFO("Setting num blocks per burst to be %d instead!\n", max_num_blocks_per_burst); + num_blocks_per_burst=max_num_blocks_per_burst; + } + + for (size_t i=0; ifetch_blockwise().with_num_blocks_per_burst(num_blocks_per_burst); + SG_SDEBUG("Leaving!\n"); +} + +InitPerFeature DataManager::samples_at(size_t i) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(im_samples; + else + return nullptr; +} + +index_t& DataManager::num_samples_at(size_t i) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(im_num_samples; +} + +const index_t DataManager::num_samples_at(size_t i) const +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(iget_num_samples(); + else + return 0; +} + +const index_t DataManager::blocksize_at(size_t i) const +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(im_block_details.m_blocksize; + else + return 0; +} + +void DataManager::set_blockwise(bool blockwise) +{ + SG_SDEBUG("Entering!\n"); + for (size_t i=0; iset_blockwise(blockwise); + SG_SDEBUG("Leaving!\n"); +} + +const bool DataManager::is_blockwise() const +{ + SG_SDEBUG("Entering!\n"); + bool blockwise=true; + for (size_t i=0; im_block_details.m_full_data; + SG_SDEBUG("Leaving!\n"); + return blockwise; +} + +void DataManager::set_train_test_mode(bool on) +{ + train_test_mode=on; + if (!train_test_mode) + { + train_mode=default_train_mode; + train_test_ratio=default_train_test_ratio; + cross_validation_mode=default_cross_validation_mode; + } + REQUIRE(fetchers.size()>0, "Features are not set!"); + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [this, on](fetcher_type& f) + { + f->set_train_test_mode(on); + if (on) + { + f->set_train_mode(train_mode); + f->set_train_test_ratio(train_test_ratio); + } + }); +} + +bool DataManager::is_train_test_mode() const +{ + return train_test_mode; +} + +void DataManager::set_train_mode(bool on) +{ + if (train_test_mode) + train_mode=on; + else + { + SG_SERROR("Train mode cannot be used without turning on Train/Test mode first!" + "Please call set_train_test_mode(True) before using this method!\n"); + } +} + +bool DataManager::is_train_mode() const +{ + return train_mode; +} + +void DataManager::set_cross_validation_mode(bool on) +{ + if (train_test_mode) + cross_validation_mode=on; + else + { + SG_SERROR("Cross-validation mode cannot be used without turning on Train/Test mode first!" + "Please call set_train_test_mode(True) before using this method!\n"); + } +} + +bool DataManager::is_cross_validation_mode() const +{ + return cross_validation_mode; +} + +void DataManager::set_train_test_ratio(float64_t ratio) +{ + if (train_test_mode) + train_test_ratio=ratio; + else + { + SG_SERROR("Train-test ratio cannot be set without turning on Train/Test mode first!" + "Please call set_train_test_mode(True) before using this method!\n"); + } +} + +float64_t DataManager::get_train_test_ratio() const +{ + return train_test_ratio; +} + +index_t DataManager::get_num_folds() const +{ + return ceil(get_train_test_ratio())+1; +} + +void DataManager::shuffle_features() +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { f->shuffle_features(); }); + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::unshuffle_features() +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { f->unshuffle_features(); }); + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::init_active_subset() +{ + SG_SDEBUG("Entering!\n"); + + REQUIRE(train_test_mode, + "Train-test subset cannot be used without turning on Train/Test mode first!" + "Please call set_train_test_mode(True) before using this method!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [this](fetcher_type& f) + { + f->set_train_mode(train_mode); + f->set_train_test_ratio(train_test_ratio); + f->init_active_subset(); + }); + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::use_fold(index_t idx) +{ + SG_SDEBUG("Entering!\n"); + + REQUIRE(train_test_mode, + "Fold subset cannot be used without turning on Train/Test mode first!" + "Please call set_train_test_mode(True) before using this method!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + REQUIRE(idx>=0, "Fold index has to be in [0, %d]!", get_num_folds()-1); + REQUIRE(idx fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [this, idx](fetcher_type& f) + { + f->set_train_mode(train_mode); + f->set_train_test_ratio(train_test_ratio); + f->use_fold(idx); + }); + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::start() +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + + if (train_test_mode && !cross_validation_mode) + init_active_subset(); + + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { f->start(); }); + SG_SDEBUG("Leaving!\n"); +} + +NextSamples DataManager::next() +{ + SG_SDEBUG("Entering!\n"); + + // sets the number of feature objects (number of distributions) + NextSamples next_samples(fetchers.size()); + + // fetch a number of blocks (per burst) from each distribution + for (size_t i=0; inext(); + if (feats!=nullptr) + { + ASSERT(feats->ref_count()==0); + + auto blocksize=fetchers[i]->m_block_details.m_blocksize; + auto num_blocks_curr_burst=feats->get_num_vectors()/blocksize; + + // use same number of blocks from all the distributions + if (next_samples.m_num_blocks==0) + next_samples.m_num_blocks=num_blocks_curr_burst; + else + ASSERT(next_samples.m_num_blocks==num_blocks_curr_burst); + + next_samples[i]=Block::create_blocks(feats, num_blocks_curr_burst, blocksize); + } + } + SG_SDEBUG("Leaving!\n"); + return next_samples; +} + +void DataManager::end() +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { f->end(); }); + SG_SDEBUG("Leaving!\n"); +} + +void DataManager::reset() +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(fetchers.size()>0, "Features are not set!"); + typedef std::unique_ptr fetcher_type; + std::for_each(fetchers.begin(), fetchers.end(), [](fetcher_type& f) { f->reset(); }); + SG_SDEBUG("Leaving!\n"); +} diff --git a/src/shogun/statistical_testing/internals/DataManager.h b/src/shogun/statistical_testing/internals/DataManager.h new file mode 100644 index 00000000000..70324daebf1 --- /dev/null +++ b/src/shogun/statistical_testing/internals/DataManager.h @@ -0,0 +1,238 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef DATA_MANAGER_H__ +#define DATA_MANAGER_H__ + +#include +#include +#include +#include + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +class DataFetcher; +class NextSamples; + +/** + * @brief Class DataManager for fetching/streaming test data block-wise. + * It can handle data coming from multiple sources. The number of data + * sources is represented by the num_distributions parameter in the constructor + * of the data manager. It can handle heterogenous data sources, and it can + * stream multiple blocks per burst, as the computation would require. The size + * of the blocks and the number of blocks to be fetched per burst can be set + * externally. + * + * This class is designed to be used on a stack. An instance of DataManager + * should not be serialzied or copied or moved around. In Shogun, it is helpful + * when used inside just the implementation inside a PIMPL. + */ +class DataManager +{ +public: + /** + * Default constructor. + * + * @param num_distributions number of data sources (i.e. CFeature objects) + */ + DataManager(size_t num_distributions); + + /** + * Disabled copy constructor + * @param other other instance + */ + DataManager(const DataManager& other) = delete; + + /** + * Disabled assignment operator + * @param other other instance + */ + DataManager& operator=(const DataManager& other) = delete; + + /** + * Destructor + */ + ~DataManager(); + + /** + * Sets the blocksize for block-wise data fetching. It divides the block-size + * per data source according to the total number of feature vectors available + * from that source. More formally, if there are \f$K\f$ data sources, \f$X_k\f$, + * \f$k=\[1,K]\f$, with number of feature vectors \f$n_{X_k}\f$ from each, then + * setting a block-size of \f$B\f$ would mean that in each next() call of the + * data manager instance, it will fetch \f$rho_{X_k} B\f$ samples from each + * \f$X_k\f$, where \f$rho_{X_k}=n_{X_k}/n\f$, \f$n=sum_k n_{X_k}\f$. + * + * @param blocksize The size of the block consisting of data from all the sources. + */ + void set_blocksize(index_t blocksize); + + /** + * In order to speed up the computation, usually a number of blocks are fetched at + * once per next() call. This method sets that number. + * + * @param num_blocks_per_burst The number of blocks to be fetched in a burst. + */ + void set_num_blocks_per_burst(index_t num_blocks_per_burst); + + /** + * Setter for feature object as a data source. Since multiple data sources are + * supported, this method takes an index in which the feature object is set. + * Internally, it initializes a data fetcher object for the provided feature + * object. + * + * Example usage: + * @code + * + * DataManager data_mgr; + * // feats_0 = some CFeatures instance + * // feats_1 = some CFeatures instance + * data_mgr.sample_at(0) = feats_0; + * data_mgr.sample_at(1) = feats_1; + * + * @endcode + * + * @param i The data source index, at which the feature object is to be set as a + * data source. + * @return An initializer for the specified data source (that sets up a fetcher + * for this feature), to be used as lvalue. + */ + InitPerFeature samples_at(size_t i); + + /** + * Getter for feature object at a give data source index. + * + * @param i The data source index, from which the feature object is to be obtained + * @return The underlying CFeatures object at the specified data source. + */ + CFeatures* samples_at(size_t i) const; + + /** + * Setter for the number of samples. Setting this number is mandatory for + * streaming features. For other type of feature objects, this number equals + * the number of vectors, and is set internally. + * + * Example usage: + * @code + * + * DataManager data_mgr; + * data_mgr.num_sample_at(0) = 10; + * data_mgr.num_sample_at(1) = 15; + * + * @endcode + * + * @param i The data source index, at which the number of samples is to be set. + * @return A reference for the number of samples for the specified data source + * to be used as lvalue. + */ + index_t& num_samples_at(size_t i); + + /** + * Getter for the number of samples. + * + * @param i The data source index, from which the number of samples is to be obtained. + * @return The number of samples for the specified data source. + */ + const index_t num_samples_at(size_t i) const; + + /** + * Getter for the number of samples from a specified data source in a block. + * + * @param i The data source index. + * @return The number of samples from i-th data source in a block. + */ + const index_t blocksize_at(size_t i) const; + + /** + * @return Total number of samples that can be fetched from all the data sources. + */ + index_t get_num_samples() const; + + /** + * @return The minimum block-size that can be fetched from the specified data sources. + * For example, if there are two data sources, with samples 20 and 30, respectively, + * then minimum blocksize can be 5 (2 from 1st data source, 3 from the 2nd), and there + * can be then 10 such blocks. + */ + index_t get_min_blocksize() const; +#ifndef DOXYGEN_SHOULD_SKIP_THIS + void set_blockwise(bool blockwise); + const bool is_blockwise() const; + + void set_train_test_mode(bool on); + bool is_train_test_mode() const; + + void set_train_mode(bool on); + bool is_train_mode() const; + + void set_cross_validation_mode(bool on); + bool is_cross_validation_mode() const; + + void set_train_test_ratio(float64_t ratio); + float64_t get_train_test_ratio() const; + + index_t get_num_folds() const; + + void shuffle_features(); + void unshuffle_features(); + + void use_fold(index_t i); + void init_active_subset(); + + void start(); + NextSamples next(); + void end(); + void reset(); +#endif // DOXYGEN_SHOULD_SKIP_THIS +private: + std::vector > fetchers; + + bool train_test_mode; // -> if ON, then train/test/fold subset is used (in start()) in end() method, we remove these subsets. + bool cross_validation_mode; // -> if ON, then shuffle subset is used, remove it after train_test mode in end() + bool train_mode; // -> if train/test mode ON or cross-validation mode on, this one is used. + float64_t train_test_ratio; + + constexpr static bool default_train_test_mode=false; + constexpr static bool default_train_mode=false; + constexpr static bool default_cross_validation_mode=false; + constexpr static float64_t default_train_test_ratio=1.0; +}; + +} + +} + +#endif // DATA_MANAGER_H__ diff --git a/src/shogun/statistical_testing/internals/FeaturesUtil.cpp b/src/shogun/statistical_testing/internals/FeaturesUtil.cpp new file mode 100644 index 00000000000..b765688ea1f --- /dev/null +++ b/src/shogun/statistical_testing/internals/FeaturesUtil.cpp @@ -0,0 +1,149 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +CFeatures* FeaturesUtil::create_shallow_copy(CFeatures* other) +{ + SG_SDEBUG("Entering!\n"); + CFeatures* shallow_copy=nullptr; + if (other->get_feature_type()==F_DREAL && other->get_feature_class()==C_DENSE) + { + auto casted=static_cast*>(other); + + // use the same underlying feature matrix, no ref-count + int32_t num_feats=0, num_vecs=0; + float64_t* data=casted->get_feature_matrix(num_feats, num_vecs); + SG_SDEBUG("Using underlying feature matrix with %d dimensions and %d feature vectors!\n", num_feats, num_vecs); + SGMatrix feats_matrix(data, num_feats, num_vecs, false); + shallow_copy=new CDenseFeatures(feats_matrix); + clone_subset_stack(other, shallow_copy); + } + else + SG_SNOTIMPLEMENTED; + SG_SDEBUG("Leaving!\n"); + return shallow_copy; +} + +CFeatures* FeaturesUtil::create_merged_copy(CFeatures* feats_a, CFeatures* feats_b) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(feats_a->get_feature_type()==feats_b->get_feature_type(), + "The feature types of the underlying feature objects should be same!\n"); + REQUIRE(feats_a->get_feature_class()==feats_b->get_feature_class(), + "The feature classes of the underlying feature objects should be same!\n"); + + CFeatures* merged_copy=nullptr; + + if (feats_a->get_feature_type()==F_DREAL && feats_a->get_feature_class()==C_DENSE) + { + auto casted_a=static_cast*>(feats_a); + auto casted_b=static_cast*>(feats_b); + + REQUIRE(casted_a->get_num_features()==casted_b->get_num_features(), + "The number of features from a (%d) has to be equal with that of b (%d)!\n", + casted_a->get_num_features(), casted_b->get_num_features()); + + SGMatrix data_a=casted_a->get_feature_matrix(); + SGMatrix data_b=casted_b->get_feature_matrix(); + ASSERT(data_a.num_rows==data_b.num_rows); + + SGMatrix merged(data_a.num_rows, data_a.num_cols+data_b.num_cols); + std::copy(data_a.data(), data_a.data()+data_a.size(), merged.data()); + std::copy(data_b.data(), data_b.data()+data_b.size(), merged.data()+data_a.size()); + + merged_copy=new CDenseFeatures(merged); + } + else + SG_SNOTIMPLEMENTED; + + SG_SDEBUG("Leaving!\n"); + return merged_copy; +} + +void FeaturesUtil::clone_subset_stack(CFeatures* src, CFeatures* dst) +{ + SG_SDEBUG("Entering!\n"); + CSubsetStack* src_subset_stack=src->get_subset_stack(); + if (src_subset_stack->has_subsets()) + { + SG_SDEBUG("Subset present, cloning the subsets!\n"); + CSubsetStack* subset_stack=static_cast(src_subset_stack->clone()); + std::stack > stack; + while (subset_stack->has_subsets()) + { + stack.push(subset_stack->get_last_subset()->get_subset_idx()); + subset_stack->remove_subset(); + } + SG_UNREF(subset_stack); + SG_SDEBUG("Number of subsets to be added is %d!\n", stack.size()); + if (stack.size()>1) + { + SGVector ref=stack.top(); + dst->add_subset(ref); + stack.pop(); + do + { + SGVector inds=stack.top(); + for (auto i=0, j=0; iadd_subset(inds); + inds=ref; + stack.pop(); + } while (!stack.empty()); + } + else + { + while (!stack.empty()) + { + dst->add_subset(stack.top()); + stack.pop(); + } + } + } + SG_UNREF(src_subset_stack); + SG_SDEBUG("Leaving!\n"); +} diff --git a/src/shogun/statistical_testing/internals/FeaturesUtil.h b/src/shogun/statistical_testing/internals/FeaturesUtil.h new file mode 100644 index 00000000000..760e5be3316 --- /dev/null +++ b/src/shogun/statistical_testing/internals/FeaturesUtil.h @@ -0,0 +1,83 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef FEATURES_UTIL_H__ +#define FEATURES_UTIL_H__ + +#include + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +/** + * @brief Class FeaturesUtil for providing generic helper methods for + * handling Shogun's feature objects for the big-testing framework. + */ +struct FeaturesUtil +{ + /** + * This creates a shallow copy of the feature object. It uses the same + * underlying feature storage as the original object, but it clones all + * the subsets. + * + * @param other The feature object whose shallow copy has to be created. + * @return A shallow copy of the feature object. + */ + static CFeatures* create_shallow_copy(CFeatures* other); + + /** + * This creates a merged copy of the two feature objects. + * + * @param feats_a First feature object. + * @param feats_b Second feature object. + * @return A merged copy of the feature objects with total number of feature + * vectors of feats_a.num_vectors+feats_b.num_vectors. + */ + static CFeatures* create_merged_copy(CFeatures* feats_a, CFeatures* feats_b); + + /** + * This copies the subset stack from the src features object to the dst. + * + * @param src The source features object + * @param dst The destination features object + */ + static void clone_subset_stack(CFeatures* src, CFeatures* dst); +}; + +} + +} + +#endif // FEATURES_UTIL_H__ diff --git a/src/shogun/statistical_testing/internals/InitPerFeature.cpp b/src/shogun/statistical_testing/internals/InitPerFeature.cpp new file mode 100644 index 00000000000..b36223d5e0f --- /dev/null +++ b/src/shogun/statistical_testing/internals/InitPerFeature.cpp @@ -0,0 +1,44 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2014 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +InitPerFeature::InitPerFeature(std::unique_ptr& fetcher) : m_fetcher(fetcher) +{ +} + +InitPerFeature::~InitPerFeature() +{ +} + +InitPerFeature& InitPerFeature::operator=(CFeatures* feats) +{ + m_fetcher = std::unique_ptr(DataFetcherFactory::get_instance(feats)); + return *this; +} + +InitPerFeature::operator const CFeatures*() const +{ + return m_fetcher->m_samples; +} diff --git a/src/shogun/statistical_testing/internals/InitPerFeature.h b/src/shogun/statistical_testing/internals/InitPerFeature.h new file mode 100644 index 00000000000..8e94c993b36 --- /dev/null +++ b/src/shogun/statistical_testing/internals/InitPerFeature.h @@ -0,0 +1,53 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef INIT_PER_FEATURE_H__ +#define INIT_PER_FEATURE_H__ + +#include +#include + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +class DataFetcher; +class DataManager; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class InitPerFeature +{ + friend class DataManager; +private: + explicit InitPerFeature(std::unique_ptr& fetcher); +public: + ~InitPerFeature(); + InitPerFeature& operator=(CFeatures* feats); + operator const CFeatures*() const; +private: + std::unique_ptr& m_fetcher; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // INIT_PER_FEATURE_H__ diff --git a/src/shogun/statistical_testing/internals/InitPerKernel.cpp b/src/shogun/statistical_testing/internals/InitPerKernel.cpp new file mode 100644 index 00000000000..1d80e531f56 --- /dev/null +++ b/src/shogun/statistical_testing/internals/InitPerKernel.cpp @@ -0,0 +1,43 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include + +using namespace shogun; +using namespace internal; + +InitPerKernel::InitPerKernel(std::shared_ptr& kernel) : m_kernel(kernel) +{ +} + +InitPerKernel::~InitPerKernel() +{ +} + +InitPerKernel& InitPerKernel::operator=(CKernel* kernel) +{ + SG_REF(kernel); + m_kernel = std::shared_ptr(kernel, [](CKernel* ptr) { SG_UNREF(ptr); }); + return *this; +} + +InitPerKernel::operator CKernel*() const +{ + return m_kernel.get(); +} diff --git a/src/shogun/statistical_testing/internals/InitPerKernel.h b/src/shogun/statistical_testing/internals/InitPerKernel.h new file mode 100644 index 00000000000..94cc3b2778a --- /dev/null +++ b/src/shogun/statistical_testing/internals/InitPerKernel.h @@ -0,0 +1,50 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef INIT_PER_KERNEL_H__ +#define INIT_PER_KERNEL_H__ + +#include +#include + +namespace shogun +{ + +class CKernel; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class InitPerKernel +{ + friend class KernelManager; +private: + explicit InitPerKernel(std::shared_ptr& kernel); +public: + ~InitPerKernel(); + InitPerKernel& operator=(CKernel* kernel); + operator CKernel*() const; +private: + std::shared_ptr& m_kernel; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // INIT_PER_KERNEL_H__ diff --git a/src/shogun/statistical_testing/internals/Kernel.h b/src/shogun/statistical_testing/internals/Kernel.h new file mode 100644 index 00000000000..aaf93c7e637 --- /dev/null +++ b/src/shogun/statistical_testing/internals/Kernel.h @@ -0,0 +1,107 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include + +#ifndef KERNEL_FUNCTOR_H__ +#define KERNEL_FUNCTOR_H__ + +namespace shogun +{ + +class CKernel; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class Kernel +{ +public: + explicit Kernel(CKernel* kernel) : m_kernel(kernel) + { + } + + inline float32_t operator()(int32_t i, int32_t j) const + { + return m_kernel->kernel(i, j); + } +private: + CKernel* m_kernel; +}; + +class SelfAdjointPrecomputedKernel +{ +public: + SelfAdjointPrecomputedKernel() : m_num_feat_vec(0) + { + } + explicit SelfAdjointPrecomputedKernel(SGVector self_adjoint_kernel_matrix) : m_num_feat_vec(0) + { + REQUIRE(self_adjoint_kernel_matrix.size()>0, "Provided kernel matrix cannot be of size 0!\n"); + m_self_adjoint_kernel_matrix=self_adjoint_kernel_matrix; + } + void precompute(CKernel* kernel) + { + REQUIRE(kernel, "Kernel instance cannot be NULL!\n"); + REQUIRE(kernel->get_num_vec_lhs()==kernel->get_num_vec_rhs(), + "Kernel instance is not symmetric (%dx%d)!\n", kernel->get_num_vec_lhs(), kernel->get_num_vec_rhs()); + m_num_feat_vec=kernel->get_num_vec_lhs(); + auto size=m_num_feat_vec*(m_num_feat_vec+1)/2; + if (m_self_adjoint_kernel_matrix.size()==0 || m_self_adjoint_kernel_matrix.size()!=size) + m_self_adjoint_kernel_matrix=SGVector(size); + for (auto i=0; ikernel(i, j); + } + } + } + inline float32_t operator()(int32_t i, int32_t j) const + { + ASSERT(m_num_feat_vec); + ASSERT(i>=0 && i=0 && jj) + std::swap(i, j); + auto index=i*m_num_feat_vec-i*(i+1)/2+j; + return m_self_adjoint_kernel_matrix[index]; + } +private: + SGVector m_self_adjoint_kernel_matrix; + index_t m_num_feat_vec; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} +#endif // KERNEL_FUNCTOR_H__ diff --git a/src/shogun/statistical_testing/internals/KernelManager.cpp b/src/shogun/statistical_testing/internals/KernelManager.cpp new file mode 100644 index 00000000000..03262d2c667 --- /dev/null +++ b/src/shogun/statistical_testing/internals/KernelManager.cpp @@ -0,0 +1,225 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +KernelManager::KernelManager() +{ + SG_SDEBUG("Kernel manager instance initialized!\n"); +} + +KernelManager::KernelManager(index_t num_kernels) +{ + SG_SDEBUG("Kernel manager instance initialized with %d kernels!\n", num_kernels); + m_kernels.resize(num_kernels); + m_precomputed_kernels.resize(num_kernels); + std::fill(m_kernels.begin(), m_kernels.end(), nullptr); + std::fill(m_precomputed_kernels.begin(), m_precomputed_kernels.end(), nullptr); +} + +KernelManager::~KernelManager() +{ + clear(); +} + +void KernelManager::clear() +{ + m_kernels.resize(0); + m_precomputed_kernels.resize(0); +} + +InitPerKernel KernelManager::kernel_at(size_t i) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(i(kernel, [](CKernel* ptr) { SG_UNREF(ptr); })); + m_precomputed_kernels.push_back(nullptr); + SG_SDEBUG("Leaving!\n"); +} + +const size_t KernelManager::num_kernels() const +{ + return m_kernels.size(); +} + +void KernelManager::precompute_kernel_at(size_t i) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(iget_kernel_type()!=K_CUSTOM) + { + // TODO give option to use different policies to precompute the kernel matrix + // this one here is default setting : use shogun's pthread parallelism to compute + // the kernel matrix. + SGMatrix kernel_matrix=kernel->get_kernel_matrix(); + m_precomputed_kernels[i]=std::shared_ptr(new CCustomKernel(kernel_matrix)); + SG_SDEBUG("Kernel type %s is precomputed and replaced internally with %s!\n", + kernel->get_name(), m_precomputed_kernels[i]->get_name()); + } + SG_SDEBUG("Leaving!\n"); +} + +void KernelManager::restore_kernel_at(size_t i) +{ + SG_SDEBUG("Entering!\n"); + REQUIRE(i0); + bool same=false; + EDistanceType distance_type=D_UNKNOWN; + for (size_t i=0; i(kernel_at(i)); + if (shift_invariant_kernel!=nullptr) + { + auto current_distance_type=shift_invariant_kernel->get_distance_type(); + if (distance_type==D_UNKNOWN) + { + distance_type=current_distance_type; + same=true; + } + else if (distance_type==current_distance_type) + same=true; + else + { + same=false; + break; + } + } + else + { + same=false; + SG_SINFO("Kernel at location %d is not of CShiftInvariantKernel type (was of %s type)!\n", + i, kernel_at(i)->get_name()); + break; + } + } + return same; +} + +CDistance* KernelManager::get_distance_instance() const +{ + REQUIRE(same_distance_type(), "Distance types for all the kernels are not the same!\n"); + + CDistance* distance=nullptr; + CShiftInvariantKernel* kernel_0=dynamic_cast(kernel_at(0)); + REQUIRE(kernel_0, "Kernel (%s) must be of CShiftInvariantKernel type!\n", kernel_at(0)->get_name()); + if (kernel_0->get_distance_type()==D_EUCLIDEAN) + { + auto euclidean_distance=new CEuclideanDistance(); + euclidean_distance->set_disable_sqrt(true); + distance=euclidean_distance; + } + else if (kernel_0->get_distance_type()==D_MANHATTAN) + { + auto manhattan_distance=new CManhattanMetric(); + distance=manhattan_distance; + } + else + { + SG_SERROR("Unsupported distance type!\n"); + } + return distance; +} + +void KernelManager::set_precomputed_distance(CCustomDistance* distance) const +{ + for (size_t i=0; i(kernel); + REQUIRE(shift_inv_kernel!=nullptr, "Kernel instance (was %s) must be of CShiftInvarintKernel type!\n", kernel->get_name()); + shift_inv_kernel->m_precomputed_distance=distance; + shift_inv_kernel->num_lhs=distance->get_num_vec_lhs(); + shift_inv_kernel->num_rhs=distance->get_num_vec_rhs(); + } +} + +void KernelManager::unset_precomputed_distance() const +{ + for (size_t i=0; i(kernel); + REQUIRE(shift_inv_kernel!=nullptr, "Kernel instance (was %s) must be of CShiftInvarintKernel type!\n", kernel->get_name()); + shift_inv_kernel->m_precomputed_distance=nullptr; + shift_inv_kernel->num_lhs=0; + shift_inv_kernel->num_rhs=0; + } +} diff --git a/src/shogun/statistical_testing/internals/KernelManager.h b/src/shogun/statistical_testing/internals/KernelManager.h new file mode 100644 index 00000000000..51a398bf47c --- /dev/null +++ b/src/shogun/statistical_testing/internals/KernelManager.h @@ -0,0 +1,80 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef KERNEL_MANAGER_H__ +#define KERNEL_MANAGER_H__ + +#include +#include +#include +#include + +namespace shogun +{ + +class CKernel; +class CDistance; +class CCustomDistance; +class CCustomKernel; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class KernelManager +{ +public: + KernelManager(); + explicit KernelManager(index_t num_kernels); + ~KernelManager(); + + InitPerKernel kernel_at(size_t i); + CKernel* kernel_at(size_t i) const; + + void push_back(CKernel* kernel); + const size_t num_kernels() const; + + void precompute_kernel_at(size_t i); + void restore_kernel_at(size_t i); + + void clear(); + bool same_distance_type() const; + CDistance* get_distance_instance() const; + void set_precomputed_distance(CCustomDistance* distance) const; + void unset_precomputed_distance() const; +private: + std::vector > m_kernels; + std::vector > m_precomputed_kernels; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // KERNEL_MANAGER_H__ diff --git a/src/shogun/statistical_testing/internals/NextSamples.cpp b/src/shogun/statistical_testing/internals/NextSamples.cpp new file mode 100644 index 00000000000..cfaa1adae4f --- /dev/null +++ b/src/shogun/statistical_testing/internals/NextSamples.cpp @@ -0,0 +1,75 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include + +using namespace shogun; +using namespace internal; + +NextSamples::NextSamples(index_t num_distributions) : m_num_blocks(0) +{ + next_samples.resize(num_distributions); +} + +NextSamples& NextSamples::operator=(const NextSamples& other) +{ + clear(); + m_num_blocks=other.m_num_blocks; + next_samples=other.next_samples; + return *this; +} + +NextSamples::~NextSamples() +{ + clear(); +} + +std::vector& NextSamples::operator[](size_t i) +{ + REQUIRE(i>=0 && i& NextSamples::operator[](size_t i) const +{ + REQUIRE(i>=0 && i type; + return std::any_of(next_samples.cbegin(), next_samples.cend(), [](type& f) { return f.size()==0; }); +} + +void NextSamples::clear() +{ + typedef std::vector type; + std::for_each(next_samples.begin(), next_samples.end(), [](type& f) { f.clear(); }); + next_samples.clear(); +} diff --git a/src/shogun/statistical_testing/internals/NextSamples.h b/src/shogun/statistical_testing/internals/NextSamples.h new file mode 100644 index 00000000000..04980496f09 --- /dev/null +++ b/src/shogun/statistical_testing/internals/NextSamples.h @@ -0,0 +1,116 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef NEXT_SAMPLES_H__ +#define NEXT_SAMPLES_H__ + +#include +#include +#include +#include + +namespace shogun +{ + +class CFeatures; + +namespace internal +{ + +/** + * @brief class NextSamples is the return type for next() call in DataManager. + * If there are no more samples (from any one of the distributions), an empty + * instance of NextSamples is supposed to be returned. This can be verified + * from the caller by calling the empty() method. Otherwise, always a get() + * call with appropriate index would give the samples from that distribution. + * If an inappropriate index is provided, e.g. get(2) for a two-sample test, + * a runtime exception is thrown. + * + * Example usage: + * @code + * NextSamples next_samples(2); + * next_samples[0] = fetchers[0].next(); + * next_samples[1] = fetchers[1].next(); + * if (!next_samples.empty()) + * { + * auto first = next_samples[0]; + * auto second = next_samples[1]; + * auto third = next_samples[2]; / Runtime Error + * } + * @endcode + */ +class NextSamples +{ + friend class DataManager; +private: + NextSamples(index_t num_distributions); +public: + /** + * Assignment operator. Clears the current blocks. + */ + NextSamples& operator=(const NextSamples& other); + + /** + * Destructor + */ + ~NextSamples(); + + /** + * Contains a number of blocks (of samples) fetched in the current burst from a + * specified distribution. + * + * @param i determines samples from which distribution + * @return a vector of fetched blocks of features from the specified distribution + */ + std::vector& operator[](size_t i); + + /** + * Const version of the above. This is called when a const instance of NextSamples + * is returned. + */ + const std::vector& operator[](size_t i) const; + + /** + * @return number of blocks fetched from each of the distribution. It is assumed + * that this number is same for all the distributions. + */ + const index_t num_blocks() const; + + /** + * This returns true if any of the distribution fetched 0 blocks (checked from the + * size of the vector for that distribution) + * + * @return whether this instance does not contain any blocks of samples from any + * of the distribution + */ + const bool empty() const; + + /** + * Method that clears the memory occupied by the feature objects inside. + */ + void clear(); +private: + index_t m_num_blocks; + std::vector > next_samples; +}; + +} + +} + +#endif // NEXT_SAMPLES_H__ diff --git a/src/shogun/statistical_testing/internals/StreamingDataFetcher.cpp b/src/shogun/statistical_testing/internals/StreamingDataFetcher.cpp new file mode 100644 index 00000000000..378b9c8c9e8 --- /dev/null +++ b/src/shogun/statistical_testing/internals/StreamingDataFetcher.cpp @@ -0,0 +1,121 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +StreamingDataFetcher::StreamingDataFetcher(CStreamingFeatures* samples) +: DataFetcher(), parser_running(false) +{ + REQUIRE(samples!=nullptr, "Samples cannot be null!\n"); + SG_REF(samples); + m_samples=std::shared_ptr(samples, [](CStreamingFeatures* ptr) { SG_UNREF(ptr); }); + m_num_samples=0; +} + +StreamingDataFetcher::~StreamingDataFetcher() +{ + end(); +} + +void StreamingDataFetcher::set_num_samples(index_t num_samples) +{ + m_num_samples=num_samples; +} + +void StreamingDataFetcher::shuffle_features() +{ +} + +void StreamingDataFetcher::unshuffle_features() +{ +} + +void StreamingDataFetcher::use_fold(index_t i) +{ +} + +void StreamingDataFetcher::init_active_subset() +{ +} + +index_t StreamingDataFetcher::get_num_samples() const +{ + if (train_test_mode) + { + if (train_mode) + return m_num_samples*train_test_ratio/(train_test_ratio+1); + else + return m_num_samples/(train_test_ratio+1); + } + return m_num_samples; +} + +void StreamingDataFetcher::start() +{ + REQUIRE(get_num_samples()>0, "Number of samples is not set! It is MANDATORY for streaming features!\n"); + if (m_block_details.m_full_data || m_block_details.m_blocksize>get_num_samples()) + { + SG_SINFO("Fetching entire data (%d samples)!\n", get_num_samples()); + m_block_details.with_blocksize(get_num_samples()); + } + m_block_details.m_total_num_blocks=get_num_samples()/m_block_details.m_blocksize; + m_block_details.m_next_block_index=0; + if (!parser_running) + { + m_samples->start_parser(); + parser_running=true; + } +} + +CFeatures* StreamingDataFetcher::next() +{ + CFeatures* next_samples=nullptr; + // figure out how many samples to fetch in this burst + auto num_already_fetched=m_block_details.m_next_block_index*m_block_details.m_blocksize; + auto num_more_samples=get_num_samples()-num_already_fetched; + if (num_more_samples>0) + { + auto num_samples_this_burst=std::min(m_block_details.m_max_num_samples_per_burst, num_more_samples); + next_samples=m_samples->get_streamed_features(num_samples_this_burst); + m_block_details.m_next_block_index+=m_block_details.m_num_blocks_per_burst; + } + return next_samples; +} + +void StreamingDataFetcher::reset() +{ + m_block_details.m_next_block_index=0; + m_samples->reset_stream(); +} + +void StreamingDataFetcher::end() +{ + if (parser_running) + { + m_samples->end_parser(); + parser_running=false; + } +} diff --git a/src/shogun/statistical_testing/internals/StreamingDataFetcher.h b/src/shogun/statistical_testing/internals/StreamingDataFetcher.h new file mode 100644 index 00000000000..370aa8390ee --- /dev/null +++ b/src/shogun/statistical_testing/internals/StreamingDataFetcher.h @@ -0,0 +1,68 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2016 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include + +#ifndef STREMING_DATA_FETCHER_H__ +#define STREMING_DATA_FETCHER_H__ + +namespace shogun +{ + +class CStreamingFeatures; + +namespace internal +{ + +class DataManager; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class StreamingDataFetcher : public DataFetcher +{ + friend class DataManager; +public: + StreamingDataFetcher(CStreamingFeatures* samples); + virtual ~StreamingDataFetcher(); + void set_num_samples(index_t num_samples); + + virtual void shuffle_features(); + virtual void unshuffle_features(); + + virtual void use_fold(index_t i); + virtual void init_active_subset(); + + virtual void start(); + virtual CFeatures* next(); + virtual void reset(); + virtual void end(); + + virtual index_t get_num_samples() const; + virtual const char* get_name() const + { + return "StreamingDataFetcher"; + } +private: + std::shared_ptr m_samples; + bool parser_running; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} +#endif // STREMING_DATA_FETCHER_H__ diff --git a/src/shogun/statistical_testing/internals/TestTypes.h b/src/shogun/statistical_testing/internals/TestTypes.h new file mode 100644 index 00000000000..47f786f21cd --- /dev/null +++ b/src/shogun/statistical_testing/internals/TestTypes.h @@ -0,0 +1,98 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef TEST_TYPES_H__ +#define TEST_TYPES_H__ + +namespace shogun +{ + +namespace internal +{ + +/** + * @brief Meta test-type for 1-distribution statistical tests. + */ +struct OneDistributionTest +{ + /** defines the number of feature objects required */ + static constexpr index_t num_feats = 1; +}; + +/** + * @brief Meta test-type for 2-distribution statistical tests. + */ +struct TwoDistributionTest +{ + /** defines the number of feature objects required */ + static constexpr index_t num_feats = 2; +}; + +/** + * @brief Meta test-type for 3-distribution statistical tests. + */ +struct ThreeDistributionTest +{ + /** defines the number of feature objects required */ + static constexpr index_t num_feats = 3; +}; + +/** + * @brief Meta test-type for goodness-of-fit test. + */ +struct GoodnessOfFitTest : OneDistributionTest +{ + /** defines the number of kernel objects required */ + static constexpr index_t num_kernels = 1; +}; + +/** + * @brief Meta test-type for two-sample test. + */ +struct TwoSampleTest : TwoDistributionTest +{ + /** defines the number of kernel objects required */ + static constexpr index_t num_kernels = 1; +}; + +/** + * @brief Meta test-type for independence test. + */ +struct IndependenceTest : TwoDistributionTest +{ + /** defines the number of kernel objects required */ + static constexpr index_t num_kernels = 2; +}; + +} + +} + +#endif // TEST_TYPES_H__ diff --git a/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h b/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h new file mode 100644 index 00000000000..ab599a2f5a8 --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h @@ -0,0 +1,261 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef COMPUTE_MMD_H_ +#define COMPUTE_MMD_H_ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace shogun +{ + +namespace internal +{ + +namespace mmd +{ + +struct terms_t +{ + std::array term{}; + std::array diag{}; +}; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +/** + * @brief Class Compute blah blah. + */ +struct ComputeMMD +{ + ComputeMMD() : m_n_x(0), m_n_y(0), m_stype(ST_UNBIASED_FULL) + { + } + + template + float32_t operator()(const Kernel& kernel) const + { + ASSERT(m_n_x>0 && m_n_y>0); + const index_t size=m_n_x+m_n_y; + terms_t terms; + for (auto i=0; i + float32_t operator()(const SGMatrix& kernel_matrix) const + { + ASSERT(m_n_x>0 && m_n_y>0); + const index_t size=m_n_x+m_n_y; + ASSERT(kernel_matrix.num_rows==size && kernel_matrix.num_cols==size); + + typedef Eigen::Matrix MatrixXt; + typedef Eigen::Block > BlockXt; + + Eigen::Map map(kernel_matrix.matrix, kernel_matrix.num_rows, kernel_matrix.num_cols); + + const BlockXt& b_x=map.block(0, 0, m_n_x, m_n_x); + const BlockXt& b_y=map.block(m_n_x, m_n_x, m_n_y, m_n_y); + const BlockXt& b_xy=map.block(m_n_x, 0, m_n_y, m_n_x); + + terms_t terms; + terms.diag[0]=b_x.diagonal().sum(); + terms.diag[1]=b_y.diagonal().sum(); + terms.diag[2]=b_xy.diagonal().sum(); + + terms.term[0]=(b_x.sum()-terms.diag[0])/2+terms.diag[0]; + terms.term[1]=(b_y.sum()-terms.diag[1])/2+terms.diag[1]; + terms.term[2]=b_xy.sum(); + + return compute(terms); + } + + SGVector operator()(const KernelManager& kernel_mgr) const + { + ASSERT(m_n_x>0 && m_n_y>0); + std::vector terms(kernel_mgr.num_kernels()); + const index_t size=m_n_x+m_n_y; + for (auto j=0; jkernel(i, j); + add_term_lower(terms[k], kernel, i, j); + } + } + } + + SGVector result(kernel_mgr.num_kernels()); + for (size_t k=0; k + inline void add_term_lower(terms_t& terms, T kernel_value, index_t i, index_t j) const + { + ASSERT(m_n_x>0 && m_n_y>0); + if (i=j) + { + SG_SDEBUG("Adding Kernel(%d, %d)=%f to term_0!\n", i, j, kernel_value); + terms.term[0]+=kernel_value; + if (i==j) + terms.diag[0]+=kernel_value; + } + else if (i>=m_n_x && j>=m_n_x && i>=j) + { + SG_SDEBUG("Adding Kernel(%d, %d)=%f to term_1!\n", i, j, kernel_value); + terms.term[1]+=kernel_value; + if (i==j) + terms.diag[1]+=kernel_value; + } + else if (i>=m_n_x && j + inline void add_term_upper(terms_t& terms, T kernel_value, index_t i, index_t j) const + { + ASSERT(m_n_x>0 && m_n_y>0); + if (i=m_n_x && j>=m_n_x && i<=j) + { + SG_SDEBUG("Adding Kernel(%d, %d)=%f to term_1!\n", i, j, kernel_value); + terms.term[1]+=kernel_value; + if (i==j) + terms.diag[1]+=kernel_value; + } + else if (i=m_n_x) + { + SG_SDEBUG("Adding Kernel(%d, %d)=%f to term_2!\n", i, j, kernel_value); + terms.term[2]+=kernel_value; + if (i+m_n_x==j) + terms.diag[2]+=kernel_value; + } + } + + inline float64_t compute(terms_t& terms) const + { + ASSERT(m_n_x>0 && m_n_y>0); + terms.term[0]=2*(terms.term[0]-terms.diag[0]); + terms.term[1]=2*(terms.term[1]-terms.diag[1]); + SG_SDEBUG("term_0 sum (without diagonal) = %f!\n", terms.term[0]); + SG_SDEBUG("term_1 sum (without diagonal) = %f!\n", terms.term[1]); + if (m_stype!=ST_BIASED_FULL) + { + terms.term[0]/=m_n_x*(m_n_x-1); + terms.term[1]/=m_n_y*(m_n_y-1); + } + else + { + terms.term[0]+=terms.diag[0]; + terms.term[1]+=terms.diag[1]; + SG_SDEBUG("term_0 sum (with diagonal) = %f!\n", terms.term[0]); + SG_SDEBUG("term_1 sum (with diagonal) = %f!\n", terms.term[1]); + terms.term[0]/=m_n_x*m_n_x; + terms.term[1]/=m_n_y*m_n_y; + } + SG_SDEBUG("term_0 (normalized) = %f!\n", terms.term[0]); + SG_SDEBUG("term_1 (normalized) = %f!\n", terms.term[1]); + + SG_SDEBUG("term_2 sum (with diagonal) = %f!\n", terms.term[2]); + if (m_stype==ST_UNBIASED_INCOMPLETE) + { + terms.term[2]-=terms.diag[2]; + SG_SDEBUG("term_2 sum (without diagonal) = %f!\n", terms.term[2]); + terms.term[2]/=m_n_x*(m_n_x-1); + } + else + terms.term[2]/=m_n_x*m_n_y; + SG_SDEBUG("term_2 (normalized) = %f!\n", terms.term[2]); + + auto result=terms.term[0]+terms.term[1]-2*terms.term[2]; + SG_SDEBUG("result = %f!\n", result); + return result; + } + + index_t m_n_x; + index_t m_n_y; + EStatisticType m_stype; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} +#endif // COMPUTE_MMD_H_ diff --git a/src/shogun/statistical_testing/internals/mmd/CrossValidationMMD.h b/src/shogun/statistical_testing/internals/mmd/CrossValidationMMD.h new file mode 100644 index 00000000000..390ef6ad93f --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/CrossValidationMMD.h @@ -0,0 +1,241 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef CROSS_VALIDATION_MMD_H_ +#define CROSS_VALIDATION_MMD_H_ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using std::unique_ptr; + +namespace shogun +{ + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct CrossValidationMMD : PermutationMMD +{ + CrossValidationMMD(index_t n_x, index_t n_y, index_t num_folds, index_t num_null_samples) + { + ASSERT(n_x>0 && n_y>0); + ASSERT(num_folds>0); + ASSERT(num_null_samples>0); + + m_n_x=n_x; + m_n_y=n_y; + m_num_folds=num_folds; + m_num_null_samples=num_null_samples; + m_num_runs=DEFAULT_NUM_RUNS; + m_alpha=DEFAULT_ALPHA; + + init(); + } + + void operator()(const KernelManager& kernel_mgr) + { + REQUIRE(m_rejections.num_rows==m_num_runs*m_num_folds, + "Number of rows in the measure matrix (was %d), has to be >= %d*%d = %d!\n", + m_rejections.num_rows, m_num_runs, m_num_folds, m_num_runs*m_num_folds); + REQUIRE(size_t(m_rejections.num_cols)==kernel_mgr.num_kernels(), + "Number of columns in the measure matrix (was %d), has to equal to the nunber of kernels (%d)!\n", + m_rejections.num_cols, kernel_mgr.num_kernels()); + + const index_t size=m_n_x+m_n_y; + const index_t orig_n_x=m_n_x; + const index_t orig_n_y=m_n_y; + SGVector null_samples(m_num_null_samples); + SGVector precomputed_km(size*(size+1)/2); + + for (size_t k=0; kkernel(i, j); + } + } + + for (auto current_run=0; current_runbuild_subsets(); + m_kfold_y->build_subsets(); + for (auto current_fold=0; current_fold xy_wrapper(m_xy_inds.data(), m_xy_inds.size(), false); + m_stack->add_subset(xy_wrapper); + + m_permuted_inds.resize(m_xy_inds.size()); + SGVector permutation_wrapper(m_permuted_inds.data(), m_permuted_inds.size(), false); + for (auto n=0; nadd_subset(permutation_wrapper); + SGVector inds=m_stack->get_last_subset()->get_subset_idx(); + m_stack->remove_subset(); + + std::fill(m_inverted_permuted_inds[n].data(), m_inverted_permuted_inds[n].data()+size, -1); + for (int idx=0; idxremove_subset(); + + terms_t terms; + for (auto i=0; i m_kfold_x; + unique_ptr m_kfold_y; + unique_ptr m_stack; + + std::vector m_xy_inds; + SGVector m_inverted_inds; + SGMatrix m_rejections; + + void init() + { + SGVector dummy_labels_x(m_n_x); + SGVector dummy_labels_y(m_n_y); + + auto instance_x=new CCrossValidationSplitting(new CBinaryLabels(dummy_labels_x), m_num_folds); + auto instance_y=new CCrossValidationSplitting(new CBinaryLabels(dummy_labels_y), m_num_folds); + m_kfold_x=unique_ptr(instance_x); + m_kfold_y=unique_ptr(instance_y); + + m_stack=unique_ptr(new CSubsetStack()); + + const index_t size=m_n_x+m_n_y; + m_inverted_inds=SGVector(size); + + m_inverted_permuted_inds.resize(m_num_null_samples); + for (auto i=0; i x_inds=m_kfold_x->generate_subset_inverse(current_fold); + SGVector y_inds=m_kfold_y->generate_subset_inverse(current_fold); + std::for_each(y_inds.data(), y_inds.data()+y_inds.size(), [this](index_t& val) { val += m_n_x; }); + + m_n_x=x_inds.size(); + m_n_y=y_inds.size(); + + m_xy_inds.resize(x_inds.size()+y_inds.size()); + std::copy(x_inds.data(), x_inds.data()+x_inds.size(), m_xy_inds.data()); + std::copy(y_inds.data(), y_inds.data()+y_inds.size(), m_xy_inds.data()+x_inds.size()); + } +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} +#endif // CROSS_VALIDATION_MMD_H_ diff --git a/src/shogun/statistical_testing/internals/mmd/PermutationMMD.h b/src/shogun/statistical_testing/internals/mmd/PermutationMMD.h new file mode 100644 index 00000000000..69c48ef05cd --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/PermutationMMD.h @@ -0,0 +1,254 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef PERMUTATION_MMD_H_ +#define PERMUTATION_MMD_H_ + +#include +#include +#include +#include +#include +#include + +namespace shogun +{ + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct PermutationMMD : ComputeMMD +{ + PermutationMMD() : m_save_inds(false) + { + } + + template + SGVector operator()(const Kernel& kernel) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_num_null_samples>0); + precompute_permutation_inds(); + + const index_t size=m_n_x+m_n_y; + SGVector null_samples(m_num_null_samples); +#pragma omp parallel for + for (auto n=0; n=inverted_col) + add_term_lower(terms, kernel(i, j), inverted_row, inverted_col); + else + add_term_lower(terms, kernel(i, j), inverted_col, inverted_row); + } + } + null_samples[n]=compute(terms); + SG_SDEBUG("null_samples[%d] = %f!\n", n, null_samples[n]); + } + return null_samples; + } + + SGMatrix operator()(const KernelManager& kernel_mgr) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_num_null_samples>0); + precompute_permutation_inds(); + + const index_t size=m_n_x+m_n_y; + SGMatrix null_samples(m_num_null_samples, kernel_mgr.num_kernels()); + SGVector km(size*(size+1)/2); + for (size_t k=0; kkernel(i, j); + } + } + +#pragma omp parallel for + for (auto n=0; n + float64_t p_value(const Kernel& kernel) + { + auto statistic=ComputeMMD::operator()(kernel); + auto null_samples=operator()(kernel); + return compute_p_value(null_samples, statistic); + } + + SGVector p_value(const KernelManager& kernel_mgr) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_num_null_samples>0); + precompute_permutation_inds(); + + const index_t size=m_n_x+m_n_y; + SGVector null_samples(m_num_null_samples); + SGVector result(kernel_mgr.num_kernels()); + + SGVector km(size*(size+1)/2); + for (size_t k=0; kkernel(i, j); + add_term_upper(terms, km[index], i, j); + } + } + float32_t statistic=compute(terms); + SG_SDEBUG("Kernel(%d): statistic=%f\n", k, statistic); + +#pragma omp parallel for + for (auto n=0; n0); + allocate_permutation_inds(); + SGVector sg_wrapper(m_permuted_inds.data(), m_permuted_inds.size(), false); + for (auto n=0; n& null_samples, float32_t statistic) const + { + std::sort(null_samples.data(), null_samples.data()+null_samples.size()); + float64_t idx=null_samples.find_position_to_insert(statistic); + return 1.0-idx/null_samples.size(); + } + + inline void allocate_permutation_inds() + { + const index_t size=m_n_x+m_n_y; + if (m_permuted_inds.size()!=size_t(size)) + m_permuted_inds.resize(size); + + if (m_inverted_permuted_inds.size()!=size_t(m_num_null_samples)) + m_inverted_permuted_inds.resize(m_num_null_samples); + + for (auto i=0; i(size, m_num_null_samples); + } + + index_t m_num_null_samples; + bool m_save_inds; + std::vector m_permuted_inds; + std::vector > m_inverted_permuted_inds; + SGMatrix m_all_inds; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} + +#endif // PERMUTATION_MMD_H_ diff --git a/src/shogun/statistical_testing/internals/mmd/VarianceH0.h b/src/shogun/statistical_testing/internals/mmd/VarianceH0.h new file mode 100644 index 00000000000..e67344f9e5c --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/VarianceH0.h @@ -0,0 +1,81 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef VARIANCE_H0__H_ +#define VARIANCE_H0__H_ + +#include +#include +#include +#include + +namespace shogun +{ + +template class SGMatrix; + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct VarianceH0 +{ + template + T operator()(const SGMatrix& kernel_matrix) + { + typedef Eigen::Matrix MatrixXt; + typedef Eigen::Matrix VectorXt; + + Eigen::Map map(kernel_matrix.matrix, kernel_matrix.num_rows, kernel_matrix.num_cols); + index_t B=map.rows(); + + VectorXt diag=map.diagonal(); + map.diagonal().setZero(); + + auto term_1=CMath::sq(map.array().sum()/B/(B-1)); + auto term_2=map.array().square().sum()/B/(B-1); + auto term_3=(map.colwise().sum()/(B-1)).array().square().sum()/B; + + map.diagonal()=diag; + + auto variance_estimate=2*(term_1+term_2-2*term_3); + return variance_estimate; + } +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} + +#endif // VARIANCE_H0__H_ diff --git a/src/shogun/statistical_testing/internals/mmd/VarianceH1.h b/src/shogun/statistical_testing/internals/mmd/VarianceH1.h new file mode 100644 index 00000000000..67fbcde012e --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/VarianceH1.h @@ -0,0 +1,269 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef VARIANCE_H1__H_ +#define VARIANCE_H1__H_ + +#include +#include +#include +#include +#include +#include +#include +#include + +using std::vector; + +namespace shogun +{ + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct VarianceH1 +{ + VarianceH1() : m_lambda(1E-5), m_free_terms(true) + { + } + + void init_terms() + { + m_sum_x=0; + m_sum_y=0; + m_sum_xy=0; + m_sum_sq_x=0; + m_sum_sq_y=0; + m_sum_sq_xy=0; + + m_sum_colwise_x.resize(m_n_x); + m_sum_colwise_y.resize(m_n_y); + m_sum_rowwise_xy.resize(m_n_x); + m_sum_colwise_xy.resize(m_n_y); + std::fill(m_sum_colwise_x.begin(), m_sum_colwise_x.end(), 0); + std::fill(m_sum_colwise_y.begin(), m_sum_colwise_y.end(), 0); + std::fill(m_sum_rowwise_xy.begin(), m_sum_rowwise_xy.end(), 0); + std::fill(m_sum_colwise_xy.begin(), m_sum_colwise_xy.end(), 0); + + if (m_second_order_terms.rows()==m_n_x && m_second_order_terms.cols()==m_n_x) + m_second_order_terms.setZero(); + else + m_second_order_terms=Eigen::MatrixXd::Zero(m_n_x, m_n_x); + } + + void free_terms() + { + if (m_free_terms) + { + m_sum_colwise_x.resize(0); + m_sum_colwise_y.resize(0); + m_sum_rowwise_xy.resize(0); + m_sum_colwise_xy.resize(0); + m_second_order_terms=Eigen::MatrixXd::Zero(0, 0); + } + } + + template + void add_terms(T kernel_value, index_t i, index_t j) + { + if (i=m_n_x && j>=m_n_x) + { + m_sum_y+=2*kernel_value; + m_sum_sq_y+=2*kernel_value*kernel_value; + m_sum_colwise_y[i-m_n_x]+=kernel_value; + m_sum_colwise_y[j-m_n_x]+=kernel_value; + m_second_order_terms(i-m_n_x, j-m_n_x)+=kernel_value; + m_second_order_terms(j-m_n_x, i-m_n_x)+=kernel_value; + } + else if (i=m_n_x) + { + m_sum_xy+=kernel_value; + m_sum_sq_xy+=kernel_value*kernel_value; + if (j-i!=m_n_x) + { + m_second_order_terms(i, j-m_n_x)-=kernel_value; + m_second_order_terms(j-m_n_x, i)-=kernel_value; + } + m_sum_rowwise_xy[i]+=kernel_value; + m_sum_colwise_xy[j-m_n_x]+=kernel_value; + } + } + + float64_t compute_variance_estimate() + { + Eigen::Map map_sum_colwise_x(m_sum_colwise_x.data(), m_sum_colwise_x.size()); + Eigen::Map map_sum_colwise_y(m_sum_colwise_y.data(), m_sum_colwise_y.size()); + Eigen::Map map_sum_rowwise_xy(m_sum_rowwise_xy.data(), m_sum_rowwise_xy.size()); + Eigen::Map map_sum_colwise_xy(m_sum_colwise_xy.data(), m_sum_colwise_xy.size()); + + auto t_0=(map_sum_colwise_x.dot(map_sum_colwise_x)-m_sum_sq_x)/m_n_x/(m_n_x-1)/(m_n_x-2); + auto t_1=CMath::sq(m_sum_x/m_n_x/(m_n_x-1)); + + auto t_2=map_sum_colwise_x.dot(map_sum_rowwise_xy)*2/m_n_x/(m_n_x-1)/m_n_y; + auto t_3=m_sum_x*m_sum_xy*2/m_n_x/m_n_x/(m_n_x-1)/m_n_y; + + auto t_4=(map_sum_colwise_y.dot(map_sum_colwise_y)-m_sum_sq_y)/m_n_y/(m_n_y-1)/(m_n_y-2); + auto t_5=CMath::sq(m_sum_y/m_n_y/(m_n_y-1)); + + auto t_6=map_sum_colwise_y.dot(map_sum_colwise_xy)*2/m_n_y/(m_n_y-1)/m_n_x; + auto t_7=m_sum_y*m_sum_xy*2/m_n_y/m_n_y/(m_n_y-1)/m_n_x; + + auto t_8=(map_sum_rowwise_xy.dot(map_sum_rowwise_xy)-m_sum_sq_xy)/m_n_y/(m_n_y-1)/m_n_x; + auto t_9=2*CMath::sq(m_sum_xy/m_n_x/m_n_y); + auto t_10=(map_sum_colwise_xy.dot(map_sum_colwise_xy)-m_sum_sq_xy)/m_n_x/(m_n_x-1)/m_n_y; + + auto var_first=(t_0-t_1)-t_2+t_3+(t_4-t_5)-t_6+t_7+(t_8-t_9+t_10); + var_first*=4.0*(m_n_x-2)/m_n_x/(m_n_x-1); + + auto var_second=2.0/m_n_x/m_n_y/(m_n_x-1)/(m_n_y-1)*m_second_order_terms.array().square().sum(); + + auto variance_estimate=var_first+var_second; + if (variance_estimate<0) + variance_estimate=var_second; + + return variance_estimate; + } + + template + float64_t operator()(const Kernel& kernel) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_n_x==m_n_y); + const index_t size=m_n_x+m_n_y; + init_terms(); + for (auto j=0; j operator()(const KernelManager& kernel_mgr) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_n_x==m_n_y); + ASSERT(kernel_mgr.num_kernels()>0); + + const index_t size=m_n_x+m_n_y; + SGVector result(kernel_mgr.num_kernels()); + SelfAdjointPrecomputedKernel kernel_functor(SGVector(size*(size+1)/2)); + for (size_t k=0; k test_power(const KernelManager& kernel_mgr) + { + ASSERT(m_n_x>0 && m_n_y>0); + ASSERT(m_n_x==m_n_y); + ASSERT(kernel_mgr.num_kernels()>0); + ComputeMMD compute_mmd_job; + compute_mmd_job.m_n_x=m_n_x; + compute_mmd_job.m_n_y=m_n_y; + compute_mmd_job.m_stype=ST_UNBIASED_FULL; + + const index_t size=m_n_x+m_n_y; + SGVector result(kernel_mgr.num_kernels()); + SelfAdjointPrecomputedKernel kernel_functor(SGVector(size*(size+1)/2)); + for (size_t k=0; k m_sum_colwise_x; + vector m_sum_colwise_y; + vector m_sum_rowwise_xy; + vector m_sum_colwise_xy; + Eigen::MatrixXd m_second_order_terms; + + bool m_free_terms; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} + +#endif // VARIANCE_H1__H_ diff --git a/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.cpp b/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.cpp new file mode 100644 index 00000000000..7e9fcb162f4 --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.cpp @@ -0,0 +1,45 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2014 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; + +float32_t WithinBlockDirect::operator()(const SGMatrix& km) +{ + Eigen::Map map(km.matrix, km.num_rows, km.num_cols); + index_t B=km.num_rows; + + Eigen::VectorXf diag=map.diagonal(); + map.diagonal().setZero(); + + auto term_1=map.array().square().sum(); + auto term_2=CMath::sq(map.array().sum()); + auto term_3=(map*map).array().sum(); + + map.diagonal()=diag; + + auto variance_estimate=2*(term_1+term_2/(B-1)/(B-2)-2*term_3/(B-2))/B/(B-3); + return variance_estimate; +} diff --git a/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.h b/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.h new file mode 100644 index 00000000000..253f7d3c7e0 --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/WithinBlockDirect.h @@ -0,0 +1,49 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2014 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#ifndef WITHIN_BLOCK_DIRECT_H_ +#define WITHIN_BLOCK_DIRECT_H_ + +#include + +namespace shogun +{ + +template class SGMatrix; +template class CGPUMatrix; + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +struct WithinBlockDirect +{ + typedef float32_t return_type; + return_type operator()(const SGMatrix& kernel_matrix); +// return_type operator()(const CGPUMatrix& kernel_matrix); +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} + +#endif // WITHIN_BLOCK_DIRECT_H_ diff --git a/src/shogun/statistical_testing/internals/mmd/WithinBlockPermutation.cpp b/src/shogun/statistical_testing/internals/mmd/WithinBlockPermutation.cpp new file mode 100644 index 00000000000..26e7d794113 --- /dev/null +++ b/src/shogun/statistical_testing/internals/mmd/WithinBlockPermutation.cpp @@ -0,0 +1,117 @@ +/* + * Restructuring Shogun's statistical hypothesis testing framework. + * Copyright (C) 2014 Soumyajit De + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; + +WithinBlockPermutation::WithinBlockPermutation(index_t nx, index_t ny, EStatisticType type) +: n_x(nx), n_y(ny), stype(type), terms() +{ + SG_SDEBUG("number of samples are %d and %d!\n", n_x, n_y); + permuted_inds=SGVector(n_x+n_y); + inverted_permuted_inds=SGVector(permuted_inds.vlen); +} + +void WithinBlockPermutation::add_term(float32_t val, index_t i, index_t j) +{ + if (i=n_x && j>=n_x && i<=j) + { + SG_SDEBUG("Adding Kernel(%d,%d)=%f to term_1!\n", i, j, val); + terms.term[1]+=val; + if (i==j) + terms.diag[1]+=val; + } + else if (i>=n_x && j& km) +{ + SG_SDEBUG("Entering!\n"); + + std::iota(permuted_inds.vector, permuted_inds.vector+permuted_inds.vlen, 0); + CMath::permute(permuted_inds); + for (int i=0; i. + */ + +#ifndef WITHIN_BLOCK_PERMUTATION_H_ +#define WITHIN_BLOCK_PERMUTATION_H_ + +#include +#include +#include + +namespace shogun +{ + +template class SGMatrix; +template class CGPUMatrix; + +namespace internal +{ + +namespace mmd +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class WithinBlockPermutation +{ + typedef float32_t return_type; +public: + WithinBlockPermutation(index_t, index_t, EStatisticType); + return_type operator()(const SGMatrix& kernel_matrix); +// return_type operator()(const CGPUMatrix& kernel_matrix); +private: + void add_term(float32_t, index_t, index_t); + + const index_t n_x; + const index_t n_y; + const EStatisticType stype; + SGVector permuted_inds; + SGVector inverted_permuted_inds; + struct terms_t + { + float32_t term[3]; + float32_t diag[3]; + }; + terms_t terms; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +} + +#endif // WITHIN_BLOCK_PERMUTATION_H_ diff --git a/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.cpp b/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.cpp new file mode 100644 index 00000000000..b6382b3058f --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.cpp @@ -0,0 +1,265 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +struct CKernelSelectionStrategy::Self +{ + Self(); + + KernelManager kernel_mgr; + std::unique_ptr policy; + + EKernelSelectionMethod method; + bool weighted; + index_t num_runs; + index_t num_folds; + float64_t alpha; + + void init_policy(CMMD* estimator); + + const static EKernelSelectionMethod default_method; + const static bool default_weighted; + const static index_t default_num_runs; + const static index_t default_num_folds; + const static float64_t default_alpha; +}; + +const EKernelSelectionMethod CKernelSelectionStrategy::Self::default_method=KSM_AUTO; +const bool CKernelSelectionStrategy::Self::default_weighted=false; +const index_t CKernelSelectionStrategy::Self::default_num_runs=10; +const index_t CKernelSelectionStrategy::Self::default_num_folds=3; +const float64_t CKernelSelectionStrategy::Self::default_alpha=0.05; + +CKernelSelectionStrategy::Self::Self() : policy(nullptr), method(default_method), + weighted(default_weighted), num_runs(default_num_runs), num_folds(default_num_folds), alpha(default_alpha) +{ +} + +void CKernelSelectionStrategy::Self::init_policy(CMMD* estimator) +{ + switch (method) + { + case KSM_MEDIAN_HEURISTIC: + { + REQUIRE(!weighted, "Weighted kernel selection is not possible with MEDIAN_HEURISTIC!\n"); + policy=std::unique_ptr(new MedianHeuristic(kernel_mgr, estimator)); + } + break; + case KSM_CROSS_VALIDATION: + { + REQUIRE(!weighted, "Weighted kernel selection is not possible with CROSS_VALIDATION!\n"); + policy=std::unique_ptr(new MaxCrossValidation(kernel_mgr, estimator, + num_runs, num_folds, alpha)); + } + break; + case KSM_MAXIMIZE_MMD: + { + if (weighted) + policy=std::unique_ptr(new WeightedMaxMeasure(kernel_mgr, estimator)); + else + policy=std::unique_ptr(new MaxMeasure(kernel_mgr, estimator)); + } + break; + case KSM_MAXIMIZE_POWER: + { + if (weighted) + { + auto casted_estimator=dynamic_cast(estimator); + REQUIRE(casted_estimator, "Weighted kernel selection is not possible with MAXIMIZE_POWER!\n"); + policy=std::unique_ptr(new WeightedMaxTestPower(kernel_mgr, estimator)); + } + else + policy=std::unique_ptr(new MaxTestPower(kernel_mgr, estimator)); + } + break; + default: + { + SG_SERROR("Unsupported kernel selection method specified! Accepted strategies are " + "MAXIMIZE_MMD (single, weighted), " + "MAXIMIZE_POWER (single, weighted), " + "CROSS_VALIDATION (single) and " + "MEDIAN_HEURISTIC (single)!\n"); + } + break; + } +} + +CKernelSelectionStrategy::CKernelSelectionStrategy() +{ + init(); +} + +CKernelSelectionStrategy::CKernelSelectionStrategy(EKernelSelectionMethod method, bool weighted) +{ + init(); + self->method=method; + self->weighted=weighted; +} + +CKernelSelectionStrategy::CKernelSelectionStrategy(EKernelSelectionMethod method, index_t num_runs, + index_t num_folds, float64_t alpha) +{ + init(); + self->method=method; + self->num_runs=num_runs; + self->num_folds=num_folds; + self->alpha=alpha; +} + +void CKernelSelectionStrategy::init() +{ + self=std::unique_ptr(new Self()); +} + +CKernelSelectionStrategy::~CKernelSelectionStrategy() +{ + self->kernel_mgr.clear(); +} + +CKernelSelectionStrategy& CKernelSelectionStrategy::use_method(EKernelSelectionMethod method) +{ + self->method=method; + return *this; +} + +CKernelSelectionStrategy& CKernelSelectionStrategy::use_num_runs(index_t num_runs) +{ + self->num_runs=num_runs; + return *this; +} + +CKernelSelectionStrategy& CKernelSelectionStrategy::use_num_folds(index_t num_folds) +{ + self->num_folds=num_folds; + return *this; +} + +CKernelSelectionStrategy& CKernelSelectionStrategy::use_alpha(float64_t alpha) +{ + self->alpha=alpha; + return *this; +} + +CKernelSelectionStrategy& CKernelSelectionStrategy::use_weighted(bool weighted) +{ + self->weighted=weighted; + return *this; +} + +EKernelSelectionMethod CKernelSelectionStrategy::get_method() const +{ + return self->method; +} + +index_t CKernelSelectionStrategy::get_num_runs() const +{ + return self->num_runs; +} + +index_t CKernelSelectionStrategy::get_num_folds() const +{ + return self->num_folds; +} + +float64_t CKernelSelectionStrategy::get_alpha() const +{ + return self->alpha; +} + +bool CKernelSelectionStrategy::get_weighted() const +{ + return self->weighted; +} + +void CKernelSelectionStrategy::add_kernel(CKernel* kernel) +{ + self->kernel_mgr.push_back(kernel); +} + +CKernel* CKernelSelectionStrategy::select_kernel(CMMD* estimator) +{ + auto num_kernels=self->kernel_mgr.num_kernels(); + REQUIRE(num_kernels>0, "Number of kernels is 0. Please add kernels using add_kernel method!\n"); + SG_DEBUG("Selecting kernels from a total of %d kernels!\n", num_kernels); + + self->init_policy(estimator); + ASSERT(self->policy!=nullptr); + + return self->policy->select_kernel(); +} + +// TODO call this method when test train mode is turned off +void CKernelSelectionStrategy::erase_intermediate_results() +{ + self->policy=nullptr; + self->kernel_mgr.clear(); +} + +SGMatrix CKernelSelectionStrategy::get_measure_matrix() +{ + REQUIRE(self->policy!=nullptr, "The kernel selection policy is not initialized!\n"); + return self->policy->get_measure_matrix(); +} + +SGVector CKernelSelectionStrategy::get_measure_vector() +{ + REQUIRE(self->policy!=nullptr, "The kernel selection policy is not initialized!\n"); + return self->policy->get_measure_vector(); +} + +const char* CKernelSelectionStrategy::get_name() const +{ + return "KernelSelectionStrategy"; +} + +const KernelManager& CKernelSelectionStrategy::get_kernel_mgr() const +{ + return self->kernel_mgr; +} diff --git a/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.h b/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.h new file mode 100644 index 00000000000..456fdab28c9 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/KernelSelectionStrategy.h @@ -0,0 +1,95 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2012 - 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef KERNEL_SELECTION_STRAGERY_H_ +#define KERNEL_SELECTION_STRAGERY_H_ + +#include +#include +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +class CQuadraticTimeMMD; +template class SGVector; +template class SGMatrix; + +namespace internal +{ + +class KernelManager; + +} +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class CKernelSelectionStrategy : public CSGObject +{ + friend class CMMD; + friend class CStreamingMMD; + friend class CQuadraticTimeMMD; +public: + CKernelSelectionStrategy(); + CKernelSelectionStrategy(EKernelSelectionMethod method, bool weighted = false); + CKernelSelectionStrategy(EKernelSelectionMethod method, index_t num_runs, index_t num_folds, float64_t alpha); + CKernelSelectionStrategy(const CKernelSelectionStrategy& other)=delete; + CKernelSelectionStrategy& operator=(const CKernelSelectionStrategy& other)=delete; + virtual ~CKernelSelectionStrategy(); + + CKernelSelectionStrategy& use_method(EKernelSelectionMethod method); + CKernelSelectionStrategy& use_num_runs(index_t num_runs); + CKernelSelectionStrategy& use_num_folds(index_t num_folds); + CKernelSelectionStrategy& use_alpha(float64_t alpha); + CKernelSelectionStrategy& use_weighted(bool weighted); + + EKernelSelectionMethod get_method() const; + index_t get_num_runs() const; + index_t get_num_folds() const; + float64_t get_alpha() const; + bool get_weighted() const; + + void add_kernel(CKernel* kernel); + CKernel* select_kernel(CMMD* estimator); + virtual const char* get_name() const; + void erase_intermediate_results(); + + SGMatrix get_measure_matrix(); + SGVector get_measure_vector(); +private: + struct Self; + std::unique_ptr self; + void init(); + const internal::KernelManager& get_kernel_mgr() const; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} +#endif // KERNEL_SELECTION_STRAGERY_H_ diff --git a/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.cpp b/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.cpp new file mode 100644 index 00000000000..555748278ec --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.cpp @@ -0,0 +1,51 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +KernelSelection::KernelSelection(KernelManager& km, CMMD* est) : kernel_mgr(km), estimator(est) +{ + REQUIRE(kernel_mgr.num_kernels()>0, "Number of kernels is %d!\n", kernel_mgr.num_kernels()); + REQUIRE(estimator!=nullptr, "Estimator is not set!\n"); +} + +KernelSelection::~KernelSelection() +{ +} diff --git a/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.h b/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.h new file mode 100644 index 00000000000..d2bf3cfe5d5 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/KernelSelection.h @@ -0,0 +1,71 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef KERNEL_SELECTION_H__ +#define KERNEL_SELECTION_H__ + +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +template class SGVector; +template class SGMatrix; + +namespace internal +{ + +class KernelManager; +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class KernelSelection +{ +public: + KernelSelection(KernelManager&, CMMD*); + KernelSelection(const KernelSelection& other)=delete; + virtual ~KernelSelection(); + KernelSelection& operator=(const KernelSelection& other)=delete; + virtual CKernel* select_kernel()=0; + virtual SGMatrix get_measure_matrix()=0; + virtual SGVector get_measure_vector()=0; +protected: + const KernelManager& kernel_mgr; + CMMD* estimator; + virtual void init_measures()=0; + virtual void compute_measures()=0; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // KERNEL_SELECTION_H__ diff --git a/src/shogun/statistical_testing/kernelselection/internals/MaxCrossValidation.cpp b/src/shogun/statistical_testing/kernelselection/internals/MaxCrossValidation.cpp new file mode 100644 index 00000000000..cefb31eae32 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MaxCrossValidation.cpp @@ -0,0 +1,174 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; + +MaxCrossValidation::MaxCrossValidation(KernelManager& km, CMMD* est, const index_t& M, const index_t& K, const float64_t& alp) +: KernelSelection(km, est), num_runs(M), num_folds(K), alpha(alp) +{ + REQUIRE(num_runs>0, "Number of runs (%d) must be positive!\n", num_runs); + REQUIRE(num_folds>0, "Number of folds (%d) must be positive!\n", num_folds); + REQUIRE(alpha>=0.0 && alpha<=1.0, "Threshold (%f) has to be in [0, 1]!\n", alpha); +} + +MaxCrossValidation::~MaxCrossValidation() +{ +} + +SGVector MaxCrossValidation::get_measure_vector() +{ + return measures; +} + +SGMatrix MaxCrossValidation::get_measure_matrix() +{ + return rejections; +} + +void MaxCrossValidation::init_measures() +{ + const index_t num_kernels=kernel_mgr.num_kernels(); + if (rejections.num_rows!=num_folds*num_runs || rejections.num_cols!=num_kernels) + rejections=SGMatrix(num_folds*num_runs, num_kernels); + std::fill(rejections.data(), rejections.data()+rejections.size(), 0); + if (measures.size()!=num_kernels) + measures=SGVector(num_kernels); + std::fill(measures.data(), measures.data()+measures.size(), 0); +} + +void MaxCrossValidation::compute_measures() +{ + SG_SDEBUG("Performing %d fold cross-validattion!\n", num_folds); + const size_t num_kernels=kernel_mgr.num_kernels(); + + CQuadraticTimeMMD* quadratic_time_mmd=dynamic_cast(estimator); + if (quadratic_time_mmd) + { + REQUIRE(estimator->get_null_approximation_method()==NAM_PERMUTATION, + "Only supported with PERMUTATION method for null distribution approximation!\n"); + + auto Nx=estimator->get_num_samples_p(); + auto Ny=estimator->get_num_samples_q(); + auto num_null_samples=estimator->get_num_null_samples(); + auto stype=estimator->get_statistic_type(); + CrossValidationMMD compute(Nx, Ny, num_folds, num_null_samples); + compute.m_stype=stype; + compute.m_alpha=alpha; + compute.m_num_runs=num_runs; + compute.m_rejections=rejections; + + if (kernel_mgr.same_distance_type()) + { + CDistance* distance=kernel_mgr.get_distance_instance(); + kernel_mgr.set_precomputed_distance(estimator->compute_joint_distance(distance)); + SG_UNREF(distance); + compute(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + } + else + { + auto samples_p_and_q=quadratic_time_mmd->get_p_and_q(); + SG_REF(samples_p_and_q); + + for (size_t k=0; kinit(samples_p_and_q, samples_p_and_q); + } + + compute(kernel_mgr); + + for (size_t k=0; kremove_lhs_and_rhs(); + } + + SG_UNREF(samples_p_and_q); + } + } + else // TODO put check, this one assumes infinite data + { + auto existing_kernel=estimator->get_kernel(); + for (auto i=0; iset_kernel(kernel); + auto statistic=estimator->compute_statistic(); + rejections(i*num_folds+j, k)=estimator->compute_p_value(statistic)cleanup(); + } + } + } + if (existing_kernel) + estimator->set_kernel(existing_kernel); + } + + for (auto j=0; j +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +template class SGVector; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class MaxCrossValidation : public KernelSelection +{ +public: + MaxCrossValidation(KernelManager&, CMMD*, const index_t&, const index_t&, const float64_t&); + MaxCrossValidation(const MaxCrossValidation& other)=delete; + ~MaxCrossValidation(); + MaxCrossValidation& operator=(const MaxCrossValidation& other)=delete; + virtual CKernel* select_kernel() override; + virtual SGVector get_measure_vector(); + virtual SGMatrix get_measure_matrix(); +protected: + virtual void init_measures(); + virtual void compute_measures(); + const index_t num_runs; + const index_t num_folds; + const float64_t alpha; + SGMatrix rejections; + SGVector measures; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // MAX_CROSS_VALIDATION_H__ diff --git a/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.cpp b/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.cpp new file mode 100644 index 00000000000..d0748e86230 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.cpp @@ -0,0 +1,104 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +MaxMeasure::MaxMeasure(KernelManager& km, CMMD* est) : KernelSelection(km, est) +{ +} + +MaxMeasure::~MaxMeasure() +{ +} + +SGVector MaxMeasure::get_measure_vector() +{ + return measures; +} + +SGMatrix MaxMeasure::get_measure_matrix() +{ + SG_SNOTIMPLEMENTED; + return SGMatrix(); +} + +void MaxMeasure::init_measures() +{ + const index_t num_kernels=kernel_mgr.num_kernels(); + REQUIRE(num_kernels>0, "Number of kernels is %d!\n", kernel_mgr.num_kernels()); + if (measures.size()!=num_kernels) + measures=SGVector(num_kernels); + std::fill(measures.data(), measures.data()+measures.size(), 0); +} + +void MaxMeasure::compute_measures() +{ + REQUIRE(estimator!=nullptr, "Estimator is not set!\n"); + CQuadraticTimeMMD* mmd=dynamic_cast(estimator); + if (mmd!=nullptr && kernel_mgr.same_distance_type()) + measures=mmd->multikernel()->statistic(kernel_mgr); + else + { + init_measures(); + auto existing_kernel=estimator->get_kernel(); + const size_t num_kernels=kernel_mgr.num_kernels(); + for (size_t i=0; iset_kernel(kernel); + measures[i]=estimator->compute_statistic(); + estimator->cleanup(); + } + if (existing_kernel) + estimator->set_kernel(existing_kernel); + } +} + +CKernel* MaxMeasure::select_kernel() +{ + compute_measures(); + ASSERT(size_t(measures.size())==kernel_mgr.num_kernels()); + auto max_element=std::max_element(measures.vector, measures.vector+measures.vlen); + auto max_idx=std::distance(measures.vector, max_element); + SG_SDEBUG("Selected kernel at %d position!\n", max_idx); + return kernel_mgr.kernel_at(max_idx); +} diff --git a/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.h b/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.h new file mode 100644 index 00000000000..fff5b81a4df --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MaxMeasure.h @@ -0,0 +1,69 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef MAX_MEASURE_H__ +#define MAX_MEASURE_H__ + +#include +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +template class SGVector; +template class SGMatrix; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class MaxMeasure : public KernelSelection +{ +public: + MaxMeasure(KernelManager&, CMMD*); + MaxMeasure(const MaxMeasure& other)=delete; + ~MaxMeasure(); + MaxMeasure& operator=(const MaxMeasure& other)=delete; + virtual CKernel* select_kernel(); + virtual SGVector get_measure_vector(); + virtual SGMatrix get_measure_matrix(); +protected: + virtual void init_measures(); + virtual void compute_measures(); + SGVector measures; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // MAX_MEASURE_H__ diff --git a/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.cpp b/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.cpp new file mode 100644 index 00000000000..cb1635bfc43 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.cpp @@ -0,0 +1,83 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +MaxTestPower::MaxTestPower(KernelManager& km, CMMD* est) : MaxMeasure(km, est), lambda(1E-5) +{ +} + +MaxTestPower::~MaxTestPower() +{ +} + +void MaxTestPower::compute_measures() +{ + init_measures(); + REQUIRE(estimator!=nullptr, "Estimator is not set!\n"); + const auto m=estimator->get_num_samples_p(); + const auto n=estimator->get_num_samples_q(); + auto existing_kernel=estimator->get_kernel(); + const size_t num_kernels=kernel_mgr.num_kernels(); + auto streaming_mmd=dynamic_cast(estimator); + if (streaming_mmd) + { + for (size_t i=0; iset_kernel(kernel); + auto estimates=streaming_mmd->compute_statistic_variance(); + auto var_est=estimates.first; + auto mmd_est=estimates.second*(m+n)/m/n; + measures[i]=mmd_est/CMath::sqrt(var_est+lambda); + estimator->cleanup(); + } + } + else + { + auto quadratictime_mmd=dynamic_cast(estimator); + ASSERT(quadratictime_mmd); + measures=quadratictime_mmd->multikernel()->test_power(kernel_mgr); + } + if (existing_kernel) + estimator->set_kernel(existing_kernel); +} diff --git a/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.h b/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.h new file mode 100644 index 00000000000..5229f67d502 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MaxTestPower.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef MAX_TEST_POWER_H__ +#define MAX_TEST_POWER_H__ + +#include +#include + +namespace shogun +{ + +class CKernel; +class CMMD; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class MaxTestPower : public MaxMeasure +{ +public: + MaxTestPower(KernelManager&, CMMD*); + MaxTestPower(const MaxTestPower& other)=delete; + ~MaxTestPower(); + MaxTestPower& operator=(const MaxTestPower& other)=delete; +protected: + virtual void compute_measures(); + float64_t lambda; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // MAX_TEST_POWER_H__ diff --git a/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.cpp b/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.cpp new file mode 100644 index 00000000000..d7e98d69f04 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.cpp @@ -0,0 +1,115 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +MedianHeuristic::MedianHeuristic(KernelManager& km, CMMD* est) : KernelSelection(km, est), distance(nullptr) +{ + for (size_t i=0; iget_kernel_type()==K_GAUSSIAN, + "The underlying kernel has to be a GaussianKernel (was %s)!\n", + kernel_mgr.kernel_at(i)->get_name()); + } +} + +MedianHeuristic::~MedianHeuristic() +{ +} + +void MedianHeuristic::init_measures() +{ + SG_SNOTIMPLEMENTED; +} + +void MedianHeuristic::compute_measures() +{ + auto tmp=new CEuclideanDistance(); + tmp->set_disable_sqrt(false); + SG_REF(tmp); + distance=std::shared_ptr(estimator->compute_joint_distance(tmp)); + SG_UNREF(tmp); + + n=distance->get_num_vec_lhs(); + REQUIRE(distance->get_num_vec_lhs()==distance->get_num_vec_rhs(), + "Distance matrix is supposed to be a square matrix (was of dimension %dX%d)!\n", + distance->get_num_vec_lhs(), distance->get_num_vec_rhs()); + measures=SGVector((n*(n-1))/2); + size_t write_idx=0; + for (auto j=0; jdistance(i, j); + } + std::sort(measures.data(), measures.data()+measures.size()); +} + +SGVector MedianHeuristic::get_measure_vector() +{ + return measures; +} + +SGMatrix MedianHeuristic::get_measure_matrix() +{ + REQUIRE(distance!=nullptr, "Distance is not initialized!\n"); + return distance->get_distance_matrix(); +} + +CKernel* MedianHeuristic::select_kernel() +{ + compute_measures(); + auto median_distance=measures[measures.size()/2]; + SG_SDEBUG("kernel width (shogun): %f\n", median_distance); + + const size_t num_kernels=kernel_mgr.num_kernels(); + measures=SGVector(num_kernels); + for (size_t i=0; i(kernel_mgr.kernel_at(i)); + measures[i]=CMath::abs(kernel->get_width()-median_distance); + } + + size_t kernel_idx=std::distance(measures.data(), std::min_element(measures.data(), measures.data()+measures.size())); + SG_SDEBUG("Selected kernel at %d position!\n", kernel_idx); + return kernel_mgr.kernel_at(kernel_idx); +} diff --git a/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.h b/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.h new file mode 100644 index 00000000000..a59d457a647 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/MedianHeuristic.h @@ -0,0 +1,72 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef MEDIAN_HEURISTIC_H__ +#define MEDIAN_HEURISTIC_H__ + +#include +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +class CCustomDistance; +template class SGVector; +template class SGMatrix; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class MedianHeuristic : public KernelSelection +{ +public: + MedianHeuristic(KernelManager&, CMMD*); + MedianHeuristic(const MedianHeuristic& other)=delete; + ~MedianHeuristic(); + MedianHeuristic& operator=(const MedianHeuristic& other)=delete; + virtual CKernel* select_kernel() override; + virtual SGVector get_measure_vector(); + virtual SGMatrix get_measure_matrix(); +protected: + virtual void init_measures(); + virtual void compute_measures(); + std::shared_ptr distance; + SGVector measures; + int32_t n; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // MEDIAN_HEURISTIC_H__ diff --git a/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.cpp b/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.cpp new file mode 100644 index 00000000000..77788ae8aea --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.cpp @@ -0,0 +1,164 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 3 of the License, or + * (at your option) any later version. + * + * Written (W) 2011-2013 Heiko Strathmann + * Written (W) 2016 Soumyajit De + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +//#ifdef USE_GPL_SHOGUN +#include +//#endif // USE_GPL_SHOGUN + +using namespace shogun; +using namespace internal; + +struct OptimizationSolver::Self +{ + Self(SGVector mmds, SGMatrix Q); +//#ifdef USE_GPL_SHOGUN + SGVector solve() const; + void init(); + static const float64_t* get_Q_col(uint32_t i); + static void print_state(libqp_state_T state); + + index_t opt_max_iterations; + float64_t opt_epsilon; + float64_t opt_low_cut; + SGVector m_mmds; + static SGMatrix m_Q; +//#endif // USE_GPL_SHOGUN +}; + +//#ifdef USE_GPL_SHOGUN +SGMatrix OptimizationSolver::Self::m_Q=SGMatrix(); +//#endif // USE_GPL_SHOGUN + +OptimizationSolver::Self::Self(SGVector mmds, SGMatrix Q) +{ +//#ifdef USE_GPL_SHOGUN + m_Q=Q; + m_mmds=mmds; + init(); +//#endif // USE_GPL_SHOGUN +} + +//#ifdef USE_GPL_SHOGUN +void OptimizationSolver::Self::init() +{ + opt_max_iterations=10000; + opt_epsilon=1E-14; + opt_low_cut=1E-6; +} + +const float64_t* OptimizationSolver::Self::get_Q_col(uint32_t i) +{ + return &m_Q[m_Q.num_rows*i]; +} + +void OptimizationSolver::Self::print_state(libqp_state_T state) +{ + SG_SDEBUG("libqp state: primal=%f\n", state.QP); +} + +SGVector OptimizationSolver::Self::solve() const +{ + const index_t num_kernels=m_mmds.size(); + float64_t sum_m_mmds=std::accumulate(m_mmds.data(), m_mmds.data()+m_mmds.size(), 0); + SGVector weights(num_kernels); + if (std::any_of(m_mmds.data(), m_mmds.data()+m_mmds.size(), [](float64_t& value) { return value > 0; })) + { + SG_SDEBUG("At least one MMD entry is positive, performing optimisation\n") + + std::vector Q_diag(num_kernels); + std::vector f(num_kernels, 0); + std::vector lb(num_kernels, 0); + std::vector ub(num_kernels, CMath::INFTY); + + // initial point has to be feasible, i.e. m_mmds'*x = b + std::fill(weights.data(), weights.data()+weights.size(), 1.0/sum_m_mmds); + + for (index_t i=0; i(); + + // set really small entries to zero and sum up for normalization + float64_t sum_weights=0; + for (index_t i=0; i& mmds, const SGMatrix& Q) +{ + self=std::unique_ptr(new Self(mmds, Q)); +} + +OptimizationSolver::~OptimizationSolver() +{ +} + +SGVector OptimizationSolver::solve() const +{ +//#ifdef USE_GPL_SHOGUN + return self->solve(); +//#else // USE_GPL_SHOGUN +// SG_SWARNING("Presently this feature is only available with GNU GPLv3 license!"); +// return SGVector(); +//#endif // USE_GPL_SHOGUN +} diff --git a/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.h b/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.h new file mode 100644 index 00000000000..d763c097c12 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/OptimizationSolver.h @@ -0,0 +1,64 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef OPTIMIZATION_SOLVER_H__ +#define OPTIMIZATION_SOLVER_H__ + +#include +#include + +namespace shogun +{ + +template class SGVector; +template class SGMatrix; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class OptimizationSolver +{ +public: + OptimizationSolver(const SGVector& mmds, const SGMatrix& Q); + OptimizationSolver(const OptimizationSolver& other)=delete; + OptimizationSolver& operator=(const OptimizationSolver& other)=delete; + ~OptimizationSolver(); + SGVector solve() const; +private: + struct Self; + std::unique_ptr self; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // OPTIMIZATION_SOLVER_H__ diff --git a/src/shogun/statistics/MMDKernelSelection.cpp b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.cpp similarity index 52% rename from src/shogun/statistics/MMDKernelSelection.cpp rename to src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.cpp index 4882423ad57..4ce92028c0d 100644 --- a/src/shogun/statistics/MMDKernelSelection.cpp +++ b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.cpp @@ -1,7 +1,7 @@ /* * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -29,61 +29,58 @@ * either expressed or implied, of the Shogun Development Team. */ -#include +#include +#include +#include #include -#include -#include -#include +#include +#include +#include +#include using namespace shogun; +using namespace internal; -CMMDKernelSelection::CMMDKernelSelection() +WeightedMaxMeasure::WeightedMaxMeasure(KernelManager& km, CMMD* est) : MaxMeasure(km, est) { } -CMMDKernelSelection::CMMDKernelSelection(CKernelTwoSampleTest* mmd) - : CKernelSelection(mmd) +WeightedMaxMeasure::~WeightedMaxMeasure() { - /* ensure that mmd contains an instance of a MMD related class - TODO - Add S_BTEST_MMD when feature/mmd is merged with develop */ - REQUIRE(mmd->get_statistic_type()==S_LINEAR_TIME_MMD || - mmd->get_statistic_type()==S_QUADRATIC_TIME_MMD, - "Provided instance for kernel two sample testing has to be a MMD-" - "based class! The provided is of class \"%s\"\n", mmd->get_name()); } -CMMDKernelSelection::~CMMDKernelSelection() +void WeightedMaxMeasure::compute_measures() { + MaxMeasure::compute_measures(); + const size_t num_kernels=kernel_mgr.num_kernels(); + if (Q.num_rows!=num_kernels || Q.num_cols!=num_kernels) + Q=SGMatrix(num_kernels, num_kernels); + std::fill(Q.data(), Q.data()+Q.size(), 0); + for (size_t i=0; i WeightedMaxMeasure::get_measure_matrix() { - SG_DEBUG("entering\n") + return Q; +} + +CKernel* WeightedMaxMeasure::select_kernel() +{ + init_measures(); + compute_measures(); - /* compute measures and return single kernel with maximum measure */ - SGVector measures=compute_measures(); + OptimizationSolver solver(measures, Q); + SGVector weights=solver.solve(); - /* find maximum and return corresponding kernel */ - float64_t max=measures[0]; - index_t max_idx=0; - for (index_t i=1; imax) - { - max=measures[i]; - max_idx=i; - } + if (!kernel->append_kernel(kernel_mgr.kernel_at(i))) + SG_SERROR("Error while creating a combined kernel! Please contact Shogun developers!\n"); } - - /* find kernel with corresponding index */ - CCombinedKernel* combined=(CCombinedKernel*)m_estimator->get_kernel(); - CKernel* current=combined->get_kernel(max_idx); - - SG_UNREF(combined); - SG_DEBUG("leaving\n"); - - /* current is not SG_UNREF'ed nor SG_REF'ed since the counter needs to be - * incremented exactly by one */ - return current; + kernel->set_subkernel_weights(weights); + SG_SDEBUG("Created a weighted kernel!\n"); + return kernel; } - diff --git a/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.h b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.h new file mode 100644 index 00000000000..51c757db967 --- /dev/null +++ b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxMeasure.h @@ -0,0 +1,65 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#ifndef WEIGHTED_MAX_MEASURE_H__ +#define WEIGHTED_MAX_MEASURE_H__ + +#include +#include + +namespace shogun +{ + +class CKernel; +class CMMD; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class WeightedMaxMeasure : public MaxMeasure +{ +public: + WeightedMaxMeasure(KernelManager&, CMMD*); + WeightedMaxMeasure(const WeightedMaxMeasure& other)=delete; + ~WeightedMaxMeasure(); + WeightedMaxMeasure& operator=(const WeightedMaxMeasure& other)=delete; + virtual CKernel* select_kernel(); + virtual SGMatrix get_measure_matrix(); +protected: + virtual void compute_measures(); + SGMatrix Q; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // WEIGHTED_MAX_MEASURE_H__ diff --git a/src/shogun/statistics/KernelSelection.cpp b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxTestPower.cpp similarity index 55% rename from src/shogun/statistics/KernelSelection.cpp rename to src/shogun/statistical_testing/kernelselection/internals/WeightedMaxTestPower.cpp index 84a45fa74c7..88b07b465b4 100644 --- a/src/shogun/statistics/KernelSelection.cpp +++ b/src/shogun/statistical_testing/kernelselection/internals/WeightedMaxTestPower.cpp @@ -1,7 +1,7 @@ /* * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De + * Written (W) 2013 Heiko Strathmann + * Written (w) 2014 - 2016 Soumyajit De * All rights reserved. * * Redistribution and use in source and binary forms, with or without @@ -29,58 +29,37 @@ * either expressed or implied, of the Shogun Development Team. */ -#include +#include +#include +#include #include -#include -#include -#include +#include +#include +#include +#include using namespace shogun; +using namespace internal; -CKernelSelection::CKernelSelection() +WeightedMaxTestPower::WeightedMaxTestPower(KernelManager& km, CMMD* est) : WeightedMaxMeasure(km, est), lambda(1E-5) { - init(); } -CKernelSelection::CKernelSelection(CKernelTwoSampleTest* estimator) +WeightedMaxTestPower::~WeightedMaxTestPower() { - init(); - set_estimator(estimator); } -CKernelSelection::~CKernelSelection() +void WeightedMaxTestPower::init_measures() { - SG_UNREF(m_estimator); } -void CKernelSelection::init() +void WeightedMaxTestPower::compute_measures() { - SG_ADD((CSGObject**)&m_estimator, "estimator", - "Underlying CKernelTwoSampleTest instance", MS_NOT_AVAILABLE); - - m_estimator=NULL; -} - -void CKernelSelection::set_estimator(CKernelTwoSampleTest* estimator) -{ - REQUIRE(estimator, "No CKernelTwoSampleTest instance provided!\n"); - - /* ensure that there is a combined kernel */ - CKernel* kernel=estimator->get_kernel(); - REQUIRE(kernel, "Underlying \"%s\" has no kernel set!\n", - estimator->get_name()); - REQUIRE(kernel->get_kernel_type()==K_COMBINED, "Kernel of underlying \"%s\" " - "is of type \"%s\" but is has to be CCombinedKernel\n", - estimator->get_name(), kernel->get_name()); - SG_UNREF(kernel); - - SG_REF(estimator); - SG_UNREF(m_estimator); - m_estimator=estimator; -} - -CKernelTwoSampleTest* CKernelSelection::get_estimator() const -{ - SG_REF(m_estimator); - return m_estimator; + auto casted_estimator=dynamic_cast(estimator); + ASSERT(casted_estimator); + const auto& estimates=casted_estimator->compute_statistic_and_Q(kernel_mgr); + measures=estimates.first; + Q=estimates.second; + for (index_t i=0; i +#include + +namespace shogun +{ + +class CKernel; +class CMMD; +template class SGVector; + +namespace internal +{ +#ifndef DOXYGEN_SHOULD_SKIP_THIS +class WeightedMaxTestPower : public WeightedMaxMeasure +{ +public: + WeightedMaxTestPower(KernelManager&, CMMD*); + WeightedMaxTestPower(const WeightedMaxTestPower& other)=delete; + ~WeightedMaxTestPower(); + WeightedMaxTestPower& operator=(const WeightedMaxTestPower& other)=delete; +protected: + virtual void init_measures(); + virtual void compute_measures(); + float64_t lambda; +}; +#endif // DOXYGEN_SHOULD_SKIP_THIS +} + +} + +#endif // WEIGHTED_MAX_TEST_POWER_H__ diff --git a/src/shogun/statistics/HSIC.cpp b/src/shogun/statistics/HSIC.cpp deleted file mode 100644 index 7c2eb90a37d..00000000000 --- a/src/shogun/statistics/HSIC.cpp +++ /dev/null @@ -1,293 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -CHSIC::CHSIC() : CKernelIndependenceTest() -{ - init(); -} - -CHSIC::CHSIC(CKernel* kernel_p, CKernel* kernel_q, CFeatures* p, - CFeatures* q) : - CKernelIndependenceTest(kernel_p, kernel_q, p, q) -{ - init(); - - if (p && q && p->get_num_vectors()!=q->get_num_vectors()) - { - SG_ERROR("Only features with equal number of vectors are currently " - "possible\n"); - } - else - m_num_features=p->get_num_vectors(); -} - -CHSIC::~CHSIC() -{ -} - -void CHSIC::init() -{ - SG_ADD(&m_num_features, "num_features", - "Number of features from each of the distributions", - MS_NOT_AVAILABLE); - - m_num_features=0; -} - -float64_t CHSIC::compute_statistic() -{ - SG_DEBUG("entering!\n"); - - REQUIRE(m_kernel_p && m_kernel_q, "No or only one kernel specified!\n"); - - REQUIRE(m_p && m_q, "features needed!\n") - - /* compute kernel matrices */ - SGMatrix K=get_kernel_matrix_K(); - SGMatrix L=get_kernel_matrix_L(); - - /* center matrices (MATLAB: Kc=H*K*H) */ - K.center(); - - /* compute MATLAB: sum(sum(Kc' .* (L))), which is biased HSIC */ - index_t m=m_num_features; - SG_DEBUG("Number of samples %d!\n", m); - - float64_t result=0; - for (index_t i=0; i params=fit_null_gamma(); - result=CStatistics::gamma_cdf(statistic, params[0], params[1]); - break; - } - - default: - /* sampling null is handled there */ - result=CIndependenceTest::compute_p_value(statistic); - break; - } - - return result; -} - -float64_t CHSIC::compute_threshold(float64_t alpha) -{ - float64_t result=0; - switch (m_null_approximation_method) - { - case HSIC_GAMMA: - { - /* fit gamma and return inverse cdf at statistic */ - SGVector params=fit_null_gamma(); - - // alpha, beta are shape and rate parameter - result=CStatistics::gamma_inverse_cdf(alpha, params[0], params[1]); - break; - } - - default: - /* sampling null is handled there */ - result=CIndependenceTest::compute_threshold(alpha); - break; - } - - return result; -} - -SGVector CHSIC::fit_null_gamma() -{ - REQUIRE(m_kernel_p && m_kernel_q, "No or only one kernel specified!\n"); - - REQUIRE(m_p && m_q, "features needed!\n") - - index_t m=m_num_features; - - /* compute kernel matrices */ - SGMatrix K=get_kernel_matrix_K(); - SGMatrix L=get_kernel_matrix_L(); - - /* compute sum and trace of uncentered kernel matrices, needed later */ - float64_t trace_K=0; - float64_t trace_L=0; - float64_t sum_K=0; - float64_t sum_L=0; - for (index_t i=0; i result(2); - result[0]=a; - result[1]=b; - - SG_DEBUG("leaving!\n") - return result; -} - -SGVector CHSIC::sample_null() -{ - SG_DEBUG("entering!\n") - - /* replace current kernel via precomputed custom kernel and call superclass - * method */ - - /* backup references to old kernels */ - CKernel* kernel_p=m_kernel_p; - CKernel* kernel_q=m_kernel_q; - - /* init kernels before to be sure that everything is fine - * kernel function between two samples from different distributions - * is never computed - in fact, they may as well have different features */ - m_kernel_p->init(m_p, m_p); - m_kernel_q->init(m_q, m_q); - - /* precompute kernel matrices */ - CCustomKernel* precomputed_p=new CCustomKernel(m_kernel_p); - CCustomKernel* precomputed_q=new CCustomKernel(m_kernel_q); - SG_REF(precomputed_p); - SG_REF(precomputed_q); - - /* temporarily replace own kernels */ - m_kernel_p=precomputed_p; - m_kernel_q=precomputed_q; - - /* use superclass sample_null which shuffles the entries for one - * distribution using index permutation on rows and columns of - * kernel matrix from one distribution, while accessing the other - * in its original order and then compute statistic */ - SGVector null_samples=CKernelIndependenceTest::sample_null(); - - /* restore kernels */ - m_kernel_p=kernel_p; - m_kernel_q=kernel_q; - - SG_UNREF(precomputed_p); - SG_UNREF(precomputed_q); - - SG_DEBUG("leaving!\n") - return null_samples; -} - -void CHSIC::set_p(CFeatures* p) -{ - CIndependenceTest::set_p(p); - m_num_features=p->get_num_vectors(); -} - -void CHSIC::set_q(CFeatures* q) -{ - CIndependenceTest::set_q(q); - m_num_features=q->get_num_vectors(); -} - diff --git a/src/shogun/statistics/HSIC.h b/src/shogun/statistics/HSIC.h deleted file mode 100644 index 5122222fde2..00000000000 --- a/src/shogun/statistics/HSIC.h +++ /dev/null @@ -1,207 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef __HSIC_H_ -#define __HSIC_H_ - -#include - -#include - -namespace shogun -{ - -template class SGMatrix; - - -/** @brief This class implements the Hilbert Schmidtd Independence Criterion - * based independence test as described in [1]. - * - * Given samples \f$Z=\{(x_i,y_i)\}_{i=1}^m\f$ from the joint - * distribution \f$\textbf{P}_{xy}\f$, does the joint distribution - * factorize as \f$\textbf{P}_{xy}=\textbf{P}_x\textbf{P}_y\f$? - * - * The HSIC is a kernel based independence criterion, which is based on the - * largest singular value of a Cross-Covariance Operator in a reproducing - * kernel Hilbert space (RKHS). Its population expression is zero if and only - * if the two underlying distributions are independent. - * - * This class can compute empirical biased estimates: - * \f[ - * m\text{HSIC}(Z)[,p,q]^2)=\frac{1}{m^2}\text{trace}\textbf{KHLH} - * \f] - * where \f$\textbf{H}=\textbf{I}-\frac{1}{m}\textbf{11}^T\f$ is a centering - * matrix and \f$\textbf{K}, \textbf{L}\f$ are kernel matrices of both sets - * of samples. - * - * Note that computing the statistic returns m*MMD; same holds for the null - * distribution samples. - * - * Along with the statistic comes a method to compute a p-value based on - * different methods. Sampling from null is also possible. If unsure which one to - * use, sampling with 250 iterations always is correct (but slow). - * - * To choose, use set_null_approximation_method() and choose from - * - * HSIC_GAMMA: for a very fast, but not consistent test based on moment matching - * of a Gamma distribution, as described in [1]. - * - * PERMUTATION: For permuting available samples to sample null-distribution. - * This is done on precomputed kernel matrices, since they have to - * be stored anyway when the statistic is computed. - * - * A very basic method for kernel selection when using CGaussianKernel is to - * use the median distance of the underlying data. See examples how to do that. - * More advanced methods will follow in the near future. However, the median - * heuristic works in quite some cases. See [1]. - * - * [1]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). - * A kernel statistical test of independence. - * Advances in Neural Information Processing Systems, 1-8. - * - */ -class CHSIC : public CKernelIndependenceTest -{ -public: - /** Constructor */ - CHSIC(); - - /** Constructor. - * - * Initializes the kernels and features from the two distributions and - * SG_REFs them - * - * @param kernel_p kernel to use on samples from p - * @param kernel_q kernel to use on samples from q - * @param p samples from distribution p - * @param q samples from distribution q - */ - CHSIC(CKernel* kernel_p, CKernel* kernel_q, CFeatures* p, CFeatures* q); - - /** destructor */ - virtual ~CHSIC(); - - /** Computes the HSIC statistic (see class description) for underlying - * kernels and data. Note that it is multiplied by the number of used - * samples. It is a biased estimator. Note that it is m*HSIC_b. - * - * Note that since kernel matrices have to be stored, it has quadratic - * space costs. - * - * @return m*HSIC (unbiased estimate) - */ - virtual float64_t compute_statistic(); - - /** computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- - * distribution where the given statistic lies in. - * - * @param statistic statistic value to compute the p-value for - * @return p-value parameter statistic is the (1-p) percentile of the - * null distribution - */ - virtual float64_t compute_p_value(float64_t statistic); - - /** computes a threshold based on current method for approximating the - * null-distribution. The threshold is the value that a statistic has - * to have in ordner to reject the null-hypothesis. - * - * @param alpha test level to reject null-hypothesis - * @return threshold for statistics to reject null-hypothesis - */ - virtual float64_t compute_threshold(float64_t alpha); - - /** @return the class name */ - virtual const char* get_name() const - { - return "HSIC"; - } - - /** returns the statistic type of this test statistic */ - virtual EStatisticType get_statistic_type() const - { - return S_HSIC; - } - - /** Setter for features from distribution p, SG_REFs it - * - * @param p features from p - */ - virtual void set_p(CFeatures* p); - - /** Setter for features from distribution q, SG_REFs it - * - * @param q features from q - */ - virtual void set_q(CFeatures* q); - - /** Approximates the null-distribution by a two parameter gamma - * distribution. Returns parameters. - * - * NOTE: the gamma distribution is fitted to m*HSIC_b. But since - * compute_statistic() returnes the biased estimate, you can safely call - * this with values from compute_statistic(). - * However, the attached features have to be the SAME size, as these, the - * statistic was computed on. If compute_threshold() or compute_p_value() - * are used, this is ensured automatically. Note that m*Null-distribution is - * fitted, which is fine since the statistic is also m*HSIC. - * - * Has quadratic computational costs in terms of samples. - * - * Called by compute_p_value() if null approximation method is set to - * MMD2_GAMMA. - * - * @return vector with two parameters for gamma distribution. To use: - * call gamma_cdf(statistic, a, b). - */ - SGVector fit_null_gamma(); - - /** merges both sets of samples and computes the test statistic - * m_num_null_sample times. This version precomputes the kenrel matrix - * once by hand, then samples using this one. The matrix has - * to be stored anyway when statistic is computed. - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - -private: - /** register parameters and initialize with defaults */ - void init(); - - /** number of features from the distributions (should be equal for both) */ - index_t m_num_features; - -}; - -} - -#endif /* __HSIC_H_ */ diff --git a/src/shogun/statistics/IndependenceTest.cpp b/src/shogun/statistics/IndependenceTest.cpp deleted file mode 100644 index 89c16cb3ad2..00000000000 --- a/src/shogun/statistics/IndependenceTest.cpp +++ /dev/null @@ -1,132 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include - -using namespace shogun; - -CIndependenceTest::CIndependenceTest() : CHypothesisTest() -{ - init(); -} - -CIndependenceTest::CIndependenceTest(CFeatures* p, CFeatures* q) - : CHypothesisTest() -{ - init(); - - SG_REF(p); - SG_REF(q); - - m_p=p; - m_q=q; -} - -CIndependenceTest::~CIndependenceTest() -{ - SG_UNREF(m_p); - SG_UNREF(m_q); -} - -void CIndependenceTest::init() -{ - SG_ADD((CSGObject**)&m_p, "p", "Samples from p", MS_NOT_AVAILABLE); - SG_ADD((CSGObject**)&m_q, "q", "Samples from q", MS_NOT_AVAILABLE); - - m_p=NULL; - m_q=NULL; -} - -SGVector CIndependenceTest::sample_null() -{ - SG_DEBUG("entering!\n") - - REQUIRE(m_p, "No features p!\n"); - REQUIRE(m_q, "No features q!\n"); - - /* compute sample statistics for null distribution */ - SGVector results(m_num_null_samples); - - /* memory for index permutations. Adding of subset has to happen - * inside the loop since it may be copied if there already is one set. - * - * subset for selecting samples from p. In this case we want to - * shuffle only samples from p while keeping samples from q fixed */ - SGVector ind_permutation(m_p->get_num_vectors()); - ind_permutation.range_fill(); - - for (index_t i=0; iadd_subset(ind_permutation); - results[i]=compute_statistic(); - m_p->remove_subset(); - } - - SG_DEBUG("leaving!\n") - return results; -} - -void CIndependenceTest::set_p(CFeatures* p) -{ - /* ref before unref to avoid problems when instances are equal */ - SG_REF(p); - SG_UNREF(m_p); - m_p=p; -} - -void CIndependenceTest::set_q(CFeatures* q) -{ - /* ref before unref to avoid problems when instances are equal */ - SG_REF(q); - SG_UNREF(m_q); - m_q=q; -} - -CFeatures* CIndependenceTest::get_p() -{ - SG_REF(m_p); - return m_p; -} - -CFeatures* CIndependenceTest::get_q() -{ - SG_REF(m_q); - return m_q; -} - diff --git a/src/shogun/statistics/IndependenceTest.h b/src/shogun/statistics/IndependenceTest.h deleted file mode 100644 index faac0eee492..00000000000 --- a/src/shogun/statistics/IndependenceTest.h +++ /dev/null @@ -1,124 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef INDEPENDENCE_TEST_H_ -#define INDEPENDENCE_TEST_H_ - -#include - -#include - -namespace shogun -{ - -class CFeatures; - -/** @brief Provides an interface for performing the independence test. - * Given samples \f$Z=\{(x_i,y_i)\}_{i=1}^m\f$ from the joint distribution - * \f$\textbf{P}_{xy}\f$, does the joint distribution factorize as - * \f$\textbf{P}_{xy}=\textbf{P}_x\textbf{P}_y\f$, i.e. product of the marginals? - * The null-hypothesis says yes, i.e. no dependence, the alternative hypothesis - * says no. - * - * Abstract base class. Provides all interfaces and implements approximating - * the null distribution via permutation, i.e. shuffling the samples from - * one distribution repeatedly using subsets while keeping the samples from - * the other distribution in its original order - * - */ -class CIndependenceTest : public CHypothesisTest -{ -public: - /** default constructor */ - CIndependenceTest(); - - /** Constructor. - * - * Initializes the features from the two distributions and SG_REFs them - * - * @param p samples from distribution p - * @param q samples from distribution q - */ - CIndependenceTest(CFeatures* p, CFeatures* q); - - /** destructor */ - virtual ~CIndependenceTest(); - - /** shuffles samples from one distribution keeping the samples from another - * distribution in the original order and computes the test statistic - * m_num_null_sample times - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - - /** Setter for features from distribution p, SG_REFs it - * - * @param p features from p - */ - virtual void set_p(CFeatures* p); - - /** Setter for features from distribution q, SG_REFs it - * - * @param q features from q - */ - virtual void set_q(CFeatures* q); - - /** Getter for features from p, SG_REF'ed - * - * @return feature object from p - */ - virtual CFeatures* get_p(); - - /** Getter for features from q, SG_REF'ed - * - * @return feature object from q - */ - virtual CFeatures* get_q(); - - /** @return class name */ - virtual const char* get_name() const=0; - -private: - /** register parameters and initialize with default values */ - void init(); - -protected: - /** samples of the distribution p */ - CFeatures* m_p; - - /** samples of the distribution q */ - CFeatures* m_q; -}; - -} - -#endif /* INDEPENDENCE_TEST_H_ */ diff --git a/src/shogun/statistics/KernelIndependenceTest.cpp b/src/shogun/statistics/KernelIndependenceTest.cpp deleted file mode 100644 index e667b2978ce..00000000000 --- a/src/shogun/statistics/KernelIndependenceTest.cpp +++ /dev/null @@ -1,208 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -CKernelIndependenceTest::CKernelIndependenceTest() : - CIndependenceTest() -{ - init(); -} - -CKernelIndependenceTest::CKernelIndependenceTest(CKernel* kernel_p, - CKernel* kernel_q, CFeatures* p, CFeatures* q) : - CIndependenceTest(p, q) -{ - init(); - - m_kernel_p=kernel_p; - SG_REF(kernel_p); - - m_kernel_q=kernel_q; - SG_REF(kernel_q); -} - -CKernelIndependenceTest::~CKernelIndependenceTest() -{ - SG_UNREF(m_kernel_p); - SG_UNREF(m_kernel_q); -} - -void CKernelIndependenceTest::init() -{ - SG_ADD((CSGObject**)&m_kernel_p, "kernel_p", "Kernel for samples from p", - MS_AVAILABLE); - SG_ADD((CSGObject**)&m_kernel_q, "kernel_q", "Kernel for samples from q", - MS_AVAILABLE); - - m_kernel_p=NULL; - m_kernel_q=NULL; -} - -SGVector CKernelIndependenceTest::sample_null() -{ - SG_DEBUG("entering!\n") - - /* compute sample statistics for null distribution */ - SGVector results; - - /* only do something if a custom kernel is used: use the power of pre- - * computed kernel matrices - */ - if (m_kernel_p->get_kernel_type()==K_CUSTOM && - m_kernel_q->get_kernel_type()==K_CUSTOM) - { - /* allocate memory */ - results=SGVector(m_num_null_samples); - - /* memory for index permutations */ - SGVector ind_permutation(m_p->get_num_vectors()); - ind_permutation.range_fill(); - - /* check if kernel is a custom kernel. In that case, changing features is - * not what we want but just subsetting the kernel itself */ - CCustomKernel* custom_kernel_p=(CCustomKernel*)m_kernel_p; - - for (index_t i=0; iadd_row_subset(ind_permutation); - custom_kernel_p->add_col_subset(ind_permutation); - - /* compute statistic for this permutation of mixed samples */ - results[i]=compute_statistic(); - - /* remove subsets */ - custom_kernel_p->remove_row_subset(); - custom_kernel_p->remove_col_subset(); - } - } - else - { - /* in this case, just use superclass method */ - results=CIndependenceTest::sample_null(); - } - - - SG_DEBUG("leaving!\n") - return results; -} - -void CKernelIndependenceTest::set_kernel_p(CKernel* kernel_p) -{ - /* ref before unref to avoid problems when instances are equal */ - SG_REF(kernel_p); - SG_UNREF(m_kernel_p); - m_kernel_p=kernel_p; -} - -void CKernelIndependenceTest::set_kernel_q(CKernel* kernel_q) -{ - /* ref before unref to avoid problems when instances are equal */ - SG_REF(kernel_q); - SG_UNREF(m_kernel_q); - m_kernel_q=kernel_q; -} - -CKernel* CKernelIndependenceTest::get_kernel_p() -{ - SG_REF(m_kernel_p); - return m_kernel_p; -} - -CKernel* CKernelIndependenceTest::get_kernel_q() -{ - SG_REF(m_kernel_q); - return m_kernel_q; -} - -SGMatrix CKernelIndependenceTest::get_kernel_matrix_K() -{ - SG_DEBUG("entering!\n"); - - SGMatrix K; - - /* distinguish between custom and normal kernels */ - if (m_kernel_p->get_kernel_type()==K_CUSTOM) - { - /* custom kernels need to to be initialised when a subset is added */ - CCustomKernel* custom_kernel_p=(CCustomKernel*)m_kernel_p; - K=custom_kernel_p->get_kernel_matrix(); - } - else - { - /* need to init the kernel if kernel is not precomputed - if subsets of - * features are in the stack (for permutation), this will handle it */ - m_kernel_p->init(m_p, m_p); - K=m_kernel_p->get_kernel_matrix(); - } - - SG_DEBUG("leaving!\n"); - - return K; -} - -SGMatrix CKernelIndependenceTest::get_kernel_matrix_L() -{ - SG_DEBUG("entering!\n"); - - SGMatrix L; - - /* now second half of data for L */ - if (m_kernel_q->get_kernel_type()==K_CUSTOM) - { - /* custom kernels need to to be initialised - no subsets here */ - CCustomKernel* custom_kernel_q=(CCustomKernel*)m_kernel_q; - L=custom_kernel_q->get_kernel_matrix(); - } - else - { - /* need to init the kernel if kernel is not precomputed */ - m_kernel_q->init(m_q, m_q); - L=m_kernel_q->get_kernel_matrix(); - } - - SG_DEBUG("leaving!\n"); - - return L; -} - diff --git a/src/shogun/statistics/KernelIndependenceTest.h b/src/shogun/statistics/KernelIndependenceTest.h deleted file mode 100644 index 077d56462d9..00000000000 --- a/src/shogun/statistics/KernelIndependenceTest.h +++ /dev/null @@ -1,145 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef KERNEL_INDEPENDENCE_TEST_H_ -#define KERNEL_INDEPENDENCE_TEST_H_ - -#include - -#include - -namespace shogun -{ - -class CFeatures; -class CKernel; - -/** @brief Kernel independence test base class. Provides an interface for - * performing an independence test. Given samples \f$Z=\{(x_i,y_i)\}_{i=1}^m\f$ - * from the joint distribution \f$\textbf{P}_{xy}\f$, does the joint - * distribution factorize as \f$\textbf{P}_{xy}=\textbf{P}_x\textbf{P}_y\f$, - * i.e. product of the marginals? - * - * The null-hypothesis says yes, i.e. no dependence, the alternative hypothesis - * says no. - * - * In this class, this is done using a single kernel for each of the two sets - * of samples - * - * The class also re-implements the sample_null() method. If the underlying - * kernel is a custom one (precomputed), the rows and column of the kernel - * matrix for p is permuted using subsets. The computation falls back to - * CIndependenceTest::sample_null() otherwise, which requires to re-compute - * the kernel in each iteration via subsets on the features instead - * - * Abstract base class. - */ -class CKernelIndependenceTest: public CIndependenceTest -{ -public: - /** default constructor */ - CKernelIndependenceTest(); - - /** Constructor. - * - * Initializes the kernels and features from the two distributions and - * SG_REFs them - * - * @param kernel_p kernel to use on samples from p - * @param kernel_q kernel to use on samples from q - * @param p samples from distribution p - * @param q samples from distribution q - */ - CKernelIndependenceTest(CKernel* kernel_p, CKernel* kernel_q, - CFeatures* p, CFeatures* q); - - /** destructor */ - virtual ~CKernelIndependenceTest(); - - /** shuffles the indeices that corresponds to the kernel entries of - * samples from p while accessing samples from q in the original order and - * computes the test statistic m_num_null_samples times. This version - * checks if a precomputed custom kernel is used, and, if so, just permutes - * the indices of the kernel corresponding to p instead of re-computing it - * in every iteration. - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - - /** Setter for kernel for features from distribution p, SG_REFs it - * - * @param kernel_p kernel for features from p - */ - virtual void set_kernel_p(CKernel* kernel_p); - - /** Setter for kernel for features from distribution q, SG_REFs it - * - * @param kernel_q kernel for features from q - */ - virtual void set_kernel_q(CKernel* kernel_q); - - /** Getter for kernel for features from p, SG_REF'ed - * - * @return kernel for features from p - */ - virtual CKernel* get_kernel_p(); - - /** Getter for kernel for features from q, SG_REF'ed - * - * @return kernel for features from q - */ - virtual CKernel* get_kernel_q(); - - /** @return the class name */ - virtual const char* get_name() const=0; - -private: - /** register parameters and intiailize with default values */ - void init(); - -protected: - /** @return kernel matrix on samples from p. Distinguishes CustomKernels */ - SGMatrix get_kernel_matrix_K(); - - /** @return kernel matrix on samples from q. Distinguishes CustomKernels */ - SGMatrix get_kernel_matrix_L(); - - /** underlying kernel for p */ - CKernel* m_kernel_p; - - /** underlying kernel for q */ - CKernel* m_kernel_q; -}; - -} - -#endif /* KERNEL_INDEPENDENCE_TEST_H_ */ diff --git a/src/shogun/statistics/KernelMeanMatching.cpp b/src/shogun/statistics/KernelMeanMatching.cpp deleted file mode 100644 index 71942054b6d..00000000000 --- a/src/shogun/statistics/KernelMeanMatching.cpp +++ /dev/null @@ -1,109 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Copyright (W) 2012 Sergey Lisitsyn - */ - -#include -#ifdef USE_GPL_SHOGUN - -#include -#include - - -static float64_t* kmm_K = NULL; -static int32_t kmm_K_ld = 0; - -static const float64_t* kmm_get_col(uint32_t i) -{ - return kmm_K + kmm_K_ld*i; -} - -namespace shogun -{ -CKernelMeanMatching::CKernelMeanMatching() : - CSGObject(), m_kernel(NULL) -{ -} - -CKernelMeanMatching::CKernelMeanMatching(CKernel* kernel, SGVector training_indices, - SGVector test_indices) : - CSGObject(), m_kernel(NULL) -{ - set_kernel(kernel); - set_training_indices(training_indices); - set_test_indices(test_indices); -} - -SGVector CKernelMeanMatching::compute_weights() -{ - int32_t i,j; - ASSERT(m_kernel) - ASSERT(m_training_indices.vlen) - ASSERT(m_test_indices.vlen) - - int32_t n_tr = m_training_indices.vlen; - int32_t n_te = m_test_indices.vlen; - - SGVector weights(n_tr); - weights.zero(); - - kmm_K = SG_MALLOC(float64_t, n_tr*n_tr); - kmm_K_ld = n_tr; - float64_t* diag_K = SG_MALLOC(float64_t, n_tr); - for (i=0; ikernel(m_training_indices[i], m_training_indices[i]); - diag_K[i] = d; - kmm_K[i*n_tr+i] = d; - for (j=i+1; jkernel(m_training_indices[i],m_training_indices[j]); - kmm_K[i*n_tr+j] = d; - kmm_K[j*n_tr+i] = d; - } - } - float64_t* kappa = SG_MALLOC(float64_t, n_tr); - for (i=0; ikernel(m_training_indices[i],m_test_indices[j]); - - avg *= float64_t(n_tr)/n_te; - kappa[i] = -avg; - } - float64_t* a = SG_MALLOC(float64_t, n_tr); - for (i=0; i -#ifdef USE_GPL_SHOGUN - -#include -#include - -namespace shogun -{ - -/** @brief Kernel Mean Matching */ -class CKernelMeanMatching: public CSGObject -{ -public: - - /** constructor */ - CKernelMeanMatching(); - - /** constructor */ - CKernelMeanMatching(CKernel* kernel, SGVector training_indices, SGVector test_indices); - - /** get kernel */ - CKernel* get_kernel() const { SG_REF(m_kernel); return m_kernel; } - /** set kernel */ - void set_kernel(CKernel* kernel) { SG_REF(kernel); SG_UNREF(m_kernel); m_kernel = kernel; } - /** get training indices */ - SGVector get_training_indices() const { return m_training_indices; } - /** set training indices */ - void set_training_indices(SGVector training_indices) { m_training_indices = training_indices; } - /** get test indices */ - SGVector get_test_indices() const { return m_test_indices; } - /** set test indices */ - void set_test_indices(SGVector test_indices) { m_test_indices = test_indices; } - - /** compute weights */ - SGVector compute_weights(); - - virtual const char* get_name() const { return "KernelMeanMatching"; } - -protected: - - /** kernel */ - CKernel* m_kernel; - /** training indices */ - SGVector m_training_indices; - /** test indices */ - SGVector m_test_indices; -}; - -} -#endif //USE_GPL_SHOGUN -#endif diff --git a/src/shogun/statistics/KernelSelection.h b/src/shogun/statistics/KernelSelection.h deleted file mode 100644 index 8d2cdf29abe..00000000000 --- a/src/shogun/statistics/KernelSelection.h +++ /dev/null @@ -1,104 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef KERNEL_SELECTION_H_ -#define KERNEL_SELECTION_H_ - -#include -#include - -namespace shogun -{ - -class CKernelTwoSampleTest; -class CKernel; - -/** @brief Base class for kernel selection for kernel two-sample test - * statistic implementations (e.g. MMD). - * Provides abstract methods for selecting kernels and computing criteria or - * kernel weights for the implemented method. In order to implement new methods - * for kernel selection, simply write a new implementation of this class. - */ -class CKernelSelection: public CSGObject -{ -public: - /** Default constructor */ - CKernelSelection(); - - /** Constructor that initialises the underlying CKernelTwoSampleTest instance - * - * @param estimator CKernelTwoSampleTest instance to use. - */ - CKernelSelection(CKernelTwoSampleTest* estimator); - - /** Destructor */ - virtual ~CKernelSelection(); - - /** If the the implemented method selects a single kernel, this computes - * criteria for all underlying kernels. If the method selects combined - * kernels, this method returns weights for the baseline kernels - * - * @return vector with criteria or kernel weights - */ - virtual SGVector compute_measures()=0; - - /** Abstract method that performs kernel selection on the base of the - * compute_measures() method and returns the selected kernel which is - * either a single or a combined one (with weights set) - * - * @return selected kernel (SG_REF'ed) - */ - virtual CKernel* select_kernel()=0; - - /** @param estimator the underlying CKernelTwoSampleTest instance */ - void set_estimator(CKernelTwoSampleTest* estimator); - - /** @return the underlying CKernelTwoSampleTest instance */ - CKernelTwoSampleTest* get_estimator() const; - - /** @return name of the SGSerializable */ - virtual const char* get_name() const - { - return "KernelSelection"; - } - -private: - /** Register parameters and initialize with default */ - void init(); - -protected: - /** Underlying kernel two-sample test instance */ - CKernelTwoSampleTest* m_estimator; -}; - -} - -#endif /* KERNEL_SELECTION_H_ */ diff --git a/src/shogun/statistics/KernelTwoSampleTest.cpp b/src/shogun/statistics/KernelTwoSampleTest.cpp deleted file mode 100644 index 0035f1bb9d5..00000000000 --- a/src/shogun/statistics/KernelTwoSampleTest.cpp +++ /dev/null @@ -1,117 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -CKernelTwoSampleTest::CKernelTwoSampleTest() : - CTwoSampleTest() -{ - init(); -} - -CKernelTwoSampleTest::CKernelTwoSampleTest(CKernel* kernel, - CFeatures* p_and_q, index_t q_start) : - CTwoSampleTest(p_and_q, q_start) -{ - init(); - - m_kernel=kernel; - SG_REF(kernel); -} - -CKernelTwoSampleTest::CKernelTwoSampleTest(CKernel* kernel, - CFeatures* p, CFeatures* q) : CTwoSampleTest(p, q) -{ - init(); - - m_kernel=kernel; - SG_REF(kernel); -} - -CKernelTwoSampleTest::~CKernelTwoSampleTest() -{ - SG_UNREF(m_kernel); -} - -void CKernelTwoSampleTest::init() -{ - SG_ADD((CSGObject**)&m_kernel, "kernel", "Kernel for two sample test", - MS_AVAILABLE); - m_kernel=NULL; -} - -SGVector CKernelTwoSampleTest::sample_null() -{ - SG_DEBUG("entering!\n"); - - REQUIRE(m_kernel, "No kernel set!\n"); - REQUIRE(m_kernel->get_kernel_type()==K_CUSTOM || m_p_and_q, - "No features and no custom kernel set!\n"); - - /* compute sample statistics for null distribution */ - SGVector results; - - /* only do something if a custom kernel is used: use the power of pre- - * computed kernel matrices - */ - if (m_kernel->get_kernel_type()==K_CUSTOM) - { - /* allocate memory */ - results=SGVector(m_num_null_samples); - - /* in case of custom kernel, there are no features */ - index_t num_data; - if (m_kernel->get_kernel_type()==K_CUSTOM) - num_data=m_kernel->get_num_vec_lhs(); - else - num_data=m_p_and_q->get_num_vectors(); - - /* memory for index permutations, (would slow down loop) */ - SGVector ind_permutation(num_data); - ind_permutation.range_fill(); - - /* check if kernel is a custom kernel. In that case, changing features is - * not what we want but just subsetting the kernel itself */ - CCustomKernel* custom_kernel=(CCustomKernel*)m_kernel; - - for (index_t i=0; iadd_row_subset(ind_permutation); - custom_kernel->add_col_subset(ind_permutation); - - /* compute statistic for this permutation of mixed samples */ - results[i]=compute_statistic(); - - /* remove subsets */ - custom_kernel->remove_row_subset(); - custom_kernel->remove_col_subset(); - } - } - else - { - /* in this case, just use superclass method */ - results=CTwoSampleTest::sample_null(); - } - - SG_DEBUG("leaving!\n"); - - return results; -} diff --git a/src/shogun/statistics/KernelTwoSampleTest.h b/src/shogun/statistics/KernelTwoSampleTest.h deleted file mode 100644 index ba7b9d26160..00000000000 --- a/src/shogun/statistics/KernelTwoSampleTest.h +++ /dev/null @@ -1,126 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#ifndef KERNEL_TWO_SAMPLE_TEST_H_ -#define KERNEL_TWO_SAMPLE_TEST_H_ - -#include - -#include -#include - -namespace shogun -{ - -class CFeatures; -class CKernel; - -/** @brief Kernel two sample test base class. Provides an interface for - * performing a two-sample test using a kernel, i.e. Given samples from two - * distributions \f$p\f$ and \f$q\f$, the null-hypothesis is: \f$H_0: p=q\f$, - * the alternative hypothesis: \f$H_1: p\neq q\f$. - * - * In this class, this is done using a single kernel for the data. - * - * The class also re-implements the sample_null() method. If the underlying - * kernel is a custom one (precomputed), the rows and column of the kernel - * matrix is permuted using subsets. The computation falls back to - * CTwoSampleTest::sample_null() otherwise. - * - * Abstract base class. - */ -class CKernelTwoSampleTest : public CTwoSampleTest -{ - public: - /** default constructor */ - CKernelTwoSampleTest(); - - /** Constructor - * - * @param p_and_q feature data. Is assumed to contain samples from both - * p and q. First all samples from p, then from index q_start all - * samples from q - * - * @param kernel kernel to use - * @param p_and_q samples from p and q, appended - * @param q_start index of first sample of q - */ - CKernelTwoSampleTest(CKernel* kernel, CFeatures* p_and_q, - index_t q_start); - - /** Constructor. - * This is a convienience constructor which copies both features to one - * element and then calls the other constructor. Needs twice the memory - * for a short time - * - * @param kernel kernel for MMD - * @param p samples from distribution p, will be copied and NOT - * SG_REF'ed - * @param q samples from distribution q, will be copied and NOT - * SG_REF'ed - */ - CKernelTwoSampleTest(CKernel* kernel, CFeatures* p, - CFeatures* q); - - /** destructor */ - virtual ~CKernelTwoSampleTest(); - - /** Setter for the underlying kernel - * @param kernel new kernel to use - */ - inline virtual void set_kernel(CKernel* kernel) - { - /* ref before unref to prevent deleting in case objects are the same */ - SG_REF(kernel); - SG_UNREF(m_kernel); - m_kernel=kernel; - } - - /** @return underlying kernel, is SG_REF'ed */ - inline virtual CKernel* get_kernel() - { - SG_REF(m_kernel); - return m_kernel; - } - - /** merges both sets of samples and computes the test statistic - * m_num_null_samples times. This version checks if a precomputed - * custom kernel is used, and, if so, just permutes it instead of re- - * computing it in every iteration. - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - - /** Same as compute_statistic(), but with the possibility to perform on - * multiple kernels at once - * - * @param multiple_kernels if true, and underlying kernel is K_COMBINED, - * method will be executed on all subkernels on the same data - * @return vector of results for subkernels - */ - virtual SGVector compute_statistic( - bool multiple_kernels)=0; - - /** Wrapper for compute_statistic(false) */ - virtual float64_t compute_statistic()=0; - - virtual const char* get_name() const=0; - - private: - void init(); - - protected: - /** underlying kernel */ - CKernel* m_kernel; -}; - -} - -#endif /* KERNEL_TWO_SAMPLE_TEST_H_ */ diff --git a/src/shogun/statistics/LinearTimeMMD.cpp b/src/shogun/statistics/LinearTimeMMD.cpp deleted file mode 100644 index 735d7dc502f..00000000000 --- a/src/shogun/statistics/LinearTimeMMD.cpp +++ /dev/null @@ -1,458 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include -#include -#include - -#include - -using namespace shogun; - -CLinearTimeMMD::CLinearTimeMMD() : CStreamingMMD() -{ -} - -CLinearTimeMMD::CLinearTimeMMD(CKernel* kernel, CStreamingFeatures* p, - CStreamingFeatures* q, index_t m, index_t blocksize) - : CStreamingMMD(kernel, p, q, m, blocksize) -{ -} - -CLinearTimeMMD::~CLinearTimeMMD() -{ -} - -void CLinearTimeMMD::compute_squared_mmd(CKernel* kernel, CList* data, - SGVector& current, SGVector& pp, - SGVector& qq, SGVector& pq, - SGVector& qp, index_t num_this_run) -{ - SG_DEBUG("entering!\n"); - - REQUIRE(data->get_num_elements()==4, "Wrong number of blocks!\n"); - - /* cast is safe the list is passed inside the class - * features will be SG_REF'ed once again by these get methods */ - CFeatures* p1=(CFeatures*)data->get_first_element(); - CFeatures* p2=(CFeatures*)data->get_next_element(); - CFeatures* q1=(CFeatures*)data->get_next_element(); - CFeatures* q2=(CFeatures*)data->get_next_element(); - - SG_DEBUG("computing MMD values for current kernel!\n"); - - /* compute kernel matrix diagonals */ - kernel->init(p1, p2); - kernel->get_kernel_diagonal(pp); - - kernel->init(q1, q2); - kernel->get_kernel_diagonal(qq); - - kernel->init(p1, q2); - kernel->get_kernel_diagonal(pq); - - kernel->init(q1, p2); - kernel->get_kernel_diagonal(qp); - - /* cleanup */ - SG_UNREF(p1); - SG_UNREF(p2); - SG_UNREF(q1); - SG_UNREF(q2); - - /* compute sum of current h terms for current kernel */ - - for (index_t i=0; i CLinearTimeMMD::compute_squared_mmd(CKernel* kernel, - CList* data, index_t num_this_run) -{ - /* wrapper method used for convenience for using preallocated memory */ - SGVector current(num_this_run); - SGVector pp(num_this_run); - SGVector qq(num_this_run); - SGVector pq(num_this_run); - SGVector qp(num_this_run); - compute_squared_mmd(kernel, data, current, pp, qq, pq, qp, num_this_run); - return current; -} - -void CLinearTimeMMD::compute_statistic_and_variance( - SGVector& statistic, SGVector& variance, - bool multiple_kernels) -{ - SG_DEBUG("entering!\n") - - REQUIRE(m_streaming_p, "streaming features p required!\n"); - REQUIRE(m_streaming_q, "streaming features q required!\n"); - - REQUIRE(m_kernel, "kernel needed!\n"); - - /* make sure multiple_kernels flag is used only with a combined kernel */ - REQUIRE(!multiple_kernels || m_kernel->get_kernel_type()==K_COMBINED, - "multiple kernels specified, but underlying kernel is not of type " - "K_COMBINED\n"); - - /* m is number of samples from each distribution, m_2 is half of it - * using names from JLMR paper (see class documentation) */ - index_t m_2=m_m/2; - - SG_DEBUG("m_m=%d\n", m_m) - - /* find out whether single or multiple kernels (cast is safe, check above) */ - index_t num_kernels=1; - if (multiple_kernels) - { - num_kernels=((CCombinedKernel*)m_kernel)->get_num_subkernels(); - SG_DEBUG("computing MMD and variance for %d sub-kernels\n", - num_kernels); - } - - /* allocate memory for results if vectors are empty */ - if (!statistic.vector) - statistic=SGVector(num_kernels); - - if (!variance.vector) - variance=SGVector(num_kernels); - - /* ensure right dimensions */ - REQUIRE(statistic.vlen==num_kernels, - "statistic vector size (%d) does not match number of kernels (%d)\n", - statistic.vlen, num_kernels); - - REQUIRE(variance.vlen==num_kernels, - "variance vector size (%d) does not match number of kernels (%d)\n", - variance.vlen, num_kernels); - - /* temp variable in the algorithm */ - float64_t delta; - - /* initialise statistic and variance since they are cumulative */ - statistic.zero(); - variance.zero(); - - /* needed for online mean and variance */ - SGVector term_counters(num_kernels); - term_counters.set_const(1); - - /* term counter to compute online mean and variance */ - index_t num_examples_processed=0; - while (num_examples_processedget_kernel(i); - - /* compute linear time MMD values */ - SGVector current=compute_squared_mmd(kernel, data, - num_this_run); - - /* single variances for all kernels. Update mean and variance - * using Knuth's online variance algorithm. - * C.f. for example Wikipedia */ - for (index_t j=0; jget_loglevel()==MSG_DEBUG || io->get_loglevel()==MSG_GCDEBUG) - statistic.display_vector("statistics"); - - /* variance of terms can be computed using mean (statistic). - * Note that the variance needs to be divided by m_2 in order to get - * variance of null-distribution */ - for (index_t i=0; iget_loglevel()==MSG_DEBUG || io->get_loglevel()==MSG_GCDEBUG) - variance.display_vector("variances"); - - SG_DEBUG("leaving!\n") -} - -void CLinearTimeMMD::compute_statistic_and_Q( - SGVector& statistic, SGMatrix& Q) -{ - SG_DEBUG("entering!\n") - - REQUIRE(m_streaming_p, "streaming features p required!\n"); - REQUIRE(m_streaming_q, "streaming features q required!\n"); - - REQUIRE(m_kernel, "kernel needed!\n"); - - /* make sure multiple_kernels flag is used only with a combined kernel */ - REQUIRE(m_kernel->get_kernel_type()==K_COMBINED, - "underlying kernel is not of type K_COMBINED\n"); - - /* cast combined kernel */ - CCombinedKernel* combined=(CCombinedKernel*)m_kernel; - - /* m is number of samples from each distribution, m_4 is quarter of it */ - REQUIRE(m_m>=4, "Need at least m>=4\n"); - index_t m_4=m_m/4; - - SG_DEBUG("m_m=%d\n", m_m) - - /* find out whether single or multiple kernels (cast is safe, check above) */ - index_t num_kernels=combined->get_num_subkernels(); - REQUIRE(num_kernels>0, "At least one kernel is needed\n"); - - /* allocate memory for results if vectors are empty */ - if (!statistic.vector) - statistic=SGVector(num_kernels); - - if (!Q.matrix) - Q=SGMatrix(num_kernels, num_kernels); - - /* ensure right dimensions */ - REQUIRE(statistic.vlen==num_kernels, - "statistic vector size (%d) does not match number of kernels (%d)\n", - statistic.vlen, num_kernels); - - REQUIRE(Q.num_rows==num_kernels, - "Q number of rows does (%d) not match number of kernels (%d)\n", - Q.num_rows, num_kernels); - - REQUIRE(Q.num_cols==num_kernels, - "Q number of columns (%d) does not match number of kernels (%d)\n", - Q.num_cols, num_kernels); - - /* initialise statistic and variance since they are cumulative */ - statistic.zero(); - Q.zero(); - - /* produce two kernel lists to iterate doubly nested */ - CList* list_i=new CList(); - CList* list_j=new CList(); - - for (index_t k_idx=0; k_idxget_num_kernels(); k_idx++) - { - CKernel* kernel = combined->get_kernel(k_idx); - list_i->append_element(kernel); - list_j->append_element(kernel); - SG_UNREF(kernel); - } - - /* needed for online mean and variance */ - SGVector term_counters_statistic(num_kernels); - SGMatrix term_counters_Q(num_kernels, num_kernels); - term_counters_statistic.set_const(1); - term_counters_Q.set_const(1); - - index_t num_examples_processed=0; - while (num_examples_processedget_num_elements(); - CFeatures* current=(CFeatures*)data->get_first_element(); - data_a->append_element(current); - SG_UNREF(current); - current=(CFeatures*)data->get_next_element(); - data_b->append_element(current); - SG_UNREF(current); - num_elements-=2; - /* loop counter is safe since num_elements can only be even */ - while (num_elements) - { - current=(CFeatures*)data->get_next_element(); - data_a->append_element(current); - SG_UNREF(current); - current=(CFeatures*)data->get_next_element(); - data_b->append_element(current); - SG_UNREF(current); - num_elements-=2; - } - /* safely unref previous list of data, decreases refcounts of features - * but doesn't delete them */ - SG_UNREF(data); - - /* now for each of these streamed data instances, iterate through all - * kernels and update Q matrix while also computing MMD statistic */ - - /* preallocate some memory for faster processing */ - SGVector pp(num_this_run); - SGVector qq(num_this_run); - SGVector pq(num_this_run); - SGVector qp(num_this_run); - SGVector h_i_a(num_this_run); - SGVector h_i_b(num_this_run); - SGVector h_j_a(num_this_run); - SGVector h_j_b(num_this_run); - - /* iterate through Q matrix and update values, compute mmd */ - CKernel* kernel_i=(CKernel*)list_i->get_first_element(); - for (index_t i=0; iget_first_element(); - for (index_t j=0; j<=i; ++j) - { - /* compute all necessary 8 h-vectors for this burst. - * h_delta-terms for each kernel, expression 7 of NIPS paper */ - - /* second kernel, a-part */ - compute_squared_mmd(kernel_j, data_a, h_j_a, pp, qq, pq, qp, - num_this_run); - - /* second kernel, b-part */ - compute_squared_mmd(kernel_j, data_b, h_j_b, pp, qq, pq, qp, - num_this_run); - - float64_t term; - for (index_t it=0; itget_next_element(); - } - - /* update MMD statistic online computation for kernel i, using - * vectors that were computed above */ - SGVector h(num_this_run*2); - for (index_t it=0; itget_next_element(); - } - - /* clean up streamed data */ - SG_UNREF(data_a); - SG_UNREF(data_b); - - /* add number of processed examples for this run */ - num_examples_processed+=num_this_run; - } - - /* clean up */ - SG_UNREF(list_i); - SG_UNREF(list_j); - - SG_DEBUG("Done compouting statistic, processed 4*%d examples.\n", - num_examples_processed); - - SG_DEBUG("leaving!\n") -} - diff --git a/src/shogun/statistics/LinearTimeMMD.h b/src/shogun/statistics/LinearTimeMMD.h deleted file mode 100644 index 84049792803..00000000000 --- a/src/shogun/statistics/LinearTimeMMD.h +++ /dev/null @@ -1,158 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef LINEAR_TIME_MMD_H_ -#define LINEAR_TIME_MMD_H_ - -#include - -#include - -namespace shogun -{ - -class CStreamingFeatures; -class CFeatures; - -/** @brief This class implements the linear time Maximum Mean Statistic as - * described in [1] for streaming data (see CStreamingMMD for description). - * - * Given two sets of samples \f$\{x_i\}_{i=1}^m\sim p\f$ and - * \f$\{y_i\}_{i=1}^m\sim q\f$ - * the (unbiased) statistic is computed as - * \f[ - * \text{MMD}_l^2[\mathcal{F},X,Y]=\frac{1}{m_2}\sum_{i=1}^{m_2} - * h(z_{2i},z_{2i+1}) - * \f] - * where - * \f[ - * h(z_{2i},z_{2i+1})=k(x_{2i},x_{2i+1})+k(y_{2i},y_{2i+1})-k(x_{2i},y_{2i+1})- - * k(x_{2i+1},y_{2i}) - * \f] - * and \f$ m_2=\lfloor\frac{m}{2} \rfloor\f$. - * - * [1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., - * & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning - * Research, 13, 671-721. - */ -class CLinearTimeMMD: public CStreamingMMD -{ -public: - /** default constructor */ - CLinearTimeMMD(); - - /** Constructor. - * @param kernel kernel to use - * @param p streaming features p to use - * @param q streaming features q to use - * @param m number of samples from each distribution - * @param blocksize size of examples that are processed at once when - * computing statistic/threshold. If larger than m/2, all examples will be - * processed at once. Memory consumption increased linearly in the - * blocksize. Choose as large as possible regarding available memory. - */ - CLinearTimeMMD(CKernel* kernel, CStreamingFeatures* p, - CStreamingFeatures* q, index_t m, index_t blocksize=10000); - - /** destructor */ - virtual ~CLinearTimeMMD(); - - /** Computes squared MMD and a variance estimate, in linear time. - * If multiple_kernels is set to true, each subkernel is evaluated on the - * same data. - * - * @param statistic return parameter for statistic, vector with entry for - * each kernel. May be allocated before but doesn not have to be - * - * @param variance return parameter for statistic, vector with entry for - * each kernel. May be allocated before but doesn not have to be - * - * @param multiple_kernels optional flag, if set to true, it is assumed that - * the underlying kernel is of type K_COMBINED. Then, the MMD is computed on - * all subkernel separately rather than computing it on the combination. - * This is used by kernel selection strategies that need to evaluate - * multiple kernels on the same data. Since the linear time MMD works on - * streaming data, one cannot simply compute MMD, change kernel since data - * would be different for every kernel. - */ - virtual void compute_statistic_and_variance( - SGVector& statistic, SGVector& variance, - bool multiple_kernels=false); - - /** Same as compute_statistic_and_variance, but computes a linear time - * estimate of the covariance of the multiple-kernel-MMD. - * See [1] for details. - */ - virtual void compute_statistic_and_Q( - SGVector& statistic, SGMatrix& Q); - - /** returns the statistic type of this test statistic */ - virtual EStatisticType get_statistic_type() const - { - return S_LINEAR_TIME_MMD; - } - - /** @return the class name */ - virtual const char* get_name() const - { - return "LinearTimeMMD"; - } - -protected: - /** method that computes the squared MMD in linear time (see class - * description for the equation) - * - * @param kernel the kernel to be used for computing MMD. This will be - * useful when multiple kernels are used - * @param data the list of data on which kernels are computed. The order - * of data in the list is \f$x,x',\cdots\sim p\f$ followed by - * \f$y,y',\cdots\sim q\f$. It is assumed that detele_data flag is set - * inside the list - * @param num_this_run number of data points in current blocks - * @return the MMD values (the h-vectors) - */ - virtual SGVector compute_squared_mmd(CKernel* kernel, - CList* data, index_t num_this_run); - -private: - /** helper method, same as compute_squared_mmd with an option to use - * preallocated memory for faster processing */ - void compute_squared_mmd(CKernel* kernel, CList* data, - SGVector& current, SGVector& pp, - SGVector& qq, SGVector& pq, - SGVector& qp, index_t num_this_run); - -}; - -} - -#endif /* LINEAR_TIME_MMD_H_ */ - diff --git a/src/shogun/statistics/MMDKernelSelection.h b/src/shogun/statistics/MMDKernelSelection.h deleted file mode 100644 index 5616b2bbdd5..00000000000 --- a/src/shogun/statistics/MMDKernelSelection.h +++ /dev/null @@ -1,100 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef __MMDKERNELSELECTION_H_ -#define __MMDKERNELSELECTION_H_ - -#include -#include - -namespace shogun -{ - -class CKernelTwoSampleTest; -class CKernel; - -/** @brief Base class for kernel selection for MMD-based two-sample test - * statistic implementations. - * Provides abstract methods for selecting kernels and computing criteria or - * kernel weights for the implemented method. In order to implement new methods - * for kernel selection, simply write a new implementation of this class. - * - * Kernel selection works this way: One passes an instance of CCombinedKernel - * to the MMD statistic and appends all kernels that should be considered. - * Depending on the type of kernel selection implementation, a single one or - * a combination of those baseline kernels is selected and returned to the user. - * This kernel can then be passed to the MMD instance to perform a test. - * - */ -class CMMDKernelSelection: public CKernelSelection -{ -public: - - /** Default constructor */ - CMMDKernelSelection(); - - /** Constructor that initialises the underlying MMD instance - * - * @param mmd MMD instance to use. Has to be an MMD based kernel two-sample - * test. Currently: linear or quadratic time MMD. - */ - CMMDKernelSelection(CKernelTwoSampleTest* mmd); - - /** Destructor */ - virtual ~CMMDKernelSelection(); - - /** If the the implemented method selects a single kernel, this computes - * criteria for all underlying kernels. If the method selects combined - * kernels, this method returns weights for the baseline kernels - * - * @return vector with criteria or kernel weights - */ - virtual SGVector compute_measures()=0; - - /** Performs kernel selection on the base of the compute_measures() method - * and returns the selected kernel which is either a single or a combined - * one (with weights set) - * - * @return selected kernel (SG_REF'ed) - */ - virtual CKernel* select_kernel(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const - { - return "MMDKernelSelection"; - } - -}; - -} - -#endif /* __MMDKERNELSELECTION_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionComb.cpp b/src/shogun/statistics/MMDKernelSelectionComb.cpp deleted file mode 100644 index aa1da177181..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionComb.cpp +++ /dev/null @@ -1,171 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#ifdef USE_GPL_SHOGUN - -#include -#include -#include - -using namespace shogun; - -CMMDKernelSelectionComb::CMMDKernelSelectionComb() : - CMMDKernelSelection() -{ - init(); -} - -CMMDKernelSelectionComb::CMMDKernelSelectionComb( - CKernelTwoSampleTest* mmd) : CMMDKernelSelection(mmd) -{ - init(); -} - -CMMDKernelSelectionComb::~CMMDKernelSelectionComb() -{ -} - -void CMMDKernelSelectionComb::init() -{ - SG_ADD(&m_opt_max_iterations, "opt_max_iterations", "Maximum number of " - "iterations for qp solver", MS_NOT_AVAILABLE); - SG_ADD(&m_opt_epsilon, "opt_epsilon", "Stopping criterion for qp solver", - MS_NOT_AVAILABLE); - SG_ADD(&m_opt_low_cut, "opt_low_cut", "Low cut value for optimization " - "kernel weights", MS_NOT_AVAILABLE); - - /* sensible values for optimization */ - m_opt_max_iterations=10000; - m_opt_epsilon=10E-15; - m_opt_low_cut=10E-7; -} - -CKernel* CMMDKernelSelectionComb::select_kernel() -{ - /* cast is safe due to assertion in constructor */ - CCombinedKernel* combined=(CCombinedKernel*)m_estimator->get_kernel(); - - /* optimise for kernel weights and set them */ - SGVector weights=compute_measures(); - combined->set_subkernel_weights(weights); - - /* note that kernel is SG_REF'ed from getter above */ - return combined; -} - -/* no reference counting, use the static context constructor of SGMatrix */ -SGMatrix CMMDKernelSelectionComb::m_Q=SGMatrix(false); - -const float64_t* CMMDKernelSelectionComb::get_Q_col(uint32_t i) -{ - return &m_Q[m_Q.num_rows*i]; -} - -/** helper function that prints current state */ -void CMMDKernelSelectionComb::print_state(libqp_state_T state) -{ - SG_SDEBUG("libqp state: primal=%f\n", state.QP); -} - -SGVector CMMDKernelSelectionComb::solve_optimization( - SGVector mmds) -{ - /* readability */ - index_t num_kernels=mmds.vlen; - - /* compute sum of mmds to generate feasible point for convex program */ - float64_t sum_mmds=0; - for (index_t i=0; i Q_diag(num_kernels); - SGVector f(num_kernels); - SGVector lb(num_kernels); - SGVector ub(num_kernels); - SGVector weights(num_kernels); - - /* init everything, there are two cases possible: i) at least one mmd is - * is positive, ii) all mmds are negative */ - bool one_pos=false; - for (index_t i=0; i0) - { - SG_DEBUG("found at least one positive MMD\n") - one_pos=true; - break; - } - } - - if (!one_pos) - { - SG_WARNING("All mmd estimates are negative. This is techically possible," - "although extremely rare. Consider using different kernels. " - "This combination will lead to a bad two-sample test. Since any" - "combination is bad, will now just return equally distributed " - "kernel weights\n"); - - /* if no element is positive, we can choose arbritary weights since - * the results will be bad anyway */ - weights.set_const(1.0/num_kernels); - } - else - { - SG_DEBUG("one MMD entry is positive, performing optimisation\n") - /* do optimisation, init vectors */ - for (index_t i=0; i - -#ifdef USE_GPL_SHOGUN - -#include -#include -#include - -namespace shogun -{ - -class CLinearTimeMMD; - -/** @brief Base class for kernel selection of combined kernels. Given an MMD - * instance whose underlying kernel is a combined one, this class provides an - * interface to select weights of this combined kernel. - */ -class CMMDKernelSelectionComb: public CMMDKernelSelection -{ -public: - - /** Default constructor */ - CMMDKernelSelectionComb(); - - /** Constructor that initialises the underlying MMD instance. Currently, - * only the linear time MMD is supported - * - * @param mmd MMD instance to use - */ - CMMDKernelSelectionComb(CKernelTwoSampleTest* mmd); - - /** Destructor */ - virtual ~CMMDKernelSelectionComb(); - - /** @return computes weights for the underlying kernel, sets them to it, and - * returns it (SG_REF'ed) - * - * @return underlying kernel with weights set - */ - virtual CKernel* select_kernel(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const=0; - -protected: - /** Solves the quadratic program - * \f[ - * \min_\beta \{\beta^T Q \beta \quad \text{s.t.}\quad \beta^T \eta=1, \beta\succeq 0\}, - * \f] - * where \f$\eta\f$ is a given parameter and \f$Q\f$ is the m_Q member. - * - * Note that at least one element is assumed \f$\eta\f$ has to be positive. - * - * @param mmds values that will be put into \f$\eta\f$. At least one element - * is assumed to be positive - * @return result of optimization \f$\beta\f$ - */ - virtual SGVector solve_optimization(SGVector mmds); - - /** return pointer to i-th column of m_Q. Helper for libqp */ - static const float64_t* get_Q_col(uint32_t i); - - /** helper function that prints current state */ - static void print_state(libqp_state_T state); - - /** maximum number of iterations of qp solver */ - index_t m_opt_max_iterations; - - /** stopping accuracy of qp solver */ - float64_t m_opt_epsilon; - - /** low cut for weights, if weights are under this value, are set to zero */ - float64_t m_opt_low_cut; - - /** matrix for selection of kernel weights (static because of libqp) */ - static SGMatrix m_Q; - -private: - /** initializer */ - void init(); -}; - -} - -#endif //USE_GPL_SHOGUN -#endif /* __MMDKERNELSELECTIONCOMB_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionCombMaxL2.cpp b/src/shogun/statistics/MMDKernelSelectionCombMaxL2.cpp deleted file mode 100644 index e150e7ce8c2..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionCombMaxL2.cpp +++ /dev/null @@ -1,74 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#ifdef USE_GPL_SHOGUN - -#include -#include -#include -#include - - -using namespace shogun; - -CMMDKernelSelectionCombMaxL2::CMMDKernelSelectionCombMaxL2() : - CMMDKernelSelectionComb() -{ -} - -CMMDKernelSelectionCombMaxL2::CMMDKernelSelectionCombMaxL2( - CKernelTwoSampleTest* mmd) : CMMDKernelSelectionComb(mmd) -{ - /* currently, this method is only developed for the linear time MMD */ - REQUIRE(mmd->get_statistic_type()==S_QUADRATIC_TIME_MMD || - mmd->get_statistic_type()==S_LINEAR_TIME_MMD, "%s::%s(): Only " - "CLinearTimeMMD is currently supported! Provided instance is " - "\"%s\"\n", get_name(), get_name(), mmd->get_name()); -} - -CMMDKernelSelectionCombMaxL2::~CMMDKernelSelectionCombMaxL2() -{ -} - -SGVector CMMDKernelSelectionCombMaxL2::compute_measures() -{ - /* cast is safe due to assertion in constructor */ - CCombinedKernel* kernel=(CCombinedKernel*)m_estimator->get_kernel(); - index_t num_kernels=kernel->get_num_subkernels(); - SG_UNREF(kernel); - - /* compute mmds for all underlying kernels and create identity matrix Q - * (see NIPS paper) */ - SGVector mmds=m_estimator->compute_statistic(true); - - /* free matrix by hand since it is static */ - SG_FREE(m_Q.matrix); - m_Q.matrix=NULL; - m_Q.num_rows=0; - m_Q.num_cols=0; - m_Q=SGMatrix(num_kernels, num_kernels, false); - for (index_t i=0; i result=CMMDKernelSelectionComb::solve_optimization(mmds); - - /* free matrix by hand since it is static (again) */ - SG_FREE(m_Q.matrix); - m_Q.matrix=NULL; - m_Q.num_rows=0; - m_Q.num_cols=0; - - return result; -} -#endif //USE_GPL_SHOGUN diff --git a/src/shogun/statistics/MMDKernelSelectionCombMaxL2.h b/src/shogun/statistics/MMDKernelSelectionCombMaxL2.h deleted file mode 100644 index 430300f97cc..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionCombMaxL2.h +++ /dev/null @@ -1,81 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#ifndef __MMDKERNELSELECTIONCOMBMAXL2_H_ -#define __MMDKERNELSELECTIONCOMBMAXL2_H_ - -#include -#ifdef USE_GPL_SHOGUN -#include -#include - -namespace shogun -{ - -/** @brief Implementation of maximum MMD kernel selection for combined kernel. - * This class selects a combination of baseline kernels that maximises the - * the MMD for a combined kernel based on a L2-regularization approach. This - * boils down to solve the convex program - * \f[ - * \min_\beta \{\beta^T \beta \quad \text{s.t.}\quad \beta^T \eta=1, \beta\succeq 0\}, - * \f] - * where \f$\eta\f$ is a vector whose elements are the MMDs of the baseline - * kernels. - * - * This is meant to work for the CQuadraticTimeMMD statistic. - * Optimal weight selecton for CLinearTimeMMD can be found in - * CMMDKernelSelectionCombOpt. - * - * The method is described in - * Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., - * Balakrishnan, S., Pontil, M., & Fukumizu, K. (2012). - * Optimal kernel choice for large-scale two-sample tests. - * Advances in Neural Information Processing Systems. - */ -class CMMDKernelSelectionCombMaxL2: public CMMDKernelSelectionComb -{ -public: - - /** Default constructor */ - CMMDKernelSelectionCombMaxL2(); - - /** Constructor that initialises the underlying MMD instance - * - * @param mmd MMD instance to use. Has to be an MMD based kernel two-sample - * test. Currently: linear or quadratic time MMD. - */ - CMMDKernelSelectionCombMaxL2(CKernelTwoSampleTest* mmd); - - /** Destructor */ - virtual ~CMMDKernelSelectionCombMaxL2(); - - /** Computes kernel weights which maximise the MMD of the underlying - * combined kernel using L2-regularization. - * - * This boils down to solving a convex program which is quadratic in the - * number of kernels. See class description. - * - * SHOGUN has to be compiled with LAPACK to make this available. See - * set_opt* methods for optimization parameters. - * - * IMPORTANT: Kernel weights have to be learned on different data than is - * used for testing/evaluation! - */ - virtual SGVector compute_measures(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const - { - return "MMDKernelSelectionCombMaxL2"; - } -}; - -} -#endif //USE_GPL_SHOGUN -#endif /* __MMDKERNELSELECTIONCOMBMAXL2_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionCombOpt.cpp b/src/shogun/statistics/MMDKernelSelectionCombOpt.cpp deleted file mode 100644 index ceecf63b500..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionCombOpt.cpp +++ /dev/null @@ -1,99 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#ifdef USE_GPL_SHOGUN - -#include -#include -#include - - -using namespace shogun; - -CMMDKernelSelectionCombOpt::CMMDKernelSelectionCombOpt() : - CMMDKernelSelectionComb() -{ - init(); -} - -CMMDKernelSelectionCombOpt::CMMDKernelSelectionCombOpt( - CKernelTwoSampleTest* mmd, float64_t lambda) : - CMMDKernelSelectionComb(mmd) -{ - /* currently, this method is only developed for the linear time MMD */ - REQUIRE(dynamic_cast(mmd), "%s::%s(): Only " - "CLinearTimeMMD is currently supported! Provided instance is " - "\"%s\"\n", get_name(), get_name(), mmd->get_name()); - - init(); - - m_lambda=lambda; -} - -CMMDKernelSelectionCombOpt::~CMMDKernelSelectionCombOpt() -{ -} - -void CMMDKernelSelectionCombOpt::init() -{ - /* set to a sensible standard value that proved to be useful in - * experiments, see NIPS paper */ - m_lambda=1E-5; - - SG_ADD(&m_lambda, "lambda", "Regularization parameter lambda", - MS_NOT_AVAILABLE); -} - -SGVector CMMDKernelSelectionCombOpt::compute_measures() -{ - /* cast is safe due to assertion in constructor */ - CCombinedKernel* kernel=(CCombinedKernel*)m_estimator->get_kernel(); - index_t num_kernels=kernel->get_num_subkernels(); - SG_UNREF(kernel); - - /* allocate space for MMDs and Q matrix */ - SGVector mmds(num_kernels); - - /* free matrix by hand since it is static */ - SG_FREE(m_Q.matrix); - m_Q.matrix=NULL; - m_Q.num_rows=0; - m_Q.num_cols=0; - m_Q=SGMatrix(num_kernels, num_kernels, false); - - /* online compute mmds and covariance matrix Q of kernels */ - ((CLinearTimeMMD*)m_estimator)->compute_statistic_and_Q(mmds, m_Q); - - /* evtl regularize to avoid numerical problems (see NIPS paper) */ - if (m_lambda) - { - SG_DEBUG("regularizing matrix Q by adding %f to diagonal\n", m_lambda) - for (index_t i=0; iget_loglevel()==MSG_DEBUG) - { - m_Q.display_matrix("(regularized) Q"); - mmds.display_vector("mmds"); - } - - /* solve the generated problem */ - SGVector result=CMMDKernelSelectionComb::solve_optimization(mmds); - - /* free matrix by hand since it is static (again) */ - SG_FREE(m_Q.matrix); - m_Q.matrix=NULL; - m_Q.num_rows=0; - m_Q.num_cols=0; - - return result; -} -#endif //USE_GPL_SHOGUN diff --git a/src/shogun/statistics/MMDKernelSelectionCombOpt.h b/src/shogun/statistics/MMDKernelSelectionCombOpt.h deleted file mode 100644 index 9e7223ea6ee..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionCombOpt.h +++ /dev/null @@ -1,95 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#ifndef __MMDKERNELSELECTIONCOMBOPT_H_ -#define __MMDKERNELSELECTIONCOMBOPT_H_ - -#include -#ifdef USE_GPL_SHOGUN - -#include - -namespace shogun -{ - -class CLinearTimeMMD; - -/** @brief Implementation of optimal kernel selection for combined kernel. - * This class selects a combination of baseline kernels that maximises the - * ratio of the MMD and its standard deviation for a combined kernel. This - * boils down to solve the convex program - * \f[ - * \min_\beta \{\beta^T (Q+\lambda_m) \beta \quad \text{s.t.}\quad \beta^T \eta=1, \beta\succeq 0\}, - * \f] - * where \f$\eta\f$ is a vector whose elements are the MMDs of the baseline - * kernels and \f$Q\f$ is a linear time estimate of the covariance of \f$\eta\f$. - * - * This only works for the CLinearTimeMMD statistic. * - * IMPORTANT: The kernel has to be selected on different data than the two-sample - * test is performed on. - * - * The method is described in - * Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., - * Balakrishnan, S., Pontil, M., & Fukumizu, K. (2012). - * Optimal kernel choice for large-scale two-sample tests. - * Advances in Neural Information Processing Systems. - */ -class CMMDKernelSelectionCombOpt: public CMMDKernelSelectionComb -{ -public: - - /** Default constructor */ - CMMDKernelSelectionCombOpt(); - - /** Constructor that initialises the underlying MMD instance - * - * @param mmd linear time mmd MMD instance to use. - * @param lambda ridge that is added to standard deviation, a sensible value - * is 10E-5 which is the default - */ - CMMDKernelSelectionCombOpt(CKernelTwoSampleTest* mmd, - float64_t lambda=10E-5); - - /** Destructor */ - virtual ~CMMDKernelSelectionCombOpt(); - - /** Computes optimal kernel weights using the ratio of the squared MMD by its - * standard deviation as a criterion, where both expressions are estimated - * in linear time. - * - * This boils down to solving a convex program which is quadratic in the - * number of kernels. See class description. - * - * SHOGUN has to be compiled with LAPACK to make this available. See - * set_opt* methods for optimization parameters. - * - * IMPORTANT: Kernel weights have to be learned on different data than is - * used for testing/evaluation! - */ - virtual SGVector compute_measures(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const - { - return "MMDKernelSelectionCombOpt"; - } - -private: - /** Initializer */ - void init(); - -protected: - /** Ridge that is added to the diagonal of the Q matrix in the optimization - * problem */ - float64_t m_lambda; -}; - -} -#endif //USE_GPL_SHOGUN -#endif /* __MMDKERNELSELECTIONCOMBOPT_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionMax.cpp b/src/shogun/statistics/MMDKernelSelectionMax.cpp deleted file mode 100644 index fe2c2c7ca6e..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionMax.cpp +++ /dev/null @@ -1,32 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#include - -using namespace shogun; - -CMMDKernelSelectionMax::CMMDKernelSelectionMax() : CMMDKernelSelection() -{ -} - -CMMDKernelSelectionMax::CMMDKernelSelectionMax( - CKernelTwoSampleTest* mmd) : CMMDKernelSelection(mmd) -{ -} - -CMMDKernelSelectionMax::~CMMDKernelSelectionMax() -{ -} - -SGVector CMMDKernelSelectionMax::compute_measures() -{ - /* simply return vector with MMDs */ - return m_estimator->compute_statistic(true); -} diff --git a/src/shogun/statistics/MMDKernelSelectionMax.h b/src/shogun/statistics/MMDKernelSelectionMax.h deleted file mode 100644 index 34bb9fcadc0..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionMax.h +++ /dev/null @@ -1,60 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#ifndef __MMDKERNELSELECTIONMAX_H_ -#define __MMDKERNELSELECTIONMAX_H_ - -#include - -#include - -namespace shogun -{ - -/** @brief Kernel selection class that selects the single kernel that maximises - * the MMD statistic. Works for CQuadraticTimeMMD and CLinearTimeMMD. This leads - * to a heuristic that is better than the standard median heuristic for - * Gaussian kernels. However, it comes with no guarantees. - * - * Optimal selection of single kernels can be found in the class - * CMMDKernelSelectionOpt - * - * This method was first described in - * Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., - * & Schoelkopf, B. - * Kernel choice and classifiability for RKHS embeddings of probability - * distributions. Advances in Neural Information Processing Systems (2009). - */ -class CMMDKernelSelectionMax: public CMMDKernelSelection -{ -public: - - /** Default constructor */ - CMMDKernelSelectionMax(); - - /** Constructor that initialises the underlying MMD instance - * - * @param mmd MMD instance to use. Has to be an MMD based kernel two-sample - * test. Currently: linear or quadratic time MMD. - */ - CMMDKernelSelectionMax(CKernelTwoSampleTest* mmd); - - /** Destructor */ - virtual ~CMMDKernelSelectionMax(); - - /** @return vector the MMD of all single baseline kernels */ - virtual SGVector compute_measures(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const { return "MMDKernelSelectionMax"; } -}; - -} - -#endif /* __MMDKERNELSELECTIONMAX_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionMedian.cpp b/src/shogun/statistics/MMDKernelSelectionMedian.cpp deleted file mode 100644 index fad79dcb3d0..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionMedian.cpp +++ /dev/null @@ -1,237 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - - -using namespace shogun; - -CMMDKernelSelectionMedian::CMMDKernelSelectionMedian() : - CMMDKernelSelection() -{ - init(); -} - -CMMDKernelSelectionMedian::CMMDKernelSelectionMedian( - CKernelTwoSampleTest* mmd, index_t num_data_distance) : - CMMDKernelSelection(mmd) -{ - /* assert that a combined kernel is used */ - CKernel* kernel=mmd->get_kernel(); - CFeatures* lhs=kernel->get_lhs(); - CFeatures* rhs=kernel->get_rhs(); - REQUIRE(kernel, "%s::%s(): No kernel set!\n", get_name(), get_name()); - REQUIRE(kernel->get_kernel_type()==K_COMBINED, "%s::%s(): Requires " - "CombinedKernel as kernel. Yours is %s", get_name(), get_name(), - kernel->get_name()); - - /* assert that all subkernels are Gaussian kernels */ - CCombinedKernel* combined=(CCombinedKernel*)kernel; - - for (index_t k_idx=0; k_idxget_num_kernels(); k_idx++) - { - CKernel* subkernel=combined->get_kernel(k_idx); - REQUIRE(kernel, "%s::%s(): Subkernel (%d) of current kernel is not" - " of type GaussianKernel\n", get_name(), get_name(), k_idx); - SG_UNREF(subkernel); - } - - /* assert 64 bit dense features since EuclideanDistance can only handle - * those */ - if (m_estimator->get_statistic_type()==S_QUADRATIC_TIME_MMD) - { - CFeatures* features=((CQuadraticTimeMMD*)m_estimator)->get_p_and_q(); - REQUIRE(features->get_feature_class()==C_DENSE && - features->get_feature_type()==F_DREAL, "%s::select_kernel(): " - "Only 64 bit float dense features allowed, these are \"%s\"" - " and of type %d\n", - get_name(), features->get_name(), features->get_feature_type()); - SG_UNREF(features); - } - else if (m_estimator->get_statistic_type()==S_LINEAR_TIME_MMD) - { - CStreamingFeatures* p=((CLinearTimeMMD*)m_estimator)->get_streaming_p(); - CStreamingFeatures* q=((CLinearTimeMMD*)m_estimator)->get_streaming_q(); - REQUIRE(p->get_feature_class()==C_STREAMING_DENSE && - p->get_feature_type()==F_DREAL, "%s::select_kernel(): " - "Only 64 bit float streaming dense features allowed, these (p) " - "are \"%s\" and of type %d\n", - get_name(), p->get_name(), p->get_feature_type()); - - REQUIRE(p->get_feature_class()==C_STREAMING_DENSE && - p->get_feature_type()==F_DREAL, "%s::select_kernel(): " - "Only 64 bit float streaming dense features allowed, these (q) " - "are \"%s\" and of type %d\n", - get_name(), q->get_name(), q->get_feature_type()); - SG_UNREF(p); - SG_UNREF(q); - } - - SG_UNREF(kernel); - SG_UNREF(lhs); - SG_UNREF(rhs); - - init(); - - m_num_data_distance=num_data_distance; -} - -CMMDKernelSelectionMedian::~CMMDKernelSelectionMedian() -{ -} - -void CMMDKernelSelectionMedian::init() -{ - SG_ADD(&m_num_data_distance, "m_num_data_distance", "Number of elements to " - "to compute median distance on", MS_NOT_AVAILABLE); - - /* this is a sensible value */ - m_num_data_distance=1000; -} - -SGVector CMMDKernelSelectionMedian::compute_measures() -{ - SG_ERROR("%s::compute_measures(): Not implemented. Use select_kernel() " - "method!\n", get_name()); - return SGVector(); -} - -CKernel* CMMDKernelSelectionMedian::select_kernel() -{ - /* number of data for distace */ - index_t num_data=CMath::min(m_num_data_distance, m_estimator->get_m()); - - SGMatrix dists; - - /* compute all pairwise distances, depends which mmd statistic is used */ - if (m_estimator->get_statistic_type()==S_QUADRATIC_TIME_MMD) - { - /* fixed data, create merged copy of a random subset */ - - /* create vector with that correspond to the num_data first points of - * each distribution, remember data is stored jointly */ - SGVector subset(num_data*2); - index_t m=m_estimator->get_m(); - for (index_t i=0; iget_p_and_q(); - features->add_subset(subset); - - /* cast is safe, see constructor */ - CDenseFeatures* dense_features= - (CDenseFeatures*) features; - - CEuclideanDistance* distance=new CEuclideanDistance(dense_features, - dense_features); - dists=distance->get_distance_matrix(); - features->remove_subset(); - SG_UNREF(distance); - SG_UNREF(features); - } - else if (m_estimator->get_statistic_type()==S_LINEAR_TIME_MMD) - { - /* just stream the desired number of points */ - CLinearTimeMMD* linear_mmd=(CLinearTimeMMD*)m_estimator; - - CStreamingFeatures* p=linear_mmd->get_streaming_p(); - CStreamingFeatures* q=linear_mmd->get_streaming_q(); - - /* cast is safe, see constructor */ - CDenseFeatures* p_streamed=(CDenseFeatures*) - p->get_streamed_features(num_data); - CDenseFeatures* q_streamed=(CDenseFeatures*) - q->get_streamed_features(num_data); - - /* for safety */ - SG_REF(p_streamed); - SG_REF(q_streamed); - - /* create merged feature object */ - CDenseFeatures* merged=(CDenseFeatures*) - p_streamed->create_merged_copy(q_streamed); - - /* compute pairwise distances */ - CEuclideanDistance* distance=new CEuclideanDistance(merged, merged); - dists=distance->get_distance_matrix(); - - /* clean up */ - SG_UNREF(distance); - SG_UNREF(p_streamed); - SG_UNREF(q_streamed); - SG_UNREF(p); - SG_UNREF(q); - } - - /* create a vector where the zeros have been removed, use upper triangle - * only since distances are symmetric */ - SGVector dist_vec(dists.num_rows*(dists.num_rows-1)/2); - index_t write_idx=0; - for (index_t i=0; i(dist_vec.vector, dist_vec.vlen); - float64_t median_distance=dist_vec[dist_vec.vlen/2]; - SG_DEBUG("median_distance: %f\n", median_distance); - - /* shogun has no square and factor two in its kernel width, MATLAB does - * median_width = sqrt(0.5*median_distance), we do this */ - float64_t shogun_sigma=median_distance; - SG_DEBUG("kernel width (shogun): %f\n", shogun_sigma); - - /* now of all kernels, find the one which has its width closest - * Cast is safe due to constructor of MMDKernelSelection class */ - CCombinedKernel* combined=(CCombinedKernel*)m_estimator->get_kernel(); - float64_t min_distance=CMath::MAX_REAL_NUMBER; - CKernel* min_kernel=NULL; - float64_t distance; - for (index_t i=0; iget_num_subkernels(); ++i) - { - CKernel* current=combined->get_kernel(i); - REQUIRE(current->get_kernel_type()==K_GAUSSIAN, "%s::select_kernel(): " - "%d-th kernel is not a Gaussian but \"%s\"!\n", get_name(), i, - current->get_name()); - - /* check if width is closer to median width */ - distance=CMath::abs(((CGaussianKernel*)current)->get_width()- - shogun_sigma); - - if (distance - -#include - -namespace shogun -{ - -/** @brief Implements MMD kernel selection for a number of Gaussian baseline - * kernels via selecting the one with a bandwidth parameter that is closest to - * the median of all pairwise distances in the underlying data. Therefore, it - * only works for data to which a GaussianKernel can be applied, which are - * grouped under the class CDotFeatures in SHOGUN. - * - * This method works reasonable if distinguishing characteristics of data are not - * hidden at a different length-scale that the overall one. In addition it is - * fast to compute. In other cases, it is a bad choice. - * - * Optimal selection of single kernels can be found in the class - * CMMDKernelSelectionOpt - * - * Described among oher places in - * Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. - * (2012). - * A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 671-721. - */ -class CMMDKernelSelectionMedian: public CMMDKernelSelection -{ -public: - - /** Default constructor */ - CMMDKernelSelectionMedian(); - - /** Constructor that initialises the underlying MMD instance - * - * @param mmd MMD instance to use. Has to be an MMD based kernel two-sample - * test. - * @param num_data_distance Number of points that is used to compute the - * median distance on. Since the median is stable, this do need need to be - * all data, but a small subset is sufficient. - */ - CMMDKernelSelectionMedian(CKernelTwoSampleTest* mmd, - index_t num_data_distance=1000); - - /** Destructor */ - virtual ~CMMDKernelSelectionMedian(); - - /** @return Throws an error and shoold not be used */ - virtual SGVector compute_measures(); - - /** Returns the baseline kernel whose bandwidth parameter is closest to the - * median of the pairwise distances of the underlyinf data - * - * @return selected kernel (SG_REF'ed) - */ - virtual CKernel* select_kernel(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const { return "MMDKernelSelectionMedian"; } - -private: - /* initialises and registers member variables */ - void init(); - -protected: - /** maximum number of data to be used for median distance computation */ - index_t m_num_data_distance; -}; - -} - -#endif /* __MMDKERNELSELECTIONMEDIAN_H_ */ diff --git a/src/shogun/statistics/MMDKernelSelectionOpt.cpp b/src/shogun/statistics/MMDKernelSelectionOpt.cpp deleted file mode 100644 index b03dfaee18f..00000000000 --- a/src/shogun/statistics/MMDKernelSelectionOpt.cpp +++ /dev/null @@ -1,62 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#include -#include - -using namespace shogun; - -CMMDKernelSelectionOpt::CMMDKernelSelectionOpt() : - CMMDKernelSelection() -{ - init(); -} - -CMMDKernelSelectionOpt::CMMDKernelSelectionOpt( - CKernelTwoSampleTest* mmd, float64_t lambda) : - CMMDKernelSelection(mmd) -{ - init(); - - /* currently, this method is only developed for the linear time MMD */ - REQUIRE(dynamic_cast(mmd), "%s::%s(): Only " - "CLinearTimeMMD is currently supported! Provided instance is " - "\"%s\"\n", get_name(), get_name(), mmd->get_name()); - - m_lambda=lambda; -} - -CMMDKernelSelectionOpt::~CMMDKernelSelectionOpt() -{ -} - -SGVector CMMDKernelSelectionOpt::compute_measures() -{ - /* comnpute mmd on all subkernels using the same data. Note that underlying - * kernel was asserted to be a combined one */ - SGVector mmds; - SGVector vars; - ((CLinearTimeMMD*)m_estimator)->compute_statistic_and_variance(mmds, vars, true); - - /* we know that the underlying MMD is linear time version, cast is safe */ - SGVector measures(mmds.vlen); - - for (index_t i=0; i - -#include - -namespace shogun -{ - -class CLinearTimeMMD; - -/** @brief Implements optimal kernel selection for single kernels. - * Given a number of baseline kernels, this method selects the one that - * minimizes the type II error for a given type I error for a two-sample test. - * This only works for the CLinearTimeMMD statistic. - * - * The idea is to maximise the ratio of MMD and its standard deviation. - * - * IMPORTANT: The kernel has to be selected on different data than the two-sample - * test is performed on. - * - * Described in - * Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., - * Balakrishnan, S., Pontil, M., & Fukumizu, K. (2012). - * Optimal kernel choice for large-scale two-sample tests. - * Advances in Neural Information Processing Systems. - */ -class CMMDKernelSelectionOpt: public CMMDKernelSelection -{ -public: - - /** Default constructor */ - CMMDKernelSelectionOpt(); - - /** Constructor that initialises the underlying MMD instance. Currently, - * only the linear time MMD is supported - * - * @param mmd MMD instance to use - * @param lambda ridge that is added to standard deviation in order to - * prevent division by zero. A sensivle value is for example 1E-5. - */ - CMMDKernelSelectionOpt(CKernelTwoSampleTest* mmd, - float64_t lambda=10E-5); - - /** Destructor */ - virtual ~CMMDKernelSelectionOpt(); - - /** Overwrites superclass method and ensures that all statistics are - * computed on the same data. Since linear time MMD is a streaming - * statistic, just computing all statistics one after another would use - * different data. This method makes sure that all kernels are used at once - * - * @return vector with kernel criterion values for all attached kernels - */ - virtual SGVector compute_measures(); - - /** @return name of the SGSerializable */ - virtual const char* get_name() const { return "MMDKernelSelectionOpt"; } - -private: - /** Initializer */ - void init(); - -protected: - /** Ridge that is added to the denumerator of the ratio of MMD and its - * standard deviation */ - float64_t m_lambda; -}; - -} - -#endif /* __MMDKERNELSELECTIONOPTSINGLE_H_ */ diff --git a/src/shogun/statistics/NOCCO.cpp b/src/shogun/statistics/NOCCO.cpp deleted file mode 100644 index e2dc9ad4a4c..00000000000 --- a/src/shogun/statistics/NOCCO.cpp +++ /dev/null @@ -1,268 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * Written (w) 2012-2013 Heiko Strathmann - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include - -#include -#include -#include -#include -#include - -using namespace shogun; -using namespace Eigen; - -CNOCCO::CNOCCO() : CKernelIndependenceTest() -{ - init(); -} - -CNOCCO::CNOCCO(CKernel* kernel_p, CKernel* kernel_q, CFeatures* p, CFeatures* q) - : CKernelIndependenceTest(kernel_p, kernel_q, p, q) -{ - init(); - - // only equal number of samples are allowed - if (p && q) - { - REQUIRE(p->get_num_vectors()==q->get_num_vectors(), - "Only equal number of samples from both the distributions are " - "possible. Provided %d samples from p and %d samples from q!\n", - p->get_num_vectors(), q->get_num_vectors()); - - m_num_features=p->get_num_vectors(); - } -} - -CNOCCO::~CNOCCO() -{ -} - -void CNOCCO::init() -{ - SG_ADD(&m_num_features, "num_features", - "Number of features from each of the distributions", - MS_NOT_AVAILABLE); - SG_ADD(&m_epsilon, "epsilon", "The regularization constant", - MS_NOT_AVAILABLE); - - m_num_features=0; - m_epsilon=0.0; - - // we need PERMUTATION as the null approximation method here - m_null_approximation_method=PERMUTATION; -} - -void CNOCCO::set_p(CFeatures* p) -{ - CIndependenceTest::set_p(p); - REQUIRE(m_p, "Provided feature for p cannot be null!\n"); - m_num_features=m_p->get_num_vectors(); -} - -void CNOCCO::set_q(CFeatures* q) -{ - CIndependenceTest::set_q(q); - REQUIRE(m_q, "Provided feature for q cannot be null!\n"); - m_num_features=m_q->get_num_vectors(); -} - -void CNOCCO::set_epsilon(float64_t epsilon) -{ - m_epsilon=epsilon; -} - -float64_t CNOCCO::get_epsilon() const -{ - return m_epsilon; -} - -SGMatrix CNOCCO::compute_helper(SGMatrix m) -{ - SG_DEBUG("Entering!\n"); - - const index_t n=m_num_features; - Map mat(m.matrix, n, n); - - // the result matrix res = m * inv(m + n*epsilon*eye(n,n)) - SGMatrix res(n, n); - MatrixXd to_inv=mat+n*m_epsilon*MatrixXd::Identity(n,n); - - // since the matrix is SPD, instead of directly computing the inverse, - // we compute the Cholesky decomposition and solve systems (see class - // documentation for details) - LLT chol(to_inv); - - // compute the matrix times inverse by solving systems - VectorXd e=VectorXd::Zero(n); - for (index_t i=0; i Gx=get_kernel_matrix_K(); - SGMatrix Gy=get_kernel_matrix_L(); - - // center the kernel matrices - Gx.center(); - Gy.center(); - - SGMatrix Rx=compute_helper(Gx); - SGMatrix Ry=compute_helper(Gy); - - Map Rx_map(Rx.matrix, Rx.num_rows, Rx.num_cols); - Map Ry_map(Ry.matrix, Ry.num_rows, Ry.num_cols); - - // compute the trace of the matrix multiplication without computing the - // off-diagonal entries of the final matrix and just the diagonal entries - float64_t result=0.0; - for (index_t i=0; i CNOCCO::sample_null() -{ - SG_DEBUG("Entering!\n") - - /* replace current kernel via precomputed custom kernel and call superclass - * method */ - - /* backup references to old kernels */ - CKernel* kernel_p=m_kernel_p; - CKernel* kernel_q=m_kernel_q; - - /* init kernels before to be sure that everything is fine - * kernel function between two samples from different distributions - * is never computed - in fact, they may as well have different features */ - m_kernel_p->init(m_p, m_p); - m_kernel_q->init(m_q, m_q); - - /* precompute kernel matrices */ - CCustomKernel* precomputed_p=new CCustomKernel(m_kernel_p); - CCustomKernel* precomputed_q=new CCustomKernel(m_kernel_q); - SG_REF(precomputed_p); - SG_REF(precomputed_q); - - /* temporarily replace own kernels */ - m_kernel_p=precomputed_p; - m_kernel_q=precomputed_q; - - /* use superclass sample_null which shuffles the entries for one - * distribution using index permutation on rows and columns of - * kernel matrix from one distribution, while accessing the other - * in its original order and then compute statistic */ - SGVector null_samples=CKernelIndependenceTest::sample_null(); - - /* restore kernels */ - m_kernel_p=kernel_p; - m_kernel_q=kernel_q; - - SG_UNREF(precomputed_p); - SG_UNREF(precomputed_q); - - SG_DEBUG("Leaving!\n") - return null_samples; -} diff --git a/src/shogun/statistics/NOCCO.h b/src/shogun/statistics/NOCCO.h deleted file mode 100644 index 7989a95ea55..00000000000 --- a/src/shogun/statistics/NOCCO.h +++ /dev/null @@ -1,224 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * Written (w) 2012-2013 Heiko Strathmann - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef NOCCO_H_ -#define NOCCO_H_ - -#include - -#include - -namespace shogun -{ - -template class SGMatrix; - -/** @brief This class implements the NOrmalized Cross Covariance Operator - * (NOCCO) based independence test as described in [1]. - * - * The test of independence is performed as follows: Given samples \f$Z=\{(x_i, - * y_i)\}_{i=1}^n\f$ from the joint distribution \f$\textbf{P}_{XY}\f$, - * does the joint distribution factorize as \f$\textbf{P}_{XY}=\textbf{P}_X - * \textbf{P}_Y\f$? The null hypothesis says yes and the alternative hypothesis - * says no. - * - * The dependence of the random variables \f$\mathbf X=\{x_i\}\f$ and \f$ - * \mathbf Y=\{y_i\}\f$ can be measured via the cross-covariance operator - * \f$\boldsymbol\Sigma_{YX}\f$ which becomes \f$\mathbf{0}\f$ if and only if - * \f$\mathbf X\f$ and \f$\mathbf Y\f$ are independent. This term factorizes as - * \f$\boldsymbol\Sigma_{YX}=\boldsymbol\Sigma_{YY}^{\frac{1}{2}}\mathbf{V}_ - * {YX}\boldsymbol\Sigma_{XX}^{\frac{1}{2}}\f$, where \f$\boldsymbol\Sigma_ - * {XX}\f$ and \f$\boldsymbol\Sigma_{YY}\f$ are known as covariance operator and - * \f$\mathbf{V}_{YX}\f$ is known as normalized cross-covariance operator. The - * paper uses the Hilbert-Schmidt norm of \f$\mathbf V_{YX}\f$ as a dependence - * measure of the independence test (see paper for theroretical details). - * - * This class overrides the compute_statistic() method of the superclass which - * computes an unbiased estimate of the normalized cross covariance operator - * norm. Given the kernels \f$K\f$ (for \f$\mathbf X\f$) and \f$L\f$ (for - * \f$\mathbf Y\f$), if we denote the doubly centered Gram matrices as - * \f$\mathbf{G}_X=\mathbf{HKH}\f$ and \f$\mathbf{G}_Y=\mathbf{HLH}\f$ - * (where \f$\mathbf H=\mathbf I-\frac{1}{n}\mathbf{1}\f$), then the operator - * norm is estimated as - * \f[ - * \hat{I}^{\text{NOCCO}}=\text{Trace}\left[\mathbf{R_X R_Y}\right] - * \f] - * where \f$\mathbf{R}_X=\mathbf{G}_X(\mathbf{G}_X+n\varepsilon_n\mathbf{I}) - * ^{-1}\f$ and \f$\mathbf{R}_Y=\mathbf{G}_Y(\mathbf{G}_Y+n\varepsilon_n - * \mathbf{I})^{-1}\f$ and \f$\varepsilon_n\gt 0\f$ is a regularization - * constant. - * - * In order to avoid computing direct inverse in the above terms for avoiding - * numerical issues, this class uses Cholesky decomposition of matrices - * \f$\mathbf{GG}_*=\mathbf{LL}^\top\f$ (where \f$\mathbf{GG}_*=(\mathbf{G}_*+ - * n\varepsilon_n\mathbf{I})^{-1}\f$) and solve systems \f$\mathbf{GG}_* - * \mathbf x_i=\mathbf{LL}^\top\mathbf x_i=\mathbf e_i\f$ (\f$\mathbf e_i\f$ - * being the \f$i^{\text{th}}\f$ column of \f$\mathbf I_n\f$) one by one. On - * the fly it then uses the solution vectors \f$\mathbf x_i\f$ to compute the - * matrix-matrix product \f$\mathbf C_*=\mathbf G_*\mathbf{GG}_*^{-1}\f$ - * using \f$\mathbf C_{*,(j,i)}=\mathbf G_{*,j}\cdot \mathbf x_i\f$, where - * \f$\mathbf G_{*,j}\f$ is the \f$j^{\text{th}}\f$ row of \f$\mathbf G_*\f$ (or - * column, since it is symmetric) and then discarding the vector. - * - * The final trace computation is also simplified using the symmetry of the - * matrices \f$\mathbf R_X\f$ and \f$\mathbf R_Y\f$. Computation of the off- - * diagonal elements are avoided using - * \f[ - * \text{Trace}\left[\mathbf R_X \mathbf R_Y\right ]=\sum_{i=1}^n - * \mathbf R_X^i\cdot \mathbf R_Y^i - * \f] - * - * For performing the independence test, PERMUTATION test is used by first - * randomly shuffling the samples from one of the distributions while keeping - * the samples from the other distribution in the original order. This way we - * sample the null distribution and compute p-value and threshold for a given - * test power. - * - * [1]: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Scholkopf: - * Kernel Measures of Conditional Dependence. NIPS 2007 - */ -class CNOCCO : public CKernelIndependenceTest -{ -public: - /** Constructor */ - CNOCCO(); - - /** Constructor. - * - * Initializes the kernels and features from the two distributions and - * SG_REFs them - * - * @param kernel_p kernel to use on samples from p - * @param kernel_q kernel to use on samples from q - * @param p samples from distribution p - * @param q samples from distribution q - */ - CNOCCO(CKernel* kernel_p, CKernel* kernel_q, CFeatures* p, CFeatures* q); - - /** Destructor */ - virtual ~CNOCCO(); - - /** Computes the NOCCO statistic (see class description) for underlying - * kernels and data. - * - * Note that since kernel matrices have to be stored, it has quadratic - * space costs. - * - * @return unbiased estimate of NOCCO - */ - virtual float64_t compute_statistic(); - - /** Computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- - * distribution where the given statistic lies in. - * - * @param statistic statistic value to compute the p-value for - * @return p-value parameter statistic is the (1-p) percentile of the - * null distribution - */ - virtual float64_t compute_p_value(float64_t statistic); - - /** Computes a threshold based on current method for approximating the - * null-distribution. The threshold is the value that a statistic has - * to have in ordner to reject the null-hypothesis. - * - * @param alpha test level to reject null-hypothesis - * @return threshold for statistics to reject null-hypothesis - */ - virtual float64_t compute_threshold(float64_t alpha); - - /** @return the class name */ - virtual const char* get_name() const - { - return "NOCCO"; - } - - /** @return the statistic type of this test statistic */ - virtual EStatisticType get_statistic_type() const - { - return S_NOCCO; - } - - /** Setter for features from distribution p, SG_REFs it - * - * @param p features from p - */ - virtual void set_p(CFeatures* p); - - /** Setter for features from distribution q, SG_REFs it - * - * @param q features from q - */ - virtual void set_q(CFeatures* q); - - /** - * Setter for regularization parameter epsilon - * @param epsilon the regularization parameter - */ - void set_epsilon(float64_t epsilon); - - /** @return epsilon the regularization parameter */ - float64_t get_epsilon() const; - - /** Merges both sets of samples and computes the test statistic - * m_num_null_sample times. This version precomputes the kenrel matrix - * once by hand, then samples using this one. The matrix has - * to be stored anyway when statistic is computed. - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - -protected: - /** - * Helper method which computes the matrix times matrix inverse using LLT - * solve (Cholesky) withoout storing the inverse (see class documentation). - * - * @param m the centered Gram matrix - * @return the result matrix of the multiplication - */ - SGMatrix compute_helper(SGMatrix m); - -private: - /** Register parameters and initialize with defaults */ - void init(); - - /** Number of features from the distributions (should be equal for both) */ - index_t m_num_features; - - /** The regularization constant */ - float64_t m_epsilon; - -}; - -} - -#endif // NOCCO_H_ diff --git a/src/shogun/statistics/QuadraticTimeMMD.cpp b/src/shogun/statistics/QuadraticTimeMMD.cpp deleted file mode 100644 index fff724e5e54..00000000000 --- a/src/shogun/statistics/QuadraticTimeMMD.cpp +++ /dev/null @@ -1,1115 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -#include - -using namespace Eigen; - -CQuadraticTimeMMD::CQuadraticTimeMMD() : CKernelTwoSampleTest() -{ - init(); -} - -CQuadraticTimeMMD::CQuadraticTimeMMD(CKernel* kernel, CFeatures* p_and_q, - index_t m) : - CKernelTwoSampleTest(kernel, p_and_q, m) -{ - init(); -} - -CQuadraticTimeMMD::CQuadraticTimeMMD(CKernel* kernel, CFeatures* p, - CFeatures* q) : CKernelTwoSampleTest(kernel, p, q) -{ - init(); -} - -CQuadraticTimeMMD::CQuadraticTimeMMD(CCustomKernel* custom_kernel, index_t m) : - CKernelTwoSampleTest(custom_kernel, NULL, m) -{ - init(); -} - -CQuadraticTimeMMD::~CQuadraticTimeMMD() -{ -} - -void CQuadraticTimeMMD::init() -{ - SG_ADD(&m_num_samples_spectrum, "num_samples_spectrum", "Number of samples" - " for spectrum method null-distribution approximation", - MS_NOT_AVAILABLE); - SG_ADD(&m_num_eigenvalues_spectrum, "num_eigenvalues_spectrum", "Number of " - " Eigenvalues for spectrum method null-distribution approximation", - MS_NOT_AVAILABLE); - SG_ADD((machine_int_t*)&m_statistic_type, "statistic_type", - "Biased or unbiased MMD", MS_NOT_AVAILABLE); - - m_num_samples_spectrum=0; - m_num_eigenvalues_spectrum=0; - m_statistic_type=UNBIASED; -} - -SGVector CQuadraticTimeMMD::compute_unbiased_statistic_variance( - int m, int n) -{ - SG_DEBUG("Entering!\n"); - - /* init kernel with features. NULL check is handled in compute_statistic */ - m_kernel->init(m_p_and_q, m_p_and_q); - - /* computing kernel values and their sums on the fly that are used both in - computing statistic and variance */ - - /* the following matrix stores row-wise sum of kernel values k(X,X') in - the first column and row-wise squared sum of kernel values k^2(X,X') - in the second column. m entries in both column */ - SGMatrix xx_sum_sq_sum_rowwise=m_kernel-> - row_wise_sum_squared_sum_symmetric_block(0, m); - - /* row-wise sum of kernel values k(Y,Y'), n entries */ - SGVector yy_sum_rowwise=m_kernel-> - row_wise_sum_symmetric_block(m, n); - - /* row-wise and col-wise sum of kernel values k(X,Y), m+n entries - first m entries are row-wise sum, rest n entries are col-wise sum */ - SGVector xy_sum_rowcolwise=m_kernel-> - row_col_wise_sum_block(0, m, m, n); - - /* computing overall sum and squared sum from above for convenience */ - - SGVector xx_sum_rowwise(m); - std::copy(xx_sum_sq_sum_rowwise.matrix, xx_sum_sq_sum_rowwise.matrix+m, - xx_sum_rowwise.vector); - - SGVector xy_sum_rowwise(m); - std::copy(xy_sum_rowcolwise.vector, xy_sum_rowcolwise.vector+m, - xy_sum_rowwise.vector); - - SGVector xy_sum_colwise(n); - std::copy(xy_sum_rowcolwise.vector+m, xy_sum_rowcolwise.vector+m+n, - xy_sum_colwise.vector); - - float64_t xx_sq_sum=0.0; - for (index_t i=0; i results(3); - results[0]=statistic; - results[1]=var_null; - results[2]=var_alt; - - SG_DEBUG("Leaving!\n"); - - return results; -} - -SGVector CQuadraticTimeMMD::compute_biased_statistic_variance(int m, int n) -{ - SG_DEBUG("Entering!\n"); - - /* init kernel with features. NULL check is handled in compute_statistic */ - m_kernel->init(m_p_and_q, m_p_and_q); - - /* computing kernel values and their sums on the fly that are used both in - computing statistic and variance */ - - /* the following matrix stores row-wise sum of kernel values k(X,X') in - the first column and row-wise squared sum of kernel values k^2(X,X') - in the second column. m entries in both column */ - SGMatrix xx_sum_sq_sum_rowwise=m_kernel-> - row_wise_sum_squared_sum_symmetric_block(0, m, false); - - /* row-wise sum of kernel values k(Y,Y'), n entries */ - SGVector yy_sum_rowwise=m_kernel-> - row_wise_sum_symmetric_block(m, n, false); - - /* row-wise and col-wise sum of kernel values k(X,Y), m+n entries - first m entries are row-wise sum, rest n entries are col-wise sum */ - SGVector xy_sum_rowcolwise=m_kernel-> - row_col_wise_sum_block(0, m, m, n); - - /* computing overall sum and squared sum from above for convenience */ - - SGVector xx_sum_rowwise(m); - std::copy(xx_sum_sq_sum_rowwise.matrix, xx_sum_sq_sum_rowwise.matrix+m, - xx_sum_rowwise.vector); - - SGVector xy_sum_rowwise(m); - std::copy(xy_sum_rowcolwise.vector, xy_sum_rowcolwise.vector+m, - xy_sum_rowwise.vector); - - SGVector xy_sum_colwise(n); - std::copy(xy_sum_rowcolwise.vector+m, xy_sum_rowcolwise.vector+m+n, - xy_sum_colwise.vector); - - float64_t xx_sq_sum=0.0; - for (index_t i=0; i results(3); - results[0]=statistic; - results[1]=var_null; - results[2]=var_alt; - - SG_DEBUG("Leaving!\n"); - - return results; -} - -SGVector CQuadraticTimeMMD::compute_incomplete_statistic_variance(int n) -{ - SG_DEBUG("Entering!\n"); - - /* init kernel with features. NULL check is handled in compute_statistic */ - m_kernel->init(m_p_and_q, m_p_and_q); - - /* computing kernel values and their sums on the fly that are used both in - computing statistic and variance */ - - /* the following matrix stores row-wise sum of kernel values k(X,X') in - the first column and row-wise squared sum of kernel values k^2(X,X') - in the second column. n entries in both column */ - SGMatrix xx_sum_sq_sum_rowwise=m_kernel-> - row_wise_sum_squared_sum_symmetric_block(0, n); - - /* row-wise sum of kernel values k(Y,Y'), n entries */ - SGVector yy_sum_rowwise=m_kernel-> - row_wise_sum_symmetric_block(n, n); - - /* row-wise and col-wise sum of kernel values k(X,Y), 2n entries - first n entries are row-wise sum, rest n entries are col-wise sum */ - SGVector xy_sum_rowcolwise=m_kernel-> - row_col_wise_sum_block(0, n, n, n, true); - - /* computing overall sum and squared sum from above for convenience */ - - SGVector xx_sum_rowwise(n); - std::copy(xx_sum_sq_sum_rowwise.matrix, xx_sum_sq_sum_rowwise.matrix+n, - xx_sum_rowwise.vector); - - SGVector xy_sum_rowwise(n); - std::copy(xy_sum_rowcolwise.vector, xy_sum_rowcolwise.vector+n, - xy_sum_rowwise.vector); - - SGVector xy_sum_colwise(n); - std::copy(xy_sum_rowcolwise.vector+n, xy_sum_rowcolwise.vector+2*n, - xy_sum_colwise.vector); - - float64_t xx_sq_sum=0.0; - for (index_t i=0; i results(3); - results[0]=statistic; - results[1]=var_null; - results[2]=var_alt; - - SG_DEBUG("Leaving!\n"); - - return results; -} - -float64_t CQuadraticTimeMMD::compute_unbiased_statistic(int m, int n) -{ - return compute_unbiased_statistic_variance(m, n)[0]; -} - -float64_t CQuadraticTimeMMD::compute_biased_statistic(int m, int n) -{ - return compute_biased_statistic_variance(m, n)[0]; -} - -float64_t CQuadraticTimeMMD::compute_incomplete_statistic(int n) -{ - return compute_incomplete_statistic_variance(n)[0]; -} - -float64_t CQuadraticTimeMMD::compute_statistic() -{ - REQUIRE(m_kernel, "No kernel specified!\n") - - index_t m=m_m; - index_t n=0; - - /* check if kernel is precomputed (custom kernel) */ - if (m_kernel->get_kernel_type()==K_CUSTOM) - n=m_kernel->get_num_vec_lhs()-m; - else - { - REQUIRE(m_p_and_q, "The samples are not initialized!\n"); - n=m_p_and_q->get_num_vectors()-m; - } - - SG_DEBUG("Computing MMD with %d samples from p and %d samples from q!\n", - m, n); - - float64_t result=0; - switch (m_statistic_type) - { - case UNBIASED: - result=compute_unbiased_statistic(m, n); - result*=m*n/float64_t(m+n); - break; - case UNBIASED_DEPRECATED: - result=compute_unbiased_statistic(m, n); - result*=m==n ? m : (m+n); - break; - case BIASED: - result=compute_biased_statistic(m, n); - result*=m*n/float64_t(m+n); - break; - case BIASED_DEPRECATED: - result=compute_biased_statistic(m, n); - result*=m==n? m : (m+n); - break; - case INCOMPLETE: - REQUIRE(m==n, "Only possible with equal number of samples from both" - "distribution!\n") - result=compute_incomplete_statistic(n); - result*=n/2; - break; - default: - SG_ERROR("Unknown statistic type!\n"); - break; - } - - return result; -} - -SGVector CQuadraticTimeMMD::compute_variance() -{ - REQUIRE(m_kernel, "No kernel specified!\n") - - index_t m=m_m; - index_t n=0; - - /* check if kernel is precomputed (custom kernel) */ - if (m_kernel->get_kernel_type()==K_CUSTOM) - n=m_kernel->get_num_vec_lhs()-m; - else - { - REQUIRE(m_p_and_q, "The samples are not initialized!\n"); - n=m_p_and_q->get_num_vectors()-m; - } - - SG_DEBUG("Computing MMD with %d samples from p and %d samples from q!\n", - m, n); - - SGVector result(2); - switch (m_statistic_type) - { - case UNBIASED: - case UNBIASED_DEPRECATED: - { - SGVector res=compute_unbiased_statistic_variance(m, n); - result[0]=res[1]; - result[1]=res[2]; - break; - } - case BIASED: - case BIASED_DEPRECATED: - { - SGVector res=compute_biased_statistic_variance(m, n); - result[0]=res[1]; - result[1]=res[2]; - break; - } - case INCOMPLETE: - { - REQUIRE(m==n, "Only possible with equal number of samples from both" - "distribution!\n") - SGVector res=compute_incomplete_statistic_variance(n); - result[0]=res[1]; - result[1]=res[2]; - break; - } - default: - SG_ERROR("Unknown statistic type!\n"); - break; - } - - return result; -} - -float64_t CQuadraticTimeMMD::compute_variance_under_null() -{ - return compute_variance()[0]; -} - -float64_t CQuadraticTimeMMD::compute_variance_under_alternative() -{ - return compute_variance()[1]; -} - -SGVector CQuadraticTimeMMD::compute_statistic(bool multiple_kernels) -{ - SGVector mmds; - if (!multiple_kernels) - { - /* just one mmd result */ - mmds=SGVector(1); - mmds[0]=compute_statistic(); - } - else - { - REQUIRE(m_kernel, "No kernel specified!\n") - REQUIRE(m_kernel->get_kernel_type()==K_COMBINED, - "multiple kernels specified, but underlying kernel is not of type " - "K_COMBINED\n"); - - /* cast and allocate memory for results */ - CCombinedKernel* combined=(CCombinedKernel*)m_kernel; - SG_REF(combined); - mmds=SGVector(combined->get_num_subkernels()); - - /* iterate through all kernels and compute statistic */ - /* TODO this might be done in parallel */ - for (index_t i=0; iget_kernel(i); - /* temporarily replace underlying kernel and compute statistic */ - m_kernel=current; - mmds[i]=compute_statistic(); - - SG_UNREF(current); - } - - /* restore combined kernel */ - m_kernel=combined; - SG_UNREF(combined); - } - - return mmds; -} - -SGMatrix CQuadraticTimeMMD::compute_variance(bool multiple_kernels) -{ - SGMatrix vars; - if (!multiple_kernels) - { - /* just one mmd result */ - vars=SGMatrix(1, 2); - SGVector result=compute_variance(); - vars(0, 0)=result[0]; - vars(0, 1)=result[1]; - } - else - { - REQUIRE(m_kernel, "No kernel specified!\n") - REQUIRE(m_kernel->get_kernel_type()==K_COMBINED, - "multiple kernels specified, but underlying kernel is not of type " - "K_COMBINED\n"); - - /* cast and allocate memory for results */ - CCombinedKernel* combined=(CCombinedKernel*)m_kernel; - SG_REF(combined); - vars=SGMatrix(combined->get_num_subkernels(), 2); - - /* iterate through all kernels and compute variance */ - /* TODO this might be done in parallel */ - for (index_t i=0; iget_kernel(i); - /* temporarily replace underlying kernel and compute variance */ - m_kernel=current; - SGVector result=compute_variance(); - vars(i, 0)=result[0]; - vars(i, 1)=result[1]; - - SG_UNREF(current); - } - - /* restore combined kernel */ - m_kernel=combined; - SG_UNREF(combined); - } - - return vars; -} - -float64_t CQuadraticTimeMMD::compute_p_value(float64_t statistic) -{ - float64_t result=0; - - switch (m_null_approximation_method) - { - case MMD2_SPECTRUM: - { - /* get samples from null-distribution and compute p-value of statistic */ - SGVector null_samples=sample_null_spectrum( - m_num_samples_spectrum, m_num_eigenvalues_spectrum); - CMath::qsort(null_samples); - index_t pos=null_samples.find_position_to_insert(statistic); - result=1.0-((float64_t)pos)/null_samples.vlen; - break; - } - - case MMD2_SPECTRUM_DEPRECATED: - { - /* get samples from null-distribution and compute p-value of statistic */ - SGVector null_samples=sample_null_spectrum_DEPRECATED( - m_num_samples_spectrum, m_num_eigenvalues_spectrum); - CMath::qsort(null_samples); - index_t pos=null_samples.find_position_to_insert(statistic); - result=1.0-((float64_t)pos)/null_samples.vlen; - break; - } - - case MMD2_GAMMA: - { - /* fit gamma and return cdf at statistic */ - SGVector params=fit_null_gamma(); - result=CStatistics::gamma_cdf(statistic, params[0], params[1]); - break; - } - - default: - result=CKernelTwoSampleTest::compute_p_value(statistic); - break; - } - - return result; -} - -float64_t CQuadraticTimeMMD::compute_threshold(float64_t alpha) -{ - float64_t result=0; - - switch (m_null_approximation_method) - { - case MMD2_SPECTRUM: - { - /* get samples from null-distribution and compute threshold */ - SGVector null_samples=sample_null_spectrum( - m_num_samples_spectrum, m_num_eigenvalues_spectrum); - CMath::qsort(null_samples); - result=null_samples[index_t(CMath::floor(null_samples.vlen*(1-alpha)))]; - break; - } - - case MMD2_SPECTRUM_DEPRECATED: - { - /* get samples from null-distribution and compute threshold */ - SGVector null_samples=sample_null_spectrum_DEPRECATED( - m_num_samples_spectrum, m_num_eigenvalues_spectrum); - CMath::qsort(null_samples); - result=null_samples[index_t(CMath::floor(null_samples.vlen*(1-alpha)))]; - break; - } - - case MMD2_GAMMA: - { - /* fit gamma and return inverse cdf at alpha */ - SGVector params=fit_null_gamma(); - result=CStatistics::gamma_inverse_cdf(alpha, params[0], params[1]); - break; - } - - default: - /* sampling null is handled here */ - result=CKernelTwoSampleTest::compute_threshold(alpha); - break; - } - - return result; -} - - -SGVector CQuadraticTimeMMD::sample_null_spectrum(index_t num_samples, - index_t num_eigenvalues) -{ - REQUIRE(m_kernel, "(%d, %d): No kernel set!\n", num_samples, - num_eigenvalues); - REQUIRE(m_kernel->get_kernel_type()==K_CUSTOM || m_p_and_q, - "(%d, %d): No features set and no custom kernel in use!\n", - num_samples, num_eigenvalues); - - index_t m=m_m; - index_t n=0; - - /* check if kernel is precomputed (custom kernel) */ - if (m_kernel && m_kernel->get_kernel_type()==K_CUSTOM) - n=m_kernel->get_num_vec_lhs()-m; - else - { - REQUIRE(m_p_and_q, "The samples are not initialized!\n"); - n=m_p_and_q->get_num_vectors()-m; - } - - if (num_samples<=2) - { - SG_ERROR("Number of samples has to be at least 2, " - "better in the hundreds"); - } - - if (num_eigenvalues>m+n-1) - SG_ERROR("Number of Eigenvalues too large\n"); - - if (num_eigenvalues<1) - SG_ERROR("Number of Eigenvalues too small\n"); - - /* imaginary matrix K=[K KL; KL' L] (MATLAB notation) - * K is matrix for XX, L is matrix for YY, KL is XY, LK is YX - * works since X and Y are concatenated here */ - m_kernel->init(m_p_and_q, m_p_and_q); - SGMatrix K=m_kernel->get_kernel_matrix(); - - /* center matrix K=H*K*H */ - K.center(); - - /* compute eigenvalues and select num_eigenvalues largest ones */ - Map c_kernel_matrix(K.matrix, K.num_rows, K.num_cols); - SelfAdjointEigenSolver eigen_solver(c_kernel_matrix); - REQUIRE(eigen_solver.info()==Eigen::Success, - "Eigendecomposition failed!\n"); - index_t max_num_eigenvalues=eigen_solver.eigenvalues().rows(); - - /* finally, sample from null distribution */ - SGVector null_samples(num_samples); - for (index_t i=0; i CQuadraticTimeMMD::sample_null_spectrum_DEPRECATED( - index_t num_samples, index_t num_eigenvalues) -{ - REQUIRE(m_kernel, "(%d, %d): No kernel set!\n", num_samples, - num_eigenvalues); - REQUIRE(m_kernel->get_kernel_type()==K_CUSTOM || m_p_and_q, - "(%d, %d): No features set and no custom kernel in use!\n", - num_samples, num_eigenvalues); - - index_t m=m_m; - index_t n=0; - - /* check if kernel is precomputed (custom kernel) */ - if (m_kernel && m_kernel->get_kernel_type()==K_CUSTOM) - n=m_kernel->get_num_vec_lhs()-m; - else - { - REQUIRE(m_p_and_q, "The samples are not initialized!\n"); - n=m_p_and_q->get_num_vectors()-m; - } - - if (num_samples<=2) - { - SG_ERROR("Number of samples has to be at least 2, " - "better in the hundreds"); - } - - if (num_eigenvalues>m+n-1) - SG_ERROR("Number of Eigenvalues too large\n"); - - if (num_eigenvalues<1) - SG_ERROR("Number of Eigenvalues too small\n"); - - /* imaginary matrix K=[K KL; KL' L] (MATLAB notation) - * K is matrix for XX, L is matrix for YY, KL is XY, LK is YX - * works since X and Y are concatenated here */ - m_kernel->init(m_p_and_q, m_p_and_q); - SGMatrix K=m_kernel->get_kernel_matrix(); - - /* center matrix K=H*K*H */ - K.center(); - - /* compute eigenvalues and select num_eigenvalues largest ones */ - Map c_kernel_matrix(K.matrix, K.num_rows, K.num_cols); - SelfAdjointEigenSolver eigen_solver(c_kernel_matrix); - REQUIRE(eigen_solver.info()==Eigen::Success, - "Eigendecomposition failed!\n"); - index_t max_num_eigenvalues=eigen_solver.eigenvalues().rows(); - - /* precomputing terms with rho_x and rho_y of equation 10 in [1] - * (see documentation) */ - float64_t rho_x=float64_t(m)/(m+n); - float64_t rho_y=1-rho_x; - - /* instead of using two Gaussian rv's ~ N(0,1), we'll use just one rv - * ~ N(0, 1/rho_x+1/rho_y) (derived from eq 10 in [1]) */ - float64_t std_dev=CMath::sqrt(1/rho_x+1/rho_y); - float64_t inv_rho_x_y=1/(rho_x*rho_y); - - SG_DEBUG("Using Gaussian samples ~ N(0,%f)\n", std_dev*std_dev); - - /* finally, sample from null distribution */ - SGVector null_samples(num_samples); - for (index_t i=0; i CQuadraticTimeMMD::fit_null_gamma() -{ - REQUIRE(m_kernel, "No kernel set!\n"); - REQUIRE(m_kernel->get_kernel_type()==K_CUSTOM || m_p_and_q, - "No features set and no custom kernel in use!\n"); - - index_t n=0; - - /* check if kernel is precomputed (custom kernel) */ - if (m_kernel && m_kernel->get_kernel_type()==K_CUSTOM) - n=m_kernel->get_num_vec_lhs()-m_m; - else - { - REQUIRE(m_p_and_q, "The samples are not initialized!\n"); - n=m_p_and_q->get_num_vectors()-m_m; - } - REQUIRE(m_m==n, "Only possible with equal number of samples " - "from both distribution!\n") - - index_t num_data; - if (m_kernel->get_kernel_type()==K_CUSTOM) - num_data=m_kernel->get_num_vec_rhs(); - else - num_data=m_p_and_q->get_num_vectors(); - - if (m_m!=num_data/2) - SG_ERROR("Currently, only equal sample sizes are supported\n"); - - /* evtl. warn user not to use wrong statistic type */ - if (m_statistic_type!=BIASED_DEPRECATED) - { - SG_WARNING("Note: provided statistic has " - "to be BIASED. Please ensure that! To get rid of warning," - "call %s::set_statistic_type(BIASED_DEPRECATED)\n", get_name()); - } - - /* imaginary matrix K=[K KL; KL' L] (MATLAB notation) - * K is matrix for XX, L is matrix for YY, KL is XY, LK is YX - * works since X and Y are concatenated here */ - m_kernel->init(m_p_and_q, m_p_and_q); - - /* compute mean under H0 of MMD, which is - * meanMMD = 2/m * ( 1 - 1/m*sum(diag(KL)) ); - * in MATLAB. - * Remove diagonals on the fly */ - float64_t mean_mmd=0; - for (index_t i=0; ikernel(i, m_m+i); - } - mean_mmd=2.0/m_m*(1.0-1.0/m_m*mean_mmd); - - /* compute variance under H0 of MMD, which is - * varMMD = 2/m/(m-1) * 1/m/(m-1) * sum(sum( (K + L - KL - KL').^2 )); - * in MATLAB, so sum up all elements */ - float64_t var_mmd=0; - for (index_t i=0; ikernel(i, j); - to_add+=m_kernel->kernel(m_m+i, m_m+j); - to_add-=m_kernel->kernel(i, m_m+j); - to_add-=m_kernel->kernel(m_m+i, j); - var_mmd+=CMath::pow(to_add, 2); - } - } - var_mmd*=2.0/m_m/(m_m-1)*1.0/m_m/(m_m-1); - - /* parameters for gamma distribution */ - float64_t a=CMath::pow(mean_mmd, 2)/var_mmd; - float64_t b=var_mmd*m_m / mean_mmd; - - SGVector result(2); - result[0]=a; - result[1]=b; - - return result; -} - -void CQuadraticTimeMMD::set_num_samples_spectrum(index_t - num_samples_spectrum) -{ - m_num_samples_spectrum=num_samples_spectrum; -} - -void CQuadraticTimeMMD::set_num_eigenvalues_spectrum( - index_t num_eigenvalues_spectrum) -{ - m_num_eigenvalues_spectrum=num_eigenvalues_spectrum; -} - -void CQuadraticTimeMMD::set_statistic_type(EQuadraticMMDType - statistic_type) -{ - m_statistic_type=statistic_type; -} - diff --git a/src/shogun/statistics/QuadraticTimeMMD.h b/src/shogun/statistics/QuadraticTimeMMD.h deleted file mode 100644 index f9981b0cec9..00000000000 --- a/src/shogun/statistics/QuadraticTimeMMD.h +++ /dev/null @@ -1,487 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef QUADRATIC_TIME_MMD_H_ -#define QUADRATIC_TIME_MMD_H_ - -#include - -#include - -namespace shogun -{ - -class CFeatures; -class CKernel; -class CCustomKernel; - -/** Enum to select which statistic type of quadratic time MMD should be computed */ -enum EQuadraticMMDType -{ - BIASED, - BIASED_DEPRECATED, - UNBIASED, - UNBIASED_DEPRECATED, - INCOMPLETE -}; - -/** @brief This class implements the quadratic time Maximum Mean Statistic as - * described in [1]. - * The MMD is the distance of two probability distributions \f$p\f$ and \f$q\f$ - * in a RKHS which we denote by - * \f[ - * \hat{\eta_k}=\text{MMD}[\mathcal{F},p,q]^2=\textbf{E}_{x,x'} - * \left[ k(x,x')\right]-2\textbf{E}_{x,y}\left[ k(x,y)\right] - * +\textbf{E}_{y,y'}\left[ k(y,y')\right]=||\mu_p - \mu_q||^2_\mathcal{F} - * \f] - * - * Given two sets of samples \f$\{x_i\}_{i=1}^{n_x}\sim p\f$ and - * \f$\{y_i\}_{i=1}^{n_y}\sim q\f$, \f$n_x+n_y=n\f$, - * the unbiased estimate of the above statistic is computed as - * \f[ - * \hat{\eta}_{k,U}=\frac{1}{n_x(n_x-1)}\sum_{i=1}^{n_x}\sum_{j\neq i} - * k(x_i,x_j)+\frac{1}{n_y(n_y-1)}\sum_{i=1}^{n_y}\sum_{j\neq i}k(y_i,y_j) - * -\frac{2}{n_xn_y}\sum_{i=1}^{n_x}\sum_{j=1}^{n_y}k(x_i,y_j) - * \f] - * - * A biased version is - * \f[ - * \hat{\eta}_{k,V}=\frac{1}{n_x^2}\sum_{i=1}^{n_x}\sum_{j=1}^{n_x} - * k(x_i,x_j)+\frac{1}{n_y^2}\sum_{i=1}^{n_y}\sum_{j=1}^{n_y}k(y_i,y_j) - * -\frac{2}{n_xn_y}\sum_{i=1}^{n_x}\sum_{j=1}^{n_y}k(x_i,y_j) - * \f] - * - * When \f$n_x=n_y=\frac{n}{2}\f$, an incomplete version can also be computed - * as the following - * \f[ - * \hat{\eta}_{k,U^-}=\frac{1}{\frac{n}{2}(\frac{n}{2}-1)}\sum_{i\neq j} - * h(z_i,z_j) - * \f] - * where for each pair \f$z=(x,y)\f$, \f$h(z,z')=k(x,x')+k(y,y')-k(x,y')- - * k(x',y)\f$. - * - * The type (biased/unbiased/incomplete) can be selected via set_statistic_type(). - * Note that there are presently two setups for computing statistic. While using - * BIASED, UNBIASED or INCOMPLETE, the estimate returned by compute_statistic() - * is \f$\frac{n_xn_y}{n_x+n_y}\hat{\eta}_k\f$. If DEPRECATED ones are used, then - * this returns \f$(n_x+n_y)\hat{\eta}_k\f$ in general and \f$(\frac{n}{2}) - * \hat{\eta}_k\f$ when \f$n_x=n_y=\frac{n}{2}\f$. This holds for the null - * distribution samples as well. - * - * Estimating variance of the asymptotic distribution of the statistic under - * null and alternative hypothesis can be done using compute_variance() method. - * This is internally done alongwise computing statistics to avoid recomputing - * the kernel. - * - * Variance under null is computed as - * \f$\sigma_{k,0}^2=2\hat{\kappa}_2=2(\kappa_2-2\kappa_1+\kappa_0)\f$ - * where - * \f$\kappa_0=\left(\mathbb{E}_{X,X'}k(X,X')\right )^2\f$, - * \f$\kappa_1=\mathbb{E}_X\left[(\mathbb{E}_{X'}k(X,X'))^2\right]\f$, and - * \f$\kappa_2=\mathbb{E}_{X,X'}k^2(X,X')\f$ - * and variance under alternative is computed as - * \f[ - * \sigma_{k,A}^2=4\rho_y\left\{\mathbb{E}_X\left[\left(\mathbb{E}_{X'} - * k(X,X')-\mathbb{E}_Yk(X,Y)\right)^2 \right ] -\left(\mathbb{E}_{X,X'} - * k(X,X')-\mathbb{E}_{X,Y}k(X,Y) \right)^2\right \}+4\rho_x\left\{ - * \mathbb{E}_Y\left[\left(\mathbb{E}_{Y'}k(Y,Y')-\mathbb{E}_Xk(X,Y) - * \right)^2\right ] -\left(\mathbb{E}_{Y,Y'}k(Y,Y')-\mathbb{E}_{X,Y} - * k(X,Y) \right)^2\right \} - * \f] - * where \f$\rho_x=\frac{n_x}{n}\f$ and \f$\rho_y=\frac{n_y}{n}\f$. - * - * Note that statistic and variance estimation can be done for multiple kernels - * at once as well. - * - * Along with the statistic comes a method to compute a p-value based on - * different methods. Permutation test is also possible. If unsure which one to - * use, sampling with 250 permutation iterations always is correct (but slow). - * - * To choose, use set_null_approximation_method() and choose from. - * - * MMD2_SPECTRUM_DEPRECATED: For a fast, consistent test based on the spectrum of - * the kernel matrix, as described in [2]. Only supported if Eigen3 is installed. - * - * MMD2_SPECTRUM: Similar to the deprecated version except it estimates the - * statistic under null as \f$\frac{n_xn_y}{n_x+n_y}\hat{\eta}_{k,U}\rightarrow - * \sum_r\lambda_r(Z_r^2-1)\f$ instead (see method description for more details). - * - * MMD2_GAMMA: for a very fast, but not consistent test based on moment matching - * of a Gamma distribution, as described in [2]. - * - * PERMUTATION: For permuting available samples to sample null-distribution - * - * If you do not know about your data, but want to use the MMD from a kernel - * matrix, just use the custom kernel constructor. Everything else will work as - * usual. - * - * For kernel selection see CMMDKernelSelection. - * - * NOTE: \f$n_x\f$ and \f$n_y\f$ are represented by \f$m\f$ and \f$n\f$, - * respectively in the implementation. - * - * [1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. (2012). - * A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 671-721. - * - * [2]: Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). - * A fast, consistent kernel two-sample test. - * - */ -class CQuadraticTimeMMD : public CKernelTwoSampleTest -{ -public: - /** default constructor */ - CQuadraticTimeMMD(); - - /** Constructor - * - * @param p_and_q feature data. Is assumed to contain samples from both - * p and q. First m samples from p, then from index m all samples from q - * - * @param kernel kernel to use - * @param p_and_q samples from p and q, appended - * @param m index of first sample of q - */ - CQuadraticTimeMMD(CKernel* kernel, CFeatures* p_and_q, index_t m); - - /** Constructor. - * This is a convienience constructor which copies both features to one - * element and then calls the other constructor. Needs twice the memory - * for a short time - * - * @param kernel kernel for MMD - * @param p samples from distribution p, will be copied and NOT SG_REF'ed - * @param q samples from distribution q, will be copied and NOT SG_REF'ed - */ - CQuadraticTimeMMD(CKernel* kernel, CFeatures* p, CFeatures* q); - - /** Constructor. - * This is a convienience constructor which allows to only specify - * a custom kernel. In this case, the features are completely ignored - * and all computations will be done on the custom kernel - * - * @param custom_kernel custom kernel for MMD, which is a kernel between - * the appended features p and q - * @param m index of first sample of q - */ - CQuadraticTimeMMD(CCustomKernel* custom_kernel, index_t m); - - /** destructor */ - virtual ~CQuadraticTimeMMD(); - - /** Computes the squared quadratic time MMD for the current data. Note - * that the type (biased/unbiased/incomplete) can be specified with - * set_statistic_type() method. - * - * @return (biased, unbiased or incomplete) \f$\frac{mn}{m+n}\hat{\eta}_k\f$. - * If DEPRECATED types are used, then it returns \f$(m+m)\hat{\eta}_k\f$ in - * general and \f$m\hat{\eta}_k\f$ when \f$m=n\f$. - */ - virtual float64_t compute_statistic(); - - /** Same as compute_statistic(), but with the possibility to perform on - * multiple kernels at once - * - * @param multiple_kernels if true, and underlying kernel is K_COMBINED, - * method will be executed on all subkernels on the same data - * @return vector of results for subkernels - */ - SGVector compute_statistic(bool multiple_kernels); - - /** - * Wrapper for computing variance estimate of the asymptotic distribution - * of the statistic (unbisaed/biased/incomplete) under null and alternative - * hypothesis (see class description for details) - * - * @return a vector of two values containing asymptotic variance estimate - * under null and alternative, respectively - */ - virtual SGVector compute_variance(); - - /** Same as compute_variance(), but with the possibility to perform on - * multiple kernels at once - * - * @param multiple_kernels if true, and underlying kernel is K_COMBINED, - * method will be executed on all subkernels on the same data - * @return matrix of results for subkernels, one row for each subkernel - */ - SGMatrix compute_variance(bool multiple_kernels); - - /** - * Wrapper method for compute_variance() - * - * @return variance estimation of asymptotic distribution of statistic - * under null hypothesis - */ - float64_t compute_variance_under_null(); - - /** - * Wrapper method for compute_variance() - * - * @return variance estimation of asymptotic distribution of statistic - * under alternative hypothesis - */ - float64_t compute_variance_under_alternative(); - - /** computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- - * distribution where the given statistic lies in. - * - * Not all methods for computing the p-value are compatible with all - * methods of computing the statistic (biased/unbiased/incomplete). - * - * @param statistic statistic value to compute the p-value for - * @return p-value parameter statistic is the (1-p) percentile of the - * null distribution - */ - virtual float64_t compute_p_value(float64_t statistic); - - /** computes a threshold based on current method for approximating the - * null-distribution. The threshold is the value that a statistic has - * to have in ordner to reject the null-hypothesis. - * - * Not all methods for computing the p-value are compatible with all - * methods of computing the statistic (biased/unbiased/incomplete). - * - * @param alpha test level to reject null-hypothesis - * @return threshold for statistics to reject null-hypothesis - */ - virtual float64_t compute_threshold(float64_t alpha); - - /** @return the class name */ - virtual const char* get_name() const - { - return "QuadraticTimeMMD"; - }; - - /** returns the statistic type of this test statistic */ - virtual EStatisticType get_statistic_type() const - { - return S_QUADRATIC_TIME_MMD; - } - - /** Returns a set of samples of an estimate of the null distribution - * using the Eigen-spectrum of the centered kernel matrix of the merged - * samples of p and q. May be used to compute p-value (easy). - * - * The estimate is computed as - * \f[ - * \frac{n_xn_y}{n_x+n_y}\hat{\eta}_{k,U}\rightarrow\sum_{l=1}^\infty - * \lambda_l\left(Z^2_l-1 \right) - * \f] - * where \f${Z_l}\stackrel{i.i.d.}{\sim}\mathcal{N}(0,1)\f$ and - * \f$\lambda_l\f$ are the eigenvalues of centered kernel matrix HKH. - * - * kernel matrix needs to be stored in memory - * - * Note that m*n/(m+n)*Null-distribution is returned, - * which is fine since the statistic is also m*n/(m+n)*MMD^2 - * - * Works well if the kernel matrix is NOT diagonal dominant. - * See Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). - * A fast, consistent kernel two-sample test. - * - * @param num_samples number of samples to draw - * @param num_eigenvalues number of eigenvalues to use to draw samples - * Maximum number of m+n-1 where m and n are the sizes of samples from - * p and q respectively. - * @return samples from the estimated null distribution - */ - SGVector sample_null_spectrum(index_t num_samples, - index_t num_eigenvalues); - - /** Returns a set of samples of an estimate of the null distribution - * using the Eigen-spectrum of the centered kernel matrix of the merged - * samples of p and q. May be used to compute p-value (easy). - * - * The unbiased version uses - * \f[ - * t\text{MMD}_u^2[\mathcal{F},X,Y]\rightarrow\sum_{l=1}^\infty - * \lambda_l\left((a_l\rho_x^{-\frac{1}{{2}}} - * -b_l\rho_y^{-\frac{1}{{2}}})^2-(\rho_x\rho_y)^{-1} \right) - * \f] - * where \f$t=m+n\f$, \f$\lim_{m,n\rightarrow\infty}m/t\rightarrow - * \rho_x\f$ and \f$\rho_y\f$ likewise (equation 10 from [1]) and - * \f$\lambda_l\f$ are estimated as \f$\frac{\nu_l}{(m+n)}\f$, where - * \f$\nu_l\f$ are the eigenvalues of centered kernel matrix HKH. - * - * The biased version uses - * \f[ - * t\text{MMD}_b^2[\mathcal{F},X,Y]\rightarrow\sum_{l=1}^\infty - * \lambda_l\left((a_l\rho_x^{-\frac{1}{{2}}}- - * b_l\rho_y^{-\frac{1}{{2}}})^2\right) - * \f] - * - * kernel matrix needs to be stored in memory - * - * Note that (m+n)*Null-distribution is returned, - * which is fine since the statistic is also (m+n)*MMD: - * except when m and n are equal, then m*MMD^2 is returned - * - * Works well if the kernel matrix is NOT diagonal dominant. - * See Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). - * A fast, consistent kernel two-sample test. - * - * @param num_samples number of samples to draw - * @param num_eigenvalues number of eigenvalues to use to draw samples - * Maximum number of m+n-1 where m and n are the sizes of samples from - * p and q respectively. - * It is usually safe to use a smaller number since they decay very - * fast, however, a conservative approach would be to use all (-1 does - * this). See paper for details. - * @return samples from the estimated null distribution - */ - SGVector sample_null_spectrum_DEPRECATED(index_t num_samples, - index_t num_eigenvalues); - - /** setter for number of samples to use in spectrum based p-value - * computation. - * - * @param num_samples_spectrum number of samples to draw from - * approximate null-distributrion - */ - void set_num_samples_spectrum(index_t num_samples_spectrum); - - /** setter for number of eigenvalues to use in spectrum based p-value - * computation. Maximum is m_m+m_n-1 - * - * @param num_eigenvalues_spectrum number of eigenvalues to use to - * approximate null-distributrion - */ - void set_num_eigenvalues_spectrum(index_t num_eigenvalues_spectrum); - - /** @param statistic_type statistic type (biased/unbiased/incomplete) to use */ - void set_statistic_type(EQuadraticMMDType statistic_type); - - /** Approximates the null-distribution by the two parameter gamma - * distribution. It works in O(m^2) where m is the number of samples - * from each distribution. Its very fast, but may be inaccurate. - * However, there are cases where it performs very well. - * Returns parameters of gamma distribution that is fitted. - * - * Called by compute_p_value() if null approximation method is set to - * MMD2_GAMMA. - * - * Note that when being used for constructing a test, the provided - * statistic HAS to be the biased version (see paper for details). To use, - * set BIASED_DEPRECATED as statistic type. Note that m*Null-distribution - * is fitted, which is fine since the statistic is also m*MMD. - * - * See Gretton, A., Fukumizu, K., & Harchaoui, Z. (2011). - * A fast, consistent kernel two-sample test. - * - * @return vector with two parameter for gamma distribution. To use: - * call gamma_cdf(statistic, a, b). - */ - SGVector fit_null_gamma(); - -protected: - /** - * Helper method to compute unbiased estimate of squared quadratic time MMD - * and variance estimate under null and alternative hypothesis - * - * @param m number of samples from p - * @param n number of samples from q - * @return a vector of three values - * first - unbiased \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,U}\f$ - * second - variance under null hypothesis (see class documentation) - * third - variance under alternative hypothesis (see class documentation) - */ - SGVector compute_unbiased_statistic_variance(int m, int n); - - /** - * Helper method to compute biased estimate of squared quadratic time MMD - * and variance estimate under null and alternative hypothesis - * - * @param m number of samples from p - * @param n number of samples from q - * @return a vector of three values - * first - biased \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,V}\f$ - * second - variance under null hypothesis (see class documentation) - * third - variance under alternative hypothesis (see class documentation) - */ - SGVector compute_biased_statistic_variance(int m, int n); - - /** - * Helper method to compute incomplete estimate of squared quadratic time MMD - * and variance estimate under null and alternative hypothesis - * - * @param n number of samples from p and q - * @return a vector of three values - * first - incomplete \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,U^-}\f$ - * second - variance under null hypothesis (see class documentation) - * third - variance under alternative hypothesis (see class documentation) - */ - SGVector compute_incomplete_statistic_variance(int n); - - /** Wrapper method for computing unbiased estimate of MMD^2 - * - * @param m number of samples from p - * @param n number of samples from q - * @return unbiased \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,U}\f$ - */ - float64_t compute_unbiased_statistic(int m, int n); - - /** Wrapper method for computing biased estimate of MMD^2 - * - * @param m number of samples from p - * @param n number of samples from q - * @return biased \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,V}\f$ - */ - float64_t compute_biased_statistic(int m, int n); - - /** Wrapper method for computing incomplete estimate of MMD^2 - * - * @param n number of samples from p and q - * @return incomplete \f$\text{MMD}^2\f$ estimate \f$\hat{\eta}_{k,U^-}\f$ - */ - float64_t compute_incomplete_statistic(int n); - -private: - /** register parameters and initialize with defaults */ - void init(); - -protected: - /** number of samples for spectrum null-dstribution-approximation */ - index_t m_num_samples_spectrum; - - /** number of Eigenvalues for spectrum null-dstribution-approximation */ - index_t m_num_eigenvalues_spectrum; - - /** type of statistic (biased/unbiased/incomplete as well as deprecated - * versions of biased/unbiased) - */ - EQuadraticMMDType m_statistic_type; -}; - -} - -#endif /* QUADRATIC_TIME_MMD_H_ */ diff --git a/src/shogun/statistics/StreamingMMD.cpp b/src/shogun/statistics/StreamingMMD.cpp deleted file mode 100644 index 5ba3d10a796..00000000000 --- a/src/shogun/statistics/StreamingMMD.cpp +++ /dev/null @@ -1,325 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -CStreamingMMD::CStreamingMMD() : CKernelTwoSampleTest() -{ - init(); -} - -CStreamingMMD::CStreamingMMD(CKernel* kernel, CStreamingFeatures* p, - CStreamingFeatures* q, index_t m, index_t blocksize) : - CKernelTwoSampleTest(kernel, NULL, m) -{ - init(); - - m_streaming_p=p; - SG_REF(m_streaming_p); - - m_streaming_q=q; - SG_REF(m_streaming_q); - - m_blocksize=blocksize; -} - -CStreamingMMD::~CStreamingMMD() -{ - SG_UNREF(m_streaming_p); - SG_UNREF(m_streaming_q); - - /* m_kernel is SG_UNREFed in base desctructor */ -} - -void CStreamingMMD::init() -{ - SG_ADD((CSGObject**)&m_streaming_p, "streaming_p", "Streaming features p", - MS_NOT_AVAILABLE); - SG_ADD((CSGObject**)&m_streaming_q, "streaming_q", "Streaming features p", - MS_NOT_AVAILABLE); - SG_ADD(&m_blocksize, "blocksize", "Number of elements processed at once", - MS_NOT_AVAILABLE); - SG_ADD(&m_simulate_h0, "simulate_h0", "Whether p and q are mixed", - MS_NOT_AVAILABLE); - - m_streaming_p=NULL; - m_streaming_q=NULL; - m_blocksize=10000; - m_simulate_h0=false; -} - -float64_t CStreamingMMD::compute_statistic() -{ - /* use wrapper method and compute for single kernel */ - SGVector statistic; - SGVector variance; - compute_statistic_and_variance(statistic, variance, false); - - return statistic[0]; -} - -SGVector CStreamingMMD::compute_statistic(bool multiple_kernels) -{ - /* make sure multiple_kernels flag is used only with a combined kernel */ - REQUIRE(!multiple_kernels || m_kernel->get_kernel_type()==K_COMBINED, - "multiple kernels specified, but underlying kernel is not of type " - "K_COMBINED\n"); - - SGVector statistic; - SGVector variance; - compute_statistic_and_variance(statistic, variance, multiple_kernels); - - return statistic; -} - -float64_t CStreamingMMD::compute_variance_estimate() -{ - /* use wrapper method and compute for single kernel */ - SGVector statistic; - SGVector variance; - compute_statistic_and_variance(statistic, variance, false); - - return variance[0]; -} - -float64_t CStreamingMMD::compute_p_value(float64_t statistic) -{ - float64_t result=0; - - switch (m_null_approximation_method) - { - case MMD1_GAUSSIAN: - { - /* compute variance and use to estimate Gaussian distribution */ - float64_t std_dev=CMath::sqrt(compute_variance_estimate()); - result=1.0-CStatistics::normal_cdf(statistic, std_dev); - } - break; - - default: - /* sampling null is handled here */ - result=CKernelTwoSampleTest::compute_p_value(statistic); - break; - } - - return result; -} - -float64_t CStreamingMMD::compute_threshold(float64_t alpha) -{ - float64_t result=0; - - switch (m_null_approximation_method) - { - case MMD1_GAUSSIAN: - { - /* compute variance and use to estimate Gaussian distribution */ - float64_t std_dev=CMath::sqrt(compute_variance_estimate()); - result=1.0-CStatistics::inverse_normal_cdf(1-alpha, 0, std_dev); - } - break; - - default: - /* sampling null is handled here */ - result=CKernelTwoSampleTest::compute_threshold(alpha); - break; - } - - return result; -} - -float64_t CStreamingMMD::perform_test() -{ - float64_t result=0; - - switch (m_null_approximation_method) - { - case MMD1_GAUSSIAN: - { - /* compute variance and use to estimate Gaussian distribution, use - * wrapper method and compute for single kernel */ - SGVector statistic; - SGVector variance; - compute_statistic_and_variance(statistic, variance, false); - - /* estimate Gaussian distribution */ - result=1.0-CStatistics::normal_cdf(statistic[0], - CMath::sqrt(variance[0])); - } - break; - - default: - /* sampling null can be done separately in superclass */ - result=CHypothesisTest::perform_test(); - break; - } - - return result; -} - -SGVector CStreamingMMD::sample_null() -{ - SGVector samples(m_num_null_samples); - - /* instead of permutating samples, just samples new data all the time. */ - CStreamingFeatures* p=m_streaming_p; - CStreamingFeatures* q=m_streaming_q; - SG_REF(p); - SG_REF(q); - - bool old=m_simulate_h0; - set_simulate_h0(true); - for (index_t i=0; iget_streamed_features(num_this_run); - data->append_element(block); - } - - SG_DEBUG("streaming %d blocks from q of blocksize %d!\n", num_blocks, - num_this_run); - - /* stream data from q num_blocks of time*/ - for (index_t i=0; iget_streamed_features(num_this_run); - data->append_element(block); - } - - /* check whether h0 should be simulated and permute if so */ - if (m_simulate_h0) - { - /* create merged copy of all feature instances to permute */ - SG_DEBUG("merging and premuting features!\n"); - - /* use the first element to merge rest of the data into */ - CFeatures* first=(CFeatures*)data->get_first_element(); - - /* this delete element doesn't deallocate first element but just removes - * from the list and does a SG_UNREF. But its not deleted because - * get_first_element() does a SG_REF before returning so we need to later - * manually take care of its destruction via SG_UNREF here itself */ - data->delete_element(); - - CFeatures* merged=first->create_merged_copy(data); - - /* now we can get rid of unnecessary feature objects */ - SG_UNREF(first); - data->delete_all_elements(); - - /* permute */ - SGVector inds(merged->get_num_vectors()); - inds.range_fill(); - CMath::permute(inds); - merged->add_subset(inds); - - /* copy back */ - SGVector copy(num_this_run); - copy.range_fill(); - for (index_t i=0; i<2*num_blocks; ++i) - { - CFeatures* current=merged->copy_subset(copy); - data->append_element(current); - /* SG_UNREF'ing since copy_subset does a SG_REF, this is - * safe since the object is already SG_REF'ed inside the list */ - SG_UNREF(current); - - if (i<2*num_blocks-1) - copy.add(num_this_run); - } - - /* clean up */ - SG_UNREF(merged); - } - - SG_REF(data); - - SG_DEBUG("leaving!\n"); - return data; -} - -void CStreamingMMD::set_p_and_q(CFeatures* p_and_q) -{ - SG_ERROR("Method not implemented since linear time mmd is based on " - "streaming features\n"); -} - -CFeatures* CStreamingMMD::get_p_and_q() -{ - SG_ERROR("Method not implemented since linear time mmd is based on " - "streaming features\n"); - return NULL; -} - -CStreamingFeatures* CStreamingMMD::get_streaming_p() -{ - SG_REF(m_streaming_p); - return m_streaming_p; -} - -CStreamingFeatures* CStreamingMMD::get_streaming_q() -{ - SG_REF(m_streaming_q); - return m_streaming_q; -} - diff --git a/src/shogun/statistics/StreamingMMD.h b/src/shogun/statistics/StreamingMMD.h deleted file mode 100644 index d2e8d0a3e0d..00000000000 --- a/src/shogun/statistics/StreamingMMD.h +++ /dev/null @@ -1,310 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef STREAMING_MMD_H_ -#define STREAMING_MMD_H_ - -#include - -#include - -namespace shogun -{ - -class CStreamingFeatures; -class CFeatures; - -/** @brief Abstract base class that provides an interface for performing kernel - * two-sample test on streaming data using Maximum Mean Discrepancy (MMD) as - * the test statistic. The MMD is the distance of two probability distributions - * \f$p\f$ and \f$q\f$ in a RKHS (see [1] for formal description). - * - * \f[ - * \text{MMD}[\mathcal{F},p,q]^2=\textbf{E}_{x,x'}\left[ k(x,x')\right]- - * 2\textbf{E}_{x,y}\left[ k(x,y)\right] - * +\textbf{E}_{y,y'}\left[ k(y,y')\right]=||\mu_p - \mu_q||^2_\mathcal{F} - * \f] - * - * where \f$x,x'\sim p\f$ and \f$y,y'\sim q\f$. The data has to be provided as - * streaming features, which are processed in blocks for a given blocksize. - * The blocksize determines how many examples are processed at once. A method - * for getting a specified number of blocks of data is provided which can - * optionally merge and permute the data within the current burst. The exact - * computation of kernel functions for MMD computation is abstract and has to - * be defined by its subclasses, which should return a vector of function - * values. Please note that for streaming MMD, the number of data points from - * both the distributions has to be equal. - * - * Along with the statistic comes a method to compute a p-value based on a - * Gaussian approximation of the null-distribution which is possible in - * linear time and constant space. Sampling from null is also possible (no - * permutations but new examples will be used here). - * If unsure which one to use, sampling with 250 iterations always is - * correct (but slow). When the sample size is large (>1000) at least, - * the Gaussian approximation is an accurate and much faster choice. - * - * To choose, use set_null_approximation_method() and choose from - * - * MMD1_GAUSSIAN: Approximates the null-distribution with a Gaussian. Only use - * from at least 1000 samples. If using, check if type I error equals the - * desired value. - * - * PERMUTATION: For permuting available samples to sample null-distribution. - * - * For kernel selection see CMMDKernelSelection. - * - * [1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & - * Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning - * Research, 13, 671-721. - */ -class CStreamingMMD: public CKernelTwoSampleTest -{ -public: - /** default constructor */ - CStreamingMMD(); - - /** constructor. - * - * @param kernel kernel to use - * @param p streaming features p to use - * @param q streaming features q to use - * @param m number of samples from each distribution - * @param blocksize size of examples that are processed at once when - * computing statistic/threshold. - */ - CStreamingMMD(CKernel* kernel, CStreamingFeatures* p, - CStreamingFeatures* q, index_t m, index_t blocksize=10000); - - /** destructor */ - virtual ~CStreamingMMD(); - - /** Computes the squared MMD for the current data. This is an unbiased - * estimate. This method relies on compute_statistic_and_variance which - * has to be defined in the subclasses - * - * Note that the underlying streaming feature parser has to be started - * before this is called. Otherwise deadlock. - * - * @return squared MMD - */ - virtual float64_t compute_statistic(); - - /** Same as compute_statistic(), but with the possibility to perform on - * multiple kernels at once - * - * @param multiple_kernels if true, and underlying kernel is K_COMBINED, - * method will be executed on all subkernels on the same data - * @return vector of results for subkernels - */ - virtual SGVector compute_statistic(bool multiple_kernels); - - /** computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- - * distribution where the given statistic lies in. - * - * The method for computing the p-value can be set via - * set_null_approximation_method(). - * Since the null- distribution is normal, a Gaussian approximation - * is available. - * - * @param statistic statistic value to compute the p-value for - * @return p-value parameter statistic is the (1-p) percentile of the - * null distribution - */ - virtual float64_t compute_p_value(float64_t statistic); - - /** Performs the complete two-sample test on current data and returns a - * p-value. - * - * In case null distribution should be estimated with MMD1_GAUSSIAN, - * statistic and p-value are computed in the same loop, which is more - * efficient than first computing statistic and then computung p-values. - * - * In case of sampling null, superclass method is called. - * - * The method for computing the p-value can be set via - * set_null_approximation_method(). - * - * @return p-value such that computed statistic is the (1-p) quantile - * of the estimated null distribution - */ - virtual float64_t perform_test(); - - /** computes a threshold based on current method for approximating the - * null-distribution. The threshold is the value that a statistic has - * to have in ordner to reject the null-hypothesis. - * - * The method for computing the p-value can be set via - * set_null_approximation_method(). - * Since the null- distribution is normal, a Gaussian approximation - * is available. - * - * @param alpha test level to reject null-hypothesis - * @return threshold for statistics to reject null-hypothesis - */ - virtual float64_t compute_threshold(float64_t alpha); - - /** computes a linear time estimate of the variance of the squared mmd, - * which may be used for an approximation of the null-distribution - * The value is the variance of the vector of which the MMD is the mean. - * - * @return variance estimate - */ - virtual float64_t compute_variance_estimate(); - - /** Abstract method that computes MMD and a linear time variance estimate. - * If multiple_kernels is set to true, each subkernel is evaluated on the - * same data. - * - * @param statistic return parameter for statistic, vector with entry for - * each kernel. May be allocated before but doesn not have to be - * - * @param variance return parameter for statistic, vector with entry for - * each kernel. May be allocated before but doesn not have to be - * - * @param multiple_kernels optional flag, if set to true, it is assumed that - * the underlying kernel is of type K_COMBINED. Then, the MMD is computed on - * all subkernel separately rather than computing it on the combination. - * This is used by kernel selection strategies that need to evaluate - * multiple kernels on the same data. Since the linear time MMD works on - * streaming data, one cannot simply compute MMD, change kernel since data - * would be different for every kernel. - */ - virtual void compute_statistic_and_variance( - SGVector& statistic, SGVector& variance, - bool multiple_kernels=false)=0; - - /** Same as compute_statistic_and_variance, but computes a linear time - * estimate of the covariance of the multiple-kernel-MMD. - * See [1] for details. - */ - virtual void compute_statistic_and_Q( - SGVector& statistic, SGMatrix& Q)=0; - - /** Mimics sampling null for MMD. However, samples are not permutated but - * constantly streamed and then merged. Usually, this is not necessary - * since there is the Gaussian approximation for the null distribution. - * However, in certain cases this may fail and sampling the null - * distribution might be numerically more stable. Ovewrite superclass - * method that merges samples. - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - - /** Setter for the blocksize of examples to be processed at once - * @param blocksize new blocksize to use - */ - void set_blocksize(index_t blocksize) - { - m_blocksize=blocksize; - } - - /** Streams num_blocks data from each distribution with blocks of size - * num_this_run. If m_simulate_h0 is set, it merges the blocks together, - * shuffles and redistributes between the blocks. - * - * @param num_blocks number of blocks to be streamed from each distribution - * @param num_this_run number of data points to be streamed for one block - * @return an ordered list of blocks of data. The order in the - * list is \f$x,x',\cdots\sim p\f$ followed by \f$y,y',\cdots\sim q\f$. - * The features inside the list are SG_REF'ed and delete_data is set in the - * list, which will destroy the at CList's destructor call - */ - CList* stream_data_blocks(index_t num_blocks, index_t num_this_run); - - /** Not implemented for streaming MMD since it uses streaming feautres */ - virtual void set_p_and_q(CFeatures* p_and_q); - - /** Not implemented for streaming MMD since it uses streaming feautres */ - virtual CFeatures* get_p_and_q(); - - /** Getter for streaming features of p distribution. - * @return streaming features object for p distribution, SG_REF'ed - */ - virtual CStreamingFeatures* get_streaming_p(); - - /** Getter for streaming features of q distribution. - * @return streaming features object for q distribution, SG_REF'ed - */ - virtual CStreamingFeatures* get_streaming_q(); - - /** @param simulate_h0 if true, samples from p and q will be mixed and - * permuted - */ - inline void set_simulate_h0(bool simulate_h0) - { - m_simulate_h0=simulate_h0; - } - - /** @return the class name */ - virtual const char* get_name() const - { - return "StreamingMMD"; - } - -protected: - /** abstract method that computes the squared MMD - * - * @param kernel the kernel to be used for computing MMD. This will be - * useful when multiple kernels are used - * @param data the list of data on which kernels are computed. The order - * of data in the list is \f$x,x',\cdots\sim p\f$ followed by - * \f$y,y',\cdots\sim q\f$. It is assumed that detele_data flag is set - * inside the list - * @param num_this_run number of data points in current blocks - * @return the MMD values - */ - virtual SGVector compute_squared_mmd(CKernel* kernel, - CList* data, index_t num_this_run)=0; - - /** Streaming feature objects that are used instead of merged samples */ - CStreamingFeatures* m_streaming_p; - - /** Streaming feature objects that are used instead of merged samples*/ - CStreamingFeatures* m_streaming_q; - - /** Number of examples processed at once, i.e. in one burst */ - index_t m_blocksize; - - /** If this is true, samples will be mixed between p and q in any method - * that computes the statistic */ - bool m_simulate_h0; - -private: - /** register parameters and initialize with defaults */ - void init(); -}; - -} - -#endif /* STREAMING_MMD_H_ */ - diff --git a/src/shogun/statistics/TwoSampleTest.cpp b/src/shogun/statistics/TwoSampleTest.cpp deleted file mode 100644 index 0510f3b5e48..00000000000 --- a/src/shogun/statistics/TwoSampleTest.cpp +++ /dev/null @@ -1,176 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include - -using namespace shogun; - -CTwoSampleTest::CTwoSampleTest() : CHypothesisTest() -{ - init(); -} - -CTwoSampleTest::CTwoSampleTest(CFeatures* p_and_q, index_t m) : - CHypothesisTest() -{ - init(); - - m_p_and_q=p_and_q; - SG_REF(m_p_and_q); - - m_m=m; -} - -CTwoSampleTest::CTwoSampleTest(CFeatures* p, CFeatures* q) : - CHypothesisTest() -{ - init(); - - m_p_and_q=p->create_merged_copy(q); - SG_REF(m_p_and_q); - - m_m=p->get_num_vectors(); -} - -CTwoSampleTest::~CTwoSampleTest() -{ - SG_UNREF(m_p_and_q); -} - -void CTwoSampleTest::init() -{ - SG_ADD((CSGObject**)&m_p_and_q, "p_and_q", "Concatenated samples p and q", - MS_NOT_AVAILABLE); - SG_ADD(&m_m, "m", "Index of first sample of q", - MS_NOT_AVAILABLE); - - m_p_and_q=NULL; - m_m=0; -} - -SGVector CTwoSampleTest::sample_null() -{ - SG_DEBUG("entering!\n") - - REQUIRE(m_p_and_q, "No appended features p and q!\n"); - - /* compute sample statistics for null distribution */ - SGVector results(m_num_null_samples); - - /* memory for index permutations. Adding of subset has to happen - * inside the loop since it may be copied if there already is one set */ - SGVector ind_permutation(m_p_and_q->get_num_vectors()); - ind_permutation.range_fill(); - - for (index_t i=0; iadd_subset(ind_permutation); - results[i]=compute_statistic(); - m_p_and_q->remove_subset(); - } - - SG_DEBUG("leaving!\n") - return results; -} - -float64_t CTwoSampleTest::compute_p_value(float64_t statistic) -{ - float64_t result=0; - - if (m_null_approximation_method==PERMUTATION) - { - /* sample a bunch of MMD values from null distribution */ - SGVector values=sample_null(); - - /* find out percentile of parameter "statistic" in null distribution */ - CMath::qsort(values); - float64_t i=values.find_position_to_insert(statistic); - - /* return corresponding p-value */ - result=1.0-i/values.vlen; - } - else - SG_ERROR("Unknown method to approximate null distribution!\n"); - - return result; -} - -float64_t CTwoSampleTest::compute_threshold(float64_t alpha) -{ - float64_t result=0; - - if (m_null_approximation_method==PERMUTATION) - { - /* sample a bunch of MMD values from null distribution */ - SGVector values=sample_null(); - - /* return value of (1-alpha) quantile */ - result=values[index_t(CMath::floor(values.vlen*(1-alpha)))]; - } - else - SG_ERROR("Unknown method to approximate null distribution!\n"); - - return result; -} - -void CTwoSampleTest::set_p_and_q(CFeatures* p_and_q) -{ - /* ref before unref to avoid problems when instances are equal */ - SG_REF(p_and_q); - SG_UNREF(m_p_and_q); - m_p_and_q=p_and_q; -} - -void CTwoSampleTest::set_m(index_t m) -{ - REQUIRE(m_p_and_q, "Samples are not specified!\n"); - REQUIRE(m_p_and_q->get_num_vectors()>m, "Provided sample size for p" - "(%d) is greater than total number of samples (%d)!\n", - m, m_p_and_q->get_num_vectors()); - m_m=m; -} - -CFeatures* CTwoSampleTest::get_p_and_q() -{ - SG_REF(m_p_and_q); - return m_p_and_q; -} - diff --git a/src/shogun/statistics/TwoSampleTest.h b/src/shogun/statistics/TwoSampleTest.h deleted file mode 100644 index aa57c5a1ccb..00000000000 --- a/src/shogun/statistics/TwoSampleTest.h +++ /dev/null @@ -1,144 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#ifndef TWO_SAMPLE_TEST_H_ -#define TWO_SAMPLE_TEST_H_ - -#include - -#include - -namespace shogun -{ - -class CFeatures; - -/** @brief Provides an interface for performing the classical two-sample test - * i.e. Given samples from two distributions \f$p\f$ and \f$q\f$, the - * null-hypothesis is: \f$H_0: p=q\f$, the alternative hypothesis: - * \f$H_1: p\neq q\f$. - * - * Abstract base class. Provides all interfaces and implements approximating - * the null distribution via permutation, i.e. repeatedly merging both samples - * and them compute the test statistic on them. - * - */ -class CTwoSampleTest : public CHypothesisTest -{ -public: - /** default constructor */ - CTwoSampleTest(); - - /** Constructor - * - * @param p_and_q feature data. Is assumed to contain samples from both - * p and q. First all samples from p, then from index m all - * samples from q - * - * @param p_and_q samples from p and q, appended - * @param m index of first sample of q - */ - CTwoSampleTest(CFeatures* p_and_q, index_t m); - - /** Constructor. - * This is a convienience constructor which copies both features to one - * element and then calls the other constructor. Needs twice the memory - * for a short time - * - * @param p samples from distribution p, will be copied and NOT - * SG_REF'ed - * @param q samples from distribution q, will be copied and NOT - * SG_REF'ed - */ - CTwoSampleTest(CFeatures* p, CFeatures* q); - - /** destructor */ - virtual ~CTwoSampleTest(); - - /** merges both sets of samples and computes the test statistic - * m_num_permutation_iteration times - * - * @return vector of all statistics - */ - virtual SGVector sample_null(); - - /** computes a p-value based on current method for approximating the - * null-distribution. The p-value is the 1-p quantile of the null- - * distribution where the given statistic lies in. - * - * @param statistic statistic value to compute the p-value for - * @return p-value parameter statistic is the (1-p) percentile of the - * null distribution - */ - virtual float64_t compute_p_value(float64_t statistic); - - /** computes a threshold based on current method for approximating the - * null-distribution. The threshold is the argument of the \f$1-\alpha\f$ - * quantile of the null. \f$\alpha\f$ is provided. - * - * @param alpha \f$\alpha\f$ quantile to get the threshold for - * @return threshold which is the \f$1-\alpha\f$ quantile of the null - * distribution - */ - virtual float64_t compute_threshold(float64_t alpha); - - /** Setter for joint features - * @param p_and_q joint features from p and q to set - */ - virtual void set_p_and_q(CFeatures* p_and_q); - - /** Getter for joint features, SG_REF'ed - * @return joint feature object - */ - virtual CFeatures* get_p_and_q(); - - /** @param m number of samples from first distribution p */ - void set_m(index_t m); - - /** @return number of to be used samples m */ - index_t get_m() { return m_m; } - - virtual const char* get_name() const=0; - -private: - void init(); - -protected: - /** concatenated samples of the two distributions (two blocks) */ - CFeatures* m_p_and_q; - - /** defines the first index of samples of q */ - index_t m_m; -}; - -} - -#endif /* TWO_SAMPLE_TEST_H_ */ diff --git a/tests/unit/base/SGObject_unittest.cc b/tests/unit/base/SGObject_unittest.cc index af312927b11..e81520d530c 100644 --- a/tests/unit/base/SGObject_unittest.cc +++ b/tests/unit/base/SGObject_unittest.cc @@ -18,7 +18,6 @@ #include #include #include -#include #include #include "MockObject.h" #include @@ -44,13 +43,13 @@ TEST(SGObject,equals_NULL_parameter) CDenseFeatures* feats=new CDenseFeatures(data); CGaussianKernel* kernel=new CGaussianKernel(); - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feats, 5); - CQuadraticTimeMMD* mmd2=new CQuadraticTimeMMD(NULL, feats, 5); + CGaussianKernel* kernel2=new CGaussianKernel(); + kernel2->init(feats, feats); - mmd->equals(mmd2); + EXPECT_FALSE(kernel->equals(kernel2)); - SG_UNREF(mmd); - SG_UNREF(mmd2); + SG_UNREF(kernel); + SG_UNREF(kernel2); } #ifdef USE_REFERENCE_COUNTING @@ -385,4 +384,4 @@ TEST(SGObject, tags_has) EXPECT_EQ(obj->has("foo"), false); EXPECT_EQ(obj->has("foo"), false); EXPECT_EQ(obj->has(Tag("foo")), false); -} \ No newline at end of file +} diff --git a/tests/unit/features/DenseFeatures_unittest.cc b/tests/unit/features/DenseFeatures_unittest.cc index 076dfc1511b..649bbede44c 100644 --- a/tests/unit/features/DenseFeatures_unittest.cc +++ b/tests/unit/features/DenseFeatures_unittest.cc @@ -145,12 +145,12 @@ TEST(DenseFeaturesTest, shallow_copy_subset_data) SGMatrix orig_matrix=features->get_feature_matrix(); SGMatrix copy_matrix=static_cast*>(features_copy)->get_feature_matrix(); - + for (index_t i=0; i data(dim,n); + for (index_t i=0; istd_normal_distrib(); + + CDenseFeatures* orig_feats=new CDenseFeatures(data); + CStreamingDenseFeatures* feats=new CStreamingDenseFeatures(orig_feats); + + feats->start_parser(); + + CDenseFeatures* streamed=dynamic_cast*>(feats->get_streamed_features(n)); + ASSERT_TRUE(streamed!=nullptr); + ASSERT_TRUE(orig_feats->equals(streamed)); + SG_UNREF(streamed); + + feats->reset_stream(); + + streamed=dynamic_cast*>(feats->get_streamed_features(n)); + ASSERT_TRUE(streamed!=nullptr); + ASSERT_TRUE(orig_feats->equals(streamed)); + SG_UNREF(streamed); + + feats->end_parser(); + SG_UNREF(feats); +} diff --git a/tests/unit/preprocessor/BAHSIC_unittest.cc b/tests/unit/preprocessor/BAHSIC_unittest.cc deleted file mode 100644 index 0d1458c6e0d..00000000000 --- a/tests/unit/preprocessor/BAHSIC_unittest.cc +++ /dev/null @@ -1,147 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(BAHSIC, get_selected_feats) -{ - const index_t dim=25; - const index_t num_data=100; - - // use fix seed for reproducibility - CMath::init_random(12345); - - SGMatrix data(dim, num_data); - for (index_t i=0; i labels_vec(num_data); - for (index_t i=0; i* feats=new CDenseFeatures(data); - CBinaryLabels* labels=new CBinaryLabels(labels_vec); - - float64_t sigma=0.5; - - CGaussianKernel* kernel_p=new CGaussianKernel(10, 2*CMath::sq(sigma)); - CGaussianKernel* kernel_q=new CGaussianKernel(10, 2*CMath::sq(sigma)); - - CBAHSIC* fs=new CBAHSIC(); - - index_t target_dim=dim/5; - - fs->set_labels(labels); - fs->set_target_dim(target_dim); - fs->set_kernel_features(kernel_p); - fs->set_kernel_labels(kernel_q); - fs->set_policy(N_LARGEST); - - // remove one feature at a time - fs->set_num_remove(1); - - CFeatures* selected=fs->apply(feats); - - SGMatrix selected_data - =((CDenseFeatures*)selected)->get_feature_matrix(); - - SGVector inds=fs->get_selected_feats(); - - for (index_t i=0; i data(dim, num_data); - for (index_t i=0; i labels_vec(num_data); - for (index_t i=0; i* feats=new CDenseFeatures(data); - CBinaryLabels* labels=new CBinaryLabels(labels_vec); - float64_t sigma=1.0; - CGaussianKernel* kernel_p=new CGaussianKernel(10, 2*CMath::sq(sigma)); - CGaussianKernel* kernel_q=new CGaussianKernel(10, 2*CMath::sq(sigma)); - - CBAHSIC* fs=new CBAHSIC(); - index_t target_dim=dim/2; - fs->set_labels(labels); - fs->set_target_dim(target_dim); - fs->set_kernel_features(kernel_p); - fs->set_kernel_labels(kernel_q); - fs->set_policy(N_LARGEST); - fs->set_num_remove(dim-target_dim); - CFeatures* selected=fs->apply(feats); - - SGVector selected_inds=fs->get_selected_feats(); - - // ensure that selected feats are the same as computed in local machine - SGVector inds(target_dim); - inds[0]=0; - inds[1]=1; - inds[2]=2; - inds[3]=3; - - EXPECT_EQ(selected_inds.vlen, inds.vlen); - - for (index_t i=0; i -#include -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(FeatureSelection, remove_feats) -{ - const index_t dim=8; - const index_t num_data=5; - - // use fix seed for reproducibility - CMath::init_random(1); - - SGMatrix data(dim, num_data); - for (index_t i=0; i* feats=new CDenseFeatures(data); - - CBAHSIC* fs=new CBAHSIC(); - index_t target_dim=dim/2; - fs->set_num_remove(dim-target_dim); - fs->set_policy(N_LARGEST); - - // create a dummy argsorted vector to remove last dim/2 features - SGVector argsorted(dim); - argsorted.range_fill(); - - CFeatures* reduced=fs->remove_feats(feats, argsorted); - SGMatrix reduced_data - =((CDenseFeatures*)reduced)->get_feature_matrix(); - - for (index_t i=0; i data(dim, num_data); - for (index_t i=0; i labels_vec(num_data); - for (index_t i=0; i* feats=new CDenseFeatures(data); - CBinaryLabels* labels=new CBinaryLabels(labels_vec); - float64_t sigma=1.0; - CGaussianKernel* kernel_p=new CGaussianKernel(10, 2*CMath::sq(sigma)); - CGaussianKernel* kernel_q=new CGaussianKernel(10, 2*CMath::sq(sigma)); - - // SG_REF'ing the kernel for q because it is SG_UNREF'ed in precompute - // call and to replace by a CCustomKernel - SG_REF(kernel_q); - - CBAHSIC* fs=new CBAHSIC(); - fs->set_labels(labels); - fs->set_kernel_features(kernel_p); - fs->set_kernel_labels(kernel_q); - - // compute the measure removing dimension 0 - float64_t measure=fs->compute_measures(feats, 0); - - // recreate this using HSIC - SGVector inds(dim-1); - for (index_t i=0; icopy_dimension_subset(inds); - - SGMatrix l_data(1, num_data); - memcpy(l_data.matrix, labels_vec.vector, sizeof(float64_t)*num_data); - CDenseFeatures* l_feats=new CDenseFeatures(l_data); - - CHSIC* hsic=new CHSIC(); - hsic->set_p(transformed); - hsic->set_q(l_feats); - hsic->set_kernel_p(kernel_p); - hsic->set_kernel_q(kernel_q); - - EXPECT_NEAR(measure, hsic->compute_statistic(), 1E-15); - - SG_UNREF(fs); - SG_UNREF(hsic); - SG_UNREF(kernel_q); - SG_UNREF(feats); - SG_UNREF(transformed); -} diff --git a/tests/unit/statistical_testing/KernelSelection_unittest.cc b/tests/unit/statistical_testing/KernelSelection_unittest.cc new file mode 100644 index 00000000000..d6717a25f82 --- /dev/null +++ b/tests/unit/statistical_testing/KernelSelection_unittest.cc @@ -0,0 +1,390 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2012-2013 Heiko Strathmann + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; + +TEST(KernelSelectionMaxMMD, linear_time_single_kernel_streaming) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_MMD); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.03125, 1E-10); +} + +TEST(KernelSelectionMaxMMD, quadratic_time_single_kernel_dense) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto mmd=some(feats_p, feats_q); + mmd->set_statistic_type(ST_BIASED_FULL); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_MMD); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.25, 1E-10); +} + +TEST(KernelSelectionMaxMMD, linear_time_weighted_kernel_streaming) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_MMD, true); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto weighted_kernel=dynamic_cast(mmd->get_kernel()); + ASSERT_TRUE(weighted_kernel!=nullptr); + ASSERT_TRUE(weighted_kernel->get_num_subkernels()==num_kernels); + + SGVector weights=weighted_kernel->get_subkernel_weights(); + for (auto i=0; iset_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_POWER); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.03125, 1E-10); +} + +TEST(KernelSelectionMaxTestPower, quadratic_time_single_kernel) +{ + const index_t m=10; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_POWER); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.25, 1E-10); +} + +TEST(KernelSelectionMaxTestPower, linear_time_weighted_kernel_streaming) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MAXIMIZE_POWER, true); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto weighted_kernel=dynamic_cast(mmd->get_kernel()); + ASSERT_TRUE(weighted_kernel!=nullptr); + ASSERT_TRUE(weighted_kernel->get_num_subkernels()==num_kernels); + + SGVector weights=weighted_kernel->get_subkernel_weights(); + for (auto i=0; iset_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto mmd=some(feats_p, feats_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_null_approximation_method(NAM_PERMUTATION); + mmd->set_num_null_samples(10); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_CROSS_VALIDATION, num_runs, num_folds, alpha); + + mmd->set_train_test_mode(true); + mmd->set_train_test_ratio(train_test_ratio); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.25, 1E-10); +} + +TEST(KernelSelectionMaxCrossValidation, linear_time_single_kernel_dense) +{ + const index_t m=8; + const index_t n=12; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + const index_t num_runs=1; + const index_t num_folds=3; + const float64_t train_test_ratio=3; + const float64_t alpha=0.05; + + sg_rand->set_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto mmd=some(feats_p, feats_q); + mmd->set_statistic_type(ST_BIASED_FULL); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_CROSS_VALIDATION, num_runs, num_folds, alpha); + + mmd->set_train_test_mode(true); + mmd->set_train_test_ratio(train_test_ratio); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 0.03125, 1E-10); +} + +TEST(KernelSelectionMedianHeuristic, quadratic_time_single_kernel_dense) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MEDIAN_HEURISTIC); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 1.0, 1E-10); +} + +TEST(KernelSelectionMedianHeuristic, linear_time_single_kernel_dense) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + + sg_rand->set_seed(12345); + + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + auto mmd=some(gen_p, gen_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + for (auto i=0, sigma=-5; iadd_kernel(new CGaussianKernel(10, tau)); + } + mmd->set_kernel_selection_strategy(KSM_MEDIAN_HEURISTIC); + + mmd->set_train_test_mode(true); + mmd->select_kernel(); + mmd->set_train_test_mode(false); + + auto selected_kernel=static_cast(mmd->get_kernel()); + EXPECT_NEAR(selected_kernel->get_width(), 1.0, 1E-10); +} diff --git a/tests/unit/statistical_testing/LinearTimeMMD_unittest.cc b/tests/unit/statistical_testing/LinearTimeMMD_unittest.cc new file mode 100644 index 00000000000..c74b92e4189 --- /dev/null +++ b/tests/unit/statistical_testing/LinearTimeMMD_unittest.cc @@ -0,0 +1,546 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2012-2013 Heiko Strathmann + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; + +TEST(LinearTimeMMD, biased_same_num_samples) +{ + const index_t m=4; + const index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.090438791828373444, 1E-5); +} + +TEST(LinearTimeMMD, unbiased_same_num_samples) +{ + const index_t m=4; + const index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.066491458266665582, 1E-5); +} + +TEST(LinearTimeMMD, incomplete_same_num_samples) +{ + const index_t m=4; + const index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert local machine computed result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.083423196012644057, 1E-5); +} + +TEST(LinearTimeMMD, biased_different_null_samples) +{ + const index_t m=4; + const index_t n=6; + const index_t d=3; + SGMatrix data(d,m+n); + for (index_t i=0; i data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, n); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*n); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.06525051478776954, 1E-5); +} + +TEST(LinearTimeMMD, unbiased_different_null_samples) +{ + const index_t m=4; + const index_t n=6; + const index_t d=3; + SGMatrix data(d,m+n); + for (index_t i=0; i data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, n); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*n); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.039823645725702045, 1E-5); +} + +TEST(LinearTimeMMD, compute_variance_null) +{ + const index_t m=8; + const index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + // assert local machine computed result + mmd->set_statistic_type(ST_UNBIASED_FULL); + float64_t var=mmd->compute_variance(); + EXPECT_NEAR(var, 0.0022330284118652344, 1E-10); + + mmd->set_statistic_type(ST_BIASED_FULL); + var=mmd->compute_variance(); + EXPECT_NEAR(var, 0.0022330284118652344, 1E-10); + + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + var=mmd->compute_variance(); + EXPECT_NEAR(var, 0.0022330284118652344, 1E-10); +} + +TEST(LinearTimeMMD, perform_test_permutation_biased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_BIASED_FULL); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.0, 1E-10); +} + +TEST(LinearTimeMMD, perform_test_permutation_unbiased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_FULL); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.0, 1E-10); +} + +TEST(LinearTimeMMD, perform_test_permutation_unbiased_incomplete) +{ + const index_t m=20; + const index_t n=20; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.59999999999999998, 1E-10); +} + +TEST(LinearTimeMMD, perform_test_gaussian_biased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_MMD1_GAUSSIAN); + + // biased case + + // compute p-value using Gaussian approximation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_BIASED_FULL); + float64_t p_value_gaussian=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value_gaussian, 0.0, 1E-10); +} + +TEST(LinearTimeMMD, perform_test_gaussian_unbiased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_MMD1_GAUSSIAN); + + // unbiased case + + // compute p-value using spectrum approximation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_FULL); + float64_t p_value_gaussian=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value_gaussian, 0.060947882185221292, 1E-6); +} + +TEST(LinearTimeMMD, perform_test_gaussian_unbiased_incomplete) +{ + const index_t m=20; + const index_t n=20; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=new CMeanShiftDataGenerator(0, dim, 0); + auto gen_q=new CMeanShiftDataGenerator(difference, dim, 0); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(gen_p, gen_q); + mmd->set_num_samples_p(m); + mmd->set_num_samples_q(n); + mmd->set_num_blocks_per_burst(1000); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_MMD1_GAUSSIAN); + + // unbiased case + + // compute p-value using spectrum approximation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + float64_t p_value_gaussian=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value_gaussian, 0.40645354706402292, 1E-6); +} diff --git a/tests/unit/statistical_testing/QuadraticTimeMMD_unittest.cc b/tests/unit/statistical_testing/QuadraticTimeMMD_unittest.cc new file mode 100644 index 00000000000..e410172cc8a --- /dev/null +++ b/tests/unit/statistical_testing/QuadraticTimeMMD_unittest.cc @@ -0,0 +1,680 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (W) 2012-2013 Heiko Strathmann + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace Eigen; + +TEST(QuadraticTimeMMD, biased_same_num_samples) +{ + index_t m=8; + index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.17882546486779649, 1E-5); +} + +TEST(QuadraticTimeMMD, unbiased_same_num_samples) +{ + index_t m=8; + index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + mmd->set_kernel(kernel); + + // assert matlab result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.13440094336133723, 1E-5); +} + +TEST(QuadraticTimeMMD, incomplete_same_num_samples) +{ + index_t m=8; + index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + // normalise data + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + mmd->set_kernel(kernel); + + // assert local machine computed result + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.16743977201175841, 1E-5); +} + +TEST(QuadraticTimeMMD, unbiased_different_num_samples) +{ + const index_t m=5; + const index_t n=6; + const index_t d=1; + float64_t data[] = {0.61318059, -0.69222999, 0.94424411, -0.48769626, + -0.00709551, 0.35025598, 0.20741384, -0.63622519, -1.21315264, + -0.77349617, -0.42707091}; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); + + SGMatrix data_q(d, n); + memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); + + CDenseFeatures* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + CGaussianKernel* kernel=new CGaussianKernel(10, 2); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + mmd->set_kernel(kernel); + + // assert python result at + // https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, -0.037500338130199401, 1E-5); +} + +TEST(QuadraticTimeMMD, biased_different_num_samples) +{ + const index_t m=5; + const index_t n=6; + const index_t d=1; + float64_t data[] = {-0.47616889, -2.1767364, -0.04185537, -1.20787529, + 1.94875193, -0.16695709, 2.51282666, -0.58116389, 1.52366887, + 0.18985099, 0.76120258}; + + // create data matrix for each features (appended is not supported) + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); + + SGMatrix data_q(d, n); + memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); + + CDenseFeatures* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + // shoguns kernel width is different + CGaussianKernel* kernel=new CGaussianKernel(10, 2); + + // create MMD instance, convienience constructor + auto mmd=some(features_p, features_q); + mmd->set_statistic_type(ST_BIASED_FULL); + mmd->set_kernel(kernel); + + // assert python result at + // https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py + float64_t statistic=mmd->compute_statistic(); + EXPECT_NEAR(statistic, 0.54418915736201567, 1E-5); +} + +TEST(QuadraticTimeMMD, compute_variance_h0) +{ + index_t m=8; + index_t d=3; + SGMatrix data(d,2*m); + for (index_t i=0; i<2*d*m; ++i) + data.matrix[i]=i; + + SGMatrix data_p(d, m); + memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); + + SGMatrix data_q(d, m); + memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); + + float64_t max_p=data_p.max_single(); + float64_t max_q=data_q.max_single(); + + for (index_t i=0; i* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + auto mmd=some(features_p, features_q); + mmd->set_kernel(kernel); + + float64_t var=mmd->compute_variance_h0(); + EXPECT_NEAR(var, 0.0042963027954101562, 1E-10); +} + +TEST(QuadraticTimeMMD, compute_variance_h1) +{ + const index_t m=5; + const index_t d=1; + const float64_t sigma=0.1; + + SGVector samples(2*m); + samples[0]=1.935070; + samples[1]=-0.068707; + samples[2]=0.022104; + samples[3]=-0.454249; + samples[4]=0.926944; + samples[5]=-0.62854; + samples[6]=0.91924; + samples[7]=-0.25241; + samples[8]=1.64107; + samples[9]=-0.65426; + + SGMatrix data_p(d, m); + std::copy(samples.data(), samples.data()+m, data_p.data()); + + SGMatrix data_q(d, m); + std::copy(samples.data()+m, samples.data()+samples.size(), data_q.data()); + + CDenseFeatures* features_p=new CDenseFeatures(data_p); + CDenseFeatures* features_q=new CDenseFeatures(data_q); + + CGaussianKernel* kernel=new CGaussianKernel(10, sigma*sigma*2); + + auto mmd=some(features_p, features_q); + mmd->set_kernel(kernel); + float64_t var=mmd->compute_variance_h1(); + EXPECT_NEAR(var, 0.017511, 1E-6); + + mmd->precompute_kernel_matrix(false); + var=mmd->compute_variance_h1(); + EXPECT_NEAR(var, 0.017511, 1E-6); +} + +TEST(QuadraticTimeMMD, perform_test_permutation_biased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + // stream some data from generator + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(feat_p, feat_q); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_BIASED_FULL); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.0, 1E-10); +} + +TEST(QuadraticTimeMMD, perform_test_permutation_unbiased_full) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + // stream some data from generator + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(feat_p, feat_q); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_FULL); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.0, 1E-10); +} + +TEST(QuadraticTimeMMD, perform_test_permutation_unbiased_incomplete) +{ + const index_t m=20; + const index_t n=20; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + // stream some data from generator + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(feat_p, feat_q); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + // compute p-value using permutation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_INCOMPLETE); + float64_t p_value=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value, 0.0, 1E-10); +} + +TEST(QuadraticTimeMMD, perform_test_spectrum) +{ + const index_t m=20; + const index_t n=30; + const index_t dim=3; + + // use fixed seed + sg_rand->set_seed(12345); + + float64_t difference=0.5; + + // streaming data generator for mean shift distributions + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + // stream some data from generator + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + // shoguns kernel width is different + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + // create MMD instance, convienience constructor + auto mmd=some(feat_p, feat_q); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + index_t num_eigenvalues=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_MMD2_SPECTRUM); + mmd->spectrum_set_num_eigenvalues(num_eigenvalues); + + // biased case + + // compute p-value using spectrum approximation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_BIASED_FULL); + float64_t p_value_spectrum=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value_spectrum, 0.0, 1E-10); + + // unbiased case + + // compute p-value using spectrum approximation for null distribution and + // assert against local machine computed result + mmd->set_statistic_type(ST_UNBIASED_FULL); + p_value_spectrum=mmd->compute_p_value(mmd->compute_statistic()); + EXPECT_NEAR(p_value_spectrum, 0.0, 1E-10); +} + +TEST(QuadraticTimeMMD, precomputed_vs_nonprecomputed) +{ + const index_t m=20; + const index_t n=20; + const index_t dim=3; + + float64_t difference=0.5; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + float64_t sigma=2; + float64_t sq_sigma_twice=sigma*sigma*2; + CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); + + auto mmd=some(feat_p, feat_q); + mmd->set_kernel(kernel); + + index_t num_null_samples=10; + mmd->set_num_null_samples(num_null_samples); + mmd->set_null_approximation_method(NAM_PERMUTATION); + + sg_rand->set_seed(12345); + SGVector result_1=mmd->sample_null(); + + mmd->precompute_kernel_matrix(false); + sg_rand->set_seed(12345); + SGVector result_2=mmd->sample_null(); + + ASSERT_EQ(result_1.size(), result_2.size()); + for (auto i=0; iset_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + auto mmd=some(feat_p, feat_q); + for (auto i=0, sigma=-5; imultikernel()->add_kernel(new CGaussianKernel(10, tau)); + } + SGVector mmd_multiple=mmd->multikernel()->compute_statistic(); + mmd->multikernel()->cleanup(); + + SGVector mmd_single(num_kernels); + for (auto i=0, sigma=-5; iset_kernel(new CGaussianKernel(10, tau)); + mmd_single[i]=mmd->compute_statistic(); + } + + ASSERT_EQ(mmd_multiple.size(), mmd_single.size()); + for (auto i=0; iset_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + auto mmd=some(feat_p, feat_q); + for (auto i=0, sigma=-5; imultikernel()->add_kernel(new CGaussianKernel(10, tau)); + } + SGVector var_est_multiple=mmd->multikernel()->compute_variance_h1(); + mmd->multikernel()->cleanup(); + + SGVector var_est_single(num_kernels); + for (auto i=0, sigma=-5; iset_kernel(new CGaussianKernel(10, tau)); + var_est_single[i]=mmd->compute_variance_h1(); + } + + ASSERT_EQ(var_est_multiple.size(), var_est_single.size()); + for (auto i=0; iset_seed(12345); + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + auto mmd=some(feat_p, feat_q); + mmd->set_statistic_type(ST_UNBIASED_FULL); + for (auto i=0, sigma=-5; imultikernel()->add_kernel(new CGaussianKernel(10, tau)); + } + SGVector test_power_multiple=mmd->multikernel()->compute_test_power(); + mmd->multikernel()->cleanup(); + + SGVector test_power_single(num_kernels); + for (auto i=0, sigma=-5; iset_kernel(new CGaussianKernel(10, tau)); + test_power_single[i]=mmd->compute_statistic()*(m+n)/m/n/CMath::sqrt(mmd->compute_variance_h1()+1E-5); + } + + ASSERT_EQ(test_power_multiple.size(), test_power_single.size()); + for (auto i=0; i(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + CFeatures* feat_p=gen_p->get_streamed_features(m); + CFeatures* feat_q=gen_q->get_streamed_features(n); + + auto mmd=some(feat_p, feat_q); + mmd->set_num_null_samples(num_null_samples); + for (auto i=0, sigma=-5; imultikernel()->add_kernel(new CGaussianKernel(cache_size, tau)); + } + sg_rand->set_seed(12345); + SGVector rejections_multiple=mmd->multikernel()->perform_test(alpha); + mmd->multikernel()->cleanup(); + + SGVector rejections_single(num_kernels); + for (auto i=0, sigma=-5; iset_kernel(new CGaussianKernel(cache_size, tau)); + sg_rand->set_seed(12345); + rejections_single[i]=mmd->perform_test(alpha); + } + + ASSERT_EQ(rejections_multiple.size(), rejections_single.size()); + for (auto i=0; i +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace shogun +{ + +class CTwoDistributionTestMock : public CTwoDistributionTest +{ +public: + MOCK_METHOD0(compute_statistic, float64_t()); + MOCK_METHOD0(sample_null, SGVector()); +}; + +} + +using namespace shogun; + +TEST(TwoDistributionTest, compute_distance_dense) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=static_cast*>(gen_p->get_streamed_features(m)); + auto feats_q=static_cast*>(gen_q->get_streamed_features(n)); + + auto mock_obj=some(); + mock_obj->set_p(feats_p); + mock_obj->set_q(feats_q); + + auto euclidean_distance=some(); + auto distance=mock_obj->compute_distance(euclidean_distance); + auto distance_mat1=distance->get_distance_matrix(); + SG_UNREF(distance); + + euclidean_distance->init(feats_p, feats_q); + auto distance_mat2=euclidean_distance->get_distance_matrix(); + + EXPECT_TRUE(distance_mat1.num_rows==distance_mat2.num_rows); + EXPECT_TRUE(distance_mat1.num_cols==distance_mat2.num_cols); + for (size_t i=0; i(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=static_cast*>(gen_p->get_streamed_features(m)); + auto feats_q=static_cast*>(gen_q->get_streamed_features(n)); + + auto mock_obj=some(); + mock_obj->set_p(feats_p); + mock_obj->set_q(feats_q); + + auto euclidean_distance=some(); + auto distance=mock_obj->compute_joint_distance(euclidean_distance); + auto distance_mat1=distance->get_distance_matrix(); + + SGMatrix data_p_and_q(dim, m+n); + auto data_p=feats_p->get_feature_matrix(); + auto data_q=feats_q->get_feature_matrix(); + std::copy(data_p.data(), data_p.data()+data_p.size(), data_p_and_q.data()); + std::copy(data_q.data(), data_q.data()+data_q.size(), data_p_and_q.data()+data_p.size()); + auto feats_p_and_q=some >(data_p_and_q); + + euclidean_distance->init(feats_p_and_q, feats_p_and_q); + auto distance_mat2=euclidean_distance->get_distance_matrix(); + + EXPECT_TRUE(distance_mat1.num_rows==distance_mat2.num_rows); + EXPECT_TRUE(distance_mat1.num_cols==distance_mat2.num_cols); + for (size_t i=0; i(); + mock_obj->set_p(gen_p); + mock_obj->set_q(gen_q); + mock_obj->set_num_samples_p(m); + mock_obj->set_num_samples_q(n); + + sg_rand->set_seed(12345); + auto euclidean_distance=some(); + auto distance=mock_obj->compute_distance(euclidean_distance); + auto distance_mat1=distance->get_distance_matrix(); + + sg_rand->set_seed(12345); + auto feats_p=static_cast*>(gen_p->get_streamed_features(m)); + auto feats_q=static_cast*>(gen_q->get_streamed_features(n)); + euclidean_distance->init(feats_p, feats_q); + auto distance_mat2=euclidean_distance->get_distance_matrix(); + + EXPECT_TRUE(distance_mat1.num_rows==distance_mat2.num_rows); + EXPECT_TRUE(distance_mat1.num_cols==distance_mat2.num_cols); + for (size_t i=0; i(); + mock_obj->set_p(gen_p); + mock_obj->set_q(gen_q); + mock_obj->set_num_samples_p(m); + mock_obj->set_num_samples_q(n); + + sg_rand->set_seed(12345); + auto euclidean_distance=some(); + auto distance=mock_obj->compute_joint_distance(euclidean_distance); + auto distance_mat1=distance->get_distance_matrix(); + + sg_rand->set_seed(12345); + auto feats_p=static_cast*>(gen_p->get_streamed_features(m)); + auto feats_q=static_cast*>(gen_q->get_streamed_features(n)); + + SGMatrix data_p_and_q(dim, m+n); + auto data_p=feats_p->get_feature_matrix(); + auto data_q=feats_q->get_feature_matrix(); + std::copy(data_p.data(), data_p.data()+data_p.size(), data_p_and_q.data()); + std::copy(data_q.data(), data_q.data()+data_q.size(), data_p_and_q.data()+data_p.size()); + auto feats_p_and_q=new CDenseFeatures(data_p_and_q); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + euclidean_distance->init(feats_p_and_q, feats_p_and_q); + auto distance_mat2=euclidean_distance->get_distance_matrix(); + + EXPECT_TRUE(distance_mat1.num_rows==distance_mat2.num_rows); + EXPECT_TRUE(distance_mat1.num_cols==distance_mat2.num_cols); + for (size_t i=0; i +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(Block, create_blocks) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + + // check whether correct number of blocks has been formed + auto blocks=Block::create_blocks(feats_p, num_vec/blocksize, blocksize); + ASSERT_TRUE(blocks.size()==size_t(num_vec/blocksize)); + + // check const cast operator + for (auto it=blocks.begin(); it!=blocks.end(); ++it) + { + const Block& block=*it; + auto block_feats=static_cast(block); + ASSERT_TRUE(block_feats->get_num_vectors()==blocksize); + } + + // check non-const cast operator + for (auto it=blocks.begin(); it!=blocks.end(); ++it) + { + Block& block=*it; + auto block_feats=static_cast>(block); + ASSERT_TRUE(block_feats->get_num_vectors()==blocksize); + } + + // check const get() method + for (auto it=blocks.begin(); it!=blocks.end(); ++it) + { + const Block& block=*it; + auto block_feats=block.get(); + ASSERT_TRUE(block_feats->get_num_vectors()==blocksize); + } + + // check non-const get() method + for (auto it=blocks.begin(); it!=blocks.end(); ++it) + { + Block& block=*it; + auto block_feats=block.get(); + ASSERT_TRUE(block_feats->get_num_vectors()==blocksize); + } + + // check for proper block-wise organizing + SGVector inds(blocksize); + std::iota(inds.vector, inds.vector+inds.vlen, 0); + for (size_t i=0; iadd_subset(inds); + SGMatrix subset=feats_p->get_feature_matrix(); + SGMatrix blockd=static_cast(blocks[i].get())->get_feature_matrix(); + ASSERT_TRUE(subset.equals(blockd)); + feats_p->remove_subset(); + std::for_each(inds.vector, inds.vector+inds.vlen, [&blocksize](index_t& val) { val+=blocksize; }); + } + + // no clean-up should be required +} diff --git a/tests/unit/statistical_testing/internals/CrossValidationMMD_unittest.cc b/tests/unit/statistical_testing/internals/CrossValidationMMD_unittest.cc new file mode 100644 index 00000000000..9e15ca824d2 --- /dev/null +++ b/tests/unit/statistical_testing/internals/CrossValidationMMD_unittest.cc @@ -0,0 +1,323 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; + +TEST(CrossValidationMMD, biased_full) +{ + const index_t n=24; + const index_t m=15; + const index_t dim=2; + const index_t num_null_samples=5; + const index_t num_folds=3; + const index_t num_runs=2; + const index_t num_kernels=4; + const index_t cache_size=10; + const float64_t difference=0.5; + const float64_t alpha=0.05; + const auto stype=ST_BIASED_FULL; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + + kernel_mgr.set_precomputed_distance(precomputed_distance); + auto cv=CrossValidationMMD(n, m, num_folds, num_null_samples); + cv.m_stype=stype; + cv.m_alpha=alpha; + cv.m_num_runs=num_runs; + cv.m_rejections=SGMatrix(num_runs*num_folds, num_kernels); + sg_rand->set_seed(12345); + cv(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + SGVector dummy_labels_p(n); + SGVector dummy_labels_q(m); + + auto kfold_p=some(new CBinaryLabels(dummy_labels_p), num_folds); + auto kfold_q=some(new CBinaryLabels(dummy_labels_q), num_folds); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + for (auto k=0; kbuild_subsets(); + kfold_q->build_subsets(); + + for (auto current_fold=0; current_foldgenerate_subset_inverse(current_fold); + auto current_train_subset_q=kfold_q->generate_subset_inverse(current_fold); + + feats_p->add_subset(current_train_subset_p); + feats_q->add_subset(current_train_subset_q); + + permutation_mmd.m_n_x=feats_p->get_num_vectors(); + permutation_mmd.m_n_y=feats_q->get_num_vectors(); + + auto current_merged_feats=static_cast*> + (FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + kernel->init(current_merged_feats, current_merged_feats); + auto p_value=permutation_mmd.p_value(kernel->get_kernel_matrix()); + + EXPECT_EQ(cv.m_rejections(current_run*num_folds+current_fold, k), p_valueremove_lhs_and_rhs(); + feats_p->remove_subset(); + feats_q->remove_subset(); + } + } + } +} + +TEST(CrossValidationMMD, unbiased_full) +{ + const index_t n=24; + const index_t m=15; + const index_t dim=2; + const index_t num_null_samples=5; + const index_t num_folds=3; + const index_t num_runs=2; + const index_t num_kernels=4; + const index_t cache_size=10; + const float64_t difference=0.5; + const float64_t alpha=0.05; + const auto stype=ST_UNBIASED_FULL; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + + kernel_mgr.set_precomputed_distance(precomputed_distance); + auto cv=CrossValidationMMD(n, m, num_folds, num_null_samples); + cv.m_stype=stype; + cv.m_alpha=alpha; + cv.m_num_runs=num_runs; + cv.m_rejections=SGMatrix(num_runs*num_folds, num_kernels); + sg_rand->set_seed(12345); + cv(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + SGVector dummy_labels_p(n); + SGVector dummy_labels_q(m); + + auto kfold_p=some(new CBinaryLabels(dummy_labels_p), num_folds); + auto kfold_q=some(new CBinaryLabels(dummy_labels_q), num_folds); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + for (auto k=0; kbuild_subsets(); + kfold_q->build_subsets(); + + for (auto current_fold=0; current_foldgenerate_subset_inverse(current_fold); + auto current_train_subset_q=kfold_q->generate_subset_inverse(current_fold); + + feats_p->add_subset(current_train_subset_p); + feats_q->add_subset(current_train_subset_q); + + permutation_mmd.m_n_x=feats_p->get_num_vectors(); + permutation_mmd.m_n_y=feats_q->get_num_vectors(); + + auto current_merged_feats=static_cast*> + (FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + kernel->init(current_merged_feats, current_merged_feats); + auto p_value=permutation_mmd.p_value(kernel->get_kernel_matrix()); + + EXPECT_EQ(cv.m_rejections(current_run*num_folds+current_fold, k), p_valueremove_lhs_and_rhs(); + feats_p->remove_subset(); + feats_q->remove_subset(); + } + } + } +} + +TEST(CrossValidationMMD, unbiased_incomplete) +{ + const index_t n=18; + const index_t m=18; + const index_t dim=2; + const index_t num_null_samples=5; + const index_t num_folds=3; + const index_t num_runs=2; + const index_t num_kernels=4; + const index_t cache_size=10; + const float64_t difference=0.5; + const float64_t alpha=0.05; + const auto stype=ST_UNBIASED_INCOMPLETE; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + + kernel_mgr.set_precomputed_distance(precomputed_distance); + auto cv=CrossValidationMMD(n, m, num_folds, num_null_samples); + cv.m_stype=stype; + cv.m_alpha=alpha; + cv.m_num_runs=num_runs; + cv.m_rejections=SGMatrix(num_runs*num_folds, num_kernels); + sg_rand->set_seed(12345); + cv(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + SGVector dummy_labels_p(n); + SGVector dummy_labels_q(m); + + auto kfold_p=some(new CBinaryLabels(dummy_labels_p), num_folds); + auto kfold_q=some(new CBinaryLabels(dummy_labels_q), num_folds); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + for (auto k=0; kbuild_subsets(); + kfold_q->build_subsets(); + + for (auto current_fold=0; current_foldgenerate_subset_inverse(current_fold); + auto current_train_subset_q=kfold_q->generate_subset_inverse(current_fold); + + feats_p->add_subset(current_train_subset_p); + feats_q->add_subset(current_train_subset_q); + + permutation_mmd.m_n_x=feats_p->get_num_vectors(); + permutation_mmd.m_n_y=feats_q->get_num_vectors(); + + auto current_merged_feats=static_cast*> + (FeaturesUtil::create_merged_copy(feats_p, feats_q)); + + kernel->init(current_merged_feats, current_merged_feats); + auto p_value=permutation_mmd.p_value(kernel->get_kernel_matrix()); + + EXPECT_EQ(cv.m_rejections(current_run*num_folds+current_fold, k), p_valueremove_lhs_and_rhs(); + feats_p->remove_subset(); + feats_q->remove_subset(); + } + } + } +} diff --git a/tests/unit/statistical_testing/internals/DataFetcherFactory_unittest.cc b/tests/unit/statistical_testing/internals/DataFetcherFactory_unittest.cc new file mode 100644 index 00000000000..36fd29a552e --- /dev/null +++ b/tests/unit/statistical_testing/internals/DataFetcherFactory_unittest.cc @@ -0,0 +1,64 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(DataFetcherFactory, get_instance) +{ + const index_t dim=1; + const index_t num_vec=1; + + SGMatrix data_p(dim, num_vec); + data_p(0, 0)=0; + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + + std::unique_ptr fetcher(DataFetcherFactory::get_instance(feats_p)); + ASSERT_TRUE(strcmp(fetcher->get_name(), "DataFetcher")==0); + + CStreamingDenseFeatures *streaming_p=new CStreamingDenseFeatures(feats_p); + SG_REF(streaming_p); + + std::unique_ptr streaming_fetcher(DataFetcherFactory::get_instance(streaming_p)); + ASSERT_TRUE(strcmp(streaming_fetcher->get_name(), "StreamingDataFetcher")==0); +} diff --git a/tests/unit/statistical_testing/internals/DataFetcher_unittest.cc b/tests/unit/statistical_testing/internals/DataFetcher_unittest.cc new file mode 100644 index 00000000000..d68ac582c88 --- /dev/null +++ b/tests/unit/statistical_testing/internals/DataFetcher_unittest.cc @@ -0,0 +1,147 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(DataFetcher, full_data) +{ + const index_t dim=3; + const index_t num_vec=8; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + + DataFetcher fetcher(feats_p); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + + SG_UNREF(curr); + + curr=fetcher.next(); + ASSERT_TRUE(curr==nullptr); + fetcher.end(); +} + +TEST(DataFetcher, block_data) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + + DataFetcher fetcher(feats_p); + + fetcher.fetch_blockwise() + .with_blocksize(blocksize) + .with_num_blocks_per_burst(num_blocks_per_burst); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + while (curr!=nullptr) + { + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize*num_blocks_per_burst); + + SG_UNREF(curr); + curr=fetcher.next(); + } + fetcher.end(); +} + +TEST(DataFetcher, reset_functionality) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + + DataFetcher fetcher(feats_p); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + + SG_UNREF(curr); + + curr=fetcher.next(); + ASSERT_TRUE(curr==nullptr); + + fetcher.reset(); + fetcher.fetch_blockwise() + .with_blocksize(blocksize) + .with_num_blocks_per_burst(num_blocks_per_burst); + + fetcher.start(); + curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + while (curr!=nullptr) + { + tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize*num_blocks_per_burst); + SG_UNREF(curr); + curr=fetcher.next(); + } + fetcher.end(); +} diff --git a/tests/unit/statistical_testing/internals/DataManager_unittest.cc b/tests/unit/statistical_testing/internals/DataManager_unittest.cc new file mode 100644 index 00000000000..fab238474ae --- /dev/null +++ b/tests/unit/statistical_testing/internals/DataManager_unittest.cc @@ -0,0 +1,915 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(DataManager, full_data_one_distribution_normal_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + auto feats_p=new CDenseFeatures(data_p); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp=dynamic_cast*>(next_burst[0][0].get()); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + + mgr.end(); +} + +TEST(DataManager, full_data_one_distribution_streaming_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + auto feats_p=new CDenseFeatures(data_p); + auto streaming_p=new CStreamingDenseFeatures(feats_p); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=streaming_p; + mgr.num_samples_at(0)=num_vec; + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp=dynamic_cast*>(next_burst[0][0].get()); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + + mgr.end(); +} + +TEST(DataManager, full_data_two_distributions_normal_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.samples_at(1)=feats_q; + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp_p=dynamic_cast(next_burst[0][0].get()); + auto tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); +} + +TEST(DataManager, full_data_two_distributions_streaming_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + auto streaming_p=new CStreamingDenseFeatures(feats_p); + auto streaming_q=new CStreamingDenseFeatures(feats_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=streaming_p; + mgr.samples_at(1)=streaming_q; + mgr.num_samples_at(0)=num_vec; + mgr.num_samples_at(1)=num_vec; + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp_p=dynamic_cast(next_burst[0][0].get()); + auto tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); +} + +TEST(DataManager, block_data_one_distribution_normal_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + auto feats_p=new CDenseFeatures(data_p); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i*>(next_burst[0][i].get()); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize); + total+=tmp->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); +} + +TEST(DataManager, block_data_one_distribution_streaming_feats) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + auto feats_p=new CDenseFeatures(data_p); + auto streaming_p=new CStreamingDenseFeatures(feats_p); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=streaming_p; + mgr.num_samples_at(0)=num_vec; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i*>(next_burst[0][i].get()); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize); + total+=tmp->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); +} + +TEST(DataManager, block_data_two_distributions_normal_feats_equal_blocksize) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.samples_at(1)=feats_q; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); +} + +TEST(DataManager, block_data_two_distributions_streaming_feats_equal_blocksize) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + auto streaming_p=new CStreamingDenseFeatures(feats_p); + auto streaming_q=new CStreamingDenseFeatures(feats_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=streaming_p; + mgr.samples_at(1)=streaming_q; + mgr.num_samples_at(0)=num_vec; + mgr.num_samples_at(1)=num_vec; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); +} + +TEST(DataManager, block_data_two_distributions_normal_feats_different_blocksize) +{ + const index_t dim=3; + const index_t num_vec_p=8; + const index_t num_vec_q=12; + const index_t blocksize=5; + const index_t num_blocks_per_burst=3; + const index_t num_distributions=2; + + auto blocksize_p=blocksize*num_vec_p/(num_vec_p+num_vec_q); + auto blocksize_q=blocksize*num_vec_q/(num_vec_p+num_vec_q); + + SGMatrix data_p(dim, num_vec_p); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec_p, 0); + + SGMatrix data_q(dim, num_vec_q); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec_q, dim*num_vec_p); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.samples_at(1)=feats_q; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total_p=0; + auto total_q=0; + + while (!next_burst.empty()) + { + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize_p); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize_q); + total_p+=tmp_p->get_num_vectors(); + total_q+=tmp_q->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total_p==num_vec_p); + ASSERT_TRUE(total_q==num_vec_q); +} + +TEST(DataManager, block_data_two_distributions_streaming_feats_different_blocksize) +{ + const index_t dim=3; + const index_t num_vec_p=8; + const index_t num_vec_q=12; + const index_t blocksize=5; + const index_t num_blocks_per_burst=3; + const index_t num_distributions=2; + + auto blocksize_p=blocksize*num_vec_p/(num_vec_p+num_vec_q); + auto blocksize_q=blocksize*num_vec_q/(num_vec_p+num_vec_q); + + SGMatrix data_p(dim, num_vec_p); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec_p, 0); + + SGMatrix data_q(dim, num_vec_q); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec_q, dim*num_vec_p); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + auto streaming_p=new CStreamingDenseFeatures(feats_p); + auto streaming_q=new CStreamingDenseFeatures(feats_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=streaming_p; + mgr.samples_at(1)=streaming_q; + mgr.num_samples_at(0)=num_vec_p; + mgr.num_samples_at(1)=num_vec_q; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total_p=0; + auto total_q=0; + + while (!next_burst.empty()) + { + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize_p); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize_q); + total_p+=tmp_p->get_num_vectors(); + total_q+=tmp_q->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total_p==num_vec_p); + ASSERT_TRUE(total_q==num_vec_q); +} + +TEST(DataManager, train_test_whole_dense) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=2; + const index_t train_test_ratio=3; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.samples_at(1)=feats_q; + + mgr.set_train_test_mode(true); + mgr.set_train_test_ratio(train_test_ratio); + + // training data + mgr.set_train_mode(true); + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp_p=dynamic_cast(next_burst[0][0].get()); + auto tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec*train_test_ratio/(train_test_ratio+1)); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec*train_test_ratio/(train_test_ratio+1)); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); + + // test data + mgr.set_train_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + tmp_p=dynamic_cast(next_burst[0][0].get()); + tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec/(train_test_ratio+1)); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec/(train_test_ratio+1)); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); + + // full data + mgr.set_train_test_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + tmp_p=dynamic_cast(next_burst[0][0].get()); + tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); +} + +TEST(DataManager, train_test_blockwise_dense) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=2; + const index_t train_test_ratio=3; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + SGMatrix data_q(dim, num_vec); + std::iota(data_q.matrix, data_q.matrix+dim*num_vec, dim*num_vec); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + auto feats_q=new feat_type(data_q); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.samples_at(1)=feats_q; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.set_train_test_mode(true); + mgr.set_train_test_ratio(train_test_ratio); + + // train data + mgr.set_train_mode(true); + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec*train_test_ratio/(train_test_ratio+1)); + mgr.end(); + + // test data + mgr.set_train_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec/(train_test_ratio+1)); + mgr.end(); + + // full data + mgr.set_train_test_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); + mgr.end(); +} + +TEST(DataManager, train_test_whole_streaming) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t num_distributions=2; + const index_t train_test_ratio=3; + const float64_t difference=0.5; + + DataManager mgr(num_distributions); + mgr.samples_at(0)=new CMeanShiftDataGenerator(0, dim, 0); + mgr.samples_at(1)=new CMeanShiftDataGenerator(difference, dim, 0); + mgr.num_samples_at(0)=num_vec; + mgr.num_samples_at(1)=num_vec; + + typedef CDenseFeatures feat_type; + + mgr.set_train_test_mode(true); + mgr.set_train_test_ratio(train_test_ratio); + + // training data + mgr.set_train_mode(true); + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + auto tmp_p=dynamic_cast(next_burst[0][0].get()); + auto tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec*train_test_ratio/(train_test_ratio+1)); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec*train_test_ratio/(train_test_ratio+1)); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); + + // test data + mgr.set_train_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + tmp_p=dynamic_cast(next_burst[0][0].get()); + tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec/(train_test_ratio+1)); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec/(train_test_ratio+1)); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); + + // full data + mgr.set_train_test_mode(false); + mgr.reset(); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + + tmp_p=dynamic_cast(next_burst[0][0].get()); + tmp_q=dynamic_cast(next_burst[1][0].get()); + + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==num_vec); + ASSERT_TRUE(tmp_q->get_num_vectors()==num_vec); + + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); +} + +TEST(DataManager, train_test_blockwise_streaming) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=2; + const index_t train_test_ratio=3; + const float64_t difference=0.5; + + DataManager mgr(num_distributions); + mgr.samples_at(0)=new CMeanShiftDataGenerator(0, dim, 0); + mgr.samples_at(1)=new CMeanShiftDataGenerator(difference, dim, 0); + mgr.num_samples_at(0)=num_vec; + mgr.num_samples_at(1)=num_vec; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + typedef CDenseFeatures feat_type; + + mgr.set_train_test_mode(true); + mgr.set_train_test_ratio(train_test_ratio); + + // train data + mgr.set_train_mode(true); + mgr.start(); + + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + auto total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec*train_test_ratio/(train_test_ratio+1)); + mgr.end(); + + // test data + mgr.set_train_mode(false); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec/(train_test_ratio+1)); + mgr.end(); + + // full data + mgr.set_train_test_mode(false); + mgr.reset(); + mgr.start(); + + next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + + total=0; + + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i(next_burst[0][i].get()); + auto tmp_q=dynamic_cast(next_burst[1][i].get()); + ASSERT_TRUE(tmp_p!=nullptr); + ASSERT_TRUE(tmp_q!=nullptr); + ASSERT_TRUE(tmp_p->get_num_vectors()==blocksize/2); + ASSERT_TRUE(tmp_q->get_num_vectors()==blocksize/2); + total+=tmp_p->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); + mgr.end(); +} + +TEST(DataManager, set_blockwise_on_off) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + auto feats_p=new CDenseFeatures(data_p); + + DataManager mgr(num_distributions); + mgr.samples_at(0)=feats_p; + mgr.set_blocksize(blocksize); + mgr.set_num_blocks_per_burst(num_blocks_per_burst); + + mgr.set_blockwise(false); + mgr.start(); + auto next_burst=mgr.next(); + ASSERT_TRUE(!next_burst.empty()); + ASSERT_TRUE(next_burst.num_blocks()==1); + auto casted=dynamic_cast*>(next_burst[0][0].get()); + ASSERT_TRUE(casted!=nullptr); + ASSERT_TRUE(casted->get_num_vectors()==num_vec); + next_burst=mgr.next(); + ASSERT_TRUE(next_burst.empty()); + mgr.end(); + + mgr.reset(); + mgr.set_blockwise(true); + mgr.start(); + auto total=0; + next_burst=mgr.next(); + while (!next_burst.empty()) + { + ASSERT_TRUE(next_burst.num_blocks()==num_blocks_per_burst); + for (auto i=0; i*>(next_burst[0][i].get()); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize); + total+=tmp->get_num_vectors(); + } + next_burst=mgr.next(); + } + ASSERT_TRUE(total==num_vec); +} diff --git a/tests/unit/statistical_testing/internals/FeaturesUtil_unittest.cc b/tests/unit/statistical_testing/internals/FeaturesUtil_unittest.cc new file mode 100644 index 00000000000..5d48e79c007 --- /dev/null +++ b/tests/unit/statistical_testing/internals/FeaturesUtil_unittest.cc @@ -0,0 +1,141 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(FeaturesUtil, create_shallow_copy) +{ + const index_t dim=2; + const index_t num_vec=10; + + SGMatrix data(dim, num_vec); + std::iota(data.matrix, data.matrix+dim*num_vec, 0); + + auto feats=new CDenseFeatures(data); + SGVector inds(5); + std::iota(inds.data(), inds.data()+inds.size(), 3); + feats->add_subset(inds); + SGVector inds2(2); + std::iota(inds2.data(), inds2.data()+inds2.size(), 1); + feats->add_subset(inds2); + + auto shallow_copy=static_cast*>(FeaturesUtil::create_shallow_copy(feats)); + int32_t num_feats=0, num_vecs=0; + float64_t* copied_data=shallow_copy->get_feature_matrix(num_feats, num_vecs); + ASSERT_TRUE(data.data()==copied_data); + ASSERT_TRUE(dim==num_feats); + ASSERT_TRUE(num_vec==num_vecs); + + SGMatrix src=feats->get_feature_matrix(); + SGMatrix dst=shallow_copy->get_feature_matrix(); + ASSERT(src.equals(dst)); + + shallow_copy->remove_all_subsets(); + SG_UNREF(shallow_copy); + + feats->remove_all_subsets(); + SG_UNREF(feats); +} + +TEST(FeaturesUtil, create_merged_copy) +{ + const index_t dim=2; + const index_t num_vec=3; + + SGMatrix data(dim, num_vec); + std::iota(data.matrix, data.matrix+dim*num_vec, 0); + + auto feats_a=new CDenseFeatures(data); + SGVector inds_a(2); + inds_a[0]=1; + inds_a[1]=2; + feats_a->add_subset(inds_a); + SGMatrix data_a=feats_a->get_feature_matrix(); + + auto feats_b=new CDenseFeatures(data); + SGVector inds_b(2); + inds_b[0]=0; + inds_b[1]=2; + feats_b->add_subset(inds_b); + SGMatrix data_b=feats_b->get_feature_matrix(); + + SGMatrix merged(dim, data_a.num_cols+data_b.num_cols); + std::copy(data_a.data(), data_a.data()+data_a.size(), merged.data()); + std::copy(data_b.data(), data_b.data()+data_b.size(), merged.data()+data_a.size()); + + auto merged_copy=static_cast*>(FeaturesUtil::create_merged_copy(feats_a, feats_b)); + SGMatrix copied(merged_copy->get_feature_matrix()); + ASSERT_TRUE(merged.equals(copied)); + + SG_UNREF(merged_copy); + SG_UNREF(feats_a); + SG_UNREF(feats_b); +} + +TEST(FeaturesUtil, clone_subset_stack) +{ + const index_t dim=2; + const index_t num_vec=10; + + SGMatrix data(dim, num_vec); + std::iota(data.matrix, data.matrix+dim*num_vec, 0); + + auto feats=new CDenseFeatures(data); + SGVector inds(5); + std::iota(inds.data(), inds.data()+inds.size(), 3); + feats->add_subset(inds); + SGVector inds2(2); + std::iota(inds2.data(), inds2.data()+inds2.size(), 1); + feats->add_subset(inds2); + + auto copy=new CDenseFeatures(data); + FeaturesUtil::clone_subset_stack(feats, copy); + + auto src_subset_stack=feats->get_subset_stack(); + auto dst_subset_stack=copy->get_subset_stack(); + ASSERT_TRUE(src_subset_stack->equals(dst_subset_stack)); + SG_UNREF(src_subset_stack); + SG_UNREF(dst_subset_stack); + + copy->remove_all_subsets(); + SG_UNREF(copy); + + feats->remove_all_subsets(); + SG_UNREF(feats); +} diff --git a/tests/unit/statistical_testing/internals/InitPerFeature_unittest.cc b/tests/unit/statistical_testing/internals/InitPerFeature_unittest.cc new file mode 100644 index 00000000000..22a8577336a --- /dev/null +++ b/tests/unit/statistical_testing/internals/InitPerFeature_unittest.cc @@ -0,0 +1,69 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(InitPerFeature, assignment_and_cast_operators) +{ + const index_t dim=1; + const index_t num_vec=1; + const index_t num_distributions=1; + + SGMatrix data_p(dim, num_vec); + data_p(0, 0)=0; + auto feats_p=new CDenseFeatures(data_p); + + DataManager data_mgr(num_distributions); + data_mgr.samples_at(0)=feats_p; + const DataManager& const_data_mgr=data_mgr; + + auto stored_feats=data_mgr.samples_at(0); + bool typecheck=std::is_same::value; + ASSERT_TRUE(typecheck); + ASSERT_TRUE(feats_p==stored_feats); + + auto stored_feats2=const_data_mgr.samples_at(0); + typecheck=std::is_same::value; + ASSERT_TRUE(typecheck); + ASSERT_TRUE(feats_p==stored_feats2); + + const CFeatures* samples=static_cast(stored_feats); + ASSERT_TRUE(feats_p==samples); +} diff --git a/tests/unit/statistical_testing/internals/KernelManager_unittest.cc b/tests/unit/statistical_testing/internals/KernelManager_unittest.cc new file mode 100644 index 00000000000..480c156f4e1 --- /dev/null +++ b/tests/unit/statistical_testing/internals/KernelManager_unittest.cc @@ -0,0 +1,69 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(KernelManager, store_precompute_restore) +{ + const index_t dim=1; + const index_t num_vec=1; + const index_t num_kernels=1; + + SGMatrix data_p(dim, num_vec); + data_p(0, 0)=0; + + auto feats=new CDenseFeatures(data_p); + auto kernel=new CGaussianKernel(); + kernel->set_width(0.5); + + KernelManager kernel_mgr(num_kernels); + const KernelManager& const_kernel_mgr=kernel_mgr; + + kernel_mgr.kernel_at(0)=kernel; + ASSERT_TRUE(const_kernel_mgr.kernel_at(0)->get_kernel_type()==K_GAUSSIAN); + + CKernel* k=const_kernel_mgr.kernel_at(0); + k->init(feats, feats); + kernel_mgr.precompute_kernel_at(0); + ASSERT_TRUE(const_kernel_mgr.kernel_at(0)!=kernel); + ASSERT_TRUE(const_kernel_mgr.kernel_at(0)->get_kernel_type()==K_CUSTOM); + + kernel_mgr.restore_kernel_at(0); + ASSERT_TRUE(const_kernel_mgr.kernel_at(0)==kernel); + ASSERT_TRUE(const_kernel_mgr.kernel_at(0)->get_kernel_type()==K_GAUSSIAN); +} diff --git a/tests/unit/statistical_testing/internals/Kernel_unittest.cc b/tests/unit/statistical_testing/internals/Kernel_unittest.cc new file mode 100644 index 00000000000..0812c37948f --- /dev/null +++ b/tests/unit/statistical_testing/internals/Kernel_unittest.cc @@ -0,0 +1,63 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(SelfAdjointKernelFunctor, kernel) +{ + const index_t dim=3; + const index_t num_vec=8; + const float64_t sigma=0.1; + + SGMatrix data(dim, num_vec); + for (auto i=0; irandom(0.0, 0.1); + auto feats=some >(data); + + auto kernel=some(10, 2*sigma*sigma); + kernel->init(feats, feats); + + SelfAdjointPrecomputedKernel kernel_functor; + kernel_functor.precompute(kernel); + + for (auto i=0; ikernel(i, j), kernel_functor(i, j), 1E-6); + } +} diff --git a/tests/unit/statistical_testing/internals/MultiKernelMMD_unittest.cc b/tests/unit/statistical_testing/internals/MultiKernelMMD_unittest.cc new file mode 100644 index 00000000000..412483a18f2 --- /dev/null +++ b/tests/unit/statistical_testing/internals/MultiKernelMMD_unittest.cc @@ -0,0 +1,260 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +namespace shogun +{ + +class CTwoDistributionTestMock : public CTwoDistributionTest +{ +public: + MOCK_METHOD0(compute_statistic, float64_t()); + MOCK_METHOD0(sample_null, SGVector()); +}; + +} + +using namespace shogun; +using namespace internal; +using namespace mmd; +using Eigen::Map; +using Eigen::MatrixXd; + +TEST(MultiKernelMMD, biased_full) +{ + const index_t m=5; + const index_t n=10; + const index_t dim=1; + const float64_t difference=0.5; + const index_t num_kernels=10; + const EStatisticType stype=ST_BIASED_FULL; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto test=some(); + test->set_p(feats_p); + test->set_q(feats_q); + + KernelManager kernel_mgr(num_kernels); + for (auto i=0, sigma=-5; icompute_joint_distance(distance)); + + ComputeMMD tester; + tester.m_n_x=m; + tester.m_n_y=n; + tester.m_stype=stype; + SGVector values=tester(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + auto data_p=static_cast*>(feats_p)->get_feature_matrix(); + auto data_q=static_cast*>(feats_q)->get_feature_matrix(); + SGMatrix data_p_and_q(dim, m+n); + std::copy(data_p.data(), data_p.data()+data_p.size(), data_p_and_q.data()); + std::copy(data_q.data(), data_q.data()+data_q.size(), data_p_and_q.data()+data_p.size()); + auto feats_p_and_q=new CDenseFeatures(data_p_and_q); + SG_REF(feats_p_and_q); + + SGVector ref(kernel_mgr.num_kernels()); + for (size_t i=0; iinit(feats_p_and_q, feats_p_and_q); + SGMatrix km=kernel->get_kernel_matrix(); + Map map(km.data(), km.num_rows, km.num_cols); + auto term_0=map.block(0, 0, m, m).sum(); + auto term_1=map.block(m, m, n, n).sum(); + auto term_2=map.block(m, 0, n, m).sum(); + term_0/=m*m; + term_1/=n*n; + term_2/=m*n; + ref[i]=term_0+term_1-2*term_2; + kernel->remove_lhs_and_rhs(); + } + SG_UNREF(feats_p_and_q); + + ASSERT_EQ(ref.size(), values.size()); + for (auto i=0; i(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto test=some(); + test->set_p(feats_p); + test->set_q(feats_q); + + KernelManager kernel_mgr(num_kernels); + for (auto i=0, sigma=-5; icompute_joint_distance(distance)); + + ComputeMMD tester; + tester.m_n_x=m; + tester.m_n_y=n; + tester.m_stype=stype; + SGVector values=tester(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + auto data_p=static_cast*>(feats_p)->get_feature_matrix(); + auto data_q=static_cast*>(feats_q)->get_feature_matrix(); + SGMatrix data_p_and_q(dim, m+n); + std::copy(data_p.data(), data_p.data()+data_p.size(), data_p_and_q.data()); + std::copy(data_q.data(), data_q.data()+data_q.size(), data_p_and_q.data()+data_p.size()); + auto feats_p_and_q=new CDenseFeatures(data_p_and_q); + SG_REF(feats_p_and_q); + + SGVector ref(kernel_mgr.num_kernels()); + for (size_t i=0; iinit(feats_p_and_q, feats_p_and_q); + SGMatrix km=kernel->get_kernel_matrix(); + Map map(km.data(), km.num_rows, km.num_cols); + auto term_0=map.block(0, 0, m, m).sum()-map.diagonal().head(m).sum(); + auto term_1=map.block(m, m, n, n).sum()-map.diagonal().tail(n).sum(); + auto term_2=map.block(m, 0, n, m).sum(); + term_0/=m*(m-1); + term_1/=n*(n-1); + term_2/=m*n; + ref[i]=term_0+term_1-2*term_2; + kernel->remove_lhs_and_rhs(); + } + SG_UNREF(feats_p_and_q); + + ASSERT_EQ(ref.size(), values.size()); + for (auto i=0; i(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(m); + auto feats_q=gen_q->get_streamed_features(n); + + auto test=some(); + test->set_p(feats_p); + test->set_q(feats_q); + + KernelManager kernel_mgr(num_kernels); + for (auto i=0, sigma=-5; icompute_joint_distance(distance)); + + ComputeMMD tester; + tester.m_n_x=m; + tester.m_n_y=n; + tester.m_stype=stype; + SGVector values=tester(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + auto data_p=static_cast*>(feats_p)->get_feature_matrix(); + auto data_q=static_cast*>(feats_q)->get_feature_matrix(); + SGMatrix data_p_and_q(dim, m+n); + std::copy(data_p.data(), data_p.data()+data_p.size(), data_p_and_q.data()); + std::copy(data_q.data(), data_q.data()+data_q.size(), data_p_and_q.data()+data_p.size()); + auto feats_p_and_q=new CDenseFeatures(data_p_and_q); + SG_REF(feats_p_and_q); + + SGVector ref(kernel_mgr.num_kernels()); + for (size_t i=0; iinit(feats_p_and_q, feats_p_and_q); + SGMatrix km=kernel->get_kernel_matrix(); + Map map(km.data(), km.num_rows, km.num_cols); + auto term_0=map.block(0, 0, m, m).sum()-map.diagonal().head(m).sum(); + auto term_1=map.block(m, m, n, n).sum()-map.diagonal().tail(n).sum(); + auto term_2=map.block(m, 0, n, m).sum()-map.block(m, 0, n, m).diagonal().sum(); + term_0/=m*(m-1); + term_1/=n*(n-1); + term_2/=m*(n-1); + ref[i]=term_0+term_1-2*term_2; + kernel->remove_lhs_and_rhs(); + } + SG_UNREF(feats_p_and_q); + + ASSERT_EQ(ref.size(), values.size()); + for (auto i=0; i +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; +using namespace mmd; + +using Eigen::Map; +using Eigen::MatrixXf; +using Eigen::Dynamic; +using Eigen::PermutationMatrix; + +TEST(PermutationMMD, biased_full_single_kernel) +{ + const index_t dim=2; + const index_t n=13; + const index_t m=7; + const index_t num_null_samples=5; + const auto stype=ST_BIASED_FULL; + + SGMatrix data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, m); + std::iota(data_q.matrix, data_q.matrix+dim*m, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*m, [&m](float64_t& val) { val/=2*m; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto kernel_matrix=kernel->get_kernel_matrix(); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGVector result_1=permutation_mmd(kernel_matrix); + + auto compute_mmd=ComputeMMD(); + compute_mmd.m_n_x=n; + compute_mmd.m_n_y=m; + compute_mmd.m_stype=stype; + + Map map(kernel_matrix.matrix, kernel_matrix.num_rows, kernel_matrix.num_cols); + SGVector result_2(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; i perm(kernel_matrix.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + result_2[i]=compute_mmd(permuted_km); + } + + SGVector inds(kernel_matrix.num_rows); + SGVector result_3(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; iadd_subset(inds); + kernel->init(feats, feats); + kernel_matrix=kernel->get_kernel_matrix(); + result_3[i]=compute_mmd(kernel_matrix); + feats->remove_subset(); + } + + for (auto i=0; i data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, m); + std::iota(data_q.matrix, data_q.matrix+dim*m, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*m, [&m](float64_t& val) { val/=2*m; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto kernel_matrix=kernel->get_kernel_matrix(); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGVector result_1=permutation_mmd(kernel_matrix); + + auto compute_mmd=ComputeMMD(); + compute_mmd.m_n_x=n; + compute_mmd.m_n_y=m; + compute_mmd.m_stype=stype; + + Map map(kernel_matrix.matrix, kernel_matrix.num_rows, kernel_matrix.num_cols); + SGVector result_2(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; i perm(kernel_matrix.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + result_2[i]=compute_mmd(permuted_km); + } + + SGVector inds(kernel_matrix.num_rows); + SGVector result_3(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; iadd_subset(inds); + kernel->init(feats, feats); + kernel_matrix=kernel->get_kernel_matrix(); + result_3[i]=compute_mmd(kernel_matrix); + feats->remove_subset(); + } + + for (auto i=0; i data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, n); + std::iota(data_q.matrix, data_q.matrix+dim*n, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*n, [&n](float64_t& val) { val/=2*n; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto kernel_matrix=kernel->get_kernel_matrix(); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=n; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGVector result_1=permutation_mmd(kernel_matrix); + + auto compute_mmd=ComputeMMD(); + compute_mmd.m_n_x=n; + compute_mmd.m_n_y=n; + compute_mmd.m_stype=stype; + + Map map(kernel_matrix.matrix, kernel_matrix.num_rows, kernel_matrix.num_cols); + SGVector result_2(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; i perm(kernel_matrix.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + result_2[i]=compute_mmd(permuted_km); + } + + SGVector inds(kernel_matrix.num_rows); + SGVector result_3(num_null_samples); + sg_rand->set_seed(12345); + for (auto i=0; iadd_subset(inds); + kernel->init(feats, feats); + kernel_matrix=kernel->get_kernel_matrix(); + result_3[i]=compute_mmd(kernel_matrix); + feats->remove_subset(); + } + + for (auto i=0; i data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, m); + std::iota(data_q.matrix, data_q.matrix+dim*m, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*m, [&m](float64_t& val) { val/=2*m; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto kernel_matrix=kernel->get_kernel_matrix(); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGVector result_1=permutation_mmd(kernel_matrix); + + sg_rand->set_seed(12345); + SGVector result_2=permutation_mmd(Kernel(kernel)); + + EXPECT_TRUE(result_1.size()==result_2.size()); + for (auto i=0; i(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + SG_REF(merged_feats); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + kernel_mgr.set_precomputed_distance(precomputed_distance); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGMatrix null_samples=permutation_mmd(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + ASSERT_EQ(null_samples.num_cols, num_kernels); + ASSERT_EQ(null_samples.num_rows, num_null_samples); + + for (auto k=0; kinit(merged_feats, merged_feats); + sg_rand->set_seed(12345); + SGVector curr_null_samples=permutation_mmd(kernel->get_kernel_matrix()); + + ASSERT_EQ(curr_null_samples.size(), null_samples.num_rows); + for (auto i=0; iremove_lhs_and_rhs(); + } + SG_UNREF(merged_feats); +} + +TEST(PermutationMMD, unbiased_full_multi_kernel) +{ + const index_t n=24; + const index_t m=15; + const index_t dim=2; + const index_t num_null_samples=5; + const index_t num_kernels=4; + const index_t cache_size=10; + const float64_t difference=0.5; + const auto stype=ST_UNBIASED_FULL; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + SG_REF(merged_feats); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + kernel_mgr.set_precomputed_distance(precomputed_distance); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGMatrix null_samples=permutation_mmd(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + ASSERT_EQ(null_samples.num_cols, num_kernels); + ASSERT_EQ(null_samples.num_rows, num_null_samples); + + for (auto k=0; kinit(merged_feats, merged_feats); + sg_rand->set_seed(12345); + SGVector curr_null_samples=permutation_mmd(kernel->get_kernel_matrix()); + + ASSERT_EQ(curr_null_samples.size(), null_samples.num_rows); + for (auto i=0; iremove_lhs_and_rhs(); + } + SG_UNREF(merged_feats); +} + +TEST(PermutationMMD, unbiased_incomplete_multi_kernel) +{ + const index_t n=18; + const index_t m=18; + const index_t dim=2; + const index_t num_null_samples=5; + const index_t num_kernels=4; + const index_t cache_size=10; + const float64_t difference=0.5; + const auto stype=ST_UNBIASED_INCOMPLETE; + + auto gen_p=some(0, dim, 0); + auto gen_q=some(difference, dim, 0); + + auto feats_p=gen_p->get_streamed_features(n); + auto feats_q=gen_q->get_streamed_features(m); + auto merged_feats=static_cast*>(FeaturesUtil::create_merged_copy(feats_p, feats_q)); + SG_REF(merged_feats); + + KernelManager kernel_mgr; + for (auto i=0; iinit(merged_feats, merged_feats); + auto precomputed_distance=some(); + auto distance_matrix=distance_instance->get_distance_matrix(); + precomputed_distance->set_triangle_distance_matrix_from_full(distance_matrix.data(), n+m, n+m); + SG_UNREF(distance_instance); + kernel_mgr.set_precomputed_distance(precomputed_distance); + + auto permutation_mmd=PermutationMMD(); + permutation_mmd.m_n_x=n; + permutation_mmd.m_n_y=m; + permutation_mmd.m_stype=stype; + permutation_mmd.m_num_null_samples=num_null_samples; + + sg_rand->set_seed(12345); + SGMatrix null_samples=permutation_mmd(kernel_mgr); + kernel_mgr.unset_precomputed_distance(); + + ASSERT_EQ(null_samples.num_cols, num_kernels); + ASSERT_EQ(null_samples.num_rows, num_null_samples); + + for (auto k=0; kinit(merged_feats, merged_feats); + sg_rand->set_seed(12345); + SGVector curr_null_samples=permutation_mmd(kernel->get_kernel_matrix()); + + ASSERT_EQ(curr_null_samples.size(), null_samples.num_rows); + for (auto i=0; iremove_lhs_and_rhs(); + } + SG_UNREF(merged_feats); +} diff --git a/tests/unit/statistical_testing/internals/StreamingDataFetcher_unittest.cc b/tests/unit/statistical_testing/internals/StreamingDataFetcher_unittest.cc new file mode 100644 index 00000000000..d6b4146af03 --- /dev/null +++ b/tests/unit/statistical_testing/internals/StreamingDataFetcher_unittest.cc @@ -0,0 +1,153 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace internal; + +TEST(StreamingDataFetcher, full_data) +{ + const index_t dim=3; + const index_t num_vec=8; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + CStreamingFeatures *streaming_p = new CStreamingDenseFeatures(feats_p); + + StreamingDataFetcher fetcher(streaming_p); + fetcher.set_num_samples(num_vec); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + + SG_UNREF(curr); + + curr=fetcher.next(); + ASSERT_TRUE(curr==nullptr); + fetcher.end(); +} + +TEST(StreamingDataFetcher, block_data) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + CStreamingFeatures *streaming_p = new CStreamingDenseFeatures(feats_p); + + StreamingDataFetcher fetcher(streaming_p); + fetcher.set_num_samples(num_vec); + + fetcher.fetch_blockwise() + .with_blocksize(blocksize) + .with_num_blocks_per_burst(num_blocks_per_burst); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + while (curr!=nullptr) + { + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize*num_blocks_per_burst); + SG_UNREF(curr); + curr=fetcher.next(); + } + fetcher.end(); +} + +TEST(StreamingDataFetcher, DISABLED_reset_functionality) +{ + const index_t dim=3; + const index_t num_vec=8; + const index_t blocksize=2; + const index_t num_blocks_per_burst=2; + + SGMatrix data_p(dim, num_vec); + std::iota(data_p.matrix, data_p.matrix+dim*num_vec, 0); + + using feat_type=CDenseFeatures; + auto feats_p=new feat_type(data_p); + CStreamingFeatures *streaming_p = new CStreamingDenseFeatures(feats_p); + + StreamingDataFetcher fetcher(streaming_p); + fetcher.set_num_samples(num_vec); + + fetcher.start(); + auto curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + + auto tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + + SG_UNREF(curr); + + curr=fetcher.next(); + ASSERT_TRUE(curr==nullptr); + + fetcher.reset(); + fetcher.fetch_blockwise() + .with_blocksize(blocksize) + .with_num_blocks_per_burst(num_blocks_per_burst); + + fetcher.start(); + curr=fetcher.next(); + ASSERT_TRUE(curr!=nullptr); + while (curr!=nullptr) + { + tmp=dynamic_cast(curr); + ASSERT_TRUE(tmp!=nullptr); + ASSERT_TRUE(tmp->get_num_vectors()==blocksize*num_blocks_per_burst); + SG_UNREF(curr); + curr=fetcher.next(); + } + fetcher.end(); +} diff --git a/tests/unit/statistical_testing/internals/WithinBlockPermutation_unittest.cc b/tests/unit/statistical_testing/internals/WithinBlockPermutation_unittest.cc new file mode 100644 index 00000000000..eb568aa77b6 --- /dev/null +++ b/tests/unit/statistical_testing/internals/WithinBlockPermutation_unittest.cc @@ -0,0 +1,254 @@ +/* + * Copyright (c) The Shogun Machine Learning Toolbox + * Written (w) 2016 Soumyajit De + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * + * 1. Redistributions of source code must retain the above copyright notice, this + * list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright notice, + * this list of conditions and the following disclaimer in the documentation + * and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * The views and conclusions contained in the software and documentation are those + * of the authors and should not be interpreted as representing official policies, + * either expressed or implied, of the Shogun Development Team. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using namespace shogun; +using namespace Eigen; + +TEST(WithinBlockPermutation, biased_full) +{ + const index_t dim=2; + const index_t n=13; + const index_t m=7; + + using operation=std::function)>; + + SGMatrix data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, m); + std::iota(data_q.matrix, data_q.matrix+dim*m, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*m, [&m](float64_t& val) { val/=2*m; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto mat=kernel->get_kernel_matrix(); + + // compute using within-block-permutation functor + operation compute=shogun::internal::mmd::WithinBlockPermutation(n, m, ST_BIASED_FULL); + sg_rand->set_seed(12345); + auto result_1=compute(mat); + + auto mmd=shogun::internal::mmd::ComputeMMD(); + mmd.m_n_x=n; + mmd.m_n_y=m; + mmd.m_stype=ST_BIASED_FULL; + compute=mmd; + + // compute a row-column permuted temporary matrix first + // then compute a biased-full statistic on this matrix + Map map(mat.matrix, mat.num_rows, mat.num_cols); + PermutationMatrix perm(mat.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + sg_rand->set_seed(12345); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + auto result_2=compute(permuted_km); + + // shuffle the features first, recompute the kernel matrix using + // shuffled samples, then compute a biased-full statistic on this matrix + SGVector inds(mat.num_rows); + std::iota(inds.vector, inds.vector+inds.vlen, 0); + sg_rand->set_seed(12345); + CMath::permute(inds); + feats->add_subset(inds); + kernel->init(feats, feats); + mat=kernel->get_kernel_matrix(); + auto result_3=compute(mat); + + EXPECT_NEAR(result_1, result_2, 1E-6); + EXPECT_NEAR(result_1, result_3, 1E-6); + + SG_UNREF(feats); +} + +TEST(WithinBlockPermutation, unbiased_full) +{ + const index_t dim=2; + const index_t n=13; + const index_t m=7; + + using operation=std::function)>; + + SGMatrix data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, m); + std::iota(data_q.matrix, data_q.matrix+dim*m, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*m, [&m](float64_t& val) { val/=2*m; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto mat=kernel->get_kernel_matrix(); + + // compute using within-block-permutation functor + operation compute=shogun::internal::mmd::WithinBlockPermutation(n, m, ST_UNBIASED_FULL); + sg_rand->set_seed(12345); + auto result_1=compute(mat); + + auto mmd=shogun::internal::mmd::ComputeMMD(); + mmd.m_n_x=n; + mmd.m_n_y=m; + mmd.m_stype=ST_UNBIASED_FULL; + compute=mmd; + + // compute a row-column permuted temporary matrix first + // then compute unbiased-full statistic on this matrix + Map map(mat.matrix, mat.num_rows, mat.num_cols); + PermutationMatrix perm(mat.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + sg_rand->set_seed(12345); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + auto result_2=compute(permuted_km); + + // shuffle the features first, recompute the kernel matrix using + // shuffled samples, then compute unbiased-full statistic on this matrix + SGVector inds(mat.num_rows); + std::iota(inds.vector, inds.vector+inds.vlen, 0); + sg_rand->set_seed(12345); + CMath::permute(inds); + feats->add_subset(inds); + kernel->init(feats, feats); + mat=kernel->get_kernel_matrix(); + auto result_3=compute(mat); + + EXPECT_NEAR(result_1, result_2, 1E-6); + EXPECT_NEAR(result_1, result_3, 1E-6); + + SG_UNREF(feats); +} + +TEST(WithinBlockPermutation, unbiased_incomplete) +{ + const index_t dim=2; + const index_t n=10; + + using operation=std::function)>; + + SGMatrix data_p(dim, n); + std::iota(data_p.matrix, data_p.matrix+dim*n, 1); + std::for_each(data_p.matrix, data_p.matrix+dim*n, [&n](float64_t& val) { val/=n; }); + + SGMatrix data_q(dim, n); + std::iota(data_q.matrix, data_q.matrix+dim*n, n+1); + std::for_each(data_q.matrix, data_q.matrix+dim*n, [&n](float64_t& val) { val/=2*n; }); + + auto feats_p=new CDenseFeatures(data_p); + auto feats_q=new CDenseFeatures(data_q); + auto feats=feats_p->create_merged_copy(feats_q); + SG_REF(feats); + SG_UNREF(feats_p); + SG_UNREF(feats_q); + + auto kernel=some(); + kernel->set_width(2.0); + + kernel->init(feats, feats); + auto mat=kernel->get_kernel_matrix(); + + // compute using within-block-permutation functor + operation compute=shogun::internal::mmd::WithinBlockPermutation(n, n, ST_UNBIASED_INCOMPLETE); + sg_rand->set_seed(12345); + auto result_1=compute(mat); + + auto mmd=shogun::internal::mmd::ComputeMMD(); + mmd.m_n_x=n; + mmd.m_n_y=n; + mmd.m_stype=ST_UNBIASED_INCOMPLETE; + compute=mmd; + + // compute a row-column permuted temporary matrix first + // then compute unbiased-incomplete statistic on this matrix + Map map(mat.matrix, mat.num_rows, mat.num_cols); + PermutationMatrix perm(mat.num_rows); + perm.setIdentity(); + SGVector perminds(perm.indices().data(), perm.indices().size(), false); + sg_rand->set_seed(12345); + CMath::permute(perminds); + MatrixXf permuted = perm.transpose()*map*perm; + SGMatrix permuted_km(permuted.data(), permuted.rows(), permuted.cols(), false); + auto result_2=compute(permuted_km); + + // shuffle the features first, recompute the kernel matrix using + // shuffled samples, then compute uniased-incomplete statistic on this matrix + SGVector inds(mat.num_rows); + std::iota(inds.vector, inds.vector+inds.vlen, 0); + sg_rand->set_seed(12345); + CMath::permute(inds); + feats->add_subset(inds); + kernel->init(feats, feats); + mat=kernel->get_kernel_matrix(); + auto result_3=compute(mat); + + EXPECT_NEAR(result_1, result_2, 1E-6); + EXPECT_NEAR(result_1, result_3, 1E-6); + + SG_UNREF(feats); +} diff --git a/tests/unit/statistics/HSIC_unittest.cc b/tests/unit/statistics/HSIC_unittest.cc deleted file mode 100644 index 756039245ce..00000000000 --- a/tests/unit/statistics/HSIC_unittest.cc +++ /dev/null @@ -1,166 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2012-2013 Heiko Strathmann, pl8787 - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include - -using namespace shogun; - -void create_fixed_data_kernel_small(CFeatures*& features_p, - CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q) -{ - index_t m=2; - index_t d=3; - - SGMatrix p(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - p.matrix[i]=i; - - SGMatrix q(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - q.matrix[i]=i+10; - - features_p=new CDenseFeatures(p); - features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); -} - -void create_fixed_data_kernel_big(CFeatures*& features_p, - CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q) -{ - index_t m=10; - index_t d=7; - - SGMatrix p(d,m); - for (index_t i=0; i q(d,m); - for (index_t i=0; i(p); - features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); -} - -/** tests the hsic statistic for a single fixed data case and ensures - * equality with sma implementation */ -TEST(HSIC, hsic_fixed) -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_small(features_p, features_q, kernel_p, kernel_q); - - index_t m=features_p->get_num_vectors(); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - /* assert matlab result, note that compute statistic computes m*hsic */ - float64_t difference=hsic->compute_statistic(); - - EXPECT_NEAR(difference, m*0.164761446385339, 1e-15); - - SG_UNREF(hsic); -} - -// disabled as I think previous inverse_gamma_cdf was faulty -// now unit test fails. Needs to be investigated statistically -TEST(DISABLED_HSIC, hsic_gamma) -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - hsic->set_null_approximation_method(HSIC_GAMMA); - float64_t p=hsic->compute_p_value(0.05); - - EXPECT_NEAR(p, 0.172182287884256, 1e-14); - - SG_UNREF(hsic); -} - -TEST(HSIC, hsic_sample_null) -{ - CFeatures* features_p=NULL; - CFeatures* features_q=NULL; - CKernel* kernel_p=NULL; - CKernel* kernel_q=NULL; - create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q); - - CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q); - - /* do sampling null */ - hsic->set_null_approximation_method(PERMUTATION); - hsic->compute_p_value(0.05); - - /* ensure that sampling null of hsic leads to same results as using - * CKernelIndependenceTest */ - CMath::init_random(1); - float64_t mean1=CStatistics::mean(hsic->sample_null()); - float64_t var1=CStatistics::variance(hsic->sample_null()); - - CMath::init_random(1); - float64_t mean2=CStatistics::mean( - hsic->CKernelIndependenceTest::sample_null()); - float64_t var2=CStatistics::variance(hsic->sample_null()); - - /* assert than results are the same from bot sampling null impl. */ - EXPECT_NEAR(mean1, mean2, 1e-7); - EXPECT_NEAR(var1, var2, 1e-7); - - SG_UNREF(hsic); -} - diff --git a/tests/unit/statistics/LinearTimeMMD_unittest.cc b/tests/unit/statistics/LinearTimeMMD_unittest.cc deleted file mode 100644 index 8fca231d14b..00000000000 --- a/tests/unit/statistics/LinearTimeMMD_unittest.cc +++ /dev/null @@ -1,284 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -/** tests the linear mmd statistic for a single data case and ensures - * equality with matlab implementation. Since data from memory is used, - * this is rather complicated, i.e. create dense features and then create - * streaming dense features from them. Normally, just use streaming features - * directly. */ -TEST(LinearTimeMMD,test_linear_mmd_fixed) -{ - index_t m=2; - index_t d=3; - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - SGMatrix data(d, 2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - CDenseFeatures* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p=new CStreamingDenseFeatures( - features_p); - CStreamingFeatures* streaming_q=new CStreamingDenseFeatures( - features_q); - - /* shoguns kernel width is different */ - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(kernel, streaming_p, streaming_q, m); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - /* assert matlab result */ - float64_t statistic=mmd->compute_statistic(); - //SG_SPRINT("statistic=%f\n", statistic); - float64_t difference=statistic-0.034218118311602; - EXPECT_LE(CMath::abs(difference), 10E-16); - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(mmd); -} - -TEST(LinearTimeMMD,test_linear_mmd_statistic_and_Q_fixed) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d, 2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p_1=new CStreamingDenseFeatures( - features_p); - CStreamingFeatures* streaming_q_1=new CStreamingDenseFeatures( - features_q); - CStreamingFeatures* streaming_p_2=new CStreamingDenseFeatures( - features_p); - CStreamingFeatures* streaming_q_2=new CStreamingDenseFeatures( - features_q); - - /* create combined kernel with values 2^5 to 2^7 */ - CCombinedKernel* kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd_1=new CLinearTimeMMD(kernel, streaming_p_1, - streaming_q_1, m); - CLinearTimeMMD* mmd_2=new CLinearTimeMMD(kernel, streaming_p_2, - streaming_q_2, m); - - /* results only equal if blocksize is larger than number of samples (other- - * wise, samples are processed in a different combination). In practice, - * just use some large value */ - mmd_1->set_blocksize(m); - mmd_2->set_blocksize(m); - - /* start streaming features parser */ - streaming_p_1->start_parser(); - streaming_q_1->start_parser(); - streaming_p_2->start_parser(); - streaming_q_2->start_parser(); - - /* test method */ - SGVector mmds_1; - SGMatrix Q; - mmd_1->compute_statistic_and_Q(mmds_1, Q); - SGVector mmds_2=mmd_2->compute_statistic(true); - - /* display results */ - //Q.display_matrix("Q"); - //mmds_1.display_vector("mmds_1"); - //mmds_2.display_vector("mmds_2"); - - /* assert that both MMD methods give the same results */ - EXPECT_EQ(mmds_1.vlen, mmds_2.vlen); - for (index_t i=0; iend_parser(); - streaming_q_1->end_parser(); - streaming_p_2->end_parser(); - streaming_q_2->end_parser(); - - SG_UNREF(mmd_1); - SG_UNREF(mmd_2); -} - -TEST(LinearTimeMMD,test_linear_mmd_statistic_and_variance_fixed) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d, 2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p=new CStreamingDenseFeatures( - features_p); - CStreamingFeatures* streaming_q=new CStreamingDenseFeatures( - features_q); - - /* create combined kernel with values 2^5 to 2^7 */ - CCombinedKernel* kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(kernel, streaming_p, streaming_q, m); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - /* test method */ - SGVector mmds; - SGVector vars; - mmd->compute_statistic_and_variance(mmds, vars, true); - - /* display results */ - //vars.display_vector("vars"); - //mmds.display_vector("mmds"); - - /* assert actual result against fixed MATLAB code */ -// mmds= -// 1.0e-03 * -// 0.156085264965383 -// 0.039043151854851 -// 0.009762153067083 - EXPECT_LE(CMath::abs(mmds[0]-0.000156085264965383), 10E-18); - EXPECT_LE(CMath::abs(mmds[1]-0.000039043151854851), 10E-18); - EXPECT_LE(CMath::abs(mmds[2]-0.000009762153067083), 10E-18); - - /* assert correctness of variance estimates */ -// vars = -// 1.0e-08 * -// 0.418667765635434 -// 0.026197180636036 -// 0.001637799815771 - EXPECT_LE(CMath::abs(vars[0]-0.418667765635434E-8), 10E-23); - EXPECT_LE(CMath::abs(vars[1]-0.026197180636036E-8), 10E-23); - EXPECT_LE(CMath::abs(vars[2]-0.001637799815771E-8), 10E-23); - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(mmd); -} diff --git a/tests/unit/statistics/MMDKernelSelectionCombMaxL2_unittest.cc b/tests/unit/statistics/MMDKernelSelectionCombMaxL2_unittest.cc deleted file mode 100644 index 2b8910a885a..00000000000 --- a/tests/unit/statistics/MMDKernelSelectionCombMaxL2_unittest.cc +++ /dev/null @@ -1,107 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#ifdef USE_GPL_SHOGUN - -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(MMDKernelSelectionCombMaxL2, select_kernel) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p= - new CStreamingDenseFeatures(features_p); - CStreamingFeatures* streaming_q= - new CStreamingDenseFeatures(features_q); - - /* create kernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined_kernel, streaming_p, - streaming_q, m); - - /* kernel selection instance */ - CMMDKernelSelectionCombMaxL2* selection=new CMMDKernelSelectionCombMaxL2( - mmd); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - CKernel* result=selection->select_kernel(); - CCombinedKernel* casted=dynamic_cast(result); - ASSERT(casted); - SGVector weights=casted->get_subkernel_weights(); - //weights.display_vector("weights"); - - /* assert weights against matlab */ -// w_l2 = -// 0.761798188424313 -// 0.190556119182660 -// 0.047645692393028 - EXPECT_LE(CMath::abs(weights[0]-0.761798188424313), 10E-15); - EXPECT_LE(CMath::abs(weights[1]-0.190556119182660), 10E-15); - EXPECT_LE(CMath::abs(weights[2]-0.047645692393028), 10E-15); - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(selection); - SG_UNREF(result); -} -#endif //USE_GPL_SHOGUN diff --git a/tests/unit/statistics/MMDKernelSelectionCombOpt_unittest.cc b/tests/unit/statistics/MMDKernelSelectionCombOpt_unittest.cc deleted file mode 100644 index c207397209d..00000000000 --- a/tests/unit/statistics/MMDKernelSelectionCombOpt_unittest.cc +++ /dev/null @@ -1,107 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#ifdef USE_GPL_SHOGUN -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(MMDKernelSelectionCombOpt, select_kernel) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p= - new CStreamingDenseFeatures(features_p); - CStreamingFeatures* streaming_q= - new CStreamingDenseFeatures(features_q); - - /* create kernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined_kernel, streaming_p, - streaming_q, m); - - /* kernel selection instance with regularisation term */ - CMMDKernelSelectionCombOpt* selection=new CMMDKernelSelectionCombOpt(mmd, - 10E-5); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - CKernel* result=selection->select_kernel(); - CCombinedKernel* casted=dynamic_cast(result); - ASSERT(casted); - SGVector weights=casted->get_subkernel_weights(); - //weights.display_vector("weights"); - - /* assert weights against matlab */ -// w_opt = -// 0.761798190146441 -// 0.190556117891148 -// 0.047645691962411 - EXPECT_LE(CMath::abs(weights[0]-0.761798190146441), 10E-15); - EXPECT_LE(CMath::abs(weights[1]-0.190556117891148), 10E-15); - EXPECT_LE(CMath::abs(weights[2]-0.047645691962411), 10E-15); - - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(selection); - SG_UNREF(result); -} -#endif //USE_GPL_SHOGUN diff --git a/tests/unit/statistics/MMDKernelSelectionMax_unittest.cc b/tests/unit/statistics/MMDKernelSelectionMax_unittest.cc deleted file mode 100644 index 232dd2c6a01..00000000000 --- a/tests/unit/statistics/MMDKernelSelectionMax_unittest.cc +++ /dev/null @@ -1,182 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(MMDKernelSelectionMax,select_kernel_quadratic_time_mmd) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create kernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(combined_kernel, features_p, - features_q); - - /* kernel selection instance */ - CMMDKernelSelectionMax* selection= - new CMMDKernelSelectionMax(mmd); - - /* assert correct mmd values, maxmmd criterion is already checked with - * linear time mmd maxmmd selection. Do biased and unbiased m*MMD */ - - /* unbiased m*MMD */ - mmd->set_statistic_type(UNBIASED_DEPRECATED); - SGVector measures=selection->compute_measures(); - //measures.display_vector("unbiased mmd"); -// unbiased_quad_mmds = -// 0.001164382204818 0.000291185913881 0.000072802127661 - EXPECT_LE(CMath::abs(measures[0]-0.001164382204818), 10E-15); - EXPECT_LE(CMath::abs(measures[1]-0.000291185913881), 10E-15); - EXPECT_LE(CMath::abs(measures[2]-0.000072802127661), 10E-15); - - /* biased m*MMD */ - mmd->set_statistic_type(BIASED_DEPRECATED); - measures=selection->compute_measures(); - //measures.display_vector("biased mmd"); -// biased_quad_mmds = -// 0.001534961982492 0.000383849322208 0.000095969134022 - EXPECT_LE(CMath::abs(measures[0]-0.001534961982492), 10E-15); - EXPECT_LE(CMath::abs(measures[1]-0.000383849322208), 10E-15); - EXPECT_LE(CMath::abs(measures[2]-0.000095969134022), 10E-15); - - /* since convienience constructor was use for mmd, features have to be - * cleaned up by hand */ - SG_UNREF(features_p); - SG_UNREF(features_q); - - SG_UNREF(selection); -} - -TEST(MMDKernelSelectionMax,select_kernel_linear_time_mmd) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p= - new CStreamingDenseFeatures(features_p); - CStreamingFeatures* streaming_q= - new CStreamingDenseFeatures(features_q); - - /* create kernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined_kernel, streaming_p, - streaming_q, m); - - /* kernel selection instance */ - CMMDKernelSelectionMax* selection= - new CMMDKernelSelectionMax(mmd); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - /* assert that the correct kernel is returned since I checked the MMD - * already very often */ - CKernel* result=selection->select_kernel(); - CGaussianKernel* casted=dynamic_cast(result); - ASSERT(casted); - - /* assert weights against matlab */ - CKernel* reference=combined_kernel->get_first_kernel(); - ASSERT(result==reference); - SG_UNREF(reference); - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(selection); - SG_UNREF(result); -} diff --git a/tests/unit/statistics/MMDKernelSelectionMedian_unittest.cc b/tests/unit/statistics/MMDKernelSelectionMedian_unittest.cc deleted file mode 100644 index f27ca8e9dff..00000000000 --- a/tests/unit/statistics/MMDKernelSelectionMedian_unittest.cc +++ /dev/null @@ -1,87 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(MMDKernelSelectionMedian,select_kernel) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create Gaussian kernelkernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - //SG_SPRINT("adding widths (std)(shogun): "); - for (index_t i=-5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2.0, i); - float64_t sq_sigma_twice=sigma*sigma*2; - //SG_SPRINT("(%f)(%f) ", sigma, sq_sigma_twice); - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - //SG_SPRINT("\n"); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(combined_kernel, features_p, - features_q); - - /* kernel selection instance */ - CMMDKernelSelectionMedian* selection= - new CMMDKernelSelectionMedian(mmd); - - /* we know that a Gaussian kernel is returned when using median, the - * fifth one here one here */ - CGaussianKernel* kernel=(CGaussianKernel*)selection->select_kernel(); - //SG_SPRINT("median kernel width: %f\n", kernel->get_width()); - EXPECT_EQ(kernel->get_width(), 0.5); - - SG_UNREF(kernel); - - /* since convienience constructor was use for mmd, features have to be - * cleaned up by hand */ - SG_UNREF(features_p); - SG_UNREF(features_q); - - SG_UNREF(selection); -} diff --git a/tests/unit/statistics/MMDKernelSelectionOpt_unittest.cc b/tests/unit/statistics/MMDKernelSelectionOpt_unittest.cc deleted file mode 100644 index 510e777230f..00000000000 --- a/tests/unit/statistics/MMDKernelSelectionOpt_unittest.cc +++ /dev/null @@ -1,99 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -TEST(MMDKernelSelectionOpt,select_kernel) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data to get some reasonable values for Q matrix */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - //SG_SPRINT("%f, %f\n", max_p, max_q); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* create stremaing features from dense features */ - CStreamingFeatures* streaming_p= - new CStreamingDenseFeatures(features_p); - CStreamingFeatures* streaming_q= - new CStreamingDenseFeatures(features_q); - - /* create kernels with sigmas 2^5 to 2^7 */ - CCombinedKernel* combined_kernel=new CCombinedKernel(); - for (index_t i=5; i<=7; ++i) - { - /* shoguns kernel width is different */ - float64_t sigma=CMath::pow(2, i); - float64_t sq_sigma_twice=sigma*sigma*2; - combined_kernel->append_kernel(new CGaussianKernel(10, sq_sigma_twice)); - } - - /* create MMD instance */ - CLinearTimeMMD* mmd=new CLinearTimeMMD(combined_kernel, streaming_p, - streaming_q, m); - - /* kernel selection instance with regularisation term */ - CMMDKernelSelectionOpt* selection= - new CMMDKernelSelectionOpt(mmd, 10E-5); - - /* start streaming features parser */ - streaming_p->start_parser(); - streaming_q->start_parser(); - - SGVector ratios=selection->compute_measures(); - //ratios.display_vector("ratios"); - - /* assert weights against matlab */ -// ratios = -// 0.947668253683719 -// 0.336041393822230 -// 0.093824478467851 - EXPECT_LE(CMath::abs(ratios[0]-0.947668253683719), 10E-15); - EXPECT_LE(CMath::abs(ratios[1]-0.336041393822230), 10E-15); - EXPECT_LE(CMath::abs(ratios[2]-0.093824478467851), 10E-15); - - /* start streaming features parser */ - streaming_p->end_parser(); - streaming_q->end_parser(); - - SG_UNREF(selection); -} diff --git a/tests/unit/statistics/NOCCO_unittest.cc b/tests/unit/statistics/NOCCO_unittest.cc deleted file mode 100644 index 38469b3a32c..00000000000 --- a/tests/unit/statistics/NOCCO_unittest.cc +++ /dev/null @@ -1,183 +0,0 @@ -/* - * Copyright (c) The Shogun Machine Learning Toolbox - * Written (w) 2014 Soumyajit De - * Written (w) 2012-2013 Heiko Strathmann - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * - * 1. Redistributions of source code must retain the above copyright notice, this - * list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright notice, - * this list of conditions and the following disclaimer in the documentation - * and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR - * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - * The views and conclusions contained in the software and documentation are those - * of the authors and should not be interpreted as representing official policies, - * either expressed or implied, of the Shogun Development Team. - */ - -#include -#include -#include -#include -#include -#include - -using namespace shogun; - -using namespace Eigen; - -/** tests the nocco statistic for a single fixed data case and ensures - * equality with matlab implementation */ -TEST(NOCCO, compute_statistic) -{ - const index_t m=2; - const index_t d=3; - const float64_t epsilon=0.1; - - SGMatrix p(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - p.matrix[i]=i; - - SGMatrix q(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - q.matrix[i]=i+10; - - CFeatures* features_p=new CDenseFeatures(p); - CFeatures* features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - CKernel* kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - CKernel* kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); - - CNOCCO* nocco=new CNOCCO(kernel_p, kernel_q, features_p, features_q); - nocco->set_epsilon(epsilon); - - float64_t statistic=nocco->compute_statistic(); - - /* compute the statistic locally */ - kernel_p->init(features_p, features_p); - kernel_q->init(features_q, features_q); - - SGMatrix K=kernel_p->get_kernel_matrix(); - SGMatrix L=kernel_q->get_kernel_matrix(); - - K.center(); - L.center(); - - Map Km(K.matrix, K.num_rows, K.num_cols); - Map Lm(L.matrix, L.num_rows, L.num_cols); - - const MatrixXd& Km_inv=(Km+2*m*epsilon*MatrixXd::Identity(2*m, 2*m)).inverse(); - const MatrixXd& Lm_inv=(Lm+2*m*epsilon*MatrixXd::Identity(2*m, 2*m)).inverse(); - - float64_t naive=(Km*Km_inv*Lm*Lm_inv).trace(); - - /* assert locally computed naive result */ - EXPECT_NEAR(statistic, naive, 1E-15); - - SG_UNREF(nocco); -} - -TEST(NOCCO, compute_p_value) -{ - const index_t m=2; - const index_t d=3; - const float64_t epsilon=0.1; - - SGMatrix p(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - p.matrix[i]=i; - - SGMatrix q(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - q.matrix[i]=i+10; - - CFeatures* features_p=new CDenseFeatures(p); - CFeatures* features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shoguns kernel width is different */ - CKernel* kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - CKernel* kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); - - CNOCCO* nocco=new CNOCCO(kernel_p, kernel_q, features_p, features_q); - nocco->set_epsilon(epsilon); - - /* compute p-value via sampling null */ - nocco->set_null_approximation_method(PERMUTATION); - EXPECT_NEAR(nocco->compute_p_value(0.05), 1.0, 1E-15); - - SG_UNREF(nocco); -} - -TEST(NOCCO, sample_null) -{ - const index_t m=10; - const index_t d=7; - const float64_t epsilon=0.1; - - SGMatrix p(d,m); - for (index_t i=0; i q(d,m); - for (index_t i=0; i(p); - CFeatures* features_q=new CDenseFeatures(q); - - float64_t sigma_x=2; - float64_t sigma_y=3; - float64_t sq_sigma_x_twice=sigma_x*sigma_x*2; - float64_t sq_sigma_y_twice=sigma_y*sigma_y*2; - - /* shogun's kernel width is different */ - CKernel* kernel_p=new CGaussianKernel(10, sq_sigma_x_twice); - CKernel* kernel_q=new CGaussianKernel(10, sq_sigma_y_twice); - - CNOCCO* nocco=new CNOCCO(kernel_p, kernel_q, features_p, features_q); - nocco->set_epsilon(epsilon); - - /* do sampling null */ - - /* ensure that sampling null of nocco leads to same results as using - * CKernelIndependenceTest */ - CMath::init_random(1); - float64_t mean1=CStatistics::mean(nocco->sample_null()); - float64_t var1=CStatistics::variance(nocco->sample_null()); - - CMath::init_random(1); - float64_t mean2=CStatistics::mean( - nocco->CKernelIndependenceTest::sample_null()); - float64_t var2=CStatistics::variance(nocco->sample_null()); - - /* assert than results are the same from bot sampling null impl. */ - EXPECT_NEAR(mean1, mean2, 1E-8); - EXPECT_NEAR(var1, var2, 1E-8); - - SG_UNREF(nocco); -} diff --git a/tests/unit/statistics/QuadraticTimeMMD_unittest.cc b/tests/unit/statistics/QuadraticTimeMMD_unittest.cc deleted file mode 100644 index 5bf7d8c5d0b..00000000000 --- a/tests/unit/statistics/QuadraticTimeMMD_unittest.cc +++ /dev/null @@ -1,838 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 3 of the License, or - * (at your option) any later version. - * - * Written (W) 2012-2013 Heiko Strathmann - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -using namespace shogun; -using namespace Eigen; - -TEST(QuadraticTimeMMD,test_quadratic_mmd_biased) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(BIASED); - - /* assert matlab result */ - float64_t statistic=mmd->compute_statistic(); - //SG_SPRINT("statistic=%f\n", statistic); - EXPECT_NEAR(statistic, 0.17882546486779649, 1E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,test_quadratic_mmd_biased_DEPRECATED) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(BIASED_DEPRECATED); - - /* assert matlab result */ - float64_t statistic=mmd->compute_statistic(); - //SG_SPRINT("statistic=%f\n", statistic); - EXPECT_NEAR(statistic, 0.357650929735592, 10E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,test_quadratic_mmd_unbiased) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(UNBIASED); - - /* assert matlab result */ - float64_t statistic=mmd->compute_statistic(); - //SG_SPRINT("statistic=%f\n", statistic); - EXPECT_NEAR(statistic, 0.13440094336133723, 1E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,test_quadratic_mmd_unbiased_DEPRECATED) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(UNBIASED_DEPRECATED); - - /* assert matlab result */ - float64_t statistic=mmd->compute_statistic(); - //SG_SPRINT("statistic=%f\n", statistic); - float64_t difference=statistic-0.268801886722675; - EXPECT_LE(CMath::abs(difference), 10E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,test_quadratic_mmd_incomplete) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(INCOMPLETE); - - /* assert local machine computed result */ - float64_t statistic=mmd->compute_statistic(); - EXPECT_NEAR(statistic, 0.16743977201175841, 1E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD, test_quadratic_mmd_unbiased_different_num_samples) -{ - const index_t m=5; - const index_t n=6; - const index_t d=1; - float64_t data[] = {0.61318059, -0.69222999, 0.94424411, -0.48769626, - -0.00709551, 0.35025598, 0.20741384, -0.63622519, -1.21315264, - -0.77349617, -0.42707091}; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); - - SGMatrix data_q(d, n); - memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); - - CDenseFeatures* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - CGaussianKernel* kernel=new CGaussianKernel(10, 2); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(UNBIASED); - - /* assert python result at - * https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py */ - float64_t statistic=mmd->compute_statistic(); - EXPECT_NEAR(statistic, -0.037500338130199401, 1E-9); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD, test_quadratic_mmd_unbiased_different_num_samples_DEPRECATED) -{ - const index_t m=5; - const index_t n=6; - const index_t d=1; - float64_t data[] = {0.61318059, -0.69222999, 0.94424411, -0.48769626, - -0.00709551, 0.35025598, 0.20741384, -0.63622519, -1.21315264, - -0.77349617, -0.42707091}; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); - - SGMatrix data_q(d, n); - memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); - - CDenseFeatures* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - CGaussianKernel* kernel=new CGaussianKernel(10, 2); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(UNBIASED_DEPRECATED); - - /* assert python result at - * https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py */ - float64_t statistic=mmd->compute_statistic(); - EXPECT_NEAR(statistic, -0.151251364436, 1E-9); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD, test_quadratic_mmd_biased_different_num_samples) -{ - const index_t m=5; - const index_t n=6; - const index_t d=1; - float64_t data[] = {-0.47616889, -2.1767364, -0.04185537, -1.20787529, - 1.94875193, -0.16695709, 2.51282666, -0.58116389, 1.52366887, - 0.18985099, 0.76120258}; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); - - SGMatrix data_q(d, n); - memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); - - CDenseFeatures* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - CGaussianKernel* kernel=new CGaussianKernel(10, 2); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(BIASED); - - /* assert python result at - * https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py */ - float64_t statistic=mmd->compute_statistic(); - EXPECT_NEAR(statistic, 0.54418915736201567, 1E-8); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD, test_quadratic_mmd_biased_different_num_samples_DEPRECATED) -{ - const index_t m=5; - const index_t n=6; - const index_t d=1; - float64_t data[] = {-0.47616889, -2.1767364, -0.04185537, -1.20787529, - 1.94875193, -0.16695709, 2.51282666, -0.58116389, 1.52366887, - 0.18985099, 0.76120258}; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data[0]), sizeof(float64_t)*m); - - SGMatrix data_q(d, n); - memcpy(&(data_q.matrix[0]), &(data[m]), sizeof(float64_t)*n); - - CDenseFeatures* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - CGaussianKernel* kernel=new CGaussianKernel(10, 2); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - mmd->set_statistic_type(BIASED_DEPRECATED); - - /* assert python result at - * https://github.com/lambday/shogun-hypothesis-testing/blob/master/mmd.py */ - float64_t statistic=mmd->compute_statistic(); - EXPECT_NEAR(statistic, 2.1948962593, 1E-8); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,compute_variance_null) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - - /* assert local machine computed result */ - mmd->set_statistic_type(UNBIASED); - float64_t var=mmd->compute_variance_under_null(); - EXPECT_NEAR(var, 0.0064888052500351456, 1E-10); - - mmd->set_statistic_type(BIASED); - var=mmd->compute_variance_under_null(); - EXPECT_NEAR(var, 0.0071464012090942663, 1E-10); - - mmd->set_statistic_type(INCOMPLETE); - var=mmd->compute_variance_under_null(); - EXPECT_NEAR(var, 0.0064888052500342575, 1E-10); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD,compute_variance_alternative) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, features_p, features_q); - - /* assert local machine computed result */ - mmd->set_statistic_type(UNBIASED); - float64_t var=mmd->compute_variance_under_alternative(); - EXPECT_NEAR(var, 0.0065377436264417842, 1E-15); - - mmd->set_statistic_type(BIASED); - var=mmd->compute_variance_under_alternative(); - EXPECT_NEAR(var, 0.0065069769045954847, 1E-15); - - mmd->set_statistic_type(INCOMPLETE); - var=mmd->compute_variance_under_alternative(); - EXPECT_NEAR(var, 0.0080742069013913682, 1E-15); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); -} - -TEST(QuadraticTimeMMD, null_approximation_spectrum_different_num_samples) -{ - const index_t m=20; - const index_t n=30; - const index_t dim=3; - - /* use fixed seed */ - sg_rand->set_seed(12345); - - float64_t difference=0.5; - - /* streaming data generator for mean shift distributions */ - CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim, 0); - CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim, 0); - - /* stream some data from generator */ - CFeatures* feat_p=gen_p->get_streamed_features(m); - CFeatures* feat_q=gen_q->get_streamed_features(n); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feat_p, feat_q); - - index_t num_null_samples=250; - index_t num_eigenvalues=10; - mmd->set_num_samples_spectrum(num_null_samples); - mmd->set_null_approximation_method(MMD2_SPECTRUM); - mmd->set_num_eigenvalues_spectrum(num_eigenvalues); - - /* biased case */ - - /* compute p-value using spectrum approximation for null distribution and - * assert against local machine computed result */ - mmd->set_statistic_type(BIASED); - float64_t p_value_spectrum=mmd->perform_test(); - EXPECT_NEAR(p_value_spectrum, 0.0, 1E-10); - - /* unbiased case */ - - /* compute p-value using spectrum approximation for null distribution and - * assert against local machine computed result */ - mmd->set_statistic_type(UNBIASED); - p_value_spectrum=mmd->perform_test(); - EXPECT_NEAR(p_value_spectrum, 0.004, 1E-10); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(feat_p); - SG_UNREF(feat_q); - SG_UNREF(gen_p); - SG_UNREF(gen_q); -} - -TEST(QuadraticTimeMMD, null_approximation_spectrum_different_num_samples_DEPRECATED) -{ - const index_t m=20; - const index_t n=30; - const index_t dim=3; - - /* use fixed seed */ - sg_rand->set_seed(12345); - - float64_t difference=0.5; - - /* streaming data generator for mean shift distributions */ - CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim, 0); - CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim, 0); - - /* stream some data from generator */ - CFeatures* feat_p=gen_p->get_streamed_features(m); - CFeatures* feat_q=gen_q->get_streamed_features(n); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance, convienience constructor */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feat_p, feat_q); - - index_t num_null_samples=250; - index_t num_eigenvalues=10; - mmd->set_num_samples_spectrum(num_null_samples); - mmd->set_null_approximation_method(MMD2_SPECTRUM_DEPRECATED); - mmd->set_num_eigenvalues_spectrum(num_eigenvalues); - - /* biased case */ - - /* compute p-value using spectrum approximation for null distribution and - * assert against local machine computed result */ - mmd->set_statistic_type(BIASED_DEPRECATED); - float64_t p_value_spectrum=mmd->perform_test(); - EXPECT_NEAR(p_value_spectrum, 0.0, 1E-10); - - /* unbiased case */ - - /* compute p-value using spectrum approximation for null distribution and - * assert against local machine computed result */ - mmd->set_statistic_type(UNBIASED_DEPRECATED); - p_value_spectrum=mmd->perform_test(); - EXPECT_NEAR(p_value_spectrum, 0.004, 1E-10); - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(feat_p); - SG_UNREF(feat_q); - SG_UNREF(gen_p); - SG_UNREF(gen_q); -} - -TEST(QuadraticTimeMMD,test_quadratic_mmd_precomputed_kernel) -{ - index_t m=8; - index_t d=3; - SGMatrix data(d,2*m); - for (index_t i=0; i<2*d*m; ++i) - data.matrix[i]=i; - - /* create data matrix for each features (appended is not supported) */ - SGMatrix data_p(d, m); - memcpy(&(data_p.matrix[0]), &(data.matrix[0]), sizeof(float64_t)*d*m); - - SGMatrix data_q(d, m); - memcpy(&(data_q.matrix[0]), &(data.matrix[d*m]), sizeof(float64_t)*d*m); - - /* normalise data */ - float64_t max_p=data_p.max_single(); - float64_t max_q=data_q.max_single(); - - for (index_t i=0; i* features_p=new CDenseFeatures(data_p); - CDenseFeatures* features_q=new CDenseFeatures(data_q); - CFeatures* p_and_q=features_p->create_merged_copy(features_q); - SG_REF(p_and_q); - - /* shoguns kernel width is different */ - float64_t sigma=2; - float64_t sq_sigma_twice=sigma*sigma*2; - CGaussianKernel* kernel=new CGaussianKernel(10, sq_sigma_twice); - - /* create MMD instance */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, p_and_q, m); - mmd->set_num_null_samples(10); - - /* use fixed seed */ - sg_rand->set_seed(12345); - SGVector null_samples=mmd->sample_null(); - - float64_t mean=CStatistics::mean(null_samples); - float64_t var=CStatistics::variance(null_samples); - - //SG_SPRINT("mean %f, var %f\n", mean, var); - - /* now again but with a precomputed kernel, same features. - * This avoids re-computing the kernel matrix in every permutation - * iteration and should be num_iterations times faster */ - - /* re-init kernel before kernel matrix is computed: this is due to a design - * error in subsets and should be worked on! */ - kernel->init(p_and_q, p_and_q); - CCustomKernel* precomputed_kernel=new CCustomKernel(kernel); - SG_UNREF(mmd); - mmd=new CQuadraticTimeMMD(precomputed_kernel, p_and_q, m); - mmd->set_num_null_samples(10); - sg_rand->set_seed(12345); - null_samples=mmd->sample_null(); - - /* assert that results do not change */ - //SG_SPRINT("mean %f, var %f\n", CStatistics::mean(null_samples), - // CStatistics::variance(null_samples)); - EXPECT_LE(CMath::abs(mean-CStatistics::mean(null_samples)), 10E-8); - EXPECT_LE(CMath::abs(var-CStatistics::variance(null_samples)), 10E-8); - - SG_UNREF(mmd); - SG_UNREF(features_p); - SG_UNREF(features_q); - SG_UNREF(p_and_q); -} - -TEST(QuadraticTimeMMD,custom_kernel_vs_normal_kernel_DEPRECATED) -{ - /* number of examples kept low in order to make things fast */ - index_t m=20; - index_t dim=2; - float64_t difference=0.5; - - /* streaming data generator for mean shift distributions */ - CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim, 0); - CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim, 0); - - /* stream some data from generator */ - CFeatures* feat_p=gen_p->get_streamed_features(m); - CFeatures* feat_q=gen_q->get_streamed_features(m); - - /* set kernel a-priori. usually one would do some kernel selection. See - * other examples for this. */ - float64_t width=10; - CGaussianKernel* kernel=new CGaussianKernel(10, width); - - /* create quadratic time mmd instance. Note that this constructor - * copies p and q and does not reference them */ - CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feat_p, feat_q); - - /* set up for a precomputed custom kernel using merged features p_and_q */ - CGaussianKernel* kernel2=new CGaussianKernel(10, width); - CFeatures* p_and_q=mmd->get_p_and_q(); - kernel2->init(p_and_q, p_and_q); - CCustomKernel* precomputed=new CCustomKernel(kernel2); - CQuadraticTimeMMD* mmd2=new CQuadraticTimeMMD(precomputed, m); - SG_UNREF(p_and_q); - SG_UNREF(kernel2); - - /* perform test: compute p-value and test if null-hypothesis is rejected for - * a test level of 0.05 */ - float64_t alpha=0.05; - - mmd->set_null_approximation_method(PERMUTATION); - mmd->set_statistic_type(BIASED_DEPRECATED); - mmd->set_num_null_samples(3); - mmd->set_num_eigenvalues_spectrum(3); - mmd->set_num_samples_spectrum(250); - - mmd2->set_null_approximation_method(PERMUTATION); - mmd2->set_statistic_type(BIASED_DEPRECATED); - mmd2->set_num_null_samples(3); - mmd2->set_num_eigenvalues_spectrum(3); - mmd2->set_num_samples_spectrum(250); - - /* compute tpye I and II error using normal and precomputed kernel */ - index_t num_trials=3; - - SGVector inds(2*m); - inds.range_fill(); - - /* use fixed seed */ - CMath::init_random(1); - for (index_t i=0; iset_seed(1); - - /* first, we compute using normal kernel */ - p_and_q->add_subset(inds); - float64_t type_I_mmds=mmd->compute_statistic(); - mmd->set_null_approximation_method(PERMUTATION); - float64_t type_I_threshs_boot=mmd->compute_threshold(alpha); - mmd->set_null_approximation_method(MMD2_SPECTRUM_DEPRECATED); - float64_t type_I_threshs_spectrum=mmd->compute_threshold(alpha); - mmd->set_null_approximation_method(MMD2_GAMMA); - float64_t type_I_threshs_gamma=mmd->compute_threshold(alpha); - p_and_q->remove_subset(); - - float64_t type_II_mmds=mmd->compute_statistic(); - mmd->set_null_approximation_method(PERMUTATION); - float64_t type_II_threshs_boot=mmd->compute_threshold(alpha); - mmd->set_null_approximation_method(MMD2_SPECTRUM_DEPRECATED); - float64_t type_II_threshs_spectrum=mmd->compute_threshold(alpha); - mmd->set_null_approximation_method(MMD2_GAMMA); - float64_t type_II_threshs_gamma=mmd->compute_threshold(alpha); - - /* now compute using precomputed custom kernel */ - - /* setting seed for Gaussian samples used in spectrum approximation method */ - sg_rand->set_seed(1); - - precomputed->add_row_subset(inds); - precomputed->add_col_subset(inds); - float64_t type_I_mmds_pre=mmd2->compute_statistic(); - mmd2->set_null_approximation_method(PERMUTATION); - float64_t type_I_threshs_boot_pre=mmd2->compute_threshold(alpha); - mmd2->set_null_approximation_method(MMD2_SPECTRUM_DEPRECATED); - float64_t type_I_threshs_spectrum_pre=mmd2->compute_threshold(alpha); - mmd2->set_null_approximation_method(MMD2_GAMMA); - float64_t type_I_threshs_gamma_pre=mmd2->compute_threshold(alpha); - precomputed->remove_row_subset(); - precomputed->remove_col_subset(); - - float64_t type_II_mmds_pre=mmd2->compute_statistic(); - mmd2->set_null_approximation_method(PERMUTATION); - float64_t type_II_threshs_boot_pre=mmd2->compute_threshold(alpha); - mmd2->set_null_approximation_method(MMD2_SPECTRUM_DEPRECATED); - float64_t type_II_threshs_spectrum_pre=mmd2->compute_threshold(alpha); - mmd2->set_null_approximation_method(MMD2_GAMMA); - float64_t type_II_threshs_gamma_pre=mmd2->compute_threshold(alpha); - - /* assert results from both */ - EXPECT_NEAR(type_I_mmds, type_I_mmds_pre, 1E-6); - EXPECT_NEAR(type_I_threshs_boot, type_I_threshs_boot_pre, 1E-6); - EXPECT_NEAR(type_I_threshs_spectrum, type_I_threshs_spectrum_pre, 1E-6); - EXPECT_NEAR(type_I_threshs_gamma, type_I_threshs_gamma_pre, 1E-6); - EXPECT_NEAR(type_II_mmds, type_II_mmds_pre, 1E-5); - EXPECT_NEAR(type_II_threshs_boot, type_II_threshs_boot_pre, 1E-6); - EXPECT_NEAR(type_II_threshs_spectrum, type_II_threshs_spectrum_pre, 1E-6); - EXPECT_NEAR(type_II_threshs_gamma, type_II_threshs_gamma_pre, 1E-6); - } - - /* clean up */ - SG_UNREF(mmd); - SG_UNREF(mmd2); - SG_UNREF(gen_p); - SG_UNREF(gen_q); - - /* convienience constructor of MMD was used, these were not referenced */ - SG_UNREF(feat_p); - SG_UNREF(feat_q); -}