diff --git a/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst b/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst
new file mode 100644
index 00000000000..2aef3a8a378
--- /dev/null
+++ b/doc/cookbook/source/examples/statistical_testing/linear_time_mmd.rst
@@ -0,0 +1,80 @@
+===============
+Linear Time MMD
+===============
+
+The linear time MMD implements a nonparametric statistical hypothesis test to reject the null hypothesis that to distributions :math:`p` and :math:`q`, each only observed via :math:`n` samples, are the same, i.e. :math:`H_0:p=q`.
+
+The (unbiased) statistic is given by
+
+.. math::
+
+ \frac{2}{n}\sum_{i=1}^n k(x_{2i},x_{2i}) + k(x_{2i+1}, x_{2i+1}) - 2k(x_{2i},x_{2i+1}).
+
+See :cite:`gretton2012kernel` for a detailed introduction.
+
+-------
+Example
+-------
+
+Imagine we have samples from :math:`p` and :math:`q`.
+As the linear time MMD is a streaming statistic, we need to pass it :sgclass:`CStreamingFeatures`.
+Here, we use synthetic data generators, but it is possible to construct :sgclass:`CStreamingFeatures` from (large) files.
+We create an instance of :sgclass:`CLinearTimeMMD`, passing it data and the kernel to use,
+
+.. sgexample:: linear_time_mmd.sg:create_instance
+
+An important parameter for controlling the efficiency of the linear time MMD is block size of the number of samples that is processed at once. As a guideline, set as large as memory allows.
+
+.. sgexample::linear_time_mmd.sg:set_burst
+
+Computing the statistic is done as
+
+.. sgexample::linear_time_mmd.sg:estimate_mmd
+
+We can perform the hypothesis test via computing a test threshold for a given :math:`\alpha`, or by directly computing a p-value.
+
+.. sgexample::linear_time_mmd.sg:perform_test_threshold
+
+---------------
+Kernel learning
+---------------
+
+There are various options to learn a kernel.
+All options allow to learn a single kernel among a number of provided baseline kernels.
+Furthermore, some of these criterions can be used to learn the coefficients of a convex combination of baseline kernels.
+
+There are different strategies to learn the kernel, see :sgclass:`CKernelSelectionStrategy`.
+
+We specify the desired baseline kernels to consider. Note the kernel above is not considered in the selection.
+
+.. sgexample:: linear_time_mmd.sg:add_kernels
+
+IMPORTANT: when learning the kernel for statistical testing, this needs to be done on different data than being used for performing the actual test.
+One way to accomplish this is to manually provide a different set of features for testing.
+In Shogun, it is also possible to automatically split the provided data by specifying the ratio between train and test data, via enabling the train-test mode.
+
+.. sgexample:: linear_time_mmd.sg:enable_train_test_mode
+
+A ratio of 1 means the data is split into half during learning the kernel, and subsequent tests are performed on the second half.
+
+We learn the kernel and extract the result, again see :sgclass:`CKernelSelectionStrategy` more available strategies. Note that the kernel of the mmd itself is replaced.
+If all kernels have the same type, we can convert the result into that type, for example to extract its parameters.
+
+.. sgexample:: linear_time_mmd.sg:select_kernel_single
+
+Note that in order to extract particular kernel parameters, we need to cast the kernel to its actual type.
+
+Similarly, a convex combination of kernels, in the form of :sgclass:`CCombinedKernel` can be learned and extracted as
+
+.. sgexample:: linear_time_mmd.sg:select_kernel_combined
+
+We can perform the test on the last learnt kernel.
+Since we enabled the train-test mode, this automatically is done on the held out test data.
+
+.. sgexample:: linear_time_mmd.sg:perform_test
+
+----------
+References
+----------
+.. bibliography:: ../../references.bib
+ :filter: docname in docnames
diff --git a/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst b/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst
new file mode 100644
index 00000000000..1882d19e7ae
--- /dev/null
+++ b/doc/cookbook/source/examples/statistical_testing/quadratic_time_mmd.rst
@@ -0,0 +1,93 @@
+==================
+Quadratic Time MMD
+==================
+
+The quadratic time MMD implements a nonparametric statistical hypothesis test to reject the null hypothesis that to distributions :math:`p` and :math:`q`, only observed via :math:`n` and :math:`m` samples respectively, are the same, i.e. :math:`H_0:p=q`.
+
+The (biased) test statistic is given by
+
+.. math::
+
+ \frac{1}{nm}\sum_{i=1}^n\sum_{j=1}^m k(x_i,x_i) + k(x_j, x_j) - 2k(x_i,x_j).
+
+
+See :cite:`gretton2012kernel` for a detailed introduction.
+
+-------
+Example
+-------
+
+Imagine we have samples from :math:`p` and :math:`q`, here in the form of CDenseFeatures (here 64 bit floats aka RealFeatures).
+
+.. sgexample:: quadratic_time_mmd.sg:create_features
+
+We create an instance of :sgclass:`CQuadraticTimeMMD`, passing it data the kernel.
+
+.. sgexample:: quadratic_time_mmd.sg:create_instance
+
+We can select multiple ways to compute the test statistic, see :sgclass:`CQuadraticTimeMMD` for details.
+The biased statistic is computed as
+
+.. sgexample:: quadratic_time_mmd.sg:estimate_mmd
+
+There are multiple ways to perform the actual hypothesis test, see :sgclass:`CQuadraticTimeMMD` for details. The permutation version simulates from :math:`H_0` via repeatedly permuting the samples from :math:`p` and :math:`q`. We can perform the test via computing a test threshold for a given :math:`\alpha`, or by directly computing a p-value.
+
+.. sgexample:: quadratic_time_mmd.sg:perform_test
+
+----------------
+Multiple kernels
+----------------
+
+It is possible to perform all operations (computing statistics, performing test, etc) for multiple kernels at once, via the :sgclass:`CMultiKernelQuadraticTimeMMD` interface.
+
+.. sgexample:: quadratic_time_mmd.sg:multi_kernel
+
+Note that the results are now a vector with one entry per kernel.
+Also note that the kernels for single and multiple are kept separately.
+
+---------------
+Kernel learning
+---------------
+
+There are various options to learn a kernel.
+All options allow to learn a single kernel among a number of provided baseline kernels.
+Furthermore, some of these criterions can be used to learn the coefficients of a convex combination of baseline kernels.
+
+There are different strategies to learn the kernel, see :sgclass:`CKernelSelectionStrategy`.
+
+We specify the desired baseline kernels to consider. Note the kernel above is not considered in the selection.
+
+.. sgexample:: quadratic_time_mmd.sg:add_kernels
+
+IMPORTANT: when learning the kernel for statistical testing, this needs to be done on different data than being used for performing the actual test.
+One way to accomplish this is to manually provide a different set of features for testing.
+In Shogun, it is also possible to automatically split the provided data by specifying the ratio between train and test data, via enabling the train-test mode.
+
+.. sgexample:: quadratic_time_mmd.sg:enable_train_test_mode
+
+A ratio of 1 means the data is split into half during learning the kernel, and subsequent tests are performed on the second half.
+
+We learn the kernel and extract the result, again see :sgclass:`CKernelSelectionStrategy` more available strategies.
+Note that the kernel of the mmd itself is replaced.
+If all kernels have the same type, we can convert the result into that type, for example to extract its parameters.
+
+.. sgexample:: quadratic_time_mmd.sg:select_kernel_single
+
+Note that in order to extract particular kernel parameters, we need to cast the kernel to its actual type.
+
+Similarly, a convex combination of kernels, in the form of :sgclass:`CCombinedKernel` can be learned and extracted as
+
+.. sgexample:: quadratic_time_mmd.sg:select_kernel_combined
+
+We can perform the test on the last learnt kernel.
+Since we enabled the train-test mode, this automatically is done on the held out test data.
+
+.. sgexample:: quadratic_time_mmd.sg:perform_test_optimized
+
+----------
+References
+----------
+.. bibliography:: ../../references.bib
+ :filter: docname in docnames
+
+:wiki:`Statistical_hypothesis_testing`
diff --git a/doc/cookbook/source/index.rst b/doc/cookbook/source/index.rst
index 30616938fcb..1b979cb0399 100644
--- a/doc/cookbook/source/index.rst
+++ b/doc/cookbook/source/index.rst
@@ -47,6 +47,15 @@ Regression
examples/regression/**
+Statistical Testing
+-------------------
+
+.. toctree::
+ :maxdepth: 1
+ :glob:
+
+ examples/statistical_testing/**
+
Kernels
-------
diff --git a/doc/cookbook/source/references.bib b/doc/cookbook/source/references.bib
index 5cba98e83a3..b32bec37852 100644
--- a/doc/cookbook/source/references.bib
+++ b/doc/cookbook/source/references.bib
@@ -25,7 +25,7 @@ @book{cristianini2000introduction
publisher={Cambridge University Press}
}
@article{fan2008liblinear,
- title={LIBLINEAR: A Library for Large Linear Classification},
+ title={{LIBLINEAR: A Library for Large Linear Classification}},
author={R.E. Fan and K.W. Chang and C.J. Hsieh and X.R. Wang and C.J. Lin},
journal={Journal of Machine Learning Research},
volume={9},
@@ -36,7 +36,18 @@ @book{Rasmussen2005GPM
author = {Rasmussen, C. E. and Williams, C. K. I.},
title = {Gaussian Processes for Machine Learning},
year = {2005},
- publisher = {The MIT Press}
+ publisher = {The MIT Press},
+ year={2008},
+}
+
+@article{gretton2012kernel,
+ title={A kernel two-sample test},
+ author={Gretton, A. and Borgwardt, K.M. and Rasch, M.J. and Sch{\"o}lkopf, B. and Smola, A.},
+ journal={The Journal of Machine Learning Research},
+ volume={13},
+ number={1},
+ pages={723--773},
+ year={2012},
}
@article{ueda2000smem,
title={SMEM Algorithm for Mixture Models},
@@ -102,6 +113,13 @@ @inproceedings{shalev2011shareboost
pages={1179--1187},
year={2011}
}
+
+@inproceedings{gretton2012optimal,
+ author={Gretton, A. and Sriperumbudur, B. and Sejdinovic, D. and Strathmann, H. and Balakrishnan, S. and Pontil, M. and Fukumizu, K.},
+ booktitle={Advances in Neural Information Processing Systems},
+ title={{Optimal kernel choice for large-scale two-sample tests}},
+ year={2012}
+}
@article{sonnenburg2006large,
title={Large scale multiple kernel learning},
author={S. Sonnenburg and G. R{\"a}tsch and C. Sch{\"a}fer and B. Sch{\"o}lkopf},
diff --git a/doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb b/doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb
similarity index 62%
rename from doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb
rename to doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb
index 9169538737c..4b8acceaca5 100644
--- a/doc/ipython-notebooks/statistics/mmd_two_sample_testing.ipynb
+++ b/doc/ipython-notebooks/statistical_testing/mmd_two_sample_testing.ipynb
@@ -11,22 +11,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "#### By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de"
+ "#### Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de\n",
+ "#### Soumyajit De - soumyajitde.cse@gmail.com - github.com/lambday"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "This notebook describes Shogun's framework for statistical hypothesis testing. We begin by giving a brief outline of the problem setting and then describe various implemented algorithms. All the algorithms discussed here are for Kernel two-sample testing with Maximum Mean Discrepancy and are based on embedding probability distributions into Reproducing Kernel Hilbert Spaces( RKHS )."
+ "This notebook describes Shogun's framework for statistical hypothesis testing. We begin by giving a brief outline of the problem setting and then describe various implemented algorithms.\n",
+ "All algorithms discussed here are instances of kernel two-sample testing with the *maximum mean discrepancy*, and are based on embedding probability distributions into Reproducing Kernel Hilbert Spaces (RKHS)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Methods for two-sample testing currently consist of tests based on the *Maximum Mean Discrepancy*. There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours.\n",
- "Independence testing is currently based in the *Hilbert Schmidt Independence Criterion*."
+ "There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours."
]
},
{
@@ -39,8 +40,8 @@
"source": [
"%pylab inline\n",
"%matplotlib inline\n",
- "# import all Shogun classes\n",
- "from modshogun import *"
+ "import modshogun as sg\n",
+ "import numpy as np"
]
},
{
@@ -54,7 +55,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called *null-hypothesis* $H_0$, which is the complement of an *alternative-hypothesis* $H_A$. \n",
+ "To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called *null-hypothesis* $H_0:p=q$, which is the complement of an *alternative-hypothesis* $H_A$. \n",
"\n",
"To distinguish the hypotheses, a test statistic is computed on sample data. Since sample data is finite, this corresponds to sampling the true distribution of the test statistic. There are two different distributions of the test statistic -- one for each hypothesis. The *null-distribution* corresponds to test statistic samples under the model that $H_0$ holds; the *alternative-distribution* corresponds to test statistic samples under the model that $H_A$ holds.\n",
"\n",
@@ -65,11 +66,11 @@
" * A *type I error* is made when $H_0: p=q$ is wrongly rejected. That is, the test says that the samples are from different distributions when they are not.\n",
" * A *type II error* is made when $H_A: p\\neq q$ is wrongly accepted. That is, the test says that the samples are from the same distribution when they are not.\n",
"\n",
- "A so-called *consistent* test achieves zero type II error for a fixed type I error.\n",
+ "A so-called *consistent* test achieves zero type II error for a fixed type I error, as it sees more data.\n",
"\n",
"To decide whether to reject $H_0$, one could set a threshold, say at the $95\\%$ quantile of the null-distribution, and reject $H_0$ when the test statistic lies below that threshold. This means that the chance that the samples were generated under $H_0$ are $5\\%$. We call this number the *test power* $\\alpha$ (in this case $\\alpha=0.05$). It is an upper bound on the probability for a type I error. An alternative way is simply to compute the quantile of the test statistic in the null-distribution, the so-called *p-value*, and to compare the p-value against a desired test power, say $\\alpha=0.05$, by hand. The advantage of the second method is that one not only gets a binary answer, but also an upper bound on the type I error.\n",
"\n",
- "In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this for any two-sample test is called *bootstrapping*, or the *permutation* test, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. For many test statistics, there are more sophisticated methods of approximating the null distribution."
+ "In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this is called the *permutation test*, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. Shogun comes with an extremely optimized implementation though. For completeness, Shogun also includes a number of more sohpisticated ways of approximating the null distribution."
]
},
{
@@ -83,15 +84,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. This class offers methods to\n",
+ "Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. We here focos on CTwoSampleTest. This class offers methods to\n",
"\n",
" * compute the implemented test statistic,\n",
" * compute p-values for a given value of the test statistic,\n",
" * compute a test threshold for a given p-value,\n",
- " * sampling the null distribution, i.e. perform the permutation test or bootstrappig of the null-distribution, and\n",
- " * performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic, it might be faster to call this than to compute threshold and test statistic seperately with the above methods.\n",
- " \n",
- "There are special subclasses for testing two distributions against each other (CTwoSampleTest, CIndependenceTest), kernel two-sample testing (CKernelTwoSampleTest), and kernel independence testing (CKernelIndependenceTest), which however mostly differ in internals and constructors."
+ " * approximate the null distribution, e.g. perform the permutation test and\n",
+ " * performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic."
]
},
{
@@ -123,7 +122,7 @@
" +\\textbf{E}_{y,y'}\\left[ k(y,y')\\right]\n",
"\\end{align*}\n",
"\n",
- "Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.\n",
+ "Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.\n",
"\n",
"We here only describe how to use the MMD for two-sample testing. Shogun offers two types of test statistic based on the MMD, one with quadratic costs both in time and space, and one with linear time and constant space costs. Both come in different versions and with different methods how to approximate the null-distribution in order to construct a two-sample test."
]
@@ -159,11 +158,11 @@
"outputs": [],
"source": [
"# use scipy for generating samples\n",
- "from scipy.stats import norm, laplace\n",
+ "from scipy.stats import laplace, norm\n",
"\n",
- "def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=sqrt(0.5)): \n",
+ "def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)): \n",
" # sample from both distributions\n",
- " X=norm.rvs(size=n, loc=mu, scale=sigma2)\n",
+ " X=norm.rvs(size=n)*np.sqrt(sigma2)+mu\n",
" Y=laplace.rvs(size=n, loc=mu, scale=b)\n",
" \n",
" return X,Y"
@@ -179,31 +178,30 @@
"source": [
"mu=0.0\n",
"sigma2=1\n",
- "b=sqrt(0.5)\n",
+ "b=np.sqrt(0.5)\n",
"n=220\n",
"X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)\n",
"\n",
"# plot both densities and histograms\n",
- "figure(figsize=(18,5))\n",
- "suptitle(\"Gaussian vs. Laplace\")\n",
- "subplot(121)\n",
- "Xs=linspace(-2, 2, 500)\n",
- "plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))\n",
- "plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))\n",
- "title(\"Densities\")\n",
- "xlabel(\"$x$\")\n",
- "ylabel(\"$p(x)$\")\n",
- "_=legend([ 'Gaussian','Laplace'])\n",
- "\n",
- "subplot(122)\n",
- "hist(X, alpha=0.5)\n",
- "xlim([-5,5])\n",
- "ylim([0,100])\n",
- "hist(Y,alpha=0.5)\n",
- "xlim([-5,5])\n",
- "ylim([0,100])\n",
- "legend([\"Gaussian\", \"Laplace\"])\n",
- "_=title('Histograms')"
+ "plt.figure(figsize=(18,5))\n",
+ "plt.suptitle(\"Gaussian vs. Laplace\")\n",
+ "plt.subplot(121)\n",
+ "Xs=np.linspace(-2, 2, 500)\n",
+ "plt.plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))\n",
+ "plt.plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))\n",
+ "plt.title(\"Densities\")\n",
+ "plt.xlabel(\"$x$\")\n",
+ "plt.ylabel(\"$p(x)$\")\n",
+ "\n",
+ "plt.subplot(122)\n",
+ "plt.hist(X, alpha=0.5)\n",
+ "plt.xlim([-5,5])\n",
+ "plt.ylim([0,100])\n",
+ "plt.hist(Y,alpha=0.5)\n",
+ "plt.xlim([-5,5])\n",
+ "plt.ylim([0,100])\n",
+ "plt.legend([\"Gaussian\", \"Laplace\"])\n",
+ "plt.title('Samples');"
]
},
{
@@ -222,8 +220,8 @@
"outputs": [],
"source": [
"print \"Gaussian vs. Laplace\"\n",
- "print \"Sample means: %.2f vs %.2f\" % (mean(X), mean(Y))\n",
- "print \"Samples variances: %.2f vs %.2f\" % (var(X), var(Y))"
+ "print \"Sample means: %.2f vs %.2f\" % (np.mean(X), np.mean(Y))\n",
+ "print \"Samples variances: %.2f vs %.2f\" % (np.var(X), np.var(Y))"
]
},
{
@@ -237,7 +235,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.\n",
+ "We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.\n",
"\n",
"An unbiased estimate for the MMD expression above can be obtained by estimating expected values with averaging over independent samples\n",
"\n",
@@ -251,7 +249,7 @@
"\\mmd_b[\\mathcal{F},X,Y]^2=\\frac{1}{m^2}\\sum_{i=1}^m\\sum_{j=1}^mk(x_i,x_j) + \\frac{1}{n^ 2}\\sum_{i=1}^n\\sum_{j=1}^nk(y_i,y_j)-\\frac{2}{mn}\\sum_{i=1}^m\\sum_{j\\neq i}^nk(x_i,y_j)\n",
".$$\n",
"\n",
- "Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\\mmd_b[\\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed."
+ "Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\\mmd_b[\\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed."
]
},
{
@@ -263,22 +261,25 @@
"outputs": [],
"source": [
"# turn data into Shogun representation (columns vectors)\n",
- "feat_p=RealFeatures(X.reshape(1,len(X)))\n",
- "feat_q=RealFeatures(Y.reshape(1,len(Y)))\n",
+ "feat_p=sg.RealFeatures(X.reshape(1,len(X)))\n",
+ "feat_q=sg.RealFeatures(Y.reshape(1,len(Y)))\n",
"\n",
"# choose kernel for testing. Here: Gaussian\n",
"kernel_width=1\n",
- "kernel=GaussianKernel(10, kernel_width)\n",
+ "kernel=sg.GaussianKernel(10, kernel_width)\n",
"\n",
"# create mmd instance of test-statistic\n",
- "mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n",
+ "mmd=sg.QuadraticTimeMMD()\n",
+ "mmd.set_kernel(kernel)\n",
+ "mmd.set_p(feat_p)\n",
+ "mmd.set_q(feat_q)\n",
"\n",
"# compute biased and unbiased test statistic (default is unbiased)\n",
- "mmd.set_statistic_type(BIASED)\n",
+ "mmd.set_statistic_type(sg.ST_BIASED_FULL)\n",
"biased_statistic=mmd.compute_statistic()\n",
"\n",
- "mmd.set_statistic_type(UNBIASED)\n",
- "unbiased_statistic=mmd.compute_statistic()\n",
+ "mmd.set_statistic_type(sg.ST_UNBIASED_FULL)\n",
+ "statistic=unbiased_statistic=mmd.compute_statistic()\n",
"\n",
"print \"%d x MMD_b[X,Y]^2=%.2f\" % (len(X), biased_statistic)\n",
"print \"%d x MMD_u[X,Y]^2=%.2f\" % (len(X), unbiased_statistic)"
@@ -288,7 +289,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Any sub-class of CHypothesisTest can compute approximate the null distribution using permutation/bootstrapping. This way always is guaranteed to produce consistent results, however, it might take a long time as for each sample of the null distribution, the test statistic has to be computed for a different permutation of the data. Note that each of the below calls samples from the null distribution. It is wise to choose one method in practice. Also note that we set the number of samples from the null distribution to a low value to reduce runtime. Choose larger in practice, it is in fact good to plot the samples."
+ "Any sub-class of CHypothesisTest can compute approximate the null distribution using permutation/bootstrapping. This way always is guaranteed to produce consistent results, however, it might take a long time as for each sample of the null distribution, the test statistic has to be computed for a different permutation of the data. Shogun's implementation is highly optimized, exploiting low-level CPU caching and multiple available cores."
]
},
{
@@ -299,18 +300,14 @@
},
"outputs": [],
"source": [
- "# this is not necessary as bootstrapping is the default\n",
- "mmd.set_null_approximation_method(PERMUTATION)\n",
- "mmd.set_statistic_type(UNBIASED)\n",
- "\n",
- "# to reduce runtime, should be larger practice\n",
- "mmd.set_num_null_samples(100)\n",
+ "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n",
+ "mmd.set_num_null_samples(200)\n",
"\n",
"# now show a couple of ways to compute the test\n",
"\n",
"# compute p-value for computed test statistic\n",
- "p_value=mmd.compute_p_value(unbiased_statistic)\n",
- "print \"P-value of MMD value %.2f is %.2f\" % (unbiased_statistic, p_value)\n",
+ "p_value=mmd.compute_p_value(statistic)\n",
+ "print \"P-value of MMD value %.2f is %.2f\" % (statistic, p_value)\n",
"\n",
"# compute threshold for rejecting H_0 for a given test power\n",
"alpha=0.05\n",
@@ -318,7 +315,7 @@
"print \"Threshold for rejecting H0 with a test power of %.2f is %.2f\" % (alpha, threshold)\n",
"\n",
"# performing the test by hand given the above results, note that those two are equivalent\n",
- "if unbiased_statistic>threshold:\n",
+ "if statistic>threshold:\n",
" print \"H0 is rejected with confidence %.2f\" % alpha\n",
" \n",
"if p_valueCCustomKernel class, which allows to precompute a kernel matrix (multithreaded) of a given kernel and store it in memory. Instances of this class can then be used as if they were standard kernels."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "# precompute kernel to be faster for null sampling\n",
- "p_and_q=mmd.get_p_and_q()\n",
- "kernel.init(p_and_q, p_and_q);\n",
- "precomputed_kernel=CustomKernel(kernel);\n",
- "mmd.set_kernel(precomputed_kernel);\n",
- "\n",
- "# increase number of iterations since should be faster now\n",
- "mmd.set_num_null_samples(500);\n",
- "p_value_boot=mmd.perform_test();\n",
- "print \"P-value of MMD test is %.2f\" % p_value_boot"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now let us visualise distribution of MMD statistic under $H_0:p=q$ and $H_A:p\\neq q$. Sample both null and alternative distribution for that. Use the interface of CTwoSampleTest to sample from the null distribution (permutations, re-computing of test statistic is done internally). For the alternative distribution, compute the test statistic for a new sample set of $X$ and $Y$ in a loop. Note that the latter is expensive, as the kernel cannot be precomputed, and infinite data is needed. Though it is not needed in practice but only for illustrational purposes here."
+ "Now let us visualise distribution of MMD statistic under $H_0:p=q$ and $H_A:p\\neq q$. Sample both null and alternative distribution for that. Use the interface of CHypothesisTest to sample from the null distribution (permutations, re-computing of test statistic is done internally). For the alternative distribution, compute the test statistic for a new sample set of $X$ and $Y$ in a loop. Note that the latter is expensive, as the kernel cannot be precomputed, and infinite data is needed. Though it is not needed in practice but only for illustrational purposes here."
]
},
{
@@ -388,18 +346,21 @@
"num_samples=500\n",
"\n",
"# sample null distribution\n",
- "mmd.set_num_null_samples(num_samples)\n",
"null_samples=mmd.sample_null()\n",
"\n",
"# sample alternative distribution, generate new data for that\n",
- "alt_samples=zeros(num_samples)\n",
+ "alt_samples=np.zeros(num_samples)\n",
"for i in range(num_samples):\n",
" X=norm.rvs(size=n, loc=mu, scale=sigma2)\n",
" Y=laplace.rvs(size=n, loc=mu, scale=b)\n",
- " feat_p=RealFeatures(reshape(X, (1,len(X))))\n",
- " feat_q=RealFeatures(reshape(Y, (1,len(Y))))\n",
- " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n",
- " alt_samples[i]=mmd.compute_statistic()"
+ " feat_p=sg.RealFeatures(np.reshape(X, (1,len(X))))\n",
+ " feat_q=sg.RealFeatures(np.reshape(Y, (1,len(Y))))\n",
+ " # TODO: reset pre-computed kernel here\n",
+ " mmd.set_p(feat_p)\n",
+ " mmd.set_q(feat_q)\n",
+ " alt_samples[i]=mmd.compute_statistic()\n",
+ "\n",
+ "np.std(alt_samples)"
]
},
{
@@ -428,26 +389,26 @@
"outputs": [],
"source": [
"def plot_alt_vs_null(alt_samples, null_samples, alpha):\n",
- " figure(figsize=(18,5))\n",
+ " plt.figure(figsize=(18,5))\n",
" \n",
- " subplot(131)\n",
- " hist(null_samples, 50, color='blue')\n",
- " title('Null distribution')\n",
- " subplot(132)\n",
- " title('Alternative distribution')\n",
- " hist(alt_samples, 50, color='green')\n",
+ " plt.subplot(131)\n",
+ " plt.hist(null_samples, 50, color='blue')\n",
+ " plt.title('Null distribution')\n",
+ " plt.subplot(132)\n",
+ " plt.title('Alternative distribution')\n",
+ " plt.hist(alt_samples, 50, color='green')\n",
" \n",
- " subplot(133)\n",
- " hist(null_samples, 50, color='blue')\n",
- " hist(alt_samples, 50, color='green', alpha=0.5)\n",
- " title('Null and alternative distriution')\n",
+ " plt.subplot(133)\n",
+ " plt.hist(null_samples, 50, color='blue')\n",
+ " plt.hist(alt_samples, 50, color='green', alpha=0.5)\n",
+ " plt.title('Null and alternative distriution')\n",
" \n",
" # find (1-alpha) element of null distribution\n",
- " null_samples_sorted=sort(null_samples)\n",
- " quantile_idx=int(num_samples*(1-alpha))\n",
+ " null_samples_sorted=np.sort(null_samples)\n",
+ " quantile_idx=int(len(null_samples)*(1-alpha))\n",
" quantile=null_samples_sorted[quantile_idx]\n",
- " axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')\n",
- " _=legend()"
+ " plt.axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')\n",
+ " legend();"
]
},
{
@@ -472,7 +433,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "As already mentioned, bootstrapping the null distribution is expensive business. There exist a couple of methods that are more sophisticated and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.\n",
+ "As already mentioned, permuting the data to access the null distribution is probably the method of choice, due to the efficient implementation in Shogun. There exist a couple of methods that are more sophisticated (and slower) and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.\n",
"\n",
"The first one is a spectral method that is based around the Eigenspectrum of the kernel matrix of the joint samples. It is faster than bootstrapping while being a consistent test. Effectively, the null-distribution of the biased statistic is sampled, but in a more efficient way than the bootstrapping approach. The converges as\n",
"\n",
@@ -482,12 +443,12 @@
"\n",
"where $z_l\\sim \\mathcal{N}(0,2)$ are i.i.d. normal samples and $\\lambda_l$ are Eigenvalues of expression 2 in [2], which can be empirically estimated by $\\hat\\lambda_l=\\frac{1}{m}\\nu_l$ where $\\nu_l$ are the Eigenvalues of the centred kernel matrix of the joint samples $X$ and $Y$. The distribution above can be easily sampled. Shogun's implementation has two parameters:\n",
"\n",
- " * Number of samples from null-distribution. The more, the more accurate. As a rule of thumb, use 250.\n",
+ " * Number of samples from null-distribution. The more, the more accurate.\n",
" * Number of Eigenvalues of the Eigen-decomposition of the kernel matrix to use. The more, the better the results get. However, the Eigen-spectrum of the joint gram matrix usually decreases very fast. Plotting the Spectrum can help. See [2] for details.\n",
"\n",
- "If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are much lower than bootstrapping, which is the only consistent alternative. Since Eigenvalues of the gram matrix has to be computed, costs are in $\\mathcal{O}(m^3)$.\n",
+ "If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are likely to be larger than permutation testing, due to the efficient implementation of the latter: Eigenvalues of the gram matrix cost $\\mathcal{O}(m^3)$.\n",
"\n",
- "Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic."
+ "Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic."
]
},
{
@@ -499,23 +460,24 @@
"outputs": [],
"source": [
"# optional: plot spectrum of joint kernel matrix\n",
- "from numpy.linalg import eig\n",
+ "\n",
+ "# TODO: it would be good if there was a way to extract the joint kernel matrix for all kernel tests\n",
"\n",
"# get joint feature object and compute kernel matrix and its spectrum\n",
"feats_p_q=mmd.get_p_and_q()\n",
"mmd.get_kernel().init(feats_p_q, feats_p_q)\n",
"K=mmd.get_kernel().get_kernel_matrix()\n",
- "w,_=eig(K)\n",
+ "w,_=np.linalg.eig(K)\n",
"\n",
"# visualise K and its spectrum (only up to threshold)\n",
- "figure(figsize=(18,5))\n",
- "subplot(121)\n",
- "imshow(K, interpolation=\"nearest\")\n",
- "title(\"Kernel matrix K of joint data $X$ and $Y$\")\n",
- "subplot(122)\n",
+ "plt.figure(figsize=(18,5))\n",
+ "plt.subplot(121)\n",
+ "plt.imshow(K, interpolation=\"nearest\")\n",
+ "plt.title(\"Kernel matrix K of joint data $X$ and $Y$\")\n",
+ "plt.subplot(122)\n",
"thresh=0.1\n",
- "plot(w[:len(w[w>thresh])])\n",
- "_=title(\"Eigenspectrum of K until component %d\" % len(w[w>thresh]))"
+ "plt.plot(w[:len(w[w>thresh])])\n",
+ "title(\"Eigenspectrum of K until component %d\" % len(w[w>thresh]));"
]
},
{
@@ -540,22 +502,23 @@
"num_eigen=len(w[w>thresh])\n",
"\n",
"# finally, do the test, use biased statistic\n",
- "mmd.set_statistic_type(BIASED)\n",
+ "mmd.set_statistic_type(sg.ST_BIASED_FULL)\n",
"\n",
"#tell Shogun to use spectrum approximation\n",
- "mmd.set_null_approximation_method(MMD2_SPECTRUM)\n",
- "mmd.set_num_eigenvalues_spectrum(num_eigen)\n",
- "mmd.set_num_samples_spectrum(num_samples)\n",
+ "mmd.set_null_approximation_method(sg.NAM_MMD2_SPECTRUM)\n",
+ "mmd.spectrum_set_num_eigenvalues(num_eigen)\n",
+ "mmd.set_num_null_samples(num_samples)\n",
"\n",
"# the usual test interface\n",
- "p_value_spectrum=mmd.perform_test()\n",
+ "statistic=mmd.compute_statistic()\n",
+ "p_value_spectrum=mmd.compute_p_value(statistic)\n",
"print \"Spectrum: P-value of MMD test is %.2f\" % p_value_spectrum\n",
"\n",
- "# compare with ground truth bootstrapping\n",
- "mmd.set_null_approximation_method(PERMUTATION)\n",
+ "# compare with ground truth from permutation test\n",
+ "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n",
"mmd.set_num_null_samples(num_samples)\n",
- "p_value_boot=mmd.perform_test()\n",
- "print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_spectrum"
+ "p_value_permutation=mmd.compute_p_value(statistic)\n",
+ "print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_permutation"
]
},
{
@@ -595,15 +558,16 @@
"outputs": [],
"source": [
"# tell Shogun to use gamma approximation\n",
- "mmd.set_null_approximation_method(MMD2_GAMMA)\n",
+ "mmd.set_null_approximation_method(sg.NAM_MMD2_GAMMA)\n",
"\n",
"# the usual test interface\n",
- "p_value_gamma=mmd.perform_test()\n",
+ "statistic=mmd.compute_statistic()\n",
+ "p_value_gamma=mmd.compute_p_value(statistic)\n",
"print \"Gamma: P-value of MMD test is %.2f\" % p_value_gamma\n",
"\n",
"# compare with ground truth bootstrapping\n",
- "mmd.set_null_approximation_method(PERMUTATION)\n",
- "p_value_boot=mmd.perform_test()\n",
+ "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n",
+ "p_value_spectrum=mmd.compute_p_value(statistic)\n",
"print \"Bootstrapping: P-value of MMD test is %.2f\" % p_value_spectrum"
]
},
@@ -637,32 +601,34 @@
" Z=hstack((X,Y))\n",
" X=Z[:len(X)]\n",
" Y=Z[len(X):]\n",
- " feat_p=RealFeatures(reshape(X, (1,len(X))))\n",
- " feat_q=RealFeatures(reshape(Y, (1,len(Y))))\n",
+ " feat_p=sg.RealFeatures(reshape(X, (1,len(X))))\n",
+ " feat_q=sg.RealFeatures(reshape(Y, (1,len(Y))))\n",
" \n",
" # gamma\n",
- " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n",
- " mmd.set_null_approximation_method(MMD2_GAMMA)\n",
- " mmd.set_statistic_type(BIASED)\n",
+ " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n",
+ " mmd.set_kernel(kernel)\n",
+ " mmd.set_null_approximation_method(sg.NAM_MMD2_GAMMA)\n",
+ " mmd.set_statistic_type(sg.ST_BIASED_FULL) \n",
" rejections_gamma[i]=mmd.perform_test(alpha)\n",
" \n",
" # spectrum\n",
- " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n",
- " mmd.set_null_approximation_method(MMD2_SPECTRUM)\n",
- " mmd.set_num_eigenvalues_spectrum(num_eigen)\n",
- " mmd.set_num_samples_spectrum(num_samples)\n",
- " mmd.set_statistic_type(BIASED)\n",
+ " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n",
+ " mmd.set_kernel(kernel)\n",
+ " mmd.set_null_approximation_method(sg.NAM_MMD2_SPECTRUM)\n",
+ " mmd.spectrum_set_num_eigenvalues(num_eigen)\n",
+ " mmd.set_num_null_samples(num_samples)\n",
+ " mmd.set_statistic_type(sg.ST_BIASED_FULL)\n",
" rejections_spectrum[i]=mmd.perform_test(alpha)\n",
" \n",
" # bootstrap (precompute kernel)\n",
- " mmd=QuadraticTimeMMD(kernel, feat_p, feat_q)\n",
+ " mmd=sg.QuadraticTimeMMD(feat_p, feat_q)\n",
" p_and_q=mmd.get_p_and_q()\n",
" kernel.init(p_and_q, p_and_q)\n",
- " precomputed_kernel=CustomKernel(kernel)\n",
+ " precomputed_kernel=sg.CustomKernel(kernel)\n",
" mmd.set_kernel(precomputed_kernel)\n",
- " mmd.set_null_approximation_method(PERMUTATION)\n",
+ " mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n",
" mmd.set_num_null_samples(num_samples)\n",
- " mmd.set_statistic_type(BIASED)\n",
+ " mmd.set_statistic_type(sg.ST_BIASED_FULL)\n",
" rejections_bootstrap[i]=mmd.perform_test(alpha)"
]
},
@@ -701,9 +667,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.\n",
+ "So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.\n",
"\n",
- "And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing."
+ "And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing."
]
},
{
@@ -722,8 +688,8 @@
"angle=pi/4\n",
"\n",
"# these are streaming features\n",
- "gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n",
- "gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n",
+ "gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n",
+ "gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n",
"\t\t\n",
"# stream some data and plot\n",
"num_plot=1000\n",
@@ -755,7 +721,7 @@
"\n",
"where $ m_2=\\lfloor\\frac{m}{2} \\rfloor$. While the above expression assumes that $m$ data are available from each distribution, the statistic in general works in an online setting where features are obtained one by one. Since only pairs of four points are considered at once, this allows to compute it on data streams. In addition, the computational costs are linear in the number of samples that are considered from each distribution. These two properties make the linear time MMD very applicable for large scale two-sample tests. In theory, any number of samples can be processed -- time is the only limiting factor.\n",
"\n",
- "We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream."
+ "We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream."
]
},
{
@@ -769,7 +735,11 @@
"block_size=100\n",
"\n",
"# if features are already under the streaming interface, just pass them\n",
- "mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)\n",
+ "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n",
+ "mmd.set_kernel(kernel)\n",
+ "mmd.set_num_samples_p(m)\n",
+ "mmd.set_num_samples_q(m)\n",
+ "mmd.set_num_blocks_per_burst(block_size)\n",
"\n",
"# compute an unbiased estimate in linear time\n",
"statistic=mmd.compute_statistic()\n",
@@ -785,7 +755,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Sometimes, one might want to use CLinearTimeMMD with data that is stored in memory. In that case, it is easy to data in the form of for example CStreamingDenseFeatures into CDenseFeatures."
+ "Sometimes, one might want to use CLinearTimeMMD with data that is stored in memory. In that case, it is easy to data in the form of for example CStreamingDenseFeatures into CDenseFeatures."
]
},
{
@@ -797,25 +767,21 @@
"outputs": [],
"source": [
"# data source\n",
- "gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n",
- "gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n",
- "\n",
- "# retreive some points, store them as non-streaming data in memory\n",
- "data_p=gen_p.get_streamed_features(100)\n",
- "data_q=gen_q.get_streamed_features(data_p.get_num_vectors())\n",
- "print \"Number of data is %d\" % data_p.get_num_vectors()\n",
+ "gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)\n",
+ "gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)\n",
"\n",
- "# cast data in memory as streaming features again (which now stream from the in-memory data)\n",
- "streaming_p=StreamingRealFeatures(data_p)\n",
- "streaming_q=StreamingRealFeatures(data_q)\n",
+ "num_samples=100\n",
+ "print \"Number of data is %d\" % num_samples\n",
"\n",
- "# it is important to start the internal parser to avoid deadlocks\n",
- "streaming_p.start_parser()\n",
- "streaming_q.start_parser()\n",
+ "# retreive some points, store them as non-streaming data in memory\n",
+ "data_p=gen_p.get_streamed_features(num_samples)\n",
+ "data_q=gen_q.get_streamed_features(num_samples)\n",
"\n",
- "# example to create mmd (note that m can be maximum the number of data in memory)\n",
+ "# example to create mmd (note that num_samples can be maximum the number of data in memory)\n",
"\n",
- "mmd=LinearTimeMMD(GaussianKernel(10,1), streaming_p, streaming_q, data_p.get_num_vectors(), 1)\n",
+ "mmd=sg.LinearTimeMMD(data_p, data_q)\n",
+ "mmd.set_kernel(sg.GaussianKernel(10, 1))\n",
+ "mmd.set_num_blocks_per_burst(100)\n",
"print \"Linear time MMD statistic: %.2f\" % mmd.compute_statistic()"
]
},
@@ -830,7 +796,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, *new* data is taken from the stream in every iteration.\n",
+ "As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, *new* data is taken from the stream in every iteration.\n",
"\n",
"Bootstrapping is not really necessary since there exists a fast and consistent estimate of the null-distribution. However, to ensure that any approximation is accurate, it should always be checked against bootstrapping at least once.\n",
"\n",
@@ -848,7 +814,7 @@
"\n",
"A normal distribution with this variance and zero mean can then be used as an approximation for the null-distribution. This results in a consistent test and is very fast. However, note that it is an approximation and its accuracy depends on the underlying data distributions. It is a good idea to compare to the bootstrapping approach first to determine an appropriate number of samples to use. This number is usually in the tens of thousands.\n",
"\n",
- "CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)"
+ "CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)"
]
},
{
@@ -859,10 +825,15 @@
},
"outputs": [],
"source": [
- "mmd=LinearTimeMMD(kernel, gen_p, gen_q, m, block_size)\n",
+ "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n",
+ "mmd.set_kernel(kernel)\n",
+ "mmd.set_num_samples_p(m)\n",
+ "mmd.set_num_samples_q(m)\n",
+ "mmd.set_num_blocks_per_burst(block_size)\n",
+ "\n",
"print \"m=%d samples from p and q\" % m\n",
"print \"Binary test result is: \" + (\"Rejection\" if mmd.perform_test(alpha) else \"No rejection\")\n",
- "print \"P-value test result is %.2f\" % mmd.perform_test()"
+ "print \"P-value test result is %.2f\" % mmd.compute_p_value(mmd.compute_statistic())"
]
},
{
@@ -880,38 +851,38 @@
"\\DeclareMathOperator*{\\argmax}{arg\\,max}$\n",
"Now which kernel do we actually use for our tests? So far, we just plugged in arbritary ones. However, for kernel two-sample testing, it is possible to do something more clever.\n",
"\n",
- "Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the *optimal* kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].\n",
+ "Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the *optimal* kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].\n",
"\n",
- "MMD Kernel selection in Shogun always involves an implementation of the base class CMMDKernelSelection, which defines the interface for kernel selection. If combinations of kernel should be considered, there is a sub-class CMMDKernelSelectionComb. In addition, it involves setting up a number of baseline kernels $\\mathcal{K}$ to choose from/combine in the form of a CCombinedKernel. All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.\n",
+ "MMD Kernel selection in Shogun always involves coosing a one of the methods of CGaussianKernel All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.\n",
"\n",
- " * CMMDKernelSelectionMedian Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.\n",
+ " * *KSM_MEDIAN_HEURISTIC*: Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.\n",
"\n",
- " * CMMDKernelSelectionMax Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate.\n",
+ " * *KSM_MAXIMIZE_MMD*: Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate.\n",
"$$\n",
"k^*=\\argmax_{k\\in\\mathcal{K}} \\hat \\eta_k,\n",
"$$\n",
"where $\\eta_k$ is an empirical MMD estimate for using a kernel $k$.\n",
"This was first described in [4] and was empirically shown to perform better than the median heuristic above. However, it remains a heuristic that comes with no guarantees. Since MMD estimates can be computed in linear and quadratic time, this method works for both methods. However, for the linear time statistic, there exists a better method.\n",
" \n",
- " * CMMDKernelSelectionOpt Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation.\n",
+ " * *KSM_MAXIMIZE_POWER*: Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation.\n",
"$$\n",
"k^*=\\argmax_{k\\in\\mathcal{K}} \\frac{\\hat \\eta_k}{\\hat\\sigma_k+\\lambda},\n",
"$$\n",
"where $\\eta_k$ is a linear time MMD estimate for using a kernel $k$ and $\\hat\\sigma_k$ is a linear time variance estimate of $\\eta_k$ to which a small number $\\lambda$ is added to prevent division by zero.\n",
- "These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error. *Important:* For this method to work, the kernel needs to be selected on *different* data than the test is performed on. Otherwise, the method will produce wrong results.\n",
+ "These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error. *Important:* For this method to work, the kernel needs to be selected on *different* data than the test is performed on. Otherwise, the method will produce wrong results.\n",
" \n",
- " * CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program\n",
+ " * CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program\n",
"$$\n",
"\\boldsymbol{\\beta}^*=\\min_{\\boldsymbol{\\beta}} \\{\\boldsymbol{\\beta}^T\\boldsymbol{\\beta} : \\boldsymbol{\\beta}^T\\boldsymbol{\\eta}=\\mathbf{1}, \\boldsymbol{\\beta}\\succeq 0\\},\n",
"$$\n",
"where $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. See [3] for details. Note that this method is unable to select a single kernel -- even when this would be optimal.\n",
"Again, when using the linear time MMD, there are better methods available.\n",
"\n",
- " * CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \\emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is\n",
+ " * CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \\emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is\n",
"$$\n",
"\\boldsymbol{\\beta}^*=\\min_{\\boldsymbol{\\beta}} (\\hat Q+\\lambda I) \\{\\boldsymbol{\\beta}^T\\boldsymbol{\\beta} : \\boldsymbol{\\beta}^T\\boldsymbol{\\eta}=\\mathbf{1}, \\boldsymbol{\\beta}\\succeq 0\\},\n",
"$$\n",
- "where again $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\\hat Q$ is a linear time estimate of the covariance matrix of the vector $\\boldsymbol{\\eta}$ to whose diagonal a small number $\\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt."
+ "where again $\\boldsymbol{\\beta}$ is a vector of the resulting kernel weights and $\\boldsymbol{\\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\\hat Q$ is a linear time estimate of the covariance matrix of the vector $\\boldsymbol{\\eta}$ to whose diagonal a small number $\\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt."
]
},
{
@@ -925,11 +896,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call\n",
+ "In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call\n",
"\n",
" * *compute_measures* to compute a vector kernel selection criteria if a single kernel selection method is used. It will return a vector of selected kernel weights if a combined kernel selection method is used. For \\shogunclass{CMMDKernelSelectionMedian}, the method does throw an error.\n",
"\n",
- " * *select\\_kernel* returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method. \n",
+ " * *select\\_kernel* returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method. \n",
"\n",
"In order to utilise the selected kernel, it has to be passed to an MMD instance. We now give an example how to select the optimal single and combined kernel for the Gaussian Blobs dataset."
]
@@ -949,22 +920,29 @@
},
"outputs": [],
"source": [
- "sigmas=[2**x for x in linspace(-5,5, 10)]\n",
+ "# mmd instance using streaming features\n",
+ "mmd=sg.LinearTimeMMD(gen_p, gen_q)\n",
+ "mmd.set_num_samples_p(m)\n",
+ "mmd.set_num_samples_q(m)\n",
+ "mmd.set_num_blocks_per_burst(block_size)\n",
+ "\n",
+ "sigmas=[2**x for x in np.linspace(-5, 5, 11)]\n",
"print \"Choosing kernel width from\", [\"{0:.2f}\".format(sigma) for sigma in sigmas]\n",
- "combined=CombinedKernel()\n",
- "for i in range(len(sigmas)):\n",
- " combined.append_kernel(GaussianKernel(10, sigmas[i]))\n",
"\n",
- "# mmd instance using streaming features\n",
- "block_size=1000\n",
- "mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size)\n",
+ "for i in range(len(sigmas)):\n",
+ " mmd.add_kernel(sg.GaussianKernel(10, sigmas[i]))\n",
"\n",
"# optmal kernel choice is possible for linear time MMD\n",
- "selection=MMDKernelSelectionOpt(mmd)\n",
+ "mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_POWER)\n",
+ "\n",
+ "# must be set true for kernel selection\n",
+ "mmd.set_train_test_mode(True)\n",
"\n",
"# select best kernel\n",
- "best_kernel=selection.select_kernel()\n",
- "best_kernel=GaussianKernel.obtain_from_generic(best_kernel)\n",
+ "mmd.select_kernel()\n",
+ "\n",
+ "best_kernel=mmd.get_kernel()\n",
+ "best_kernel=sg.GaussianKernel.obtain_from_generic(best_kernel)\n",
"print \"Best single kernel has bandwidth %.2f\" % best_kernel.get_width()"
]
},
@@ -983,10 +961,8 @@
},
"outputs": [],
"source": [
- "alpha=0.05\n",
- "mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, m, block_size)\n",
- "mmd.set_null_approximation_method(MMD1_GAUSSIAN);\n",
- "p_value_best=mmd.perform_test();\n",
+ "mmd.set_null_approximation_method(sg.NAM_MMD1_GAUSSIAN);\n",
+ "p_value_best=mmd.compute_p_value(mmd.compute_statistic());\n",
"\n",
"print \"Bootstrapping: P-value of MMD test with optimal kernel is %.2f\" % p_value_best"
]
@@ -1006,19 +982,20 @@
},
"outputs": [],
"source": [
- "mmd=LinearTimeMMD(best_kernel, gen_p, gen_q, 5000, block_size)\n",
+ "m=5000\n",
+ "mmd.set_num_samples_p(m)\n",
+ "mmd.set_num_samples_q(m)\n",
+ "mmd.set_train_test_mode(False)\n",
"num_samples=500\n",
"\n",
"# sample null and alternative distribution, implicitly generate new data for that\n",
- "null_samples=zeros(num_samples)\n",
+ "mmd.set_null_approximation_method(sg.NAM_PERMUTATION)\n",
+ "mmd.set_num_null_samples(num_samples)\n",
+ "null_samples=mmd.sample_null()\n",
+ "\n",
"alt_samples=zeros(num_samples)\n",
"for i in range(num_samples):\n",
- " alt_samples[i]=mmd.compute_statistic()\n",
- " \n",
- " # tell MMD to merge data internally while streaming\n",
- " mmd.set_simulate_h0(True)\n",
- " null_samples[i]=mmd.compute_statistic()\n",
- " mmd.set_simulate_h0(False)"
+ " alt_samples[i]=mmd.compute_statistic()"
]
},
{
@@ -1103,7 +1080,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
- "version": "2.7.12"
+ "version": "2.7.10"
}
},
"nbformat": 4,
diff --git a/examples/meta/generator/targets/cpp.json b/examples/meta/generator/targets/cpp.json
index d512e30b309..73a224161e0 100644
--- a/examples/meta/generator/targets/cpp.json
+++ b/examples/meta/generator/targets/cpp.json
@@ -86,7 +86,7 @@
"MethodCall": "$object->$method($arguments)",
"StaticCall": "C$typeName::$method($arguments)",
"Identifier": "$identifier",
- "Enum":"$value"
+ "Enum":"$typeName::$value"
},
"Element": {
"Access": {
diff --git a/examples/meta/src/statistical_testing/linear_time_mmd.sg b/examples/meta/src/statistical_testing/linear_time_mmd.sg
new file mode 100644
index 00000000000..97e93642b83
--- /dev/null
+++ b/examples/meta/src/statistical_testing/linear_time_mmd.sg
@@ -0,0 +1,59 @@
+GaussianBlobsDataGenerator features_p()
+GaussianBlobsDataGenerator features_q()
+
+#![create_instance]
+LinearTimeMMD mmd()
+GaussianKernel kernel(10, 1)
+mmd.set_kernel(kernel)
+mmd.set_p(features_p)
+mmd.set_q(features_q)
+mmd.set_num_samples_p(1000)
+mmd.set_num_samples_q(1000)
+real alpha = 0.05
+#![create_instance]
+
+#![set_burst]
+mmd.set_num_blocks_per_burst(1000)
+#![set_burst]
+
+#![estimate_mmd]
+real statistic = mmd.compute_statistic()
+#![estimate_mmd]
+
+#![perform_test]
+real threshold = mmd.compute_threshold(alpha)
+real p_value = mmd.compute_p_value(statistic)
+#![perform_test]
+
+#![add_kernels]
+GaussianKernel kernel1(10, 0.1)
+GaussianKernel kernel2(10, 1)
+GaussianKernel kernel3(10, 10)
+mmd.add_kernel(kernel1)
+mmd.add_kernel(kernel2)
+mmd.add_kernel(kernel3)
+#![add_kernels]
+
+#![enable_train_test_mode]
+mmd.set_train_test_mode(True)
+mmd.set_train_test_ratio(1)
+#![enable_train_test_mode]
+
+#![select_kernel_single]
+mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_POWER)
+mmd.select_kernel()
+GaussianKernel learnt_kernel_single = GaussianKernel:obtain_from_generic(mmd.get_kernel())
+real width = learnt_kernel_single.get_width()
+#![select_kernel_single]
+
+#![select_kernel_combined]
+mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_POWER, True)
+mmd.select_kernel()
+CombinedKernel learnt_kernel_combined = CombinedKernel:obtain_from_generic(mmd.get_kernel())
+RealVector weights = learnt_kernel_combined.get_subkernel_weights()
+#![select_kernel_combined]
+
+#![perform_test_optimized]
+real statistic_optimized = mmd.compute_statistic()
+real p_value_optimized = mmd.compute_p_value(statistic)
+#![perform_test_optimized]
diff --git a/examples/meta/src/statistical_testing/quadratic_time_mmd.sg b/examples/meta/src/statistical_testing/quadratic_time_mmd.sg
new file mode 100644
index 00000000000..e6ec822ab40
--- /dev/null
+++ b/examples/meta/src/statistical_testing/quadratic_time_mmd.sg
@@ -0,0 +1,71 @@
+CSVFile f_features_p("../../data/two_sample_test_gaussian.dat")
+CSVFile f_features_q("../../data/two_sample_test_laplace.dat")
+
+#![create_features]
+RealFeatures features_p(f_features_p)
+RealFeatures features_q(f_features_q)
+#![create_features]
+
+#![create_instance]
+QuadraticTimeMMD mmd(features_p, features_q)
+GaussianKernel kernel(10, 1)
+mmd.set_kernel(kernel)
+real alpha = 0.05
+#![create_instance]
+
+#![estimate_mmd]
+mmd.set_statistic_type(enum EStatisticType.ST_BIASED_FULL)
+real statistic = mmd.compute_statistic()
+#![estimate_mmd]
+
+#![perform_test]
+mmd.set_null_approximation_method(enum ENullApproximationMethod.NAM_PERMUTATION)
+mmd.set_num_null_samples(200)
+real threshold = mmd.compute_threshold(alpha)
+real p_value = mmd.compute_p_value(statistic)
+#![perform_test]
+
+#![add_kernels]
+GaussianKernel kernel1(10, 0.1)
+GaussianKernel kernel2(10, 1)
+GaussianKernel kernel3(10, 10)
+mmd.add_kernel(kernel1)
+mmd.add_kernel(kernel2)
+mmd.add_kernel(kernel3)
+#![add_kernels]
+
+#![multi_kernel]
+MultiKernelQuadraticTimeMMD mk = mmd.multikernel()
+mk.add_kernel(kernel1)
+mk.add_kernel(kernel2)
+mk.add_kernel(kernel2)
+
+RealVector mk_statistic = mk.compute_statistic()
+RealVector mk_p_value = mk.compute_p_value()
+#![multi_kernel]
+
+#![enable_train_test_mode]
+mmd.set_train_test_mode(True)
+mmd.set_train_test_ratio(1)
+#![enable_train_test_mode]
+
+#![select_kernel_single]
+int num_runs = 1
+int num_folds = 3
+mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_CROSS_VALIDATION, num_runs, num_folds, alpha)
+mmd.select_kernel()
+GaussianKernel learnt_kernel_single = GaussianKernel:obtain_from_generic(mmd.get_kernel())
+real width = learnt_kernel_single.get_width()
+#![select_kernel_single]
+
+#![select_kernel_combined]
+mmd.set_kernel_selection_strategy(enum EKernelSelectionMethod.KSM_MAXIMIZE_MMD, True)
+mmd.select_kernel()
+CombinedKernel learnt_kernel_combined = CombinedKernel:obtain_from_generic(mmd.get_kernel())
+RealVector weights = learnt_kernel_combined.get_subkernel_weights()
+#![select_kernel_combined]
+
+#![perform_test_optimized]
+real statistic_optimized = mmd.compute_statistic()
+real p_value_optimized = mmd.compute_p_value(statistic)
+#![perform_test_optimized]
diff --git a/examples/undocumented/libshogun/statistics_hsic.cpp b/examples/undocumented/libshogun/statistics_hsic.cpp
deleted file mode 100644
index 196d9874a90..00000000000
--- a/examples/undocumented/libshogun/statistics_hsic.cpp
+++ /dev/null
@@ -1,172 +0,0 @@
-/*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 3 of the License, or
- * (at your option) any later version.
- *
- * Written (W) 2012 Heiko Strathmann
- */
-
-#include
-#include
-#include
-#include
-#include
-
-using namespace shogun;
-
-void create_fixed_data_kernel_small(CFeatures*& features_p,
- CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q)
-{
- index_t m=2;
- index_t d=3;
-
- SGMatrix p(d,2*m);
- for (index_t i=0; i<2*d*m; ++i)
- p.matrix[i]=i;
-
-// p.display_matrix("p");
-
- SGMatrix q(d,2*m);
- for (index_t i=0; i<2*d*m; ++i)
- q.matrix[i]=i+10;
-
-// q.display_matrix("q");
-
- features_p=new CDenseFeatures(p);
- features_q=new CDenseFeatures(q);
-
- float64_t sigma_x=2;
- float64_t sigma_y=3;
- float64_t sq_sigma_x_twice=sigma_x*sigma_x*2;
- float64_t sq_sigma_y_twice=sigma_y*sigma_y*2;
-
- /* shoguns kernel width is different */
- kernel_p=new CGaussianKernel(10, sq_sigma_x_twice);
- kernel_q=new CGaussianKernel(10, sq_sigma_y_twice);
-}
-
-void create_fixed_data_kernel_big(CFeatures*& features_p,
- CFeatures*& features_q, CKernel*& kernel_p, CKernel*& kernel_q)
-{
- index_t m=10;
- index_t d=7;
-
- SGMatrix p(d,m);
- for (index_t i=0; i q(d,m);
- for (index_t i=0; i(p);
- features_q=new CDenseFeatures(q);
-
- float64_t sigma_x=2;
- float64_t sigma_y=3;
- float64_t sq_sigma_x_twice=sigma_x*sigma_x*2;
- float64_t sq_sigma_y_twice=sigma_y*sigma_y*2;
-
- /* shoguns kernel width is different */
- kernel_p=new CGaussianKernel(10, sq_sigma_x_twice);
- kernel_q=new CGaussianKernel(10, sq_sigma_y_twice);
-}
-
-/** tests the hsic statistic for a single fixed data case and ensures
- * equality with sma implementation */
-void test_hsic_fixed()
-{
- CFeatures* features_p=NULL;
- CFeatures* features_q=NULL;
- CKernel* kernel_p=NULL;
- CKernel* kernel_q=NULL;
- create_fixed_data_kernel_small(features_p, features_q, kernel_p, kernel_q);
-
- index_t m=features_p->get_num_vectors();
-
- CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q);
-
- /* assert matlab result, note that compute statistic computes m*hsic */
- float64_t difference=hsic->compute_statistic();
- SG_SPRINT("hsic fixed: %f\n", difference);
- ASSERT(CMath::abs(difference-m*0.164761446385339)<10E-16);
-
-
- SG_UNREF(hsic);
-}
-
-void test_hsic_gamma()
-{
- CFeatures* features_p=NULL;
- CFeatures* features_q=NULL;
- CKernel* kernel_p=NULL;
- CKernel* kernel_q=NULL;
- create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q);
-
- CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q);
-
- hsic->set_null_approximation_method(HSIC_GAMMA);
- float64_t p=hsic->compute_p_value(0.05);
- SG_SPRINT("p-value: %f\n", p);
-
- // disabled as I think previous inverse_gamma_cdf was faulty
- // now unit test fails. Needs to be investigated statistically
- //ASSERT(CMath::abs(p-0.172182287884256)<10E-15);
-
- SG_UNREF(hsic);
-}
-
-void test_hsic_sample_null()
-{
- CFeatures* features_p=NULL;
- CFeatures* features_q=NULL;
- CKernel* kernel_p=NULL;
- CKernel* kernel_q=NULL;
- create_fixed_data_kernel_big(features_p, features_q, kernel_p, kernel_q);
-
- CHSIC* hsic=new CHSIC(kernel_p, kernel_q, features_p, features_q);
-
- /* do sampling null */
- hsic->set_null_approximation_method(PERMUTATION);
- float64_t p=hsic->compute_p_value(0.05);
- SG_SPRINT("p-value: %f\n", p);
-
- /* ensure that sampling null of hsic leads to same results as using
- * CKernelIndependenceTest */
- CMath::init_random(1);
- float64_t mean1=CStatistics::mean(hsic->sample_null());
- float64_t var1=CStatistics::variance(hsic->sample_null());
- SG_SPRINT("mean1=%f, var1=%f\n", mean1, var1);
-
- CMath::init_random(1);
- float64_t mean2=CStatistics::mean(
- hsic->CKernelIndependenceTest::sample_null());
- float64_t var2=CStatistics::variance(hsic->sample_null());
- SG_SPRINT("mean2=%f, var2=%f\n", mean2, var2);
-
- /* assert than results are the same from bot sampling null impl. */
- ASSERT(CMath::abs(mean1-mean2)<10E-8);
- ASSERT(CMath::abs(var1-var2)<10E-8);
-
- SG_UNREF(hsic);
-}
-
-int main(int argc, char** argv)
-{
- init_shogun_with_defaults();
-
-// sg_io->set_loglevel(MSG_DEBUG);
-
- test_hsic_fixed();
- test_hsic_gamma();
- test_hsic_sample_null();
-
- exit_shogun();
- return 0;
-}
-
diff --git a/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp b/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp
deleted file mode 100644
index 3687cd49d1a..00000000000
--- a/examples/undocumented/libshogun/statistics_linear_time_mmd.cpp
+++ /dev/null
@@ -1,93 +0,0 @@
-/*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 3 of the License, or
- * (at your option) any later version.
- *
- * Written (W) 2013 Heiko Strathmann
- */
-
-#include
-#include
-#include
-#include
-#include
-#include
-
-using namespace shogun;
-
-void linear_time_mmd()
-{
- /* note that the linear time statistic is designed for much larger datasets
- * so increase to get reasonable results */
- index_t m=1000;
- index_t dim=2;
- float64_t difference=0.5;
-
- /* streaming data generator for mean shift distributions */
- CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim);
- CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim);
-
- /* set kernel a-priori. usually one would do some kernel selection. See
- * other examples for this. */
- float64_t width=10;
- CGaussianKernel* kernel=new CGaussianKernel(10, width);
-
- /* create linear time mmd instance */
- index_t blocksize=1000;
- CLinearTimeMMD* mmd=new CLinearTimeMMD(kernel, gen_p, gen_q, m, blocksize);
-
- /* perform test: compute p-value and test if null-hypothesis is rejected for
- * a test level of 0.05 */
- float64_t alpha=0.05;
-
- /* using bootstrapping (not reccomended for linear time MMD, since slow).
- * Also, in practice, use at least 250 iterations */
- mmd->set_null_approximation_method(PERMUTATION);
- mmd->set_num_null_samples(10);
- float64_t p_value_bootstrap=mmd->perform_test();
- /* reject if p-value is smaller than test level */
- SG_SPRINT("bootstrap: p!=q: %d\n", p_value_bootstrapset_null_approximation_method(MMD1_GAUSSIAN);
- float64_t p_value_gaussian=mmd->perform_test();
- /* reject if p-value is smaller than test level */
- SG_SPRINT("gaussian approx: p!=q: %d\n", p_value_gaussian typeIerrors(num_trials);
- SGVector typeIIerrors(num_trials);
- for (index_t i=0; iset_simulate_h0(true);
- typeIerrors[i]=mmd->perform_test()>alpha;
- mmd->set_simulate_h0(false);
-
- typeIIerrors[i]=mmd->perform_test()>alpha;
- }
-
- SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors));
- SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors));
-
- SG_UNREF(mmd);
-}
-
-int main(int argc, char** argv)
-{
- init_shogun_with_defaults();
-// sg_io->set_loglevel(MSG_DEBUG);
-
- linear_time_mmd();
-
- exit_shogun();
- return 0;
-}
-
diff --git a/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp b/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp
deleted file mode 100644
index a3f5df0764f..00000000000
--- a/examples/undocumented/libshogun/statistics_mmd_kernel_selection.cpp
+++ /dev/null
@@ -1,216 +0,0 @@
-/*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 3 of the License, or
- * (at your option) any later version.
- *
- * Written (W) 2012 Heiko Strathmann
- */
-
-#include
-#include
-#include
-#ifdef USE_GPL_SHOGUN
-#include
-#include
-#endif //USE_GPL_SHOGUN
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-
-using namespace shogun;
-
-void kernel_choice_linear_time_mmd_opt_single()
-{
- /* Note that the linear time mmd is designed for large datasets. Results on
- * this small number will be bad (unstable, type I error wrong) */
- index_t m=1000;
- index_t num_blobs=3;
- float64_t distance=3;
- float64_t stretch=10;
- float64_t angle=CMath::PI/4;
-
- CGaussianBlobsDataGenerator* gen_p=new CGaussianBlobsDataGenerator(
- num_blobs, distance, stretch, angle);
-
- CGaussianBlobsDataGenerator* gen_q=new CGaussianBlobsDataGenerator(
- num_blobs, distance, 1, 1);
-
- /* create kernels */
- CCombinedKernel* combined=new CCombinedKernel();
- float64_t sigma_from=-3;
- float64_t sigma_to=10;
- float64_t sigma_step=1;
- float64_t sigma=sigma_from;
- while (sigma<=sigma_to)
- {
- /* shoguns kernel width is different */
- float64_t width=CMath::pow(2.0, sigma);
- float64_t sq_width_twice=width*width*2;
- combined->append_kernel(new CGaussianKernel(10, sq_width_twice));
- sigma+=sigma_step;
- }
-
- /* create MMD instance */
- CLinearTimeMMD* mmd=new CLinearTimeMMD(combined, gen_p, gen_q, m);
-
- /* kernel selection instance with regularisation term. May be replaced by
- * other methods for selecting single kernels */
- CMMDKernelSelectionOpt* selection=
- new CMMDKernelSelectionOpt(mmd, 10E-5);
-//
- /* select kernel that maximised MMD */
-// CMMDKernelSelectionMax* selection=
-// new CMMDKernelSelectionMax(mmd);
-
-// /* select kernel with width closest to median data distance */
-// CMMDKernelSelectionMedian* selection=
-// new CMMDKernelSelectionMedian(mmd, 10E-5);
-
- /* compute measures.
- * For Opt: ratio of MMD and standard deviation
- * For Max: MMDs of single kernels
- * for Medigan: Does not work! */
- SG_SPRINT("computing ratios\n");
- SGVector ratios=selection->compute_measures();
- ratios.display_vector("ratios");
-
- /* select kernel using the maximum ratio (and cast) */
- SG_SPRINT("selecting kernel\n");
- CKernel* selected=selection->select_kernel();
- CGaussianKernel* casted=CGaussianKernel::obtain_from_generic(selected);
- SG_SPRINT("selected kernel width: %f\n", casted->get_width());
- mmd->set_kernel(selected);
- SG_UNREF(casted);
- SG_UNREF(selected);
-
- mmd->set_null_approximation_method(MMD1_GAUSSIAN);
-
- /* compute tpye I and II error (use many more trials). Type I error is only
- * estimated to check MMD1_GAUSSIAN method for estimating the null
- * distribution. Note that testing has to happen on difference data than
- * kernel selecting, but the linear time mmd does this implicitly */
- float64_t alpha=0.05;
- index_t num_trials=5;
- SGVector typeIerrors(num_trials);
- SGVector typeIIerrors(num_trials);
- for (index_t i=0; iset_simulate_h0(true);
- typeIerrors[i]=mmd->perform_test()>alpha;
- mmd->set_simulate_h0(false);
-
- typeIIerrors[i]=mmd->perform_test()>alpha;
- }
-
- SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors));
- SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors));
-
-
- SG_UNREF(selection);
-}
-
-void kernel_choice_linear_time_mmd_opt_comb()
-{
-#ifdef USE_GPL_SHOGUN
- /* Note that the linear time mmd is designed for large datasets. Results on
- * this small number will be bad (unstable, type I error wrong) */
- index_t m=1000;
- index_t num_blobs=3;
- float64_t distance=3;
- float64_t stretch=10;
- float64_t angle=CMath::PI/4;
-
- CGaussianBlobsDataGenerator* gen_p=new CGaussianBlobsDataGenerator(
- num_blobs, distance, stretch, angle);
-
- CGaussianBlobsDataGenerator* gen_q=new CGaussianBlobsDataGenerator(
- num_blobs, distance, 1, 1);
-
- /* create kernels */
- CCombinedKernel* combined=new CCombinedKernel();
- float64_t sigma_from=-3;
- float64_t sigma_to=10;
- float64_t sigma_step=1;
- float64_t sigma=sigma_from;
- index_t num_kernels=0;
- while (sigma<=sigma_to)
- {
- /* shoguns kernel width is different */
- float64_t width=CMath::pow(2.0, sigma);
- float64_t sq_width_twice=width*width*2;
- combined->append_kernel(new CGaussianKernel(10, sq_width_twice));
- sigma+=sigma_step;
- num_kernels++;
- }
-
- /* create MMD instance */
- CLinearTimeMMD* mmd=new CLinearTimeMMD(combined, gen_p, gen_q, m);
-
- /* kernel selection instance with regularisation term. May be replaced by
- * other methods for selecting single kernels */
- CMMDKernelSelectionCombOpt* selection=
- new CMMDKernelSelectionCombOpt(mmd, 10E-5);
-
- /* maximise L2 regularised MMD */
-// CMMDKernelSelectionCombMaxL2* selection=
-// new CMMDKernelSelectionCombMaxL2(mmd, 10E-5);
-
- /* select kernel (does the same as above, but sets weights to kernel) */
- SG_SPRINT("selecting kernel\n");
- CKernel* selected=selection->select_kernel();
- CCombinedKernel* casted=CCombinedKernel::obtain_from_generic(selected);
- casted->get_subkernel_weights().display_vector("weights");
- mmd->set_kernel(selected);
- SG_UNREF(casted);
- SG_UNREF(selected);
-
- /* compute tpye I and II error (use many more trials). Type I error is only
- * estimated to check MMD1_GAUSSIAN method for estimating the null
- * distribution. Note that testing has to happen on difference data than
- * kernel selecting, but the linear time mmd does this implicitly */
- mmd->set_null_approximation_method(MMD1_GAUSSIAN);
- float64_t alpha=0.05;
- index_t num_trials=5;
- SGVector typeIerrors(num_trials);
- SGVector typeIIerrors(num_trials);
- for (index_t i=0; iset_simulate_h0(true);
- typeIerrors[i]=mmd->perform_test()>alpha;
- mmd->set_simulate_h0(false);
-
- typeIIerrors[i]=mmd->perform_test()>alpha;
- }
-
- SG_SPRINT("type I error: %f\n", CStatistics::mean(typeIerrors));
- SG_SPRINT("type II error: %f\n", CStatistics::mean(typeIIerrors));
-
-
- SG_UNREF(selection);
-#endif //USE_GPL_SHOGUN
-}
-
-int main(int argc, char** argv)
-{
- init_shogun_with_defaults();
-// sg_io->set_loglevel(MSG_DEBUG);
-
- /* select a single kernel for linear time MMD */
- kernel_choice_linear_time_mmd_opt_single();
-
- /* select combined kernels for linear time MMD */
- kernel_choice_linear_time_mmd_opt_comb();
-
- exit_shogun();
- return 0;
-}
-
diff --git a/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp b/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp
deleted file mode 100644
index 5ffe8cdc3c4..00000000000
--- a/examples/undocumented/libshogun/statistics_quadratic_time_mmd.cpp
+++ /dev/null
@@ -1,135 +0,0 @@
-/*
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 3 of the License, or
- * (at your option) any later version.
- *
- * Written (W) 2013 Heiko Strathmann
- */
-
-#include
-#include
-#include
-#include
-#include
-#include
-#include
-
-using namespace shogun;
-
-void quadratic_time_mmd()
-{
- /* number of examples kept low in order to make things fast */
- index_t m=30;
- index_t dim=2;
- float64_t difference=0.5;
-
- /* streaming data generator for mean shift distributions */
- CMeanShiftDataGenerator* gen_p=new CMeanShiftDataGenerator(0, dim);
- CMeanShiftDataGenerator* gen_q=new CMeanShiftDataGenerator(difference, dim);
-
- /* stream some data from generator */
- CFeatures* feat_p=gen_p->get_streamed_features(m);
- CFeatures* feat_q=gen_q->get_streamed_features(m);
-
- /* set kernel a-priori. usually one would do some kernel selection. See
- * other examples for this. */
- float64_t width=10;
- CGaussianKernel* kernel=new CGaussianKernel(10, width);
-
- /* create quadratic time mmd instance. Note that this constructor
- * copies p and q and does not reference them */
- CQuadraticTimeMMD* mmd=new CQuadraticTimeMMD(kernel, feat_p, feat_q);
-
- /* perform test: compute p-value and test if null-hypothesis is rejected for
- * a test level of 0.05 */
- float64_t alpha=0.05;
-
- /* using permutation (slow, not the most reliable way. Consider pre-
- * computing the kernel when using it, see below).
- * Also, in practice, use at least 250 iterations */
- mmd->set_null_approximation_method(PERMUTATION);
- mmd->set_num_null_samples(3);
- float64_t p_value=mmd->perform_test();
- /* reject if p-value is smaller than test level */
- SG_SPRINT("bootstrap: p!=q: %d\n", p_valueset_statistic_type(BIASED);
- mmd->set_null_approximation_method(MMD2_SPECTRUM);
- mmd->set_num_eigenvalues_spectrum(3);
- mmd->set_num_samples_spectrum(250);
- p_value=mmd->perform_test();
- /* reject if p-value is smaller than test level */
- SG_SPRINT("spectrum: p!=q: %d\n", p_valueset_statistic_type(BIASED);
- mmd->set_null_approximation_method(MMD2_GAMMA);
- p_value=mmd->perform_test();
- /* reject if p-value is smaller than test level */
- SG_SPRINT("gamma: p!=q: %d\n", p_valueset_null_approximation_method(PERMUTATION);
- mmd->set_num_null_samples(5);
- index_t num_trials=5;
- SGVector type_I_errors(num_trials);
- SGVector type_II_errors(num_trials);
- SGVector inds(2*m);
- inds.range_fill();
- CFeatures* p_and_q=mmd->get_p_and_q();
-
- /* use a precomputed kernel to be faster */
- kernel->init(p_and_q, p_and_q);
- CCustomKernel* precomputed=new CCustomKernel(kernel);
- mmd->set_kernel(precomputed);
- for (index_t i=0; iadd_row_subset(inds);
- precomputed->add_col_subset(inds);
- type_I_errors[i]=mmd->perform_test()>alpha;
- precomputed->remove_row_subset();
- precomputed->remove_col_subset();
-
- /* on normal data, this gives type II error */
- type_II_errors[i]=mmd->perform_test()>alpha;
- }
- SG_UNREF(p_and_q);
-
- SG_SPRINT("type I error: %f\n", CStatistics::mean(type_I_errors));
- SG_SPRINT("type II error: %f\n", CStatistics::mean(type_II_errors));
-
- /* clean up */
- SG_UNREF(mmd);
- SG_UNREF(gen_p);
- SG_UNREF(gen_q);
-
- /* convienience constructor of MMD was used, these were not referenced */
- SG_UNREF(feat_p);
- SG_UNREF(feat_q);
-}
-
-int main(int argc, char** argv)
-{
- init_shogun_with_defaults();
-// sg_io->set_loglevel(MSG_DEBUG);
-
- quadratic_time_mmd();
-
- exit_shogun();
- return 0;
-}
-
diff --git a/examples/undocumented/python_modular/statistics_hsic.py b/examples/undocumented/python_modular/statistics_hsic.py
deleted file mode 100644
index ba1f3470bc3..00000000000
--- a/examples/undocumented/python_modular/statistics_hsic.py
+++ /dev/null
@@ -1,107 +0,0 @@
-#!/usr/bin/env python
-#
-# This program is free software you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation either version 3 of the License, or
-# (at your option) any later version.
-#
-# Written (C) 2012-2013 Heiko Strathmann
-#
-import numpy as np
-from math import pi
-
-parameter_list = [[150,3,3]]
-
-def statistics_hsic (n, difference, angle):
- from modshogun import RealFeatures
- from modshogun import DataGenerator
- from modshogun import GaussianKernel
- from modshogun import HSIC
- from modshogun import PERMUTATION, HSIC_GAMMA
- from modshogun import EuclideanDistance
- from modshogun import Statistics, Math
-
- # for reproducable results (the numpy one might not be reproducible across
- # different OS/Python-distributions
- Math.init_random(1)
- np.random.seed(1)
-
- # note that the HSIC has to store kernel matrices
- # which upper bounds the sample size
-
- # use data generator class to produce example data
- data=DataGenerator.generate_sym_mix_gauss(n,difference,angle)
- #plot(data[0], data[1], 'x');show()
-
- # create shogun feature representation
- features_x=RealFeatures(np.array([data[0]]))
- features_y=RealFeatures(np.array([data[1]]))
-
- # compute median data distance in order to use for Gaussian kernel width
- # 0.5*median_distance normally (factor two in Gaussian kernel)
- # However, shoguns kernel width is different to usual parametrization
- # Therefore 0.5*2*median_distance^2
- # Use a subset of data for that, only 200 elements. Median is stable
- subset=np.random.permutation(features_x.get_num_vectors()).astype(np.int32)
- subset=subset[0:200]
- features_x.add_subset(subset)
- dist=EuclideanDistance(features_x, features_x)
- distances=dist.get_distance_matrix()
- features_x.remove_subset()
- median_distance=np.median(distances)
- sigma_x=median_distance**2
- features_y.add_subset(subset)
- dist=EuclideanDistance(features_y, features_y)
- distances=dist.get_distance_matrix()
- features_y.remove_subset()
- median_distance=np.median(distances)
- sigma_y=median_distance**2
- #print "median distance for Gaussian kernel on x:", sigma_x
- #print "median distance for Gaussian kernel on y:", sigma_y
- kernel_x=GaussianKernel(10,sigma_x)
- kernel_y=GaussianKernel(10,sigma_y)
-
- hsic=HSIC(kernel_x,kernel_y,features_x,features_y)
-
- # perform test: compute p-value and test if null-hypothesis is rejected for
- # a test level of 0.05 using different methods to approximate
- # null-distribution
- statistic=hsic.compute_statistic()
- #print "HSIC:", statistic
- alpha=0.05
-
- #print "computing p-value using sampling null"
- hsic.set_null_approximation_method(PERMUTATION)
- # normally, at least 250 iterations should be done, but that takes long
- hsic.set_num_null_samples(100)
- # sampling null allows usage of unbiased or biased statistic
- p_value_boot=hsic.compute_p_value(statistic)
- thresh_boot=hsic.compute_threshold(alpha)
- #print "p_value:", p_value_boot
- #print "threshold for 0.05 alpha:", thresh_boot
- #print "p_value <", alpha, ", i.e. test sais p and q are dependend:", p_value_bootalpha
- mmd.set_simulate_h0(False)
-
- typeIIerrors[i]=mmd.perform_test()>alpha
-
- #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors)
-
- return statistic, p_value_boot, p_value_gaussian, null_samples, typeIerrors, typeIIerrors
-
-if __name__=='__main__':
- print('LinearTimeMMD')
- statistics_linear_time_mmd(*parameter_list[0])
diff --git a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py b/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py
deleted file mode 100644
index 677a48672f2..00000000000
--- a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_combined.py
+++ /dev/null
@@ -1,115 +0,0 @@
-#!/usr/bin/env python
-#
-# This program is free software you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation either version 3 of the License, or
-# (at your option) any later version.
-#
-# Written (C) 2012-2013 Heiko Strathmann
-#
-from numpy import *
-#from pylab import *
-
-parameter_list = [[1000,10,5,3,pi/4, "opt"], [1000,10,5,3,pi/4, "l2"]]
-
-
-def statistics_mmd_kernel_selection_combined(m,distance,stretch,num_blobs,angle,selection_method):
- from modshogun import RealFeatures
- from modshogun import GaussianBlobsDataGenerator
- from modshogun import GaussianKernel, CombinedKernel
- from modshogun import LinearTimeMMD
- try:
- from modshogun import MMDKernelSelectionCombMaxL2
- except ImportError:
- print("MMDKernelSelectionCombMaxL2 not available")
- exit(0)
- try:
- from modshogun import MMDKernelSelectionCombOpt
- except ImportError:
- print("MMDKernelSelectionCombOpt not available")
- exit(0)
-
- from modshogun import PERMUTATION, MMD1_GAUSSIAN
- from modshogun import EuclideanDistance
- from modshogun import Statistics, Math
-
- # init seed for reproducability
- Math.init_random(1)
-
- # note that the linear time statistic is designed for much larger datasets
- # results for this low number will be bad (unstable, type I error wrong)
-
- # streaming data generator
- gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
- gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)
-
- # stream some data and plot
- num_plot=1000
- features=gen_p.get_streamed_features(num_plot)
- features=features.create_merged_copy(gen_q.get_streamed_features(num_plot))
- data=features.get_feature_matrix()
-
- #figure()
- #subplot(2,2,1)
- #grid(True)
- #plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$')
- #title('$X\sim p$')
- #subplot(2,2,2)
- #grid(True)
- #plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5)
- #title('$Y\sim q$')
-
- # create combined kernel with Gaussian kernels inside (shoguns Gaussian kernel is
- # different to the standard form, see documentation)
- sigmas=[2**x for x in range(-3,10)]
- widths=[x*x*2 for x in sigmas]
- combined=CombinedKernel()
- for i in range(len(sigmas)):
- combined.append_kernel(GaussianKernel(10, widths[i]))
-
- # mmd instance using streaming features, blocksize of 10000
- block_size=10000
- mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size)
-
- # kernel selection instance (this can easily replaced by the other methods for selecting
- # combined kernels
- if selection_method=="opt":
- selection=MMDKernelSelectionCombOpt(mmd)
- elif selection_method=="l2":
- selection=MMDKernelSelectionCombMaxL2(mmd)
-
- # perform kernel selection (kernel is automatically set)
- kernel=selection.select_kernel()
- kernel=CombinedKernel.obtain_from_generic(kernel)
- #print "selected kernel weights:", kernel.get_subkernel_weights()
- #subplot(2,2,3)
- #plot(kernel.get_subkernel_weights())
- #title("Kernel weights")
-
- # compute tpye I and II error (use many more trials). Type I error is only
- # estimated to check MMD1_GAUSSIAN method for estimating the null
- # distribution. Note that testing has to happen on difference data than
- # kernel selecting, but the linear time mmd does this implicitly
- mmd.set_null_approximation_method(MMD1_GAUSSIAN)
-
- # number of trials should be larger to compute tight confidence bounds
- num_trials=5;
- alpha=0.05 # test power
- typeIerrors=[0 for x in range(num_trials)]
- typeIIerrors=[0 for x in range(num_trials)]
- for i in range(num_trials):
- # this effectively means that p=q - rejecting is tpye I error
- mmd.set_simulate_h0(True)
- typeIerrors[i]=mmd.perform_test()>alpha
- mmd.set_simulate_h0(False)
-
- typeIIerrors[i]=mmd.perform_test()>alpha
-
- #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors)
-
- return kernel,typeIerrors,typeIIerrors
-
-if __name__=='__main__':
- print('MMDKernelSelectionCombined')
- statistics_mmd_kernel_selection_combined(*parameter_list[0])
- #show()
diff --git a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py b/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py
deleted file mode 100644
index ffa291afbde..00000000000
--- a/examples/undocumented/python_modular/statistics_mmd_kernel_selection_single.py
+++ /dev/null
@@ -1,124 +0,0 @@
-#!/usr/bin/env python
-#
-# This program is free software you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation either version 3 of the License, or
-# (at your option) any later version.
-#
-# Written (C) 2012-2013 Heiko Strathmann
-#
-from numpy import *
-#from pylab import *
-
-parameter_list = [[1000,10,5,3,pi/4, "opt"], [1000,10,5,3,pi/4, "max"], [1000,10,5,3,pi/4, "median"]]
-
-def statistics_mmd_kernel_selection_single(m,distance,stretch,num_blobs,angle,selection_method):
- from modshogun import RealFeatures
- from modshogun import GaussianBlobsDataGenerator
- from modshogun import GaussianKernel, CombinedKernel
- from modshogun import LinearTimeMMD
- from modshogun import MMDKernelSelectionMedian
- from modshogun import MMDKernelSelectionMax
- from modshogun import MMDKernelSelectionOpt
- from modshogun import PERMUTATION, MMD1_GAUSSIAN
- from modshogun import EuclideanDistance
- from modshogun import Statistics, Math
-
- # init seed for reproducability
- Math.init_random(1)
-
- # note that the linear time statistic is designed for much larger datasets
- # results for this low number will be bad (unstable, type I error wrong)
- m=1000
- distance=10
- stretch=5
- num_blobs=3
- angle=pi/4
-
- # streaming data generator
- gen_p=GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
- gen_q=GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)
-
- # stream some data and plot
- num_plot=1000
- features=gen_p.get_streamed_features(num_plot)
- features=features.create_merged_copy(gen_q.get_streamed_features(num_plot))
- data=features.get_feature_matrix()
-
- #figure()
- #subplot(2,2,1)
- #grid(True)
- #plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$')
- #title('$X\sim p$')
- #subplot(2,2,2)
- #grid(True)
- #plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5)
- #title('$Y\sim q$')
-
-
- # create combined kernel with Gaussian kernels inside (shoguns Gaussian kernel is
- # different to the standard form, see documentation)
- sigmas=[2**x for x in range(-3,10)]
- widths=[x*x*2 for x in sigmas]
- combined=CombinedKernel()
- for i in range(len(sigmas)):
- combined.append_kernel(GaussianKernel(10, widths[i]))
-
- # mmd instance using streaming features, blocksize of 10000
- block_size=1000
- mmd=LinearTimeMMD(combined, gen_p, gen_q, m, block_size)
-
- # kernel selection instance (this can easily replaced by the other methods for selecting
- # single kernels
- if selection_method=="opt":
- selection=MMDKernelSelectionOpt(mmd)
- elif selection_method=="max":
- selection=MMDKernelSelectionMax(mmd)
- elif selection_method=="median":
- selection=MMDKernelSelectionMedian(mmd)
-
- # print measures (just for information)
- # in case Opt: ratios of MMD and standard deviation
- # in case Max: MMDs for each kernel
- # Does not work for median method
- if selection_method!="median":
- ratios=selection.compute_measures()
- #print "Measures:", ratios
-
- #subplot(2,2,3)
- #plot(ratios)
- #title('Measures')
-
- # perform kernel selection
- kernel=selection.select_kernel()
- kernel=GaussianKernel.obtain_from_generic(kernel)
- #print "selected kernel width:", kernel.get_width()
-
- # compute tpye I and II error (use many more trials). Type I error is only
- # estimated to check MMD1_GAUSSIAN method for estimating the null
- # distribution. Note that testing has to happen on difference data than
- # kernel selecting, but the linear time mmd does this implicitly
- mmd.set_kernel(kernel)
- mmd.set_null_approximation_method(MMD1_GAUSSIAN)
-
- # number of trials should be larger to compute tight confidence bounds
- num_trials=5;
- alpha=0.05 # test power
- typeIerrors=[0 for x in range(num_trials)]
- typeIIerrors=[0 for x in range(num_trials)]
- for i in range(num_trials):
- # this effectively means that p=q - rejecting is tpye I error
- mmd.set_simulate_h0(True)
- typeIerrors[i]=mmd.perform_test()>alpha
- mmd.set_simulate_h0(False)
-
- typeIIerrors[i]=mmd.perform_test()>alpha
-
- #print "type I error:", mean(typeIerrors), ", type II error:", mean(typeIIerrors)
-
- return kernel,typeIerrors,typeIIerrors
-
-if __name__=='__main__':
- print('MMDKernelSelection')
- statistics_mmd_kernel_selection_single(*parameter_list[0])
- #show()
diff --git a/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py b/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py
deleted file mode 100644
index 343a03c4ed2..00000000000
--- a/examples/undocumented/python_modular/statistics_quadratic_time_mmd.py
+++ /dev/null
@@ -1,115 +0,0 @@
-#!/usr/bin/env python
-#
-# This program is free software you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation either version 3 of the License, or
-# (at your option) any later version.
-#
-# Written (C) 2012-2013 Heiko Strathmann
-#
-import numpy as np
-
-parameter_list = [[30,2,0.5]]
-
-def statistics_quadratic_time_mmd (m,dim,difference):
- from modshogun import RealFeatures
- from modshogun import MeanShiftDataGenerator
- from modshogun import GaussianKernel, CustomKernel
- from modshogun import QuadraticTimeMMD
- from modshogun import PERMUTATION, MMD2_SPECTRUM, MMD2_GAMMA, BIASED, BIASED_DEPRECATED
- from modshogun import Statistics, IntVector, RealVector, Math
-
- # for reproducable results (the numpy one might not be reproducible across
- # different OS/Python-distributions
- Math.init_random(1)
- np.random.seed(1)
-
- # number of examples kept low in order to make things fast
-
- # streaming data generator for mean shift distributions
- gen_p=MeanShiftDataGenerator(0, dim);
- #gen_p.parallel.set_num_threads(1)
- gen_q=MeanShiftDataGenerator(difference, dim);
-
- # stream some data from generator
- feat_p=gen_p.get_streamed_features(m);
- feat_q=gen_q.get_streamed_features(m);
-
- # set kernel a-priori. usually one would do some kernel selection. See
- # other examples for this.
- width=10;
- kernel=GaussianKernel(10, width);
-
- # create quadratic time mmd instance. Note that this constructor
- # copies p and q and does not reference them
- mmd=QuadraticTimeMMD(kernel, feat_p, feat_q);
-
- # perform test: compute p-value and test if null-hypothesis is rejected for
- # a test level of 0.05
- alpha=0.05;
-
- # using permutation (slow, not the most reliable way. Consider pre-
- # computing the kernel when using it, see below).
- # Also, in practice, use at least 250 iterations
- mmd.set_null_approximation_method(PERMUTATION);
- mmd.set_num_null_samples(3);
- p_value_null=mmd.perform_test();
- # reject if p-value is smaller than test level
- #print "bootstrap: p!=q: ", p_value_nullalpha;
- precomputed.remove_row_subset();
- precomputed.remove_col_subset();
-
- # on normal data, this gives type II error
- type_II_errors[i]=mmd.perform_test()>alpha;
-
- return type_I_errors,type_I_errors,p_value_null,p_value_spectrum,p_value_gamma,
-
-if __name__=='__main__':
- print('QuadraticTimeMMD')
- statistics_quadratic_time_mmd(*parameter_list[0])
diff --git a/src/interfaces/modular/Preprocessor.i b/src/interfaces/modular/Preprocessor.i
index 0086099701e..fdb05143421 100644
--- a/src/interfaces/modular/Preprocessor.i
+++ b/src/interfaces/modular/Preprocessor.i
@@ -29,9 +29,9 @@
%rename(SortWordString) CSortWordString;
/* Feature selection framework */
-%rename(DependenceMaximization) CDependenceMaximization;
-%rename(KernelDependenceMaximization) CDependenceMaximization;
-%rename(BAHSIC) CBAHSIC;
+#%rename(DependenceMaximization) CDependenceMaximization;
+#%rename(KernelDependenceMaximization) CDependenceMaximization;
+#%rename(BAHSIC) CBAHSIC;
%newobject shogun::CFeatureSelection::apply;
%newobject shogun::CFeatureSelection::remove_feats;
@@ -145,7 +145,3 @@ namespace shogun
%include
%include
-
-%include
-%include
-%include
diff --git a/src/interfaces/modular/Preprocessor_includes.i b/src/interfaces/modular/Preprocessor_includes.i
index 95a101c4f86..35076c98410 100644
--- a/src/interfaces/modular/Preprocessor_includes.i
+++ b/src/interfaces/modular/Preprocessor_includes.i
@@ -25,7 +25,4 @@
#include
#include
-#include
-#include
-#include
%}
diff --git a/src/interfaces/modular/Statistics.i b/src/interfaces/modular/Statistics.i
index e542c6c6fb1..4c63f8b4e58 100644
--- a/src/interfaces/modular/Statistics.i
+++ b/src/interfaces/modular/Statistics.i
@@ -7,45 +7,36 @@
* Written (W) 2012-2013 Heiko Strathmann
*/
+/* These functions return new Objects */
+%newobject shogun::CTwoDistributionTest::compute_distance(CDistance*);
+%newobject shogun::CTwoDistributionTest::compute_joint_distance(CDistance*);
+%newobject shogun::CQuadraticTimeMMD::get_p_and_q();
+
/* Remove C Prefix */
%rename(HypothesisTest) CHypothesisTest;
+%rename(OneDistributionTest) COneDistributionTest;
+%rename(TwoDistributionTest) CTwoDistributionTest;
%rename(IndependenceTest) CIndependenceTest;
%rename(TwoSampleTest) CTwoSampleTest;
-%rename(KernelTwoSampleTest) CKernelTwoSampleTest;
+%rename(MMD) CMMD;
%rename(StreamingMMD) CStreamingMMD;
%rename(LinearTimeMMD) CLinearTimeMMD;
+%rename(BTestMMD) CBTestMMD;
%rename(QuadraticTimeMMD) CQuadraticTimeMMD;
-%rename(KernelIndependenceTest) CKernelIndependenceTest;
-%rename(HSIC) CHSIC;
-%rename(NOCCO) CNOCCO;
-%rename(KernelMeanMatching) CKernelMeanMatching;
-%rename(KernelSelection) CKernelSelection;
-%rename(MMDKernelSelection) CMMDKernelSelection;
-%rename(MMDKernelSelectionComb) CMMDKernelSelectionComb;
-%rename(MMDKernelSelectionMedian) CMMDKernelSelectionMedian;
-%rename(MMDKernelSelectionMax) CMMDKernelSelectionMax;
-%rename(MMDKernelSelectionOpt) CMMDKernelSelectionOpt;
-%rename(MMDKernelSelectionCombOpt) CMMDKernelSelectionCombOpt;
-%rename(MMDKernelSelectionCombMaxL2) CMMDKernelSelectionCombMaxL2;
-
+%rename(MultiKernelQuadraticTimeMMD) CMultiKernelQuadraticTimeMMD;
+%rename(KernelSelectionStrategy) CKernelSelectionStrategy;
/* Include Class Headers to make them visible from within the target language */
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
-%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
+%include
diff --git a/src/interfaces/modular/Statistics_includes.i b/src/interfaces/modular/Statistics_includes.i
index 8cf811edeac..48aa564c620 100644
--- a/src/interfaces/modular/Statistics_includes.i
+++ b/src/interfaces/modular/Statistics_includes.i
@@ -1,22 +1,16 @@
%{
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
- #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
+ #include
%}
diff --git a/src/shogun/distance/Distance.cpp b/src/shogun/distance/Distance.cpp
index e3a581a9a04..7284d412299 100644
--- a/src/shogun/distance/Distance.cpp
+++ b/src/shogun/distance/Distance.cpp
@@ -56,13 +56,13 @@ bool CDistance::init(CFeatures* l, CFeatures* r)
{
REQUIRE(check_compatibility(l, r), "Features are not compatible!\n");
- //remove references to previous features
- remove_lhs_and_rhs();
-
//increase reference counts
SG_REF(l);
SG_REF(r);
+ //remove references to previous features
+ remove_lhs_and_rhs();
+
lhs=l;
rhs=r;
diff --git a/src/shogun/features/DenseFeatures.cpp b/src/shogun/features/DenseFeatures.cpp
index d88b5cd2fcd..da012aa2031 100644
--- a/src/shogun/features/DenseFeatures.cpp
+++ b/src/shogun/features/DenseFeatures.cpp
@@ -641,14 +641,14 @@ template
CFeatures* CDenseFeatures::shallow_subset_copy()
{
CFeatures* shallow_copy_features=NULL;
-
+
SG_SDEBUG("Using underlying feature matrix with %d dimensions and %d feature vectors!\n", num_features, num_vectors);
SGMatrix shallow_copy_matrix(feature_matrix);
shallow_copy_features=new CDenseFeatures(shallow_copy_matrix);
SG_REF(shallow_copy_features);
if (m_subset_stack->has_subsets())
shallow_copy_features->add_subset(m_subset_stack->get_last_subset()->get_subset_idx());
-
+
return shallow_copy_features;
}
diff --git a/src/shogun/features/streaming/StreamingDenseFeatures.cpp b/src/shogun/features/streaming/StreamingDenseFeatures.cpp
index d47a1ec49d0..1db8f72ac5a 100644
--- a/src/shogun/features/streaming/StreamingDenseFeatures.cpp
+++ b/src/shogun/features/streaming/StreamingDenseFeatures.cpp
@@ -70,6 +70,7 @@ template void CStreamingDenseFeatures::reset_stream()
parser.exit_parser();
parser.init(working_file, has_labels, 1);
parser.set_free_vector_after_release(false);
+ parser.set_free_vectors_on_destruct(false);
parser.start_parser();
}
}
diff --git a/src/shogun/io/streaming/InputParser.h b/src/shogun/io/streaming/InputParser.h
index 73fdec0812f..652d0db3a1d 100644
--- a/src/shogun/io/streaming/InputParser.h
+++ b/src/shogun/io/streaming/InputParser.h
@@ -428,6 +428,7 @@ template
else
example_type = E_UNLABELLED;
+ SG_UNREF(examples_ring);
examples_ring = new CParseBuffer(size);
SG_REF(examples_ring);
@@ -466,7 +467,8 @@ template
}
SG_SDEBUG("creating parse thread\n")
- examples_ring->init_vector();
+ if (examples_ring)
+ examples_ring->init_vector();
#ifdef HAVE_CXX11
parse_thread.reset(new std::thread(&parse_loop_entry_point, this));
#elif defined(HAVE_PTHREAD)
diff --git a/src/shogun/kernel/CombinedKernel.cpp b/src/shogun/kernel/CombinedKernel.cpp
index 31292136f8c..9c39a2c8fb7 100644
--- a/src/shogun/kernel/CombinedKernel.cpp
+++ b/src/shogun/kernel/CombinedKernel.cpp
@@ -811,7 +811,7 @@ CCombinedKernel* CCombinedKernel::obtain_from_generic(CKernel* kernel)
if (kernel->get_kernel_type()!=K_COMBINED)
{
SG_SERROR("CCombinedKernel::obtain_from_generic(): provided kernel is "
- "not of type CGaussianKernel!\n");
+ "not of type CCombinedKernel!\n");
}
/* since an additional reference is returned */
diff --git a/src/shogun/kernel/CustomKernel.h b/src/shogun/kernel/CustomKernel.h
index 613a71a852d..165e9a724d9 100644
--- a/src/shogun/kernel/CustomKernel.h
+++ b/src/shogun/kernel/CustomKernel.h
@@ -550,13 +550,13 @@ class CCustomKernel: public CKernel
*/
SGMatrix get_float32_kernel_matrix()
{
- REQUIRE(!m_row_subset_stack, "%s::get_float32_kernel_matrix(): "
+ REQUIRE(!m_row_subset_stack->has_subsets(), "%s::get_float32_kernel_matrix(): "
"Not possible with row subset active! If you want to"
" create a %s from another one with a subset, use "
"get_kernel_matrix() and the SGMatrix constructor!\n",
get_name(), get_name());
- REQUIRE(!m_col_subset_stack, "%s::get_float32_kernel_matrix(): "
+ REQUIRE(!m_col_subset_stack->has_subsets(), "%s::get_float32_kernel_matrix(): "
"Not possible with collumn subset active! If you want to"
" create a %s from another one with a subset, use "
"get_kernel_matrix() and the SGMatrix constructor!\n",
diff --git a/src/shogun/kernel/ShiftInvariantKernel.h b/src/shogun/kernel/ShiftInvariantKernel.h
index b52544b3277..3f75646a45d 100644
--- a/src/shogun/kernel/ShiftInvariantKernel.h
+++ b/src/shogun/kernel/ShiftInvariantKernel.h
@@ -39,6 +39,11 @@
namespace shogun
{
+namespace internal
+{
+ class KernelManager;
+}
+
/** @brief Base class for the family of kernel functions that only depend on
* the difference of the inputs, i.e. whose values does not change if the
* inputs are shifted by the same amount. More precisely,
@@ -49,6 +54,9 @@ namespace shogun
*/
class CShiftInvariantKernel: public CKernel
{
+
+ friend class internal::KernelManager;
+
public:
/** Default constructor. */
CShiftInvariantKernel();
diff --git a/src/shogun/labels/BinaryLabels.cpp b/src/shogun/labels/BinaryLabels.cpp
index f46890d9718..6f1e93f0484 100644
--- a/src/shogun/labels/BinaryLabels.cpp
+++ b/src/shogun/labels/BinaryLabels.cpp
@@ -149,6 +149,6 @@ CLabels* CBinaryLabels::shallow_subset_copy()
((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector);
if (m_subset_stack->has_subsets())
shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx());
-
+
return shallow_copy_labels;
}
diff --git a/src/shogun/labels/BinaryLabels.h b/src/shogun/labels/BinaryLabels.h
index 248608483e2..462397596bd 100644
--- a/src/shogun/labels/BinaryLabels.h
+++ b/src/shogun/labels/BinaryLabels.h
@@ -119,7 +119,6 @@ class CBinaryLabels : public CDenseLabels
#ifndef SWIG // SWIG should skip this part
virtual CLabels* shallow_subset_copy();
#endif
-
};
}
#endif
diff --git a/src/shogun/labels/MulticlassLabels.cpp b/src/shogun/labels/MulticlassLabels.cpp
index ef65ea092e0..3133efdbbfd 100644
--- a/src/shogun/labels/MulticlassLabels.cpp
+++ b/src/shogun/labels/MulticlassLabels.cpp
@@ -144,6 +144,6 @@ CLabels* CMulticlassLabels::shallow_subset_copy()
((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector);
if (m_subset_stack->has_subsets())
shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx());
-
- return shallow_copy_labels;
+
+ return shallow_copy_labels;
}
diff --git a/src/shogun/labels/RegressionLabels.cpp b/src/shogun/labels/RegressionLabels.cpp
index eb85c368526..5870eafd2d2 100644
--- a/src/shogun/labels/RegressionLabels.cpp
+++ b/src/shogun/labels/RegressionLabels.cpp
@@ -35,6 +35,6 @@ CLabels* CRegressionLabels::shallow_subset_copy()
((CDenseLabels*) shallow_copy_labels)->set_labels(shallow_copy_vector);
if (m_subset_stack->has_subsets())
shallow_copy_labels->add_subset(m_subset_stack->get_last_subset()->get_subset_idx());
-
+
return shallow_copy_labels;
}
diff --git a/src/shogun/labels/RegressionLabels.h b/src/shogun/labels/RegressionLabels.h
index 831b69961c8..56698d149ce 100644
--- a/src/shogun/labels/RegressionLabels.h
+++ b/src/shogun/labels/RegressionLabels.h
@@ -69,7 +69,6 @@ class CRegressionLabels : public CDenseLabels
#ifndef SWIG // SWIG should skip this part
virtual CLabels* shallow_subset_copy();
#endif
-
};
}
#endif
diff --git a/src/shogun/machine/BaggingMachine.cpp b/src/shogun/machine/BaggingMachine.cpp
index 5edeef58c7c..ad3aa046082 100644
--- a/src/shogun/machine/BaggingMachine.cpp
+++ b/src/shogun/machine/BaggingMachine.cpp
@@ -76,7 +76,7 @@ SGVector CBaggingMachine::apply_get_outputs(CFeatures* data)
SGMatrix output(data->get_num_vectors(), m_num_bags);
output.zero();
-
+
#pragma omp parallel for
for (int32_t i = 0; i < m_num_bags; ++i)
{
@@ -178,7 +178,7 @@ bool CBaggingMachine::train_machine(CFeatures* data)
labels->remove_subset();
#pragma omp critical
- {
+ {
// get out of bag indexes
CDynamicArray* oob = get_oob_indices(idx);
m_oob_indices->push_back(oob);
diff --git a/src/shogun/multiclass/tree/CARTree.cpp b/src/shogun/multiclass/tree/CARTree.cpp
index cd7c261da1d..ac0a310ed42 100644
--- a/src/shogun/multiclass/tree/CARTree.cpp
+++ b/src/shogun/multiclass/tree/CARTree.cpp
@@ -105,7 +105,7 @@ CMulticlassLabels* CCARTree::apply_multiclass(CFeatures* data)
// apply multiclass starting from root
bnode_t* current=dynamic_cast(get_root());
-
+
REQUIRE(current, "Tree machine not yet trained.\n");
CLabels* ret=apply_from_current_node(dynamic_cast*>(data), current);
@@ -289,7 +289,7 @@ bool CCARTree::train_machine(CFeatures* data)
void CCARTree::set_sorted_features(SGMatrix& sorted_feats, SGMatrix& sorted_indices)
{
- m_pre_sort=true;
+ m_pre_sort=true;
m_sorted_features=sorted_feats;
m_sorted_indices=sorted_indices;
}
@@ -414,7 +414,7 @@ CBinaryTreeMachineNode* CCARTree::CARTtrain(CFeatures* data, SG
int32_t c_left=-1;
int32_t c_right=-1;
int32_t best_attribute;
-
+
SGVector indices(num_vecs);
if (m_pre_sort)
{
@@ -532,13 +532,13 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
SGVector& left, SGVector& right, SGVector& is_left_final, int32_t &num_missing_final, int32_t &count_left,
int32_t &count_right, int32_t subset_size, const SGVector& active_indices)
{
- SGVector labels_vec=(dynamic_cast(labels))->get_labels();
+ SGVector labels_vec=(dynamic_cast(labels))->get_labels();
int32_t num_vecs=labels->get_num_labels();
int32_t num_feats;
if (m_pre_sort)
num_feats=mat.num_cols;
else
- num_feats=mat.num_rows;
+ num_feats=mat.num_rows;
int32_t n_ulabels;
SGVector ulabels=get_unique_labels(labels_vec,n_ulabels);
@@ -567,7 +567,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
}
}
}
-
+
SGVector idx(num_feats);
idx.range_fill();
if (subset_size)
@@ -579,7 +579,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
float64_t max_gain=MIN_SPLIT_GAIN;
int32_t best_attribute=-1;
float64_t best_threshold=0;
-
+
SGVector indices_mask;
SGVector count_indices(mat.num_rows);
count_indices.zero();
@@ -603,6 +603,8 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
{
SGVector feats(num_vecs);
SGVector sorted_args(num_vecs);
+ SGVector temp_count_indices(count_indices.size());
+ memcpy(temp_count_indices.vector, count_indices.vector, sizeof(int32_t)*count_indices.size());
if (m_pre_sort)
{
@@ -708,7 +710,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
if(dupes[j]!=j)
is_left[j]=is_left[dupes[j]];
}
-
+
float64_t g=0;
if (m_mode==PT_MULTICLASS)
g=gain(wleft,wright,total_wclasses);
@@ -806,7 +808,7 @@ int32_t CCARTree::compute_best_attribute(const SGMatrix& mat, const S
count_right=1;
if (m_pre_sort)
{
- SGVector temp_vec(mat.get_column_vector(best_attribute), mat.num_rows, false);
+ SGVector temp_vec(mat.get_column_vector(best_attribute), mat.num_rows, false);
SGVector sorted_indices(m_sorted_indices.get_column_vector(best_attribute), mat.num_rows, false);
int32_t count=0;
for(int32_t i=0;i& mat, const S
if(dupes[i]!=i)
is_left_final[i]=is_left_final[dupes[i]];
}
-
- }
+
+ }
else
{
for (int32_t i=0;i& feats, co
{
Map map_weights(weights.vector, weights.size());
- Map map_feats(feats.vector, weights.size());
+ Map map_feats(feats.vector, weights.size());
float64_t mean=map_weights.dot(map_feats);
total_weight=map_weights.sum();
@@ -1104,7 +1106,7 @@ CLabels* CCARTree::apply_from_current_node(CDenseFeatures* feats, bno
{
int32_t num_vecs=feats->get_num_vectors();
REQUIRE(num_vecs>0, "No data provided in apply\n");
-
+
SGVector labels(num_vecs);
for (int32_t i=0;i* tree)
{
REQUIRE(tree, "Tree not provided for pruning.\n");
-
+
CDynamicObjectArray* trees=new CDynamicObjectArray();
SG_UNREF(m_alphas);
m_alphas=new CDynamicArray();
diff --git a/src/shogun/preprocessor/BAHSIC.h b/src/shogun/preprocessor/BAHSIC.h
deleted file mode 100644
index e58a89e9d1d..00000000000
--- a/src/shogun/preprocessor/BAHSIC.h
+++ /dev/null
@@ -1,92 +0,0 @@
-/*
- * Copyright (c) The Shogun Machine Learning Toolbox
- * Written (w) 2014 Soumyajit De
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
- * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- * The views and conclusions contained in the software and documentation are those
- * of the authors and should not be interpreted as representing official policies,
- * either expressed or implied, of the Shogun Development Team.
- */
-
-#ifndef BAHSIC_H__
-#define BAHSIC_H__
-
-#include
-#include
-
-namespace shogun
-{
-
-/** @brief Class CBAHSIC, that extends CKernelDependenceMaximization and uses
- * HSIC [1] to compute dependence measures for feature selection using a
- * backward elimination approach as described in [1]. This class serves as a
- * convenience class that initializes the CDependenceMaximization#m_estimator
- * with an instance of CHSIC and allows only shogun::BACKWARD_ELIMINATION algorithm
- * to use which is set internally. Therefore, trying to use other algorithms
- * by set_algorithm() will not work. Plese see the class documentation of CHSIC
- * and [2] for more details on mathematical description of HSIC.
- *
- * Refrences:
- * [1] Song, Le and Bedo, Justin and Borgwardt, Karsten M. and Gretton, Arthur
- * and Smola, Alex. (2007). Gene Selection via the BAHSIC Family of Algorithms.
- * Journal Bioinformatics. Volume 23 Issue Pages i490-i498. Oxford University
- * Press Oxford, UK
- * [2]: Gretton, A., Fukumizu, K., Teo, C., & Song, L. (2008). A kernel
- * statistical test of independence. Advances in Neural Information Processing
- * Systems, 1-8.
- */
-class CBAHSIC : public CKernelDependenceMaximization
-{
-public:
- /** Default constructor */
- CBAHSIC();
-
- /** Destructor */
- virtual ~CBAHSIC();
-
- /**
- * Since only shogun::BACKWARD_ELIMINATION algorithm is applicable for BAHSIC,
- * and this is set internally, this method is overridden to prevent this
- * to be set from public API.
- *
- * @param algorithm the feature selection algorithm to use
- */
- virtual void set_algorithm(EFeatureSelectionAlgorithm algorithm);
-
- /** @return the preprocessor type */
- virtual EPreprocessorType get_type() const;
-
- /** @return the class name */
- virtual const char* get_name() const
- {
- return "BAHSIC";
- }
-
-private:
- /** Register params and initialize with default values */
- void initialize_parameters();
-
-};
-
-}
-#endif // BAHSIC_H__
diff --git a/src/shogun/preprocessor/DependenceMaximization.cpp b/src/shogun/preprocessor/DependenceMaximization.cpp
index ac636f7fad9..16cb71576bc 100644
--- a/src/shogun/preprocessor/DependenceMaximization.cpp
+++ b/src/shogun/preprocessor/DependenceMaximization.cpp
@@ -31,7 +31,7 @@
#include
#include
#include
-#include
+#include
#include
#include
diff --git a/src/shogun/preprocessor/KernelDependenceMaximization.cpp b/src/shogun/preprocessor/KernelDependenceMaximization.cpp
deleted file mode 100644
index b8292ab166c..00000000000
--- a/src/shogun/preprocessor/KernelDependenceMaximization.cpp
+++ /dev/null
@@ -1,141 +0,0 @@
-/*
- * Copyright (c) The Shogun Machine Learning Toolbox
- * Written (w) 2014 Soumyajit De
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
- * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- * The views and conclusions contained in the software and documentation are those
- * of the authors and should not be interpreted as representing official policies,
- * either expressed or implied, of the Shogun Development Team.
- */
-
-#include
-#include
-#include
-
-using namespace shogun;
-
-CKernelDependenceMaximization::CKernelDependenceMaximization()
- : CDependenceMaximization()
-{
- initialize_parameters();
-}
-
-void CKernelDependenceMaximization::initialize_parameters()
-{
- SG_ADD((CSGObject**)&m_kernel_features, "kernel_features",
- "the kernel to be used for features", MS_NOT_AVAILABLE);
- SG_ADD((CSGObject**)&m_kernel_labels, "kernel_labels",
- "the kernel to be used for labels", MS_NOT_AVAILABLE);
-
- m_kernel_features=NULL;
- m_kernel_labels=NULL;
-}
-
-CKernelDependenceMaximization::~CKernelDependenceMaximization()
-{
- SG_UNREF(m_kernel_features);
- SG_UNREF(m_kernel_labels);
-}
-
-void CKernelDependenceMaximization::precompute()
-{
- SG_DEBUG("Entering!\n");
-
- REQUIRE(m_labels_feats, "Features for labels is not initialized!\n");
- REQUIRE(m_kernel_labels, "Kernel for labels is not initialized!\n");
-
- // ASSERT here because the estimator is set internally and cannot
- // be set via public API
- ASSERT(m_estimator);
-
- CFeatureSelection::precompute();
-
- // make sure that we have an instance of CKernelIndependenceTest via
- // proper cast and set this kernel to the estimator
- CKernelIndependenceTest* estimator
- =dynamic_cast(m_estimator);
- ASSERT(estimator);
-
- // precompute the kernel for labels
- m_kernel_labels->init(m_labels_feats, m_labels_feats);
- CCustomKernel* precomputed
- =new CCustomKernel(m_kernel_labels->get_kernel_matrix());
-
- // replace the kernel for labels with precomputed kernel
- SG_UNREF(m_kernel_labels);
- m_kernel_labels=precomputed;
- SG_REF(m_kernel_labels);
-
- // we can safely SG_UNREF the feature object for labels now
- SG_UNREF(m_labels_feats);
- m_labels_feats=NULL;
-
- // finally set this as kernel for the labels
- estimator->set_kernel_q(m_kernel_labels);
-
- SG_DEBUG("Leaving!\n");
-}
-
-void CKernelDependenceMaximization::set_kernel_features(CKernel* kernel)
-{
- // sanity check. using assert here because estimator instances are
- // set internally and cannot be set via public API.
- ASSERT(m_estimator);
- CKernelIndependenceTest* estimator
- =dynamic_cast(m_estimator);
- ASSERT(estimator);
-
- SG_REF(kernel);
- SG_UNREF(m_kernel_features);
- m_kernel_features=kernel;
-
- estimator->set_kernel_p(m_kernel_features);
-}
-
-void CKernelDependenceMaximization::set_kernel_labels(CKernel* kernel)
-{
- // sanity check. using assert here because estimator instances are
- // set internally and cannot be set via public API.
- ASSERT(m_estimator);
- CKernelIndependenceTest* estimator
- =dynamic_cast(m_estimator);
- ASSERT(estimator);
-
- SG_REF(kernel);
- SG_UNREF(m_kernel_labels);
- m_kernel_labels=kernel;
-
- estimator->set_kernel_q(m_kernel_labels);
-}
-
-CKernel* CKernelDependenceMaximization::get_kernel_features() const
-{
- SG_REF(m_kernel_features);
- return m_kernel_features;
-}
-
-CKernel* CKernelDependenceMaximization::get_kernel_labels() const
-{
- SG_REF(m_kernel_labels);
- return m_kernel_labels;
-}
diff --git a/src/shogun/preprocessor/KernelDependenceMaximization.h b/src/shogun/preprocessor/KernelDependenceMaximization.h
deleted file mode 100644
index 9d4159088dd..00000000000
--- a/src/shogun/preprocessor/KernelDependenceMaximization.h
+++ /dev/null
@@ -1,105 +0,0 @@
-/*
- * Copyright (c) The Shogun Machine Learning Toolbox
- * Written (w) 2014 Soumyajit De
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright notice, this
- * list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright notice,
- * this list of conditions and the following disclaimer in the documentation
- * and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
- * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
- * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
- * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
- * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
- * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- * The views and conclusions contained in the software and documentation are those
- * of the authors and should not be interpreted as representing official policies,
- * either expressed or implied, of the Shogun Development Team.
- */
-
-#ifndef KERNEL_DEPENDENCE_MAXIMIZATION_H__
-#define KERNEL_DEPENDENCE_MAXIMIZATION_H__
-
-#include
-#include
-
-namespace shogun
-{
-
-class CFeatures;
-class CKernelSelection;
-
-/** @brief Class CKernelDependenceMaximization, that uses an implementation
- * of CKernelIndependenceTest to compute dependence measures for feature
- * selection. Different kernels are used for labels and data. For the sake
- * of computational convenience, the precompute() method is overridden to
- * precompute the kernel for labels and save as an instance of CCustomKernel
- */
-class CKernelDependenceMaximization : public CDependenceMaximization
-{
-public:
- /** Default constructor */
- CKernelDependenceMaximization();
-
- /** Destructor */
- virtual ~CKernelDependenceMaximization();
-
- /** @param kernel the kernel for features (data) */
- void set_kernel_features(CKernel* kernel);
-
- /** @return the kernel for features */
- CKernel* get_kernel_features() const;
-
- /** @param kernel the kernel for labels */
- void set_kernel_labels(CKernel* kernel);
-
- /** @return the kernel for labels */
- CKernel* get_kernel_labels() const;
-
- /**
- * Abstract method which is overridden in the subclasses to set accepted
- * feature selection algorithm
- *
- * @param algorithm the feature selection algorithm to use
- */
- virtual void set_algorithm(EFeatureSelectionAlgorithm algorithm)=0;
-
- /** @return the class name */
- virtual const char* get_name() const
- {
- return "KernelDependenceMaximization";
- }
-
-protected:
- /**
- * Precomputes the kernel on labels and replaces the #m_kernel_labels
- * with an instance of CCustomKernel. Labels features are set via
- * CDependenceMaximization::set_labels call.
- */
- virtual void precompute();
-
- /** The kernel for data (features) to be used in CKernelIndependenceTest */
- CKernel* m_kernel_features;
-
- /** The kernel for labels to be used in CKernelIndependenceTest */
- CKernel* m_kernel_labels;
-
-private:
- /** Register params and initialize with default values */
- void initialize_parameters();
-
-};
-
-}
-#endif // KERNEL_DEPENDENCE_MAXIMIZATION_H__
diff --git a/src/shogun/statistical_testing/BTestMMD.cpp b/src/shogun/statistical_testing/BTestMMD.cpp
new file mode 100644
index 00000000000..d1be47bd3dc
--- /dev/null
+++ b/src/shogun/statistical_testing/BTestMMD.cpp
@@ -0,0 +1,117 @@
+/*
+ * Restructuring Shogun's statistical hypothesis testing framework.
+ * Copyright (C) 2016 Soumyajit De
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+using namespace shogun;
+using namespace internal;
+
+CBTestMMD::CBTestMMD() : CStreamingMMD()
+{
+}
+
+CBTestMMD::~CBTestMMD()
+{
+}
+
+void CBTestMMD::set_blocksize(index_t blocksize)
+{
+ get_data_mgr().set_blocksize(blocksize);
+}
+
+void CBTestMMD::set_num_blocks_per_burst(index_t num_blocks_per_burst)
+{
+ get_data_mgr().set_num_blocks_per_burst(num_blocks_per_burst);
+}
+
+const std::function)> CBTestMMD::get_direct_estimation_method() const
+{
+ return mmd::WithinBlockDirect();
+}
+
+float64_t CBTestMMD::normalize_statistic(float64_t statistic) const
+{
+ const DataManager& data_mgr=get_data_mgr();
+ const index_t Nx=data_mgr.num_samples_at(0);
+ const index_t Ny=data_mgr.num_samples_at(1);
+ const index_t Bx=data_mgr.blocksize_at(0);
+ const index_t By=data_mgr.blocksize_at(1);
+ return Nx*Ny*statistic*CMath::sqrt((Bx+By)/float64_t(Nx+Ny))/(Nx+Ny);
+}
+
+const float64_t CBTestMMD::normalize_variance(float64_t variance) const
+{
+ const DataManager& data_mgr=get_data_mgr();
+ const index_t Bx=data_mgr.blocksize_at(0);
+ const index_t By=data_mgr.blocksize_at(1);
+ return variance*CMath::sq(Bx*By/float64_t(Bx+By));
+}
+
+float64_t CBTestMMD::compute_p_value(float64_t statistic)
+{
+ float64_t result=0;
+ switch (get_null_approximation_method())
+ {
+ case NAM_MMD1_GAUSSIAN:
+ {
+ float64_t sigma_sq=compute_variance();
+ float64_t std_dev=CMath::sqrt(sigma_sq);
+ result=1.0-CStatistics::normal_cdf(statistic, std_dev);
+ break;
+ }
+ default:
+ {
+ result=CHypothesisTest::compute_p_value(statistic);
+ break;
+ }
+ }
+ return result;
+}
+
+float64_t CBTestMMD::compute_threshold(float64_t alpha)
+{
+ float64_t result=0;
+ switch (get_null_approximation_method())
+ {
+ case NAM_MMD1_GAUSSIAN:
+ {
+ float64_t sigma_sq=compute_variance();
+ float64_t std_dev=CMath::sqrt(sigma_sq);
+ result=1.0-CStatistics::inverse_normal_cdf(1-alpha, 0, std_dev);
+ break;
+ }
+ default:
+ {
+ result=CHypothesisTest::compute_threshold(alpha);
+ break;
+ }
+ }
+ return result;
+}
+
+const char* CBTestMMD::get_name() const
+{
+ return "BTestMMD";
+}
diff --git a/src/shogun/statistical_testing/BTestMMD.h b/src/shogun/statistical_testing/BTestMMD.h
new file mode 100644
index 00000000000..03439818c17
--- /dev/null
+++ b/src/shogun/statistical_testing/BTestMMD.h
@@ -0,0 +1,48 @@
+/*
+ * Restructuring Shogun's statistical hypothesis testing framework.
+ * Copyright (C) 2016 Soumyajit De
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#ifndef B_TEST_MMD_H_
+#define B_TEST_MMD_H_
+
+#include
+
+namespace shogun
+{
+
+class CBTestMMD : public CStreamingMMD
+{
+public:
+ typedef std::function)> operation;
+ CBTestMMD();
+ virtual ~CBTestMMD();
+
+ void set_blocksize(index_t blocksize);
+ void set_num_blocks_per_burst(index_t num_blocks_per_burst);
+
+ virtual float64_t compute_p_value(float64_t statistic);
+ virtual float64_t compute_threshold(float64_t alpha);
+
+ virtual const char* get_name() const;
+private:
+ virtual const operation get_direct_estimation_method() const;
+ virtual float64_t normalize_statistic(float64_t statistic) const;
+ virtual const float64_t normalize_variance(float64_t variance) const;
+};
+
+}
+#endif // B_TEST_MMD_H_
diff --git a/src/shogun/statistics/HypothesisTest.cpp b/src/shogun/statistical_testing/HypothesisTest.cpp
similarity index 50%
rename from src/shogun/statistics/HypothesisTest.cpp
rename to src/shogun/statistical_testing/HypothesisTest.cpp
index d8167fd9e24..9afd853d094 100644
--- a/src/shogun/statistics/HypothesisTest.cpp
+++ b/src/shogun/statistical_testing/HypothesisTest.cpp
@@ -1,6 +1,7 @@
/*
* Copyright (c) The Shogun Machine Learning Toolbox
- * Written (w) 2012-2013 Heiko Strathmann
+ * Written (w) 2012 - 2013 Heiko Strathmann
+ * Written (w) 2014 - 2016 Soumyajit De
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -28,98 +29,84 @@
* either expressed or implied, of the Shogun Development Team.
*/
-#include
-#include
+#include
#include
#include
+#include
+#include
using namespace shogun;
+using namespace internal;
-CHypothesisTest::CHypothesisTest() : CSGObject()
+struct CHypothesisTest::Self
+{
+ explicit Self(index_t num_distributions);
+ DataManager data_mgr;
+};
+
+CHypothesisTest::Self::Self(index_t num_distributions) : data_mgr(num_distributions)
{
- init();
}
-CHypothesisTest::~CHypothesisTest()
+CHypothesisTest::CHypothesisTest()
{
+ SG_WARNING("An empty instance of this class should not be used! If you are seeing \
+ this error, please contact Shogun developers!\n");
}
-void CHypothesisTest::init()
+CHypothesisTest::CHypothesisTest(index_t num_distributions) : CSGObject()
{
- SG_ADD(&m_num_null_samples, "num_null_samples",
- "Number of permutation iterations for sampling null",
- MS_NOT_AVAILABLE);
- SG_ADD((machine_int_t*)&m_null_approximation_method,
- "null_approximation_method",
- "Method for approximating null distribution",
- MS_NOT_AVAILABLE);
+ self=std::unique_ptr(new CHypothesisTest::Self(num_distributions));
+}
- m_num_null_samples=250;
- m_null_approximation_method=PERMUTATION;
+CHypothesisTest::~CHypothesisTest()
+{
}
-void CHypothesisTest::set_null_approximation_method(
- ENullApproximationMethod null_approximation_method)
+void CHypothesisTest::set_train_test_mode(bool on)
{
- m_null_approximation_method=null_approximation_method;
+ self->data_mgr.set_train_test_mode(on);
}
-void CHypothesisTest::set_num_null_samples(index_t num_null_samples)
+void CHypothesisTest::set_train_test_ratio(float64_t ratio)
{
- m_num_null_samples=num_null_samples;
+ self->data_mgr.set_train_test_ratio(ratio);
+ self->data_mgr.reset();
}
float64_t CHypothesisTest::compute_p_value(float64_t statistic)
{
- float64_t result=0;
-
- if (m_null_approximation_method==PERMUTATION)
- {
- /* sample a bunch of MMD values from null distribution */
- SGVector values=sample_null();
-
- /* find out percentile of parameter "statistic" in null distribution */
- CMath::qsort(values);
- float64_t i=values.find_position_to_insert(statistic);
-
- /* return corresponding p-value */
- result=1.0-i/values.vlen;
- }
- else
- SG_ERROR("Unknown method to approximate null distribution!\n");
-
- return result;
+ SGVector values=sample_null();
+ std::sort(values.vector, values.vector + values.vlen);
+ float64_t i=values.find_position_to_insert(statistic);
+ return 1.0-i/values.vlen;
}
float64_t CHypothesisTest::compute_threshold(float64_t alpha)
{
- float64_t result=0;
-
- if (m_null_approximation_method==PERMUTATION)
- {
- /* sample a bunch of MMD values from null distribution */
- SGVector values=sample_null();
+ SGVector values=sample_null();
+ std::sort(values.vector, values.vector + values.vlen);
+ return values[index_t(CMath::floor(values.vlen*(1-alpha)))];
+}
- /* return value of (1-alpha) quantile */
- CMath::qsort(values);
- result=values[index_t(CMath::floor(values.vlen*(1-alpha)))];
- }
- else
- SG_ERROR("Unknown method to approximate null distribution!\n");
+bool CHypothesisTest::perform_test(float64_t alpha)
+{
+ auto statistic=compute_statistic();
+ auto p_value=compute_p_value(statistic);
+ return p_valuedata_mgr;
}
-bool CHypothesisTest::perform_test(float64_t alpha)
+const DataManager& CHypothesisTest::get_data_mgr() const
{
- float64_t p_value=perform_test();
- return p_valuedata_mgr;
}
diff --git a/src/shogun/statistics/HypothesisTest.h b/src/shogun/statistical_testing/HypothesisTest.h
similarity index 52%
rename from src/shogun/statistics/HypothesisTest.h
rename to src/shogun/statistical_testing/HypothesisTest.h
index e9607760887..2346e9ef5ee 100644
--- a/src/shogun/statistics/HypothesisTest.h
+++ b/src/shogun/statistical_testing/HypothesisTest.h
@@ -1,6 +1,7 @@
/*
* Copyright (c) The Shogun Machine Learning Toolbox
- * Written (w) 2012-2013 Heiko Strathmann
+ * Written (w) 2012 - 2013 Heiko Strathmann
+ * Written (w) 2014 - 2016 Soumyajit De
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -31,40 +32,30 @@
#ifndef HYPOTHESIS_TEST_H_
#define HYPOTHESIS_TEST_H_
+#include
#include
-
#include
namespace shogun
{
-/** enum for different statistic types */
-enum EStatisticType
-{
- S_LINEAR_TIME_MMD,
- S_QUADRATIC_TIME_MMD,
- S_HSIC,
- S_NOCCO
-};
+class CFeatures;
-/** enum for different method to approximate null-distibution */
-enum ENullApproximationMethod
+namespace internal
{
- PERMUTATION,
- MMD2_SPECTRUM_DEPRECATED,
- MMD2_SPECTRUM,
- MMD2_GAMMA,
- MMD1_GAUSSIAN,
- HSIC_GAMMA
-};
-/** @brief Hypothesis test base class. Provides an interface for statistical
+class DataManager;
+
+}
+
+/**
+ * @brief Hypothesis test base class. Provides an interface for statistical
* hypothesis testing via three methods: compute_statistic(), compute_p_value()
* and compute_threshold(). The second computes a p-value for the statistic
- * computed by the first method.
- * The p-value represents the position of the statistic in the null-distribution,
- * i.e. the distribution of the statistic population given the null-hypothesis
- * is true. (1-position = p-value).
+ * computed by the first method. The p-value represents the position of the
+ * statistic in the null-distribution, i.e. the distribution of the statistic
+ * population given the null-hypothesis is true. (1-position = p-value).
+ *
* The third method, compute_threshold(), computes a threshold for a given
* test level which is needed to reject the null-hypothesis.
*
@@ -78,20 +69,50 @@ enum ENullApproximationMethod
class CHypothesisTest : public CSGObject
{
public:
- /** default constructor */
+ /** Default constructor */
CHypothesisTest();
- /** destructor */
+ /** Destructor */
virtual ~CHypothesisTest();
- /** @return test statistic for the given data/parameters/methods */
- virtual float64_t compute_statistic()=0;
+ /**
+ * Method that enables/disables the training-testing mode. If this option
+ * is turned on, then the samples would be split in two pieces: one chunk
+ * would be used for training algorithms and the other chunk would be used
+ * for performing tests. If this option is turned off, the entire data
+ * would be used for performing the test. Before running any training
+ * algorithms, make sure to turn this mode on.
+ *
+ * By default, the training-testing mode is turned off.
+ *
+ * \sa {set_train_test_ratio()}
+ *
+ * @param on Whether to enable/disable the training-testing mode
+ */
+ void set_train_test_mode(bool on);
+
+ /**
+ * Method that specifies the ratio of training-testing data split for the
+ * algorithms. Note that this is NOT the percentage of samples to be used
+ * for training, rather the ratio of the number of samples to be used for
+ * training and that of testing.
+ *
+ * By default, an equal 50-50 split (ratio = 1) is made.
+ *
+ * \sa {set_train_test_mode()}
+ *
+ * @param ratio The ratio of the number of samples to be used for training
+ * and that of testing
+ */
+ void set_train_test_ratio(float64_t ratio);
- /** computes a p-value based on current method for approximating the
- * null-distribution. The p-value is the 1-p quantile of the null-
+ /**
+ * Method that computes a p-value based on current method for approximating
+ * the null-distribution. The p-value is the 1-p quantile of the null-
* distribution where the given statistic lies in.
+ *
* This method depends on the implementation of sample_null method
- * which should be implemented in its sub-classes
+ * which should be implemented by the sub-classes.
*
* @param statistic statistic value to compute the p-value for
* @return p-value parameter statistic is the (1-p) percentile of the
@@ -99,38 +120,24 @@ class CHypothesisTest : public CSGObject
*/
virtual float64_t compute_p_value(float64_t statistic);
- /** computes a threshold based on current method for approximating the
- * null-distribution. The threshold is the value that a statistic has
+ /**
+ * Method that computes a threshold based on current method for approximating
+ * the null-distribution. The threshold is the value that a statistic has
* to have in ordner to reject the null-hypothesis.
+ *
* This method depends on the implementation of sample_null method
- * which should be implemented in its sub-classes
+ * which should be implemented by the sub-classes.
*
* @param alpha test level to reject null-hypothesis
- * @return threshold for statistics to reject null-hypothesis
+ * @return Threshold for statistics to reject null-hypothesis
*/
virtual float64_t compute_threshold(float64_t alpha);
- /** Performs the complete two-sample test on current data and returns a
- * p-value.
- *
- * This is a wrapper that calls compute_statistic first and then
- * calls compute_p_value using the obtained statistic. In some statistic
- * classes, it might be possible to compute statistic and p-value in
- * one single run which is more efficient. Therefore, this method might
- * be overwritten in subclasses.
+ /**
+ * Method that performs the complete hypothesis test on current data and
+ * returns a binary answer: wheter null hypothesis is rejected or not.
*
- * The method for computing the p-value can be set via
- * set_null_approximation_method().
- *
- * @return p-value such that computed statistic is the (1-p) quantile
- * of the estimated null distribution
- */
- virtual float64_t perform_test();
-
- /** Performs the complete two-sample test on current data and returns
- * a binary answer wheter null hypothesis is rejected or not.
- *
- * This is just a wrapper for the above perform_test() method that
+ * This is just a wrapper for the above compute_p_value() method that
* returns a p-value. If this p-value lies below the test level alpha,
* the null hypothesis is rejected.
*
@@ -141,42 +148,34 @@ class CHypothesisTest : public CSGObject
*/
bool perform_test(float64_t alpha);
- /** computes the test statistic m_num_null_samples times, exact
- * computation depends on the implementations.
+ /**
+ * Interface for computing the test-statistic for the hypothesis test.
*
- * @return vector of all statistics
+ * @return Test statistic for the given data/parameters/methods
*/
- virtual SGVector sample_null()=0;
+ virtual float64_t compute_statistic()=0;
- /** sets the number of permutation iterations for sample_null()
+ /**
+ * Interface for computing the samples under the null-hypothesis.
*
- * @param num_null_samples how often permutation shall be done
+ * @return Vector of all statistics
*/
- virtual void set_num_null_samples(index_t num_null_samples);
-
- /** sets the method how to approximate the null-distribution
- * @param null_approximation_method method to use
- */
- virtual void set_null_approximation_method(
- ENullApproximationMethod null_approximation_method);
-
- /** returns the statistic type of this test statistic */
- virtual EStatisticType get_statistic_type() const=0;
-
- virtual const char* get_name() const=0;
-
-private:
- /** register parameters and initialize with default values */
- void init();
+ virtual SGVector sample_null()=0;
+ /** @return The name of the class */
+ virtual const char* get_name() const;
protected:
- /** number of iterations for sampling from null-distributions */
- index_t m_num_null_samples;
+ explicit CHypothesisTest(index_t num_distributions);
+ internal::DataManager& get_data_mgr();
+ const internal::DataManager& get_data_mgr() const;
+private:
+ CHypothesisTest(const CHypothesisTest& other)=delete;
+ CHypothesisTest& operator=(const CHypothesisTest& other)=delete;
- /** Defines how the the null distribution is approximated */
- ENullApproximationMethod m_null_approximation_method;
+ struct Self;
+ std::unique_ptr self;
};
}
-#endif /* HYPOTHESIS_TEST_H_ */
+#endif // HYPOTHESIS_TEST_H_
diff --git a/src/shogun/statistical_testing/IndependenceTest.cpp b/src/shogun/statistical_testing/IndependenceTest.cpp
new file mode 100644
index 00000000000..77b8013a401
--- /dev/null
+++ b/src/shogun/statistical_testing/IndependenceTest.cpp
@@ -0,0 +1,79 @@
+/*
+ * Restructuring Shogun's statistical hypothesis testing framework.
+ * Copyright (C) 2016 Soumyajit De
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include
+#include
+#include
+#include
+
+using namespace shogun;
+using namespace internal;
+
+struct CIndependenceTest::Self
+{
+ Self(index_t num_kernels);
+ KernelManager kernel_mgr;
+};
+
+CIndependenceTest::Self::Self(index_t num_kernels) : kernel_mgr(num_kernels)
+{
+}
+
+CIndependenceTest::CIndependenceTest() : CTwoDistributionTest()
+{
+ self=std::unique_ptr(new Self(IndependenceTest::num_kernels));
+}
+
+CIndependenceTest::~CIndependenceTest()
+{
+}
+
+void CIndependenceTest::set_kernel_p(CKernel* kernel_p)
+{
+ self->kernel_mgr.kernel_at(0)=kernel_p;
+}
+
+CKernel* CIndependenceTest::get_kernel_p() const
+{
+ return self->kernel_mgr.kernel_at(0);
+}
+
+void CIndependenceTest::set_kernel_q(CKernel* kernel_q)
+{
+ self->kernel_mgr.kernel_at(1)=kernel_q;
+}
+
+CKernel* CIndependenceTest::get_kernel_q() const
+{
+ return self->kernel_mgr.kernel_at(1);
+}
+
+const char* CIndependenceTest::get_name() const
+{
+ return "IndependenceTest";
+}
+
+KernelManager& CIndependenceTest::get_kernel_mgr()
+{
+ return self->kernel_mgr;
+}
+
+const KernelManager& CIndependenceTest::get_kernel_mgr() const
+{
+ return self->kernel_mgr;
+}
diff --git a/src/shogun/statistical_testing/IndependenceTest.h b/src/shogun/statistical_testing/IndependenceTest.h
new file mode 100644
index 00000000000..492fd46d998
--- /dev/null
+++ b/src/shogun/statistical_testing/IndependenceTest.h
@@ -0,0 +1,105 @@
+/*
+ * Restructuring Shogun's statistical hypothesis testing framework.
+ * Copyright (C) 2016 Soumyajit De
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#ifndef INDEPENDENCE_TEST_H_
+#define INDEPENDENCE_TEST_H_
+
+#include
+#include
+
+namespace shogun
+{
+
+class CKernel;
+
+namespace internal
+{
+ class KernelManager;
+}
+
+/**
+ * @brief Provides an interface for performing the independence test.
+ * Given samples \f$Z=\{(x_i,y_i)\}_{i=1}^m\f$ from the joint distribution
+ * \f$\textbf{P}_{xy}\f$, whether the joint distribution factorize as
+ * \f$\textbf{P}_{xy}=\textbf{P}_x\textbf{P}_y\f$, i.e. product of the marginals.
+ * The null-hypothesis says yes, i.e. no dependence, the alternative hypothesis
+ * says no.
+ *
+ * Abstract base class. Provides all interfaces and implements approximating
+ * the null distribution via permutation, i.e. shuffling the samples from
+ * one distribution repeatedly using subsets while keeping the samples from
+ * the other distribution in its original order
+ *
+ */
+class CIndependenceTest : public CTwoDistributionTest
+{
+public:
+ /** Default constructor */
+ CIndependenceTest();
+
+ /** Destructor */
+ virtual ~CIndependenceTest();
+
+ /**
+ * Method that sets the kernel to be used for performing the test for the
+ * samples from p.
+ *
+ * @param kernel_p The kernel instance to be used for samples from p
+ */
+ void set_kernel_p(CKernel* kernel_p);
+
+ /** @return The kernel instance that is used for samples from p */
+ CKernel* get_kernel_p() const;
+
+ /**
+ * Method that sets the kernel to be used for performing the test for the
+ * samples from q.
+ *
+ * @param kernel_q The kernel instance to be used for samples from q
+ */
+ void set_kernel_q(CKernel* kernel_q);
+
+ /** @return The kernel instance that is used for samples from q */
+ CKernel* get_kernel_q() const;
+
+ /**
+ * Interface for computing the test-statistic for the hypothesis test.
+ *
+ * @return test statistic for the given data/parameters/methods
+ */
+ virtual float64_t compute_statistic()=0;
+
+ /**
+ * Interface for computing the samples under the null-hypothesis.
+ *
+ * @return vector of all statistics
+ */
+ virtual SGVector sample_null()=0;
+
+ /** @return The name of the class */
+ virtual const char* get_name() const;
+protected:
+ internal::KernelManager& get_kernel_mgr();
+ const internal::KernelManager& get_kernel_mgr() const;
+private:
+ struct Self;
+ std::unique_ptr self;
+};
+
+}
+#endif // INDEPENDENCE_TEST_H_
diff --git a/src/shogun/statistical_testing/LinearTimeMMD.cpp b/src/shogun/statistical_testing/LinearTimeMMD.cpp
new file mode 100644
index 00000000000..c1e9fac7de5
--- /dev/null
+++ b/src/shogun/statistical_testing/LinearTimeMMD.cpp
@@ -0,0 +1,152 @@
+/*
+ * Restructuring Shogun's statistical hypothesis testing framework.
+ * Copyright (C) 2016 Soumyajit De
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+using namespace shogun;
+using namespace internal;
+
+CLinearTimeMMD::CLinearTimeMMD() : CStreamingMMD()
+{
+}
+
+CLinearTimeMMD::CLinearTimeMMD(CFeatures* samples_from_p, CFeatures* samples_from_q) : CStreamingMMD()
+{
+ set_p(samples_from_p);
+ set_q(samples_from_q);
+}
+
+CLinearTimeMMD::~CLinearTimeMMD()
+{
+}
+
+void CLinearTimeMMD::set_num_blocks_per_burst(index_t num_blocks_per_burst)
+{
+ auto& data_mgr=get_data_mgr();
+ auto min_blocksize=data_mgr.get_min_blocksize();
+ if (min_blocksize==2)
+ {
+ // only possible when number of samples from both the distributions are the same
+ auto N=data_mgr.num_samples_at(0);
+ for (auto i=2; i