-
-
Notifications
You must be signed in to change notification settings - Fork 1k
-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring statistical hypothesis testing framework #2495
Comments
I think we need to distinguish between methods in the hypothesis test base class that streams data in the way its done for all tests (with certain ways to permute things), and then the methods that are called to compute things on those blocks. It would be great if the difference between an independence and a two-sample test is just a few lines of code in an abstract method implementation and the overall logic (as mathematically the same) stays in a base class. |
@karlnapf just wondering, can we put these under the computation framework? We send different blocks (or better, one whole burst) as different computation jobs to the computation engine and job result aggregator computes the running average? will be ready to work on cluster then. |
Yes, absolutely. A good thing to have would be a callback function job or something. Ideas on this? |
Reading modern c++ design book now.. Lots of nice ideas here.. Let's try to Will get back with a class diagram soon...
|
I absolutely agree on this. Let's do proper design thinking before we start hacking things. |
The hack will be a great opportunity to talk (and start pushing) such things. |
@karlnapf a few thoughts regarding this: we can rely on a policy-based design pattern for the job. Firstly: as we already saw while designing Secondly: talking about this problem particularly, I think it intuitively fits nice in the policy design pattern because what we want to achieve here is to punch a number of different policies for data fetching and computing the statistic combinatorially instead of trying to rely on inefficient hierarchical structure. For example, in the previous design I proposed, there is no reason that Another very interesting point is that we can readily make big tests available not just for real numerical data types but for (since we can soon move into) other feature types as well - string, sparse, graphs, signals without much effort! So here is how I am thinking it can work: The main three internal components here are
The I was writing the POC code for Few other points I can think of that we should resolve (before I forget):
@lisitsyn I need your comments as well on this - specially we need to discuss if this design can fit into a plugin architecture - which part goes where etc. I apologize in advance for such a long post and the messed up status of the example src. If all these pieces fit together then maybe we can restructure kernels as well (gotta get sparse kernel working). Please share your thoughts and suggestions on this. |
Great great suggestions. I suggest we discuss at the Stammtisch tonight? |
I was in a bit later yesterday (sorry) but could not catch you then. Shall we meet some other day? Ill try to hang out a bit later this afternoon |
@karlnapf absolutely. I also slept off earlier than usual yesterday. Let me know when you're free :) |
I will try to hang our in IRC a bit this week, so then we can talk. |
@karlnapf my apologies for the long delay. Just came back from a trip home (festive season in India). While working with template <class Kernel> class KernelManager But to use this in a non-template hypothesis test base class we'd have to use a specialized kernel which is restrictive. Also for independence test we may use two different type of kernels. Similar argument can be given for A few way outs I could have thought of -
I'm in favor of (2) because making our wrapper's setters templated would require that we have to send it by the actual type and not via base pointers. Then we cannot use this to instantiate it inside a polymorphic method (say, inside Having all these in mind, I thought of the following structure - In one side we have template <typename TestType>
class CHypothesisTest : public CSGObject {
typedef TestType test_type;
....
DataManager<test_type> data_manager;
};
class CTwoSampleTest : public CHypothesisTest<TwoSampleTest> {
typedef TwoSampleTest test_type;
....
};
class CStreamingTwoSampleTest : public CHypothesisTest<StreamingTwoSampleTest> {
typedef StreamingTwoSampleTest test_type;
....
};
class CKernelTwoSampleTest : public CTwoSampleTest {
typedef CTwoSampleTest::test_type test_type;
....
KernelManager<test_type> kernel_manager;
};
// and so on... This way the internal details are perfectly hidden from the wrappers and rest is left up to the template <typename TestType>
struct DataManager {
template <typename Features>
void push_back(Features* feats);
....
vector<CFeatures*> samples;
}; The idea is - based on feature-type, we'd set fetcher policy and permutation policy and for that we'd rely on the fact that Couple of other ideas -
I haven't chalked up the kernel manager yet but I suppose it can work similarly as of data manager. There are some linking issues with the current poc code that I'll try to get rid of before our next meeting. Again sorry for the long post - just noting down the things related to the design decisions so that I don't forget. Hope to talk to you soon! |
Hi @lambday |
@karlnapf Hi, hope you had a nice holiday. I updated a working example on the POC code here.
Some more testing needed. Before our next meeting I will try to code up KernelManager - will be good for the discussion if we have a working example ready. Should next Monday be fine then? |
Hi, just cloned this and tried to compile but I think you might have forgot to commit the |
@karlnapf just updated. Please check. |
I think the benefit of using DataManager interface is mainly the ability to mix and match different feature class and types without having the wrapper worry about it. I'll try to do something for string feature class. All it requires is to know how to permute the features. Same goes for sparse. |
yep, I agree, that is nice. Checking now. BTW, if this turns out to be nice, I suggest we approach the CFeatures base class in that way. Seems way cleaner than the mess we currently have. Let me know your initial thoughts (maybe via mail, not in this thread, or another github issue, or even wiki) |
@karlnapf yeah that would be good. Also, putting the computation part as policies can also help us refactoring kernel in general - plugin different policies for computing a kernel for different features and we have dot kernel ready for dense, sparse and others. So, even when we rely on a hierarchical class structure for our wrappers (for exposing to modular interface) we can exploit the similarities between the branches. I'll prepare a draft for these ideas. |
Done in feature/bigtest. Closing. |
This follows from a discussion with @karlnapf and @sejdino. Currently kernel two-sample test and kernel independence test work under two different branches in the class tree with many things in common. In addition, in near future we'd want to perform independence test with streaming data as well which is only possible presently with two-sample test (
CStreamingMMD
). In order to fit everything nicely and reuse the components that are already there, we think that its better to separate the data-fetching components with the computation-components.Here are some initial thoughts on this (needs further discussions):
CTestData
base class (not abstract)This class will be responsible for providing interface for fetching data, either as a whole or blockwise, either merged or unmerged. So it will provide all four methods :
get_samples()
: returns aCList
of feature objects of allnum_samples
for all the featuresget_merged_samples()
: returns aCFeatures
instance of merged features (as a whole) created byCFeatures::create_merged_copy()
get_blocks(index_t num_blocks_current_burst)
: returns aCList
of feature objects fetchednum_blocks
times blockwise (maybe for dense features, we can uselinalg::blocks
which just creates a wrapper for the data without having to allocate separate memory for them)get_merged_blocks(index_t num_blocks_current_burst)
: returns a mergedCFeatures
instance fornum_blocks
from all the underlying distributions.This class will also be able to handle more than two distributions (since we may soon move into three variable interaction or higher, this will make it flexible). So we'll maintain a
m_num_distributions
insideCTestData
for that which can be set by the user. Then we'll register the samples from each distribution using aregister_samples(index_t idx, CFeatures* samples)
method which internally keeps them in an array of sizem_num_distribution
, keeping track ofnum_samples
for each viaCFeatures->get_num_vectors()
interface,idx
being the index in that array (for example if we have two distributions p and q, p will be set atarr[0]
and so on). For the case where we want to process these samples blockwise, aset_blocksize(index_ idx)
method will be available for the user to set the blocksize per distribution. So B_x B_y things are taken care of here.Also, we can add the shuffling code in
CTestData
itself. Computation components shouldn't bother about data at all. InsideCTestData
we can have methods to shuffle the samples from just one distribution or merge both, shuffle and redistribute in the same proportion for sampling null.StreamingTestData
will just be a subclass of this class which uses streaming features to provide the same interface to the computation components.Now, for the computation hierarchy, we will have one
CTestData* m_test_data
as a component and have a bunch of compute methods. Base class is stillCHypothesisTest
which will still provide interface forcompute_statistic()
compute_blockwise_statistic()
[maybe put this somewhere down the hierarchy]compute_variance()
compute_p_value()
compute_threshold()
perform_test()
sample_null()
The rest of the hierarchy can be same. In the subclasses of
CIndependenceTest
we'll usem_test_data->get_samples()
/m_test_data->get_blocks(index_t num_blocks_current_burst)
inside thecompute_statistic()
/compute_blockwise_statistic()
methods and for the subclasses ofCTwoSampleTest
we'll usem_test_data->get_merged_samples()
/m_test_data->get_merged_blocks(index_t num_blocks_current_burst)
accordingly.The way it will work is:
First the user will create an instance of
CTestData
(forCDenseFeatures
) orCStreamingTestData
(forCStreamingFeatures
). Then he'll feed the test data to someCHypothesisTest
instance and use the rest just as before. This will work for independence test/conditional-independence test, two-sample test/three-var interaction test. This will also work for different types of features as required by the independence test framework. Also, if someone wants to perform a B-test for non-streaming samples he can (may not be useful though but we don't have to create an instance ofCStreamingFeatures
fromCFeatures
each time when someone wants to do that).Also, (thanks to @karlnapf for the suggestion) since "the code to compute all these statistics is the same (all current tests in some way stream through the data and compute statistics, and it is really just the way the statistic is computed that changes from say hsic to mmd) so it would be great to generalise this too. So that the difference is really just plugging in different statistic computation methods."
The text was updated successfully, but these errors were encountered: