Changing splitting strategy to clean up Cross Validation API #4180

luisffranca · 2018-02-19T06:23:19Z

Changing splitting strategy to clean up Cross Validation API (#4049)

generate_subset_indices() and generate_subset_inverse() are renamed,
respectively, to validation() and train();
The methods are now const and return const vectors;
build_subsets() won't be explicitly called, but rather internally
in the constructor of the derived classes;
Examples, unit tests and notebooks were changed accordingly.

karlnapf

Thx for the patch!
I think the constness of the train/validation methods should be achieved otherwise

karlnapf · 2018-02-19T18:38:09Z

src/shogun/evaluation/CrossValidationSplitting.cpp

@@ -23,7 +23,7 @@ CCrossValidationSplitting::CCrossValidationSplitting(
 	m_rng = sg_rand;
 }

-void CCrossValidationSplitting::build_subsets()
+void CCrossValidationSplitting::build_subsets() const


this method cannot be const, it changes members

karlnapf · 2018-02-19T18:38:32Z

src/shogun/evaluation/SplittingStrategy.cpp

@@ -33,7 +33,7 @@ CSplittingStrategy::CSplittingStrategy(CLabels* labels, int32_t num_subsets)
 	reset_subsets();
 }

-void CSplittingStrategy::reset_subsets()
+void CSplittingStrategy::reset_subsets() const


cannot be const ...

karlnapf · 2018-02-19T18:39:13Z

src/shogun/evaluation/SplittingStrategy.cpp

@@ -69,13 +69,11 @@ CSplittingStrategy::~CSplittingStrategy()
 	SG_UNREF(m_subset_indices);
 }

-SGVector<index_t> CSplittingStrategy::generate_subset_indices(index_t subset_idx)
+const SGVector<index_t> CSplittingStrategy::validation(index_t subset_idx) const


I think it is a good idea to make this method const (if things are precomputed)
However, there is a "build_subsets" call below, which cannot be const, so the current const doenst make sense

the function arg can be const :)

karlnapf · 2018-02-19T18:39:35Z

src/shogun/evaluation/SplittingStrategy.h

 	 */
-	SGVector<index_t> generate_subset_indices(index_t subset_idx);


I like this renaming

see comment above

karlnapf · 2018-02-19T18:40:14Z

src/shogun/evaluation/SplittingStrategy.h


 	/** @return number of subsets. */
 	index_t get_num_subsets() const;

+protected:


so you make this method private. Why not call it in the constructor then? Then we can make the train/validation method const..

I've changed that, thanks.

karlnapf · 2018-02-19T18:41:30Z

src/shogun/evaluation/SplittingStrategy.h

@@ -99,14 +96,14 @@ class CSplittingStrategy: public CSGObject
 	CLabels* m_labels;

 	/** subset indices */
-	CDynamicObjectArray* m_subset_indices;
+	mutable CDynamicObjectArray* m_subset_indices;


I see now what your thinking was.
I don't think this is the best style, making it mutable and the other things const.
There are simpler ways of reaching this goal imo

luisffranca · 2018-02-26T04:23:45Z

Hi @karlnapf @vigsterkr !

Thanks for the review. I've changed the constness strategy and now build_subsets() is called in the constructors of the derived classes.

After these changes, CrossValidationMMD unit tests started to fail. That was due to the use of randomness in the creation of the subsets. Given that now build_subsets() is in the constructor, I've adapted the test in order to fix this.

karlnapf · 2018-02-27T12:49:05Z

src/shogun/evaluation/SplittingStrategy.h

 	 */
-	SGVector<index_t> generate_subset_indices(index_t subset_idx);
+	const SGVector<index_t> validation(const index_t subset_idx) const;


not sure why we want the return type to be const?

I've changed this according to the Issue description. I think this shouldn't be a problem, given the possibility to return it to const or non-const variables. Should I remove it?

I think it should be ok

karlnapf · 2018-02-27T12:51:24Z

src/shogun/evaluation/TimeSeriesSplitting.cpp

@@ -45,6 +45,7 @@ CTimeSeriesSplitting::CTimeSeriesSplitting(CLabels* labels, index_t num_subsets)
    : CSplittingStrategy(labels, num_subsets)
 {
 	init();
+	build_subsets();


the method seem to be called in init, why call it again?

build_subsets() you mean? It isn't called on init() in this class.

karlnapf · 2018-02-27T12:52:18Z

src/shogun/evaluation/TimeSeriesSplitting.cpp

@@ -99,6 +100,9 @@ void CTimeSeriesSplitting::set_min_subset_size(index_t min_size)
 	    "subsets and labels.",
 	    num_labels - (num_subsets - 1) * (num_labels / num_subsets) - 1);
 	m_min_subset_size = min_size;
+
+	// Rebuild subsets considering the new minimum subset size


so it is built twice?

In init() m_min_subset_size is set to 1, which is then used by build_subsets() (called by the constructor). If set_min_subset_size() is called, we should rebuild the subsets. I've found this due to a failure of timeseries_subset_linear_splits unit test.

Let's try to avoid this. Maybe you can shift things around that it is only built ones .. ?

One way to achieve this would be to allow m_min_subset_size to be passed as an argument in the constructor (default value 1) and remove the method set_min_subset_size(). Do you think that would be a problem?

karlnapf · 2018-02-27T12:52:59Z

src/shogun/evaluation/SplittingStrategy.cpp

@@ -69,7 +69,8 @@ CSplittingStrategy::~CSplittingStrategy()
 	SG_UNREF(m_subset_indices);
 }

-SGVector<index_t> CSplittingStrategy::generate_subset_indices(index_t subset_idx)
+const SGVector<index_t>
+CSplittingStrategy::validation(const index_t subset_idx) const


no need for const on index_t

karlnapf · 2018-02-27T12:53:51Z

Added a few comments, the build seems to be fine. So tests passing.
Once we have cleaned up I think we can merge this.
Did you check all ipython notebooks and meta examples for the new API?

karlnapf · 2018-04-02T16:28:23Z

Any updates here?

luisffranca · 2018-04-08T03:02:55Z

Hi @karlnapf ! Sorry for my late reply.

I'm sending a new patch set changing build_subsets() in TimeSeriesSplitting according to your last comments. I've adapted the unit tests and notebook accordingly.

As far as I've checked, all python notebooks and meta examples for the new API were changed.

karlnapf · 2018-04-09T14:44:50Z

examples/undocumented/libshogun/splitting_LOO_crossvalidation.cpp

-			SGVector<index_t> subset=splitting->generate_subset_indices(i);
-			SGVector<index_t> inverse=splitting->generate_subset_inverse(i);
+			SGVector<index_t> subset = splitting->validation(i);
+			SGVector<index_t> inverse = splitting->train(i);


just realised a minor inconsistency. Should be called "training" if the counterpart is called "validation"
But I would even prefer "train" and "test" ..

karlnapf · 2018-04-09T14:45:35Z

src/shogun/evaluation/CrossValidationSplitting.h

 	/** custom rng if using cross validation across different threads */
 	CRandom * m_rng;
+
+protected:
+	/** implementation of the standard cross-validation splitting strategy */


this comment seems weird to me

I agree and will remove it.

karlnapf · 2018-04-09T14:48:40Z

This is almost ready.
Just my comment on the naming. What are your thoughts here?

luisffranca · 2018-04-10T02:00:35Z

I like "train" and "test"! I'll make the changes and submit another patch set.

Thanks!

TimeSeriesSplitting

luisffranca · 2018-05-19T01:48:16Z

Hi @karlnapf !

Any other comments in this PR? Can I squash the commits to clean it up?

karlnapf · 2018-05-21T09:44:58Z

Thanks for the udpate and ping! :)

Looks good! No need to squash as github can do that these days

One thing I want to make sure.Did you run all unit and integration tests?

vinx13 · 2018-09-06T03:45:49Z

src/shogun/evaluation/CrossValidationSplitting.cpp

@@ -21,6 +21,7 @@ CCrossValidationSplitting::CCrossValidationSplitting(
 	CSplittingStrategy(labels, num_subsets)
 {
 	m_rng = sg_rand;
+	build_subsets();


a problem here is that we can only call the default ctor from swig, then we may still need to call build_subset from the outside? @karlnapf

stale · 2020-02-26T15:52:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-03-04T16:29:09Z

This issue is now being closed due to a lack of activity. Feel free to reopen it.

karlnapf requested changes Feb 19, 2018

View reviewed changes

luisffranca force-pushed the feature/splitAsIter branch from 35357cb to 694d066 Compare February 26, 2018 04:10

karlnapf reviewed Feb 27, 2018

View reviewed changes

luisffranca force-pushed the feature/splitAsIter branch from 694d066 to 74f35a6 Compare April 8, 2018 02:53

karlnapf reviewed Apr 9, 2018

View reviewed changes

luisffranca added 4 commits April 10, 2018 23:13

Changing splitting strategy to clean up Cross Validation API (#4049)

319008c

Code review: changing consts and fixing unit test

e5723bd

Code review: removing constness of indices and changing

26cb0a1

TimeSeriesSplitting

Code review: change "validation" to "test"

0a2a42a

luisffranca force-pushed the feature/splitAsIter branch from 74f35a6 to 0a2a42a Compare April 11, 2018 02:14

karlnapf approved these changes May 21, 2018

View reviewed changes

vinx13 reviewed Sep 6, 2018

View reviewed changes

stale bot added the stale label Feb 26, 2020

stale bot closed this Mar 4, 2020

		*/
		SGVector<index_t> generate_subset_indices(index_t subset_idx);

Changing splitting strategy to clean up Cross Validation API #4180

Changing splitting strategy to clean up Cross Validation API #4180

Conversation

luisffranca commented Feb 19, 2018 • edited

karlnapf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luisffranca commented Feb 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luisffranca Mar 7, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf commented Feb 27, 2018

karlnapf commented Apr 2, 2018

luisffranca commented Apr 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf commented Apr 9, 2018

luisffranca commented Apr 10, 2018

luisffranca commented May 19, 2018

karlnapf commented May 21, 2018

Choose a reason for hiding this comment

stale bot commented Feb 26, 2020

stale bot commented Mar 4, 2020

luisffranca commented Feb 19, 2018 •

edited

luisffranca Mar 7, 2018 •

edited