Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing splitting strategy to clean up Cross Validation API #4180

Closed

Conversation

luisffranca
Copy link
Contributor

@luisffranca luisffranca commented Feb 19, 2018

Changing splitting strategy to clean up Cross Validation API (#4049)

  • generate_subset_indices() and generate_subset_inverse() are renamed,
    respectively, to validation() and train();
  • The methods are now const and return const vectors;
  • build_subsets() won't be explicitly called, but rather internally
    in the constructor of the derived classes;
  • Examples, unit tests and notebooks were changed accordingly.

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the patch!
I think the constness of the train/validation methods should be achieved otherwise

@@ -23,7 +23,7 @@ CCrossValidationSplitting::CCrossValidationSplitting(
m_rng = sg_rand;
}

void CCrossValidationSplitting::build_subsets()
void CCrossValidationSplitting::build_subsets() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method cannot be const, it changes members

@@ -33,7 +33,7 @@ CSplittingStrategy::CSplittingStrategy(CLabels* labels, int32_t num_subsets)
reset_subsets();
}

void CSplittingStrategy::reset_subsets()
void CSplittingStrategy::reset_subsets() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot be const ...

@@ -69,13 +69,11 @@ CSplittingStrategy::~CSplittingStrategy()
SG_UNREF(m_subset_indices);
}

SGVector<index_t> CSplittingStrategy::generate_subset_indices(index_t subset_idx)
const SGVector<index_t> CSplittingStrategy::validation(index_t subset_idx) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to make this method const (if things are precomputed)
However, there is a "build_subsets" call below, which cannot be const, so the current const doenst make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function arg can be const :)

*/
SGVector<index_t> generate_subset_indices(index_t subset_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this renaming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above


/** @return number of subsets. */
index_t get_num_subsets() const;

protected:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you make this method private. Why not call it in the constructor then? Then we can make the train/validation method const..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed that, thanks.

@@ -99,14 +96,14 @@ class CSplittingStrategy: public CSGObject
CLabels* m_labels;

/** subset indices */
CDynamicObjectArray* m_subset_indices;
mutable CDynamicObjectArray* m_subset_indices;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now what your thinking was.
I don't think this is the best style, making it mutable and the other things const.
There are simpler ways of reaching this goal imo

@luisffranca
Copy link
Contributor Author

Hi @karlnapf @vigsterkr !

Thanks for the review. I've changed the constness strategy and now build_subsets() is called in the constructors of the derived classes.

After these changes, CrossValidationMMD unit tests started to fail. That was due to the use of randomness in the creation of the subsets. Given that now build_subsets() is in the constructor, I've adapted the test in order to fix this.

*/
SGVector<index_t> generate_subset_indices(index_t subset_idx);
const SGVector<index_t> validation(const index_t subset_idx) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we want the return type to be const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this according to the Issue description. I think this shouldn't be a problem, given the possibility to return it to const or non-const variables. Should I remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be ok

@@ -45,6 +45,7 @@ CTimeSeriesSplitting::CTimeSeriesSplitting(CLabels* labels, index_t num_subsets)
: CSplittingStrategy(labels, num_subsets)
{
init();
build_subsets();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method seem to be called in init, why call it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_subsets() you mean? It isn't called on init() in this class.

@@ -99,6 +100,9 @@ void CTimeSeriesSplitting::set_min_subset_size(index_t min_size)
"subsets and labels.",
num_labels - (num_subsets - 1) * (num_labels / num_subsets) - 1);
m_min_subset_size = min_size;

// Rebuild subsets considering the new minimum subset size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is built twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In init() m_min_subset_size is set to 1, which is then used by build_subsets() (called by the constructor). If set_min_subset_size() is called, we should rebuild the subsets. I've found this due to a failure of timeseries_subset_linear_splits unit test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid this. Maybe you can shift things around that it is only built ones .. ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to achieve this would be to allow m_min_subset_size to be passed as an argument in the constructor (default value 1) and remove the method set_min_subset_size(). Do you think that would be a problem?

@@ -69,7 +69,8 @@ CSplittingStrategy::~CSplittingStrategy()
SG_UNREF(m_subset_indices);
}

SGVector<index_t> CSplittingStrategy::generate_subset_indices(index_t subset_idx)
const SGVector<index_t>
CSplittingStrategy::validation(const index_t subset_idx) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for const on index_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@karlnapf
Copy link
Member

Added a few comments, the build seems to be fine. So tests passing.
Once we have cleaned up I think we can merge this.
Did you check all ipython notebooks and meta examples for the new API?

@karlnapf
Copy link
Member

karlnapf commented Apr 2, 2018

Any updates here?

@luisffranca
Copy link
Contributor Author

Hi @karlnapf ! Sorry for my late reply.

I'm sending a new patch set changing build_subsets() in TimeSeriesSplitting according to your last comments. I've adapted the unit tests and notebook accordingly.

As far as I've checked, all python notebooks and meta examples for the new API were changed.

SGVector<index_t> subset=splitting->generate_subset_indices(i);
SGVector<index_t> inverse=splitting->generate_subset_inverse(i);
SGVector<index_t> subset = splitting->validation(i);
SGVector<index_t> inverse = splitting->train(i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realised a minor inconsistency. Should be called "training" if the counterpart is called "validation"
But I would even prefer "train" and "test" ..

/** custom rng if using cross validation across different threads */
CRandom * m_rng;

protected:
/** implementation of the standard cross-validation splitting strategy */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems weird to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and will remove it.

@karlnapf
Copy link
Member

karlnapf commented Apr 9, 2018

This is almost ready.
Just my comment on the naming. What are your thoughts here?

@luisffranca
Copy link
Contributor Author

I like "train" and "test"! I'll make the changes and submit another patch set.

Thanks!

@luisffranca
Copy link
Contributor Author

Hi @karlnapf !

Any other comments in this PR? Can I squash the commits to clean it up?

@karlnapf
Copy link
Member

Thanks for the udpate and ping! :)

Looks good! No need to squash as github can do that these days

One thing I want to make sure.Did you run all unit and integration tests?

@@ -21,6 +21,7 @@ CCrossValidationSplitting::CCrossValidationSplitting(
CSplittingStrategy(labels, num_subsets)
{
m_rng = sg_rand;
build_subsets();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a problem here is that we can only call the default ctor from swig, then we may still need to call build_subset from the outside? @karlnapf

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 26, 2020
@stale
Copy link

stale bot commented Mar 4, 2020

This issue is now being closed due to a lack of activity. Feel free to reopen it.

@stale stale bot closed this Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants