-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for subset access of features #34
Conversation
@@ -48,6 +48,9 @@ CFeatures::CFeatures(CFile* loader) | |||
|
|||
CFeatures::~CFeatures() | |||
{ | |||
if (subset_inds) | |||
delete subset_inds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be delete[]
I am actually very happy about this initial draft. I have tried to give you as much feedback as possible. Hopefully it will simplify the code a bit and I am all for doing this for the other feature types (when that one is finished and tested - I would suggest to write a python_modular function to test it, where you set a subset_idx that is exactly 0...(num_vec-1) - i.e. identical and then run some of the svm examples). I think this will enable us to only have to specify the data set once, setting different indices for training/validation/testing etc. Should be one of the most difficult parts already. The other difficult part is how to register the model parameters. When these two are resolved you are on track. |
-some changes to names -some errors -some design changes
Hi there, |
*/ | ||
virtual int32_t get_num_vectors()=0 ; | ||
//virtual int32_t get_num_vectors_all()=0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have
virtual int32_t get_num_vectors_all()
{
return get_num_vectors();
}
here - don't you agree?
…modular appending from another string features instances now only appends a subset, if one is defined
* @param m_subset_idx index matrix | ||
* @param m_subset_len number of subset indices | ||
*/ | ||
virtual void set_feature_subset(int32_t* subset_idx, int32_t subset_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh sigh. same stupid issue with swig interfaces. we need to copy the data here. or have two functions copy_feature_subset and set_feature_subset. Don't change it just now - I seriously consider fixing up the swig interfaces.
@@ -533,7 +569,7 @@ template <class ST> class CStringFeatures : public CFeatures | |||
*/ | |||
virtual inline int32_t get_max_vector_length() | |||
{ | |||
return max_string_length; | |||
return m_subset_idx ? max_string_length_total : max_string_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should always return max_string_length here and only adjust max_string_length to have to correct value.
small changed to std getter method subset_idx_conversion() is now not pure virtual anymore added comments
copy_features() now works woth subsets code cleanups
I left the setter for the subset matrix as it is for the moment. |
data files changed corresponding to shogun-toolbox#1915
Attached my first draft for subset access for features, which is needed for cross-validation, which is my proposed gsoc project.
Features stay in memory as before, but it is possible to tell the class to use a subset for all feature access methods.
If this is ok for you, I will implement the same thing for all feature types.
Feedback and suggestions are welcome!
Heiko