New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for subset access of features #34
Changes from all commits
4c0b38f
9d840cd
87d0bba
ea6b156
1a5cadc
0bd912d
601b8ab
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,9 @@ namespace shogun | |
* (dense matrices), CSparseFeatures (sparse matrices), CStringFeatures (a | ||
* set of strings) from which all the specific features like CSimpleFeatures<float64_t> | ||
* (dense real valued feature matrices) are derived. | ||
* | ||
* It is possible to set/get/reset subset matrices. | ||
* These may be used by subclasses for access to subsets of features. | ||
*/ | ||
class CFeatures : public CSGObject | ||
{ | ||
|
@@ -149,13 +152,21 @@ class CFeatures : public CSGObject | |
*/ | ||
inline int32_t get_cache_size() { return cache_size; }; | ||
|
||
/** get number of examples/vectors | ||
/** get number of examples/vectors, possibly corresponding to the current subset | ||
* | ||
* abstract base method | ||
* | ||
* @return number of examples/vectors | ||
* @return number of examples/vectors (possibly of subset, if implemented) | ||
*/ | ||
virtual int32_t get_num_vectors()=0 ; | ||
virtual int32_t get_num_vectors()=0; | ||
|
||
/** get number of ALL examples/vectors (no subset here) | ||
* | ||
* abstract base method | ||
* | ||
* @ return number of examples/vectors (ALL, nosubset) | ||
*/ | ||
//virtual int32_t get_num_vectors_all()=0; | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to have virtual int32_t get_num_vectors_all() return get_num_vectors(); } here - don't you agree? |
||
/** in case there is a feature matrix allow for reshaping | ||
* | ||
|
@@ -232,6 +243,45 @@ class CFeatures : public CSGObject | |
properties &= (properties | p) ^ p; | ||
} | ||
|
||
/** removes (and deletes) the current subset indices matrix */ | ||
virtual void remove_feature_subset(); | ||
|
||
/** getter for the subset indices | ||
* | ||
* @param m_subset_idx (copy of) subset indices matrix (returned) | ||
* @param m_subset_len (copy of) number ofsubset indices (returned) | ||
*/ | ||
void get_feature_subset(int32_t** subset_idx, int32_t* subset_len); | ||
|
||
/** getter for the subset indices | ||
* | ||
* @param m_subset_len reference to number of subset indices (returned) | ||
* @return subset indices array | ||
*/ | ||
int32_t* get_feature_subset(int32_t& subset_len) | ||
{ | ||
subset_len=m_subset_len; | ||
return m_subset_idx; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you cannot know this - and I guess we will change this at at some later point. but for now: The get_feature_subset() function above would be callable by swig wrapped interfaces (one line %apply line in modular/Features.i is required for that). But unfortunately these interfaces require an allocated pointer and a copy of the data. Not nice I know but this is the way it is for now. So please introduce two such get_feature_subset functions - one like you have now but using references to set the data sizes that returns the subset index and one that allocates memory for the index (malloc this time!) / so a copy. Look at features/SimpleFeatures.h line 361 and below - it will be clear from that. |
||
|
||
/** sets the subset indices matrix which is afterwards used for feature access | ||
* | ||
* @param m_subset_idx index matrix | ||
* @param m_subset_len number of subset indices | ||
*/ | ||
virtual void set_feature_subset(int32_t* subset_idx, int32_t subset_len); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahh sigh. same stupid issue with swig interfaces. we need to copy the data here. or have two functions copy_feature_subset and set_feature_subset. Don't change it just now - I seriously consider fixing up the swig interfaces. |
||
|
||
protected: | ||
/** returns the corresponding real index (in array) of a subset index | ||
* (if there is a subset) | ||
* | ||
* should/has to be be overwritten when subset support is | ||
* implemented in a subclass. for now, it just returns identity | ||
* | ||
* @ return array index of the provided subset index | ||
*/ | ||
inline int32_t subset_idx_conversion(int32_t idx) { return idx; } | ||
|
||
private: | ||
/** feature properties */ | ||
uint64_t properties; | ||
|
@@ -247,6 +297,16 @@ class CFeatures : public CSGObject | |
|
||
/// i'th entry is true if features were already preprocessed with preproc i | ||
bool* preprocessed; | ||
|
||
protected: | ||
/** (possibly) contains a matrix of indices that form a subset of the stored | ||
* features. If set, all vector access methods work on the specified subset. | ||
* If it is NULL, it is ignored. | ||
*/ | ||
int32_t* m_subset_idx; | ||
|
||
/** length of the subset */ | ||
int32_t m_subset_len; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please add one line of documentation for both varialbes. |
||
}; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right we indeed need a function like this