Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformer base class and adapt preprocessor / converter to fit + apply #4285

Merged

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented May 14, 2018

  • add transformer base class and make converter / preprocessor its direct subclasses
  • adapt CPreprocessor::init(CFeatures *) to fit, possibly call cleanup in fit if already fitted to support re-fit
  • cleanup type check in init of preprocessors, use CFeatures::as instead in apply phase
  • add parameter 'inplace' (default true) to preprocessor / converter :: apply, in which case transform in place if possible
  • drop CDimensionReductionPreprocessor since it is only a wrapper of converters and provides irrelevant things (kernel, distance, ..) for subtypes.
  • subclasses of CDimensionReductionPreprocessor inherit directly from `CDensePreprocessor<float64_t>
  • Make KernelPCA inherit from CPreprocessor since it can work with other types of features
  • refactor ICA converters into fit + apply

Copy link
Member

@vigsterkr vigsterkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start!

}

/** Fit transformer to features */
virtual void fit(CFeatures* features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a pure virtual function in order to be an abstract class, or?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some transformers actually do not need fitting, so i leave this a no-op

@@ -163,7 +156,7 @@ void CHomogeneousKernelMap::init()

SGMatrix<float64_t> CHomogeneousKernelMap::apply_to_feature_matrix (CFeatures* features)
{
CDenseFeatures<float64_t>* simple_features = (CDenseFeatures<float64_t>*)features;
auto simple_features = features->as<CDenseFeatures<float64_t>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a quickfix to check whether it has been trained or not... :)

kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);
linalg::scale(bias_tmp, bias_tmp, -1.0 / n);
float64_t s = linalg::sum(bias_tmp) / n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

SGMatrix<float64_t> eigenvectors(kernel_matrix.num_rows, m_target_dim);
linalg::eigen_solver_symmetric(
kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

float64_t* var=SG_MALLOC(float64_t, num_features);
int32_t i,j;
m_mean.resize_vector(num_features);
float64_t* var = SG_MALLOC(float64_t, num_features);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgvector could be used, or?

int32_t num_ok=0;
int32_t* idx_ok=SG_MALLOC(int32_t, num_features);
int32_t num_ok = 0;
int32_t* idx_ok = SG_MALLOC(int32_t, num_features);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgvector?


for (i=0; i<num_vec; i++)
{
int32_t len=0;
bool free_vec;
uint64_t* vec=((CStringFeatures<uint64_t>*)f)->
get_feature_vector(i, len, free_vec);
uint64_t* vec = sf->get_feature_vector(i, len, free_vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto?

@@ -56,14 +47,14 @@ bool CSortUlongString::save(FILE* f)
bool CSortUlongString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint64_t>*)f)->get_num_vectors();
auto sf = f->as<CStringFeatures<uint64_t>>();
int32_t num_vec = sf->get_num_vectors();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

@@ -56,13 +47,14 @@ bool CSortWordString::save(FILE* f)
bool CSortWordString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint16_t>*)f)->get_num_vectors() ;
auto sf = f->as<CStringFeatures<uint16_t>>();
int32_t num_vec = sf->get_num_vectors();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto


for (i=0; i<num_vec; i++)
{
int32_t len = 0 ;
bool free_vec;
uint16_t* vec = ((CStringFeatures<uint16_t>*)f)->get_feature_vector(i, len, free_vec);
uint16_t* vec = sf->get_feature_vector(i, len, free_vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

@karlnapf
Copy link
Member

This is a cool push!

@vinx13 vinx13 changed the title [WIP] Add transformer base class and adapt preprocessor / converter to fit + apply Add transformer base class and adapt preprocessor / converter to fit + apply May 27, 2018
Copy link
Member

@vigsterkr vigsterkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review

Normalize.apply_to_feature_matrix(features_train)
Normalize.apply_to_feature_matrix(features_test)
SubMean.fit(features_train)
Features features_train1 = SubMean.apply(features_train)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more 'representative' naming would be more desirable :P than just features_train1

Features features_train1 = SubMean.apply(features_train)
Features features_test1 = SubMean.apply(features_test)
Normalize.fit(features_train)
Features features_train2 = Normalize.apply(features_train1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better name would be normalized_features_train/test

{
ASSERT(features);
SG_REF(features);

SGMatrix<float64_t> X = ((CDenseFeatures<float64_t>*)features)->get_feature_matrix();
auto X = features->as<CDenseFeatures<float64_t>>()->get_feature_matrix();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could do before this an assertion for get_feature_class() == C_DENSE just to have a sane error message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although .as has a quite good message as well...

@@ -59,3 +61,26 @@ float64_t CICAConverter::get_tol() const
return tol;
}

CFeatures* CICAConverter::apply(CFeatures* features, bool inplace)
{
REQUIRE(m_mixing_matrix.matrix, "ICAConverter not fitted.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICAConverter has not been fitted.

*/
virtual CFeatures* apply(CFeatures* features) = 0;
virtual void fit(CFeatures* features) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we know that this class is an abstract class for other classes that does conversion over CDenseFeatures<float64_t> then here it would be good to actually do the type mapping, meaning not to have this function pure virtual but implement something like this:

virtual void fit(CFeatures* features)
{
    REQUIRE(features->get_feature_class() == C_DENSE, "ICA converters only work with dense features")
    fit((CDenseFeatures<float64_t>*)features);
}

and have a pure virtual protected member function that has a declaration:

virtual void fit(CDenseFeatures<float64_t>*) = 0;

this way the casting thing you need to do it in one place, all the other places you can already assume CDenseFeatures<float64_t> as input.

namespace shogun
{

template <>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move this into implementation ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the header is mixed with declaration and implementation, so i split into .h and .cpp as they are done in other template classes, e.g. DensePreprocessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha!

[[deprecated]]
#endif
virtual bool
apply_to_string_features(CFeatures* f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace problem.

* @param inplace whether transform in place
* @return the result feature object after applying the transformer
*/
virtual CFeatures* apply(CFeatures* features, bool inplace = true) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is pure virtual, then surely fit could be as well?
i'm not sure if i follow the logic here why fit has a default implementation and why the apply does not or vice versa.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many transformers that do not need fitting, we don't need to put a empty fit in every subclasses, but only one place here
but as for apply, it should always be implemented (except those abstract classes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah i remember now... sorry short memory :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm actually i've realised... dont we actually wanna follow actual transformer api, meaning
.fit and .transform ? :)

feats_train.apply_preprocessor()
preproc = SortWordString()
preproc.fit(feats_train)
feats_train = preproc.apply(feats_train)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@@ -16,7 +16,6 @@
%rename(PNorm) CPNorm;
%rename(RescaleFeatures) CRescaleFeatures;

%rename(DimensionReductionPreprocessor) CDimensionReductionPreprocessor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we drop CDimensionReductionPreprocessor? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDimensionReductionPreprocessor is dropped from both cpp and interface.
It has fields like m_converter, m_kernel, m_distance but they are not used by subclasses it most cases (e.g PCA). So it is not very helpful as a superclass. And we don't need to wrap a converter with it now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okie!

@vigsterkr
Copy link
Member

have u seen these @vinx13

The following tests FAILED:
	 84 - python_legacy-distance_canberraword (SEGFAULT)
	 86 - python_legacy-distance_hammingword (SEGFAULT)
	 87 - python_legacy-distance_manhattenword (OTHER_FAULT)
	137 - python_legacy-kernel_comm_ulong_string (SEGFAULT)
	138 - python_legacy-kernel_comm_word_string (SEGFAULT)
	180 - python_legacy-kernel_weighted_comm_word_string (Failed)
	207 - python_legacy-preprocessor_sortulongstring (SEGFAULT)
	208 - python_legacy-preprocessor_sortwordstring (SEGFAULT)
	214 - python_legacy-serialization_string_kernels (SEGFAULT)
	227 - python_legacy-tests_check_commwordkernel_memleak (SEGFAULT)
Errors while running CTest

@vigsterkr vigsterkr merged commit d2baac3 into shogun-toolbox:feature/transformers May 30, 2018
@@ -13,7 +13,8 @@ ica.set_tol(0.00001)
#![set_parameters]

#![apply_convert]
Features converted = ica.apply(feats)
ica.fit(feats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new API here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants