New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformer base class and adapt preprocessor / converter to fit + apply #4285

Merged
merged 61 commits into from May 30, 2018

Conversation

Projects
None yet
3 participants
@vinx13
Contributor

vinx13 commented May 14, 2018

  • add transformer base class and make converter / preprocessor its direct subclasses
  • adapt CPreprocessor::init(CFeatures *) to fit, possibly call cleanup in fit if already fitted to support re-fit
  • cleanup type check in init of preprocessors, use CFeatures::as instead in apply phase
  • add parameter 'inplace' (default true) to preprocessor / converter :: apply, in which case transform in place if possible
  • drop CDimensionReductionPreprocessor since it is only a wrapper of converters and provides irrelevant things (kernel, distance, ..) for subtypes.
  • subclasses of CDimensionReductionPreprocessor inherit directly from `CDensePreprocessor<float64_t>
  • Make KernelPCA inherit from CPreprocessor since it can work with other types of features
  • refactor ICA converters into fit + apply

@vinx13 vinx13 force-pushed the vinx13:feature/transformers branch from 187ce2c to e813594 May 14, 2018

@vinx13 vinx13 force-pushed the vinx13:feature/transformers branch from e813594 to 73bc7ec May 15, 2018

@vigsterkr

great start!

}
/** Fit transformer to features */
virtual void fit(CFeatures* features)

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018

Member

this could be a pure virtual function in order to be an abstract class, or?

This comment has been minimized.

@vinx13

vinx13 May 15, 2018

Contributor

some transformers actually do not need fitting, so i leave this a no-op

@@ -163,7 +156,7 @@ void CHomogeneousKernelMap::init()
SGMatrix<float64_t> CHomogeneousKernelMap::apply_to_feature_matrix (CFeatures* features)
{
CDenseFeatures<float64_t>* simple_features = (CDenseFeatures<float64_t>*)features;
auto simple_features = features->as<CDenseFeatures<float64_t>>();

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018

Member

needs a quickfix to check whether it has been trained or not... :)

kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);
linalg::scale(bias_tmp, bias_tmp, -1.0 / n);
float64_t s = linalg::sum(bias_tmp) / n;

This comment has been minimized.

@vigsterkr
SGMatrix<float64_t> eigenvectors(kernel_matrix.num_rows, m_target_dim);
linalg::eigen_solver_symmetric(
kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);

This comment has been minimized.

@vigsterkr
float64_t* var=SG_MALLOC(float64_t, num_features);
int32_t i,j;
m_mean.resize_vector(num_features);
float64_t* var = SG_MALLOC(float64_t, num_features);

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018

Member

sgvector could be used, or?

int32_t num_ok=0;
int32_t* idx_ok=SG_MALLOC(int32_t, num_features);
int32_t num_ok = 0;
int32_t* idx_ok = SG_MALLOC(int32_t, num_features);

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018

Member

sgvector?

for (i=0; i<num_vec; i++)
{
int32_t len=0;
bool free_vec;
uint64_t* vec=((CStringFeatures<uint64_t>*)f)->
get_feature_vector(i, len, free_vec);
uint64_t* vec = sf->get_feature_vector(i, len, free_vec);

This comment has been minimized.

@vigsterkr
@@ -56,14 +47,14 @@ bool CSortUlongString::save(FILE* f)
bool CSortUlongString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint64_t>*)f)->get_num_vectors();
auto sf = f->as<CStringFeatures<uint64_t>>();
int32_t num_vec = sf->get_num_vectors();

This comment has been minimized.

@vigsterkr
@@ -56,13 +47,14 @@ bool CSortWordString::save(FILE* f)
bool CSortWordString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint16_t>*)f)->get_num_vectors() ;
auto sf = f->as<CStringFeatures<uint16_t>>();
int32_t num_vec = sf->get_num_vectors();

This comment has been minimized.

@vigsterkr
for (i=0; i<num_vec; i++)
{
int32_t len = 0 ;
bool free_vec;
uint16_t* vec = ((CStringFeatures<uint16_t>*)f)->get_feature_vector(i, len, free_vec);
uint16_t* vec = sf->get_feature_vector(i, len, free_vec);

This comment has been minimized.

@vigsterkr
@vigsterkr

This comment has been minimized.

note the indenting!

vinx13 added some commits May 16, 2018

Cleanup and refactor PCA and FisherLDA
Implement apply in PCA and FisherLDA
Inherit directly from dense preproc
Fix swig
Drop DimensionReductionPreprocessor in swig
Guard deprecated in swig
Remove deprecated flag for apply_to_vector
@karlnapf

This comment has been minimized.

whitespace

@karlnapf

This comment has been minimized.

Member

karlnapf commented May 21, 2018

This is a cool push!

@vinx13 vinx13 changed the title from [WIP] Add transformer base class and adapt preprocessor / converter to fit + apply to Add transformer base class and adapt preprocessor / converter to fit + apply May 27, 2018

@vigsterkr

initial review

Normalize.apply_to_feature_matrix(features_train)
Normalize.apply_to_feature_matrix(features_test)
SubMean.fit(features_train)
Features features_train1 = SubMean.apply(features_train)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

a more 'representative' naming would be more desirable :P than just features_train1

Features features_train1 = SubMean.apply(features_train)
Features features_test1 = SubMean.apply(features_test)
Normalize.fit(features_train)
Features features_train2 = Normalize.apply(features_train1)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

maybe a better name would be normalized_features_train/test

{
ASSERT(features);
SG_REF(features);
SGMatrix<float64_t> X = ((CDenseFeatures<float64_t>*)features)->get_feature_matrix();
auto X = features->as<CDenseFeatures<float64_t>>()->get_feature_matrix();

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

we could do before this an assertion for get_feature_class() == C_DENSE just to have a sane error message?

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

although .as has a quite good message as well...

@@ -59,3 +61,26 @@ float64_t CICAConverter::get_tol() const
return tol;
}
CFeatures* CICAConverter::apply(CFeatures* features, bool inplace)
{
REQUIRE(m_mixing_matrix.matrix, "ICAConverter not fitted.");

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

ICAConverter has not been fitted.

*/
virtual CFeatures* apply(CFeatures* features) = 0;
virtual void fit(CFeatures* features) = 0;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

if we know that this class is an abstract class for other classes that does conversion over CDenseFeatures<float64_t> then here it would be good to actually do the type mapping, meaning not to have this function pure virtual but implement something like this:

virtual void fit(CFeatures* features)
{
    REQUIRE(features->get_feature_class() == C_DENSE, "ICA converters only work with dense features")
    fit((CDenseFeatures<float64_t>*)features);
}

and have a pure virtual protected member function that has a declaration:

virtual void fit(CDenseFeatures<float64_t>*) = 0;

this way the casting thing you need to do it in one place, all the other places you can already assume CDenseFeatures<float64_t> as input.

namespace shogun
{
template <>

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

why move this into implementation ?

This comment has been minimized.

@vinx13

vinx13 May 29, 2018

Contributor

the header is mixed with declaration and implementation, so i split into .h and .cpp as they are done in other template classes, e.g. DensePreprocessor

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

gotcha!

[[deprecated]]
#endif
virtual bool
apply_to_string_features(CFeatures* f);

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

whitespace problem.

* @param inplace whether transform in place
* @return the result feature object after applying the transformer
*/
virtual CFeatures* apply(CFeatures* features, bool inplace = true) = 0;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

if this is pure virtual, then surely fit could be as well?
i'm not sure if i follow the logic here why fit has a default implementation and why the apply does not or vice versa.

This comment has been minimized.

@vinx13

vinx13 May 29, 2018

Contributor

there are many transformers that do not need fitting, we don't need to put a empty fit in every subclasses, but only one place here
but as for apply, it should always be implemented (except those abstract classes)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

ah yeah i remember now... sorry short memory :(

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

mmm actually i've realised... dont we actually wanna follow actual transformer api, meaning
.fit and .transform ? :)

feats_train.apply_preprocessor()
preproc = SortWordString()
preproc.fit(feats_train)
feats_train = preproc.apply(feats_train)

This comment has been minimized.

@vigsterkr
@@ -16,7 +16,6 @@
%rename(PNorm) CPNorm;
%rename(RescaleFeatures) CRescaleFeatures;
%rename(DimensionReductionPreprocessor) CDimensionReductionPreprocessor;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018

Member

why do we drop CDimensionReductionPreprocessor? :)

This comment has been minimized.

@vinx13

vinx13 May 29, 2018

Contributor

CDimensionReductionPreprocessor is dropped from both cpp and interface.
It has fields like m_converter, m_kernel, m_distance but they are not used by subclasses it most cases (e.g PCA). So it is not very helpful as a superclass. And we don't need to wrap a converter with it now

This comment has been minimized.

@vigsterkr
@vigsterkr

This comment has been minimized.

Member

vigsterkr commented May 30, 2018

have u seen these @vinx13

The following tests FAILED:
	 84 - python_legacy-distance_canberraword (SEGFAULT)
	 86 - python_legacy-distance_hammingword (SEGFAULT)
	 87 - python_legacy-distance_manhattenword (OTHER_FAULT)
	137 - python_legacy-kernel_comm_ulong_string (SEGFAULT)
	138 - python_legacy-kernel_comm_word_string (SEGFAULT)
	180 - python_legacy-kernel_weighted_comm_word_string (Failed)
	207 - python_legacy-preprocessor_sortulongstring (SEGFAULT)
	208 - python_legacy-preprocessor_sortwordstring (SEGFAULT)
	214 - python_legacy-serialization_string_kernels (SEGFAULT)
	227 - python_legacy-tests_check_commwordkernel_memleak (SEGFAULT)
Errors while running CTest

@vigsterkr vigsterkr merged commit d2baac3 into shogun-toolbox:feature/transformers May 30, 2018

1 of 2 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@@ -13,7 +13,8 @@ ica.set_tol(0.00001)
#![set_parameters]
#![apply_convert]
Features converted = ica.apply(feats)
ica.fit(feats)

This comment has been minimized.

@karlnapf

karlnapf May 30, 2018

Member

I like the new API here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment