Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformer base class and adapt preprocessor / converter to fit + apply #4285

Merged

Conversation

@vinx13
Copy link
Member

@vinx13 vinx13 commented May 14, 2018

  • add transformer base class and make converter / preprocessor its direct subclasses
  • adapt CPreprocessor::init(CFeatures *) to fit, possibly call cleanup in fit if already fitted to support re-fit
  • cleanup type check in init of preprocessors, use CFeatures::as instead in apply phase
  • add parameter 'inplace' (default true) to preprocessor / converter :: apply, in which case transform in place if possible
  • drop CDimensionReductionPreprocessor since it is only a wrapper of converters and provides irrelevant things (kernel, distance, ..) for subtypes.
  • subclasses of CDimensionReductionPreprocessor inherit directly from `CDensePreprocessor<float64_t>
  • Make KernelPCA inherit from CPreprocessor since it can work with other types of features
  • refactor ICA converters into fit + apply
@vinx13 vinx13 force-pushed the vinx13:feature/transformers branch from 187ce2c to e813594 May 14, 2018
@vinx13 vinx13 force-pushed the vinx13:feature/transformers branch from e813594 to 73bc7ec May 15, 2018
Copy link
Member

@vigsterkr vigsterkr left a comment

great start!

}

/** Fit transformer to features */
virtual void fit(CFeatures* features)

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018
Member

this could be a pure virtual function in order to be an abstract class, or?

This comment has been minimized.

@vinx13

vinx13 May 15, 2018
Author Member

some transformers actually do not need fitting, so i leave this a no-op

@@ -163,7 +156,7 @@ void CHomogeneousKernelMap::init()

SGMatrix<float64_t> CHomogeneousKernelMap::apply_to_feature_matrix (CFeatures* features)
{
CDenseFeatures<float64_t>* simple_features = (CDenseFeatures<float64_t>*)features;
auto simple_features = features->as<CDenseFeatures<float64_t>>();

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018
Member

needs a quickfix to check whether it has been trained or not... :)

kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);
linalg::scale(bias_tmp, bias_tmp, -1.0 / n);
float64_t s = linalg::sum(bias_tmp) / n;

This comment has been minimized.

SGMatrix<float64_t> eigenvectors(kernel_matrix.num_rows, m_target_dim);
linalg::eigen_solver_symmetric(
kernel_matrix, eigenvalues, eigenvectors, m_target_dim);
SGVector<float64_t> bias_tmp = linalg::rowwise_sum(kernel_matrix);

This comment has been minimized.

float64_t* var=SG_MALLOC(float64_t, num_features);
int32_t i,j;
m_mean.resize_vector(num_features);
float64_t* var = SG_MALLOC(float64_t, num_features);

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018
Member

sgvector could be used, or?

int32_t num_ok=0;
int32_t* idx_ok=SG_MALLOC(int32_t, num_features);
int32_t num_ok = 0;
int32_t* idx_ok = SG_MALLOC(int32_t, num_features);

This comment has been minimized.

@vigsterkr

vigsterkr May 15, 2018
Member

sgvector?


for (i=0; i<num_vec; i++)
{
int32_t len=0;
bool free_vec;
uint64_t* vec=((CStringFeatures<uint64_t>*)f)->
get_feature_vector(i, len, free_vec);
uint64_t* vec = sf->get_feature_vector(i, len, free_vec);

This comment has been minimized.

@@ -56,14 +47,14 @@ bool CSortUlongString::save(FILE* f)
bool CSortUlongString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint64_t>*)f)->get_num_vectors();
auto sf = f->as<CStringFeatures<uint64_t>>();
int32_t num_vec = sf->get_num_vectors();

This comment has been minimized.

@@ -56,13 +47,14 @@ bool CSortWordString::save(FILE* f)
bool CSortWordString::apply_to_string_features(CFeatures* f)
{
int32_t i;
int32_t num_vec=((CStringFeatures<uint16_t>*)f)->get_num_vectors() ;
auto sf = f->as<CStringFeatures<uint16_t>>();
int32_t num_vec = sf->get_num_vectors();

This comment has been minimized.


for (i=0; i<num_vec; i++)
{
int32_t len = 0 ;
bool free_vec;
uint16_t* vec = ((CStringFeatures<uint16_t>*)f)->get_feature_vector(i, len, free_vec);
uint16_t* vec = sf->get_feature_vector(i, len, free_vec);

This comment has been minimized.

@vigsterkr

This comment has been minimized.

note the indenting!

vinx13 added 8 commits May 16, 2018
Implement apply in PCA and FisherLDA
Inherit directly from dense preproc
Fix swig
Drop DimensionReductionPreprocessor in swig
Guard deprecated in swig
Remove deprecated flag for apply_to_vector
@karlnapf

This comment has been minimized.

whitespace

@karlnapf
Copy link
Member

@karlnapf karlnapf commented May 21, 2018

This is a cool push!

@vinx13 vinx13 changed the title [WIP] Add transformer base class and adapt preprocessor / converter to fit + apply Add transformer base class and adapt preprocessor / converter to fit + apply May 27, 2018
Copy link
Member

@vigsterkr vigsterkr left a comment

initial review

Normalize.apply_to_feature_matrix(features_train)
Normalize.apply_to_feature_matrix(features_test)
SubMean.fit(features_train)
Features features_train1 = SubMean.apply(features_train)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

a more 'representative' naming would be more desirable :P than just features_train1

Features features_train1 = SubMean.apply(features_train)
Features features_test1 = SubMean.apply(features_test)
Normalize.fit(features_train)
Features features_train2 = Normalize.apply(features_train1)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

maybe a better name would be normalized_features_train/test

{
ASSERT(features);
SG_REF(features);

SGMatrix<float64_t> X = ((CDenseFeatures<float64_t>*)features)->get_feature_matrix();
auto X = features->as<CDenseFeatures<float64_t>>()->get_feature_matrix();

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

we could do before this an assertion for get_feature_class() == C_DENSE just to have a sane error message?

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

although .as has a quite good message as well...

@@ -59,3 +61,26 @@ float64_t CICAConverter::get_tol() const
return tol;
}

CFeatures* CICAConverter::apply(CFeatures* features, bool inplace)
{
REQUIRE(m_mixing_matrix.matrix, "ICAConverter not fitted.");

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

ICAConverter has not been fitted.

*/
virtual CFeatures* apply(CFeatures* features) = 0;
virtual void fit(CFeatures* features) = 0;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

if we know that this class is an abstract class for other classes that does conversion over CDenseFeatures<float64_t> then here it would be good to actually do the type mapping, meaning not to have this function pure virtual but implement something like this:

virtual void fit(CFeatures* features)
{
    REQUIRE(features->get_feature_class() == C_DENSE, "ICA converters only work with dense features")
    fit((CDenseFeatures<float64_t>*)features);
}

and have a pure virtual protected member function that has a declaration:

virtual void fit(CDenseFeatures<float64_t>*) = 0;

this way the casting thing you need to do it in one place, all the other places you can already assume CDenseFeatures<float64_t> as input.

namespace shogun
{

template <>

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

why move this into implementation ?

This comment has been minimized.

@vinx13

vinx13 May 29, 2018
Author Member

the header is mixed with declaration and implementation, so i split into .h and .cpp as they are done in other template classes, e.g. DensePreprocessor

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

gotcha!

[[deprecated]]
#endif
virtual bool
apply_to_string_features(CFeatures* f);

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

whitespace problem.

* @param inplace whether transform in place
* @return the result feature object after applying the transformer
*/
virtual CFeatures* apply(CFeatures* features, bool inplace = true) = 0;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

if this is pure virtual, then surely fit could be as well?
i'm not sure if i follow the logic here why fit has a default implementation and why the apply does not or vice versa.

This comment has been minimized.

@vinx13

vinx13 May 29, 2018
Author Member

there are many transformers that do not need fitting, we don't need to put a empty fit in every subclasses, but only one place here
but as for apply, it should always be implemented (except those abstract classes)

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

ah yeah i remember now... sorry short memory :(

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

mmm actually i've realised... dont we actually wanna follow actual transformer api, meaning
.fit and .transform ? :)

feats_train.apply_preprocessor()
preproc = SortWordString()
preproc.fit(feats_train)
feats_train = preproc.apply(feats_train)

This comment has been minimized.

@@ -16,7 +16,6 @@
%rename(PNorm) CPNorm;
%rename(RescaleFeatures) CRescaleFeatures;

%rename(DimensionReductionPreprocessor) CDimensionReductionPreprocessor;

This comment has been minimized.

@vigsterkr

vigsterkr May 29, 2018
Member

why do we drop CDimensionReductionPreprocessor? :)

This comment has been minimized.

@vinx13

vinx13 May 29, 2018
Author Member

CDimensionReductionPreprocessor is dropped from both cpp and interface.
It has fields like m_converter, m_kernel, m_distance but they are not used by subclasses it most cases (e.g PCA). So it is not very helpful as a superclass. And we don't need to wrap a converter with it now

This comment has been minimized.

@vigsterkr
Copy link
Member

@vigsterkr vigsterkr commented May 30, 2018

have u seen these @vinx13

The following tests FAILED:
	 84 - python_legacy-distance_canberraword (SEGFAULT)
	 86 - python_legacy-distance_hammingword (SEGFAULT)
	 87 - python_legacy-distance_manhattenword (OTHER_FAULT)
	137 - python_legacy-kernel_comm_ulong_string (SEGFAULT)
	138 - python_legacy-kernel_comm_word_string (SEGFAULT)
	180 - python_legacy-kernel_weighted_comm_word_string (Failed)
	207 - python_legacy-preprocessor_sortulongstring (SEGFAULT)
	208 - python_legacy-preprocessor_sortwordstring (SEGFAULT)
	214 - python_legacy-serialization_string_kernels (SEGFAULT)
	227 - python_legacy-tests_check_commwordkernel_memleak (SEGFAULT)
Errors while running CTest
@vigsterkr vigsterkr merged commit d2baac3 into shogun-toolbox:feature/transformers May 30, 2018
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@@ -13,7 +13,8 @@ ica.set_tol(0.00001)
#![set_parameters]

#![apply_convert]
Features converted = ica.apply(feats)
ica.fit(feats)

This comment has been minimized.

@karlnapf

karlnapf May 30, 2018
Member

I like the new API here!

@samdbrice samdbrice added the Tag: GSoC label Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants