Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cookbook of cross validation with pipeline #4380

Merged
merged 39 commits into from Jan 31, 2019

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented Jul 20, 2018

data pr shogun-toolbox/shogun-data#164
also need to merge #4377 first

@@ -0,0 +1,34 @@
============================
Cross Validation on Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on "a" pipeline

Cross Validation on Pipeline
============================

In this example, we illustrate how to use cross-validation with :sgclass:`CPipeline`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link to some cross-validation cookbook. There is some way to link to notebooks, I think MKL does this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha it is just below, nevermind :)

-------
Example
-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly weight English:
We demonstrate a pipeline consisting of a transformer ..., and LibLinear for binary classification.
(maybe also link to the liblinear cookbook)

-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.

Imagine we have files with training data. We create :sgclass:`CDenseFeatures` (here 64 bit floats aka RealFeatures) as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is copy pasted, but this is now outdates when we use factories.
Just say: "We create CFeatures and CLabels via loading from files"

.. sgexample:: cross_validation_pipeline:create_features


We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence basically repeats the intro, so I would just remove it



We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.
The transformer and the machine are chained as a :sgclass:`CPipeline`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a Cpipeline, and chain the transformer and the classifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I would mention that "we first chain all transformers, and then finalize the pipeline with the classifier" (since you use "then")

.. sgexample:: cross_validation_pipeline:create_pipeline

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see CEvaluation

.. sgexample:: cross_validation_pipeline:create_pipeline

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last word model (singular)

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.
In this case, we use :sgclass:`CAccuracyMeasure`.
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also mention something like "The pipeline instance behaves just like a machine and this can be directly passed to CCrossValidation"

#![create_pipeline]

#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont we have factories for those things?
I would prefer if all newly added examples would fully use the new api so we dont have to refactor later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have factory for CSplittingStrategy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind creating one? That should be pretty easy!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlnapf since there are multiple subclasses of CSplittingStrategy, are we going to do some string comparison by name?

Copy link
Member

@karlnapf karlnapf Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the factory machine in factory.h, you just have to add some macro lines
the call should be splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, k=2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we need to do initialization in init() method, instead of the constructor
e.g.

CStratifiedCrossValidationSplitting::CStratifiedCrossValidationSplitting(

this need refactor right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, this will be moved into the method that is called from outside, i.e. build_subsets ... putting it into a helper method makes sense

Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")

PipelineBuilder builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, could you pls use a factory for this. We dont want to use constructors in the examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain how to use factory here?
we could create a factory PipelineBuilder* pipeline_builder(), or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, though I would just name it pipeline

#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)
Evaluation evaluation_criteron = evaluation("AccuracyMeasure")
CrossValidation cross(pipeline, feats_train, labels_train, splitting_strategy, evaluation_criteron, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factory if possible (this might be more tricky)

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great example! I made some comments

@karlnapf
Copy link
Member

Let me know if the factory creating worked....

@@ -14,7 +14,7 @@ Labels labels_test = labels(f_labels_test)
Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")

PipelineBuilder builder()
PipelineBuilder builder = pipeline()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly like this! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW why is the C++ ype not just called Pipeline and the Machine called PipelineMachine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think Pipeline and PipelineMachine might be a bit confusing, while PipelineBuilder can indicate its usage as a builder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I dont really agree, I think then should actually return CMachine (since the subsequent object will be used in this fashion) ... why do we need to know that a machine is a pipeline if it behaves as a machine?
And then Pipeline is the thing that builds the stuff.
Makes for a cleaner API imo

@vigsterkr @lisitsyn @iglesias what are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlnapf are we talking now about PipelineMachine and PipelineBuilder, or why then returns Pipeline* ? :) imo those are different things or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So CPipeline -> CPipelineMachine, CPipelineBuilder ->CPipeline, then returns CMachine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo builder is just clearer but i dont have any strong feelings about it...
note my second comment about extraction and observation of pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the stages thing is a problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. until we cannot register and cleanly extract stages from a pipeline this sort of explicit exposure is required :) otherwise the whole thing becomes totally opaque once built

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! thx for clarifying.
I still would slightly prefer different names (CPipelineMachine, CPipeline) but it is only a minor difference, also no strong feelings. Can leave it as it is

@karlnapf
Copy link
Member

rebase and merge :)


.. sgexample:: cross_validation_pipeline:create_features

We first chain all transformers, and then finalize the pipeline with the classifier.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it one as said above? Then change or remove "all".

Labels labels_test = labels(f_labels_test)
#![create_features]

#![create_pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing ]?


PipelineBuilder builder = pipeline()
builder.over(subMean)
Pipeline pipeline = builder.then(svm)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the API contain these two builder and pipeline concepts? What about just a pipeline where steps are added (and of course the order of addition matters).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is to separate construction of pipeline into a single class so that the pipeline object is immutable, and then we don't need to verify elements of pipeline everytime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much the same thing that got me confused about the builder vs the pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second one, Pipeline can be Machine here because that's enough, we use it as machine in xval. however, if we want to get elements in the pipeline, we still need Pipeline type for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a factory for it? So users can call
PipelineMachine(machine)?
Alternatively, for the c++ folks, there is as

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there might be mutability reason justifying two types. Why both part of the interface though?

@vigsterkr
Copy link
Member

vigsterkr commented Jul 27, 2018 via email

@vigsterkr
Copy link
Member

vigsterkr commented Jul 27, 2018 via email

@iglesias
Copy link
Collaborator

iglesias commented Jul 27, 2018 via email


PipelineBuilder builder = pipeline()
builder.over(subMean)
Machine pipeline = builder.then(svm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this overloading of var name and the factory work? Just asking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this overload the factory name, the factory works in this case. thanks for letting me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flow of this, so good to merge from my side

@karlnapf
Copy link
Member

karlnapf commented Aug 8, 2018

Can you rebase data and then we can merge this as well!

@vinx13 vinx13 force-pushed the cookbook/pipeline_xval branch 2 times, most recently from 7651aa9 to 8752675 Compare August 14, 2018 12:00
@karlnapf
Copy link
Member

Shall we get this in soon? :)

@vinx13
Copy link
Member Author

vinx13 commented Nov 12, 2018

@karlnapf As I can remember, previously we faced the choice on whether pipeline builder should return CMachine* or CPipeline*. The first choice, returning CMachine* and then providing a pipeline factory, works though being ugly. According to @vigsterkr 's idea, in my last commit I tried to make CPipeline as base class, but got some different errors on travis

what():  �[1;31m[ERROR]�[0m In file /opt/shogun/src/shogun/base/SGObject.h line 369: Cannot put parameter CrossValidation::machine of type shogun::CMachine*, incompatible provided type shogun::CPipeline*.

Maybe @lisitsyn has idea?

@karlnapf
Copy link
Member

Ah I remember now.
Well the real question is whether Pipeline should be part of the modular interfaces (i.e. the base types) or not.

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that it works. I like that this is a general solution to the problem, despite being a not super elegant.

Any further thoughts on this idea of implicitly casting a pipeline to CMachine via an argument of a factory? @lisitsyn @vigsterkr

@@ -61,7 +61,8 @@
"get_int_vector": "$object.get($arguments)",
"get_real": "$object.get($arguments)",
"get_real_vector": "$object.get($arguments)",
"get_real_matrix": "$object.get($arguments)"
"get_real_matrix": "$object.get($arguments)",
"put_machine": "$object.put($arguments)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work as we cannot extract the arguments yet and you would need to pass the second argument to the machine factory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do this once #4490 is solved


PipelineBuilder builder = pipeline()
builder.over(subMean)
Machine pipeline = builder.then(svm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type here should be "Pipeline" as that is what the builder returns (and now Pipeline is part of swig as well)

SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2)
Evaluation evaluation_criterion = evaluation("AccuracyMeasure")
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2)
cross.put_machine("machine", pipeline)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to cross.put(machine(pipeline)) and it should work
we can change that to put_machine later once #4490 is in

#![create_cross_validation]
SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2)
Evaluation evaluation_criterion = evaluation("AccuracyMeasure")
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2)
Copy link
Member

@karlnapf karlnapf Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT: Can I suggest that we don't provide features and labels to crossvalidation but instead as parameters of CMachineEvaluation::evaluate(CFeatures*, CLabels*) ? Different PR though
@lisitsyn @vigsterkr @iglesias

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey mate, I can't say much based on this snippet. What's your point? Also based on the last line, CrossValidation here seems to be a type of MachineEvaluation, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to pass it as function arguments if the evaluation function rather than as fields before that....

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;)

@karlnapf
Copy link
Member

Great that this now works !
Shall we address the other comments and then finally merge this?

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only some minor things are left

examples/meta/generator/targets/cpp.json Outdated Show resolved Hide resolved
@@ -1371,9 +1373,9 @@ def _internal_factory_wrapper(object_name, new_name, docstring=None):
via .put
"""
_obj = getattr(sys.modules[__name__], object_name)
def _internal_factory(name, **kwargs):
def _internal_factory(name, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed abymore

@@ -256,4 +256,9 @@ namespace shogun
{
return get_machine()->get_machine_problem_type();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t this be in factory?

@@ -48,7 +48,8 @@ TEST_F(PipelineTest, fit_predict)
auto pipeline = some<CPipelineBuilder>()
->over(transformer1)
->over(transformer2)
->then(machine);
->then(machine)
->as<CPipeline>();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that needed?

@karlnapf
Copy link
Member

karlnapf commented Jan 31, 2019

Something about include paths is broken in the example ... sorry

@@ -0,0 +1,33 @@
============================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. the ==== should be as long as the text

We also have to decide on an evaluation criterion class (see :sgclass:`CEvaluation`) to evaluate the performance of the trained model.
In this case, we use :sgclass:`CAccuracyMeasure`.
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs.
The pipeline instance behaves just like a :sgclass:`CMachine` and this can be directly passed to :sgclass:`CCrossValidation`.
Copy link
Member