Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cookbook of cross validation with pipeline #4380

Merged
merged 39 commits into from Jan 31, 2019

Conversation

@vinx13
Copy link
Member

@vinx13 vinx13 commented Jul 20, 2018

data pr shogun-toolbox/shogun-data#164
also need to merge #4377 first

@@ -0,0 +1,34 @@
============================
Cross Validation on Pipeline

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

on "a" pipeline

Cross Validation on Pipeline
============================

In this example, we illustrate how to use cross-validation with :sgclass:`CPipeline`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

maybe link to some cross-validation cookbook. There is some way to link to notebooks, I think MKL does this

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

ha it is just below, nevermind :)

-------
Example
-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

Slightly weight English:
We demonstrate a pipeline consisting of a transformer ..., and LibLinear for binary classification.
(maybe also link to the liblinear cookbook)

-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.

Imagine we have files with training data. We create :sgclass:`CDenseFeatures` (here 64 bit floats aka RealFeatures) as

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

I know this is copy pasted, but this is now outdates when we use factories.
Just say: "We create CFeatures and CLabels via loading from files"

.. sgexample:: cross_validation_pipeline:create_features


We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

This sentence basically repeats the intro, so I would just remove it



We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.
The transformer and the machine are chained as a :sgclass:`CPipeline`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

We create a Cpipeline, and chain the transformer and the classifier.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

actually, I would mention that "we first chain all transformers, and then finalize the pipeline with the classifier" (since you use "then")

.. sgexample:: cross_validation_pipeline:create_pipeline

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

see CEvaluation

.. sgexample:: cross_validation_pipeline:create_pipeline

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

last word model (singular)

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.
In this case, we use :sgclass:`CAccuracyMeasure`.
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

Please also mention something like "The pipeline instance behaves just like a machine and this can be directly passed to CCrossValidation"

#![create_pipeline]

#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

dont we have factories for those things?
I would prefer if all newly added examples would fully use the new api so we dont have to refactor later

This comment has been minimized.

@vinx13

vinx13 Jul 21, 2018
Author Member

we don't have factory for CSplittingStrategy

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

Would you mind creating one? That should be pretty easy!

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018
Author Member

sure

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018
Author Member

@karlnapf since there are multiple subclasses of CSplittingStrategy, are we going to do some string comparison by name?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

check the factory machine in factory.h, you just have to add some macro lines
the call should be splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, k=2)

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018
Author Member

so we need to do initialization in init() method, instead of the constructor
e.g.

CStratifiedCrossValidationSplitting::CStratifiedCrossValidationSplitting(

this need refactor right?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

ah yes, this will be moved into the method that is called from outside, i.e. build_subsets ... putting it into a helper method makes sense

Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")

PipelineBuilder builder()

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

Same here, could you pls use a factory for this. We dont want to use constructors in the examples

This comment has been minimized.

@vinx13

vinx13 Jul 22, 2018
Author Member

could you explain how to use factory here?
we could create a factory PipelineBuilder* pipeline_builder(), or?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

yes exactly, though I would just name it pipeline

#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)
Evaluation evaluation_criteron = evaluation("AccuracyMeasure")
CrossValidation cross(pipeline, feats_train, labels_train, splitting_strategy, evaluation_criteron, False)

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018
Member

factory if possible (this might be more tricky)

Copy link
Member

@karlnapf karlnapf left a comment

Great example! I made some comments

@karlnapf
Copy link
Member

@karlnapf karlnapf commented Jul 23, 2018

Let me know if the factory creating worked....

@@ -14,7 +14,7 @@ Labels labels_test = labels(f_labels_test)
Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")

PipelineBuilder builder()
PipelineBuilder builder = pipeline()

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

exactly like this! :)

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

BTW why is the C++ ype not just called Pipeline and the Machine called PipelineMachine?

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018
Author Member

i think Pipeline and PipelineMachine might be a bit confusing, while PipelineBuilder can indicate its usage as a builder

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

TBH, I dont really agree, I think then should actually return CMachine (since the subsequent object will be used in this fashion) ... why do we need to know that a machine is a pipeline if it behaves as a machine?
And then Pipeline is the thing that builds the stuff.
Makes for a cleaner API imo

@vigsterkr @lisitsyn @iglesias what are your thoughts?

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018
Member

@karlnapf are we talking now about PipelineMachine and PipelineBuilder, or why then returns Pipeline* ? :) imo those are different things or?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

So CPipeline -> CPipelineMachine, CPipelineBuilder ->CPipeline, then returns CMachine

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018
Member

imo builder is just clearer but i dont have any strong feelings about it...
note my second comment about extraction and observation of pipeline.

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

yes the stages thing is a problem

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018
Member

i.e. until we cannot register and cleanly extract stages from a pipeline this sort of explicit exposure is required :) otherwise the whole thing becomes totally opaque once built

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018
Member

got it! thx for clarifying.
I still would slightly prefer different names (CPipelineMachine, CPipeline) but it is only a minor difference, also no strong feelings. Can leave it as it is

@karlnapf
Copy link
Member

@karlnapf karlnapf commented Jul 23, 2018

rebase and merge :)


.. sgexample:: cross_validation_pipeline:create_features

We first chain all transformers, and then finalize the pipeline with the classifier.

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018
Contributor

Is it one as said above? Then change or remove "all".

Labels labels_test = labels(f_labels_test)
#![create_features]

#![create_pipeline

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018
Contributor

Missing ]?


PipelineBuilder builder = pipeline()
builder.over(subMean)
Pipeline pipeline = builder.then(svm)

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018
Contributor

Why does the API contain these two builder and pipeline concepts? What about just a pipeline where steps are added (and of course the order of addition matters).

This comment has been minimized.

@vinx13

vinx13 Jul 26, 2018
Author Member

the idea is to separate construction of pipeline into a single class so that the pipeline object is immutable, and then we don't need to verify elements of pipeline everytime

This comment has been minimized.

@karlnapf

karlnapf Jul 26, 2018
Member

This is pretty much the same thing that got me confused about the builder vs the pipeline.

This comment has been minimized.

@vinx13

vinx13 Jul 27, 2018
Author Member

the second one, Pipeline can be Machine here because that's enough, we use it as machine in xval. however, if we want to get elements in the pipeline, we still need Pipeline type for now

This comment has been minimized.

@karlnapf

karlnapf Jul 27, 2018
Member

what about adding a factory for it? So users can call
PipelineMachine(machine)?
Alternatively, for the c++ folks, there is as

This comment has been minimized.

@iglesias

iglesias Jul 27, 2018
Contributor

I understand that there might be mutability reason justifying two types. Why both part of the interface though?

@vigsterkr
Copy link
Member

@vigsterkr vigsterkr commented Jul 27, 2018

@vigsterkr
Copy link
Member

@vigsterkr vigsterkr commented Jul 27, 2018

@iglesias
Copy link
Contributor

@iglesias iglesias commented Jul 27, 2018

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 0e3b84b to f8120b9 Jul 30, 2018

PipelineBuilder builder = pipeline()
builder.over(subMean)
Machine pipeline = builder.then(svm)

This comment has been minimized.

@karlnapf

karlnapf Jul 30, 2018
Member

does this overloading of var name and the factory work? Just asking

This comment has been minimized.

@vinx13

vinx13 Jul 30, 2018
Author Member

yes this overload the factory name, the factory works in this case. thanks for letting me know.

This comment has been minimized.

@karlnapf

karlnapf Jul 30, 2018
Member

I like the flow of this, so good to merge from my side

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from f8120b9 to 5bd958c Aug 8, 2018
@karlnapf
Copy link
Member

@karlnapf karlnapf commented Aug 8, 2018

Can you rebase data and then we can merge this as well!

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch 2 times, most recently from 7651aa9 to 8752675 Aug 12, 2018
@karlnapf
Copy link
Member

@karlnapf karlnapf commented Nov 12, 2018

Shall we get this in soon? :)

@vinx13
Copy link
Member Author

@vinx13 vinx13 commented Nov 12, 2018

@karlnapf As I can remember, previously we faced the choice on whether pipeline builder should return CMachine* or CPipeline*. The first choice, returning CMachine* and then providing a pipeline factory, works though being ugly. According to @vigsterkr 's idea, in my last commit I tried to make CPipeline as base class, but got some different errors on travis

what():  �[1;31m[ERROR]�[0m In file /opt/shogun/src/shogun/base/SGObject.h line 369: Cannot put parameter CrossValidation::machine of type shogun::CMachine*, incompatible provided type shogun::CPipeline*.

Maybe @lisitsyn has idea?

@karlnapf
Copy link
Member

@karlnapf karlnapf commented Nov 12, 2018

Ah I remember now.
Well the real question is whether Pipeline should be part of the modular interfaces (i.e. the base types) or not.

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 6fe7b65 to 7a00f6e Jan 29, 2019
@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 7a00f6e to d2a3a74 Jan 29, 2019
@@ -61,7 +61,8 @@
"get_int_vector": "$object.get($arguments)",
"get_real": "$object.get($arguments)",
"get_real_vector": "$object.get($arguments)",
"get_real_matrix": "$object.get($arguments)"
"get_real_matrix": "$object.get($arguments)",
"put_machine": "$object.put($arguments)"

This comment has been minimized.

@karlnapf

karlnapf Jan 29, 2019
Member

this will not work as we cannot extract the arguments yet and you would need to pass the second argument to the machine factory

This comment has been minimized.

@karlnapf

karlnapf Jan 29, 2019
Member

we can do this once #4490 is solved


PipelineBuilder builder = pipeline()
builder.over(subMean)
Machine pipeline = builder.then(svm)

This comment has been minimized.

@karlnapf

karlnapf Jan 29, 2019
Member

This type here should be "Pipeline" as that is what the builder returns (and now Pipeline is part of swig as well)

SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2)
Evaluation evaluation_criterion = evaluation("AccuracyMeasure")
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2)
cross.put_machine("machine", pipeline)

This comment has been minimized.

@karlnapf

karlnapf Jan 29, 2019
Member

change to cross.put(machine(pipeline)) and it should work
we can change that to put_machine later once #4490 is in

#![create_cross_validation]
SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2)
Evaluation evaluation_criterion = evaluation("AccuracyMeasure")
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2)

This comment has been minimized.

@karlnapf

karlnapf Jan 29, 2019
Member

OT: Can I suggest that we don't provide features and labels to crossvalidation but instead as parameters of CMachineEvaluation::evaluate(CFeatures*, CLabels*) ? Different PR though
@lisitsyn @vigsterkr @iglesias

This comment has been minimized.

@iglesias

iglesias Jan 29, 2019
Contributor

Hey mate, I can't say much based on this snippet. What's your point? Also based on the last line, CrossValidation here seems to be a type of MachineEvaluation, no?

This comment has been minimized.

@karlnapf

karlnapf Jan 30, 2019
Member

I’d like to pass it as function arguments if the evaluation function rather than as fields before that....

vinx13 added 3 commits Jan 30, 2019
Copy link
Member

@karlnapf karlnapf left a comment

;)

@karlnapf
Copy link
Member

@karlnapf karlnapf commented Jan 30, 2019

Great that this now works !
Shall we address the other comments and then finally merge this?

Copy link
Member

@karlnapf karlnapf left a comment

I think only some minor things are left

examples/meta/generator/targets/cpp.json Outdated Show resolved Hide resolved
@@ -1371,9 +1373,9 @@ def _internal_factory_wrapper(object_name, new_name, docstring=None):
via .put
"""
_obj = getattr(sys.modules[__name__], object_name)
def _internal_factory(name, **kwargs):
def _internal_factory(name, *args, **kwargs):

This comment has been minimized.

@karlnapf

karlnapf Jan 30, 2019
Member

Not needed abymore

@@ -256,4 +256,9 @@ namespace shogun
{
return get_machine()->get_machine_problem_type();
}

This comment has been minimized.

@karlnapf

karlnapf Jan 30, 2019
Member

Can’t this be in factory?

@@ -48,7 +48,8 @@ TEST_F(PipelineTest, fit_predict)
auto pipeline = some<CPipelineBuilder>()
->over(transformer1)
->over(transformer2)
->then(machine);
->then(machine)
->as<CPipeline>();

This comment has been minimized.

@karlnapf

karlnapf Jan 30, 2019
Member

Is that needed?

vinx13 added 4 commits Jan 30, 2019
Fix
@karlnapf
Copy link
Member

@karlnapf karlnapf commented Jan 31, 2019

Something about include paths is broken in the example ... sorry

@@ -0,0 +1,33 @@
============================

This comment has been minimized.

@karlnapf

karlnapf Jan 31, 2019
Member

minor. the ==== should be as long as the text

We also have to decide on an evaluation criterion class (see :sgclass:`CEvaluation`) to evaluate the performance of the trained model.
In this case, we use :sgclass:`CAccuracyMeasure`.
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs.
The pipeline instance behaves just like a :sgclass:`CMachine` and this can be directly passed to :sgclass:`CCrossValidation`.

This comment has been minimized.

@karlnapf

karlnapf Jan 31, 2019
Member

This is not exactly true anymore, as we need to call the factory now. I suggest we just avoid this and say
"We next create a cross-validation instance and pass the generated pipeline."

* See also CPipelineBuilder and CPipeline.
* @return new instance of CPipelineBuilder
*/
CPipelineBuilder* pipeline()

This comment has been minimized.

@karlnapf

karlnapf Jan 31, 2019
Member

@gf712 would you be up for making the definitions of the factory methods a bit neater? :)

This comment has been minimized.

@gf712

gf712 Feb 1, 2019
Member

sure! what do you have in mind?

This comment has been minimized.

@karlnapf

karlnapf Feb 5, 2019
Member

Actually, just checked, not sure factory.h can even be made that much simpler?

@karlnapf
Copy link
Member

@karlnapf karlnapf commented Jan 31, 2019

Cool!
Thanks for the big amount of work. It is awesome to have this in now :)

@karlnapf karlnapf merged commit ab0aa3c into shogun-toolbox:develop Jan 31, 2019
1 check was pending
1 check was pending
shogun-CI in progress
Details
vigsterkr added a commit to vigsterkr/shogun that referenced this pull request Mar 9, 2019
* Add meta example and cookbook of pipeline cross validation
ktiefe added a commit to ktiefe/shogun that referenced this pull request Jul 30, 2019
* Add meta example and cookbook of pipeline cross validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants