New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cookbook of cross validation with pipeline #4380
Add cookbook of cross validation with pipeline #4380
Conversation
@@ -0,0 +1,34 @@ | |||
============================ | |||
Cross Validation on Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on "a" pipeline
Cross Validation on Pipeline | ||
============================ | ||
|
||
In this example, we illustrate how to use cross-validation with :sgclass:`CPipeline`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe link to some cross-validation cookbook. There is some way to link to notebooks, I think MKL does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha it is just below, nevermind :)
------- | ||
Example | ||
------- | ||
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly weight English:
We demonstrate a pipeline consisting of a transformer ..., and LibLinear for binary classification.
(maybe also link to the liblinear cookbook)
------- | ||
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`. | ||
|
||
Imagine we have files with training data. We create :sgclass:`CDenseFeatures` (here 64 bit floats aka RealFeatures) as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is copy pasted, but this is now outdates when we use factories.
Just say: "We create CFeatures and CLabels via loading from files"
.. sgexample:: cross_validation_pipeline:create_features | ||
|
||
|
||
We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence basically repeats the intro, so I would just remove it
|
||
|
||
We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification. | ||
The transformer and the machine are chained as a :sgclass:`CPipeline`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create a Cpipeline, and chain the transformer and the classifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I would mention that "we first chain all transformers, and then finalize the pipeline with the classifier" (since you use "then")
.. sgexample:: cross_validation_pipeline:create_pipeline | ||
|
||
Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation. | ||
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see CEvaluation
.. sgexample:: cross_validation_pipeline:create_pipeline | ||
|
||
Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation. | ||
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last word model (singular)
Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation. | ||
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models. | ||
In this case, we use :sgclass:`CAccuracyMeasure`. | ||
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also mention something like "The pipeline instance behaves just like a machine and this can be directly passed to CCrossValidation"
#![create_pipeline] | ||
|
||
#![create_cross_validation] | ||
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont we have factories for those things?
I would prefer if all newly added examples would fully use the new api so we dont have to refactor later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have factory for CSplittingStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind creating one? That should be pretty easy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlnapf since there are multiple subclasses of CSplittingStrategy
, are we going to do some string comparison by name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check the factory machine
in factory.h
, you just have to add some macro lines
the call should be splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, k=2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we need to do initialization in init()
method, instead of the constructor
e.g.
CStratifiedCrossValidationSplitting::CStratifiedCrossValidationSplitting( |
this need refactor right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, this will be moved into the method that is called from outside, i.e. build_subsets
... putting it into a helper method makes sense
Transformer subMean = transformer("PruneVarSubMean") | ||
Machine svm = machine("LibLinear") | ||
|
||
PipelineBuilder builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, could you pls use a factory for this. We dont want to use constructors in the examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain how to use factory here?
we could create a factory PipelineBuilder* pipeline_builder()
, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, though I would just name it pipeline
#![create_cross_validation] | ||
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2) | ||
Evaluation evaluation_criteron = evaluation("AccuracyMeasure") | ||
CrossValidation cross(pipeline, feats_train, labels_train, splitting_strategy, evaluation_criteron, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factory if possible (this might be more tricky)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great example! I made some comments
Let me know if the factory creating worked.... |
@@ -14,7 +14,7 @@ Labels labels_test = labels(f_labels_test) | |||
Transformer subMean = transformer("PruneVarSubMean") | |||
Machine svm = machine("LibLinear") | |||
|
|||
PipelineBuilder builder() | |||
PipelineBuilder builder = pipeline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly like this! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW why is the C++ ype not just called Pipeline and the Machine called PipelineMachine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think Pipeline
and PipelineMachine
might be a bit confusing, while PipelineBuilder
can indicate its usage as a builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I dont really agree, I think then
should actually return CMachine
(since the subsequent object will be used in this fashion) ... why do we need to know that a machine is a pipeline if it behaves as a machine?
And then Pipeline
is the thing that builds the stuff.
Makes for a cleaner API imo
@vigsterkr @lisitsyn @iglesias what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlnapf are we talking now about PipelineMachine and PipelineBuilder, or why then
returns Pipeline* ? :) imo those are different things or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So CPipeline -> CPipelineMachine, CPipelineBuilder ->CPipeline, then returns CMachine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo builder is just clearer but i dont have any strong feelings about it...
note my second comment about extraction and observation of pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the stages thing is a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. until we cannot register and cleanly extract stages from a pipeline this sort of explicit exposure is required :) otherwise the whole thing becomes totally opaque once built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! thx for clarifying.
I still would slightly prefer different names (CPipelineMachine, CPipeline) but it is only a minor difference, also no strong feelings. Can leave it as it is
rebase and merge :) |
|
||
.. sgexample:: cross_validation_pipeline:create_features | ||
|
||
We first chain all transformers, and then finalize the pipeline with the classifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it one as said above? Then change or remove "all".
Labels labels_test = labels(f_labels_test) | ||
#![create_features] | ||
|
||
#![create_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing ]?
|
||
PipelineBuilder builder = pipeline() | ||
builder.over(subMean) | ||
Pipeline pipeline = builder.then(svm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the API contain these two builder and pipeline concepts? What about just a pipeline where steps are added (and of course the order of addition matters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is to separate construction of pipeline into a single class so that the pipeline object is immutable, and then we don't need to verify elements of pipeline everytime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much the same thing that got me confused about the builder vs the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second one, Pipeline
can be Machine
here because that's enough, we use it as machine in xval. however, if we want to get elements in the pipeline, we still need Pipeline
type for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about adding a factory for it? So users can call
PipelineMachine(machine)
?
Alternatively, for the c++ folks, there is as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that there might be mutability reason justifying two types. Why both part of the interface though?
We always have these multipurpose classes where multiple functionality/concept is crammed into one class... here’s a clear cut between a builder pattern and cmachine
… On 27 Jul 2018, at 11:20, Heiko Strathmann ***@***.***> wrote:
@karlnapf commented on this pull request.
In examples/meta/src/evaluation/cross_validation_pipeline.sg:
> +File f_labels_test = csv_file("../../data/classifier_binary_2d_linear_labels_test.dat")
+
+#![create_features]
+Features feats_train = features(f_feats_train)
+Features feats_test = features(f_feats_test)
+Labels labels_train = labels(f_labels_train)
+Labels labels_test = labels(f_labels_test)
+#![create_features]
+
+#![create_pipeline
+Transformer subMean = transformer("PruneVarSubMean")
+Machine svm = machine("LibLinear")
+
+PipelineBuilder builder = pipeline()
+builder.over(subMean)
+Pipeline pipeline = builder.then(svm)
what about adding a factory for it? So users can call
PipelineMachine(machine)?
Alternatively, for the c++ folks, there is as
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
You need the builder for building, but u would like to query a trained pipeline for various stages of it as well. Or what do you mean?
… On 27 Jul 2018, at 14:35, Fernando J. Iglesias García ***@***.***> wrote:
@iglesias commented on this pull request.
In examples/meta/src/evaluation/cross_validation_pipeline.sg:
> +File f_labels_test = csv_file("../../data/classifier_binary_2d_linear_labels_test.dat")
+
+#![create_features]
+Features feats_train = features(f_feats_train)
+Features feats_test = features(f_feats_test)
+Labels labels_train = labels(f_labels_train)
+Labels labels_test = labels(f_labels_test)
+#![create_features]
+
+#![create_pipeline
+Transformer subMean = transformer("PruneVarSubMean")
+Machine svm = machine("LibLinear")
+
+PipelineBuilder builder = pipeline()
+builder.over(subMean)
+Pipeline pipeline = builder.then(svm)
I understand that there might be mutability reason justifying two types. Why both part of the interface though?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
What I was wondering was why a user needs to use two different classes to
use a pipeline: e.g. the pipeline_builder to build/construct a pipeline
object, and then this pipeline object to do all the preprocessing,
cross-validation, etc, etc, she wants.
Of course I also understand your point. It is not clean if a class
implements many different concepts altogether. It gets confusing quickly,
classes get bloated, etc.
It is just a trade-off I think.
To clarify :-) These were not comments critizing any design choice.
Completely far from that! I asked to gain understanding.
…On Fri, 27 Jul 2018 at 14:40, Viktor Gal ***@***.***> wrote:
You need the builder for building, but u would like to query a trained
pipeline for various stages of it as well. Or what do you mean?
> On 27 Jul 2018, at 14:35, Fernando J. Iglesias García <
***@***.***> wrote:
>
> @iglesias commented on this pull request.
>
> In examples/meta/src/evaluation/cross_validation_pipeline.sg:
>
> > +File f_labels_test =
csv_file("../../data/classifier_binary_2d_linear_labels_test.dat")
> +
> +#![create_features]
> +Features feats_train = features(f_feats_train)
> +Features feats_test = features(f_feats_test)
> +Labels labels_train = labels(f_labels_train)
> +Labels labels_test = labels(f_labels_test)
> +#![create_features]
> +
> +#![create_pipeline
> +Transformer subMean = transformer("PruneVarSubMean")
> +Machine svm = machine("LibLinear")
> +
> +PipelineBuilder builder = pipeline()
> +builder.over(subMean)
> +Pipeline pipeline = builder.then(svm)
> I understand that there might be mutability reason justifying two types.
Why both part of the interface though?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4380 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGrdgioF75DNVdSzkqZ7SjMgFAtjxv8ks5uKwoygaJpZM4VYFzE>
.
|
0e3b84b
to
f8120b9
Compare
|
||
PipelineBuilder builder = pipeline() | ||
builder.over(subMean) | ||
Machine pipeline = builder.then(svm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this overloading of var name and the factory work? Just asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this overload the factory name, the factory works in this case. thanks for letting me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the flow of this, so good to merge from my side
f8120b9
to
5bd958c
Compare
Can you rebase data and then we can merge this as well! |
7651aa9
to
8752675
Compare
Shall we get this in soon? :) |
@karlnapf As I can remember, previously we faced the choice on whether pipeline builder should return
Maybe @lisitsyn has idea? |
Ah I remember now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that it works. I like that this is a general solution to the problem, despite being a not super elegant.
Any further thoughts on this idea of implicitly casting a pipeline to CMachine via an argument of a factory? @lisitsyn @vigsterkr
6fe7b65
to
7a00f6e
Compare
7a00f6e
to
d2a3a74
Compare
@@ -61,7 +61,8 @@ | |||
"get_int_vector": "$object.get($arguments)", | |||
"get_real": "$object.get($arguments)", | |||
"get_real_vector": "$object.get($arguments)", | |||
"get_real_matrix": "$object.get($arguments)" | |||
"get_real_matrix": "$object.get($arguments)", | |||
"put_machine": "$object.put($arguments)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work as we cannot extract the arguments yet and you would need to pass the second argument to the machine
factory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do this once #4490 is solved
|
||
PipelineBuilder builder = pipeline() | ||
builder.over(subMean) | ||
Machine pipeline = builder.then(svm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type here should be "Pipeline" as that is what the builder returns (and now Pipeline is part of swig as well)
SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2) | ||
Evaluation evaluation_criterion = evaluation("AccuracyMeasure") | ||
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2) | ||
cross.put_machine("machine", pipeline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to cross.put(machine(pipeline))
and it should work
we can change that to put_machine later once #4490 is in
#![create_cross_validation] | ||
SplittingStrategy strategy = splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, num_subsets=2) | ||
Evaluation evaluation_criterion = evaluation("AccuracyMeasure") | ||
MachineEvaluation cross = machine_evaluation("CrossValidation", features=feats_train, labels=labels_train, splitting_strategy=strategy, evaluation_criterion=evaluation_criterion, autolock=False, num_runs=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OT: Can I suggest that we don't provide features and labels to crossvalidation but instead as parameters of CMachineEvaluation::evaluate(CFeatures*, CLabels*)
? Different PR though
@lisitsyn @vigsterkr @iglesias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey mate, I can't say much based on this snippet. What's your point? Also based on the last line, CrossValidation here seems to be a type of MachineEvaluation, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like to pass it as function arguments if the evaluation function rather than as fields before that....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;)
Great that this now works ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only some minor things are left
@@ -1371,9 +1373,9 @@ def _internal_factory_wrapper(object_name, new_name, docstring=None): | |||
via .put | |||
""" | |||
_obj = getattr(sys.modules[__name__], object_name) | |||
def _internal_factory(name, **kwargs): | |||
def _internal_factory(name, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed abymore
src/shogun/machine/Pipeline.cpp
Outdated
@@ -256,4 +256,9 @@ namespace shogun | |||
{ | |||
return get_machine()->get_machine_problem_type(); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t this be in factory?
@@ -48,7 +48,8 @@ TEST_F(PipelineTest, fit_predict) | |||
auto pipeline = some<CPipelineBuilder>() | |||
->over(transformer1) | |||
->over(transformer2) | |||
->then(machine); | |||
->then(machine) | |||
->as<CPipeline>(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that needed?
|
@@ -0,0 +1,33 @@ | |||
============================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. the ====
should be as long as the text
We also have to decide on an evaluation criterion class (see :sgclass:`CEvaluation`) to evaluate the performance of the trained model. | ||
In this case, we use :sgclass:`CAccuracyMeasure`. | ||
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs. | ||
The pipeline instance behaves just like a :sgclass:`CMachine` and this can be directly passed to :sgclass:`CCrossValidation`. |
data pr shogun-toolbox/shogun-data#164
also need to merge #4377 first