New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cookbook of cross validation with pipeline #4380

Open
wants to merge 24 commits into
base: develop
from

Conversation

Projects
None yet
4 participants
@vinx13
Contributor

vinx13 commented Jul 20, 2018

data pr shogun-toolbox/shogun-data#164
also need to merge #4377 first

@@ -0,0 +1,34 @@
============================
Cross Validation on Pipeline

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

on "a" pipeline

Cross Validation on Pipeline
============================
In this example, we illustrate how to use cross-validation with :sgclass:`CPipeline`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

maybe link to some cross-validation cookbook. There is some way to link to notebooks, I think MKL does this

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

ha it is just below, nevermind :)

-------
Example
-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

Slightly weight English:
We demonstrate a pipeline consisting of a transformer ..., and LibLinear for binary classification.
(maybe also link to the liblinear cookbook)

-------
We'll use as example a binary classification problem solvable by a pipeline consisted of a transformer :sgclass:`CPruneVarSubMean` and a machine :sgclass:`CLibLinear`.
Imagine we have files with training data. We create :sgclass:`CDenseFeatures` (here 64 bit floats aka RealFeatures) as

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

I know this is copy pasted, but this is now outdates when we use factories.
Just say: "We create CFeatures and CLabels via loading from files"

.. sgexample:: cross_validation_pipeline:create_features
We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

This sentence basically repeats the intro, so I would just remove it

We use :sgclass:`CPruneVarSubMean` to normalize the features and then use :sgclass:`CLibLinear` for classification.
The transformer and the machine are chained as a :sgclass:`CPipeline`.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

We create a Cpipeline, and chain the transformer and the classifier.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

actually, I would mention that "we first chain all transformers, and then finalize the pipeline with the classifier" (since you use "then")

.. sgexample:: cross_validation_pipeline:create_pipeline
Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

see CEvaluation

.. sgexample:: cross_validation_pipeline:create_pipeline
Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

last word model (singular)

Next, we initialize a splitting strategy :sgclass:`CStratifiedCrossValidationSplitting` to divide the dataset into :math:`k-` folds for the :math:`k-` fold cross validation.
We also have to decide on an evaluation criterion class (from :sgclass:`CEvaluation`) to evaluate the performance of the trained models.
In this case, we use :sgclass:`CAccuracyMeasure`.
We then instantiate :sgclass:`CCrossValidation` and set the number of cross validation's runs.

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

Please also mention something like "The pipeline instance behaves just like a machine and this can be directly passed to CCrossValidation"

#![create_pipeline]
#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

dont we have factories for those things?
I would prefer if all newly added examples would fully use the new api so we dont have to refactor later

This comment has been minimized.

@vinx13

vinx13 Jul 21, 2018

Contributor

we don't have factory for CSplittingStrategy

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

Would you mind creating one? That should be pretty easy!

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018

Contributor

sure

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018

Contributor

@karlnapf since there are multiple subclasses of CSplittingStrategy, are we going to do some string comparison by name?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

check the factory machine in factory.h, you just have to add some macro lines
the call should be splitting_strategy("StratifiedCrossValidationSplitting", labels=labels_train, k=2)

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018

Contributor

so we need to do initialization in init() method, instead of the constructor
e.g.

CStratifiedCrossValidationSplitting::CStratifiedCrossValidationSplitting(

this need refactor right?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

ah yes, this will be moved into the method that is called from outside, i.e. build_subsets ... putting it into a helper method makes sense

Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")
PipelineBuilder builder()

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

Same here, could you pls use a factory for this. We dont want to use constructors in the examples

This comment has been minimized.

@vinx13

vinx13 Jul 22, 2018

Contributor

could you explain how to use factory here?
we could create a factory PipelineBuilder* pipeline_builder(), or?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

yes exactly, though I would just name it pipeline

#![create_cross_validation]
StratifiedCrossValidationSplitting splitting_strategy(labels_train, 2)
Evaluation evaluation_criteron = evaluation("AccuracyMeasure")
CrossValidation cross(pipeline, feats_train, labels_train, splitting_strategy, evaluation_criteron, False)

This comment has been minimized.

@karlnapf

karlnapf Jul 20, 2018

Member

factory if possible (this might be more tricky)

@karlnapf

Great example! I made some comments

@karlnapf

This comment has been minimized.

Member

karlnapf commented Jul 23, 2018

Let me know if the factory creating worked....

@@ -14,7 +14,7 @@ Labels labels_test = labels(f_labels_test)
Transformer subMean = transformer("PruneVarSubMean")
Machine svm = machine("LibLinear")
PipelineBuilder builder()
PipelineBuilder builder = pipeline()

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

exactly like this! :)

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

BTW why is the C++ ype not just called Pipeline and the Machine called PipelineMachine?

This comment has been minimized.

@vinx13

vinx13 Jul 23, 2018

Contributor

i think Pipeline and PipelineMachine might be a bit confusing, while PipelineBuilder can indicate its usage as a builder

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

TBH, I dont really agree, I think then should actually return CMachine (since the subsequent object will be used in this fashion) ... why do we need to know that a machine is a pipeline if it behaves as a machine?
And then Pipeline is the thing that builds the stuff.
Makes for a cleaner API imo

@vigsterkr @lisitsyn @iglesias what are your thoughts?

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018

Member

@karlnapf are we talking now about PipelineMachine and PipelineBuilder, or why then returns Pipeline* ? :) imo those are different things or?

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

So CPipeline -> CPipelineMachine, CPipelineBuilder ->CPipeline, then returns CMachine

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018

Member

imo builder is just clearer but i dont have any strong feelings about it...
note my second comment about extraction and observation of pipeline.

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

yes the stages thing is a problem

This comment has been minimized.

@vigsterkr

vigsterkr Jul 23, 2018

Member

i.e. until we cannot register and cleanly extract stages from a pipeline this sort of explicit exposure is required :) otherwise the whole thing becomes totally opaque once built

This comment has been minimized.

@karlnapf

karlnapf Jul 23, 2018

Member

got it! thx for clarifying.
I still would slightly prefer different names (CPipelineMachine, CPipeline) but it is only a minor difference, also no strong feelings. Can leave it as it is

@karlnapf

This comment has been minimized.

Member

karlnapf commented Jul 23, 2018

rebase and merge :)

.. sgexample:: cross_validation_pipeline:create_features
We first chain all transformers, and then finalize the pipeline with the classifier.

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018

Contributor

Is it one as said above? Then change or remove "all".

Labels labels_test = labels(f_labels_test)
#![create_features]
#![create_pipeline

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018

Contributor

Missing ]?

PipelineBuilder builder = pipeline()
builder.over(subMean)
Pipeline pipeline = builder.then(svm)

This comment has been minimized.

@iglesias

iglesias Jul 26, 2018

Contributor

Why does the API contain these two builder and pipeline concepts? What about just a pipeline where steps are added (and of course the order of addition matters).

This comment has been minimized.

@vinx13

vinx13 Jul 26, 2018

Contributor

the idea is to separate construction of pipeline into a single class so that the pipeline object is immutable, and then we don't need to verify elements of pipeline everytime

This comment has been minimized.

@karlnapf

karlnapf Jul 26, 2018

Member

This is pretty much the same thing that got me confused about the builder vs the pipeline.

This comment has been minimized.

@vinx13

vinx13 Jul 27, 2018

Contributor

the second one, Pipeline can be Machine here because that's enough, we use it as machine in xval. however, if we want to get elements in the pipeline, we still need Pipeline type for now

This comment has been minimized.

@karlnapf

karlnapf Jul 27, 2018

Member

what about adding a factory for it? So users can call
PipelineMachine(machine)?
Alternatively, for the c++ folks, there is as

This comment has been minimized.

@iglesias

iglesias Jul 27, 2018

Contributor

I understand that there might be mutability reason justifying two types. Why both part of the interface though?

@vigsterkr

This comment has been minimized.

Member

vigsterkr commented Jul 27, 2018

@vigsterkr

This comment has been minimized.

Member

vigsterkr commented Jul 27, 2018

@iglesias

This comment has been minimized.

Contributor

iglesias commented Jul 27, 2018

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 0e3b84b to f8120b9 Jul 30, 2018

PipelineBuilder builder = pipeline()
builder.over(subMean)
Machine pipeline = builder.then(svm)

This comment has been minimized.

@karlnapf

karlnapf Jul 30, 2018

Member

does this overloading of var name and the factory work? Just asking

This comment has been minimized.

@vinx13

vinx13 Jul 30, 2018

Contributor

yes this overload the factory name, the factory works in this case. thanks for letting me know.

This comment has been minimized.

@karlnapf

karlnapf Jul 30, 2018

Member

I like the flow of this, so good to merge from my side

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from f8120b9 to 5bd958c Aug 8, 2018

@karlnapf

This comment has been minimized.

Member

karlnapf commented Aug 8, 2018

Can you rebase data and then we can merge this as well!

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 1a15b06 to e5fbcf9 Aug 12, 2018

@vinx13 vinx13 force-pushed the vinx13:cookbook/pipeline_xval branch from 7651aa9 to 8752675 Aug 14, 2018

vinx13 added some commits Aug 14, 2018

@karlnapf

This comment has been minimized.

Member

karlnapf commented Nov 12, 2018

Shall we get this in soon? :)

@vinx13

This comment has been minimized.

Contributor

vinx13 commented Nov 12, 2018

@karlnapf As I can remember, previously we faced the choice on whether pipeline builder should return CMachine* or CPipeline*. The first choice, returning CMachine* and then providing a pipeline factory, works though being ugly. According to @vigsterkr 's idea, in my last commit I tried to make CPipeline as base class, but got some different errors on travis

what():  �[1;31m[ERROR]�[0m In file /opt/shogun/src/shogun/base/SGObject.h line 369: Cannot put parameter CrossValidation::machine of type shogun::CMachine*, incompatible provided type shogun::CPipeline*.

Maybe @lisitsyn has idea?

@karlnapf

This comment has been minimized.

Member

karlnapf commented Nov 12, 2018

Ah I remember now.
Well the real question is whether Pipeline should be part of the modular interfaces (i.e. the base types) or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment