[Question]: Split dataset in training, validation and test #118

aleSuglia · 2016-11-15T15:22:32Z

Hi there,

I'm looking for a method for dividing a dataset in the common experimental setting training, validation and test. I've found the BASH script in the rival-split module that can be used to generate data that can be used in holdout evaluation. According to my experience, in order to obtain the desired result, I'm able to split the dataset in training+dev and test and, after that, split the former in training and dev.
My question is: is there an idiomatic way in RiVal to do the same?

Thank you in advance for your answer!

Alessandro

abellogin · 2016-11-15T18:21:25Z

Hi Alessandro,

I would say that the split module is more convenient for fine tuning the splitting process than the BASH script. However, it is true that there is no available method to divide a dataset in three, only in two parts; hence, the best way to obtain a training+validation+test partition is to follow the process you described (splitting one of the splits obtained from the whole dataset).

Hope this helps.
Best,
Alejandro

aleSuglia · 2016-11-15T18:33:05Z

Hi Alejandro,

first of all thank you for your clarification. I really appreciate it!
Anyway, in my opinion there should be a common method to split the dataset included in the framework. This a way to fix how people split their datasets and allow them to obtain more reproducible results.

Maybe I can implement this kind of method inheriting from the Splitter class. Could this kind of feature be useful for the RiVal framework?

Alessandro

abellogin · 2016-11-15T23:53:08Z

Hi Alessandro,

what kind of feature are you suggesting? As you are already aware, RiVal already supports CrossValidation, Random, and Temporal splits. If you are proposing to create a generic splitter that returns 3 splits instead of 2, by running whatever Splitter the user has selected, then this would be a very useful feature, indeed. If you are thinking in something else, please, elaborate a little bit and we can discuss about it.

Cheers,
Alejandro

aleSuglia · 2016-11-16T08:38:34Z

Hi Alejandro,

in the first place, I am wondering if I can implement a custom RandomSplitter class which returns three splits instead of two but, as you correctly say, this kind of feature can be implemented in each class that implements Splitter.

In your opinion what is the proper way to let the user generate the validation set too? I think that in the Splitter constructor can be added a boolean flag called generateValidation and in each subclass can be used differently according to the specific behaviour of the class. Can it be consistent with your current API?

Cheers,
Alessandro

abellogin · 2016-11-16T10:43:35Z

Hi,

I was thinking more on a ValidationSplitter that implements Splitter and that takes another Splitter as a parameter. Then, the split method would be as simple as

DataModelIF<U, I>[] trainingTestSplit = splitter.split(data);
DataModelIF<U, I>[] trainingValSplit = splitter.split(trainingTestSplit[0]);

return new DataModelIF<U,I>[]{trainingValSplit[0], trainingValSplit[1], trainingTestSplit[1]};

Would this work as you expect?

Best,
Alejandro

aleSuglia · 2016-11-16T10:51:46Z

Hi,

I think that it would be perfect for the RandomSplitter whereas I think that it won't work for the CrossValidationSplitter as it is because it returns an array containing 2*numFolds elements.

Am I missing something or do you expect to introduce a preliminary check to avoid this incorrect behaviour?

Best,
Alessandro

abellogin · 2016-11-16T11:16:08Z

No, you are right. I knew there was something missing (it couldn't be so easy). I guess a loop like the one in SplitterRunner should do the work:

DataModelIF<U, I>[] trainingTestSplits = splitter.split(data);
DataModelIF<U, I>[] newSplits = new DataModelIF<U,I>[trainingTestSplits.length / 2 * 3];
for (int i = 0; i < trainingTestSplits.length / 2; i++) {
            DataModelIF<U, I> trainingVal = trainingTestSplits[2 * i];
            DataModelIF<U, I>[] trainingValSplit = splitter.split(trainingVal);
            DataModelIF<U, I> test = trainingTestSplits[2 * i + 1];

            newSplits[3*i] = trainingValSplit[0];
            newSplits[3*i+1] = trainingValSplit[1];
            newSplits[3*i+2] = test;
}
return newSplits;

The problem with this is that, since it expects the splitter to give pairs of DataModels, if you pass a ValidationSplitter to the ValidationSplitter everything will be messed up, but I think this is something we are not interesting in supporting right now, are we?

Best,
Alejandro

aleSuglia · 2016-11-16T11:29:30Z

I am thinking exactly at the same implementation. Before adding the full implementation of the class to my fork I want to share with you my current implementation:

@SuppressWarnings("unchecked")
@Override
public DataModelIF<U, I>[] split(DataModelIF<U, I> data) {
    if (splitter instanceof ValidationSplitter) {
        throw new IllegalArgumentException("Unable to apply a validation splitter recursively!");
    }

    // if the basic splitter does cross validation generate a dev for each fold
    if (splitter instanceof CrossValidationSplitter) {
        DataModelIF<U, I>[] trainingTestSplits = this.splitter.split(data);
        DataModelIF<U, I>[] newSplits = new DataModelIF[trainingTestSplits.length/2 * 3];

        for (int i = 0; i < trainingTestSplits.length / 2; i++) {
            DataModelIF<U, I> trainingVal = trainingTestSplits[2 * i];
            DataModelIF<U, I>[] trainingValSplit = splitter.split(trainingVal);
            DataModelIF<U, I> test = trainingTestSplits[2 * i + 1];

            newSplits[3 * i] = trainingValSplit[0];
            newSplits[3 * i + 1] = trainingValSplit[1];
            newSplits[3 * i + 2] = test;
        }
        return newSplits;

    }

    DataModelIF<U, I>[] trainingTestSplit = splitter.split(data);
    DataModelIF<U, I>[] trainingValSplit = splitter.split(trainingTestSplit[0]);

    return new DataModelIF[]{trainingValSplit[0], trainingValSplit[1], trainingTestSplit[1]};

}

In your opinion is it a correct way in order to avoid the problem with the possible recursive call to the ValidationSplitter and to check if the user has specified a CrossValidationSplitter instance? Have you got any other suggestions to improve my code?

Thank you Alejandro for your availability.

Cheers,
Alessandro

abellogin · 2016-11-16T11:36:31Z

Hi Alessandro,

yes, I think your implementation is correct, although I would simplify it in the following way:

include the ValidationSplitter check in the constructor, as soon as the splitter is provided
do not include the other check, since the loop should always work (even for the RandomSplitter), and remove the last three lines.

It will be great to have this feature added in RiVal!

Best,
Alejandro

aleSuglia · 2016-11-16T11:43:38Z

Completely agree with you! Thank you for your suggestions, I really appreciate your help.

For what concerns the implementation of the split method for a TemporalDataModel, I think that is the same as the one that I have written now. The only difference is in the data model instance to be specified so everything will be fine. Is it correct?

I am happy to contribute to RiVal trying to make it better :)

Cheers,
Alessandro

abellogin · 2016-11-16T11:49:52Z

Thank you for your interest!

Indeed, the other split method would be equivalent, the only change is the datamodel instance used.

Cheers,
Alejandro

aleSuglia · 2016-11-16T11:56:08Z

I am writing some test cases for my implementation. After that I will send you a pull request.

Best,
Alessandro

abellogin added enhancement question labels Nov 16, 2016

abellogin self-assigned this Nov 16, 2016

abellogin assigned aleSuglia Nov 16, 2016

aleSuglia mentioned this issue Nov 25, 2016

Implemented ValidationSplitter and improvements for utility classes #121

Merged

abellogin closed this as completed in 4e9a4f6 Nov 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Split dataset in training, validation and test #118

[Question]: Split dataset in training, validation and test #118

aleSuglia commented Nov 15, 2016

abellogin commented Nov 15, 2016

aleSuglia commented Nov 15, 2016

abellogin commented Nov 15, 2016

aleSuglia commented Nov 16, 2016 •

edited

Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 •

edited

Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 •

edited

Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 •

edited

Loading

[Question]: Split dataset in training, validation and test #118

[Question]: Split dataset in training, validation and test #118

Comments

aleSuglia commented Nov 15, 2016

abellogin commented Nov 15, 2016

aleSuglia commented Nov 15, 2016

abellogin commented Nov 15, 2016

aleSuglia commented Nov 16, 2016 • edited Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 • edited Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 • edited Loading

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016

abellogin commented Nov 16, 2016

aleSuglia commented Nov 16, 2016 • edited Loading

aleSuglia commented Nov 16, 2016 •

edited

Loading

aleSuglia commented Nov 16, 2016 •

edited

Loading

aleSuglia commented Nov 16, 2016 •

edited

Loading

aleSuglia commented Nov 16, 2016 •

edited

Loading