Save model/ model parameters after cross-validation #603

rudra0713 · 2020-03-04T04:58:16Z

Hi,
I am using Skorch and Sklearn with PyTorch for cross-validation. This is my code right now,
`
train_data_skorch = SliceDict(**train_data)

                               logistic = NeuralNetClassifier(model,
                               lr=opt.lr,
                               batch_size=opt.batch_size,
                               max_epochs=opt.num_epochs,
                               train_split=None,
                               criterion=CrossEntropyLoss,
                               optimizer=optim.Adam,
                               iterator_train__shuffle=False,
                               device="cuda" if torch.cuda.is_available() else "cpu"
                               )

scores = cross_val_score(logistic, train_data_skorch, all_labels_numpy, cv=10, scoring="accuracy")

`
How can I save the model after cross-validation is done?

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2020-03-04T20:05:25Z

When using cross_val_score, you can't get back the saved model. In fact, with your setting, 10 models are being trained, which one do you want to save? I would suggest for you to just train a model on all the data after cross validation and to save that.

Tip: If you're annoyed by the print log from skorch during cross_val_score, set verbose=0.

rudra0713 · 2020-03-04T22:13:53Z

Shouldn't cv=10 means to create 10 folds of my data, use 9 fold for training one fold for validation for 10 times while keeping the model same? Also, I am doing cross-validation because my data set is imbalanced, not to select the set of hyperparameters.

BenjaminBossan · 2020-03-04T22:30:25Z

while keeping the model same

Well, the model hyperparameters stay the same of course, but the model itself is trained 10 times (i.e. the weights and biases of the neural net are reset and learned from scratch). cross_val_score will only return the scores to you, the 10 trained models are discarded.

rudra0713 · 2020-03-04T22:49:39Z

So, basically, if I lean a model using cross-validation, there is no way for me to store the model i.e. no way to test the model's performance on completely new and unseen data?

thomasjpfan · 2020-03-05T03:32:09Z

You can use cross_validate with return_estimator=True to keep the models around.

BenjaminBossan · 2020-03-05T22:13:44Z

Cool, didn't know about that option. Of course, you'd get 10 models back, still not sure what you would do with those.

rudra0713 · 2020-03-07T05:16:20Z

I want to use the best performing model out of those 10 models on unseen test data. I want to use cross-validation because my data set is not balanced, not to find the best set of hyper-parameters. So, without a model, there's no way for me to test on unseen data.

BenjaminBossan · 2020-03-07T13:17:46Z

You could also consider the Checkpoint callback but I think it's worth taking a step back here.

Let me try to understand your problem: You use cross_val_score to train 10 models. cross_val_score will split your X internally and validate each of the 10 models (that will have the same hyper-parameters) on that internal split. You get back the 10 models, and you want to use the best of the 10 models on yet another dataset, X_test, that you've split off beforehand. Is this a correct description?

I think this approach your approach is a bit misguided. Since the 10 models are basically the same, there is really no point in choosing one of the 10 to work on your test data. In fact, I'm pretty certain that if you just train the model on the whole X, it should generalize the best because it was trained on the largest amount of data (the 10 models are only trained on 90% of the available data each).

You say that you do this because your data is unbalanced. I assume what you mean is that your labels are unbalanced. E.g. you work with classification and one class is much more predominant than the other. There are a few ways to deal with unbalanced datasets, but the suggestion you made doesn't really solve the problem. You should think about other ideas like up- or downsampling your data (see e.g. here) or using different weights for different classes (e.g. here).

The main reason why you would need many folds for cross validation when working with unbalanced datasets is because you will encounter more variance in your validation scores. Using more folds gives you more confidence in how well the model would really generalize to unseen data. By itself, using many folds will not lead to a better model.

I hope this clarifies some things for you. All of this is unrelated to skorch, it's more general machine learning/statistics question. You should be able to find more resources on this online.

rudra0713 · 2020-03-10T09:14:46Z

Thanks a lot @BenjaminBossan for your valuable suggestions. I have decided to follow two approaches, in the first approach, I will use Sklearn's cross_val_predict to get predictions over the whole X and then create a confusion matrix.
In the second approach, I will try to use a Stratified split for my dataset. Since my dataset is a dictionary, I had to use SliceDict to utilize the NeuralNet classifier. Do you have any suggestions for how to combine SliceDict and CVSplit? Also, is there any way to get direct access to the stratified folds?

BenjaminBossan · 2020-03-10T21:21:56Z

Since my dataset is a dictionary

Can you specify: Is your X a dictionary or your y (or both)? If y is a dict, I don't think you can get a stratified split from it.

rudra0713 · 2020-03-10T23:30:43Z

This is what I am doing right now,
all_inputs, all_masks, all_labels, all_labels_numpy = create_input(opt.data_path, opt.train_sheet, opt.rm_sw, opt.rm_punct)
`

train_data = 

{
    'input_ids': all_inputs,
    'attention_mask': all_masks,
   'labels': all_labels
}

train_data_skorch = SliceDict(**train_data)

logistic = NeuralNetClassifier(model,
                               lr=opt.lr,
                               batch_size=opt.batch_size,
                               max_epochs=opt.num_epochs,
                               train_split=None,
                               criterion=CrossEntropyLoss,
                               optimizer=optim.Adam,
                               iterator_train__shuffle=False,
                               device="cuda" if torch.cuda.is_available() else "cpu"
                               )
scores = cross_val_score(logistic, train_data_skorch, all_labels_numpy, cv=10, scoring="accuracy")`

BenjaminBossan · 2020-03-11T21:42:36Z

If your all_labels_numpy is a numpy array of ints, I think sklearn should automatically use a stratified split. To be certain, you could set cv=StratifiedKFold(10) (docs).

BenjaminBossan · 2020-04-13T14:01:21Z

@rudra0713 any updates or is the issue solved?

rudra0713 · 2020-04-15T22:25:56Z

Yes, this is resolved.

…

On Mon, Apr 13, 2020 at 7:01 AM Benjamin Bossan ***@***.***> wrote: @rudra0713 <https://github.com/rudra0713> any updates or is the issue solved? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#603 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGH7XI5RGPDJEBJHNL5SRF3RMMLMBANCNFSM4LA3JZBQ> .

-- Rudra Ranajee Saha Graduate Research Assistant University of British Columbia

Hguimaraes · 2020-05-21T20:17:04Z

Hi @BenjaminBossan ,

It's possible to save the 10 models (considering CV=10) using cross_val_score? I would like to create an ensemble of those models to apply on testset (e.g. get a majority vote on a classification problem). Is it possible to achieve this using Skorch?

Cheers,

BenjaminBossan · 2020-05-21T20:52:00Z

@Hguimaraes

Is this not covered by Thomas's response?

Hguimaraes · 2020-05-22T14:23:21Z

@BenjaminBossan

You are right, sorry about that.
When I first read I thought that only the best model was returned at the end, but reading the docs I think this is what I want.

Thank you again and sorry for the repeated question.

Cheers,

BenjaminBossan · 2020-05-22T18:43:57Z

@Hguimaraes No worries, good luck with your models.

BenjaminBossan added the question label Mar 4, 2020

BenjaminBossan closed this as completed Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save model/ model parameters after cross-validation #603

Save model/ model parameters after cross-validation #603

rudra0713 commented Mar 4, 2020

BenjaminBossan commented Mar 4, 2020

rudra0713 commented Mar 4, 2020

BenjaminBossan commented Mar 4, 2020

rudra0713 commented Mar 4, 2020

thomasjpfan commented Mar 5, 2020

BenjaminBossan commented Mar 5, 2020

rudra0713 commented Mar 7, 2020

BenjaminBossan commented Mar 7, 2020

rudra0713 commented Mar 10, 2020

BenjaminBossan commented Mar 10, 2020

rudra0713 commented Mar 10, 2020

BenjaminBossan commented Mar 11, 2020

BenjaminBossan commented Apr 13, 2020

rudra0713 commented Apr 15, 2020 via email

Hguimaraes commented May 21, 2020

BenjaminBossan commented May 21, 2020

Hguimaraes commented May 22, 2020

BenjaminBossan commented May 22, 2020

Save model/ model parameters after cross-validation #603

Save model/ model parameters after cross-validation #603

Comments

rudra0713 commented Mar 4, 2020

BenjaminBossan commented Mar 4, 2020

rudra0713 commented Mar 4, 2020

BenjaminBossan commented Mar 4, 2020

rudra0713 commented Mar 4, 2020

thomasjpfan commented Mar 5, 2020

BenjaminBossan commented Mar 5, 2020

rudra0713 commented Mar 7, 2020

BenjaminBossan commented Mar 7, 2020

rudra0713 commented Mar 10, 2020

BenjaminBossan commented Mar 10, 2020

rudra0713 commented Mar 10, 2020

BenjaminBossan commented Mar 11, 2020

BenjaminBossan commented Apr 13, 2020

rudra0713 commented Apr 15, 2020 via email

Hguimaraes commented May 21, 2020

BenjaminBossan commented May 21, 2020

Hguimaraes commented May 22, 2020

BenjaminBossan commented May 22, 2020