Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save model/ model parameters after cross-validation #603

Closed
rudra0713 opened this issue Mar 4, 2020 · 18 comments
Closed

Save model/ model parameters after cross-validation #603

rudra0713 opened this issue Mar 4, 2020 · 18 comments
Labels

Comments

@rudra0713
Copy link

Hi,
I am using Skorch and Sklearn with PyTorch for cross-validation. This is my code right now,
`
train_data_skorch = SliceDict(**train_data)

                               logistic = NeuralNetClassifier(model,
                               lr=opt.lr,
                               batch_size=opt.batch_size,
                               max_epochs=opt.num_epochs,
                               train_split=None,
                               criterion=CrossEntropyLoss,
                               optimizer=optim.Adam,
                               iterator_train__shuffle=False,
                               device="cuda" if torch.cuda.is_available() else "cpu"
                               )

scores = cross_val_score(logistic, train_data_skorch, all_labels_numpy, cv=10, scoring="accuracy")

`
How can I save the model after cross-validation is done?

@BenjaminBossan
Copy link
Collaborator

When using cross_val_score, you can't get back the saved model. In fact, with your setting, 10 models are being trained, which one do you want to save? I would suggest for you to just train a model on all the data after cross validation and to save that.

Tip: If you're annoyed by the print log from skorch during cross_val_score, set verbose=0.

@rudra0713
Copy link
Author

Shouldn't cv=10 means to create 10 folds of my data, use 9 fold for training one fold for validation for 10 times while keeping the model same? Also, I am doing cross-validation because my data set is imbalanced, not to select the set of hyperparameters.

@BenjaminBossan
Copy link
Collaborator

while keeping the model same

Well, the model hyperparameters stay the same of course, but the model itself is trained 10 times (i.e. the weights and biases of the neural net are reset and learned from scratch). cross_val_score will only return the scores to you, the 10 trained models are discarded.

@rudra0713
Copy link
Author

So, basically, if I lean a model using cross-validation, there is no way for me to store the model i.e. no way to test the model's performance on completely new and unseen data?

@thomasjpfan
Copy link
Member

You can use cross_validate with return_estimator=True to keep the models around.

@BenjaminBossan
Copy link
Collaborator

Cool, didn't know about that option. Of course, you'd get 10 models back, still not sure what you would do with those.

@rudra0713
Copy link
Author

I want to use the best performing model out of those 10 models on unseen test data. I want to use cross-validation because my data set is not balanced, not to find the best set of hyper-parameters. So, without a model, there's no way for me to test on unseen data.

@BenjaminBossan
Copy link
Collaborator

You could also consider the Checkpoint callback but I think it's worth taking a step back here.

Let me try to understand your problem: You use cross_val_score to train 10 models. cross_val_score will split your X internally and validate each of the 10 models (that will have the same hyper-parameters) on that internal split. You get back the 10 models, and you want to use the best of the 10 models on yet another dataset, X_test, that you've split off beforehand. Is this a correct description?

I think this approach your approach is a bit misguided. Since the 10 models are basically the same, there is really no point in choosing one of the 10 to work on your test data. In fact, I'm pretty certain that if you just train the model on the whole X, it should generalize the best because it was trained on the largest amount of data (the 10 models are only trained on 90% of the available data each).

You say that you do this because your data is unbalanced. I assume what you mean is that your labels are unbalanced. E.g. you work with classification and one class is much more predominant than the other. There are a few ways to deal with unbalanced datasets, but the suggestion you made doesn't really solve the problem. You should think about other ideas like up- or downsampling your data (see e.g. here) or using different weights for different classes (e.g. here).

The main reason why you would need many folds for cross validation when working with unbalanced datasets is because you will encounter more variance in your validation scores. Using more folds gives you more confidence in how well the model would really generalize to unseen data. By itself, using many folds will not lead to a better model.

I hope this clarifies some things for you. All of this is unrelated to skorch, it's more general machine learning/statistics question. You should be able to find more resources on this online.

@rudra0713
Copy link
Author

Thanks a lot @BenjaminBossan for your valuable suggestions. I have decided to follow two approaches, in the first approach, I will use Sklearn's cross_val_predict to get predictions over the whole X and then create a confusion matrix.
In the second approach, I will try to use a Stratified split for my dataset. Since my dataset is a dictionary, I had to use SliceDict to utilize the NeuralNet classifier. Do you have any suggestions for how to combine SliceDict and CVSplit? Also, is there any way to get direct access to the stratified folds?

@BenjaminBossan
Copy link
Collaborator

Since my dataset is a dictionary

Can you specify: Is your X a dictionary or your y (or both)? If y is a dict, I don't think you can get a stratified split from it.

@rudra0713
Copy link
Author

This is what I am doing right now,
all_inputs, all_masks, all_labels, all_labels_numpy = create_input(opt.data_path, opt.train_sheet, opt.rm_sw, opt.rm_punct)
`

train_data = 

{
    'input_ids': all_inputs,
    'attention_mask': all_masks,
   'labels': all_labels
}

train_data_skorch = SliceDict(**train_data)

logistic = NeuralNetClassifier(model,
                               lr=opt.lr,
                               batch_size=opt.batch_size,
                               max_epochs=opt.num_epochs,
                               train_split=None,
                               criterion=CrossEntropyLoss,
                               optimizer=optim.Adam,
                               iterator_train__shuffle=False,
                               device="cuda" if torch.cuda.is_available() else "cpu"
                               )
scores = cross_val_score(logistic, train_data_skorch, all_labels_numpy, cv=10, scoring="accuracy")`

@BenjaminBossan
Copy link
Collaborator

If your all_labels_numpy is a numpy array of ints, I think sklearn should automatically use a stratified split. To be certain, you could set cv=StratifiedKFold(10) (docs).

@BenjaminBossan
Copy link
Collaborator

@rudra0713 any updates or is the issue solved?

@rudra0713
Copy link
Author

rudra0713 commented Apr 15, 2020 via email

@Hguimaraes
Copy link

Hi @BenjaminBossan ,

It's possible to save the 10 models (considering CV=10) using cross_val_score? I would like to create an ensemble of those models to apply on testset (e.g. get a majority vote on a classification problem). Is it possible to achieve this using Skorch?

Cheers,

@BenjaminBossan
Copy link
Collaborator

@Hguimaraes

Is this not covered by Thomas's response?

@Hguimaraes
Copy link

@BenjaminBossan

You are right, sorry about that.
When I first read I thought that only the best model was returned at the end, but reading the docs I think this is what I want.

Thank you again and sorry for the repeated question.

Cheers,

@BenjaminBossan
Copy link
Collaborator

@Hguimaraes No worries, good luck with your models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants