Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can History be more powerful? #804

Open
TheAutumnOfRice opened this issue Sep 28, 2021 · 5 comments
Open

Can History be more powerful? #804

TheAutumnOfRice opened this issue Sep 28, 2021 · 5 comments

Comments

@TheAutumnOfRice
Copy link
Contributor

The current History class has some limitations: (ver 0.10.0)

  1. Currently the history is saved as JSON, as a result, those recorded values are limited to simple numbers and strings. Other objects can not be saved in history files directly.
  2. Saving as JSON takes lots of time and space because numbers are stored in decimal. It's getting worse when the training epoch is increasing.
  3. In some cases, additional contents (e.g. best_loss, ...) need to be saved. But because History is a list, these contents that do not belong to any epoch are unable to be saved in History directly.

According to the above-mentioned limitations, maybe it is better for History to be saved as binary files (In my computer I've modified history.py and changed all json to pickle and it works pretty well). Besides, adding a dict into History class (like attrs dict in pandas.DataFrame) so that additional contents can be easily saved.

@BenjaminBossan
Copy link
Collaborator

I assume that you suggest to (optionally) use pickle when calling net.save_params for history

According to the above-mentioned limitations, maybe it is better for History to be saved as binary files (In my computer I've modified history.py and changed all json to pickle and it works pretty well).

Persisting the net.history with pickle is of course always an option. Since it's quite easy to do that if so desired, I'm not sure we need another mechanism to achieve the same result when you can just pickle dump the net.history attribute.

Another thing to be aware of: You can easily dump the history into a pandas DataFrame and then use pandas to persist into a lot of different formats, such as csv (but this is a one way street):

df = pd.DataFrame(net.history[:, ['train_loss', 'valid_loss', 'valid_loss_best']])
df.to_csv('history.csv')

(this won't work for the batches though)

Besides, adding a dict into History class (like attrs dict in pandas.DataFrame) so that additional contents can be easily saved.

What exactly do you mean? Could you give an example of how that looks like in usage? The only reference I could find was this experimental feature without any further explanation :)

@TheAutumnOfRice
Copy link
Contributor Author

TheAutumnOfRice commented Sep 29, 2021

@BenjaminBossan
Thank you for your replying!

Persisting the net.history with pickle is of course always an option. Since it's quite easy to do that if so desired, I'm not sure we need another mechanism to achieve the same result when you can just pickle dump the net.history attribute.

That's reasonable. If I want to add this feature without modifying History.to_file, I just need to overload net.save_params, copy the original code and modify a few lines of it, though that is not elegant enough :)

In my opinion, saving history in JSON type is feasible, but I wonder if anyone is willing to open that JSON file then checking these values with human eyes :P So just advice, maybe it is better to be persisted in binary files. Or at least, we can decide the method of persisting by an additional argument or global setting or something.

What exactly do you mean? Could you give an example of how that looks like in usage? The only reference I could find was this experimental feature without any further explanation :)

I didn't expect the PANDAS official documents not to say a word about attrs!
But luckily I found this: https://docs.h5py.org/en/stable/high/attr.html. The usages are similar: just a dict to save additional contents.

An expected usage:

H = History()
# Initial state: H.attrs: dict = {}  
# Assuming H has been recorded for many epochs
H.attrs["last_record_time"] = time.time()
H.attrs["best_valid_epoch"] = np.argmin(H[:,"valid_loss"])
H.attrs["earlystopping_count"] = 20
# ... Other attributes, which do not belong to any epoch, so they can't be saved in History list.

H.to_file()  # Save history list and H.attrs together

@BenjaminBossan
Copy link
Collaborator

If I want to add this feature without modifying History.to_file, I just need to overload net.save_params, copy the original code and modify a few lines of it, though that is not elegant enough :)

Why not override History.to_file though, that sounds easier to me?

Or at least, we can decide the method of persisting by an additional argument or global setting or something.

I think it would be possible to add an argument to to_file to optionally store as a pickle file, even though pickle is also not super efficient. Another advantage of JSON to keep in mind is that it will work across versions. This is also the idea with the other components you can save with save_params. With pickle, there is always the risk of a breaking change sometime in the future.

An expected usage: ...

As long as you're happy to save the history as a pickle file, you can just add your own attrs. E.g.:

class MyNet(NeuralNet):
    def on_train_end(self, net, X=None, y=None, **kwargs):
        h = self.history
        h.attrs = {}
        h.attrs['foo'] = 123
        ...

Obviously, if dumped as a JSON, attrs would not be saved.

@TheAutumnOfRice
Copy link
Contributor Author

Yeah, the easiest way is to modify the History.to_file method. But that requires modifying the source code. So I'd super appreciated that if your team are willing to do these changes, for example, add an argument or something.

Saving in JSON type guarantees compatibility, that's right, I think that makes sense. But in exchange, it loses flexibility, and perhaps time and space. Now I'm using my customized History.to_file and it works pretty well for me. Anyway, thanks for all these replies and advice :)

@BenjaminBossan
Copy link
Collaborator

Okay, so let's boil down the discussion so far:

  1. It would be good to be able to tell History.to_file to use pickle instead of JSON.
  2. Furthermore, it would be nice to tell net.save_params to use pickle for history instead of JSON.
  3. Maybe even allow a more data efficient storage method (possibly via pandas) but that would probably be one way.
  4. Give the history a pandas attrs style attribute to store meta data.
  5. Automatically create a training summary on_train_end to fill this attribute, e.g. total training duration, best loss, etc.

These all sound reasonable to me. When I have time in the future, I'll take a look at them, at least 1. shouldn't be too hard. Of course, if you want to create a PR in the meantime, that would be highly welcome ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants