-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event representation with multi-track support #27
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
Haven't reviewed the code yet, but this is awesome! Thank you! Another way to support multitracks in event-like representation is to use a text-based representation. For example,
|
Yeah, I was thinking exactly of these 2 examples . What I'm proposing should be equivalent to LakhNES, only I use Python tuples instead of strings, and I index tracks with numbers. E.g. MuseNet is similar, except that it encodes velocity differently. A different option would be to use track names instead of indices, e.g. |
effa63b
to
7c47efd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several issues in the implementation. Let me start with the biggest one.
I was expecting these events to be more compact and decoupled. For example, we currently represent a note by a series of velocity, note-on, time-shift and note-off events. We should represent a note in a track in a similar way - a series of track, velocity, note-on, time-shift, note-off events, where track and velocity events are optional if unchanged and work like switches (more like set_program
and set_velocity
).
There are several benefits for doing so.
- The vocabulary size would be independent of the number of tracks.
- We don't need
num_tracks
in encode anymore. - Argument
ignore_empty_tracks
can be removed. This argument is quite confusing as it have different effects in encoding and decoding.
Some other comments:
- Is the current implementation used in any literature? We could add the current implementation as a new representation if you like.
- We can base the text-based representation on the current implementation.
- I would still prefer to keep the main implementation in
muspy.inputs
andmuspy.outputs
. The processor interface is just a way to store the configurations for the functional interface.
Thank you for working on this. Let me know what you think.
muspy/processors/event.py
Outdated
tracks = [[n for track in music.tracks for n in track.notes]] | ||
else: | ||
# Keep only num_tracks non-empty tracks | ||
# TODO: Maybe warn or throw exception if too many tracks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe warns for the processor interface, but pass for functional interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning. If not wanted, they can be suppressed using warnings.filterwarnings
.
I completely agree with these benefits, but a big disadvantage are longer sequences. There is a trade-off (small vocabulary vs. short sequences), and in some scenarios (typically when the number of tracks is small and fixed), the latter would be preferable. Other benefits of this representation are that:
The LakhNES representation is equivalent to this representation (or rather a special case – without the program and velocity tokens). And MuseNet is not that different either. That's why I wanted to contribute it here.
That would make sense. Though keep in mind that my implementation still supports the single-track version as a special case. This one doesn't encode programs, but that could be easily implemented exactly the way you wrote (by adding a program token in front of every note-on). So I think the two ideas are not incompatible and could be part of the same class (or perhaps extensions of the same base class).
To me, there's no real difference between the tuple-based and the string-based representation, since we immediately convert to integers anyway. But sure, we could instead format the tuples as strings.
I chose to move it to the class because I need to store the vocab. Of course you can re-generate the vocab every time, but then it's not accessible from outside, so anyone using the representation will have to figure out the vocab size on their own. Also, you will either have duplicate code (mainly for generating the vocab) or we'll have to find a common place to put it. |
Also, if you plan to add support for control change events, it really makes sense to have a track-based representation. You don't want the piano's sustain pedal to apply to other instruments. |
I just changed
Also, I'm not sure about the usefulness of the reshaping. If I load the dataset with the PyTorch See also #45. |
Sorry for the delay. I have been thinking about how we should handle this for days. My main concern lies in the necessity of the vocabulary dictionary. For such a large vocabulary, I think it's more flexible to return the raw vocabularies (like a text-based representation) rather than a series of integers that cannot be decoded without some vocabulary dictionary. With the raw vocabularies, users can later implement their own indexer to index the vocabulary. Also, such a large vocabulary makes it inefficient to run using the functional API (i.e., Here's my suggestions.
What do you think? Is there any benefit for having this implemented in event representation? |
I'm not completely against a text representation and it has its advantages, but some things to consider:
So what we could do is merge this as a separate |
That's true. People who want to use more advanced representation would probably want to use other fancy stuff I guess. I think that would be fine. Most machine learning framework should support indexer (e.g., Hugging Face).
That's a trade-off in fact - having a large vocabulary with some words that never got used or having a smaller vocabulary that depends on the data. It actually depends on how people implement the indexer, but we could have a function that returns the complete list of possible tokens for sure.
We still have the vocabulary implicitly. The key is that the output itself does not depends on the vocabulary. We could have another function that returns the complete list (e.g., to check an user's inputs).
Sounds like a plan! We might also need some documentation. |
4831f7e
to
7ef9e7e
Compare
e52a9ed
to
7dc0036
Compare
ca66a7f
to
6b8f37b
Compare
6b8f37b
to
735dba0
Compare
@salu133445 OK, I moved this to a separate class and also added some tests and docs. |
I also added the |
I implemented multi-track support for the event-based representation. Here is a summary of the changes:
vocab
dictionary, mapping human-readable event tuples (e.g.('time_shift', 12)
) to integers. This removes the need for doing index arithmetic to work out the ranges for different events, which would be extra complicated in the multi-track case. Moreover:len(vocab)
vocab
, e.g. if they want to exclude some events they know will never appear in their data.program
+is_drum
values) are (optionally) represented by special events, which always appear at the very beginning of the sequence (the instrument setting applies to the whole track).By default, we don't encode the drum program (drum kit).To consider:
enum.Flag
...Sorry for this big PR and I'm happy to discuss everything.