Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mid3v2: custom delimiters in COMM, TXXX, POPM descriptions #159

Closed
lazka opened this issue Jul 4, 2014 · 14 comments
Closed

mid3v2: custom delimiters in COMM, TXXX, POPM descriptions #159

lazka opened this issue Jul 4, 2014 · 14 comments
Labels

Comments

@lazka
Copy link
Member

@lazka lazka commented Jul 4, 2014

Originally reported by: Christoph Reiter (Bitbucket: lazka, GitHub: lazka)


From thuerrsc...@gmail.com on August 30, 2013 04:06:28

I use mid3v2 as a tool for skripted tagging of my music collection for Quod Libet, and I'm quite happy with it. There's just one limitation that I find a little irksome. A command like the following just won't work as intended:

mid3v2 --TXXX "QuodLibet::albumartist:The Examples" track.mp3

The reason, of course, are the double quotes in the prefix as used by Quod Libet (and Ex Falso), since they conflict with the delimiter colon in the description key for TXXX (as well as COMM and POMP) frames.

My first thought was to add some special handling for QuodLibet:: prefixes and/or double colons. A better idea, in my view, is to allow users to specify their own delimiters from the command line as needed, like this:

mid3v2 --delimiter=# --TXXX "QuodLibet::albumartist#The Examples" track.mp3

The colon would remain the default delimiter. There's no risk of breaking existing skripts with the added option. Backward compatibility with id3v2 would be preserved.

My tentative patch shows how mid3v2 (and its documentation) might be adapted for a new --delimiter option. I know very little python, so my code is certainly far from perfect. Most of my changes are built after the --verbose option.

Attachment: mid3v2-deliminator.diff

Original issue: http://code.google.com/p/mutagen/issues/detail?id=159


@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on August 30, 2013 10:27:54

An escaping mechanism would surely be just as good as supporting custom delimiters. In fact, when I first ran into the double colon problem, prefixing them with backslashes was one of the first things I tried. The reasons for my disappointed became clear when I looked into mid3v2's source ...

As for implementing escaping in python, I have no idea how to do that. Maybe there's a library function, since it's such a common task? Python's string.split() method doesn't seem to support regular expressions, which would offer one way of do this.
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on August 30, 2013 04:44:24

Thanks for looking into this.

I'd prefer if it was possible to escape the delimiter instead of redefining it. To keep it compatible we could add an "-e" option like "/bin/echo" which enables this new behavior.

mid3v2 -e --TXXX "QuodLibet\:\:albumartist:The Examples" track.mp3

What do you think?

This also seems to be an unsolved problem in the original id3v2: http://www.jetmore.org/john/blog/2012/07/patch-allow-id3v21-to-include-colons-in-comment-fields/

Status: Accepted

@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on August 31, 2013 06:25:30

There is no such thing in the stdlib afaik.

Something like this should work: http://bpaste.net/show/hVJFGP0WtDJDbZp8WwbR/
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on September 03, 2013 10:22:17

> I don't think we need a full-blown parser for our simple unescaping. A
> couple of regular expressions will do, I think. See the attached diff
> for a draft implementation.

Why not? The regexp doesn't handle all cases and is imho harder to
understand.

> So far there are no provisions for common escape sequences like \n for
> a newline or \t for a tab. We'd need a mapping table for those and/or
> some more regex trickery. I'm not sure if newlines are even allowed in
> ID3 tags. Maybe a "big" escaping mechanism would carry things too far.
> Just solving the colon problem is good enough for me.

Python's string_escape codec should make this easy:

"foobar\\n".decode("string_escape") == "foobar\n"

Sounds like a good idea.

> If you decide to go for a escaping mechanism like the one proposed
> here, a note should be added to the release notes and/or man page that
> (a) escaped colons are supported now and (b) less obviously and
> possibly breaking existing stuff, that backslashes need to be doubled
> for the frames concerned. 

I would prefer the "-e" switch to turn it on and default to the old
behavior.
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 01, 2013 19:38:44

I don't think we need a full-blown parser for our simple unescaping. A couple of regular expressions will do, I think. See the attached diff for a draft implementation.

I've added a split_value function that splits key defititions for COMM, TXXX, and POPM frames on non-escaped colons only, replaces double backslashes with single ones (so that \ or even \: can be written into tags), and removes any other backslashes. As I said, I have practically no Python experience (I'm just starting to like it), so please be lenient on my code.

So far there are no provisions for common escape sequences like \n for a newline or \t for a tab. We'd need a mapping table for those and/or some more regex trickery. I'm not sure if newlines are even allowed in ID3 tags. Maybe a "big" escaping mechanism would carry things too far. Just solving the colon problem is good enough for me.

If you decide to go for a escaping mechanism like the one proposed here, a note should be added to the release notes and/or man page that (a) escaped colons are supported now and (b) less obviously and possibly breaking existing stuff, that backslashes need to be doubled for the frames concerned.

Attachment: mutagen-escape-colon.diff

@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on September 04, 2013 03:51:00

This issue was closed by revision 42885877ff28 .

Status: Fixed

@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 03, 2013 12:53:49

> Why not? The regexp doesn't handle all cases and is imho harder to
> understand.

I guess regexp vs. parser is a matter of personal preference. Coming from a perl background, I'm naturally biassed towards regular expressions, maybe too much so.

Anyway, my regexp was intended to handle just one case, i.e. colon delimiters escaped with a backslash. What we're talking about now is a more general escaping mechanism, optionally activated with an -e switch, that deals with a much larger set of escape sequences, for all kinds of ID3 tags.

I really like your ideas, although I know far too little about python's string handling to even try to implement them. From what I gather from http://docs.python.org/2/reference/lexical_analysis.html#string-literals , the string_escape encoding handles, among other things, \xhh sequences. So writing colons as \x3A would be an easy way of including them in TXXX etc. frames, without the need for any special escaping. Which would solve the initial problem quite elegantly.
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 04, 2013 15:13:25

Thanks for the quick fix! I haven't really tested the finished --escape option yet, but from I've seen, it looks like a clean and simple solution that just works. Well done!
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on September 05, 2013 01:28:02

Thanks for catching this.
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 04, 2013 21:00:38

Looks I've run into a problem:

 mid3v2 --TALB "aäā" -e test.mp3 
Traceback (most recent call last):
  File "/home/thuerrsch/bin/mid3v2", line 344, in <module>
    main(sys.argv)
  File "/home/thuerrsch/bin/mid3v2", line 328, in main
    write_files(parser.edits, args, options.escape)
  File "/home/thuerrsch/bin/mid3v2", line 148, in write_files
    edits = dict((k, map(unescape_string, v)) for (k, v) in edits.items())
  File "/home/thuerrsch/bin/mid3v2", line 148, in <genexpr>
    edits = dict((k, map(unescape_string, v)) for (k, v) in edits.items())
  File "/home/thuerrsch/bin/mid3v2", line 80, in unescape_string
    return string.decode("unicode_escape")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

It seems that with the -e switch, all non-ASCII characters *must* be escaped. Is this intentional? I'd prefer the escaping to be optional, not mandatory for each character when -e is used. Would that be possible?
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 05, 2013 09:44:54

Sorry to keep bothering you. I found to more cases that lead to unexpected errors:

mid3v2 --TALB 'a vs. ä, \x61 vs. \xe4, \141 vs. \344' -e test.mp3
mid3v2 --TALB 'trailing backslash\' --TIT2 "trailing backslash\\" -e test.mp3

Thanks for looking into this.
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on September 05, 2013 02:03:15

Should be fixed: revision 3a4f8997ce1d thanks again
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From reiter.christoph@gmail.com on September 05, 2013 11:28:15

Thanks! revision 7a3f88a0be2b $ mid3v2 --TALB 'a vs. ä, \x61 vs. \xe4, \141 vs. \344' -e test.mp3
TALB: 'utf8' codec can't decode byte 0xe4 in position 16: invalid continuation byte

$ mid3v2 --TALB 'trailing backslash\' --TIT2 "trailing backslash\\" -e test.mp3
TALB: Trailing \ in string
@lazka

This comment has been minimized.

Copy link
Member Author

@lazka lazka commented Jul 4, 2014

Original comment by Christoph Reiter (Bitbucket: lazka, GitHub: lazka):


From thuerrsc...@gmail.com on September 05, 2013 20:30:55

Thanks again for the quick fix! Everything's fine now, as far as I can tell.

As an aside, the ä character used in my examples above would have to be escaped as \xc3\xa4 rather than \xe4 (UTF-8 byte sequence vs. character code). The former is interpreted correctly by mid3v2, the latter produces an error message as it should.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.