Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching [] characters with regex #18612

Closed
Kerndog73 opened this issue Dec 22, 2018 · 4 comments
Closed

Matching [] characters with regex #18612

Kerndog73 opened this issue Dec 22, 2018 · 4 comments

Comments

@Kerndog73
Copy link

@Kerndog73 Kerndog73 commented Dec 22, 2018

Please follow the guide below

  • You will be asked some questions and requested to provide some information, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your issue (like this: [x])
  • Use the Preview tab to see what your issue will actually look like

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2018.12.17. If it's not, read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2018.12.17

Before submitting an issue make sure you have:

  • At least skimmed through the README, most notably the FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones
  • Checked that provided video/audio/playlist URLs (if any) are alive and playable in a browser

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

Description of your issue, suggested solution and other information

Explanation of your issue in arbitrary form goes here. Please make sure the description is worded well enough to be understood. Provide as much context and examples as possible.
If work on your issue requires account credentials please provide them or explain how one can obtain them.

I'm using the --metadata-from-title flag with regex and I would like to match the [ character. This is the full regex pattern.

(\[(?P<genre>.+?)\] -)?(?P<artist>.+?) - (?P<track>.+?)( \[[\w ]+\])?

I'm surrounding the pattern in 'single quotes' in the shell (bash). I'm not sure if I should be using "double quotes". I'm not really familiar with the difference between the two.

This URL has a sufficiently complex title for testing (although it doesn't have a multi-word genre).

These are some URLs that I've been testing with.

https://www.youtube.com/watch?v=5tAsUGEqob4
[Electro] - Sound Remedy & Nitro Fun - Turbo Penguin [Monstercat Release]

No genre
https://www.youtube.com/watch?v=YnwsMEabmSo
Marshmello - Alone [Monstercat Official Music Video]

No genre or [Ignored part]
https://www.youtube.com/watch?v=eOofWzI3flA
Skrillex - Rock n Roll (Will Take You to the Mountain)

I would like to match this title. For the first one, genre should be set to Electro, artist should be set to Sound Remedy & Nitro Fun and the track should be set to Turbo Penguin. [Monstercat Release] should not be included in the track. [Electro] - and [Monstercat Release] are optional.

This is how the title is parsed with the current regex

[fromtitle] parsed genre: Electro
[fromtitle] parsed track: T
[fromtitle] parsed artist: Sound Remedy & Nitro Fun

The problem I'm having is with the track. The part that is supposed to match anything within [] seems to be matching some of the track characters. With some variations of the pattern, the track is parsed as Turbo but I can't seem to get it to parse as Turbo Penguin. I think my problem has something to do with []. Also, I've never used ? in regex before. Nor have I used named capture groups (didn't even know they existed).

My full youtube-dl invokation is this:

youtube-dl --extract-audio --audio-format mp3 --audio-quality 320K --format bestaudio --add-metadata --metadata-from-title '(\[(?P<genre>.+?)\] - )?(?P<artist>.+?) - (?P<track>[\w ]+?)( \[[\w ]+\])?' --output 'song.%(ext)s' "https://www.youtube.com/watch?v=5tAsUGEqob4"

I'm fiddled around with the regex for a while and I can't seem to figure it out. Maybe bash or youtube-dl are doing something funky with []? I don't know.

@Kerndog73 Kerndog73 changed the title Using [] in regex Matching [] characters with regex Dec 22, 2018
@rautamiekka
Copy link

@rautamiekka rautamiekka commented Dec 22, 2018

Some info:

The single quotes tell the shell to not scan and potentially mangle the string, so you should always use it when you can. That way you also introduce some additional speed to the processing.

Use (?: ) instead of full ( ) to both

  • not introduce an unnamed capture group
  • introduce some RAM savings.

You're doing the brackets correctly since you're escaping them.

@Kerndog73
Copy link
Author

@Kerndog73 Kerndog73 commented Dec 22, 2018

@rautamiekka I changed the pattern to this:

(?:\[(?P<genre>.+?)\] - )?(?P<artist>.+?) - (?P<track>.+?)(?: \[[\w ]+\])?

Are you sure I'm doing the [] correctly? How can (?: \[[\w ]+\])? match urbo Penguin? It should match a space and then a [. I don't understand.


Maybe (?P<track>.+?) is matching T and then the rest is just ignored??? I must be doing something wrong.

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Dec 22, 2018

Invalid regex. Must be something like (?P<track>.+?)(?:\s*\[|$),

@dstftw dstftw closed this Dec 22, 2018
@Kerndog73
Copy link
Author

@Kerndog73 Kerndog73 commented Dec 22, 2018

Final regex is this:

(?:\[(?P<genre>.+?)\]\s*-\s*)?(?P<artist>.+?)\s*-\s*(?P<track>.+?)(?:\s*\[|$)

Works on all the tests I could find. I thought the regex had to match the whole title but apparently not. We can just find the [ and ignore everything after it.

I'm really happy with this! Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.