Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a function for pattern-matching and error checking? #847

Closed
FiloSottile opened this issue May 20, 2013 · 12 comments
Closed

Create a function for pattern-matching and error checking? #847

FiloSottile opened this issue May 20, 2013 · 12 comments
Assignees

Comments

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented May 20, 2013

We have two common base patterns:

mobj = re.search(REGEX, TEXT)
if mobj is None:
    raise ExtractorError(u'Unable to extract %s' % THING)
video_title = mobj.group(1)

mobj = re.search(REGEX, TEXT)
if mobj is None:
    self._downloader.report_warning(u'unable to extract %s' % THING)
    video_description = None
else:
    video_description = mobj.group(1)

So I was thinking about a helper function like

def search_regex(pattern, string, name, fatal=True, flags=0):
    mobj = re.search(pattern, string, flags)
    if mobj is None and fatal:
        raise ExtractorError(u'Unable to extract %s' % name)
    elif mobj is None:
        self._downloader.report_warning(u'unable to extract %s' % name)
        return None
    else:
        return mobj.group(1) # Or maybe the first matched group,
        # to accomodate cases like (?:"([^"]+)"|\'([^\']+)\')

If you like it I can put up a PR.

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 20, 2013

(pinging devs @phihag @jaimeMF)

@phihag
Copy link
Contributor

@phihag phihag commented May 20, 2013

Go ahead and start changing, this looks fine to me. Although I'd really prefer HTML parsing + xpaths for a lot of these checks.

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 20, 2013

Eh, me too, I'm a huge BeautifulSoup fan, but I fear that without
introducing dependencies there are not elegant and reliable HTML (not
simply XML) 2.6 - 3.3 parsing solutions...

Filippo Valsorda

2013/5/20 Philipp Hagemeister notifications@github.com

Go ahead and start changing, this looks fine to me. Although I'd really
prefer HTML parsing + xpaths for a lot of these checks.


Reply to this email directly or view it on GitHubhttps://github.com//issues/847#issuecomment-18136112
.

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented May 20, 2013

I would personally return the mobj, since you can pass a pattern with multiple groups.

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 20, 2013

Umh... there would not be much gain, then, as you have to check if the mobj
is Null...

My idea is that of supporting only a single return value, and to take any
matched group (see the comments to the code above)

Filippo Valsorda

2013/5/20 Jaime Marquínez Ferrándiz notifications@github.com

I would personally return the mobj, since you can pass a pattern with
multiple groups.


Reply to this email directly or view it on GitHubhttps://github.com//issues/847#issuecomment-18144591
.

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented May 20, 2013

You're right, I was thinking just in fatal errors.

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 21, 2013

Speaking about errors, I and @jaimeMF thought that we should discuss which extraction failures should be fatal, and which should issue warnings.

My take is:

  • title, id, url, and extension (that is, the required info_dict fields) yield fatals
  • all the other only warnings, but with a message along the lines of "please report this issue"
@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 21, 2013

So this would be the function

def search_regex(pattern, string, name, fatal=True, flags=0):
    mobj = re.search(pattern, string, flags)
    if mobj is None and fatal:
        raise ExtractorError(u'Unable to extract %s; '
            u'please report this issue on GitHub.' % name)
    elif mobj is None:
        self._downloader.report_warning(u'unable to extract %s; '
            u'please report this issue on GitHub.' % name)
        return None
    else:
        # return the first matched group
        return next(g for g in mobj.groups() if g is not None)
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented May 29, 2013

Just one note, some IEs assign '' to fields that can't be found, we should choose to use it or None, or maybe pass an optional default_value to search_regex

@phihag
Copy link
Contributor

@phihag phihag commented May 29, 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/29/2013 08:10 AM, Jaime Marquínez Ferrándiz wrote:

Just one note, some IEs assign '' to fields that can't be found
(...)
Those IEs should be fixed. If you can't find a field, leave it out. If
you can find it, but it's unset, set it to None.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEAREKAAYFAlGlrikACgkQ9eq1gvr7CFy+eQCbBSWFO78LrhVa2UL0DFydIHL/
2I4AniNbJiwik8dRb7zJr4HD4Z99KBHA
=PZ+T
-----END PGP SIGNATURE-----

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented May 29, 2013

Ok, great, Anna is in the process of deploying the function, I'll have her
enforce this.

However, I feel like that in output filenames we should represent unset or
None as 'nd' and/or 'unknown', or something like that.

While we are at it, @phihag have you expressed your preference on which
fields missing should raise a fatal and which a warning? I would go for id,
url and title fatal.

On Wednesday, May 29, 2013, Philipp Hagemeister wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 05/29/2013 08:10 AM, Jaime Marquínez Ferrándiz wrote:

Just one note, some IEs assign '' to fields that can't be found
(...)
Those IEs should be fixed. If you can't find a field, leave it out. If
you can find it, but it's unset, set it to None.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEAREKAAYFAlGlrikACgkQ9eq1gvr7CFy+eQCbBSWFO78LrhVa2UL0DFydIHL/
2I4AniNbJiwik8dRb7zJr4HD4Z99KBHA
=PZ+T
-----END PGP SIGNATURE-----


Reply to this email directly or view it on GitHubhttps://github.com//issues/847#issuecomment-18600170
.

Filippo Valsorda

@FiloSottile
Copy link
Collaborator Author

@FiloSottile FiloSottile commented Jun 1, 2013

My latest iteration of the function, including a default value and support for multiple fallback patterns:

def search_regex(pattern, string, name, default=None, fatal=True, flags=0):
    """
    Perform a regex search on the given string, using a single or a list of
    patterns returning the first matching group.
    In case of failure return a default value or raise a WARNING or a
    ExtractorError, depending on fatal, specifying the field name.
    """
    if type(pattern) in (type(''), type(u''), re.RegexObject):
        mobj = re.search(pattern, string, flags)
    else:
        for p in pattern:
            mobj = re.search(p, string, flags)
            if mobj: break

    if sys.stderr.isatty() and os.name != 'nt':
        _name = u'\033[0;34m%s\033[0m' % name
    else:
        _name = name

    if mobj:
        # return the first matching group
        return next(g for g in mobj.groups() if g is not None)
    elif default is not None:
        return default
    elif fatal:
        raise ExtractorError(u'Unable to extract %s; '
            u'please report this issue on GitHub.' % _name)
    else:
        self._downloader.report_warning(u'unable to extract %s; '
            u'please report this issue on GitHub.' % _name)
        return None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.