Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement re_first for Selector #52

Closed
Tethik opened this issue Aug 8, 2016 · 3 comments
Closed

Implement re_first for Selector #52

Tethik opened this issue Aug 8, 2016 · 3 comments

Comments

@Tethik
Copy link

Tethik commented Aug 8, 2016

Copied from scrapy/scrapy#1907

Currently only SelectorList supports the re_first shortcut method. It would be useful to have this method in Selector too.

from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).re_first
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'Selector' object has no attribute 're_first'
@Tethik
Copy link
Author

Tethik commented Aug 8, 2016

I've started implementing this change here:
https://github.com/Tethik/parsel/tree/re_first_for_selector

I'm also looking to see if I can improve the performance a bit. Specifically this extract_regex function found in the utils module.

def extract_regex(regex, text):
    """Extract a list of unicode strings from the given text/encoding using the following policies:
    * if the regex contains a named group called "extract" that will be returned
    * if the regex contains multiple numbered groups, all those will be returned (flattened)
    * if the regex doesn't contain any group the entire regex matching is returned
    """
    if isinstance(regex, six.string_types):
        regex = re.compile(regex, re.UNICODE)

    try:
        strings = [regex.search(text).group('extract')]   # named group
    except:
        strings = regex.findall(text)    # full regex or numbered groups
return [replace_entities(s, keep=['lt', 'amp']) for s in flatten(strings)]

The try-except means that regexes without named extract groups are going to execute twice. For large documents or when run quite often, this might become expensive. In addition, fetching all matches is unnecessary for running a match-first function.

I created another branch for benchmarking (using pytest-benchmark). I'll test there if I can improve it.
https://github.com/Tethik/parsel/tree/re_benchmark_tests

@starrify
Copy link
Contributor

Regarding Selector.re_first / Selector.extract_first, is there any obvious disadvantage if we simply use something like this:

class Selector(object):
    def re_first(self, regex, default=None):
        return SelectorList([self]).re_first(regex, default)

@redapple
Copy link
Contributor

Fixed in #86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants