Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to work with a very large “allowed_domains” attribute in scrapy? #1908

Closed
15310944349 opened this issue Apr 7, 2016 · 15 comments
Closed

Comments

@15310944349
Copy link

Because the allowed_domains is very big, it throws this exception:
regex = r'^(.*.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
How do I solve this problem?

@redapple
Copy link
Contributor

redapple commented Apr 7, 2016

@kmike
Copy link
Member

kmike commented Apr 7, 2016

For such large regexes with lots of ORs https://github.com/axiak/pyre2 library works much better than stdlib re.

@kmike
Copy link
Member

kmike commented Apr 7, 2016

A regex with e.g. 50K domains should be super-fast with pyre2; for such regexes stdlib re matching is O(N), but re2 can match it in O(1) time regarding number of domains in a regex. I'm using a similar approach in https://github.com/scrapinghub/adblockparser.

@redapple
Copy link
Contributor

redapple commented Apr 7, 2016

@15310944349
Copy link
Author

@15310944349
Copy link
Author

@kmike How do i use re2 modified the get_host_regex method, could you give a complete sample, i don't have the ability to solve, thank you.

@kmike
Copy link
Member

kmike commented Apr 11, 2016

@15310944349 install pyre2 library and use it instead of stdilb re in your get_host_regex function from SO question. It should be a drop-in replacement - just import re2 instead of re.

@15310944349
Copy link
Author

Yeah, I write this following test code

import re2 as re
def get_host_regex(allowed_domains):
        """Override this method to implement a different offsite policy"""
        if not allowed_domains:
            return re.compile('') # allow all by default
        regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
        return re.compile(regex)
def getDomain():
        domains = []
        fop = open('/spider/distributed/wzws/sample/domain', 'r')
        while True:
            line = fop.readline().strip()
            if line:
                domains.append(line)
            else:
                break
        fop.close()
        return domains
domains = getDomain()
get_host_regex(domains)

it showed up the following error code:

AttributeError: 'module' object has no attribute 'compile'

AttributeError: 'module' object has no attribute 'compile'

I don't know re2 what method to replace the escape and compile, i don't know do you have any good suggestions, think you.

@15310944349
Copy link
Author

@kmike my mistake . think you very much, re2 is very strong, i successed just import re2 as re.everything is ok. think you
hahahah

@15310944349
Copy link
Author

@kmike oh, no, when i run my scrapy project, it mention out of mmemory

re2/dfa.cc:457: DFA out of memory: prog size 517126 mem 533686

@kmike
Copy link
Member

kmike commented Apr 11, 2016

@15310944349 re2.compile has max_mem argument, you can use it to increase memory available to re2 regex engine

@15310944349
Copy link
Author

@kmike I find the related information, but i don't know how to get more memory for regex, can you give a link under reference.think you.

@kmike
Copy link
Member

kmike commented Apr 18, 2016

@15310944349 just pass max_mem argument to re.compile like it is done here.

@15310944349
Copy link
Author

@kmike oh, think you very much

@redapple
Copy link
Contributor

Worth document in the FAQ.
Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants