New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allowed_domains: Allow only root domain and no subdomains #3412
Labels
Comments
Note that you can always subclass |
If anyone else stumbles upon this issue, here is a simple solution for allowing root domain only: # OffsiteMiddleware.py
import re
from scrapy.spidermiddlewares import offsite
# Unlike the original implementation, this OffsiteMiddleware only allows URLs to
# the root domain, but no other subdomain
# When allowed_domains = [example.com] allows example.com, but not
# www.example.com or sub.example.com
# Original implementation:
# https://github.com/scrapy/scrapy/blob/master/scrapy/spidermiddlewares/offsite.py
class OffsiteMiddleware(offsite.OffsiteMiddleware):
def get_host_regex(self, spider):
regex = super().get_host_regex(spider)
# Remove optional .* (any subdomains) from regex
regex = regex.pattern.replace("(.*\.)?", "", 1)
return re.compile(regex) and in your Scrapy # settings.py
# Import our new offsite middleware
from .OffsiteMiddleware import OffsiteMiddleware
# Various Scrapy settings, such as BOT_NAME, USER_AGENT, etc...
# ...
# Overwrite the original OffsiteMiddleware with new customized OffsiteMiddleware
SPIDER_MIDDLEWARES = {
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
OffsiteMiddleware: 500,
} If you want the middleware to allow root domain and www subdomain only (as I needed), then you can use this line in your middleware instead: # Allow root domain and www-domain only
regex = regex.pattern.replace("(.*\.)?", "(www\.)?", 1) |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We are trying to configure the allowed_domains list to only include the root domain and not any subdomains. As of now it doesn't seem possible.
Desired behavior
OK to crawl:
http://example.com
Shouldn't be crawled:
http://www.example.com
http://ww2.example.com
http://subdomain1.example.com
The following configuration allows root domain and ALL subdomains:
allowed_domains = ['example.com']
A solution that only allows the root domains should be added :)
The text was updated successfully, but these errors were encountered: