Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allowed_domains: Allow only root domain and no subdomains #3412

Open
ZakariaAhmed opened this issue Sep 3, 2018 · 6 comments
Open

allowed_domains: Allow only root domain and no subdomains #3412

ZakariaAhmed opened this issue Sep 3, 2018 · 6 comments

Comments

@ZakariaAhmed
Copy link

We are trying to configure the allowed_domains list to only include the root domain and not any subdomains. As of now it doesn't seem possible.

Desired behavior
OK to crawl:
http://example.com

Shouldn't be crawled:
http://www.example.com
http://ww2.example.com
http://subdomain1.example.com

The following configuration allows root domain and ALL subdomains:
allowed_domains = ['example.com']

A solution that only allows the root domains should be added :)

@wRAR
Copy link
Member

wRAR commented Sep 3, 2018

Note that you can always subclass OffsiteMiddleware and change the regex in get_host_regex().

@FinnWoelm
Copy link

If anyone else stumbles upon this issue, here is a simple solution for allowing root domain only:

# OffsiteMiddleware.py

import re
from scrapy.spidermiddlewares import offsite

# Unlike the original implementation, this OffsiteMiddleware only allows URLs to
# the root domain, but no other subdomain
# When allowed_domains = [example.com] allows example.com, but not
# www.example.com or sub.example.com
# Original implementation:
# https://github.com/scrapy/scrapy/blob/master/scrapy/spidermiddlewares/offsite.py
class OffsiteMiddleware(offsite.OffsiteMiddleware):
    def get_host_regex(self, spider):
        regex = super().get_host_regex(spider)
        # Remove optional .* (any subdomains) from regex
        regex = regex.pattern.replace("(.*\.)?", "", 1)
        return re.compile(regex)

and in your Scrapy settings.py:

# settings.py

# Import our new offsite middleware
from .OffsiteMiddleware import OffsiteMiddleware

# Various Scrapy settings, such as BOT_NAME, USER_AGENT, etc... 
# ...

# Overwrite the original OffsiteMiddleware with new customized OffsiteMiddleware
SPIDER_MIDDLEWARES = {
    "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
    OffsiteMiddleware: 500,
}

If you want the middleware to allow root domain and www subdomain only (as I needed), then you can use this line in your middleware instead:

# Allow root domain and www-domain only 
regex = regex.pattern.replace("(.*\.)?", "(www\.)?", 1)

@digitalw

This comment was marked as resolved.

@wRAR

This comment was marked as resolved.

@digitalw

This comment was marked as resolved.

@gebeer

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants