allowed_domains: Allow only root domain and no subdomains #3412

ZakariaAhmed · 2018-09-03T11:42:58Z

We are trying to configure the allowed_domains list to only include the root domain and not any subdomains. As of now it doesn't seem possible.

Desired behavior
OK to crawl:
http://example.com

Shouldn't be crawled:
http://www.example.com
http://ww2.example.com
http://subdomain1.example.com

The following configuration allows root domain and ALL subdomains:
allowed_domains = ['example.com']

A solution that only allows the root domains should be added :)

wRAR · 2018-09-03T12:00:10Z

Note that you can always subclass OffsiteMiddleware and change the regex in get_host_regex().

FinnWoelm · 2022-01-08T23:06:51Z

If anyone else stumbles upon this issue, here is a simple solution for allowing root domain only:

# OffsiteMiddleware.py

import re
from scrapy.spidermiddlewares import offsite

# Unlike the original implementation, this OffsiteMiddleware only allows URLs to
# the root domain, but no other subdomain
# When allowed_domains = [example.com] allows example.com, but not
# www.example.com or sub.example.com
# Original implementation:
# https://github.com/scrapy/scrapy/blob/master/scrapy/spidermiddlewares/offsite.py
class OffsiteMiddleware(offsite.OffsiteMiddleware):
    def get_host_regex(self, spider):
        regex = super().get_host_regex(spider)
        # Remove optional .* (any subdomains) from regex
        regex = regex.pattern.replace("(.*\.)?", "", 1)
        return re.compile(regex)

and in your Scrapy settings.py:

# settings.py

# Import our new offsite middleware
from .OffsiteMiddleware import OffsiteMiddleware

# Various Scrapy settings, such as BOT_NAME, USER_AGENT, etc... 
# ...

# Overwrite the original OffsiteMiddleware with new customized OffsiteMiddleware
SPIDER_MIDDLEWARES = {
    "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
    OffsiteMiddleware: 500,
}

If you want the middleware to allow root domain and www subdomain only (as I needed), then you can use this line in your middleware instead:

# Allow root domain and www-domain only 
regex = regex.pattern.replace("(.*\.)?", "(www\.)?", 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allowed_domains: Allow only root domain and no subdomains #3412

allowed_domains: Allow only root domain and no subdomains #3412

ZakariaAhmed commented Sep 3, 2018

wRAR commented Sep 3, 2018

FinnWoelm commented Jan 8, 2022

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

allowed_domains: Allow only root domain and no subdomains #3412

allowed_domains: Allow only root domain and no subdomains #3412

Comments

ZakariaAhmed commented Sep 3, 2018

wRAR commented Sep 3, 2018

FinnWoelm commented Jan 8, 2022

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.