Obey Robots.txt #180

JohnMTrimbleIII · 2018-07-01T16:22:25Z

Is scrapy-splash not compatible with obeying robots.txt? Everytime I make a query it attempts to download the robots.txt from the docker instance of scrapy-splash. The below is my settings file. I'm thinking it may be a misordering of the middlewares, but I'm not sure what it should look like.

# -*- coding: utf-8 -*-

# Scrapy settings for ishop project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ishop'

SPIDER_MODULES = ['ishop.spiders']
NEWSPIDER_MODULE = 'ishop.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'ishop (+http://www.ishop.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ishop.middlewares.IshopSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ishop.middlewares.IshopDownloaderMiddleware': 543,
#}



# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ishop.pipelines.HbasePipeline': 100
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


SPIDER_MIDDLEWARES = {
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 25,
    'frontera.contrib.scrapy.middlewares.seeds.file.FileSeedLoader': 650,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    
}


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 999,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,    
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}


SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'


# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]





FRONTERA_SETTINGS = 'ishop.frontera.spiders'  # module path to your Frontera spider config module



SPLASH_URL = 'http://127.0.0.1:8050'
# SPLASH_URL= 'http://172.17.0.2:8050'


DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

The text was updated successfully, but these errors were encountered:

ArthurJ · 2019-02-14T16:40:58Z

+1
I'm having the same problem. The spider looks for http://localhost:8050/robots.txt, that does not exist. And I'm having trouble to implement the rules of my target site.

Tobias-Keller · 2019-02-16T20:53:44Z

same problem here...
spider first downloading the correct robots and then trys to download localhost robots.
2019-02-16 21:51:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testwebsite.de/robots.txt> (referer: None)

2019-02-16 21:51:02 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)

JavierRuano · 2019-02-16T23:07:53Z

Robots.txt is read at the start of crawling. You could disable that feature from settings or to write a middleware apropos robots. https://docs.scrapy.org/en/latest/topics/settings.html https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots https://stackoverflow.com/questions/37274835/getting-forbidden-by-robots-txt-scrapy El sáb., 16 feb. 2019 21:53, Tobias Keller <notifications@github.com> escribió:

…

same problem here — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#180 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Agwyu8nVvM1LYh3fAQOf1qQ2mh2wuNMUks5vOG_bgaJpZM4U-YZk> .

ArthurJ · 2019-02-17T00:25:43Z

I disabled the robotstxt midware, sub-classed it and changed the line that loads the file in the first place. So it took the right URL and worked.

In my case, I wanted to obey the robots.txt file. Just turn it off was not a solution.

Tobias-Keller · 2019-02-17T08:45:59Z

I disabled the robotstxt midware, sub-classed it and changed the line that loads the file in the first place. So it took the right URL and worked.

In my case, I wanted to obey the robots.txt file. Just turn it off was not a solution.

can you share this?
disabling the hole robots is no option.

ArthurJ · 2019-02-18T01:25:05Z

    from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
    from scrapy.http import Request
    from twisted.internet.defer import Deferred

    from scrapy.utils.httpobj import urlparse_cached


    class MyRobotsTxtMiddleware(RobotsTxtMiddleware):
        
        def robot_parser(self, request, spider):
            url = urlparse_cached(request)
            netloc = url.netloc

            if netloc not in self._parsers:
                self._parsers[netloc] = Deferred()
                robotsurl = "https://www.example.com/robots.txt"
                robotsreq = Request(
                    robotsurl,
                    priority=self.DOWNLOAD_PRIORITY,
                    meta={'dont_obey_robotstxt': True}
                )
                dfd = self.crawler.engine.download(robotsreq, spider)
                dfd.addCallback(self._parse_robots, netloc)
                dfd.addErrback(self._logerror, robotsreq, spider)
                dfd.addErrback(self._robots_error, netloc)
                self.crawler.stats.inc_value('robotstxt/request_count')

            if isinstance(self._parsers[netloc], Deferred):
                d = Deferred()
                def cb(result):
                    d.callback(result)
                    return result
                self._parsers[netloc].addCallback(cb)
                return d
            else:
                return self._parsers[netloc]

ArthurJ · 2019-02-18T01:26:22Z

SPIDER_MIDDLEWARES = {
    'mycrawler.middlewares.MyRobotsTxtMiddleware': 1,
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware':None

laisbsc · 2020-02-08T18:26:30Z

    from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
    from scrapy.http import Request
    from twisted.internet.defer import Deferred

    from scrapy.utils.httpobj import urlparse_cached


    class MyRobotsTxtMiddleware(RobotsTxtMiddleware):
        
        def robot_parser(self, request, spider):
            url = urlparse_cached(request)
            netloc = url.netloc

            if netloc not in self._parsers:
                self._parsers[netloc] = Deferred()
                robotsurl = "https://www.example.com/robots.txt"
                robotsreq = Request(
                    robotsurl,
                    priority=self.DOWNLOAD_PRIORITY,
                    meta={'dont_obey_robotstxt': True}
                )
                dfd = self.crawler.engine.download(robotsreq, spider)
                dfd.addCallback(self._parse_robots, netloc)
                dfd.addErrback(self._logerror, robotsreq, spider)
                dfd.addErrback(self._robots_error, netloc)
                self.crawler.stats.inc_value('robotstxt/request_count')

            if isinstance(self._parsers[netloc], Deferred):
                d = Deferred()
                def cb(result):
                    d.callback(result)
                    return result
                self._parsers[netloc].addCallback(cb)
                return d
            else:
                return self._parsers[netloc]

@ArthurJ where did you add this code, though? I'm quite a newbie on web crawling and I have been having huge trouble with my crawler not returning what it should.
Thanks.

jazminebarroga · 2020-10-11T05:57:18Z

The same thing happens to me where the spider first downloads the correct robots and then tries to download localhost robots. However I still see on my logs that some links are Forbidden by robots.txt so I'm a bit confused whether the spider really obeys robot.txt or not.

Gallaecio added the enhancement label Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obey Robots.txt #180

Obey Robots.txt #180

JohnMTrimbleIII commented Jul 1, 2018

ArthurJ commented Feb 14, 2019 •

edited

Tobias-Keller commented Feb 16, 2019 •

edited

JavierRuano commented Feb 16, 2019 via email

ArthurJ commented Feb 17, 2019 •

edited

Tobias-Keller commented Feb 17, 2019

ArthurJ commented Feb 18, 2019

ArthurJ commented Feb 18, 2019

laisbsc commented Feb 8, 2020

jazminebarroga commented Oct 11, 2020

Obey Robots.txt #180

Obey Robots.txt #180

Comments

JohnMTrimbleIII commented Jul 1, 2018

ArthurJ commented Feb 14, 2019 • edited

Tobias-Keller commented Feb 16, 2019 • edited

JavierRuano commented Feb 16, 2019 via email

ArthurJ commented Feb 17, 2019 • edited

Tobias-Keller commented Feb 17, 2019

ArthurJ commented Feb 18, 2019

ArthurJ commented Feb 18, 2019

laisbsc commented Feb 8, 2020

jazminebarroga commented Oct 11, 2020

ArthurJ commented Feb 14, 2019 •

edited

Tobias-Keller commented Feb 16, 2019 •

edited

ArthurJ commented Feb 17, 2019 •

edited