Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] quotes.toscrape.com unavailable via HTTP #5395

Closed
peter-gy opened this issue Feb 6, 2022 · 4 comments
Closed

[docs] quotes.toscrape.com unavailable via HTTP #5395

peter-gy opened this issue Feb 6, 2022 · 4 comments

Comments

@peter-gy
Copy link
Contributor

peter-gy commented Feb 6, 2022

Description

In some cases there is no automatic redirection from HTTP to HTTPS when accessing quotes.toscrape.com as a part of the official tutorial. Following the tutorial steps outlined in Our first Spider then executing scrapy crawl quotes results in a crawling process which does not terminate.

Steps to Reproduce

  1. Execute scrapy startproject tutorial
  2. Create a Spider as outlined in the Our first Spider section of the docs
  3. Execute scrapy crawl quotes

Expected behavior: The crawling process succeeds after a similar output:

2022-02-06 18:46:12 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tutorial)
2022-02-06 18:46:12 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.10 (main, Jan 15 2022, 11:40:53) - [Clang 13.0.0 (clang-1300.0.29.3)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-12.2-arm64-arm-64bit
2022-02-06 18:46:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-06 18:46:12 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2022-02-06 18:46:12 [scrapy.extensions.telnet] INFO: Telnet Password: 7011135517cc2745
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-06 18:46:12 [scrapy.core.engine] INFO: Spider opened
2022-02-06 18:46:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-06 18:46:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-06 18:46:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-06 18:46:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2022-02-06 18:46:13 [quotes] DEBUG: Saved file quotes-1.html
2022-02-06 18:46:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2022-02-06 18:46:14 [quotes] DEBUG: Saved file quotes-2.html
2022-02-06 18:46:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-06 18:46:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5866,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 2.134587,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 2, 6, 17, 46, 14, 197612),
 'httpcompression/response_bytes': 24787,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'memusage/max': 64536576,
 'memusage/startup': 64536576,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2022, 2, 6, 17, 46, 12, 63025)}
2022-02-06 18:46:14 [scrapy.core.engine] INFO: Spider closed (finished)

Actual behavior: The crawling process hangs after the following output:

2022-02-06 18:46:12 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tutorial)
2022-02-06 18:46:12 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.10 (main, Jan 15 2022, 11:40:53) - [Clang 13.0.0 (clang-1300.0.29.3)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-12.2-arm64-arm-64bit
2022-02-06 18:46:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-06 18:46:12 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2022-02-06 18:46:12 [scrapy.extensions.telnet] INFO: Telnet Password: 7011135517cc2745
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-06 18:46:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-06 18:46:12 [scrapy.core.engine] INFO: Spider opened
2022-02-06 18:46:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-06 18:46:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-02-06 18:46:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2022-02-06 18:46:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2022-02-06 18:46:13 [quotes] DEBUG: Saved file quotes-1.html
2022-02-06 18:46:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2022-02-06 18:46:14 [quotes] DEBUG: Saved file quotes-2.html
2022-02-06 18:46:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-02-06 18:46:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 5866,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 2.134587,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 2, 6, 17, 46, 14, 197612),
 'httpcompression/response_bytes': 24787,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'memusage/max': 64536576,
 'memusage/startup': 64536576,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2022, 2, 6, 17, 46, 12, 63025)}
2022-02-06 18:46:14 [scrapy.core.engine] INFO: Spider closed (finished)
(venv) [~/WorkEnv/TU-IT/projects/e-panacea/scrapy-demo]$ clear
(venv) [~/WorkEnv/TU-IT/projects/e-panacea/scrapy-demo]$ scrapy crawl quotes
2022-02-06 18:47:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tutorial)
2022-02-06 18:47:35 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.10 (main, Jan 15 2022, 11:40:53) - [Clang 13.0.0 (clang-1300.0.29.3)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-12.2-arm64-arm-64bit
2022-02-06 18:47:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-02-06 18:47:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tutorial',
 'NEWSPIDER_MODULE': 'tutorial.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tutorial.spiders']}
2022-02-06 18:47:35 [scrapy.extensions.telnet] INFO: Telnet Password: ebd272840e134208
2022-02-06 18:47:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-02-06 18:47:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-02-06 18:47:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-02-06 18:47:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-06 18:47:35 [scrapy.core.engine] INFO: Spider opened
2022-02-06 18:47:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-06 18:47:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

Reproduces how often: 100%

Versions

Scrapy       : 2.5.1
lxml         : 4.7.1.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.10 (main, Jan 15 2022, 11:40:53) - [Clang 13.0.0 (clang-1300.0.29.3)]
pyOpenSSL    : 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021)
cryptography : 36.0.1
Platform     : macOS-12.2-arm64-arm-64bit
@wRAR
Copy link
Member

wRAR commented Feb 7, 2022

quotes.toscrape.com unavailable via HTTP

This is not true.

2022-02-07 11:43:59 [scrapy.core.engine] INFO: Spider opened
2022-02-07 11:44:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)

Execute scrapy crawl quotes

This works for me:

2022-02-07 11:45:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2022-02-07 11:45:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

@peter-gy
Copy link
Contributor Author

peter-gy commented Feb 7, 2022

Hi @wRAR!
Maybe the issue is specific to my environment, however, I keep experiencing the described behaviour.

Using plain HTTP, the request times out:

[~]$ curl http://quotes.toscrape.com
curl: (28) Connection timed out after 300124 milliseconds

Using HTTPS, the request succeeds:

[~]$ curl https://quotes.toscrape.com
<!DOCTYPE html>
...

Please note that I do not have any custom networking setup, meaning that my environment probably should not be treated as an "exotic edge-case".

@Gallaecio
Copy link
Member

Gallaecio commented Feb 7, 2022

quotes.toscrape.com unavailable via HTTP

This is not true.

I suspected as much (and I have just confirmed that curl http://quotes.toscrape.com works here), but I also believe that we should be using HTTPS by default everywhere, so I am glad about #5396

@Gallaecio
Copy link
Member

#5396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants