Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request cookies not updated with values from previous response #4717

Closed
alfredsasko opened this issue Aug 10, 2020 · 4 comments · Fixed by #4823
Closed

Request cookies not updated with values from previous response #4717

alfredsasko opened this issue Aug 10, 2020 · 4 comments · Fixed by #4823
Labels
Milestone

Comments

@alfredsasko
Copy link

alfredsasko commented Aug 10, 2020

Description

[Description of the issue]

Steps to Reproduce

  1. Create 2 conda environments one scrapy160 with scrapy 1.6.0 version installed and scrapy230 with scrapy 2.3.0. version installed.
  2. conda activate scrapy160
  3. scrapy startproject test_login
  4. cd test_login
  5. scrapy genspider login quotes.toscrape.com/login
  6. Adapt login.py
import scrapy
from scrapy import FormRequest


class LoginSpider(scrapy.Spider):
    name = 'login'
    # allowed_domains = ['quotes.toscrape.com']

    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
        return FormRequest.from_response(
            response,
            # formname='',
            # formcss='',
            formxpath='//form',
            formdata={
                'csrf_token': csrf_token,
                'username': 'admin',
                'password': 'admin'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        self.logger.info(logout)
  1. scrapy crawl login
  2. conda activate scrapy230
  3. scrapy crawl login

Expected behavior:
After login into http://quotes.toscrape.com/login the spider is redirected to http://quotes.toscrape.com/ and locates and prints the text of the log out link on the right top side of the page for version 1.6.0 and also 2.3.0. Refer to valid [login] INFO: Logout vs invalid [login] INFO: None

Actual behavior: Version 1.6.0. is able to locate log out link while version 2.3.0 can't

Reproduces how often: 100%

Output of the Version 1.6.0 - OK

$ scrapy crawl login
2020-08-10 12:49:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: test_login)
2020-08-10 12:49:55 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.5 (default, Aug  5 2020, 09:44:06) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Windows-7-6.1.7601-SP1
2020-08-10 12:49:55 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'test_login', 'NEWSPIDER_MODULE': 'test_login.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['test_login.spiders']}
2020-08-10 12:49:55 [scrapy.extensions.telnet] INFO: Telnet Password: e0f8343159af7408
2020-08-10 12:49:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-08-10 12:49:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-10 12:49:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-10 12:49:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-10 12:49:55 [scrapy.core.engine] INFO: Spider opened
2020-08-10 12:49:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-10 12:49:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-10 12:49:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-08-10 12:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2020-08-10 12:49:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2020-08-10 12:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2020-08-10 12:49:55 [login] INFO: Logout
2020-08-10 12:49:55 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-10 12:49:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1437,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 4763,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 10, 10, 49, 55, 999506),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'request_depth_max': 1,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 8, 10, 10, 49, 55, 497942)}
2020-08-10 12:49:56 [scrapy.core.engine] INFO: Spider closed (finished)

Output of the Version 2.3.0 - NOT OK

$ scrapy crawl login
2020-08-10 12:51:37 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: test_login)
2020-08-10 12:51:37 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-7-6.1.7601-SP1
2020-08-10 12:51:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-10 12:51:37 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'test_login',
 'NEWSPIDER_MODULE': 'test_login.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['test_login.spiders']}
2020-08-10 12:51:37 [scrapy.extensions.telnet] INFO: Telnet Password: 445dedc895a440a5
2020-08-10 12:51:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-08-10 12:51:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-10 12:51:38 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-10 12:51:38 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-10 12:51:38 [scrapy.core.engine] INFO: Spider opened
2020-08-10 12:51:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-10 12:51:38 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-10 12:51:38 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-08-10 12:51:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2020-08-10 12:51:38 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2020-08-10 12:51:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2020-08-10 12:51:38 [login] INFO: None
2020-08-10 12:51:38 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-10 12:51:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1431,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 4585,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.502564,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 10, 10, 51, 38, 622537),
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'request_depth_max': 1,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 8, 10, 10, 51, 38, 119973)}
2020-08-10 12:51:38 [scrapy.core.engine] INFO: Spider closed (finished)

Additional context

@Gallaecio Gallaecio added the bug label Aug 10, 2020
@Gallaecio
Copy link
Member

Thanks for the detailed report.

@wRAR
Copy link
Member

wRAR commented Aug 24, 2020

I ran similar code with COOKIES_DEBUG=True, looks like the cookie in Set-Cookie sent by the 302 response is ignored since Scrapy 2.2 (so probably #2400 is the cause? cc: @elacuesta )

@elacuesta
Copy link
Member

elacuesta commented Aug 24, 2020

Indeed, that seems to be the cause of the issue. The breaking point is 2.2.0, as @wRAR mentions, the included snippet works as expected in 2.1.0.
This particular page always returns a "session" cookie (even for the initial GET / request), which gets stored in the jar and added to the headers for subsequent requests. When the 302 response returns a new cookie, it's stored in the jar correctly, but it gets overwritten in these lines with the previous value, taken from the redirect request.
This is tricky, I think respecting the request headers is necessary, but in this case we would need to update them with the more recent value from the response. This means we would need a way to determine which cookie is the newest, something I don't think we can do at the moment.
Sorry about that, seems like I only traded two bugs for one 😅

@elacuesta elacuesta changed the title scrapy.FormRequest.from_response() method failed to login Request cookies not updated with values from previous response Aug 24, 2020
@elacuesta
Copy link
Member

I think this is what's happening, please take a look @Gallaecio @kmike @wRAR

Cookies issue

  1. GET request to "/login" is processed normally, no cookies are added to the request
  2. 200 response is processed by the cookies middleware, a first session cookie ("cookie A") is stored in the cookiejar, the response reaches the engine normally
  3. POST request to "/login" is processed, cookie A is added from the cookiejar
  4. 302 response is processed, a new session cookie ("cookie B") replaces cookie A in the cookie jar. Cookie B is the one that represents the logged user.
  5. Redirect request is scheduled, cookie A is in the request headers (the request from step 3).
  6. Redirect request is processed, cookie A is read from the request headers and stored in the cookiejar, overriding cookie B.

The problem is in step 6. Could we solve this by removing cookies from redirect requests and letting the cookies middleware add the necessary ones back? That sounds too easy to be true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants