Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist cookies between spider runs #5930

Closed
vigandika opened this issue May 11, 2023 · 9 comments
Closed

Persist cookies between spider runs #5930

vigandika opened this issue May 11, 2023 · 9 comments

Comments

@vigandika
Copy link

Summary

Scrapy's CookiesMiddleware persist cookies and shares them between subsequent requests from the same spider. Once the spider is closed, the cookies are lost. If another spider (or the same one) that interacts with the same website is run next, it cannot use the cookies retrieved from the earlier run.

Motivation

Picture this scenario:

I have four different spiders which need to scrape different pages of the same website, and they need to be run 3 times a day, every day. The pages I need to scrape are protected, i.e. I need to be logged in, and the website implements cookie-based authentication to create and maintain sessions. The cookie expires after 24hrs.

I have already implemented the login in the middleware. To emphasize the need for this issue, the website I am dealing with happens to be very slow.

Current flow

Currently, each spider will perform the following:

  1. Request the target page
  2. Perform the login (since the request is not authenticated)
  3. Redirect to the target page
  4. Scrape data

This same process will be repeated 12 times (3 times a day for each spider (4)).

Ideal flow (after implementing this feature)

If we could persist the session between spider runs, the flow would be as follows:

On 1st spider run:

  1. Request the target page
  2. Perform the login (since the request is not authenticated)
  3. Redirect to the target page
  4. Scrape data

On subsequent spider runs:

  1. Request the target page
  2. Scrape data

By persisting the session between different spider runs, the login would've only been triggered once. The other spider runs can use the cookie retrieved by the first spider run which correctly underwent the login flow, and directly scrape the data, avoiding any delay and potential issues caused by the login process.

Alternatives used until Scrapy 2.8.0

This feature was available in the scrapy-cookies lib (see https://scrapy-cookies.readthedocs.io/en/latest/intro/tutorial.html#save-cookies-and-restore-in-your-next-run), and proved to be very useful, especially with the login cookie. Unfortunately, scrapy-cookies is incompatible with Scrapy 2.8.0 or newer.

The breaking point was the removal of the deprecated method scrapy.utils.python.to_native_str, which was removed in Scrapy 2.8.0 but still used by scrapy-cookies.

@vigandika vigandika changed the title Persist cookie sessions between spider runs Persist cookies between spider runs May 11, 2023
@GeorgeA92
Copy link
Contributor

  1. related change in Scrapy cookies middleware applied as result of this commit 397e883
    on that commit scrapy.utils.python.to_native_str was replaced by from scrapy.utils.python.to_unicode
  2. comparing _debug_cookie and _debug_set_cookie methods from scrapy-cookies. and from current scrapy I don't see any other differences.
  3. middleware call that scrapy.utils.python.to_native_str only on COOKIES_DEBUG:True - switching that setting value to False may restore (not sure) it.
  4. It looks like middleware from scrapy-cookies - is hardcoded scrapy original cookie middleware (the latest available version at that moment) with.. custom functionality.

cc @grammy-jiang

@wRAR
Copy link
Member

wRAR commented May 11, 2023

The breaking point was the removal of the deprecated method scrapy.utils.python.to_native_str, which was removed in Scrapy 2.8.0 but still used by scrapy-cookies.

Updating the code to not use it should be trivial though.

@GeorgeA92
Copy link
Contributor

The pages I need to scrape are protected, i.e. I need to be logged in, and the website implements cookie-based authentication to create and maintain sessions

BTW I am strictly against of applying this or any other similar functionality into scrapy itself especially as enabled by default feature.

Storing any kind of.. sensitive information like session(under login) data - definitely will be additional source of.. related security concerns.

@vigandika
Copy link
Author

Thanks @GeorgeA92 and @wRAR for the comments.

middleware call that scrapy.utils.python.to_native_str only on COOKIES_DEBUG:True - switching that setting value to False may restore (not sure) it.

An ImportError is raised here when the module is initialized, before the method can even be called.

Updating the code to not use it should be trivial though.

Because of the ImportError mentioned above, the only option would be to fork the scrapy-cookies repository. This would be our last resort.

Storing any kind of.. sensitive information like session(under login) data - definitely will be additional source of.. related security concerns.

I can understand your concerns and our guess to why the feature is not yet implemented was exactly that. But there is also a big advantage that the feature would bring. In my opinion, as long as the functionality is disabled by default and as long as the scrapy users can decide if they want the functionality (e.g. a COOKIES_PERSISTENCE = True/False flag in the settings), they bear the responsibility.

@wRAR
Copy link
Member

wRAR commented May 12, 2023

Because of the ImportError mentioned above, the only option would be to fork the scrapy-cookies repository. This would be our last resort.

Yes, I was talking about fixing scrapy-cookies. If you can't or don't want to fork it you can wait for someone else to do that or to fix the upstream. That's at least comparable to waiting for a feature to be implemented in Scrapy.

@wRAR
Copy link
Member

wRAR commented Jun 21, 2023

As there is a 3rd-party extension that does this it makes sense to just use it instead of adding this to Scrapy, unless there are things that can only be done in Scrapy (but it looks like the extension already solves the stated problem).

@wRAR wRAR closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023
@wRAR
Copy link
Member

wRAR commented Jun 21, 2023

Related: #5431 #5463

@grammy-jiang
Copy link
Contributor

grammy-jiang commented Jul 13, 2023

@vigandika @GeorgeA92 Sorry I haven't worked with Scrapy for a while and lost track of the latest version. I can fix this compatible issue.

@grammy-jiang
Copy link
Contributor

grammy-jiang commented Jul 13, 2023

@vigandika I just fix the import you mentioned. My development environment has some issues with tests. Please have a try and let me know if there is any further problem.

Please use pip install from git with tag 0.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants