Persist cookies between spider runs #5930

vigandika · 2023-05-11T15:08:28Z

Summary

Scrapy's CookiesMiddleware persist cookies and shares them between subsequent requests from the same spider. Once the spider is closed, the cookies are lost. If another spider (or the same one) that interacts with the same website is run next, it cannot use the cookies retrieved from the earlier run.

Motivation

Picture this scenario:

I have four different spiders which need to scrape different pages of the same website, and they need to be run 3 times a day, every day. The pages I need to scrape are protected, i.e. I need to be logged in, and the website implements cookie-based authentication to create and maintain sessions. The cookie expires after 24hrs.

I have already implemented the login in the middleware. To emphasize the need for this issue, the website I am dealing with happens to be very slow.

Current flow

Currently, each spider will perform the following:

Request the target page
Perform the login (since the request is not authenticated)
Redirect to the target page
Scrape data

This same process will be repeated 12 times (3 times a day for each spider (4)).

Ideal flow (after implementing this feature)

If we could persist the session between spider runs, the flow would be as follows:

On 1st spider run:

Request the target page
Perform the login (since the request is not authenticated)
Redirect to the target page
Scrape data

On subsequent spider runs:

Request the target page
Scrape data

By persisting the session between different spider runs, the login would've only been triggered once. The other spider runs can use the cookie retrieved by the first spider run which correctly underwent the login flow, and directly scrape the data, avoiding any delay and potential issues caused by the login process.

Alternatives used until Scrapy 2.8.0

This feature was available in the scrapy-cookies lib (see https://scrapy-cookies.readthedocs.io/en/latest/intro/tutorial.html#save-cookies-and-restore-in-your-next-run), and proved to be very useful, especially with the login cookie. Unfortunately, scrapy-cookies is incompatible with Scrapy 2.8.0 or newer.

The breaking point was the removal of the deprecated method scrapy.utils.python.to_native_str, which was removed in Scrapy 2.8.0 but still used by scrapy-cookies.

The text was updated successfully, but these errors were encountered:

GeorgeA92 · 2023-05-11T19:06:48Z

related change in Scrapy cookies middleware applied as result of this commit 397e883
on that commit scrapy.utils.python.to_native_str was replaced by from scrapy.utils.python.to_unicode
comparing _debug_cookie and _debug_set_cookie methods from scrapy-cookies. and from current scrapy I don't see any other differences.
middleware call that scrapy.utils.python.to_native_str only on COOKIES_DEBUG:True - switching that setting value to False may restore (not sure) it.
It looks like middleware from scrapy-cookies - is hardcoded scrapy original cookie middleware (the latest available version at that moment) with.. custom functionality.

cc @grammy-jiang

wRAR · 2023-05-11T19:07:30Z

The breaking point was the removal of the deprecated method scrapy.utils.python.to_native_str, which was removed in Scrapy 2.8.0 but still used by scrapy-cookies.

Updating the code to not use it should be trivial though.

GeorgeA92 · 2023-05-11T19:07:43Z

The pages I need to scrape are protected, i.e. I need to be logged in, and the website implements cookie-based authentication to create and maintain sessions

BTW I am strictly against of applying this or any other similar functionality into scrapy itself especially as enabled by default feature.

Storing any kind of.. sensitive information like session(under login) data - definitely will be additional source of.. related security concerns.

vigandika · 2023-05-12T16:11:13Z

Thanks @GeorgeA92 and @wRAR for the comments.

middleware call that scrapy.utils.python.to_native_str only on COOKIES_DEBUG:True - switching that setting value to False may restore (not sure) it.

An ImportError is raised here when the module is initialized, before the method can even be called.

Updating the code to not use it should be trivial though.

Because of the ImportError mentioned above, the only option would be to fork the scrapy-cookies repository. This would be our last resort.

Storing any kind of.. sensitive information like session(under login) data - definitely will be additional source of.. related security concerns.

I can understand your concerns and our guess to why the feature is not yet implemented was exactly that. But there is also a big advantage that the feature would bring. In my opinion, as long as the functionality is disabled by default and as long as the scrapy users can decide if they want the functionality (e.g. a COOKIES_PERSISTENCE = True/False flag in the settings), they bear the responsibility.

wRAR · 2023-05-12T17:30:11Z

Because of the ImportError mentioned above, the only option would be to fork the scrapy-cookies repository. This would be our last resort.

Yes, I was talking about fixing scrapy-cookies. If you can't or don't want to fork it you can wait for someone else to do that or to fix the upstream. That's at least comparable to waiting for a feature to be implemented in Scrapy.

wRAR · 2023-06-21T09:20:39Z

As there is a 3rd-party extension that does this it makes sense to just use it instead of adding this to Scrapy, unless there are things that can only be done in Scrapy (but it looks like the extension already solves the stated problem).

wRAR · 2023-06-21T09:53:54Z

Related: #5431 #5463

grammy-jiang · 2023-07-13T09:47:15Z

@vigandika @GeorgeA92 Sorry I haven't worked with Scrapy for a while and lost track of the latest version. I can fix this compatible issue.

grammy-jiang · 2023-07-13T10:24:01Z

@vigandika I just fix the import you mentioned. My development environment has some issues with tests. Please have a try and let me know if there is any further problem.

Please use pip install from git with tag 0.4.

vigandika changed the title ~~Persist cookie sessions between spider runs~~ Persist cookies between spider runs May 11, 2023

wRAR closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist cookies between spider runs #5930

Persist cookies between spider runs #5930

vigandika commented May 11, 2023

GeorgeA92 commented May 11, 2023

wRAR commented May 11, 2023

GeorgeA92 commented May 11, 2023

vigandika commented May 12, 2023

wRAR commented May 12, 2023

wRAR commented Jun 21, 2023

wRAR commented Jun 21, 2023

grammy-jiang commented Jul 13, 2023 •

edited

grammy-jiang commented Jul 13, 2023 •

edited

Persist cookies between spider runs #5930

Persist cookies between spider runs #5930

Comments

vigandika commented May 11, 2023

Summary

Motivation

Current flow

Ideal flow (after implementing this feature)

Alternatives used until Scrapy 2.8.0

GeorgeA92 commented May 11, 2023

wRAR commented May 11, 2023

GeorgeA92 commented May 11, 2023

vigandika commented May 12, 2023

wRAR commented May 12, 2023

wRAR commented Jun 21, 2023

wRAR commented Jun 21, 2023

grammy-jiang commented Jul 13, 2023 • edited

grammy-jiang commented Jul 13, 2023 • edited

grammy-jiang commented Jul 13, 2023 •

edited

grammy-jiang commented Jul 13, 2023 •

edited