-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist cookies between spider runs #5930
Comments
|
Updating the code to not use it should be trivial though. |
BTW I am strictly against of applying this or any other similar functionality into scrapy itself especially as enabled by default feature. Storing any kind of.. sensitive information like session(under login) data - definitely will be additional source of.. related security concerns. |
Thanks @GeorgeA92 and @wRAR for the comments.
An
Because of the
I can understand your concerns and our guess to why the feature is not yet implemented was exactly that. But there is also a big advantage that the feature would bring. In my opinion, as long as the functionality is disabled by default and as long as the scrapy users can decide if they want the functionality (e.g. a |
Yes, I was talking about fixing |
As there is a 3rd-party extension that does this it makes sense to just use it instead of adding this to Scrapy, unless there are things that can only be done in Scrapy (but it looks like the extension already solves the stated problem). |
@vigandika @GeorgeA92 Sorry I haven't worked with Scrapy for a while and lost track of the latest version. I can fix this compatible issue. |
@vigandika I just fix the import you mentioned. My development environment has some issues with tests. Please have a try and let me know if there is any further problem. Please use pip install from git with tag 0.4. |
Summary
Scrapy's CookiesMiddleware persist cookies and shares them between subsequent requests from the same spider. Once the spider is closed, the cookies are lost. If another spider (or the same one) that interacts with the same website is run next, it cannot use the cookies retrieved from the earlier run.
Motivation
Picture this scenario:
I have four different spiders which need to scrape different pages of the same website, and they need to be run 3 times a day, every day. The pages I need to scrape are protected, i.e. I need to be logged in, and the website implements cookie-based authentication to create and maintain sessions. The cookie expires after 24hrs.
I have already implemented the login in the middleware. To emphasize the need for this issue, the website I am dealing with happens to be very slow.
Current flow
Currently, each spider will perform the following:
This same process will be repeated 12 times (3 times a day for each spider (4)).
Ideal flow (after implementing this feature)
If we could persist the session between spider runs, the flow would be as follows:
On 1st spider run:
On subsequent spider runs:
By persisting the session between different spider runs, the login would've only been triggered once. The other spider runs can use the cookie retrieved by the first spider run which correctly underwent the login flow, and directly scrape the data, avoiding any delay and potential issues caused by the login process.
Alternatives used until Scrapy 2.8.0
This feature was available in the
scrapy-cookies
lib (see https://scrapy-cookies.readthedocs.io/en/latest/intro/tutorial.html#save-cookies-and-restore-in-your-next-run), and proved to be very useful, especially with the login cookie. Unfortunately,scrapy-cookies
is incompatible with Scrapy 2.8.0 or newer.The breaking point was the removal of the deprecated method
scrapy.utils.python.to_native_str
, which was removed in Scrapy 2.8.0 but still used by scrapy-cookies.The text was updated successfully, but these errors were encountered: