Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cookie handling #5463

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ Solving specific problems
topics/jobs
topics/coroutines
topics/asyncio
topics/storage

:doc:`faq`
Get answers to most frequently asked questions.
Expand Down Expand Up @@ -216,6 +217,9 @@ Solving specific problems
:doc:`topics/asyncio`
Use :mod:`asyncio` and :mod:`asyncio`-powered libraries.

:doc:`topics/storage`
Use for the storage of cookies.

.. _extending-scrapy:

Extending Scrapy
Expand Down
16 changes: 16 additions & 0 deletions docs/topics/downloader-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,22 @@ The following settings can be used to configure the cookie middleware:

.. reqmeta:: cookiejar

AccessCookiesMiddleware
-----------------------

.. module:: scrapy.downloadermiddlewares.cookies
:synopsis: Access Cookies Downloader Middleware

.. class:: AccessCookiesMiddleware

Extension of the CookiesMiddleware, gives the spider access to the cookiejar of a session.

The following settings can be used to configure the AccesCookiesMiddleware:

* :setting:`COOKIES_PERSISTENCE`
* :setting:`COOKIES_PERSISTENCE_DIR`
* :setting:`COOKIES_STORAGE`

Multiple cookie sessions per spider
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
26 changes: 26 additions & 0 deletions docs/topics/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,32 @@ scrapy.Spider
:param response: the response to parse
:type response: :class:`~scrapy.http.Response`

.. method:: set_cookie_jar(cj)

Initiate the spider's cookie jar from an existing one

.. method:: add_cookie(cookie)

Add cookie to the spider's cookiejar

.. method:: get_cookies(name, names, return_type)

:param name: name of cookie to fetch, default = None
:type name: str

:param names: names of cookies to fetch, default = None
:type names: List[str]

:param return_type: return format if multiple cookies, default = list, options are list, dict
:type return_type: object

Get cookie by name or the cookies whose name is in names.
If names is used, then the return value will be in the format of the return_type

.. method:: clear_cookies()

Erase spider's cookiejar

Comment on lines +191 to +216
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big API change, which should probably be discussed in #1878 before working on an implementation.

.. method:: log(message, [level, component])

Wrapper that sends a log message through the Spider's :attr:`logger`,
Expand Down
49 changes: 49 additions & 0 deletions docs/topics/storage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
.. _topics-storage:

=======
Storage
=======

The storage functionality can be used to store information locally or globally based on the events of opening and closing a spider. The original purpose of having the storage functionality is to be able to handle the cookie storage across spiders, however this can be extended for other purposes.

.. _topics-base-storage:

BaseStorage
===========

BaseStorage is the interface of the storage class that defines how an implemented storage should behave. The main methods are the following:

.. method:: open_spider(spdr)

This method is called upon the event of a spider being opened.

:param spider: the spider that is being opened
:type spider: :class:`~scrapy.Spider` object

.. method:: close_spider(spdr)

This method is called upon the event of a spider being closed.

:param spider: the spider that is being closed
:type spider: :class:`~scrapy.Spider` object

.. _topics-in-memory-storage:

InMemoryStorage
===============

The InMemoryStorage is designed to allow the storage of cookies on a local file. If the COOKIES_PERSISTENCE constant is set to true in the settings of the project, the cookies are saved to a file and loaded from it on demand.

.. method:: open_spider(spider)

This method is called upon the event of a spider being opened. When the spider is opened, the cookies are loaded from the file, if they were saved there by a spider from a previous crawling session.

:param spider: the spider that is being opened
:type spider: :class:`~scrapy.Spider` object

.. method:: close_spider(spider)

This method is called upon the event of a spider being closed. When the spider is closed, the cookies are saved to the file in order to allow another spider to reuse those existing cookies at a later point in time.

:param spider: the spider that is being closed
:type spider: :class:`~scrapy.Spider` object
54 changes: 52 additions & 2 deletions scrapy/downloadermiddlewares/cookies.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
from scrapy.http import Response
from scrapy.http.cookies import CookieJar
from scrapy.utils.httpobj import urlparse_cached
from scrapy.utils.project import get_project_settings
from scrapy.utils.python import to_unicode

from scrapy.utils.misc import load_object

logger = logging.getLogger(__name__)


_split_domain = TLDExtract(include_psl_private_domains=True)


Expand Down Expand Up @@ -136,3 +136,53 @@ def _get_request_cookies(self, jar, request):
formatted = filter(None, (self._format_cookie(c, request) for c in cookies))
response = Response(request.url, headers={"Set-Cookie": formatted})
return jar.make_cookies(response, request)


class AccessCookiesMiddleware(CookiesMiddleware):
def __init__(self, debug=False):
self.settings = get_project_settings()
self.jars = load_object(self.settings["COOKIES_STORAGE"]).from_middleware(self)
self.debug = debug

def spider_opened(self, spider):
"""
Whenever a spider is open, we call retrieve the cookies from the storage
"""
self.jars.open_spider(spider)

def spider_closed(self, spider):
"""
Whenever a spider is closed, we save the cookies in the storage locally
"""
self.jars.close_spider(spider)

def process_request(self, request, spider):
if request.meta.get('dont_merge_cookies', False):
# Create a clean CookieJar to add the cookies
jar = CookieJar()
else:
cookiejarkey = request.meta.get("cookiejar")
jar = self.jars[cookiejarkey]
cookies = self._get_request_cookies(jar, request)
self._process_cookies(cookies, jar=jar, request=request)

# set Cookie header
request.headers.pop('Cookie', None)
jar.add_cookie_header(request)

def process_response(self, request, response, spider):
if request.meta.get('dont_merge_cookies', False):
# Create a clean CookieJar to add the cookies
jar = CookieJar()
else:
# extract cookies from Set-Cookie and drop invalid/expired cookies
cookiejarkey = request.meta.get("cookiejar")
jar = self.jars[cookiejarkey]
cookies = jar.make_cookies(response, request)
self._process_cookies(cookies, jar=jar, request=request)

self._debug_set_cookie(response, spider)

spider.set_cookie_jar(jar)

return response
29 changes: 29 additions & 0 deletions scrapy/http/cookies.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,35 @@ def set_cookie(self, cookie):
def set_cookie_if_ok(self, cookie, request):
self.jar.set_cookie_if_ok(cookie, WrappedRequest(request))

def get_cookie(self, name):
cookie_dict = self.dict_from_cookiejar()
return cookie_dict.get(name)

def dict_from_cookiejar(self):
"""Returns a key/value dictionary from a CookieJar.

:rtype: dict
"""

cookie_dict = {}

for cookie in self.jar:
cookie_dict[cookie.name] = cookie.value

return cookie_dict

def list_from_cookiejar(self):
"""Returns a list of all Cookies in the CookieJar

:rtype: list
"""
cookie_list = []

for cookie in self.jar:
cookie_list.append(cookie)

return cookie_list


def potential_domain_matches(domain):
"""Potential domain matches for a cookie
Expand Down
5 changes: 5 additions & 0 deletions scrapy/settings/default_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@
COOKIES_ENABLED = True
COOKIES_DEBUG = False

COOKIES_PERSISTENCE = False
COOKIES_PERSISTENCE_DIR = "cookies"

COOKIES_STORAGE = "scrapy.storage.in_memory.InMemoryStorage"

DEFAULT_ITEM_CLASS = 'scrapy.item.Item'

DEFAULT_REQUEST_HEADERS = {
Expand Down
28 changes: 27 additions & 1 deletion scrapy/spiders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@
See documentation in docs/topics/spiders.rst
"""
import logging
from typing import Optional
from typing import Optional, List

from scrapy import signals
from scrapy.http import Request
from scrapy.http.cookies import CookieJar
from scrapy.utils.trackref import object_ref
from scrapy.utils.url import url_is_from_spider

Expand All @@ -19,6 +20,7 @@ class Spider(object_ref):

name: Optional[str] = None
custom_settings: Optional[dict] = None
_cookie_jar: Optional[CookieJar] = None

def __init__(self, name=None, **kwargs):
if name is not None:
Expand Down Expand Up @@ -88,6 +90,30 @@ def __str__(self):

__repr__ = __str__

def set_cookie_jar(self, cj: CookieJar):
self._cookie_jar = cj

def add_cookie(self, cookie):
self._cookie_jar.set_cookie(cookie)

# TODO Maybe simplify
def get_cookies(self, name: str = None, names: List[str] = None, return_type=list):
if name is not None:
return self._cookie_jar.get_cookie(name)
if isinstance(return_type, list):
cookies_list = self._cookie_jar.list_from_cookiejar()
if names is not None:
return list(filter(lambda cookie: cookie.name in names, cookies_list))
return cookies_list
else:
cookies_dict = iter(self._cookie_jar)
if names is not None:
return {name: cookies_dict[name] for name in names}
return cookies_dict

def clear_cookies(self):
self._cookie_jar = CookieJar()


# Top-level imports
from scrapy.spiders.crawl import CrawlSpider, Rule
Expand Down
36 changes: 36 additions & 0 deletions scrapy/storage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from collections.abc import MutableMapping

from scrapy.spiders import Spider


class BaseStorage(MutableMapping):
name = None

def __init__(self, settings):
self.settings = settings

@classmethod
def from_middleware(cls, middleware):
obj = cls(middleware.settings)
return obj

def open_spider(self, spider: Spider):
pass

def close_spider(self, spider: Spider):
pass

def __delitem__(self, v):
pass

def __getitem__(self, k):
pass

def __iter__(self):
pass

def __len__(self):
pass

def __setitem__(self, k, v):
pass
37 changes: 37 additions & 0 deletions scrapy/storage/in_memory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import io
import logging
import os
import pickle
from collections import UserDict
from typing import Dict

from scrapy.http.cookies import CookieJar
from scrapy.spiders import Spider
from scrapy.storage import BaseStorage
from scrapy.utils.project import data_path

logger = logging.getLogger(__name__)


class InMemoryStorage(UserDict, BaseStorage):
def __init__(self, settings):
super(InMemoryStorage, self).__init__()
self.settings = settings
self.cookies_dir = data_path(settings["COOKIES_PERSISTENCE_DIR"])

def open_spider(self, spider):
if not self.settings["COOKIES_PERSISTENCE"]:
return
if not os.path.exists(self.cookies_dir):
return
with io.open(self.cookies_dir, "br") as f:
self.data: Dict = pickle.load(f)

def close_spider(self, spider):
if self.settings["COOKIES_PERSISTENCE"]:
with io.open(self.cookies_dir, "bw+") as f:
pickle.dump(self.data, f)

def __missing__(self, key) -> CookieJar:
self.data.update({key: CookieJar()})
return self.data[key]