Skip to content

CaseInsensitiveDict (deprecate CaselessDict) #5146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 21, 2023

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented May 12, 2021

I think the data model for scrapy.http.headers.Headers (and scrapy.utils.datatypes.CaselessDict) is a little bit inconsistent. For instance:

In [1]: from scrapy.http.headers import Headers

In [2]: h = Headers({"a": "aaa", "b": "bbb"})

In [3]: h
Out[3]: {b'A': b'aaa', b'B': b'bbb'}

In [4]: dict(h)
Out[4]: {b'A': [b'aaa'], b'B': [b'bbb']}

In [5]: list(h.items())
Out[5]: [(b'A', [b'aaa']), (b'B', [b'bbb'])]

In [6]: list(h.values())
Out[6]: [b'aaa', b'bbb']

This has appeared before in #1459 (not reproducible anymore on py3, but the idea remains). dict is highly optimized at the C level and it seems like it doesn't always use overriden data access methods. For instance, MutableMapping implementations (like itemadapter.ItemAdapter) use a combination of keys (which I believe relies on __iter__) and __getitem__ when wrapped with dict, while dict subclasses seem to bypass this:

import inspect

from itemadapter import ItemAdapter


class InspectMixin:
    def _inspect(self):
        current_frame = inspect.currentframe()
        outer_frames = inspect.getouterframes(current_frame)
        for frame in outer_frames:
            print(frame[1:-2])

    def __getitem__(self, key):
        self._inspect()
        return super().__getitem__(key)

    def keys(self):
        self._inspect()
        return super().keys()


class CustomDict(InspectMixin, dict):
    pass


class CustomItemAdapter(InspectMixin, ItemAdapter):
    pass


d = CustomDict({"a": 1})
print("CustomDict:", dict(d))

print()

d = CustomItemAdapter({"a": 1})
print("CustomItemAdapter", dict(d))
$ python dictionary.py
CustomDict: {'a': 1}

('dictionary.py', 9, '_inspect')
('dictionary.py', 18, 'keys')
('dictionary.py', 36, '<module>')
('dictionary.py', 9, '_inspect')
('dictionary.py', 14, '__getitem__')
('dictionary.py', 36, '<module>')
CustomItemAdapter {'a': 1}

This article shows a few reasons why inheriting from dict is not a good idea.

I tried to use CaseInsensitiveDict as base class for Headers, but some tests fail (because they wrap the headers objects with dict before checking the contents). Personally, I believe the Headers data model needs a little refactoring, I'd love to hear your thought about it (should I continue this PR to include changes to the Headers class, for instance?).

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #5146 (60b9e1f) into master (e9e1034) will increase coverage by 46.52%.
The diff coverage is 100.00%.

❗ Current head 60b9e1f differs from pull request most recent head 39282d7. Consider uploading reports for the commit 39282d7 to get more accurate results

@@             Coverage Diff             @@
##           master    #5146       +/-   ##
===========================================
+ Coverage   42.36%   88.88%   +46.52%     
===========================================
  Files         162      162               
  Lines       11206    11243       +37     
  Branches     1825     1826        +1     
===========================================
+ Hits         4747     9993     +5246     
+ Misses       6103      965     -5138     
+ Partials      356      285       -71     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/httpcache.py 93.97% <ø> (+60.24%) ⬆️
scrapy/settings/default_settings.py 98.77% <ø> (ø)
scrapy/downloadermiddlewares/decompression.py 100.00% <100.00%> (+67.24%) ⬆️
scrapy/http/headers.py 98.52% <100.00%> (+23.52%) ⬆️
scrapy/pipelines/files.py 72.00% <100.00%> (+35.00%) ⬆️
scrapy/utils/datatypes.py 100.00% <100.00%> (+18.18%) ⬆️
scrapy/utils/gz.py 100.00% <100.00%> (+75.00%) ⬆️
scrapy/utils/python.py 88.75% <100.00%> (+46.87%) ⬆️

... and 134 files with indirect coverage changes

Comment on lines 83 to 86
return CaseInsensitiveDict(
(to_unicode(key, encoding=self.encoding), to_unicode(b','.join(value), encoding=self.encoding))
for key, value in self.items()
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be safe I think, CaseInsensitiveDict's interface is the same as CaselessDict. The method is only used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s still a potential issue for isinstance usages. I’m not saying I’m against this change, but I am unsure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should address the isinstance check:

diff --git scrapy/utils/datatypes.py scrapy/utils/datatypes.py
index 6eeabe1e..63c07b48 100644
--- scrapy/utils/datatypes.py
+++ scrapy/utils/datatypes.py
@@ -21,7 +21,7 @@ class CaselessDict(dict):
     def __new__(cls, *args, **kwargs):
         from scrapy.http.headers import Headers
 
-        if issubclass(cls, CaselessDict) and not issubclass(cls, Headers):
+        if issubclass(cls, CaselessDict) and not issubclass(cls, (CaseInsensitiveDict, Headers)):
             warnings.warn(
                 "scrapy.utils.datatypes.CaselessDict is deprecated,"
                 " please use scrapy.utils.datatypes.CaseInsensitiveDict instead",
@@ -79,7 +79,7 @@ class CaselessDict(dict):
         return dict.pop(self, self.normkey(key), *args)
 
 
-class CaseInsensitiveDict(collections.UserDict):
+class CaseInsensitiveDict(collections.UserDict, CaselessDict):
     """A dict-like structure that accepts strings or bytes as keys and allows case-insensitive lookups.
 
     It also allows overriding key and value normalization by defining custom `normkey` and `normvalue` methods.

Although as we discussed this might not be a problem, mostly because CaselessDict is not documented.

def __delitem__(self, key: AnyStr) -> None:
super().__delitem__(self.normkey(key))

def __contains__(self, key: AnyStr) -> bool: # type: ignore[override]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this goes against Liskov substitution principle, it makes sense because being case-insensitive only applies to str or bytes, not to all hashable types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to inherit from some type-enabled dict

@elacuesta elacuesta closed this May 12, 2021
@elacuesta elacuesta reopened this May 12, 2021
@wRAR wRAR closed this Jul 26, 2021
@wRAR wRAR reopened this Jul 26, 2021
@Gallaecio Gallaecio self-requested a review August 24, 2021 06:40
@wRAR wRAR merged commit 1b78e48 into scrapy:master Jun 21, 2023
@wRAR
Copy link
Member

wRAR commented Jun 21, 2023

Thanks!

@elacuesta elacuesta deleted the case-insensitive-dict branch August 5, 2023 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants