Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache proxy bypass #3885

Merged
merged 1 commit into from Feb 27, 2017
Merged

Cache proxy bypass #3885

merged 1 commit into from Feb 27, 2017

Conversation

dfontenot
Copy link
Contributor

Added simple cache over urllib.proxy_bypass to remedy #2988. Holds on to entries for a minute. Once it reaches its max size, will evict the first entry that is the oldest.

Added tests over the structure, did not add a test that should_bypass_proxies invokes the cache.

This is my first pull request; please let me know if I've overlooked something.

Additionally, when I check out master, run make coverage in pipenv, the test named 'TestRequests.test_proxy_error' fails. This is still the only failing test when I use my branch. The error is ConnectionError: ('Connection aborted.', BadStatusLine("''",)). Is there some additional setup I need to do to run the test suite?

@codecov-io
Copy link

codecov-io commented Feb 22, 2017

Codecov Report

Merging #3885 into master will increase coverage by 0.24%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3885      +/-   ##
==========================================
+ Coverage   89.08%   89.32%   +0.24%     
==========================================
  Files          15       15              
  Lines        1896     1940      +44     
==========================================
+ Hits         1689     1733      +44     
  Misses        207      207
Impacted Files Coverage Δ
requests/utils.py 85.02% <100%> (+0.08%)
requests/structures.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6f4818...c121b98. Read the comment docs.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so this is a good start! I've left some notes inline in the diff.

@@ -103,3 +104,94 @@ def __getitem__(self, key):

def get(self, key, default=None):
return self.__dict__.get(key, default)


class TimedCache(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is basically a collections.MutableMapping! We should make it actually be one, because that gives us a number of handy-dandy convenience methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to have TimedCache inherit collections.MutableMapping

:return: the value in the cache, or None
"""

return self.get(key, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, generally speaking it is better to implement get in terms of __getitem__ than the other way around. __getitem__ should almost always maintain dict-like behaviour of raising KeyError if the entry is not present, rather than returning None. The core concern here is that it needs to be possible to distinguish between "key not present in dictionary" and "key present with a value of None".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.


found = self._dict.get(key, default)
if not found:
return default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is wrong. Happily, it doesn't matter: if you make this a subclass of MutableMapping you don't actually need to implement this method at all, we get it for free.

return default

occurred, value = found
now = int(time.time())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a problem.

If possible we'd like to use time.monotonic instead, as time.time can go backwards. So I think in this module you'll need to feature detect on import to see whether time.monotonic is present and to prefer it if it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. To clarify, you're saying if time.monotonic is not present, it's ok to fall back to time.time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

:param value: the value to be added to the cache
"""

now = int(time.time())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about monotonic time.

Locates the value at lookup key,
if cache is full, will evict the first
found expired entry, otherwise will evict
the oldest non-expired item
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're wrapping this paragraph very short. Feel free to go all the way out to 89 columns!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

Locates the value at lookup key,
if cache is full, will evict the first
found expired entry, otherwise will evict
the oldest non-expired item
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here, can you add a blank line separating the paragraph blurb from the :param: directives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

if len(self._dict) >= self.maxlen:
replacement_candidate_key = None
max_time_delta = 0
for cur_key, (occurred, _) in self._dict.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complete search can probably be avoided by using an OrderedDict, can't it? urllib3 includes one that works on all the Python versions we support, so you can import that and use it as the base implementation to save ourselves this iterative walk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will replace self._dict with an OrderedDict.

PROXY_BYPASS_CACHE[netloc] = result
return result
else:
return cached_result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is that this be reimplemented as a generic decorator, rather than written to specifically handle proxy_bypass. That way we get a much easier time of testing the timed cache and all the relevant code. =)

@@ -74,3 +74,56 @@ def test_getitem(self, key, value):
@get_item_parameters
def test_get(self, key, value):
assert self.lookup_dict.get(key) == value


class TestTimedCache:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid making this an old-style class by putting (object) as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

@dfontenot
Copy link
Contributor Author

@Lukasa I see a line like this already in the code base

current_time = getattr(time, 'monotonic', time.time)

as a top level definition. I've attempted to mock those, but my attempts so far have had some code smells. I've added a new commit that attempts to mock those in the least smelly way, but it's not ideal. Can you help me iron these out so they aren't so strange?

My approaches so far:

  1. I set current_time in the class and mock it using mocker.patch.object. If I set it as a staticmethod or a classmethod, I'm not able to replace the function with a lambda w/o any args or with one anonymous arg. I'd rather not set it as an instance method of the class as that's a bit strange.
  2. I use mocker.patch and set current_time at the global level of requests.structures. The only problem then is I have to import requests.structures (or just requests.structures.current_time), and pyflakes complains about an unused import.

Would appreciate any suggestions.

@Lukasa
Copy link
Member

Lukasa commented Feb 22, 2017

Your mocking will get a lot easier if you make the _current_time callable a module-level object rather than putting it at class scope. =)

@dfontenot
Copy link
Contributor Author

I think I must have had a typo, mocker.patch with the fully-qualified package name for current_time works without doing any extra non-used imports.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is generally looking really good: substantial progress! I've written some more notes.

"""
Evicts entries after expiration_secs.
If none are expired and maxlen is hit,
will evict the oldest cached entry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have a slightly wider docstring than this! 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

self._dict = OrderedDict()

def __repr__(self):
return '<timed-cache maxlen:%d len:%d>' % \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be TimedCache, and it should also provide the expiration_secs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

try:
return self.__getitem__(key)
except KeyError:
return default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to implement this: collections.MutableMapping will do it for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


if len(self._dict) >= self.maxlen:
for _ in range(len(self._dict) - self.maxlen + 1):
self._dict.popitem(last=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be more cleanly expressed as:

while len(self._dict) >= self.maxlen:
    self._dict.popitem(last=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Updated.


def clear(self):
"""Clears the cache"""
return self._dict.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we also technically get for free from MutableMapping, but in this case we can do it much faster than the generic implementation so I'm happy to have this override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I've left the override intact.

assert fnc_stub.call_count is 1
assert len(cache) is 1

assert invocation_spy.call_count is 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, so this is a good test, but it mostly tests implementation details. For this, it's enough to just mock out the time and then write a small function that constantly returns an incrementing value, then wrap it in the decorator and use the timer to demonstrate the behaviour.

Generally speaking, avoiding mocks is a good thing wherever we can. =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've re-made the test more along these lines. Let me know what you think.

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, really close now! A couple of minor stylistic notes and one request to expand a test slightly and we're good to go!

:param expiration_secs: the number of seconds to hold on
to entries
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor nit, we don't need this blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


:rtype: bool
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same again, don't need this blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Looks for netloc in the cache,
if not found, will call proxy_bypass
for the netloc and store its result
in the cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again can be flowed out to column 79 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

first_result = some_method(1)
assert first_result is 1
second_result = some_method(1)
assert second_result is 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check that a different argument doesn't return a cached value.

@dfontenot
Copy link
Contributor Author

@Lukasa thanks for your help with this. As for squashing up all of the commits this PR has accumulated--who should do that? Do the maintainers of this lib do that? Or I can squash locally and push to a new PR (or just force push).

@Lukasa
Copy link
Member

Lukasa commented Feb 23, 2017

If you squash locally and force push that's fine, GitHub will work it all out. 👍

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed some extra whitespace. 😔 Let's clear all of that up and then we will be good to go. 👍

"""
Wrap a function call in a timed cache
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one.

Evicts entries after expiration_secs. If none are expired and maxlen is hit,
will evict the oldest cached entry
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

:param key: which entry to look up
:return: the value in the cache, or None
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

:param key: the key to search the cache for
:param value: the value to be added to the cache
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one



class TestTimedCache(object):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one



class TestTimedCacheManagedDecorator(object):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one

class TestTimedCacheManagedDecorator(object):

def test_caches_repeated_calls(self, mocker):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one.

Used to alleviate long gethostbyaddr calls
Made new TimedCache and decorator to wrap
a function with a cache
* Entries looked up older than a minute (default amount)
are evicted.
* When full, evicts the oldest entry
@dfontenot
Copy link
Contributor Author

Force pushed commit addresses the whitespace issues

Copy link
Member

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm happy with this! As this is a big-ish change in functionality, I'd like to give @sigmavirus24 a chance to review before I merge, but if he doesn't get time I'll merge it Monday. Great work @davidfontenot!

Copy link
Contributor

@sigmavirus24 sigmavirus24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll submit these updates in a separate pull request

(self.maxlen, len(self._dict), self.expiration_secs)

def __iter__(self):
return map(lambda kv: (kv[0], kv[1][1]), self._dict.items()).__iter__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we don't just build a generator here.

  1. map() is slow and lambda s are slow. This intentional on Guido's part.
  2. Calling __iter__ is really poor form.
  3. This would be far simpler and easier to read as
return ((key, value[1]) for key, value in self._dict.items())

return map(lambda kv: (kv[0], kv[1][1]), self._dict.items()).__iter__()

def __delitem__(self, item):
return self._dict.__delitem__(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not write this as:

del self._dict[item]

__delitem__ doesn't return anything so there's no need to worry about returning something here. And it would avoid calling the dunder method directly, which I'll repeat, is poor form.


if now - occurred > self.expiration_secs:
del self._dict[key]
raise KeyError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should raise a KeyError with the key as an argument, e.g., KeyError(key).

while len(self._dict) >= self.maxlen:
self._dict.popitem(last=False)

return self._dict.__setitem__(key, (now, value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, __setitem__ doesn't return anything. This should just be

self._dict[key] = (now, value)

There's no need to obfuscate any of this.

sigmavirus24 added a commit to sigmavirus24/requests that referenced this pull request Mar 4, 2017
There were some odd decisions made about the implementation of some of
the required methods for MuttableMapping in the TimedCache object. This
cleans those up and makes the implementation, ever so slightly, easier
to read.

See also psf#3885
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants