Skip to content

Commit

Permalink
Improve caching
Browse files Browse the repository at this point in the history
- Adds caching to all Parse* functions, using a unified cache (so the
  functions can be intermixed, and the cache size applies to everything).
- Increase default size of cache, 20 seems very small, 200 doesn't
  seem too big (though it's completely arbitrary). Tried to play
  around with random samplings (using non-linear distributions) and
  it didn't exactly change anything but...
- Updated the cache replacement policy to use a FIFO-ish policy.

Unified Cache
=============

Split the parsers between a caching frontend and a "raw" backend (the
old behaviour) to avoid Parse having to pay for multiple cache lookups
in order to fill entries.

Also decided to have the entries be handed out initialized, so
`_lookup` *always* returns an entry, which can be partial. The caller
is to check whether the key they handle (for specialised parsers) or
all the sub-keys are filled, and fill them if necessary. This makes
for relatively straightforward code even if it bounces around a
bit. The unified cache makes it so the functions are intermixed and
benefit from one another's activity.

Also added a `**jsParseBits` parameter to `ParseDevice`: I guess that's
basically never used, but `Parse` forwarded its own `jsParseBits`,
which would have led to a `TypeError`.

Cache Policy
============

The cache now uses a FIFO policy (similar to recent updates to the
stdlib's re module) thanks to dict being ordered since 3.6. In my
limited bench not showing much (possibly because the workload was so
artificial) LRU didn't stat much better than FIFO (based on hit/miss
ratios) and FIFO is simpler so for now, FIFO it is. That's easy to
change anyway.

Anyway the point was mostly that any cache which doesn't blow the
entire thing when full is almost certain to be an improvement.

A related change is that the cache used to be blown after it had
MAX_CACHE_SIZE+1 entries, as it was cleared

- on a cache miss
- if the cache size was strictly larger than `MAX_CACHE_SIZE`.

Meaning the effective size of the cache was 21 (which is a pretty
large increment given the small cache size).

This has been changed to top out at `MAX_CACHE_SIZE`.

Fixes #97
  • Loading branch information
masklinn committed May 1, 2022
1 parent d72fff8 commit a32a59e
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 25 deletions.
15 changes: 8 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,13 +79,14 @@ Extract browser data from user-agent string
..
⚠️The convenience parsers (``ParseUserAgent``, ``ParseOs``, and
``ParseDevice``) currently have no caching, which can result in
degraded performances when parsing large amounts of identical
user-agents (which might occur for real-world datasets).

In that case, prefer using ``Parse`` and extracting the
sub-component you need from the resulting dictionary.
⚠️Before 0.15, the convenience parsers (``ParseUserAgent``,
``ParseOs``, and ``ParseDevice``) were not cached, which could
result in degraded performances when parsing large amounts of
identical user-agents (which might occur for real-world datasets).

For these versions (up to 0.10 included), prefer using ``Parse``
and extracting the sub-component you need from the resulting
dictionary.

Extract OS information from user-agent string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
68 changes: 50 additions & 18 deletions ua_parser/user_agent_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,8 +215,29 @@ def Parse(self, user_agent_string):
return device, brand, model


MAX_CACHE_SIZE = 20
_parse_cache = {}
MAX_CACHE_SIZE = 200
_PARSE_CACHE = {}


def _lookup(ua, args):
key = (ua, tuple(sorted(args.items())))
entry = _PARSE_CACHE.get(key)
if entry is not None:
return entry

if len(_PARSE_CACHE) >= MAX_CACHE_SIZE:
_PARSE_CACHE.pop(next(iter(_PARSE_CACHE)))

v = _PARSE_CACHE[key] = {"string": ua}
return v


def _cached(ua, args, key, fn):
entry = _lookup(ua, args)
r = entry.get(key)
if not r:
r = entry[key] = fn(ua, args)
return r


def Parse(user_agent_string, **jsParseBits):
Expand All @@ -227,21 +248,20 @@ def Parse(user_agent_string, **jsParseBits):
Returns:
A dictionary containing all parsed bits
"""
jsParseBits = jsParseBits or {}
key = (user_agent_string, repr(jsParseBits))
cached = _parse_cache.get(key)
if cached is not None:
return cached
if len(_parse_cache) > MAX_CACHE_SIZE:
_parse_cache.clear()
v = {
"user_agent": ParseUserAgent(user_agent_string, **jsParseBits),
"os": ParseOS(user_agent_string, **jsParseBits),
"device": ParseDevice(user_agent_string, **jsParseBits),
"string": user_agent_string,
}
_parse_cache[key] = v
return v
entry = _lookup(user_agent_string, jsParseBits)
# entry is complete, return directly
if len(entry) == 4:
return entry

# entry is partially or entirely empty
if "user_agent" not in entry:
entry["user_agent"] = _ParseUserAgent(user_agent_string, jsParseBits)
if "os" not in entry:
entry["os"] = _ParseOS(user_agent_string, jsParseBits)
if "device" not in entry:
entry["device"] = _ParseDevice(user_agent_string, jsParseBits)

return entry


def ParseUserAgent(user_agent_string, **jsParseBits):
Expand All @@ -252,6 +272,10 @@ def ParseUserAgent(user_agent_string, **jsParseBits):
Returns:
A dictionary containing parsed bits.
"""
return _cached(user_agent_string, jsParseBits, "user_agent", _ParseUserAgent)


def _ParseUserAgent(user_agent_string, jsParseBits):
if (
"js_user_agent_family" in jsParseBits
and jsParseBits["js_user_agent_family"] != ""
Expand Down Expand Up @@ -298,6 +322,10 @@ def ParseOS(user_agent_string, **jsParseBits):
Returns:
A dictionary containing parsed bits.
"""
return _cached(user_agent_string, jsParseBits, "os", _ParseOS)


def _ParseOS(user_agent_string, jsParseBits):
for osParser in OS_PARSERS:
os, os_v1, os_v2, os_v3, os_v4 = osParser.Parse(user_agent_string)
if os:
Expand All @@ -312,14 +340,18 @@ def ParseOS(user_agent_string, **jsParseBits):
}


def ParseDevice(user_agent_string):
def ParseDevice(user_agent_string, **jsParseBits):
"""Parses the user-agent string for device info.
Args:
user_agent_string: The full user-agent string.
ua_family: The parsed user agent family name.
Returns:
A dictionary containing parsed bits.
"""
return _cached(user_agent_string, jsParseBits, "device", _ParseDevice)


def _ParseDevice(user_agent_string, jsParseBits):
for deviceParser in DEVICE_PARSERS:
device, brand, model = deviceParser.Parse(user_agent_string)
if device:
Expand Down
5 changes: 5 additions & 0 deletions ua_parser/user_agent_parser_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,11 @@ def runUserAgentTestsFromYAML(self, file_name):
result["patch"],
),
)
self.assertLessEqual(
len(user_agent_parser._PARSE_CACHE),
user_agent_parser.MAX_CACHE_SIZE,
"verify that the cache size never exceeds the configured setting",
)

def runOSTestsFromYAML(self, file_name):
yamlFile = open(os.path.join(TEST_RESOURCES_DIR, file_name))
Expand Down

0 comments on commit a32a59e

Please sign in to comment.