ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

Greenscreen23 · 2021-10-31T20:04:46Z

Hi,

I was using this library for a project recently, and was really surprised that ParseUserAgent was significantly slower than Parse. After digging a bit into the code, I think this is because ParseUserAgent is not using the internal cache, whereas Parse is.

As both methods are documented in the README, I was expecting both to have similar performance, with ParseUserAgent being a bit faster because it has to parse less. Currently, both methods seem to be recommended for regular use, but ParseUserAgent seems significantly worse than Parse.

In my opinion it might be helpful to document this behavior in the README or change it, because it is currently faster to use Parse(ua_string)['user_agent'] instead of ParseUserAgent.

The text was updated successfully, but these errors were encountered:

xmo-odoo · 2022-04-27T08:25:06Z

Adding a warning to the readme sounds like a good idea, the current caching behaviour is really basic (similar to the old lru cache, the cache is just nuked when full), as well as quite small (20 entries by default), so it would need to be updated more fundamentally to use either a FIFO replacement (which is what re now does), or an LRU.

A good option might be to stop caching Parser and slap an lru_cache on the sub-functions, that would increase the cost of a cache hit from Parse but would avoid duplicating entries between Parse and the sub-functions.

It'd also allow e.g. Parse and ParseUserAgent to feed each other's cache, though I don't know how common a need it is to use both Parse and the sub-functions in the same codebase.

An interesting alternative would be to keep a bespoke cache (though probably with a better replacement policy than the current) and allow partially filling entries so e.g. ParseUserAgent would look for an entry in the global cache, check if the key user_agent and if not set it, but then Parse would look for the same entry, check if all keys are set, and only update the ones missing (if any).

It would make Parse a fair bit more complicated and increase the cost of a cache hit slightly as Parse would need to check for the presence of 3 keys, but it seems like it could behave nicely (and unlike lru_cache wouldn't require dropping 2.7; dropping 2.7 and 3.6 might be a good plan regardless but according to pypistats.org they still represent about 4 and 15% of usage respectively which... seems a lot)

masklinn · 2022-04-27T14:43:59Z

@Greenscreen23 could you explain how you were using things? Did you explicitly bump up the cache size when using the library? With how small the cache is and the rough replacement policy I wouldn't have expected "significantly" performance degradation.

In the meantime I've opened #104 to at least add a note to the readme.

An even better solution would be to cache them, but *just* caching them seems redundant with the existing cache, so things would need to be rethought. Fixes ua-parser#97

Greenscreen23 · 2022-04-27T17:07:45Z

@masklinn I stumbled across this when we had to parse a dataset in one of my courses at the univerity. We had to parse a logfile with ~300.000 user agents, most of them being identical (real world data set). Here are the times I measured with the differnt functions:

Using Parse(ua_string)['user_agent]:
real 0m4,845s
user 0m4,813s
sys 0m0,027s

Using ParseUserAgent(ua_string):
real 1m7,321s
user 1m7,150s
sys 0m0,076s

I also did not increase the cache size, I just imported the functions and used them.

Thanks for opening the pr :)

masklinn · 2022-04-27T18:33:16Z

@masklinn I stumbled across this when we had to parse a dataset in one of my courses at the univerity. We had to parse a logfile with ~300.000 user agents, most of them being identical (real world data set). Here are the times I measured with the differnt functions:

Using Parse(ua_string)['user_agent]: real 0m4,845s user 0m4,813s sys 0m0,027s

Using ParseUserAgent(ua_string): real 1m7,321s user 1m7,150s sys 0m0,076s

I also did not increase the cache size, I just imported the functions and used them.

Thanks for opening the pr :)

Cool, yeah that is significant indeed. Is the dataset public per chance? Is it some sort of log file collected over some period of time? Might be interesting both as a way to build up plain benchmarks, and as a way to investigate changes to caching on something of a real-world dataset.

masklinn · 2022-04-27T18:37:43Z

And thinking about it more, it does make a lot of sense if the logs are "raw" and each page hit from a given visitor is an entry, identical entries would tend to cluster significantly, meaning even the small cache and simplistic eviction strategy would be helpful. Even more so if you decided to just sort the UAs before processing them.

Greenscreen23 · 2022-04-27T18:42:06Z

It's not public as far as I know and I don't think it will be published. It looks like it was collected on demo servers to include as many interesting user agents, but there are still quite a lot of duplicates and "real world traffic".

masklinn · 2022-04-27T19:01:07Z

It's not public as far as I know and I don't think it will be published. It looks like it was collected on demo servers to include as many interesting user agents, but there are still quite a lot of duplicates and "real world traffic".

Got it, thanks nonetheless. I guess I'll just add a note for now so unfortunate souls get the information in the meantime, and try to add caching before we cut 0.15, because it's a lot more significant than I would have initially expected. I'll look around for a test dataset to bench with, hopefully one of the sibling project has one somewhere.

This is an intermediate stopgap before modifying caching so it works with and across all the functions, as the issue does seem pretty significant (cf ua-parser#97)

This is an intermediate stopgap before modifying caching so it works with and across all the functions, as the issue does seem pretty significant (cf #97)

masklinn · 2022-04-27T20:06:02Z

TIL github goes and find "fixes" stanzas in PR descriptions as well as commit messages? That's weird. Anyway as noted above this was not intended to be closed, I want to implement better caching.

- Adds caching to all Parse* functions, using a unified cache (so the functions can be intermixed, and the cache size applies to everything). - Increase default size of cache, 20 seems very small, 200 doesn't seem too big (though it's completely arbitrary). Tried to play around with random samplings (using non-linear distributions) and it didn't exactly change anything but... - Updated the cache replacement policy to use a FIFO-ish policy. Unified Cache ============= Split the parsers between a caching frontend and a "raw" backend (the old behaviour) to avoid Parse having to pay for multiple cache lookups in order to fill entries. Also decided to have the entries be handed out initialized, so `_lookup` *always* returns an entry, which can be partial. The caller is to check whether the key they handle (for specialised parsers) or all the sub-keys are filled, and fill them if necessary. This makes for relatively straightforward code even if it bounces around a bit. The unified cache makes it so the functions are intermixed and benefit from one another's activity. Also added a `**jsParseBits` parameter to `ParseDevice`: I guess that's basically never used, but `Parse` forwarded its own `jsParseBits`, which would have led to a `TypeError`. Cache Policy ============ The cache now uses a FIFO policy (similar to recent updates to the stdlib's re module) thanks to dict being ordered since 3.6. In my limited bench not showing much (possibly because the workload was so artificial) LRU didn't stat much better than FIFO (based on hit/miss ratios) and FIFO is simpler so for now, FIFO it is. That's easy to change anyway. Anyway the point was mostly that any cache which doesn't blow the entire thing when full is almost certain to be an improvement. A related change is that the cache used to be blown after it had MAX_CACHE_SIZE+1 entries, as it was cleared - on a cache miss - if the cache size was strictly larger than `MAX_CACHE_SIZE`. Meaning the effective size of the cache was 21 (which is a pretty large increment given the small cache size). This has been changed to top out at `MAX_CACHE_SIZE`. Fixes ua-parser#97

masklinn mentioned this issue Apr 27, 2022

Add note that the convenience parsers are not cached to readme #104

Merged

masklinn closed this as completed in #104 Apr 27, 2022

masklinn added a commit that referenced this issue Apr 27, 2022

Add note that the convenience parsers are not cached to readme

fa5478c

This is an intermediate stopgap before modifying caching so it works with and across all the functions, as the issue does seem pretty significant (cf #97)

masklinn reopened this Apr 27, 2022

masklinn mentioned this issue Apr 30, 2022

Improve testing infrastructure #102

Open

11 tasks

masklinn closed this as completed in a32a59e May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

Greenscreen23 commented Oct 31, 2021

xmo-odoo commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022 •

edited

Loading

Greenscreen23 commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022

masklinn commented Apr 27, 2022

Greenscreen23 commented Apr 27, 2022

masklinn commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022 •

edited

Loading

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

Comments

Greenscreen23 commented Oct 31, 2021

xmo-odoo commented Apr 27, 2022 • edited Loading

masklinn commented Apr 27, 2022 • edited Loading

Greenscreen23 commented Apr 27, 2022 • edited Loading

masklinn commented Apr 27, 2022

masklinn commented Apr 27, 2022

Greenscreen23 commented Apr 27, 2022

masklinn commented Apr 27, 2022 • edited Loading

masklinn commented Apr 27, 2022 • edited Loading

xmo-odoo commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022 •

edited

Loading

Greenscreen23 commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022 •

edited

Loading

masklinn commented Apr 27, 2022 •

edited

Loading