Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

Closed
Greenscreen23 opened this issue Oct 31, 2021 · 8 comments · Fixed by #104
Closed

ParseUserAgent, ParseOS and ParseDevice are not using the cache #97

Greenscreen23 opened this issue Oct 31, 2021 · 8 comments · Fixed by #104

Comments

@Greenscreen23
Copy link

Hi,

I was using this library for a project recently, and was really surprised that ParseUserAgent was significantly slower than Parse. After digging a bit into the code, I think this is because ParseUserAgent is not using the internal cache, whereas Parse is.

As both methods are documented in the README, I was expecting both to have similar performance, with ParseUserAgent being a bit faster because it has to parse less. Currently, both methods seem to be recommended for regular use, but ParseUserAgent seems significantly worse than Parse.

In my opinion it might be helpful to document this behavior in the README or change it, because it is currently faster to use Parse(ua_string)['user_agent'] instead of ParseUserAgent.

@xmo-odoo
Copy link

xmo-odoo commented Apr 27, 2022

Adding a warning to the readme sounds like a good idea, the current caching behaviour is really basic (similar to the old lru cache, the cache is just nuked when full), as well as quite small (20 entries by default), so it would need to be updated more fundamentally to use either a FIFO replacement (which is what re now does), or an LRU.

A good option might be to stop caching Parser and slap an lru_cache on the sub-functions, that would increase the cost of a cache hit from Parse but would avoid duplicating entries between Parse and the sub-functions.

It'd also allow e.g. Parse and ParseUserAgent to feed each other's cache, though I don't know how common a need it is to use both Parse and the sub-functions in the same codebase.

An interesting alternative would be to keep a bespoke cache (though probably with a better replacement policy than the current) and allow partially filling entries so e.g. ParseUserAgent would look for an entry in the global cache, check if the key user_agent and if not set it, but then Parse would look for the same entry, check if all keys are set, and only update the ones missing (if any).

It would make Parse a fair bit more complicated and increase the cost of a cache hit slightly as Parse would need to check for the presence of 3 keys, but it seems like it could behave nicely (and unlike lru_cache wouldn't require dropping 2.7; dropping 2.7 and 3.6 might be a good plan regardless but according to pypistats.org they still represent about 4 and 15% of usage respectively which... seems a lot)

@masklinn
Copy link
Contributor

masklinn commented Apr 27, 2022

@Greenscreen23 could you explain how you were using things? Did you explicitly bump up the cache size when using the library? With how small the cache is and the rough replacement policy I wouldn't have expected "significantly" performance degradation.

In the meantime I've opened #104 to at least add a note to the readme.

masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
An even better solution would be to cache them, but *just* caching
them seems redundant with the existing cache, so things would need to
be rethought.

Fixes ua-parser#97
@Greenscreen23
Copy link
Author

Greenscreen23 commented Apr 27, 2022

@masklinn I stumbled across this when we had to parse a dataset in one of my courses at the univerity. We had to parse a logfile with ~300.000 user agents, most of them being identical (real world data set). Here are the times I measured with the differnt functions:

Using Parse(ua_string)['user_agent]:
real 0m4,845s
user 0m4,813s
sys 0m0,027s

Using ParseUserAgent(ua_string):
real 1m7,321s
user 1m7,150s
sys 0m0,076s

I also did not increase the cache size, I just imported the functions and used them.

Thanks for opening the pr :)

@masklinn
Copy link
Contributor

@masklinn I stumbled across this when we had to parse a dataset in one of my courses at the univerity. We had to parse a logfile with ~300.000 user agents, most of them being identical (real world data set). Here are the times I measured with the differnt functions:

Using Parse(ua_string)['user_agent]: real 0m4,845s user 0m4,813s sys 0m0,027s

Using ParseUserAgent(ua_string): real 1m7,321s user 1m7,150s sys 0m0,076s

I also did not increase the cache size, I just imported the functions and used them.

Thanks for opening the pr :)

Cool, yeah that is significant indeed. Is the dataset public per chance? Is it some sort of log file collected over some period of time? Might be interesting both as a way to build up plain benchmarks, and as a way to investigate changes to caching on something of a real-world dataset.

@masklinn
Copy link
Contributor

And thinking about it more, it does make a lot of sense if the logs are "raw" and each page hit from a given visitor is an entry, identical entries would tend to cluster significantly, meaning even the small cache and simplistic eviction strategy would be helpful. Even more so if you decided to just sort the UAs before processing them.

@Greenscreen23
Copy link
Author

It's not public as far as I know and I don't think it will be published. It looks like it was collected on demo servers to include as many interesting user agents, but there are still quite a lot of duplicates and "real world traffic".

@masklinn
Copy link
Contributor

masklinn commented Apr 27, 2022

It's not public as far as I know and I don't think it will be published. It looks like it was collected on demo servers to include as many interesting user agents, but there are still quite a lot of duplicates and "real world traffic".

Got it, thanks nonetheless. I guess I'll just add a note for now so unfortunate souls get the information in the meantime, and try to add caching before we cut 0.15, because it's a lot more significant than I would have initially expected. I'll look around for a test dataset to bench with, hopefully one of the sibling project has one somewhere.

masklinn added a commit to masklinn/uap-python that referenced this issue Apr 27, 2022
This is an intermediate stopgap before modifying caching so it works
with and across all the functions, as the issue does seem pretty
significant (cf ua-parser#97)
masklinn added a commit that referenced this issue Apr 27, 2022
This is an intermediate stopgap before modifying caching so it works
with and across all the functions, as the issue does seem pretty
significant (cf #97)
@masklinn masklinn reopened this Apr 27, 2022
@masklinn
Copy link
Contributor

masklinn commented Apr 27, 2022

TIL github goes and find "fixes" stanzas in PR descriptions as well as commit messages? That's weird. Anyway as noted above this was not intended to be closed, I want to implement better caching.

masklinn added a commit to masklinn/uap-python that referenced this issue May 1, 2022
- Adds caching to all Parse* functions, using a unified cache (so the
  functions can be intermixed, and the cache size applies to everything).
- Increase default size of cache, 20 seems very small, 200 doesn't
  seem too big (though it's completely arbitrary). Tried to play
  around with random samplings (using non-linear distributions) and
  it didn't exactly change anything but...
- Updated the cache replacement policy to use a FIFO-ish policy.

Unified Cache
=============

Split the parsers between a caching frontend and a "raw" backend (the
old behaviour) to avoid Parse having to pay for multiple cache lookups
in order to fill entries.

Also decided to have the entries be handed out initialized, so
`_lookup` *always* returns an entry, which can be partial. The caller
is to check whether the key they handle (for specialised parsers) or
all the sub-keys are filled, and fill them if necessary. This makes
for relatively straightforward code even if it bounces around a
bit. The unified cache makes it so the functions are intermixed and
benefit from one another's activity.

Also added a `**jsParseBits` parameter to `ParseDevice`: I guess that's
basically never used, but `Parse` forwarded its own `jsParseBits`,
which would have led to a `TypeError`.

Cache Policy
============

The cache now uses a FIFO policy (similar to recent updates to the
stdlib's re module) thanks to dict being ordered since 3.6. In my
limited bench not showing much (possibly because the workload was so
artificial) LRU didn't stat much better than FIFO (based on hit/miss
ratios) and FIFO is simpler so for now, FIFO it is. That's easy to
change anyway.

Anyway the point was mostly that any cache which doesn't blow the
entire thing when full is almost certain to be an improvement.

A related change is that the cache used to be blown after it had
MAX_CACHE_SIZE+1 entries, as it was cleared

- on a cache miss
- if the cache size was strictly larger than `MAX_CACHE_SIZE`.

Meaning the effective size of the cache was 21 (which is a pretty
large increment given the small cache size).

This has been changed to top out at `MAX_CACHE_SIZE`.

Fixes ua-parser#97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants