Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent mutations by entry cache callers #3215

Merged
merged 2 commits into from Jul 1, 2022

Conversation

azdagron
Copy link
Member

@azdagron azdagron commented Jul 1, 2022

Recently, we introduced a change wherein the agents supply an output mask to the server (#3123) to reduce bandwidth usage.

This exposed a bug in the interactions between the SVID API handler and the entry cache. The cache currently returns its owned copy of the entry to callers. This was done for performance reasons.... making a copy of each entry increases memory pressure in one of the hottest codepaths in the server.

Due to this behavior however, the SVID handler, when applying the mask to remove fields from the entries before including them in the response, was inadvertently stripping off fields from entries within the cache. This was not only resulting in temporary data loss (e.g. dns names) on the entries (next cache refresh would restore the fields) but could easily become a data race, wherein entries could get mutated by multiple entities at once (since the fields are mutated concurrently without any sort of lock protection).

This change updates the cache to clone the entries before returning them to the caller. Although this results in some increase in memory pressure, it is the cleanest, and most robust approach. If the increase in memory pressure turns out to be too much, we can explore other options, though those may come with a large cost in code complexity (e.g. on-demand cloning of shared data structure). Even if we did something cute, the GetAuthorizedEntries RPC is by far the most called RPC in the agent and would need to clone anyway to apply the mask.

Fixes: #3184

Recently, we introduced a change wherein the agents supply an output
mask to the server (spiffe#3123) to reduce bandwidth usage.

This exposed a bug in the interactions between the SVID API handler and
the entry cache. The cache currently returns its owned copy of the entry
to callers. This was done for performance reasons.... making a copy of
each entry increases memory pressure in one of the hottest codepaths in
the server.

Due to this behavior however, the SVID handler, when applying the mask
to remove fields from the entries before including them in the response,
was inadvertently stripping off fields from entries within the cache.
This was not only resulting in temporary data loss (e.g. dns names) on
the entries (next cache refresh would restore the fields) but could
easily become a data race, wherein entries could get mutated by multiple
entities at once (since the fields are mutated concurrently without any
sort of lock protection).

This change updates the cache to clone the entries before returning them
to the caller. Although this results in some increase in memory
pressure, it is the cleanest, and most robust approach. If the increase
in memory pressure turns out to be too much, we can explore other
options, though those may come with a large cost in code complexity
(e.g. on-demand cloning of shared data structure). Even if we did
something cute, the GetAuthorizedEntries RPC is by far the most called
RPC in the agent and would need to clone anyway to apply the mask.

Fixes: spiffe#3184

Signed-off-by: Andrew Harding <aharding@vmware.com>
Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, @azdagron!
LGTM.

@azdagron azdagron merged commit 7ea53c4 into spiffe:main Jul 1, 2022
@azdagron azdagron added this to the 1.3.2 milestone Jul 1, 2022
@azdagron azdagron deleted the fix-entry-cache-mutations branch July 1, 2022 20:28
stevend-uber pushed a commit to stevend-uber/spire that referenced this pull request Oct 16, 2023
Recently, we introduced a change wherein the agents supply an output
mask to the server (spiffe#3123) to reduce bandwidth usage.

This exposed a bug in the interactions between the SVID API handler and
the entry cache. The cache currently returns its owned copy of the entry
to callers. This was done for performance reasons.... making a copy of
each entry increases memory pressure in one of the hottest codepaths in
the server.

Due to this behavior however, the SVID handler, when applying the mask
to remove fields from the entries before including them in the response,
was inadvertently stripping off fields from entries within the cache.
This was not only resulting in temporary data loss (e.g. dns names) on
the entries (next cache refresh would restore the fields) but could
easily become a data race, wherein entries could get mutated by multiple
entities at once (since the fields are mutated concurrently without any
sort of lock protection).

This change updates the cache to clone the entries before returning them
to the caller. Although this results in some increase in memory
pressure, it is the cleanest, and most robust approach. If the increase
in memory pressure turns out to be too much, we can explore other
options, though those may come with a large cost in code complexity
(e.g. on-demand cloning of shared data structure). Even if we did
something cute, the GetAuthorizedEntries RPC is by far the most called
RPC in the agent and would need to clone anyway to apply the mask.

Fixes: spiffe#3184

Signed-off-by: Andrew Harding <aharding@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SpiffeID CRD dnsNames no longer populated in SAN
2 participants