Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Catalog corruption issue when running CLI in parallel #472

Merged
merged 14 commits into from
Sep 15, 2023

Conversation

anujc25
Copy link
Contributor

@anujc25 anujc25 commented Aug 24, 2023

What this PR does / why we need it

Which issue(s) this PR fixes

Fixes #471

Describe testing done for PR

  • Wrote a bash script to manually verify the fix
#!/bin/bash

TANZU=tanzu
$TANZU plugin install --group vmware-tkg/default
for i in {1..10}
do
    for x in {1..50}
    do
        $TANZU version &
        echo "running: $x $!"        
    done
    echo "waiting"
    wait
    echo "done"    
done

$TANZU plugin list
  • Also added E2E tests to run tanzu commands in parallel when 2 telemetry plugins are present

  • Also added E2E tests to install multiple Plugins in parallel

  • User can verify the E2E test by running tests with latest tanzu binary with the fix and the old `tanzu binary without the fix as below:

cd test/e2e/

export TANZU_CLI_E2E_TEST_BINARY_PATH=tanzu-old ## Without the fix
make e2e-catalog-tests. ## This should fail


export TANZU_CLI_E2E_TEST_BINARY_PATH=tanzu-new ## With the fix
make e2e-catalog-tests ## This should be successful

Release note

Fix the Catalog corruption issue (missing installed plugins) when running CLI in parallel

Additional information

Special notes for your reviewer

Copy link
Contributor

@marckhouzam marckhouzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look very nice to me.
There are, I think, some important comments to address, but this is a very clean solution.
Thanks @anujc25 !

pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/cleanup.go Outdated Show resolved Hide resolved
pkg/pluginmanager/manager.go Outdated Show resolved Hide resolved
pkg/pluginmanager/manager.go Show resolved Hide resolved
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch from 6f2c2e7 to cbfc837 Compare August 29, 2023 14:12
@vuil
Copy link
Contributor

vuil commented Aug 30, 2023

Do we have any e2e or integ tests (or doing multiple concurrent catalog reads/writes) that can help uncover the issue we are trying to fix here? Having them earlier would have help catch the issue, having them would have helped giving confidence that the issue is sufficiently addressed. Let's make sure we at least file a TODO to provide them.

pkg/catalog/catalog.go Outdated Show resolved Hide resolved
@anujc25
Copy link
Contributor Author

anujc25 commented Aug 31, 2023

Do we have any e2e or integ tests (or doing multiple concurrent catalog reads/writes) that can help uncover the issue we are trying to fix here? Having them earlier would have help catch the issue, having them would have helped giving confidence that the issue is sufficiently addressed. Let's make sure we at least file a TODO to provide them.

I am planning to add e2e tests for this. I wanted to get high-level feedback on the approach before I deep dive into writing tests for this PR and hence a draft PR. If the approach looks good, I will add some e2e tests for this change.

@vuil
Copy link
Contributor

vuil commented Aug 31, 2023

Do we have any e2e or integ tests (or doing multiple concurrent catalog reads/writes) that can help uncover the issue we are trying to fix here? Having them earlier would have help catch the issue, having them would have helped giving confidence that the issue is sufficiently addressed. Let's make sure we at least file a TODO to provide them.

I am planning to add e2e tests for this. I wanted to get high-level feedback on the approach before I deep dive into writing tests for this PR and hence a draft PR. If the approach looks good, I will add some e2e tests for this change.

sgtm. The not-so-great part is an expectation that any valid unlock function returned has to be called by the caller, but the approach is a reasonable compromise.

pkg/catalog/cleanup.go Outdated Show resolved Hide resolved
pkg/pluginmanager/manager.go Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch from e7dd402 to aa9360d Compare September 7, 2023 16:17
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch from 8292dce to 6df47bb Compare September 8, 2023 04:46
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch from 6df47bb to a99b6ca Compare September 8, 2023 05:09
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
@anujc25 anujc25 marked this pull request as ready for review September 8, 2023 05:11
@anujc25 anujc25 requested a review from a team as a code owner September 8, 2023 05:11
Copy link
Contributor

@marckhouzam marckhouzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice effort. This is complicated stuff.
I haven't looked at the E2E tests yet because more concurrency questions came to my mind.

I'm wondering, in saveCatalogCache() do we need to use the lockedFile file descriptor to write to the file? I think doing that would make unit tests proper fail if Unlock() is called too early (which we had before but you corrected in your latest changes), because a call to Upsert() after the Unlock() would not be able to write to the closed file.

pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/catalog.go Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
pkg/catalog/catalog.go Outdated Show resolved Hide resolved
If the cc.unlock is `nil` consider it as catalog has been unlocked
already and throw an meaningful error when running Upsert/Delete calls
on the unlocked catalog
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch 2 times, most recently from abfd39e to 75c2b96 Compare September 11, 2023 22:18
Copy link
Contributor

@marckhouzam marckhouzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for this complicated and important change!

Copy link
Contributor

@vuil vuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and improvements after the reviews.
There is a nit a typo in the comments, but changes lgtm.

pkg/catalog/catalog.go Outdated Show resolved Hide resolved
@anujc25 anujc25 force-pushed the fix-parallel-run-catalog-corruption branch from 75c2b96 to 118ba57 Compare September 15, 2023 22:59
@anujc25 anujc25 merged commit e1eaa9a into vmware-tanzu:main Sep 15, 2023
4 checks passed
@marckhouzam marckhouzam added this to the 1.1.0 milestone Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running different tanzu cli commands concurrently can corrupt the catalog cache
4 participants