Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for race condition loading management.cattle.io.cluster schema #5319

Merged
merged 2 commits into from
Mar 9, 2022

Conversation

nwmac
Copy link
Member

@nwmac nwmac commented Mar 8, 2022

Fixes #5313
Fixes #4967

This PR fixes the issue reported in both the above issues - there is a rare race condition that leads to the fail-whale page being displayed:

image

Tracing through the code, the problem is that middleware/authenticated.js on line 265, we dispatch calls to load the management and cluster stores in parallel. The management store will load all schemas (see store/index.js line 538). The cluster store assumes that the schema for the management cluster is avaialable (see store.js line 677).

There is a race condition where if the schemas are not loaded, the call to load the cluster will fail and you'll get bumped to the fail-whale page.

This PR fixes this by adding a wait into the loadCluster code - we don't want to load the schemas in multiple places - we assume that the schemas are in the process of being loaded and wait up to 15 seconds for them - after 10 seconds we log a message to the browser console and after 15 if we still don't have the schemas, we request them synchronously (this should not be needed, but is a fail-safe).

This is hard to test, but from the description above you should be able to understand how this can happen.

Developers can test by:

Commenting out line 673 in store/index.js - this essentially removes the fix in this PR. The line to comment out is:

    await waitForMgmtClusterSchemaAvailable({ getters, dispatch });

Now, change line 538 in store/index.js from:

      mgmtSchemas:    dispatch('management/loadSchemas', true),

to:

      mgmtSchemas:    (new Promise(resolve => setTimeout(resolve, 5000))).then(() => dispatch('management/loadSchemas', true)),

This will add a 5 second delay to the loading of schemas.

In the UI, refresh to the page for the local cluster, i.e. https://127.0.0.1:8005/c/local/explorer

You should see the fail-whale page.

Now uncomment line 673 in store/index.js to put back in the fix from this PR and try again and everything should load correctly.

@nwmac nwmac requested a review from richard-cox March 8, 2022 08:31
@nwmac nwmac self-assigned this Mar 8, 2022
@nwmac nwmac added this to the v2.6.4 milestone Mar 8, 2022
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew this rang a bell, I hit something similar before and fixed it via store/index.js ln 708

    const fetchProjects = async() => {
      let limit = 30000;
      const sleep = 100;

      while ( limit > 0 && !state.managementReady ) {
        await setTimeout(() => {}, sleep);
        limit -= sleep;
      }

      if ( getters['management/schemaFor'](MANAGEMENT.PROJECT) ) {
        return dispatch('management/findAll', projectArgs);
      }
    };

(#4650)

The state.managementReady flag is set once loadManegement completes.

I don't think we should have two approaches doing roughly the same thing in the same function. We could do one of....

  • change fetchProjects to use a more generic waitForMgmtClusterSchemaAvailable,
  • waitForMgmtClusterSchemaAvailable could block on managementReady and the fetchProjects removed (projects could just be fetched)

Pluses and minuses to both, am I'm happy to implement either?

@nwmac
Copy link
Member Author

nwmac commented Mar 8, 2022

@richard-cox I think we should add waitForSchema to the store then - it can take the type to wait for and maybe an optional timeout (or we just choose 30 seconds, say).

It could return true or false to indicate if the schema was available - in the case of the fix in this PR, the code could then load all of the schemas if they were still unavailable at the end of the wait.

@nwmac
Copy link
Member Author

nwmac commented Mar 8, 2022

@richard-cox can you do that refactor?

- This was going to apply to `fetchProjects` as well, however I realised
  it covers a different use case (the schema may not exist and that's fine)
- The generic `<store>/waitForSchema` action might come in handy for other
  products (harvesters `loadVirtual` might have a similar issue,
  i think Epinio also).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants