Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in the CI on Github #870

Open
ununhexium opened this issue Apr 2, 2024 · 5 comments
Open

Error in the CI on Github #870

ununhexium opened this issue Apr 2, 2024 · 5 comments
Assignees
Labels
kind/bug Something isn't working. The software does not behave as expected or specified.

Comments

@ununhexium
Copy link
Collaborator

Problem summary

This is a summary of the attempts to make PR#842 work.

No solution was found, no root cause was found, and the problem may happen again.

This is here to document what's been attempted and resume the investigation the next time this problem happens.

The problematic PR

#842

This PR has the following symptoms:

  • works locally
  • doesn't work on GH due ot a timeout

The investigation revealed that this PR fails when a specific combination of 3 factors are met. That is, when it:

  • runs in the CI of the edc-extensions repo
  • calls client.uiApi().getCatalogPageDataOffers(TestUtils.PROTOCOL_ENDPOINT)
  • runs on github

Minimum code to reproduce the error

After a few unfruitful attempts directly in the original pull request, this second PR was created to isolate and reproduce the problem.

#864

Not all the tests were run on that PR, the ones where the test results are not mentioned have run in the original PR (where I later push --forced to clean up the history). Almost all the runs there failed. The ones that didn't fail are the ones that didn't have the problematic combination.

Here is a list of all fixing attempts:

Remove any of the 3 elements above and it's fine:

  • it works as it is on local machines
  • it works on GH if this specific network call is removed
  • it works fine when started in the CI of another repo (see next section).

Note: the faulty code "worked" once.

385f538 The one time that it worked (after doing localhost -> 127.0.0.1)

2e7bcf8 And then an empty commit to double check. But this second it showed the original error again.

This problem is therefore very reliably reproducible, but only in a very specific set of circumstances.

On a different repo

https://github.com/sovity/edc-ce-copy/pull/3/

This is the same code as the original repo.

I added 2 commit for codestyle and info, just to make the gradle build part finish.

It works fine. It can fail due to ports allocation but that's a different issue (happen at startup instead of during the request) and it has a very clear error message.

Minimum reproducible code on the copy repo

https://github.com/sovity/edc-ce-copy/pull/2

The build running fine on that case, without the 127.0.0.1 change.
It fails later at deployment but that's a credentials issue.

https://github.com/sovity/edc-ce-copy/actions/runs/8452890706/job/23154317197?pr=2#step:8:27687

Answers to the commit questions:

  • Is CatalogApiTest the problem?

yes

  • Is the problem coming form the setup or the running of the test?

running, when calling getCatalogPageDataOffers

  • Is asset creation the problem?

no

  • Is the problem in the arrange section?

no

  • Is the problem in the act section?

yes, getCatalogPageDataOffers

  • Is the error in the createContractDefinition?

no

  • Is purely getCatalogPageDataOffers the problem?

Seems so. A test that runs with just this method crashes. A test that runs without it but everything else doesn't crash.

  • Can the client report a mistake about the endpoint?

Doesn't complain about the invalid URL

  • Does it throw any exception?

No

  • Is it a JVM system Error?

No Throwable (includes java system Errors) is thrown

  • Can it find the URL on its own?

I really don;t remember the result for this. Doesn't really matter.

  • Is the problem related to this specific call or to the general callin…

Only happened when calling getCatalogPageDataOffers.

  • Quest for minimum code to reproducible the error
  • Quest for minimum code to reproducible the error
  • Quest for minimum code to reproducible the error

This was to isolate the problem as much as possible.

  • Does it happen but less often?
  • Does it happen but less often? 2

Empty commits to try to make the PR fail in the copy repo.
No problem, the code worked 3 times out of 3 attempts.

The problem in the original repo fails 95%+ of the time (just 1 "miracle" when the localhost -> 127.0.0.1 was changed).

  • Does it help to give gradle a bit more memory?

No

  • Does the ServiceLocator problem still occur if the call is not made?

Yes

  • Can the problem be triggered from another test class?

Yes

  • List IP addresses
  • Try to use the real IP

The server correctly binds to the correct port and IP, checked with ss, doesn't help with the issue.

  • ulimit -a

The allocated system resources are plenty enough. free -m also shows enough memory available.

  • Is the OpenApi base URL wrong???

no

Other questions:


  • Is it really just timeout issue?

No, setting a 30s timeout on the http client doesn't help

  • Does it help to dynamically allocate ports for this failing test?

Doesn't help.
A port allocation error would happen before the call can be made.
Also the EDC shows that it got a port, and ss shows that it's correctly allocated, even without dynamic port allocation.

Remote debugging, but in the public repo. Is there a way to do it safely?

Is there any stickiness in the running node?

  • Are the tests always running on the same nodes?
  • The same kind of nodes?
  • Does it work better if we use our own runner?

More ideas to try

As the issue seems network-related, double-check the network calls:

  • Add an interceptor in OkHttp
  • Make the same call with curl
  • tcpdump and check the calls

Make the code parts of HK/Glassfish log more info.

Very similar problems where the root cause was only identified to be Jersey, and Jersey got replaced.

openhab/openhab-distro#587 (comment)

openhab/openhab-distro#587

Notes

java.lang.IllegalStateException: ServiceLocatorImpl(__HK2_Generated_X,Y,Z) has been shut down

Is a problem that happens in HK.

https://javaee.github.io/hk2/

Which is used by Jetty

@ununhexium ununhexium added the kind/bug Something isn't working. The software does not behave as expected or specified. label Apr 2, 2024
@ununhexium ununhexium self-assigned this Apr 2, 2024
@AbdullahMuk AbdullahMuk added the clean-backlog requires backlog cleaning label May 2, 2024
@ununhexium ununhexium removed the clean-backlog requires backlog cleaning label May 8, 2024
@ununhexium ununhexium mentioned this issue May 17, 2024
51 tasks
@ununhexium ununhexium mentioned this issue Jun 5, 2024
50 tasks
@efiege efiege mentioned this issue Jun 14, 2024
52 tasks
@ununhexium
Copy link
Collaborator Author

ununhexium commented Jun 20, 2024

This PR also triggered the issue:
#970

Faulty branch tracked as reference/edc-ce-issue-870 in EDC CE

@ununhexium
Copy link
Collaborator Author

100% reproducible error at

de.sovity.edc.ext.wrapper.api.ui.pages.catalog.CatalogApiTest#testDistributionKey

on reference/edc-ce-issue-870-repro

@ununhexium
Copy link
Collaborator Author

Similar error but probably not related:
when failing got process a message sent over the EDC protocol:

2024-07-10 11:08:46 5.15.0 WARNING An exception mapping did not successfully produce and processed a response. Logging the exception propagated to the default exception mapper.
java.lang.IllegalStateException: ServiceLocatorImpl(__HK2_Generated_5,5,236544568) has been shut down

@richardtreier richardtreier mentioned this issue Jul 15, 2024
51 tasks
@ununhexium
Copy link
Collaborator Author

I have the impression that adding tests triggers more of these timeouts, then adding 1 more @DisabledOnGithub on the failing test hides the issue.

@ununhexium
Copy link
Collaborator Author

ununhexium commented Jul 29, 2024

Another problem, unrelated:
reference/gh-lombok-issue-missing-builder

Fails with


> Task :utils:test-utils:javadoc
/home/runner/work/edc-ce/edc-ce/utils/test-utils/src/main/java/de/sovity/edc/extension/e2e/extension/CeE2eTestExtensionConfigFactory.java:22: error: cannot find symbol
    public static E2eTestExtensionConfig.E2eTestExtensionConfigBuilder defaultBuilder() {
                                        ^
  symbol:   class E2eTestExtensionConfigBuilder
  location: class E2eTestExtensionConfig
/home/runner/work/edc-ce/edc-ce/utils/test-utils/src/main/java/de/sovity/edc/extension/e2e/extension/CeE2eTestExtensionConfigFactory.java:26: error: cannot find symbol
    public static E2eTestExtensionConfig.E2eTestExtensionConfigBuilder withModule(String module) {
                                        ^
  symbol:   class E2eTestExtensionConfigBuilder
  location: class E2eTestExtensionConfig
2 errors

on GH but run fine in IJ and locally.

@ununhexium ununhexium mentioned this issue Aug 7, 2024
50 tasks
This was referenced Aug 19, 2024
@ununhexium ununhexium mentioned this issue Sep 4, 2024
50 tasks
@ununhexium ununhexium mentioned this issue Sep 17, 2024
50 tasks
@ununhexium ununhexium mentioned this issue Sep 26, 2024
50 tasks
@ununhexium ununhexium mentioned this issue Oct 7, 2024
50 tasks
@ununhexium ununhexium mentioned this issue Oct 25, 2024
50 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working. The software does not behave as expected or specified.
Projects
None yet
Development

No branches or pull requests

2 participants