Skip to content

Conversation

@sreuland
Copy link
Contributor

Remove quickstart as a base image and all other references. This means also remove TARGET_NETWORK for option of running the standalone network locally in system-test image. instead, tests must now include TARGET_NETWORK_RPC_URL for a remote rpc url.

Add a callable workflow for system-test build/run.

Closes #92

@sreuland sreuland marked this pull request as ready for review October 15, 2025 17:38
if: needs.prepare-config.outputs.registry_allowed == 'true'
run: |
docker tag $SYSTEM_TEST_IMAGE ${{ needs.prepare-config.outputs.system-test-image-cache-registry }}/$SYSTEM_TEST_IMAGE
docker push ${{ needs.prepare-config.outputs.system-test-image-cache-registry }}/$SYSTEM_TEST_IMAGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get enough benefit to ship system test to dockerhub? I see there's also code in here for using artifacts. Would using the GitHub cache suffice instead of dockerhub? That's what we do in quickstart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache doesn't work on forked repos, same as ghcr.io, so, the public dockerhub allows forked and non-forked pr to access the same cache, but only the non-forked pr's can push to the cache as they will have the gh token.

Copy link
Member

@leighmcculloch leighmcculloch Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below. Forks can access the cache. It's important to note that GitHub is pretty protective of the cache, in a good way, and so new PRs only access cache entries from their own branch, and the base branch (e.g. main). So cache entries from one PR don't effect other PRs.

Additionally, forks of a repository can create pull requests on the base branch and access caches on the base branch.

https://docs.github.com/en/actions/reference/workflows-and-actions/dependency-caching

Another thing to note is that when you run the quickstart workflow in the system-test repo, the cache's for the quickstart workflow when running in this repo are all cached inside the system-test repo.

These limitations are all good things though. Sharing caches across PRs and across repos is a recipe for cache poisoning, exploits, and surprising behavior. But I also don't think these limitations are a bad thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I need to understand the 10gb quota on cache per repo. In the case of nested callable workflows is that on the original initiating calling repo or the nearest calling repo? As in this case stellar-rpc->system-test-workflow->quickstart-workflow.

If the cache is enforced on one single repo in this case and quickstart and now system-test flows are both storing images to cache, it seems like may hit the 10gb quota quickly and/or evictions will negate some of the gains in that case? the system test image is about 3g.

I will try pushing up a change for caching and see how it runs as the advantages mentioned and simplicity are nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is stored in the calling repo not the called repo.

For example, look at the cache on the stellar/rpc repo, it has images there created by the quickstart workflow: https://github.com/stellar/stellar-rpc/actions/caches

This means first use of the same images will result in a rebuild, but it's not that big of a deal imo. If that's a problem, there are some other safer things we can do using dockerhub instead pushing everything to dockerhub. For example, we could push all the intermediary images to dockerhub from the quickstart main branch only, and then pull those for layer reuse. That would be a bit safer than using dockerhub to reuse the images by name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can compress the docker tar file before caching to gha, which helps:

$ docker save stellar/system-test:cache_b54d49fa4159ec38 | zstd -19 -T0 -o myimage.tar.zstd
/*stdin*\            : 18.81%   (  3.08 GiB =>    593 MiB, myimage.tar.zstd)   

Cache capacity needed depends on how many relevant cache versions are present, for system-test workflow the cache is permutations of system-test, stellar-cli, js-stellar-sdk refs. from stellar-rpc usage, it wouldn't change those much, as most pr's don't change the e2e.yml.

we could push all the intermediary images to dockerhub from the quickstart main branch only, and then pull those for layer reuse. That would be a bit safer than using dockerhub to reuse the images by name.

I think I could do docker buildx with --cache-from=type=gha and --cache-to=type=gha right here in workflow and won't do any explicit cache key checks instead just run the sytem-test image build every time in workflow and the duration will just depend on how many layers it finds from cache. This approach would achieve same obfuscation and not incur conflating changes to quickstart?

I'll try this out and start with compression also, --cache-to=type=gha,mode=max,compression=zstd and hopefully see similar reduction as i noted above. Later, if we see maxing out cache quota, could revisit this and considertype=registry,ref=docker.io/stellar/system-test:cache.

Copy link
Member

@leighmcculloch leighmcculloch Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could do docker buildx with --cache-from=type=gha and --cache-to=type=gha right here in workflow

I tried using the gha cache on the quickstart repo itself in stellar/quickstart#815, but it's not a good fit for that repos build process. It resulted in inefficient cache usage that wasn't only a size problem but slowed the build down.

It might work well for this repo, but just something to watch out, is that you won't have any control over what gets cached, and inefficient cache usage can slow a build down a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, upon consecutive reruns of this workflow from a stellar-rpc/520 wherein no source code versions are changed the docker builds appear to still be building layers that would have expected to be present in gha cache from prior runs. i will look into verbose logging on docker and debug for a root cause on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leighmcculloch , I found the immediate issue with type=gha not working in here, it was due to not including the crazy-max plugin to expose the gh runtime settings to buildkit. After using that was able to see some export to cache activity logged during builds via --cache-to but never imported from cache via --cache-from . In case it was just issue with gha or builder, I tried type=local and manually running the buildkit daemon as a container using moby:latest image instead of using docker/setup-buildx-action and still saw the same one way cache behavior with that. It seems like there's a lower issue between buildkitd process, builders using that daemone and the docker daemon on the runner as caching doesn't work between them.

So, I reverted back to caching per system-test and cli images, and this callable workflow now has expected cache hit behavior. verified by invoking the workflow and cache hits on a stellar-rpc/e2e run

If you can re-review this pr for the final cache setup in here, once it merges, I'll get the stellar-rpc/520 pointed at system-test:master and move that up for review also. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leighmcculloch , bumping this in case was lost in the emails, I was asking for re-review with the final caching strategy I used, thanks!

Copy link
Member

@leighmcculloch leighmcculloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Looks like this reduces the infra in here a lot, which is a good thing.

In terms of the long term vision, system-test is less clear to me. It depends how the tests get built out / or how the test evolve here and what gets put here to test. Looking forward to seeing how that plays out.

@sreuland sreuland merged commit b140302 into master Nov 25, 2025
5 checks passed
@sreuland sreuland deleted the no_quickstart branch November 25, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poc: refactor to pure tests, no build or embed of quickstart image

3 participants