New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

TEP-0036: Start measuring Pipelines performance #277

Merged

tekton-robot merged 1 commit into tektoncd:master from bobcatfish:measure_performance

Jan 6, 2021

Contributor

bobcatfish commented Nov 20, 2020

This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!

tekton-robot requested review from pradeepitm12 and vtereso

November 20, 2020 20:25

tekton-robot added the size/L label

bobcatfish changed the title ~~Add a TEP to start measuring Pipelines performance~~ TEP-0036: Start measuring Pipelines performance

bobcatfish mentioned this pull request

Initialize load testing tektoncd/pipeline#3554

Closed

4 tasks

vdemeester added the kind/tep label

tekton-robot added the needs-rebase label

bobcatfish force-pushed the measure_performance branch from 3b5c583 to 045a194 Compare

December 1, 2020 21:29

tekton-robot removed the needs-rebase label

ghost reviewed

View reviewed changes

ghost left a comment

🎉 Some questions but otherwise looks good to me!

teps/0036-start-measuring-tekton-pipelines-performance.md Outdated

+              * To be able to understand and communicate about Tekton performance on an ongoing basis
+              * Be able to answer questions like:
+                * How much overhead does a Task add?

ghost Dec 2, 2020

I read this and immediately wondered, "add to what?" Is that intentionally left out because it could be... anything? Or would it be possible to specify this a tad more to something like "How much overhead does a Task add to Pod performance?"?

I think the sticking point for me here is probably just the word "overhead" which implies a baseline over and above which the Task adds "something". What are those baselines? Are we saying that identifying those baselines is itself a Motivation that's worth calling out? (Reading ahead it sounds like it?)

Similar question for "How much does a Pipeline add?" and "How much does a Bundle add?" below.

Member

pritidesai Dec 3, 2020

"How much overhead does a Task add to Pod performance?"

I think pod performance depends on the cloud provider or on the environment that its getting created in. I believe whatever benchmarking we put in place will be against our own CI system to begin with, please correct me if I am wrong. I think what we are trying to evaluate here is "How much overhead these changes or this release add or improve to creating a taskrun" which includes setting up resources or workspaces or volumes etc before actually a pod is created.

Contributor Author

bobcatfish Jan 4, 2021

Great point @sbwsg ! I've added a bit more detail, still really vague tho so lemme know if you want me to be more precise.

@pritidesai it's true the the performance depends on the cloud provider - specifically what we're interested in here is the relative performance, and @sbwsg pointed out that half the statement was missing XD

The idea is: If i run some specific image in a pod, it takes X length of time to execute (probably on average but maybe we'll be more interested in some other way of looking at the metric over time), how much overhead does it add if I execute that same image / pod via a Task? And what if I put that Task in a Pipeline? You're right that the overhead might be different in different environments, it could be interesting to compare.

teps/0036-start-measuring-tekton-pipelines-performance.md Outdated

+                * How much overhead does using a Tekton Bundle vs referencing a Task in my cluster add?
+                * If we switch out X component for Y component in the controller implementation, how does this impact performance?
+                * How much better or worse does release X perform than release Y?
+                * What characteristics should I look for when choosing what kind of machine to tun the Tekton Pipelines controller on?

ghost Dec 2, 2020

nit: tun -> run

teps/0036-start-measuring-tekton-pipelines-performance.md Outdated


		### Goals

		The goal of this proposal is to avoid trying to boil the performance and loadtesting ocean all at once, and to instead

ghost Dec 2, 2020

Non-goal: boiling the performance and loadtesting ocean

😄

teps/0036-start-measuring-tekton-pipelines-performance.md

+              * Chaos testing, i.e. how does the system perform in the presence of errors
+              * Other Tekton projects (e.g. Triggers, CLI, Dashboard, etc.)
+              ### Use Cases (optional)

ghost Dec 2, 2020

Suggestion: As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the commit history a performance regression was introduced.

It sounds to me like (at least initially?) we don't plan to perform this performance measurement against every PR. It'd be nice if we built this stuff out with an eye to working backwards such that on a bleary-eyed pre-coffee Monday morning I can start my build captain rotation by following the runbook (or something) to figure out where a performance regression started rearing its head.

teps/0036-start-measuring-tekton-pipelines-performance.md Outdated

+              ## Requirements
+              * Start with a bare minimum set of SLIs/SLOs and iterate from there
+              * Access to the new infrastructure should be given to all “build cops”

ghost Dec 2, 2020

Is there an assumption here that this role will also now be reviewing and acting on the perf information being collected? Or is this purely about managing the machines the performance measurements are running on?

If the role is also doing something tangible with the info, suggest maybe putting that in a Goal / Requirement too?

Contributor Author

bobcatfish Jan 4, 2021

I think I'm gonna cheat a bit and add a requirement about deciding who is responsible for acting on this info and what they're expected to do 😇

teps/0036-start-measuring-tekton-pipelines-performance.md

+              Reference: [Definitions of SLIs and SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/).
+              These are worded in terms of running observable services. Since Tekton Pipelines is providing a service that can be run
+              by others (vs hosting a "Tekton Pipelines" instance we expect users to use) our situation is a bit different, but I
+              think the same SLI and SLO concepts can be applied.

ghost Dec 2, 2020

I agree that the concepts can be applied but I wonder if we should just call them Performance Indicators and Performance Objectives? Reason being that SLI + SLO implies (to my ear) SLA. And SLA is kinda more of a contractual thing that sometimes get used by corporations to bash each other over the head or hold back payments or get money back or w/e.

What we're offering here isn't really contractual I don't think? So I wonder if just naming them something slightly different (while 100% still linking to the SRE book as reference material for meaning) will null some of that impression?

Conversely, let's say we do stick with the SLI/SLO language. What's the SLA? Or, in other words, what are the outcomes for the Tekton Pipelines project when a performance regression is identified? Do we hold back releases? Drop all other work to focus on it? Have a dedicated build captain role specific to performance? Perhaps this also needs to be a Goal / Requirement that we have to hash out in WGs?

Member

pritidesai Dec 3, 2020

yup I agree, I like Performance Indicators and Objectives better, great suggestion @sbwsg.

Contributor Author

bobcatfish Jan 4, 2021

I'm tempted to stick with SLO and SLI b/c they are well known terms, and switching to performance indicators and objectives feels a bit like we're trying to invent something new when we don't have to.

@sbwsg I think you're totally right that SLO and SLI are often used in the same context as SLA but really I don't think they imply SLA

ghost Jan 5, 2021

sgtm!

Member

pritidesai Jan 6, 2021

sure, I am open going with SLO and SLI, we can always change it if needed.

tekton-robot added the needs-rebase label

bobcatfish force-pushed the measure_performance branch from 045a194 to d1756a7 Compare

January 4, 2021 17:05


          Add a TEP to start measuring Pipelines performance

8c845eb

This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!

bobcatfish force-pushed the measure_performance branch from d1756a7 to 8c845eb Compare

January 4, 2021 17:06

tekton-robot removed the needs-rebase label

Contributor Author

bobcatfish commented Jan 4, 2021

I think this is ready for another look @sbwsg and @pritidesai !

Also quick update on my plan going forward: I'd like to merge the problem statement, but after that I'm planning to put this on the back burner and would be happy for anyone else to take this over if they are interested.

ghost approved these changes

View reviewed changes

Contributor

tekton-robot commented Jan 5, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~teps/OWNERS~~ [sbwsg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot added the approved label

Member

pritidesai commented Jan 6, 2021

/lgtm

This is a great start.

Also quick update on my plan going forward: I'd like to merge the problem statement, but after that I'm planning to put this on the back burner and would be happy for anyone else to take this over if they are interested.

@psschwei and I are interested in giving it a go if that's fine. Our goal would be to start small with just one SLI and measuring it with the Pipeline.

tekton-robot assigned pritidesai

tekton-robot added the lgtm label

tekton-robot merged commit 95d74ad into tektoncd:master

Contributor Author

bobcatfish commented Jan 7, 2021

That sounds great @pritidesai and @psschwei !!! 🙏 🎉 Thank you :D

lbernick mentioned this pull request

TEP-0036: Start Measuring Tekton Pipelines Performance #602

Closed

2 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved kind/tep lgtm size/L