-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0036: Start measuring Pipelines performance #277
Conversation
3b5c583
to
045a194
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Some questions but otherwise looks good to me!
|
||
* To be able to understand and communicate about Tekton performance on an ongoing basis | ||
* Be able to answer questions like: | ||
* How much overhead does a Task add? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this and immediately wondered, "add to what?" Is that intentionally left out because it could be... anything? Or would it be possible to specify this a tad more to something like "How much overhead does a Task add to Pod performance?"?
I think the sticking point for me here is probably just the word "overhead" which implies a baseline over and above which the Task adds "something". What are those baselines? Are we saying that identifying those baselines is itself a Motivation that's worth calling out? (Reading ahead it sounds like it?)
Similar question for "How much does a Pipeline add?" and "How much does a Bundle add?" below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"How much overhead does a Task add to Pod performance?"
I think pod performance depends on the cloud provider or on the environment that its getting created in. I believe whatever benchmarking we put in place will be against our own CI system to begin with, please correct me if I am wrong. I think what we are trying to evaluate here is "How much overhead these changes or this release add or improve to creating a taskrun" which includes setting up resources or workspaces or volumes etc before actually a pod is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point @sbwsg ! I've added a bit more detail, still really vague tho so lemme know if you want me to be more precise.
@pritidesai it's true the the performance depends on the cloud provider - specifically what we're interested in here is the relative performance, and @sbwsg pointed out that half the statement was missing XD
The idea is: If i run some specific image in a pod, it takes X length of time to execute (probably on average but maybe we'll be more interested in some other way of looking at the metric over time), how much overhead does it add if I execute that same image / pod via a Task? And what if I put that Task in a Pipeline? You're right that the overhead might be different in different environments, it could be interesting to compare.
* How much overhead does using a Tekton Bundle vs referencing a Task in my cluster add? | ||
* If we switch out X component for Y component in the controller implementation, how does this impact performance? | ||
* How much better or worse does release X perform than release Y? | ||
* What characteristics should I look for when choosing what kind of machine to tun the Tekton Pipelines controller on? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tun -> run
|
||
### Goals | ||
|
||
The goal of this proposal is to avoid trying to boil the performance and loadtesting ocean all at once, and to instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-goal: boiling the performance and loadtesting ocean
😄
* Chaos testing, i.e. how does the system perform in the presence of errors | ||
* Other Tekton projects (e.g. Triggers, CLI, Dashboard, etc.) | ||
|
||
### Use Cases (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: As a maintainer of Tekton Pipelines I can identify through some documented process (or dashboard or tool) where in the commit history a performance regression was introduced.
It sounds to me like (at least initially?) we don't plan to perform this performance measurement against every PR. It'd be nice if we built this stuff out with an eye to working backwards such that on a bleary-eyed pre-coffee Monday morning I can start my build captain rotation by following the runbook (or something) to figure out where a performance regression started rearing its head.
## Requirements | ||
|
||
* Start with a bare minimum set of SLIs/SLOs and iterate from there | ||
* Access to the new infrastructure should be given to all “build cops” |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an assumption here that this role will also now be reviewing and acting on the perf information being collected? Or is this purely about managing the machines the performance measurements are running on?
If the role is also doing something tangible with the info, suggest maybe putting that in a Goal / Requirement too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm gonna cheat a bit and add a requirement about deciding who is responsible for acting on this info and what they're expected to do 😇
Reference: [Definitions of SLIs and SLOs](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/). | ||
These are worded in terms of running observable services. Since Tekton Pipelines is providing a service that can be run | ||
by others (vs hosting a "Tekton Pipelines" instance we expect users to use) our situation is a bit different, but I | ||
think the same SLI and SLO concepts can be applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the concepts can be applied but I wonder if we should just call them Performance Indicators and Performance Objectives? Reason being that SLI + SLO implies (to my ear) SLA. And SLA is kinda more of a contractual thing that sometimes get used by corporations to bash each other over the head or hold back payments or get money back or w/e.
What we're offering here isn't really contractual I don't think? So I wonder if just naming them something slightly different (while 100% still linking to the SRE book as reference material for meaning) will null some of that impression?
Conversely, let's say we do stick with the SLI/SLO language. What's the SLA? Or, in other words, what are the outcomes for the Tekton Pipelines project when a performance regression is identified? Do we hold back releases? Drop all other work to focus on it? Have a dedicated build captain role specific to performance? Perhaps this also needs to be a Goal / Requirement that we have to hash out in WGs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup I agree, I like Performance Indicators and Objectives better, great suggestion @sbwsg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to stick with SLO and SLI b/c they are well known terms, and switching to performance indicators and objectives feels a bit like we're trying to invent something new when we don't have to.
@sbwsg I think you're totally right that SLO and SLI are often used in the same context as SLA but really I don't think they imply SLA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I am open going with SLO and SLI, we can always change it if needed.
045a194
to
d1756a7
Compare
This PR starts a TEP to begin to measure tekton pipelines performance and address tektoncd/pipeline#540 This first iteration just tries to describe the problem vs suggesting the solution. It DOES recommend measuring SLOs and SLIs as a goal, which is kind of part of the solution, so if we think it's useful we could step back even further, but I think this is a reasonable path forward, curious what other folks think!
d1756a7
to
8c845eb
Compare
I think this is ready for another look @sbwsg and @pritidesai ! Also quick update on my plan going forward: I'd like to merge the problem statement, but after that I'm planning to put this on the back burner and would be happy for anyone else to take this over if they are interested. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm This is a great start.
@psschwei and I are interested in giving it a go if that's fine. Our goal would be to start small with just one SLI and measuring it with the Pipeline. |
That sounds great @pritidesai and @psschwei !!! 🙏 🎉 Thank you :D |
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540
This first iteration just tries to describe the problem vs suggesting
the solution.
It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!