Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KPI calculated even if too little data supplied #11

Open
arurke opened this issue May 31, 2022 · 4 comments
Open

KPI calculated even if too little data supplied #11

arurke opened this issue May 31, 2022 · 4 comments

Comments

@arurke
Copy link
Contributor

arurke commented May 31, 2022

This might be an issue in TriScale, or me misunderstanding a use-case.

TL;DR: analysis_kpi() returns a valid value when too few data-points supplied - if the "unintuitive" bound is selected (upper for percentile < 50, and vice versa).

Background: The intuitive way to calculate a KPI is to specify a bound which gives us the "worst case" (upper when percentile > 50, and vice versa). This allows us to make the "performance is at least X"-statements. However, I was thinking there was information in the other bound as well. This would show the width of the CI, and we could learn if the given metric varies a lot between runs. The first example coming to mind is industrial scenarios, where not only the maximum latency is interesting, but also its variability.

With this background I was routinely calling analysis_kpi() twice, once with bound set to upper and another with lower. Doing this I noticed I would be getting a valid value when the "unintuitive" bound was selected (upper for percentile < 50, and vice versa), even if I had too little data.

Example with too little data:

import triscale as triscale
import numpy as np

data = np.random.randint(0, 10, size=(5))

settings = {"bound": "lower", "percentile": 99,
            "confidence": 95, "bounds": [min(data), max(data)]}

independent, kpi = triscale.analysis_kpi(
                    data,
                    settings,
                    verbose=False)
print("KPI: " + str(kpi))

With bound set to "upper", the KPI correctly returns NaN. With bound set to "lower", a number is returned.

@romain-jacob
Copy link
Owner

Sorry for the delay! I have been quite busy recently.


TL;DR: Everything works as expected AFAIU

The number of data points you need depends on the bound (upper/lower) that you pick, for a given percentile (except for the median, of course). To understand why let's keep your example. P=99, C=95.

  • When we compute the "upper" CI for that percentile, what we are actually doing is checking whether we have one data point that has at least 95% probability to be larger than the 99th percentile. This requires more than 5 samples, so the method returns NaN.
  • If one computes the "lower" CI for the same percentile, we check whether we have one data point that has at least 95% probability to be smaller than the 99th percentile. And that's easy because most samples (99% of them) are expected to be smaller than the 99th percentile. So one needs very few samples.

Side-note
If you are interested in the variability of a given KPI, you might want to look at the two-sided option. In short, it spares you the calling of the method twice (plus, you are sure that you have 95% confidence of the percentile to be between the two bounds returned).

@arurke
Copy link
Contributor Author

arurke commented Jun 8, 2022

Thanks a lot for a very detailed and enlightening explanation. It makes very much sense. I tunnel-visioned, assuming they had the same requirements. Regarding the side-note: You mean in analysis_kpi()? It forces one-sided as per master now. But I do see there seems to be support for it in ThompsonCI() - is this ready to be utilized?

Sorry for the delay! I have been quite busy recently.

No need to apologize, I am grateful for you taking the time!

@romain-jacob
Copy link
Owner

Ah yes, you're right. You'll need to go back to the ThompsonCI() function to get access to the two-sided option (or you just overwrite the TriScale function to allow that option).

The two-sided option is reliable. JSYK, I've opened a PR ages ago to include this ThompsonCI() function into scipy but never got around to finish it... which is a shame but you know... life. :-/

@arurke
Copy link
Contributor Author

arurke commented Jun 12, 2022

I looked and played a bit with the two-sided option. I modified analysis_kpi() to basically call ThompsonCI() directly, and give me the lower- and upper-bound it calculates. I then call it with 1000 data-points, and varying the class and percentile, example:

data = np.random.randint(1,10,size=(1000))
settings = {"bound": "lower", "percentile": 90,
            "confidence": 95, "bounds": [min(data), max(data)],
            "class":"two-sided"}

The lower- and upper-bounds I get is as follows:

  • "one-sided":
    • 90p: 883 - 915. # With 95c, the true 90p is between index 883 and 915.
    • 10p: 84 - 116. # With 95c, the true 10p is between index 84 and 116.
  • "two-sided":
    • 90p: 84 - 915. # With 95c, 90 % of the data is between index 84 and 915
    • 10p: 84 - 915. # With 95c, 10 % of the data is between index 84 and 915?!

I am struggling to combine my understanding of CIs and "bounds", the terms in Triscale, and the data I am seeing. I was conflicted, so I added some statements behinds the bounds - I was hoping I could ask you to comment, clarify, confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants