Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT] alternative to slo_event_producer by expression evaluation #53

Open
lksv opened this issue Apr 25, 2021 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@lksv
Copy link
Contributor

lksv commented Apr 25, 2021

Follows example of slo_rules.yaml with new semantics.

Example consists of two parts:

  1. thresholds for each class and category
  2. rules with expressions. For expressions it seems to me that https://github.com/antonmedv/expr is would be great choice.

First part should be exported as Prometheus metrics as well. In the same (compatible) format as a SLO metadata. Which can lead to simple configuration of slo-exporter.

Note that term category is used for availability, latency, etc. On the other hand slo_type must explicitly identify particular metric/SLI/SLO. Therefore in case that more that one SLI for category is used than slo_type identify the exact one:

  • category: "latency", slo_type: "latency99", percentile: "99"
  • category: "latency", slo_type: "latency90", percentile: "90"

Following example do not describe usefull SLO definition, it is intended as a showcase of possible usuage.

classes:    
  - version: "1"    
    # Keys are SLO Classes and under each key is dictionary which keys define  
    # slo_types (availability, latency90, latency99 etc.    
    #    
    # If value of the dict contains:    
    # * a number then it is interpreted as a `threshold => <number>` e.g.:    
    #   `{ "availability" => 99.9 }` is only abbreviated notation to    
    #   ```    
    #   {    
    #     "availability" => { "threshold" => 99.9, "slo_category" => "availability", "slo_class" => "availability"}
    #   }    
    #   ```    
    # * an array of disct, then it expanded as example shows:    
    #   Form `"latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }]` to
    #   ```    
    #   {    
    #     "latency99" => { "threshold" => 99, "maxDuration" => 0.5, slo_category => "latency", slo_type => "latency99" },
    #     "latency90" => { "threshold" => 90, "maxDuration" => 0.2, slo_category => "latency", slo_type => "latency90" }
    #   }    
    #   ```    
    # * a dict:    
    #   If keys `slo_category` or `slo_type` are not present then they are set 
    #   to same value as a key pointing the the dict.  Then this dict is       
    #   accessible from rule expressions and `threshold` value is passed over  
    #   to the Prometheus to be used as a SLO threshold.
    # 
    # First version might implement only dict version.
    #    
    # Following lines are intentionally long without line braking    
    # It's useful to make visually straightforward to compare individual        
    # slo classes and categories (slo_class & slo_types) each other.          
    #    
    critical:  { "availability" => 99.9, "latency" => [{ "threshold" => 99, "maxDuration" => 0.5}, { "threshold" => 90, "maxDuration" => 0.2 }] }
    high_fast: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 1.5}, { "threshold" => 90, "maxDuration" => 0.5 }] }
    high_slow: { "availability" => 99.5, "latency" => [{ "threshold" => 99, "maxDuration" => 3.0}, { "threshold" => 90, "maxDuration" => 2.0 }] }
    low:       { "availability" => 99.0, "latency" => [{ "threshold" => 99, "maxDuration" => 6.0}, { "threshold" => 90, "maxDuration" => 3.0 }] }
  - version: "2"    
    critical:  { "availability" => { "threshold" => 99.9, "maxDuration" => 0.2 } } 
    high_fast: { "availability" => { "threshold" => 99.0, "maxDuration" => 1.5 } } 
    high_slow: { "availability" => { "threshold" => 99.0, "maxDuration" => 3.0 } } 
    low:       { "availability" => { "threshold" => 95.0, "maxDuration" => 6.0 } } 
    
    
# evaluation workflow:    
# 1. Input event class is determined first (e.g. `slo_class=critical`).        
# 2. For each version in class table:    
#    1. Only rules groups which group_expr results to true are evaluated.       
#    2. When rules are evaluated all variables form `classes` definition table are accessible.
#    3. when additional_metadata are defined, then:    
#       * all values which are string are added to the slo_event    
#       * all dict values which are dict and has only `expr` key are evaluated and result is added to the slo_event
#       * otherwise an error metrics is increased.    
    
slo_domain: 'autoadmins'    
  rule_groups:    
    - group_expr: 'version == "1"'    
      rules:    
      - slo_type: 'availability'    
        slo_result_exp: "statusCode < 500"    
      - slo_type: 'latency90'    
        slo_result_expr: "requestDuration < class.latency99.maxDuration"       
        additional_metadata:    
          percentyle: 90    
          le: 0.2  #hardcoded same number as `class.latency99.maxDuration`     
      - slo_type: 'latency99'    
        slo_result_expr: "requestDuration < class.latency90.maxDuration"       
        additional_metadata:    
          percentile: 99    
          le:    
            - expr: 'class.latency99.maxDuration'    
    
    - group_expr: 'version == "2"'    
      rules:    
      - slo_type: 'availability&latency'    
        slo_result_expr: "statusCode < 500 && requestDuration < availablity.maxDuration"
        additional_metadata:    
          percentile: 100    
          le:    
            - expr: 'class.availability.maxDuration'    
      # example of one category (slo_type) instead of three    
      - slo_type: 'availability&latency'    
        default_expr: "statusCode < 500 && requestDuration < availability.maxDuration"
    
      # example of expression defined additionals metadata    
      # it uses result of expression as slo_event.    
      # To the expression result is added `slo_type` key and result is checked to contains `slo_results` as boolean
      - slo_type: 'availability&latency'    
        slo_event_expr: "{ le: availability.maxDuration, percentile: availability, slo_result: (statusCode < 500 && requestDuration < availability.maxDuration) }"
@lksv lksv added the enhancement New feature or request label Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant