Skip to content

Latest commit

 

History

History
558 lines (427 loc) · 38.2 KB

File metadata and controls

558 lines (427 loc) · 38.2 KB

Tuning the Mesh

Service Mesh Tuning Requirements

In the efforts to tune a Service Mesh the usual two concerned parties which will provide non-functional requirements are the Application Team and the Platform Team. These will have expressed the following type of tuning expectations:

  • The Travel Agency Application Team (includes Product Owners, Tech Leads, Mesh Developers) who need to tune the application side to handle the expected customer load have provided us with the following expectations.

Expected load is 250k requests per day with a peak of 250 requests per second(rps).

  • The Travel Agency Platform Team (includes Cluster Operators, Mesh Operators, Platform (Application Ops) Team) who need to tune the control plane side of the Service Mesh to handle observability, ingress/egress application runtime and configuration needs have the following concerns as internal cloud providers:

What are the best practices / sizing guides / benchmarks so that we can anticipate the best evolution of the OSSM instance and give best experience to the project teams?

  • How to define the maximum capacity of an OSSM instance?

  • How to define the maximum number of applications that can join the mesh in the future?

  • Which criteria/metrics should be used for capacity rules?

  • What are current limits of an OSSM instance on the control plane and the data plane side?

Service Mesh Tuning Focus

It is important as first practice, in tuning a Service Mesh, to have an architecture decision on the Deployment Model to be used with that mesh and to have established Purpose and Principals around the use and setup of this Service Mesh as these will determine the type of tuning required.

Furthermore, be aware that in a cloud based environment there are many components that should be tuned (firewalls, loadbalancers, container platform etc.) we shall focus the following guidance only on the Red Hat OSSM 2 major areas:

  • The data plane which is consisted by all the istio-proxy sidecars (Envoy), responsible for handling a workload’s incoming traffic, injected into the Pods of every workload as well as the ingress/egress gateway components.

  • The control plane, responsible for keeping the proxies up-to-date with the latest configuration, certificates etc, and the observability stack.

In the following sections we focus guidance around:

  • How to test the performance of the mesh (the data plane's specific non-functional needs as well as the control plane components),

  • How to measure sizing needs (eg. with a set of apps and established requests how much storage for tracing/metrics, how many istiod components to use, what CPU/RAM is required by the sidecar etc.)

  • What can be tuned (eg. the configuration visibility, % of traces collected, replicas, threads etc.)

Sizing the Data Plane

The Application Team will need to tune the data plane components (Ingress/Egress workloads and istio-proxy sidecars) for memory, cpu and threads in such a way that they meet the solution’s latency and throughput targets.

Like in any other tuning exercise in order to size correctly a mesh data plane a set of scenarios based on real-world expected load is required. They will provide separate load test configurations based on which the components will be tuned until the output based on the requirements is reached. In the end, the sizing of the solution is determined by the expected performance output.

Practical Data Plane Tuning exercise

In this exercise we will showcase a process of tuning the flights and mysqldb data plane components to cater for better performance when receiving upto 250/rps from the external partner via gto-external-ingressgateway gateway. The siege http multi-threaded load testing and benchmarking utility has been employed to perform the tests.

  1. In preparation for the tests relax the TLS settings for Gateway/travel-api-gateway to SIMPLE TLS by executing ./scripts/set-simple-tls-travel-api-gateway.sh

  2. Prepare to observe statistics and metrics around the performance and resources utilization which constitute application performance Service Level Indicators (SLIs).

    • using script ./scripts/containers-mem-cpu.sh "POD NAME <mysqldb|flights|istio-proxy>" (eg. get the filter by executing `kubectl top pods --containers|sort -rk1 |grep "mysqldb-v1-56f4f9d879-gjltv" CPU/Memory resource needs can be captured for the main workload and istio-proxy containers. In 4 command separate prompts monitor:

      ./containers-mem-cpu.sh  "mysqldb-v1-56f4f9d879-gjltv     mysqldb"
      ./containers-mem-cpu.sh  "mysqldb-v1-56f4f9d879-gjltv     istio-proxy"
      ./containers-mem-cpu.sh  "flights-v1-5c4bfff4b7-brwr8     flights"
      ./containers-mem-cpu.sh  "mysqldb-v1-56f4f9d879-gjltv     istio-proxy"
    • prepare to monitor in PROMETHEUS the increase between istio_request_duration_milliseconds over istio_request_duration_milliseconds_count (the metrics can be used for the calculation of the average request duration, application SLI, over arbitrary lookbehind window specified in square brackets of the following query). There are many dimensions for the metric (eg. response_code=200/500, reporter=source/destination etc.) it is important to capture possible average duration increases (ie. added latency), throughput of requests and if there are any failures (eg. response_code=500) due to resource contention.

      increase(istio_request_duration_milliseconds_sum{destination_canonical_service="flights"}[5m])   / increase(istio_request_duration_milliseconds_count{destination_canonical_service="flights"}[5m])
  3. Get a valid JWT TOKEN for the requests and execute via the siege tool a load test (duration of tests 1 minute).

    TOKEN=$(curl -sLk --data "username=gtouser&password=gtouser&grant_type=password&client_id=istio&client_secret=bcd06d5bdd1dbaaf81853d10a66aeb989a38dd51" https://keycloak-rhsso.apps.ocp4.rhlab.de/auth/realms/servicemesh-lab/protocol/openid-connect/token | jq .access_token)
    siege -b -c500 -t60s https://gto-external-prod-istio-system.apps.ocp4.rhlab.de/flights/Tallinn --header="Authorization: Bearer $TOKEN"

Test 1 - Default data plane values

  • siege results show 97.51% success on the requests with 7049 success vs 174 failures and a 1.13s avg response time

    Lifting the server siege...
    Transactions:               6822 hits
    Availability:               97.51%
    Elapsed time:               59.14 secs
    Data transferred:           0.83 MB
    Response time:	            1.13 secs
    Transaction rate:           115.35 trans/sec
    Throughput:                 0.01 MB/sec
    Concurrency:	            130.67
    Successful transactions:    7049
    Failed transactions:	    174
    Longest transaction:	    10.16
    Shortest transaction:	    0.24
  • In PROMETHEUS we noticed that there was a:

    • 660 ms average increase in the duration spent on requests yielding response_code=500 responses, indicating not all requests were successful as the setup could not handle the requested load.

    • 114 ms increase in the duration spent to handle requests from the travels service (which is non-partner constant traffic) with 200 responses (normally avg is 28ms), due to the added load.

    • 473 ms increase in the duration spent to handle partner requests yielding on 200 responses, again due to the added load.

Test 2 - Increase sidecar concurrency with extra worker threads

In the second test tune the istio-proxy to take advantage of 4 concurrent worker threads (from the default 2) in serving requests.

  • Apply to the flights and mysqldb deployments the annotation

          annotations:
            proxy.istio.io/config: |
              concurrency: 4
    • once the POD has been restarted verify the available worker threads are now 4 by executing

      oc exec <POD-NAME> -c istio-proxy -- curl localhost:15000/stats |grep worker
      
      server.worker_0.watchdog_mega_miss: 0
      server.worker_0.watchdog_miss: 0
      server.worker_1.watchdog_mega_miss: 0
      server.worker_1.watchdog_miss: 0
      server.worker_2.watchdog_mega_miss: 0
      server.worker_2.watchdog_miss: 0
      server.worker_3.watchdog_mega_miss: 0
      server.worker_3.watchdog_miss: 0
  • Taking the same observability actions and executing the previous siege loadtest we receive the following results:

    Lifting the server siege...
    Transactions:	            8092 hits
    Availability:	            98.73%
    Elapsed time:	            59.82 secs
    Data transferred:           0.96 MB
    Response time:	            0.80 secs
    Transaction rate:           135.27 trans/sec
    Throughput:                 0.02 MB/sec
    Concurrency:	            108.69
    Successful transactions:    8188
    Failed transactions:	    104
    Longest transaction:	    6.92
    Shortest transaction:	    0.24

The results yield the following observations:

  • with 8188 successful transactions

    • an improvement of the troughput application SLI by 16%, which is a 42% decrease in failed transactions (98.73% successes and 104 failed), and

    • a 29% decrease of the application SLI of response time (down to 0.8s),

    • overall a 14% increase in throughput and 40% reduction on the longest transaction.

  • In PROMETHEUS we observe that during this test there is:

    • a 390 ms average duration increase spent on requests yielding response_code=500 responses. There are still failed requests but with a 41% smaller increase than Test 1.

    • a 69 ms increase in the duration spent to handle requests from travels service (which is normal non-partner traffic) with 200 responses. Again 40% reduction than Test 1 which indicates we can handle more load successfully.

    • a 181 ms increase in the duration spent to handle partner requests yielding a 200 responses. With a reduction of 66% from Test 1 we have another indicator the change has increased the capability to handle more requests.

Overall we notice that by tuning the worker threads on the data plane for these two components we managed to increase throughput whilst at the same time CPU and memory utilized by the istio-proxy remains largely unchanged (see below data captured with containers-mem-cpu.sh).

300

Test 3 - Increase database concurrency

One final tuning action performed is against the actual mysql database. Utilizing the mysql-credentials and the root user check in the mysqldb POD for the available connections and notice that max_connections available is set to 151 which has already been reached (see Max_used_connections) and presents a bottleneck. In response, tune the workload connections and repeat the tests.

select version();show variables like "%max_connections%";show global status like "%Max_used%";show status like "%thread%";show global status like "%Aborted%";
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| max_connections        | 151   |
| mysqlx_max_connections | 100   |
+------------------------+-------+
+---------------------------+---------------------+
| Variable_name             | Value               |
+---------------------------+---------------------+
| Max_used_connections      | 152                 |
| Max_used_connections_time | 2022-10-11 13:08:32 |
+---------------------------+---------------------+

Increase mysqld max_connections to 250

set global max_connections = 250;

Following the same observability activities and executing the siege loadtest the following results show:

  • An additional 10% increase of troughput with 8955 successful transactions and a 100% success rate.

  • At 0.69s an additional 14% decrease in response time

  • With 148.42 trans/sec an additional 14% increase and

  • An additional 40% reduction on the longest transaction

  • However, the transactions are at 148.42 trans/sec and therefore below the 250/rps target.

Lifting the server siege...
Transactions:               8785 hits
Availability:               100.00 %
Elapsed time:               59.19 secs
Data transferred:           1.05 MB
Response time:              0.69 secs
Transaction rate:           148.42 trans/sec
Throughput:                 0.02 MB/sec
Concurrency:                102.48
Successful transactions:    8955
Failed transactions:        0
Longest transaction:        9.44
Shortest transaction:       0.23

In addition the max_used_connections requested at the database, during these tests, has reached 199 which is less than the available 250 and therefore there is additional capacity.

+---------------------------+---------------------+
| Variable_name             | Value               |
+---------------------------+---------------------+
| Max_used_connections      | 199                 |
| Max_used_connections_time | 2022-10-11 15:30:55 |
+---------------------------+---------------------+

Test 4 - 500 concurrent users

In a final test increasing the max_connections=400 and the concurrent siege users to 500 (default is 255) we reach 210 trans/sec without 5xx responses but with a slight increase in latency.

With the target throughput almost reached we can look at the resources required by a single POD, which are:

  • 800m CPU time for the istio-proxy and 200m for the flights container

  • 800Mi memory for istio-proxy and 45m for the flights container

For further understanding of the needs and capabilities of the environemnt contrast these measurements against the expected performance of Istio CPU and memory consumption.

Following the same technique the remainder of the components in the flow can be tuned and instances scaled out to reach the desired throughput. In addition, the Application and Platform teams with these information can start calculating on capacity in the mesh and cluster.

What to monitor in the data plane

Following the example of how to test the performance of the data plane we proceed to determine what to monitor in order to make sizing decisions.

  1. Istio, on which OSSM is based on, defines a list of metrics which we can monitor for HTTP, HTTP/2 GRPC and TCP traffic. In particular:

    • istio_requests_total a COUNTER measuring total number of requests

    • istio_request_duration_milliseconds a DISTRIBUTION measuring latency of requests

      • In addition to monitoring for successful responses (response_code=200) this metric can also be used to monitor failed requests which may be increasing due to performance issues (ie. istio_request_duration_milliseconds_bucket{response_code="400"}, istio_request_duration_milliseconds_bucket{response_code="503"}).

        The grafana and kiali observability components allow (as does the output form siege) to determine both throughput and latency.

        300
        300

        Whilst with the use of prometheus alerts can be set against metrics such as the distribution of the request duration (istio_request_duration_milliseconds) in order to review and tune accordingly the data plane.

    • Needs for tuning between services with DestinationRules and configured pool connections may be uncoverd when monitoring client latency averaged over the past minute by source and destination service names and namespace

      histogram_quantile(0.95,
        sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
        by (
          destination_canonical_service,
          destination_workload_namespace,
          source_canonical_service,
          source_workload_namespace,
          le
        )
      )
  2. Tuning of the individual container resources is equally important. The script provided during the tuning exercise offers a means of retrieving the CPU/Memory of the istio-proxy and main workload containers whilst prometheus also exposes the envoy memory metrics (on prometheus envoy_server_memory_allocated{app="gto-external-ingressgateway"}, envoy_server_memory_heap_size{app="gto-external-ingressgateway"})

    oc exec gto-external-ingressgateway-5d9b4c5b6d-8ddqt -n prod-istio-system -- curl -s localhost:15000/memory; sleep 5; done
    {
      "allocated": "54066928",
      "heap_size": "128974848",
      "pageheap_unmapped": "0",
      "pageheap_free": "12517376",
      "total_thread_cache": "29052632",
      "total_physical_bytes": "131989504"
    }

Data Plane Tuning Advice

Normal HA Microservice Guidelines affect the performance within a Service Mesh therefore need to be taken into account in addition to tuning the data plane and include:

  • POD Priority and Preemption (most important PODs have scheduling priority)

  • Configure Liveness, Readiness, Startup probes

  • Realistic compute resources set for containers (use existing known limits for each container) and autoscalling (HPA) settings.

  • Deployment Strategy selection (RollingUpdate with rollout strategy with maxUnavailable=1 and maxSerge=0)

  • Application/Database managed (beyond sidecar) connection pools tuning and configuration must be applied.

Proxy (Envoy) tuning would include:

  • increasing application concurrency when too thin. This can be achieved by increasing worker threads on the envoy (default=2) which can improve the throughput.

  • upgrading traffic to HTTP2 as multiplexing several requests over the same connection avoids new connection creation overheads.

  • tuning the pool connections via Istio configurations can also improve the performance of the network. Specifically monitor for the

    • Number of client connections

    • Target request rate

  • An additional tuning which can affect both the data plane and control plane is the size of the configuration used by the proxy. This is increased linearly as more services are added to the mesh. As this needs to be transferred to, accepted and maintained it is important that only the necessary configs reach a particular proxy.

Observability optimizations (we shall look at this during control plane tuning) with reduction of trace sampling rates can also significantly improved throughput.

Tuning for high-throughput demands

For very high-throughput demands from workloads in the mesh consider:

  • placing the Ingress/Egress Gateway PODS in dedicated Kubernetes nodes and possibly split for SNI proxies.

  • tune the appropriate between worker threads (scale up) based also on the number of cores available on the node versus increase of the number of such pods (scale out) in order to match the necessary requirements

  • limiting the number of connections (connection_limit) on overloaded listeners (downstream connections) to improve loadbalancing between available pods

  • loadbalancing between multiple threads on the sidecar may not be so efficiently applied. Add the following annotation:

          annotations:
            proxy.istio.io/config: |
              proxyStatsMatcher:
                inclusionRegexps:
                - ".*_cx_.*"
    • and check the distribution on connections to the different downstream/upstream threads (see starvating threads])

      oc exec <POD NAME> --curl localhost:15000/stats |grep worker
      ...
      listener.0.0.0.0_8000.worker_0.downstream_cx_active: 1
      listener.0.0.0.0_8000.worker_0.downstream_cx_total: 4
      listener.0.0.0.0_8000.worker_1.downstream_cx_active: 0
      listener.0.0.0.0_8000.worker_1.downstream_cx_total: 0
      listener.0.0.0.0_8000.worker_2.downstream_cx_active: 0
      listener.0.0.0.0_8000.worker_2.downstream_cx_total: 1
      listener.0.0.0.0_8000.worker_3.downstream_cx_active: 0
      listener.0.0.0.0_8000.worker_3.downstream_cx_total: 1
    • LEAST_CONN rather than ROUND_ROBIN loadbalancing policy in the DestinationRules can also help with more efficient placement of requests.

Sizing the Control Plane

The main outcome for a control plane tuning exercise should be the answer to the following questions:

  • Can the control plane support the data plane, ie. can it keep it up-to-date with the latest configurations in an acceptable rate?

  • How much more data plane capacity can it handle?

  • What are the required resources for the observability stack?

istiod metrics to monitor

The answer to these questions can be extracted by focusing on a number of metrics:

  • pilot_xds: The number of endpoints connected to this pilot (istiod) using xDS or simply clients who need to be kept up-to-date by the control plane.

    300

    If istiod is using memory or CPU more heavily than usual check if there has been an increase of xDS clients and adjust either the resource limits for pilot or the replicas of the pilot (istiod) deployment instances.

  • pilot_xds_pushes: The count of xDS messages sent, as well as errors building or sending xDS messages. What we are looking from this metric is throughput and errors in distributing the configurations. The rate of xDS pushes increases with the number of clients connected to pilot (istiod) as well as the number of pilot configuration changes. The pilot_xds_pushes metric counts the messages that pilot has pushed to xDS APIs, including any errors in building or sending xDS messages. You can group this metric by the type tag to count xDS pushes by API (e.g., eds or rds)—if there are errors, pilot will record this metric with a different type.

    • If high pilot demand is a problem adjust either the resource limits for pilot or replicas of the pilot(istiod) deployment instances.

    • It is also possible to edit the PILOT_PUSH_THROTTLE environment variable within for istiod reducing the maximum number of concurrent pushes from the default of 100.

  • pilot_proxy_convergence_time: The time it takes for pilot to push new configurations to Envoy proxies (in milliseconds). Once more this is an indication of the increase/decrease of pilot (istiod) performance to push the new configurations. The speed of this operation depends on the size of the configuration being pushed to the Envoy proxies (istio-proxy), but necessary for keeping each proxy up to date with the routes, endpoints, and listeners in the mesh. Monitor that it is kept at a reasonable level (eg.increase(pilot_proxy_convergence_time_sum[30m])/increase(pilot_proxy_convergence_time_count[30m])).

    • Increase of the clients handled by a single istiod can hurt this metric, therefore increasing replicas of istiod by applying appropriate HPA policies would help here.

    • An increase on the PODs that are part of the data plane would also result in larger configuration (dependent on how many clusters, routes, listeners, endpoints) transferred to a sidecar. Separating the mesh ie. ensuring configurations are only visible to the appropriate namespaces, separating unrelated services to different meshes or excluding services from the mesh would be some solutions.

istiod sizing

In the Travel Agency production service mesh the configuration includes 10 services, 67 xDS cluster configurations and 83 Endpoint configurations. Performing additions of new namespaces and services increases the demands from the istiod as follows:

  • Adding 1 namespace with 8 new services results in the addition of 7 new xDS clusters and 14 endpoints and the pilot_xds shows 36 connected endpoints to be kept up to date.

    ./add-new-travel-services-namespaces-in-mesh.sh cp-size-1 prod-istio-system
    Table 1. istioD resource requirements
    istiod Memory Change CPU Change

    istiod-1

    128Mi → 134Mi

    2.36m → 3.0m

    istiod-2

    103Mi → 130Mi

    3.2m - 4.7m

  • As the connected clients are not equally distributed between the instances of istiod the total increase is attributed to the additional xDS clients and therefore we expect an increase of Memory 4.71Mi/client and CPU 0.3m/client.

  • Adding 3 additional namespaces with 24 new services results in the addition of 21 new xDS clusters and 42 endpoints and the pilot_xds shows 94 connected endpoints to be kept up to date. The increase of the data plane size has affected the istioD resource requirements as follows:

    Table 2. istioD new resource requirements
    istiod Memory Change CPU Change

    istiod-1

    134Mi → 167Mi

    3.0m → 4.5m

    istiod-2

    130Mi → 142Mi

    4.7m - 7.1m

  • The total increase is attributed to the additional xDS clients and therefore we expect an increase of Memory 2.14Mi/client(+1%) and CPU 0.18m/client(+3%).

With the introduction of new xDS clients the xDS update activities have significantly increased on the istiod

  • EDS updates

    300
  • RDS updates

    300
  • In addition, the 99th percentile of configuration transfers has seen an increase in the time required and it will be monitored along the istiod resource utilization for possible HPA or manual scaling.

    300
  • For additional guidance on resource allocations for the control plane see the OSSM Performance and scalability.

istiod Tuning Advice

In the case that the mesh data plane increases significantly (eg. many 100s of PODs) it is advisable to:

  1. review the Deployment Model of the service mesh. For instance choosing multi-tenancy over single mesh in a cluster in order to have focused mesh clusters to the solutions they include will have to be evaluated.

  2. separation of the visibility of the resource configurations in the same mesh by applying the Sidecar resource to segregate unrelated namespaces.

  3. appropriate HPA settings for the istiod components set for a pre-defined increase of a set of new xDS clients.

Observability Stack sizing

Capacity planning for the observability stack involves the sizing of:

  • Runtime components (Kiali, Jaeger, ElasticSearch - for Jaeger Storage-, Prometheus, Grafana)

  • Persistence for long-term storage of metrics, traces, graphs etc.

The capacity requirements are directly dependent to the size of the data plane (sidecars), the number of incoming requests and configuration of metrics and traces capture as well as their retention period. In the Production Setup scenario we established a Final Service Mesh Production Setup based on which the production SMCP has been configured. We shall now look if this configuration is appropriate for the established non-functional requirements.

Prometheus sizing

During the activity to Configure Prometheus for Production a PersistenceVolume of 10Gi in size was allocated to store metrics for the production environment.

In order to establish if this allocation is sufficient in handling the expected load consider the following expectations:

  • Traffic of 250k requests per day

  • Retention of metrics for 7 days

  • Full Istio metrics collection, ie. no Prometheus Metric Tuning has been applied.

To establish the sizing needs use the following prometheus queries:

  • prometheus_tsdb_head_samples_appended_total shows how many samples are stored, whilst ((rate(prometheus_tsdb_head_samples_appended_total[1d]))) gives the average.

  • rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d])) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d]) shows what is the average byte size of each sample ingested.

  • Therefore, for 7 days (or 604800 seconds) which is the metrics retention period, current total requests 90908542 (avg 1052 samples per second) and with an average byte size of each ingested sample at (1.28), the result is 1.21 GBs of storage is the required storage space.

    (604800* (rate(prometheus_tsdb_head_samples_appended_total[1d]) *
    (rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d]))))/1000000000
  • Currently, the total requests for 1 day (istio_requests_total{reporter=source}) is almost at 252286 therefore the capacity allocated will meet the expected demands.

If tuning of the prometheus metrics collection is deemed necessary, this can be applied either on the timeseries collected by, eg. Prometheus Metric Tuning, or additionally in the SMCP settings for the addons.prometheus.install.retention and addons.prometheus.install.scrapeInterval.

Jaeger and ElasticSearch sizing

During the Jaeger Configuration for Production the Jaeger resource was externalized from the SMCP and configured with an ElasticSearch single node cluster storage of 1Gi in size.

The three key components to remember before choosing the appropriate Elastic Search cluster settings are as follows:

  • Calculating the storage requirements.

    To calculate the index size follow How to check Elasticsearch index usage with CLI in OpenShift Container Platform. In the current jaeger-small-production Jaeger resource for the production SMCP the size of a single shard (replica) of the index, of traces collected over 7 days, is 519Mbs. As the strategy is to Rollover Index the size of 1Gi should be sufficient.

    It is crucial for the sizing calculations to take into account the applied [sampling rate on the data plane. In the production SMCP the sampling rate is applied across all sidecars, set to 5%, however in the case that a service contains a different sampling rate then it is important to be aware that the sampling rate of traces is determined by the first microservice in the flow, where the span is generated, and that point onwards it is respected by all other services in the flow.

    oc exec elasticsearch-cdm-prodistiosystemjaegersmallproduction-1-7pwcmt -c elasticsearch -- curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -X GET "https://localhost:9200/_cat/shards?v
    
    index                     shard prirep state      docs  store ip          node
    jaeger-service-2022-10-17 0     p      STARTED      34 15.2kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-11    0     p      STARTED  911317 32.3mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-14 0     p      STARTED      38 16.4kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-15    0     p      STARTED  230716  8.2mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-18    0     p      STARTED  408303 14.1mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-16 0     p      STARTED      21  9.9kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-13 0     p      STARTED      27 26.2kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-17    0     p      STARTED  552569 19.3mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-12 0     p      STARTED      29 39.5kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-11 0     p      STARTED      36 23.1kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-12    0     p      STARTED  933030 32.9mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-14    0     p      STARTED  584263 20.5mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    .security                 0     p      STARTED       6   33kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-15 0     p      STARTED      21 21.4kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-16    0     p      STARTED  187394  6.7mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-service-2022-10-18 0     p      STARTED      33 28.6kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
    jaeger-span-2022-10-13    0     p      STARTED 1083306 38.5mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
  • Choosing the number of shards (ie. number of replications of an index)

    The second component to consider is choosing the right indexing strategy for the indices. In ES, by default, every index is divided into n numbers of primary and replicas. (For example, if there are 2 primary and 1 replica shard then the total count of shards is 4). The primary shard count for an existing index cannot be changed once created.

    A rule of thumb is to ensure that the shard size is between 10–50 GiB and therefore a formula for calculating the approximate number of shards is:

    Number of Primary Shards = (Source Data + Room to Grow) * (1 + Indexing Overhead) / Desired Shard Size
    eg.
    with 30 GiB of data and whilst we don’t expect it to grow over time (ie. no new services added or sampling rates changed) the number of shards should be (30 * 1.1 / 20) = 2.
  • Choosing the instance types and testing.

    A stable Elastic Search cluster will require for the nodes to a establishing a quorum. The size of the quorum (3 at minimum) is dependent of the size of the Elastic Search cluster (for more information see Resillience in small clusters)

Grafana Persistence sizing

Although there are no specific sizing information on grafana it is useful to note that the persistence and runtime requirements for Grafana are affected by the number of timeseries monitored by prometheus (sum(prometheus_tsdb_head_series)) and frequence the metrics are captured as well as by the dashboards monitored.

Tuning across service mesh layers

The above information provide guidance to Application and Platform teams on uncovering capacity needs. However, in order to fine-tune a service mesh across the control plane and data plane of the service mesh aspects such as TLS settings in and out of a cluster, service to service communication requirements, bootstrapping configuration, latency tuning, infrastructure configuration, etc., need to be well understood before arriving at a stable set of benchmarks.

Important
Next in Day-2 - Upgrade Help the Travel Agency personnel get an undertanding of the OSSM versioned components and the work involved for an upgrade.