In the efforts to tune a Service Mesh the usual two concerned parties which will provide non-functional requirements are the Application Team and the Platform Team. These will have expressed the following type of tuning expectations:
-
The Travel Agency Application Team (includes
Product Owners
,Tech Leads
,Mesh Developers
) who need to tune the application side to handle the expected customer load have provided us with the following expectations.
Expected load is 250k requests per day with a peak of 250 requests per second(rps).
-
The Travel Agency Platform Team (includes
Cluster Operators
,Mesh Operators
,Platform (Application Ops) Team
) who need to tune the control plane side of theService Mesh
to handle observability, ingress/egress application runtime and configuration needs have the following concerns as internal cloud providers:
What are the best practices / sizing guides / benchmarks so that we can anticipate the best evolution of the
OSSM
instance and give best experience to the project teams?
How to define the maximum capacity of an
OSSM
instance?How to define the maximum number of applications that can join the mesh in the future?
Which criteria/metrics should be used for capacity rules?
What are current limits of an
OSSM
instance on the control plane and the data plane side?
It is important as first practice, in tuning a Service Mesh, to have an architecture decision on the Deployment Model to be used with that mesh and to have established Purpose and Principals around the use and setup of this Service Mesh as these will determine the type of tuning required.
Furthermore, be aware that in a cloud based environment there are many components that should be tuned (firewalls, loadbalancers, container platform etc.) we shall focus the following guidance only on the Red Hat OSSM
2 major areas:
-
The data plane which is consisted by all the
istio-proxy
sidecars (Envoy), responsible for handling a workload’s incoming traffic, injected into the Pods of every workload as well as theingress
/egress
gateway components. -
The control plane, responsible for keeping the proxies up-to-date with the latest configuration, certificates etc, and the observability stack.
In the following sections we focus guidance around:
-
How to test the performance of the mesh (the data plane's specific non-functional needs as well as the control plane components),
-
How to measure sizing needs (eg. with a set of apps and established requests how much storage for tracing/metrics, how many
istiod
components to use, what CPU/RAM is required by the sidecar etc.) -
What can be tuned (eg. the configuration visibility, % of traces collected, replicas, threads etc.)
The Application Team will need to tune the data plane components (Ingress
/Egress
workloads and istio-proxy
sidecars) for memory, cpu and threads in such a way that they meet the solution’s latency and throughput targets.
Like in any other tuning exercise in order to size correctly a mesh data plane a set of scenarios based on real-world expected load is required. They will provide separate load test configurations based on which the components will be tuned until the output based on the requirements is reached. In the end, the sizing of the solution is determined by the expected performance output.
In this exercise we will showcase a process of tuning the flights
and mysqldb
data plane components to cater for better performance when receiving upto 250/rps from the external partner via gto-external-ingressgateway
gateway. The siege http multi-threaded load testing and benchmarking utility has been employed to perform the tests.
-
In preparation for the tests relax the TLS settings for
Gateway/travel-api-gateway
to SIMPLE TLS by executing./scripts/set-simple-tls-travel-api-gateway.sh
-
Prepare to observe statistics and metrics around the performance and resources utilization which constitute application performance Service Level Indicators (SLIs).
-
using script
./scripts/containers-mem-cpu.sh "POD NAME <mysqldb|flights|istio-proxy>" (eg. get the filter by executing `kubectl top pods --containers|sort -rk1 |grep "mysqldb-v1-56f4f9d879-gjltv"
CPU/Memory resource needs can be captured for the main workload andistio-proxy
containers. In 4 command separate prompts monitor:./containers-mem-cpu.sh "mysqldb-v1-56f4f9d879-gjltv mysqldb" ./containers-mem-cpu.sh "mysqldb-v1-56f4f9d879-gjltv istio-proxy" ./containers-mem-cpu.sh "flights-v1-5c4bfff4b7-brwr8 flights" ./containers-mem-cpu.sh "mysqldb-v1-56f4f9d879-gjltv istio-proxy"
-
prepare to monitor in
PROMETHEUS
the increase betweenistio_request_duration_milliseconds
overistio_request_duration_milliseconds_count
(the metrics can be used for the calculation of the average request duration, application SLI, over arbitrary lookbehind window specified in square brackets of the following query). There are many dimensions for the metric (eg.response_code=200/500
,reporter=source/destination
etc.) it is important to capture possible average duration increases (ie. added latency), throughput of requests and if there are any failures (eg.response_code=500
) due to resource contention.increase(istio_request_duration_milliseconds_sum{destination_canonical_service="flights"}[5m]) / increase(istio_request_duration_milliseconds_count{destination_canonical_service="flights"}[5m])
-
-
Get a valid JWT TOKEN for the requests and execute via the
siege
tool a load test (duration of tests 1 minute).TOKEN=$(curl -sLk --data "username=gtouser&password=gtouser&grant_type=password&client_id=istio&client_secret=bcd06d5bdd1dbaaf81853d10a66aeb989a38dd51" https://keycloak-rhsso.apps.ocp4.rhlab.de/auth/realms/servicemesh-lab/protocol/openid-connect/token | jq .access_token) siege -b -c500 -t60s https://gto-external-prod-istio-system.apps.ocp4.rhlab.de/flights/Tallinn --header="Authorization: Bearer $TOKEN"
-
siege
results show97.51%
success on the requests with7049
success vs174
failures and a1.13s
avg response timeLifting the server siege... Transactions: 6822 hits Availability: 97.51% Elapsed time: 59.14 secs Data transferred: 0.83 MB Response time: 1.13 secs Transaction rate: 115.35 trans/sec Throughput: 0.01 MB/sec Concurrency: 130.67 Successful transactions: 7049 Failed transactions: 174 Longest transaction: 10.16 Shortest transaction: 0.24
-
In
PROMETHEUS
we noticed that there was a:-
660 ms
average increase in the duration spent on requests yieldingresponse_code=500
responses, indicating not all requests were successful as the setup could not handle the requested load. -
114 ms
increase in the duration spent to handle requests from thetravels
service (which is non-partner constant traffic) with 200 responses (normally avg is28ms
), due to the added load. -
473 ms
increase in the duration spent to handle partner requests yielding on 200 responses, again due to the added load.
-
In the second test tune the istio-proxy
to take advantage of 4
concurrent worker threads (from the default 2
) in serving requests.
-
Apply to the
flights
andmysqldb
deployments the annotationannotations: proxy.istio.io/config: | concurrency: 4
-
once the POD has been restarted verify the available worker threads are now 4 by executing
oc exec <POD-NAME> -c istio-proxy -- curl localhost:15000/stats |grep worker server.worker_0.watchdog_mega_miss: 0 server.worker_0.watchdog_miss: 0 server.worker_1.watchdog_mega_miss: 0 server.worker_1.watchdog_miss: 0 server.worker_2.watchdog_mega_miss: 0 server.worker_2.watchdog_miss: 0 server.worker_3.watchdog_mega_miss: 0 server.worker_3.watchdog_miss: 0
-
-
Taking the same observability actions and executing the previous
siege
loadtest we receive the following results:Lifting the server siege... Transactions: 8092 hits Availability: 98.73% Elapsed time: 59.82 secs Data transferred: 0.96 MB Response time: 0.80 secs Transaction rate: 135.27 trans/sec Throughput: 0.02 MB/sec Concurrency: 108.69 Successful transactions: 8188 Failed transactions: 104 Longest transaction: 6.92 Shortest transaction: 0.24
The results yield the following observations:
-
with
8188
successful transactions-
an improvement of the troughput application SLI by
16%
, which is a42%
decrease in failed transactions (98.73%
successes and104
failed), and -
a
29%
decrease of the application SLI of response time (down to0.8s
), -
overall a
14%
increase in throughput and40%
reduction on the longest transaction.
-
-
In
PROMETHEUS
we observe that during this test there is:-
a
390 ms
average duration increase spent on requests yieldingresponse_code=500
responses. There are still failed requests but with a41%
smaller increase than Test 1. -
a
69 ms
increase in the duration spent to handle requests fromtravels
service (which is normal non-partner traffic) with 200 responses. Again40%
reduction than Test 1 which indicates we can handle more load successfully. -
a
181 ms
increase in the duration spent to handle partner requests yielding a 200 responses. With a reduction of66%
from Test 1 we have another indicator the change has increased the capability to handle more requests.
-
Overall we notice that by tuning the worker threads on the data plane for these two components we managed to increase throughput whilst at the same time CPU and memory utilized by the istio-proxy
remains largely unchanged (see below data captured with containers-mem-cpu.sh
).
One final tuning action performed is against the actual mysql
database. Utilizing the mysql-credentials
and the root user check in the mysqldb
POD for the available connections and notice that max_connections
available is set to 151
which has already been reached (see Max_used_connections
) and presents a bottleneck. In response, tune the workload connections and repeat the tests.
select version();show variables like "%max_connections%";show global status like "%Max_used%";show status like "%thread%";show global status like "%Aborted%"; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | max_connections | 151 | | mysqlx_max_connections | 100 | +------------------------+-------+ +---------------------------+---------------------+ | Variable_name | Value | +---------------------------+---------------------+ | Max_used_connections | 152 | | Max_used_connections_time | 2022-10-11 13:08:32 | +---------------------------+---------------------+
Increase mysqld
max_connections
to 250
set global max_connections = 250;
Following the same observability activities and executing the siege
loadtest the following results show:
-
An additional
10%
increase of troughput with8955
successful transactions and a100%
success rate. -
At
0.69s
an additional14%
decrease in response time -
With
148.42 trans/sec
an additional14%
increase and -
An additional
40%
reduction on the longest transaction -
However, the transactions are at
148.42 trans/sec
and therefore below the 250/rps target.
Lifting the server siege... Transactions: 8785 hits Availability: 100.00 % Elapsed time: 59.19 secs Data transferred: 1.05 MB Response time: 0.69 secs Transaction rate: 148.42 trans/sec Throughput: 0.02 MB/sec Concurrency: 102.48 Successful transactions: 8955 Failed transactions: 0 Longest transaction: 9.44 Shortest transaction: 0.23
In addition the max_used_connections
requested at the database, during these tests, has reached 199
which is less than the available 250
and therefore there is additional capacity.
+---------------------------+---------------------+ | Variable_name | Value | +---------------------------+---------------------+ | Max_used_connections | 199 | | Max_used_connections_time | 2022-10-11 15:30:55 | +---------------------------+---------------------+
In a final test increasing the max_connections=400
and the concurrent siege
users to 500
(default is 255) we reach 210 trans/sec
without 5xx
responses but with a slight increase in latency.
With the target throughput almost reached we can look at the resources required by a single POD, which are:
-
800m
CPU time for theistio-proxy
and200m
for theflights
container -
800Mi
memory foristio-proxy
and45m
for theflights
container
For further understanding of the needs and capabilities of the environemnt contrast these measurements against the expected performance of Istio CPU and memory consumption.
Following the same technique the remainder of the components in the flow can be tuned and instances scaled out to reach the desired throughput. In addition, the Application and Platform teams with these information can start calculating on capacity in the mesh and cluster.
Following the example of how to test the performance of the data plane we proceed to determine what to monitor in order to make sizing decisions.
-
Istio, on which
OSSM
is based on, defines a list of metrics which we can monitor for HTTP, HTTP/2 GRPC and TCP traffic. In particular:-
istio_requests_total
a COUNTER measuring total number of requests -
istio_request_duration_milliseconds
a DISTRIBUTION measuring latency of requests-
In addition to monitoring for successful responses (
response_code=200
) this metric can also be used to monitor failed requests which may be increasing due to performance issues (ie.istio_request_duration_milliseconds_bucket{response_code="400"}
,istio_request_duration_milliseconds_bucket{response_code="503"}
).The
grafana
andkiali
observability components allow (as does the output formsiege
) to determine both throughput and latency.Whilst with the use of
prometheus
alerts can be set against metrics such as the distribution of the request duration (istio_request_duration_milliseconds
) in order to review and tune accordingly the data plane.
-
-
Needs for tuning between services with
DestinationRules
and configured pool connections may be uncoverd when monitoring client latency averaged over the past minute by source and destination service names and namespacehistogram_quantile(0.95, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by ( destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le ) )
-
-
Tuning of the individual container resources is equally important. The script provided during the tuning exercise offers a means of retrieving the CPU/Memory of the
istio-proxy
and mainworkload
containers whilst prometheus also exposes the envoy memory metrics (on prometheusenvoy_server_memory_allocated{app="gto-external-ingressgateway"}
,envoy_server_memory_heap_size{app="gto-external-ingressgateway"}
)oc exec gto-external-ingressgateway-5d9b4c5b6d-8ddqt -n prod-istio-system -- curl -s localhost:15000/memory; sleep 5; done { "allocated": "54066928", "heap_size": "128974848", "pageheap_unmapped": "0", "pageheap_free": "12517376", "total_thread_cache": "29052632", "total_physical_bytes": "131989504" }
Normal HA Microservice Guidelines affect the performance within a Service Mesh
therefore need to be taken into account in addition to tuning the data plane and include:
-
POD Priority and Preemption (most important PODs have scheduling priority)
-
Configure Liveness, Readiness, Startup probes
-
Realistic compute resources set for containers (use existing known limits for each container) and autoscalling (HPA) settings.
-
Deployment
Strategy selection (RollingUpdate with rollout strategy withmaxUnavailable=1
andmaxSerge=0
) -
Application/Database managed (beyond
sidecar
) connection pools tuning and configuration must be applied.
Proxy (Envoy) tuning would include:
-
increasing application concurrency when too thin. This can be achieved by increasing worker threads on the envoy (
default=2
) which can improve the throughput. -
upgrading traffic to HTTP2 as multiplexing several requests over the same connection avoids new connection creation overheads.
-
tuning the pool connections via Istio configurations can also improve the performance of the network. Specifically monitor for the
-
Number of client connections
-
Target request rate
-
-
An additional tuning which can affect both the data plane and control plane is the size of the configuration used by the proxy. This is increased linearly as more services are added to the mesh. As this needs to be transferred to, accepted and maintained it is important that only the necessary configs reach a particular proxy.
Observability optimizations (we shall look at this during control plane tuning) with reduction of trace
sampling rates can also significantly improved throughput.
For very high-throughput demands from workloads in the mesh consider:
-
placing the
Ingress
/Egress
Gateway PODS in dedicated Kubernetes nodes and possibly split for SNI proxies. -
tune the appropriate between worker threads (scale up) based also on the number of cores available on the node versus increase of the number of such pods (scale out) in order to match the necessary requirements
-
limiting the number of connections (
connection_limit
) on overloaded listeners (downstream connections) to improve loadbalancing between available pods -
loadbalancing between multiple threads on the sidecar may not be so efficiently applied. Add the following annotation:
annotations: proxy.istio.io/config: | proxyStatsMatcher: inclusionRegexps: - ".*_cx_.*"
-
and check the distribution on connections to the different downstream/upstream threads (see starvating threads])
oc exec <POD NAME> --curl localhost:15000/stats |grep worker ... listener.0.0.0.0_8000.worker_0.downstream_cx_active: 1 listener.0.0.0.0_8000.worker_0.downstream_cx_total: 4 listener.0.0.0.0_8000.worker_1.downstream_cx_active: 0 listener.0.0.0.0_8000.worker_1.downstream_cx_total: 0 listener.0.0.0.0_8000.worker_2.downstream_cx_active: 0 listener.0.0.0.0_8000.worker_2.downstream_cx_total: 1 listener.0.0.0.0_8000.worker_3.downstream_cx_active: 0 listener.0.0.0.0_8000.worker_3.downstream_cx_total: 1
-
LEAST_CONN
rather thanROUND_ROBIN
loadbalancing policy in theDestinationRules
can also help with more efficient placement of requests.
-
The main outcome for a control plane tuning exercise should be the answer to the following questions:
-
Can the control plane support the data plane, ie. can it keep it up-to-date with the latest configurations in an acceptable rate?
-
How much more data plane capacity can it handle?
-
What are the required resources for the observability stack?
The answer to these questions can be extracted by focusing on a number of metrics:
-
pilot_xds
: The number of endpoints connected to this pilot (istiod
) using xDS or simply clients who need to be kept up-to-date by the control plane.If
istiod
is using memory or CPU more heavily than usual check if there has been an increase of xDS clients and adjust either theresource
limits for pilot or the replicas of the pilot (istiod
) deployment instances. -
pilot_xds_pushes
: The count of xDS messages sent, as well as errors building or sending xDS messages. What we are looking from this metric is throughput and errors in distributing the configurations. The rate of xDS pushes increases with the number of clients connected to pilot (istiod
) as well as the number of pilot configuration changes. Thepilot_xds_pushes
metric counts the messages that pilot has pushed to xDS APIs, including any errors in building or sending xDS messages. You can group this metric by the type tag to count xDS pushes by API (e.g., eds or rds)—if there are errors, pilot will record this metric with a different type.-
If high pilot demand is a problem adjust either the
resource
limits for pilot or replicas of the pilot(istiod
) deployment instances. -
It is also possible to edit the
PILOT_PUSH_THROTTLE
environment variable within foristiod
reducing the maximum number of concurrent pushes from the default of100
.
-
-
pilot_proxy_convergence_time
: The time it takes for pilot to push new configurations to Envoy proxies (in milliseconds). Once more this is an indication of the increase/decrease of pilot (istiod
) performance to push the new configurations. The speed of this operation depends on the size of the configuration being pushed to the Envoy proxies (istio-proxy
), but necessary for keeping each proxy up to date with the routes, endpoints, and listeners in the mesh. Monitor that it is kept at a reasonable level (eg.increase(pilot_proxy_convergence_time_sum[30m])/increase(pilot_proxy_convergence_time_count[30m])
).-
Increase of the clients handled by a single
istiod
can hurt this metric, therefore increasing replicas ofistiod
by applying appropriate HPA policies would help here. -
An increase on the PODs that are part of the data plane would also result in larger configuration (dependent on how many clusters, routes, listeners, endpoints) transferred to a sidecar. Separating the mesh ie. ensuring configurations are only visible to the appropriate namespaces, separating unrelated services to different meshes or excluding services from the mesh would be some solutions.
-
In the Travel Agency production service mesh the configuration includes 10 services, 67 xDS cluster configurations and 83 Endpoint configurations. Performing additions of new namespaces and services increases the demands from the istiod
as follows:
-
Adding 1 namespace with 8 new services results in the addition of 7 new xDS clusters and 14 endpoints and the
pilot_xds
shows 36 connected endpoints to be kept up to date../add-new-travel-services-namespaces-in-mesh.sh cp-size-1 prod-istio-system
istiod Memory Change CPU Change istiod-1
128Mi → 134Mi
2.36m → 3.0m
istiod-2
103Mi → 130Mi
3.2m - 4.7m
-
As the connected clients are not equally distributed between the instances of
istiod
the total increase is attributed to the additional xDS clients and therefore we expect an increase of Memory4.71Mi/client
and CPU0.3m/client
. -
Adding 3 additional namespaces with 24 new services results in the addition of 21 new xDS clusters and 42 endpoints and the
Table 2. istioD new resource requirementspilot_xds
shows 94 connected endpoints to be kept up to date. The increase of the data plane size has affected theistioD
resource requirements as follows:istiod Memory Change CPU Change istiod-1
134Mi → 167Mi
3.0m → 4.5m
istiod-2
130Mi → 142Mi
4.7m - 7.1m
-
The total increase is attributed to the additional xDS clients and therefore we expect an increase of Memory
2.14Mi/client
(+1%
) and CPU0.18m/client
(+3%
).
With the introduction of new xDS
clients the xDS
update activities have significantly increased on the istiod
-
EDS updates
-
RDS updates
-
In addition, the
99th
percentile of configuration transfers has seen an increase in the time required and it will be monitored along theistiod
resource utilization for possible HPA or manual scaling. -
For additional guidance on resource allocations for the control plane see the OSSM Performance and scalability.
In the case that the mesh data plane increases significantly (eg. many 100s of PODs) it is advisable to:
-
review the Deployment Model of the service mesh. For instance choosing multi-tenancy over single mesh in a cluster in order to have focused mesh clusters to the solutions they include will have to be evaluated.
-
separation of the visibility of the resource configurations in the same mesh by applying the
Sidecar
resource to segregate unrelated namespaces. -
appropriate HPA settings for the
istiod
components set for a pre-defined increase of a set of new xDS clients.
Capacity planning for the observability stack involves the sizing of:
-
Runtime components (Kiali, Jaeger, ElasticSearch - for Jaeger Storage-, Prometheus, Grafana)
-
Persistence for long-term storage of metrics, traces, graphs etc.
The capacity requirements are directly dependent to the size of the data plane (sidecars), the number of incoming requests and configuration of metrics and traces capture as well as their retention period. In the Production Setup scenario we established a Final Service Mesh Production Setup based on which the production
SMCP has been configured. We shall now look if this configuration is appropriate for the established non-functional requirements.
During the activity to Configure Prometheus for Production a PersistenceVolume
of 10Gi in size was allocated to store metrics for the production environment.
In order to establish if this allocation is sufficient in handling the expected load consider the following expectations:
-
Traffic of 250k requests per day
-
Retention of metrics for 7 days
-
Full Istio metrics collection, ie. no Prometheus Metric Tuning has been applied.
To establish the sizing needs use the following prometheus
queries:
-
prometheus_tsdb_head_samples_appended_total
shows how many samples are stored, whilst ((rate(prometheus_tsdb_head_samples_appended_total[1d]))
) gives the average. -
rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d])) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d])
shows what is the average byte size of each sample ingested. -
Therefore, for 7 days (or
604800
seconds) which is the metrics retention period, current total requests90908542
(avg1052
samples per second) and with an average byte size of each ingested sample at (1.28
), the result is 1.21 GBs of storage is the required storage space.(604800* (rate(prometheus_tsdb_head_samples_appended_total[1d]) * (rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d]))))/1000000000
-
Currently, the total requests for 1 day (
istio_requests_total{reporter=source}
) is almost at252286
therefore the capacity allocated will meet the expected demands.
If tuning of the prometheus
metrics collection is deemed necessary, this can be applied either on the timeseries collected by, eg. Prometheus Metric Tuning, or additionally in the SMCP
settings for the addons.prometheus.install.retention
and addons.prometheus.install.scrapeInterval
.
During the Jaeger Configuration for Production
the Jaeger
resource was externalized from the SMCP
and configured with an ElasticSearch
single node cluster storage of 1Gi in size.
The three key components to remember before choosing the appropriate Elastic Search cluster settings are as follows:
-
Calculating the storage requirements.
To calculate the index size follow How to check Elasticsearch index usage with CLI in OpenShift Container Platform. In the current
jaeger-small-production
Jaeger
resource for theproduction
SMCP the size of a singleshard
(replica) of the index, of traces collected over 7 days, is 519Mbs. As the strategy is to Rollover Index the size of 1Gi should be sufficient.It is crucial for the sizing calculations to take into account the applied [sampling rate on the data plane. In the
production
SMCP the sampling rate is applied across all sidecars, set to 5%, however in the case that a service contains a different sampling rate then it is important to be aware that the sampling rate of traces is determined by the first microservice in the flow, where the span is generated, and that point onwards it is respected by all other services in the flow.oc exec elasticsearch-cdm-prodistiosystemjaegersmallproduction-1-7pwcmt -c elasticsearch -- curl -s --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -X GET "https://localhost:9200/_cat/shards?v index shard prirep state docs store ip node jaeger-service-2022-10-17 0 p STARTED 34 15.2kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-11 0 p STARTED 911317 32.3mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-14 0 p STARTED 38 16.4kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-15 0 p STARTED 230716 8.2mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-18 0 p STARTED 408303 14.1mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-16 0 p STARTED 21 9.9kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-13 0 p STARTED 27 26.2kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-17 0 p STARTED 552569 19.3mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-12 0 p STARTED 29 39.5kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-11 0 p STARTED 36 23.1kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-12 0 p STARTED 933030 32.9mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-14 0 p STARTED 584263 20.5mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 .security 0 p STARTED 6 33kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-15 0 p STARTED 21 21.4kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-16 0 p STARTED 187394 6.7mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-service-2022-10-18 0 p STARTED 33 28.6kb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1 jaeger-span-2022-10-13 0 p STARTED 1083306 38.5mb 10.130.0.18 elasticsearch-cdm-prodistiosystemjaegersmallproduction-1
-
Choosing the number of shards (ie. number of replications of an index)
The second component to consider is choosing the right indexing strategy for the indices. In ES, by default, every index is divided into n numbers of primary and replicas. (For example, if there are 2 primary and 1 replica shard then the total count of shards is 4). The primary shard count for an existing index cannot be changed once created.
A rule of thumb is to ensure that the shard size is between 10–50 GiB and therefore a formula for calculating the approximate number of shards is:
Number of Primary Shards = (Source Data + Room to Grow) * (1 + Indexing Overhead) / Desired Shard Size eg. with 30 GiB of data and whilst we don’t expect it to grow over time (ie. no new services added or sampling rates changed) the number of shards should be (30 * 1.1 / 20) = 2.
-
Choosing the instance types and testing.
A stable
Elastic Search
cluster will require for the nodes to a establishing a quorum. The size of the quorum (3 at minimum) is dependent of the size of theElastic Search
cluster (for more information see Resillience in small clusters)
Although there are no specific sizing information on grafana
it is useful to note that the persistence and runtime requirements for Grafana are affected by the number of timeseries monitored by prometheus
(sum(prometheus_tsdb_head_series)
) and frequence the metrics are captured as well as by the dashboards monitored.
The above information provide guidance to Application and Platform teams on uncovering capacity needs. However, in order to fine-tune a service mesh across the control plane and data plane of the service mesh aspects such as TLS settings in and out of a cluster, service to service communication requirements, bootstrapping configuration, latency tuning, infrastructure configuration, etc., need to be well understood before arriving at a stable set of benchmarks.
Important
|
Next in Day-2 - Upgrade Help the Travel Agency personnel get an undertanding of the OSSM versioned components and the work involved for an upgrade.
|