/
metrics-health-format.html
142 lines (135 loc) · 4.16 KB
/
metrics-health-format.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
# Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Gathering Metrics from Vespa"
---
<p>
Vespa processes expose APIs for <em>metrics</em> and <em>health</em>.
Default port is 8080.
The Health APIs is used for heartbeating.
Use the Metrics API to integrate with graphing and alerting services.
</p><p>
To find metrics ports, use <code><a href="vespa-cmdline-tools.html#vespa-model-inspect">vespa-model-inspect</a> services</code>
to find running services in a cluster, then
<code><a href="vespa-cmdline-tools.html#vespa-model-inspect">vespa-model-inspect</a> service <service name></code>
to find ports for the given service (e.g. <em>searchnode</em>).
</p>
<h2 id="health-api">Health API</h2>
<p>
Health status is found at <em>http://host:port/state/v1/health</em>
</p><p>
Example:
<pre>
{
"status" : {
"code" : "up",
"message" : "Everything ok here"
}
}
</pre>
The status code can either be <code>up</code> or
<code>down</code>. Status <code>up</code> means that the service is fully up,
ready for serving traffic. If the page cannot be
downloaded, a state of down is typically assumed.
The message part is optional. Typically it is empty if the service is
up, while it is set to a textual reason for why it is unavailable if
that is the case.
</p>
<h2 id="metrics-api">Metrics API</h2>
<p>
Metrics are found at <em>http://host:port/state/v1/metrics</em>
</p><p>
Metrics are reported in snapshots, where the snapshot specifies the
time window the metrics
are gathered from. Typically, the service will aggregate metrics as
they are reported, and after each snapshot period, a snapshot is taken
of the current values and they are reset. Using this approach,
min and max values are tracked, and enables values like
95% percentile for each complete snapshot period.
</p><p>
The from and to times are specified in seconds since 1970.
Milliseconds or microseconds can be added as decimals.
</p><p>
Vespa supports <a href="../jdisc/metrics.html#metrics-from-custom-components">custom metrics</a>.
</p>
<p>
Example:
<pre>
{
"status" : {
"code" : "up",
"message" : "Everything ok here"
},
"metrics" : {
"snapshot" : {
"from" : 1334134640.089,
"to" : 1334134700.088,
},
"values" : [
{
"name" : "queries",
"description" : "Number of queries executed during snapshot interval",
"values" : {
"count" : 28,
"rate" : 0.4667
},
"dimensions" : {
"searcherid" : "x"
}
},
{
"name" : "query_hits",
"description" : "Number of documents matched per query during snapshot interval",
"values" : {
"count" : 28,
"rate" : 0.4667,
"average" : 128.3,
"min" : 0,
"max" : 10000,
"sum" : 3584,
"median" : 124.0,
"std_deviation": 5.43
},
"dimensions" : {
"searcherid" : "x"
}
}
]
}
}
</pre>
A flat list of metrics is returned. Each metric value reported by a component
should be a separate metric. For related metrics,
prefix metric names with common parts and dot separate the names -
e.g. <code>memory.free</code> and <code>memory.virtual</code>.
Each metric have one or more values set - valid values:
</p>
<table class="table">
<thead>
</thead><tbody>
<tr>
<th>count</th>
<td>Number of times metric has been set. For instance in a count
metric counting number of operations done, it will annotate the
number of operations added for that snapshot period. For a value
metric, for instance setting latency of operations, the count
will set how many times latencies have been added to the
metric.</td>
</tr><tr>
<th>average</th>
<td>The average of all the values gotten during a snapshot
period. Typically sum divided by count.</td>
</tr><tr>
<th>rate</th>
<td>count/s.</td>
</tr><tr>
<th>min</th>
<td>The smallest value seen in this snapshot period.</td>
</tr><tr>
<th>max</th>
<td>The largest value seen in this snapshot period.</td>
</tr><tr>
<th>sum</th>
<td>The total value seen in this snapshot period.</td>
</tr>
</tbody>
</table>