fiber.top() #2694

kostja · 2017-08-17T14:20:53Z

Implement fiber.top() which returns fiber list ordered by average cpu usage in the last second

Arguments:

seconds
number of fibers

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 1: cpu average (%): 10.779696493982 cpu instant (%): 10.435256168573 115: cpu average (%): 5.4571279061075 cpu instant (%): 5.9653973440576 120: cpu average (%): 21.944382148464 cpu instant (%): 23.849021825646 116: cpu average (%): 8.6603872318158 cpu instant (%): 9.6812031335093 119: cpu average (%): 21.933168871944 cpu instant (%): 20.007540530351 cpu misses: 0 118: cpu average (%): 19.342901995963 cpu instant (%): 16.932679820703 117: cpu average (%): 11.549674814981 cpu instant (%): 13.128901177161 ... ``` In the table above keys are fiber id's (and a single 'cpu misses' key which indicates the amount of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) cpu instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) cpu average (per cent), which is calculated as an exponential moving average of `cpu instant` values over all previous event loop iterations.

Add a wrapper function around swim test passed to swim_run_test which breaks from default cord event loop. Otherwise we will hang indefinitely in ev_run in case there is a started watcher. Found during work on #2694

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 1: cpu average (%): 10.779696493982 cpu instant (%): 10.435256168573 115: cpu average (%): 5.4571279061075 cpu instant (%): 5.9653973440576 120: cpu average (%): 21.944382148464 cpu instant (%): 23.849021825646 116: cpu average (%): 8.6603872318158 cpu instant (%): 9.6812031335093 119: cpu average (%): 21.933168871944 cpu instant (%): 20.007540530351 cpu misses: 0 118: cpu average (%): 19.342901995963 cpu instant (%): 16.932679820703 117: cpu average (%): 11.549674814981 cpu instant (%): 13.128901177161 ... ``` In the table above keys are fiber id's (and a single 'cpu misses' key which indicates the amount of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) cpu instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) cpu average (per cent), which is calculated as an exponential moving average of `cpu instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the lust measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results.

Add a wrapper function around swim test passed to swim_run_test which breaks from default cord event loop. Otherwise we will hang indefinitely in ev_run in case there is a started watcher. Found during work on #2694

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 1: cpu average (%): 10.779696493982 cpu instant (%): 10.435256168573 115: cpu average (%): 5.4571279061075 cpu instant (%): 5.9653973440576 120: cpu average (%): 21.944382148464 cpu instant (%): 23.849021825646 116: cpu average (%): 8.6603872318158 cpu instant (%): 9.6812031335093 119: cpu average (%): 21.933168871944 cpu instant (%): 20.007540530351 cpu misses: 0 118: cpu average (%): 19.342901995963 cpu instant (%): 16.932679820703 117: cpu average (%): 11.549674814981 cpu instant (%): 13.128901177161 ... ``` In the table above keys are fiber id's (and a single 'cpu misses' key which indicates the amount of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) cpu instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) cpu average (per cent), which is calculated as an exponential moving average of `cpu instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the lust measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment.

sergepetrenko · 2019-10-25T14:39:45Z

Performance measurements with and without fiber.top():
Invoked with tarantool test.lua num_fibers
The test measures context switches per second with num_fibers all running in a loop while true do fiber.yield() end

The results are averaged between 5 runs.

num_fibers	default version (csw/sec)	fiber.top() version (csw/sec)
1	484 266	429 150 (- 11.3 %)
10	2 006 201	1 728 187 (- 13.9 %)
100	2 950 172	2 467 650 (- 16.4 %)
1000	2 400 037	2 020 417 (- 15.8 %)
10000	8 824	8 910 (+ 1.0 %)

File used to test fiber.top() version:

fiber = require('fiber')
num_fibers = arg[1]
delta_time = 1

fiber_func = function()
    while true do
        fiber.yield()
    end
end

-- create fibers but do not let them run yet
for i = 1, num_fibers do
    fiber.new(fiber_func)
end

fiber.top_enable()

t_start = fiber.time()
fiber.sleep(delta_time)
t_end = fiber.time()

fiber.top_disable();

tbl = fiber.info()
csw = 0
for k ,v in pairs(tbl) do
    csw = csw + v["csw"]
end

print("delta time:", t_end - t_start)
print("csw/sec:", csw / (t_end - t_start))

File used to test default version:

fiber = require('fiber')
num_fibers = arg[1]
delta_time = 1

fiber_func = function()
    while true do
        fiber.yield()
    end
end

-- create fibers but do not let them run yet
for i = 1, num_fibers do
    fiber.new(fiber_func)
end


t_start = fiber.time()
fiber.sleep(delta_time)
t_end = fiber.time()


tbl = fiber.info()
csw = 0
for k ,v in pairs(tbl) do
    csw = csw + v["csw"]
end

print("delta time:", t_end - t_start)
print("csw/sec:", csw / (t_end - t_start))

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - sched: average: 10.779696493982 instant: 10.435256168573 115: average: 5.4571279061075 instant: 5.9653973440576 120: average: 21.944382148464 instant: 23.849021825646 116: average: 8.6603872318158 instant: 9.6812031335093 119: average: 21.933168871944 instant: 20.007540530351 cpu misses: 0 118: average: 19.342901995963 instant: 16.932679820703 117: average: 11.549674814981 instant: 13.128901177161 ... ``` In the table above keys are fiber id's (the only exceptions are `sched` fiber, which is indexed by mane and a single 'cpu misses' key which indicates the amount of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. fiber.top() is disabled by default, to enable it you need to issue `fiber.top_enable()`. You can disable it back later using `fiber.top_disable()`.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - sched: average: 10.779696493982 instant: 10.435256168573 115: average: 5.4571279061075 instant: 5.9653973440576 120: average: 21.944382148464 instant: 23.849021825646 116: average: 8.6603872318158 instant: 9.6812031335093 119: average: 21.933168871944 instant: 20.007540530351 cpu misses: 0 118: average: 19.342901995963 instant: 16.932679820703 117: average: 11.549674814981 instant: 13.128901177161 ... ``` In the table above keys are fiber id's (the only exceptions are `sched` fiber, which is indexed by mane and a single 'cpu misses' key which indicates the amount of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. fiber.top() is disabled by default, to enable it you need to issue `fiber.top_enable()`. You can disable it back later using `fiber.top_disable()`.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 9.8578003709556 average: 7.1213095296968 103/lua: instant: 10.242495019578 average: 7.2647247910588 107/lua: instant: 33.337913031531 average: 34.602640818951 101/on_shutdown: instant: 0 average: 0 105/lua: instant: 10.558494195233 average: 7.4526482369814 106/lua: instant: 21.19255341073 average: 23.668463478562 102/interactive: instant: 0 average: 0 cpu misses: 0 1/sched: instant: 14.810743971972 average: 19.494584837545 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. fiber.top() is disabled by default, to enable it you need to issue `fiber.top_enable()`. You can disable it back later using `fiber.top_disable()`.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 9.8578003709556 average: 7.1213095296968 103/lua: instant: 10.242495019578 average: 7.2647247910588 107/lua: instant: 33.337913031531 average: 34.602640818951 101/on_shutdown: instant: 0 average: 0 105/lua: instant: 10.558494195233 average: 7.4526482369814 106/lua: instant: 21.19255341073 average: 23.668463478562 102/interactive: instant: 0 average: 0 cpu misses: 0 1/sched: instant: 14.810743971972 average: 19.494584837545 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The two metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 18.433514726042 time: 0.677505865 average: 21.98826143184 103/lua: instant: 19.131392015951 time: 0.689521917 average: 20.807772656431 107/lua: instant: 18.624600174469 time: 0.681585168 average: 17.78194117452 101/on_shutdown: instant: 0 time: 0 average: 0 105/lua: instant: 18.562289702156 time: 0.682085309 average: 15.513811055476 106/lua: instant: 18.441822789017 time: 0.677320271 average: 15.427595583115 102/interactive: instant: 0 time: 0.000367182 average: 0 cpu misses: 0 1/sched: instant: 6.8063805923649 time: 0.253035056 average: 8.3479789103691 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The three metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. 3) time (seconds), which estimates how much cpu time each fiber spent processing during its lifetime. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 18.433514726042 time: 0.677505865 average: 21.98826143184 103/lua: instant: 19.131392015951 time: 0.689521917 average: 20.807772656431 107/lua: instant: 18.624600174469 time: 0.681585168 average: 17.78194117452 101/on_shutdown: instant: 0 time: 0 average: 0 105/lua: instant: 18.562289702156 time: 0.682085309 average: 15.513811055476 106/lua: instant: 18.441822789017 time: 0.677320271 average: 15.427595583115 102/interactive: instant: 0 time: 0.000367182 average: 0 cpu misses: 0 1/sched: instant: 6.8063805923649 time: 0.253035056 average: 8.3479789103691 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The three metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. 3) time (seconds), which estimates how much cpu time each fiber spent processing during its lifetime. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`. "Time" entry is also added to each fibers output in fiber.info() (it duplicates "time" entry from fiber.top()). Note, that "time" is only counted while fiber.top is enabled.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has two fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, and current cpu consumption, which shows fiber's cpu usage over the last event loop iteration. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 18.433514726042 time: 0.677505865 average: 21.98826143184 103/lua: instant: 19.131392015951 time: 0.689521917 average: 20.807772656431 107/lua: instant: 18.624600174469 time: 0.681585168 average: 17.78194117452 101/on_shutdown: instant: 0 time: 0 average: 0 105/lua: instant: 18.562289702156 time: 0.682085309 average: 15.513811055476 106/lua: instant: 18.441822789017 time: 0.677320271 average: 15.427595583115 102/interactive: instant: 0 time: 0.000367182 average: 0 cpu misses: 0 1/sched: instant: 6.8063805923649 time: 0.253035056 average: 8.3479789103691 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The three metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. 3) time (seconds), which estimates how much cpu time each fiber spent processing during its lifetime. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`. "Time" entry is also added to each fibers output in fiber.info() (it duplicates "time" entry from fiber.top()). Note, that "time" is only counted while fiber.top is enabled.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table of alive fibers (including the scheduler). Each table entry has three fields: average cpu consumption, which is calculated with exponential moving average over event loop iterations, current cpu consumption, which shows fiber's cpu usage over the last event loop iteration and cpu time the fiber has used up during its lifespan. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - 104/lua: instant: 18.433514726042 time: 0.677505865 average: 21.98826143184 103/lua: instant: 19.131392015951 time: 0.689521917 average: 20.807772656431 107/lua: instant: 18.624600174469 time: 0.681585168 average: 17.78194117452 101/on_shutdown: instant: 0 time: 0 average: 0 105/lua: instant: 18.562289702156 time: 0.682085309 average: 15.513811055476 106/lua: instant: 18.441822789017 time: 0.677320271 average: 15.427595583115 102/interactive: instant: 0 time: 0.000367182 average: 0 cpu misses: 0 1/sched: instant: 6.8063805923649 time: 0.253035056 average: 8.3479789103691 ... ``` In the table above keys are strings containing fiber ids and names (the only exception is a single 'cpu misses' key which indicates the number of times tx thread was rescheduled on a different cpu core. More on that later). The three metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. 3) time (seconds), which estimates how much cpu time each fiber spent processing during its lifetime. More info on `cpu misses` field returned by `fiber.top()`: `cpu misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`. "Time" entry is also added to each fibers output in fiber.info() (it duplicates "time" entry from fiber.top()). Note, that "time" is only counted while fiber.top is enabled.

@TarantoolBot

Implement a new function in Lua fiber library: top(). It returns a table containing fiber cpu usage stats. The table has two entries: "cpu_misses" and "cpu". "cpu" itself is a table listing all the alive fibers and their cpu consumtion. The patch relies on CPU timestamp counter to measure each fiber's time share. Closes #2694 @TarantoolBot document Title: fiber: new function `fiber.top()` `fiber.top()` returns a table of all alive fibers and lists their cpu consumption. Let's take a look at the example: ``` tarantool> fiber.top() --- - cpu: 107/lua: instant: 30.967324490456 time: 0.351821993 average: 25.582738345233 104/lua: instant: 9.6473633128437 time: 0.110869897 average: 7.9693406131877 101/on_shutdown: instant: 0 time: 0 average: 0 103/lua: instant: 9.8026528631511 time: 0.112641118 average: 18.138387232255 106/lua: instant: 20.071174377224 time: 0.226901357 average: 17.077908441831 102/interactive: instant: 0 time: 9.6858e-05 average: 0 105/lua: instant: 9.2461986412164 time: 0.10657528 average: 7.7068458630827 1/sched: instant: 20.265286315108 time: 0.237095335 average: 23.141537169257 cpu_misses: 0 ... ``` The two entries in a table returned by `fiber.top()` are `cpu_misses` and `cpu`. `cpu` itself is a table whose keys are strings containing fiber ids and names. The three metrics available for each fiber are: 1) instant (per cent), which indicates the share of time fiber was executing during the previous event loop iteration 2) average (per cent), which is calculated as an exponential moving average of `instant` values over all previous event loop iterations. 3) time (seconds), which estimates how much cpu time each fiber spent processing during its lifetime. More info on `cpu_misses` field returned by `fiber.top()`: `cpu_misses` indicates the amount of times tx thread detected it was rescheduled on a different cpu core during the last event loop iteration. fiber.top() uses cpu timestamp counter to measure each fiber's execution time. However, each cpu core may have its own counter value (you can only rely on counter deltas if both measurements were taken on the same core, otherwise the delta may even get negative). When tx thread is rescheduled to a different cpu core, tarantool just assumes cpu delta was zero for the latest measurement. This loweres precision of our computations, so the bigger `cpu misses` value the lower the precision of fiber.top() results. Fiber.top() doesn't work on arm architecture at the moment. Please note, that enabling fiber.top() slows down fiber switching by about 15 per cent, so it is disabled by default. To enable it you need to issue `fiber.top_enable()`. You can disable it back after you finished debugging using `fiber.top_disable()`. "Time" entry is also added to each fibers output in fiber.info() (it duplicates "time" entry from fiber.top().cpu per fiber). Note, that "time" is only counted while fiber.top is enabled.

fiber.top() fills in statistics every event loop iteration, so if it was just enabled, fiber.top() may contain 'inf's and 'nan's in fiber cpu usage averages because total time consumed by the main thread was not yet accounted for. Same stands for viewing top() results for a freshly created fiber: its metrics will be zero since it hasn't lived a full ev loop iteration yet. Fix this by delaying the test till top() results are meaningful and add minor refactoring. Follow-up #2694

Follow-up #2694

fiber.top() fills in statistics every event loop iteration, so if it was just enabled, fiber.top() may contain 'inf's and 'nan's in fiber cpu usage averages because total time consumed by the main thread was not yet accounted for. Same stands for viewing top() results for a freshly created fiber: its metrics will be zero since it hasn't lived a full ev loop iteration yet. Fix this by delaying the test till top() results are meaningful and add minor refactoring. Follow-up #2694

We didn't refresh last remembered clock on fiber.top_enable() This means that the fiber issuing fiber.top_enable() would get a huge chunk of cpu time on its first yield. Fix this. Also reset clock_delta and cpu_miss_count. If fiber.top() is issued on the same ev loop iteration as fiber.top_enable(), clock_delta_last is 0 for both cord and all the fibers, so report "instant" and "average" stats per last iteration as 0 instead of NaN. Follow-up #2694

fiber.top() fills in statistics every event loop iteration, so if it was just enabled, fiber.top() may contain 'inf's and 'nan's in fiber cpu usage averages because total time consumed by the main thread was not yet accounted for. Same stands for viewing top() results for a freshly created fiber: its metrics will be zero since it hasn't lived a full ev loop iteration yet. Fix this by delaying the test till top() results are meaningful and add minor refactoring. Follow-up #2694

fiber.top() fills in statistics every event loop iteration, so if it was just enabled, fiber.top() returns zero in fiber cpu usage statistics because total time consumed by the main thread was not yet accounted for. Same stands for viewing top() results for a freshly created fiber: its metrics will be zero since it hasn't lived a full ev loop iteration yet. Fix this by delaying the test till top() results are meaningful and add minor refactoring. Follow-up #2694

We didn't refresh last remembered clock on fiber.top_enable() This means that the fiber issuing fiber.top_enable() would get a huge chunk of cpu time on its first yield. Fix this. Also reset clock_delta and cpu_miss_count. If fiber.top() is issued on the same ev loop iteration as fiber.top_enable(), clock_delta_last is 0 for both cord and all the fibers, so report "instant" and "average" stats per last iteration as 0 instead of NaN. Follow-up #2694

fiber.top() fills in statistics every event loop iteration, so if it was just enabled, fiber.top() returns zero in fiber cpu usage statistics because total time consumed by the main thread was not yet accounted for. Same stands for viewing top() results for a freshly created fiber: its metrics will be zero since it hasn't lived a full ev loop iteration yet. Fix this by delaying the test till top() results are meaningful and add minor refactoring. Follow-up #2694

Unify all the members related to fiber's clock statistics into struct clock_stat and all the members related to cord's knowledge of cpu state and clocks to struct cpu_stat. Reset stats of all alive fibers on fiber.top_enable(). Follow-up #2694

When fiber EMA is 0 and first non-zero observation is added to it, we assumed that EMA should be equal to this observation (i.e. average value should be the same as the observed one). This breaks the following invariant: sum of clock EMAs of all fibers equals clock EMA of the thread. If one of the fibers is just spawned and has a big clock delta, it will assign this delta to its EMA, while the thread will calculate the new EMA as 15 * EMA / 16 + delta / 16, which may lead to a situation when fiber EMA is greater than cord EMA. This caused occasional test failures: ``` [001] Test failed! Result content mismatch: [001] --- app/fiber.result Mon Nov 18 17:00:48 2019 [001] +++ app/fiber.reject Mon Nov 18 17:33:10 2019 [001] @@ -1511,7 +1511,7 @@ [001] -- not exact due to accumulated integer division errors [001] sum_avg > 99 and sum_avg < 101 or sum_avg [001] --- [001] -- true [001] +- 187.59585601717 [001] ... [001] tbl = nil [001] --- ``` Follow-up #2694