-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ratio algorithm for dynamic load balancing #718
Conversation
TfwPrcntl{} structure is removed. All data related to percentiles is flattened into TfwPrcntlStats{}. That makes it easier to maintain a single (or identical) copy of percentile numbers, as well as copy just the percentile values on requests. Also, it looks better with unnecessary hierarchy removed.
Minimum, maximum, and average values are now together with percentile values in the same indexed array. That makes it possible to specify an index of the value to use in scheduling.
"server" directive has new optional argument "weight=<N>". "sched" directive has a number of new options: one of "static" or "dynamic", if "dynamic", then one of "minimum", "maximum", "average", or "percentile". If percentile, then a percentile number, one of those known to Tempesta (curently 50, 75, 90, 95, 99)/. The "sched" directive has changed with this patch. There's no more "sched" option in "srv_group" directive. Instead, "sched" directive may be specified for a server group. Only one "sched" directive is allowed per group. Same is true for the implicit group "default". It doesn't matter where the "sched" directive is specified in the "srv_group" section, or in the file for servers outside all groups. If "sched" option is specified at the start of the configuration file, and a "srv_group" section doesn't have the "sched" directive, then the values of that outer "sched" directive are propagated to the server group. If no earlier outer "sched" directive is given, a default scheduler with default options is used. Add server connections only when a sheduler is set for a group.
The scheduler API is called only after all 'srv_group', 'server', and 'sched' directives in all groups in the configuration file are processed. That way the number of servers in each group, and the number of connections for each server are known to a scheduler at the time its API is called. The number of servers in a group and the number of connections to a servers are stored in the respective structures TfwSrvGroup{} and TfwServer{}. They need to be kept somewhere anyway so serve as the upper limit for control purposes, especially in scheduler API. Also, that's useful for dynamic memory allocation in schedulers. Round-robin and hash schedulers are modified to use the actual values for the number of servers in a group, and the number of connections for each server, instead of using predefined constants.
This implements the complete data structure layout. Internal ratio scheduler data is allocated dynamically and populated with server group data required for the scheduler's functionality.
For dynamic weights initialize the weight of each server in a group to the default value. That makes their weights equal initially. At this time only static equal weights are supported.
The function is called in user context when it's guaranteed that all activity has stopped. The locks do not allow the code that is executed as part of tfw_sg_release_all() execution to run code that may sleep, which should be allowed in used context.
Static weights, possibly different, are specified in the configuration file. Dynamic weights are derived indirectly from RTT values provided by APM module on periodic basis. The weights are converted to ratios used by the ratio scheduler to distribute requests proportionally to each server's weight.
Conflicts: tempesta_fw/sched/tfw_sched_hash.c tempesta_fw/sched/tfw_sched_rr.c tempesta_fw/server.c tempesta_fw/server.h tempesta_fw/sock_srv.c tempesta_fw/t/unit/sched_helper.c tempesta_fw/t/unit/test_sched_hash.c tempesta_fw/t/unit/test_sched_rr.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good after minor fixes
tempesta_fw/sched/tfw_sched_hash.c
Outdated
TfwHashSrvList *sl = sg->sched_data; | ||
unsigned long msg_hash; | ||
unsigned long tries = sl->conn_n;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubled semicolon
tempesta_fw/sched/tfw_sched_hash.c
Outdated
* of connections for each server. | ||
*/ | ||
si = 0; | ||
list_for_each_entry(srv, &sg->srv_list, list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this checks are redundant. We make the checks in sock_srv.c
when create server group, servers and connections. Actually scheduler API has changed, schedulers expect server group with needed amount of servers and connections when .add_grp
callback was called. .add_conn
callback is not used any more.
I also can't find a comment, describing the new behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather leave the validation check. I've moved it to a separate function. It's called only once for each group at configuration processing time, and I consider it a good safety measure.
As for the comment on the new behaviour, it's here:
Line 116 in bab5e18
* Called only after the group is set up with all servers; |
I have extended it a little bit to be more visible and clear.
TfwSrvConn *conn[TFW_SRV_MAX_CONN]; | ||
unsigned long hash[TFW_SRV_MAX_CONN]; | ||
TfwSrvConn **conn; | ||
unsigned long *hash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is rather significant, module version update is needed.
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
* | ||
* @lock - must be in the same cache line for faster operations. | ||
* @csidx - index of current server data entry. | ||
* @reidx - index of next server data entry which ratio we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces instead of tab
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
diff = max_val_idx = 0; | ||
for (si = 0; si < ratio->srv_n; ++si) { | ||
if (srvdata[max_val_idx].weight < srvdata[si].weight) | ||
max_val_idx = si; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_val_idx
and si
should have the same type.
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
static inline bool | ||
tfw_sched_ratio_is_srv_turn(TfwRatio *ratio, size_t csidx) | ||
{ | ||
unsigned int headsum2, tailsum2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both should be type of unsigned long
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
static TfwSrvConn * | ||
tfw_sched_ratio_sched_srv_conn(TfwMsg *msg, TfwServer *srv) | ||
{ | ||
int skipnip = 1, nipconn = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces instead of tabs
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
tfw_sched_ratio_sched_sg_conn(TfwMsg *msg, TfwSrvGroup *sg) | ||
{ | ||
size_t srv_tried_n = 0; | ||
int skipnip = 1, nipconn = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces instead of tabs
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
} | ||
|
||
/* | ||
* * Get a free for use entry from the RCU pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra asterisks
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
* the next server reqires a lock, so perhaps it makes sense to | ||
* to skip these repetitive servers while under the lock. | ||
*/ | ||
while (srv_tried_n < ratio->srv_n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, condition while (srv_tried_n < ratio->srv_n)
will never mean, that we try all possible servers. And you described that in a comment above. Better to rename srv_tried_n
to tries
:
tries = ratio->srv_n + 1;
while(--tries) {
...
At least, this code doesn't lie that it tries all the server. Even more: nature of this counter is almost the same as for tries
counter in hash scheduler.
Also, validate the integrity of a group in separate function at the time .add_grp() callback is called. It's called just once for each server group.
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
* end of the array. Reverse the sequence of server descriptor | ||
* indices in that part of the array. | ||
*/ | ||
if (has_one_val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be if (!has_one_val) {
The serialization is implemented via lockless work queues. Stats data for each server is put on a work queue (one work queue per CPU). Work queues are processed periodically, and the stats data for each server is updated in serialized manner. After that new stats values are calculated for each server with updated stats data. This provides a number of important benefits. Among those are: - Stats data for each server is now consistent at all times. There are no concurrent updates while new stats values are calculated. - Both stats updates and stats values calculations are now totally lockless. Note that APM data is now decoupled from the server structure. It's attached to or detached from a server's instance, but it's managed completely by the APM module with help of a reference counter. Thus it may have a different life cycle than a server's instance. It's possible that a different solution may be needed to process work queues in a timely manner in high load situations. Periodic processing on timer may be insufficient to handle the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good after fixing minor issues
tempesta_fw/server.h
Outdated
@@ -73,6 +83,7 @@ struct tfw_srv_group_t { | |||
rwlock_t lock; | |||
TfwScheduler *sched; | |||
void *sched_data; | |||
int srv_n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srv_n
should be type of size_t
* **round-robin** - Rotates all servers in a group in round-robin manner so | ||
that requests are distributed uniformly across servers. This is the default | ||
scheduler. | ||
* **ratio** - Balances the load across servers in a group based on each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch replaces round-robin scheduler with ratio. Description of new options in Readme
is great, but the sample configuration file etc/tempesta_fw.conf
was not updated at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functional tests sets scheduler to a default value (which was round-robin
) in this file . This should be fixed, otherwise almost all the functional tests will be broken.
tempesta_fw/sched/tfw_sched_hash.c
Outdated
} | ||
|
||
kfree(sl); | ||
sg->sched_data = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this wasn't done before, but maybe we should reset srv->sched_data
for every server in group in the cycle above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, by the time the group is deleted, each server in the group is destroyed already.
tempesta_fw/apm.c
Outdated
* @tfw_apm_rearm - Atomic flag, tells if the timer needs re-arming. | ||
* @tfw_apm_timer - The periodic timer handle. | ||
*/ | ||
#define TFW_APM_DATA_F_REARM (0x0001) /* Re-arm the timer. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starting from line 502 comments are aligned with spaces instead of tabs
tempesta_fw/t/unit/sched_helper.c
Outdated
sg = test_create_sg("test", sched_helper->sched); | ||
sg = test_create_sg("test"); | ||
sg->flags = sched_helper->flags; | ||
test_start_sg(sg, sched_helper->sched); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two lines are always used together in this patch. why not just put copying of flags to the test_start_sg
?
|
||
sched_helper_ratio.free_sched_arg(msg); | ||
EXPECT_EQ(conn_acc, conn_acc_check); | ||
sched_helper_ratio.free_sched_arg(msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double usage of sched_helper_ratio.free_sched_arg(msg)
.
#define module_exit(func) | ||
#endif | ||
|
||
#include "../../sched/tfw_sched_ratio.c" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are provided only for static ratio scheduler for servers with default weights, which is equivalent to removed round-robin scheduler. Please, add a task to bug tracker to improve tests and cover all the possible use cases.
Perhaps, some test cases will be more simpler to implement as functional tests. E.g. there is already existing test, that asserts equal load distribution among all servers in a group. It can be easily extended to comply static ratio scheduling with non-equal weights. Dynamic scheduling can be a little bit tricky though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the reason I didn't add unit tests for the new functionality. I didn't feel like they would be useful enough. These kind of tests would better be done with functional testing.
I will open a task for this, thanks.
cnt_full = atomic_read(&rng->cnt[r][TFW_STATS_BCKTS / 2 + i]); | ||
cnt_half = cnt_full / 2; | ||
atomic_set(&rng->cnt[r][i * 2], cnt_half); | ||
atomic_set(&rng->cnt[r][i * 2 + 1], cnt_full - cnt_half); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update t/unit/user_space/percentiles.c
's code as well as add a test for the fix. The test should be used to debug the math code.
tempesta_fw/apm.c
Outdated
* Process work queues on all CPUs and update stats with data | ||
* from each work item in the queue. Add servers with updated | ||
* stats to the list for calculation of stats. Each server is | ||
* is added to the list just once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double "is"
tempesta_fw/apm.c
Outdated
TfwApmWqItem wq_item; | ||
TfwRBQueue *wq = &per_cpu(tfw_apm_wq, cpu); | ||
|
||
while (!tfw_wq_pop(wq, &wq_item)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a wrong thing: we can stuck there for a while - 2K items in the a queue and 80 CPUs = 160K iterations. Meantime, 2K items in a queue with HZ/20 timer shot gives about 40K HTTP responses per second for each CPU. Meantime we had 1.8M RPS on 8 CPUs (4 cores), i.e. ~225KRPS for each CPU, so only first 1/5 of the events will be handled. 1/5 is fine, but "first" isn't.
Timer is basically softirq context, so why each CPU can't just process its queue on its own? There will be less contention and the logic will be more responsive.
I think the ring buffer, which is actually designed for MPMC case, is an overkill for statistics. What do you think if we just create a per-CPU array of size say 512 items and fill it by an events for each HZ/20 shot. And we can update statistic like:
static const size_t SHOT_SZ = 512 * 20 / HZ;
static unsigned long last_jiffies = jiffies;
static size_t s = 0;
if (last_jiffier == jiffies && s < SHOT_SZ)
// Probably it has sense to use 32 or 16 instead of 20.
update(samples[(jiffies % (HZ / 20)) + s++]);
else
s = 0; // reset the iterator for the next shot
This will collect first 10 samples, but for each millisecond this time. The algorithm seems simplier and more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I completely understand you suggestion. Are you suggesting saving/storing just one RTT sample per millisecond from Tempesta to APM's per-CPU array? Otherwise, an array of just 512 items (which is significantly less than 2k items in WQ now) will not be able to hold all RTT data until the timer function wakes up.
The major point was to avoid concurrency while updating data and calculating stats values for any particular server. That way the data is consistent at the time stats values are calculated. Concurrency caused data inconsistency due to incomplete updates, and data inconsistency caused the stats values calculation failures. So the point was that samples for any particular server should be submitted in a serialized manner (without concurrency), and stats values calculation should be part of that sequence.
tempesta_fw/apm.c
Outdated
* If the calculation cannot be completed with the current data, | ||
* then move that server to a separate list. When stats data is | ||
* updated, the calculation will be repeated. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we really have to merge per-CPU statistic. But now the data is smaller with lower contention.
tempesta_fw/sched/tfw_sched_hash.c
Outdated
|
||
cleanup: | ||
tfw_sched_hash_cleanup(sg); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ret
variable has no sense, it's always -ENOMEM
.
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
* under restrictions that are slightly relaxed. It's likely | ||
* that servers probed in these two passes are not the same. | ||
*/ | ||
attempts = rpool->srv_n * 2 + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no sense to do so many attempts, even with the rerun for skipped connections in mind. Typically if we can not choose a connection/server for so many tries, there is something wrong with the upstream cluster and spinning in the loop will just emphasize the issue on Tempesta's side. I believe rpool->srv_n
is the highest value, which has sense...
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
sg->sched_data = NULL; | ||
|
||
if (tfw_sched_ratio_validate_grp(sg)) | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue as with hash scheduler: unclear purpose for the validation check and too many allocations which can lead to memory fragmentation and poor cache lines usage.
tempesta_fw/sched/tfw_sched_ratio.c
Outdated
TfwRatioSchData *schdata = &ratio->schdata; | ||
|
||
/* Start with server that has the highest ratio. */ | ||
spin_lock(&schdata->lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad thing. We have to use the lock as described in the comment, but we still want to have RR (ratio) scheduler as the most quick one. However, probably that would be an issue now if we for example try it to run on 4 processors machine (e.g. 40 and more CPU cores). Thus, let's introduce 2 versions of the function: this one for dynamic ratios and the old one, from RR scheduler, for static ratios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean, the old one, from RR scheduler, for static equal ratios?
For static non-equal ratios we have to use the new algorithm with the lock.
@@ -249,25 +229,14 @@ tfw_sg_release_all(void) | |||
TfwServer *srv, *srv_tmp; | |||
TfwSrvGroup *sg, *sg_tmp; | |||
|
|||
write_lock(&sg_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment that we don't need the lock because the function is called at shutdown time, when there are no users any more.
tempesta_fw/sock_srv.c
Outdated
|
||
return 0; | ||
return tfw_peer_for_each_conn(srv, srv_conn, list, | ||
__tfw_sock_srv_connect_try_later_cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to call tfw_sock_srv_connect_try_later()
under conn_lock
?
Conflicts: tempesta_fw/t/unit/test_sched_rr.c
Eliminate integer promotion from signed to unsigned.
The algorithm is brought up to the same code as in the Ratio Scheduler (predictive algorithm).
Add the number of schedulable connections for each server.
Updates are stored in per-CPU arrays for each server. Processing of accumulated updates is done by a single thread that submits the updates and then runs the calculation of percentiles. That removes the concurrency between the updates and the calculation of percentiles. Different arrays are used for accumulating incoming update data and for processing the accumulated data.
tempesta_fw/sock_srv.c
Outdated
{ | ||
TfwServer *srv; | ||
|
||
list_for_each_entry(srv, tfw_cfg_slst, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the reason, why you work with tfw_cfg_slst
here, but as for me, the way the function is implemented is awful. Function takes sg
as argument, but instead of using it, it works with globally defined list of servers. The function is also needed in unit tests: https://github.com/tempesta-tech/tempesta/pull/734/files#diff-da5998b2c263193d00b380780843618bR87 but it cannot be used there without dirty hacks.
I think, it's better to pass list_head
as function argument here.
Consistently use global variables in configuration processing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge with small cleanups.
tempesta_fw/apm.c
Outdated
@@ -29,6 +29,7 @@ | |||
#include "log.h" | |||
#include "pool.h" | |||
#include "procfs.h" | |||
#include "work_queue.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include isn't needed any more.
tempesta_fw/apm.c
Outdated
for (i = 0; i < ubuf->ubufsz; ++i) | ||
WRITE_ONCE(ubuf->ubent[0][i].data, ULONG_MAX); | ||
for (i = 0; i < ubuf->ubufsz; ++i) | ||
WRITE_ONCE(ubuf->ubent[1][i].data, ULONG_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no sense to do 2 loops instead of one initializing both the arrays.
tempesta_fw/apm.c
Outdated
for_each_online_cpu(icpu) { | ||
TfwApmUBEnt *ubent; | ||
TfwApmUBuf *ubuf = per_cpu_ptr(data->ubuf, icpu); | ||
if (!(ubent = kzalloc(size, GFP_ATOMIC))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there should be kzalloc_node()
with node derived from icpu
.
Also, includes dynamic memory allocation for server and connection data
so that the maximum number of servers or connections it's not limited by
constants.