Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratio algorithm for dynamic load balancing #718

Merged
merged 39 commits into from
Jun 3, 2017
Merged

Ratio algorithm for dynamic load balancing #718

merged 39 commits into from
Jun 3, 2017

Conversation

keshonok
Copy link
Contributor

@keshonok keshonok commented Apr 3, 2017

Also, includes dynamic memory allocation for server and connection data
so that the maximum number of servers or connections it's not limited by
constants.

TfwPrcntl{} structure is removed. All data related to percentiles
is flattened into TfwPrcntlStats{}. That makes it easier to maintain
a single (or identical) copy of percentile numbers, as well as copy
just the percentile values on requests. Also, it looks better with
unnecessary hierarchy removed.
Minimum, maximum, and average values are now together with percentile
values in the same indexed array. That makes it possible to specify
an index of the value to use in scheduling.
"server" directive has new optional argument "weight=<N>". "sched"
directive has a number of new options: one of "static" or "dynamic",
if "dynamic", then one of "minimum", "maximum", "average", or
"percentile". If percentile, then a percentile number, one of those
known to Tempesta (curently 50, 75, 90, 95, 99)/.

The "sched" directive has changed with this patch. There's no more
"sched" option in "srv_group" directive. Instead, "sched" directive
may be specified for a server group. Only one "sched" directive is
allowed per group. Same is true for the implicit group "default".
It doesn't matter where the "sched" directive is specified in the
"srv_group" section, or in the file for servers outside all groups.

If "sched" option is specified at the start of the configuration
file, and a "srv_group" section doesn't have the "sched" directive,
then the values of that outer "sched" directive are propagated to
the server group. If no earlier outer "sched" directive is given,
a default scheduler with default options is used.

Add server connections only when a sheduler is set for a group.
The scheduler API is called only after all 'srv_group', 'server',
and 'sched' directives in all groups in the configuration file are
processed. That way the number of servers in each group, and the
number of connections for each server are known to a scheduler at
the time its API is called.

The number of servers in a group and the number of connections
to a servers are stored in the respective structures TfwSrvGroup{}
and TfwServer{}. They need to be kept somewhere anyway so serve as
the upper limit for control purposes, especially in scheduler API.

Also, that's useful for dynamic memory allocation in schedulers.
Round-robin and hash schedulers are modified to use the actual
values for the number of servers in a group, and the number of
connections for each server, instead of using predefined constants.
This implements the complete data structure layout. Internal ratio
scheduler data is allocated dynamically and populated with server
group data required for the scheduler's functionality.
For dynamic weights initialize the weight of each server in a group
to the default value. That makes their weights equal initially.

At this time only static equal weights are supported.
The function is called in user context when it's guaranteed that
all activity has stopped. The locks do not allow the code that is
executed as part of tfw_sg_release_all() execution to run code
that may sleep, which should be allowed in used context.
Static weights, possibly different, are specified in the configuration
file. Dynamic weights are derived indirectly from RTT values provided
by APM module on periodic basis. The weights are converted to ratios
used by the ratio scheduler to distribute requests proportionally to
each server's weight.
Conflicts:
	tempesta_fw/sched/tfw_sched_hash.c
	tempesta_fw/sched/tfw_sched_rr.c
	tempesta_fw/server.c
	tempesta_fw/server.h
	tempesta_fw/sock_srv.c
	tempesta_fw/t/unit/sched_helper.c
	tempesta_fw/t/unit/test_sched_hash.c
	tempesta_fw/t/unit/test_sched_rr.c
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good after minor fixes

TfwHashSrvList *sl = sg->sched_data;
unsigned long msg_hash;
unsigned long tries = sl->conn_n;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubled semicolon

* of connections for each server.
*/
si = 0;
list_for_each_entry(srv, &sg->srv_list, list) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this checks are redundant. We make the checks in sock_srv.c when create server group, servers and connections. Actually scheduler API has changed, schedulers expect server group with needed amount of servers and connections when .add_grp callback was called. .add_conn callback is not used any more.

I also can't find a comment, describing the new behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave the validation check. I've moved it to a separate function. It's called only once for each group at configuration processing time, and I consider it a good safety measure.

As for the comment on the new behaviour, it's here:

* Called only after the group is set up with all servers;

I have extended it a little bit to be more visible and clear.

TfwSrvConn *conn[TFW_SRV_MAX_CONN];
unsigned long hash[TFW_SRV_MAX_CONN];
TfwSrvConn **conn;
unsigned long *hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is rather significant, module version update is needed.

*
* @lock - must be in the same cache line for faster operations.
* @csidx - index of current server data entry.
* @reidx - index of next server data entry which ratio we need
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces instead of tab

diff = max_val_idx = 0;
for (si = 0; si < ratio->srv_n; ++si) {
if (srvdata[max_val_idx].weight < srvdata[si].weight)
max_val_idx = si;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_val_idx and si should have the same type.

static inline bool
tfw_sched_ratio_is_srv_turn(TfwRatio *ratio, size_t csidx)
{
unsigned int headsum2, tailsum2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both should be type of unsigned long

static TfwSrvConn *
tfw_sched_ratio_sched_srv_conn(TfwMsg *msg, TfwServer *srv)
{
int skipnip = 1, nipconn = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces instead of tabs

tfw_sched_ratio_sched_sg_conn(TfwMsg *msg, TfwSrvGroup *sg)
{
size_t srv_tried_n = 0;
int skipnip = 1, nipconn = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces instead of tabs

}

/*
* * Get a free for use entry from the RCU pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra asterisks

* the next server reqires a lock, so perhaps it makes sense to
* to skip these repetitive servers while under the lock.
*/
while (srv_tried_n < ratio->srv_n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, condition while (srv_tried_n < ratio->srv_n) will never mean, that we try all possible servers. And you described that in a comment above. Better to rename srv_tried_n to tries:

tries = ratio->srv_n + 1;
while(--tries) {
    ...

At least, this code doesn't lie that it tries all the server. Even more: nature of this counter is almost the same as for tries counter in hash scheduler.

Also, validate the integrity of a group in separate function at the time
.add_grp() callback is called. It's called just once for each server group.
* end of the array. Reverse the sequence of server descriptor
* indices in that part of the array.
*/
if (has_one_val) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be if (!has_one_val) {

The serialization is implemented via lockless work queues. Stats
data for each server is put on a work queue (one work queue per CPU).
Work queues are processed periodically, and the stats data for each
server is updated in serialized manner. After that new stats values
are calculated for each server with updated stats data.

This provides a number of important benefits. Among those are:
- Stats data for each server is now consistent at all times. There
  are no concurrent updates while new stats values are calculated.
- Both stats updates and stats values calculations are now totally
  lockless.

Note that APM data is now decoupled from the server structure. It's
attached to or detached from a server's instance, but it's managed
completely by the APM module with help of a reference counter. Thus
it may have a different life cycle than a server's instance.

It's possible that a different solution may be needed to process
work queues in a timely manner in high load situations. Periodic
processing on timer may be insufficient to handle the workload.
Copy link
Contributor

@vankoven vankoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good after fixing minor issues

@@ -73,6 +83,7 @@ struct tfw_srv_group_t {
rwlock_t lock;
TfwScheduler *sched;
void *sched_data;
int srv_n;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srv_n should be type of size_t

* **round-robin** - Rotates all servers in a group in round-robin manner so
that requests are distributed uniformly across servers. This is the default
scheduler.
* **ratio** - Balances the load across servers in a group based on each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch replaces round-robin scheduler with ratio. Description of new options in Readme is great, but the sample configuration file etc/tempesta_fw.conf was not updated at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional tests sets scheduler to a default value (which was round-robin) in this file . This should be fixed, otherwise almost all the functional tests will be broken.

}

kfree(sl);
sg->sched_data = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this wasn't done before, but maybe we should reset srv->sched_data for every server in group in the cycle above?

Copy link
Contributor Author

@keshonok keshonok May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, by the time the group is deleted, each server in the group is destroyed already.

* @tfw_apm_rearm - Atomic flag, tells if the timer needs re-arming.
* @tfw_apm_timer - The periodic timer handle.
*/
#define TFW_APM_DATA_F_REARM (0x0001) /* Re-arm the timer. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starting from line 502 comments are aligned with spaces instead of tabs

sg = test_create_sg("test", sched_helper->sched);
sg = test_create_sg("test");
sg->flags = sched_helper->flags;
test_start_sg(sg, sched_helper->sched);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two lines are always used together in this patch. why not just put copying of flags to the test_start_sg?


sched_helper_ratio.free_sched_arg(msg);
EXPECT_EQ(conn_acc, conn_acc_check);
sched_helper_ratio.free_sched_arg(msg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double usage of sched_helper_ratio.free_sched_arg(msg).

#define module_exit(func)
#endif

#include "../../sched/tfw_sched_ratio.c"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are provided only for static ratio scheduler for servers with default weights, which is equivalent to removed round-robin scheduler. Please, add a task to bug tracker to improve tests and cover all the possible use cases.

Perhaps, some test cases will be more simpler to implement as functional tests. E.g. there is already existing test, that asserts equal load distribution among all servers in a group. It can be easily extended to comply static ratio scheduling with non-equal weights. Dynamic scheduling can be a little bit tricky though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the reason I didn't add unit tests for the new functionality. I didn't feel like they would be useful enough. These kind of tests would better be done with functional testing.

I will open a task for this, thanks.

cnt_full = atomic_read(&rng->cnt[r][TFW_STATS_BCKTS / 2 + i]);
cnt_half = cnt_full / 2;
atomic_set(&rng->cnt[r][i * 2], cnt_half);
atomic_set(&rng->cnt[r][i * 2 + 1], cnt_full - cnt_half);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update t/unit/user_space/percentiles.c's code as well as add a test for the fix. The test should be used to debug the math code.

* Process work queues on all CPUs and update stats with data
* from each work item in the queue. Add servers with updated
* stats to the list for calculation of stats. Each server is
* is added to the list just once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double "is"

TfwApmWqItem wq_item;
TfwRBQueue *wq = &per_cpu(tfw_apm_wq, cpu);

while (!tfw_wq_pop(wq, &wq_item)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a wrong thing: we can stuck there for a while - 2K items in the a queue and 80 CPUs = 160K iterations. Meantime, 2K items in a queue with HZ/20 timer shot gives about 40K HTTP responses per second for each CPU. Meantime we had 1.8M RPS on 8 CPUs (4 cores), i.e. ~225KRPS for each CPU, so only first 1/5 of the events will be handled. 1/5 is fine, but "first" isn't.

Timer is basically softirq context, so why each CPU can't just process its queue on its own? There will be less contention and the logic will be more responsive.

I think the ring buffer, which is actually designed for MPMC case, is an overkill for statistics. What do you think if we just create a per-CPU array of size say 512 items and fill it by an events for each HZ/20 shot. And we can update statistic like:

	static const size_t SHOT_SZ = 512 * 20 / HZ;
	static unsigned long last_jiffies = jiffies;
	static size_t s = 0;
	if (last_jiffier == jiffies && s < SHOT_SZ)
		// Probably it has sense to use 32 or 16 instead of 20.
		update(samples[(jiffies % (HZ / 20)) + s++]);
	else
		s = 0; // reset the iterator for the next shot

This will collect first 10 samples, but for each millisecond this time. The algorithm seems simplier and more correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely understand you suggestion. Are you suggesting saving/storing just one RTT sample per millisecond from Tempesta to APM's per-CPU array? Otherwise, an array of just 512 items (which is significantly less than 2k items in WQ now) will not be able to hold all RTT data until the timer function wakes up.

The major point was to avoid concurrency while updating data and calculating stats values for any particular server. That way the data is consistent at the time stats values are calculated. Concurrency caused data inconsistency due to incomplete updates, and data inconsistency caused the stats values calculation failures. So the point was that samples for any particular server should be submitted in a serialized manner (without concurrency), and stats values calculation should be part of that sequence.

* If the calculation cannot be completed with the current data,
* then move that server to a separate list. When stats data is
* updated, the calculation will be repeated.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we really have to merge per-CPU statistic. But now the data is smaller with lower contention.


cleanup:
tfw_sched_hash_cleanup(sg);
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ret variable has no sense, it's always -ENOMEM.

* under restrictions that are slightly relaxed. It's likely
* that servers probed in these two passes are not the same.
*/
attempts = rpool->srv_n * 2 + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no sense to do so many attempts, even with the rerun for skipped connections in mind. Typically if we can not choose a connection/server for so many tries, there is something wrong with the upstream cluster and spinning in the loop will just emphasize the issue on Tempesta's side. I believe rpool->srv_n is the highest value, which has sense...

sg->sched_data = NULL;

if (tfw_sched_ratio_validate_grp(sg))
return -EINVAL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue as with hash scheduler: unclear purpose for the validation check and too many allocations which can lead to memory fragmentation and poor cache lines usage.

TfwRatioSchData *schdata = &ratio->schdata;

/* Start with server that has the highest ratio. */
spin_lock(&schdata->lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad thing. We have to use the lock as described in the comment, but we still want to have RR (ratio) scheduler as the most quick one. However, probably that would be an issue now if we for example try it to run on 4 processors machine (e.g. 40 and more CPU cores). Thus, let's introduce 2 versions of the function: this one for dynamic ratios and the old one, from RR scheduler, for static ratios.

Copy link
Contributor Author

@keshonok keshonok May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, the old one, from RR scheduler, for static equal ratios?

For static non-equal ratios we have to use the new algorithm with the lock.

@@ -249,25 +229,14 @@ tfw_sg_release_all(void)
TfwServer *srv, *srv_tmp;
TfwSrvGroup *sg, *sg_tmp;

write_lock(&sg_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment that we don't need the lock because the function is called at shutdown time, when there are no users any more.


return 0;
return tfw_peer_for_each_conn(srv, srv_conn, list,
__tfw_sock_srv_connect_try_later_cb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to call tfw_sock_srv_connect_try_later() under conn_lock?

Add the number of schedulable connections for each server.
Updates are stored in per-CPU arrays for each server. Processing
of accumulated updates is done by a single thread that submits
the updates and then runs the calculation of percentiles. That
removes the concurrency between the updates and the calculation
of percentiles.

Different arrays are used for accumulating incoming update data
and for processing the accumulated data.
{
TfwServer *srv;

list_for_each_entry(srv, tfw_cfg_slst, list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason, why you work with tfw_cfg_slst here, but as for me, the way the function is implemented is awful. Function takes sg as argument, but instead of using it, it works with globally defined list of servers. The function is also needed in unit tests: https://github.com/tempesta-tech/tempesta/pull/734/files#diff-da5998b2c263193d00b380780843618bR87 but it cannot be used there without dirty hacks.

I think, it's better to pass list_head as function argument here.

Consistently use global variables in configuration processing code.
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge with small cleanups.

@@ -29,6 +29,7 @@
#include "log.h"
#include "pool.h"
#include "procfs.h"
#include "work_queue.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The include isn't needed any more.

for (i = 0; i < ubuf->ubufsz; ++i)
WRITE_ONCE(ubuf->ubent[0][i].data, ULONG_MAX);
for (i = 0; i < ubuf->ubufsz; ++i)
WRITE_ONCE(ubuf->ubent[1][i].data, ULONG_MAX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no sense to do 2 loops instead of one initializing both the arrays.

for_each_online_cpu(icpu) {
TfwApmUBEnt *ubent;
TfwApmUBuf *ubuf = per_cpu_ptr(data->ubuf, icpu);
if (!(ubent = kzalloc(size, GFP_ATOMIC)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there should be kzalloc_node() with node derived from icpu.

@keshonok keshonok merged commit bf829a7 into master Jun 3, 2017
@keshonok keshonok mentioned this pull request Jun 3, 2017
@keshonok keshonok deleted the ab-alb-2 branch June 10, 2017 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants