Requests scheduling to massive farm of backend servers #76

krizhanovsky · 2015-03-11T15:06:52Z

Currently TFW_SCHED_MAX_SERVERS is defined as 64 backend server at maximum, which is not enough for virtualized environments (virtual hosting or clouds), so it should be extended at least to 64K and all scheduling algorithms also must be updated accordingly to process such number of backend servers.

We still do not expect too many servers per one site group among which client requests are scheduled, but we expect a lot of independent sites.

CRUCIAL NOTE: it's quite atypical to have 64K servers in the same server group. Virtualized environment means many small small sites behind Tempesta FW, i.e. the sense of the issue is HTTP scheduler, which must schedule a request among thousands of server groups. In this case most of server groups can have only one server. Meantime there could be really powerful installations with hundreds upstream servers. Thus 2-tier schedulers (ration, hash etc.) still must have dynamic arrays for connections and servers, but probably we don't need to introduce special data structures able to efficiently handle thousands of servers in the same server group.

The text was updated successfully, but these errors were encountered:

vdmit11 · 2015-03-11T15:45:41Z

Well, the TFW_SCHED_MAX_SERVERS was added for a purpose.
A fixed-size array of servers has certain advantages over a linked list:

You can use binary search (although linked list may be transofrmed to a skip list, but this is more complicated).
You can allocate per-CPU arrays easily. The small value of TFW_SCHED_MAX_SERVERS allows to do that statically. Dynamic per-CPU linked lists are not that easy.
All the memory is packed together which is good for caching, hence the performance is better.

Of course, we can re-allocate the array as needed and so on,
but for me it looks easier to have two separate modules for the two cases:

A module for small groups of servers that are online most of the time.
Another module for large groups of servers that go offline all the time.

These two cases require different implementations and involve different optimizations, so I think we really need separate modules.

krizhanovsky · 2015-03-11T16:04:11Z

    ....to have two separate modules for the two cases:

    A module for small groups of servers that are online most of the time.
    Another module for large groups of servers that go offline all the time.

This is not a different logic, this is just a different cases, so they should be treated in the same code base. Probably, you just can allocate array for small server set or use hash table or trees to handle thousands of servers. But the different containers should be processed by the same logic.

Or please make an example of logic which is fundamentally different for the cases.

krizhanovsky · 2016-02-17T16:54:59Z

The system should dynamically establish new connections to busy upstream servers and also dynamically shrink redundant connections (also applicable for forward proxy case).

UPD. It still has sense to be able to change number of connections to upstream servers. However, Tempesta FW will not support forward proxying. With wide HTTPS usage forward proxying is limited by corporate networks and other small installation which do not process millions requests per second. There is no ISP usage any more. So this is completely different use case with different environment and requirements.

UPD 2. I created a new issue #710 for the functionality, so no need to implement it this time.

krizhanovsky · 2017-01-29T11:59:49Z

As we've seen in our performance benchmarks and shown in third-party benchmarks HTTP servers, like Nginx or Apache HTTPD, shows quite low performance on 4 concurrent connection, so our current default 4 server connections and 32 as a maximum number are just inadequate. I'd say 32 as default connections with VMs running Tempesta together with a user space HTTP server in mind, and 32768 (USHORT_MAX - 1024, which is 64512 is the maximum number of ephimeral ports.

The main consequence of the issue is that all current scheduling algorithms must be reworked to support dynamically sized arrays.

A naive solution could be to keep schedulers data per CPU and establish number of upstream connections equal to N * CPU_NUM. However, Tempesta FW can service thousands of weak virtualized servers, so if it's running on for example 128 cores hardware, then we have to maintain too many redundant connections and will cause unnecessary load onto the weak servers.

The issue relates to #51 since that also updates schedulers code.

krizhanovsky · 2017-03-18T20:22:29Z

While the 2-tier schedulers are certainly should be modified to support dynamically sized arrays, the real performance issue is with HTTP scheduler which in practice must be able to process thousands of server groups. The problem is in tfw_http_match_req() which traverses the list of thousands rules and perform string matching against each item. The matcher must be reworked to handle rules in a hash table, such that we can make a quick jump by a rule key. The key can be calculate by the string and ID of the HTTP field.

In current milestone these constants should be eliminated in PRs #670 and #666.

UPD. This comment is separate into a new issue #732, so it shouldn't be done in context of #76.

vankoven · 2018-01-09T17:26:25Z

All the requirements are already implemented or moved to separated issues/task.

krizhanovsky · 2018-01-09T22:56:36Z

It seems the issue is done, but we still have no results from #680 test. Let's close it if the test shows that we really able to efficiently handle 1M hosts.

vladtcvs · 2018-01-17T14:43:48Z

Creating many backends, with 1 backend in server group, causes problems. Creating 16 interfaces with 64 ports on interface, makes problem:

ERROR: start() for module 'sock_srv' returned the error: -12 - ENOMEM

8x32: TCP: Too many orphaned sockets and kmemleak messages
8x128: much more TCP: Too many orphaned sockets and much more kmemleak

backends created with nginx, single nginx per interface, nginx config contains server {} for each port

ports used: 16384, 16375, etc for each interface

vladtcvs · 2018-01-17T21:30:38Z

testing: test_1M.py from vlts-680-1M

krizhanovsky · 2018-02-10T17:31:45Z

I didn't notice TCP: Too many orphaned sockets messages, but using tempesta_fw.conf generated by @ikoveshnikov 's script and his nginx.conf (the both are attached), I see that sysctl -w net.tempesta.state=stop (on --restart) or sysctl -w net.tempesta.state=start (on --reload) takes about 20 seconds for 1000 backends config. The call stack for the sysctl process:

[<ffffffffafecb533>] __wait_rcu_gp+0xc3/0xf0
[<ffffffffafecfc9c>] synchronize_sched.part.65+0x3c/0x60
[<ffffffffafecfdf0>] synchronize_sched+0x30/0x90
[<ffffffffc027f169>] tfw_sched_ratio_del_grp+0x49/0x80 [tfw_sched_ratio]
[<ffffffffc043e462>] tfw_sg_release+0x22/0x80 [tempesta_fw]
[<ffffffffc043e512>] tfw_sg_release_all+0x52/0xb0 [tempesta_fw]
[<ffffffffc0443656>] tfw_sock_srv_stop+0xb6/0xd0 [tempesta_fw]
[<ffffffffc043c19c>] tfw_ctlfn_state_io+0x19c/0x530 [tempesta_fw]
[<ffffffffb004e025>] proc_sys_call_handler+0xe5/0x100
[<ffffffffb004e04f>] proc_sys_write+0xf/0x20
[<ffffffffaffca322>] __vfs_write+0x32/0x160
[<ffffffffaffcb660>] vfs_write+0xb0/0x190
[<ffffffffaffcca83>] SyS_write+0x53/0xc0
[<ffffffffb03dd72e>] entry_SYSCALL_64_fastpath+0x1c/0xb1

scrip_cfg.tar.gz

krizhanovsky · 2018-02-15T22:40:25Z

After the fix 6d11ff1 perf top shows for 100K servers reconfiguration:

    76.25%  [kernel]            [k] strcasecmp
    16.01%  [tempesta_fw]       [k] tfw_cfgop_begin_srv_group
     5.64%  [tempesta_fw]       [k] tfw_sg_lookup_reconfig

krizhanovsky · 2018-02-17T13:48:39Z

After the fix 94b18ed performance profile became:

    62.33%  [tempesta_fw]  [k] tfw_cfgop_begin_srv_group
     9.25%  [tempesta_fw]  [k] tfw_apm_prcntl_tmfn
     7.98%  [tempesta_fw]  [k] __tfw_stricmp_avx2

However, reloading 10K server groups takes about 30 seconds, the same as for full restart. tempesta_fw.conf for 10K servers takes about 1MB, so all the parsing and server groups manipulations, e.g. tfw_cfgop_begin_srv_group(), takes time.

krizhanovsky · 2018-02-20T02:26:05Z

With the commit c58993a (also https://github.com/tempesta-tech/linux-4.9.35-tfw/commit/f20d5703592ce3078d3415edbc5b2703f614d9b7 for the kernel) I still cannot normally start Tempesta FW with 30K backends using configuration #680 (comment) . (Surely it'd be better to use many IP addresses and ports to avoid lock contention on single TCP socket.) The system hangs on softirq softlockups. Only following patch allows to "normally" start Tempesta FW:

diff --git a/tempesta_fw/apm.c b/tempesta_fw/apm.c
index b82a3ce..5f78ee1 100644
--- a/tempesta_fw/apm.c
+++ b/tempesta_fw/apm.c
@@ -1034,9 +1034,10 @@ tfw_apm_add_srv(TfwServer *srv)
 
        /* Start the timer for the percentile calculation. */
        set_bit(TFW_APM_DATA_F_REARM, &data->flags);
+       goto AK_DBG;
        setup_timer(&data->timer, tfw_apm_prcntl_tmfn, (unsigned long)data);
        mod_timer(&data->timer, jiffies + TFW_APM_TIMER_INTVL);
-
+AK_DBG:
        srv->apmref = data;
 
        return 0;
diff --git a/tempesta_fw/sock_srv.c b/tempesta_fw/sock_srv.c
index dc9e0ba..3b4e361 100644
--- a/tempesta_fw/sock_srv.c
+++ b/tempesta_fw/sock_srv.c
@@ -227,7 +227,12 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
        /* Don't rearm the reconnection timer if we're about to shutdown. */
        if (unlikely(!ss_active()))
                return;
-
+{
+       static unsigned long delta = 0;
+       timeout = 1000 + delta;
+       delta += 10;
+       goto AK_DBG_end;
+}
        if (srv_conn->recns < ARRAY_SIZE(tfw_srv_tmo_vals)) {
                if (srv_conn->recns)
                        TFW_DBG_ADDR("Cannot establish connection",
@@ -249,7 +254,7 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
                timeout = tfw_srv_tmo_vals[ARRAY_SIZE(tfw_srv_tmo_vals) - 1];
        }
        srv_conn->recns++;
-
+AK_DBG_end:
        mod_timer(&srv_conn->timer, jiffies + msecs_to_jiffies(timeout));
 }
 
@@ -2119,7 +2124,7 @@ static TfwCfgSpec tfw_srv_group_specs[] = {
        },
        {
                .name = "server_connect_retries",
-               .deflt = "10",
+               .deflt = "1", // AK_DBG "10",
                .handler = tfw_cfgop_in_conn_retries,
                .spec_ext = &(TfwCfgSpecInt) {
                        .range = { 0, INT_MAX },

The reason is #736: TIMER_SOFTIRQ is the higest priority softirq functions and we setup about 60K timers for the test of 30K groups and all the timers aren't so lightweight. So the timers just block any activity in the system and don't allow it to make progress.

krizhanovsky added the enhancement label Mar 11, 2015

krizhanovsky assigned vdmit11 Mar 11, 2015

krizhanovsky added this to the 0.6.0 Performance milestone Mar 11, 2015

krizhanovsky mentioned this issue Mar 11, 2015

Implement basic back-end server failovering. #45

Closed

This was referenced Mar 13, 2015

lb: improve design of load balancing interfaces #78

Closed

Sticky cookies #29

Closed

This was referenced Mar 24, 2015

cfg: double server list specification #84

Closed

tfw_sched_hash: implement Highest Random Weight (Rendezvous) hashing #58

Closed

krizhanovsky assigned krizhanovsky and unassigned vdmit11 Mar 30, 2015

krizhanovsky mentioned this issue Jan 14, 2016

Proper locking is needed in schedulers #236

Closed

krizhanovsky mentioned this issue Dec 31, 2016

Sticky Sessions implementation #666

Closed

krizhanovsky assigned vankoven and keshonok Jan 29, 2017

krizhanovsky modified the milestones: 0.5.0 Web Server, 0.6 WebOS Jan 29, 2017

krizhanovsky added the performance label Jan 29, 2017

krizhanovsky added the crucial label Feb 11, 2017

This was referenced Feb 11, 2017

Test dynamic reconfiguration of 1M upsream sites #680

Closed

Enforce the correct order of responses. Handle non-idempotent requests. #660

Merged

krizhanovsky mentioned this issue Feb 25, 2017

cfg: hot configuration reloading #51

Closed

This was referenced Mar 18, 2017

Advanced load balancing - ratio scheduler #670

Closed

Dynamically allocate and shrink server connections #710

Open

vankoven mentioned this issue Mar 22, 2017

Sticky sessions implementation #713

Merged

krizhanovsky mentioned this issue May 13, 2017

Fast HTTP match #732

Open

krizhanovsky unassigned keshonok Aug 31, 2017

krizhanovsky modified the milestones: 0.5.0 Web Server, 0.5 alpha Jan 8, 2018

vankoven closed this as completed Jan 9, 2018

krizhanovsky reopened this Jan 9, 2018

krizhanovsky mentioned this issue Jan 26, 2018

Fix #672: Add health monitoring for backend servers. #877

Merged

krizhanovsky modified the milestones: 1.0 Tempesta OS, 0.5 alpha Feb 5, 2018

This was referenced Feb 11, 2018

Deadlock on Tempesta FW shutdown #908

Closed

HTrie for Hash scheduler #910

Open

BUG: rwlock wrong CPU on stopping Tempesta FW #890

Closed

krizhanovsky added a commit that referenced this issue Feb 16, 2018

#76: performance optimization for server group lookups

94b18ed

krizhanovsky mentioned this issue Feb 16, 2018

Move timers to kernel thread #736

Open

krizhanovsky mentioned this issue Feb 18, 2018

[Cfg] can not cleanup after unsuccessful start #918

Closed

krizhanovsky mentioned this issue Feb 21, 2018

Add JS challenge #920

Merged

krizhanovsky closed this as completed in bfe33c0 Mar 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests scheduling to massive farm of backend servers #76

Requests scheduling to massive farm of backend servers #76

krizhanovsky commented Mar 11, 2015 •

edited

vdmit11 commented Mar 11, 2015

krizhanovsky commented Mar 11, 2015

krizhanovsky commented Feb 17, 2016 •

edited

krizhanovsky commented Jan 29, 2017 •

edited

krizhanovsky commented Mar 18, 2017 •

edited

vankoven commented Jan 9, 2018

krizhanovsky commented Jan 9, 2018

vladtcvs commented Jan 17, 2018

vladtcvs commented Jan 17, 2018

krizhanovsky commented Feb 10, 2018 •

edited

krizhanovsky commented Feb 15, 2018

krizhanovsky commented Feb 17, 2018 •

edited

krizhanovsky commented Feb 20, 2018

Requests scheduling to massive farm of backend servers #76

Requests scheduling to massive farm of backend servers #76

Comments

krizhanovsky commented Mar 11, 2015 • edited

vdmit11 commented Mar 11, 2015

krizhanovsky commented Mar 11, 2015

krizhanovsky commented Feb 17, 2016 • edited

krizhanovsky commented Jan 29, 2017 • edited

krizhanovsky commented Mar 18, 2017 • edited

vankoven commented Jan 9, 2018

krizhanovsky commented Jan 9, 2018

vladtcvs commented Jan 17, 2018

vladtcvs commented Jan 17, 2018

krizhanovsky commented Feb 10, 2018 • edited

krizhanovsky commented Feb 15, 2018

krizhanovsky commented Feb 17, 2018 • edited

krizhanovsky commented Feb 20, 2018

krizhanovsky commented Mar 11, 2015 •

edited

krizhanovsky commented Feb 17, 2016 •

edited

krizhanovsky commented Jan 29, 2017 •

edited

krizhanovsky commented Mar 18, 2017 •

edited

krizhanovsky commented Feb 10, 2018 •

edited

krizhanovsky commented Feb 17, 2018 •

edited