Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests scheduling to massive farm of backend servers #76

Closed
krizhanovsky opened this issue Mar 11, 2015 · 13 comments
Closed

Requests scheduling to massive farm of backend servers #76

krizhanovsky opened this issue Mar 11, 2015 · 13 comments

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Mar 11, 2015

Currently TFW_SCHED_MAX_SERVERS is defined as 64 backend server at maximum, which is not enough for virtualized environments (virtual hosting or clouds), so it should be extended at least to 64K and all scheduling algorithms also must be updated accordingly to process such number of backend servers.

We still do not expect too many servers per one site group among which client requests are scheduled, but we expect a lot of independent sites.

CRUCIAL NOTE: it's quite atypical to have 64K servers in the same server group. Virtualized environment means many small small sites behind Tempesta FW, i.e. the sense of the issue is HTTP scheduler, which must schedule a request among thousands of server groups. In this case most of server groups can have only one server. Meantime there could be really powerful installations with hundreds upstream servers. Thus 2-tier schedulers (ration, hash etc.) still must have dynamic arrays for connections and servers, but probably we don't need to introduce special data structures able to efficiently handle thousands of servers in the same server group.

@krizhanovsky krizhanovsky added this to the 0.6.0 Performance milestone Mar 11, 2015
@vdmit11
Copy link
Contributor

vdmit11 commented Mar 11, 2015

Well, the TFW_SCHED_MAX_SERVERS was added for a purpose.
A fixed-size array of servers has certain advantages over a linked list:

  • You can use binary search (although linked list may be transofrmed to a skip list, but this is more complicated).
  • You can allocate per-CPU arrays easily. The small value of TFW_SCHED_MAX_SERVERS allows to do that statically. Dynamic per-CPU linked lists are not that easy.
  • All the memory is packed together which is good for caching, hence the performance is better.

Of course, we can re-allocate the array as needed and so on,
but for me it looks easier to have two separate modules for the two cases:

  • A module for small groups of servers that are online most of the time.
  • Another module for large groups of servers that go offline all the time.

These two cases require different implementations and involve different optimizations, so I think we really need separate modules.

@krizhanovsky
Copy link
Contributor Author

    ....to have two separate modules for the two cases:

    A module for small groups of servers that are online most of the time.
    Another module for large groups of servers that go offline all the time.

This is not a different logic, this is just a different cases, so they should be treated in the same code base. Probably, you just can allocate array for small server set or use hash table or trees to handle thousands of servers. But the different containers should be processed by the same logic.

Or please make an example of logic which is fundamentally different for the cases.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Feb 17, 2016

The system should dynamically establish new connections to busy upstream servers and also dynamically shrink redundant connections (also applicable for forward proxy case).

UPD. It still has sense to be able to change number of connections to upstream servers. However, Tempesta FW will not support forward proxying. With wide HTTPS usage forward proxying is limited by corporate networks and other small installation which do not process millions requests per second. There is no ISP usage any more. So this is completely different use case with different environment and requirements.

UPD 2. I created a new issue #710 for the functionality, so no need to implement it this time.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Jan 29, 2017

As we've seen in our performance benchmarks and shown in third-party benchmarks HTTP servers, like Nginx or Apache HTTPD, shows quite low performance on 4 concurrent connection, so our current default 4 server connections and 32 as a maximum number are just inadequate. I'd say 32 as default connections with VMs running Tempesta together with a user space HTTP server in mind, and 32768 (USHORT_MAX - 1024, which is 64512 is the maximum number of ephimeral ports.

The main consequence of the issue is that all current scheduling algorithms must be reworked to support dynamically sized arrays.

A naive solution could be to keep schedulers data per CPU and establish number of upstream connections equal to N * CPU_NUM. However, Tempesta FW can service thousands of weak virtualized servers, so if it's running on for example 128 cores hardware, then we have to maintain too many redundant connections and will cause unnecessary load onto the weak servers.

The issue relates to #51 since that also updates schedulers code.

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Mar 18, 2017

While the 2-tier schedulers are certainly should be modified to support dynamically sized arrays, the real performance issue is with HTTP scheduler which in practice must be able to process thousands of server groups. The problem is in tfw_http_match_req() which traverses the list of thousands rules and perform string matching against each item. The matcher must be reworked to handle rules in a hash table, such that we can make a quick jump by a rule key. The key can be calculate by the string and ID of the HTTP field.

In current milestone these constants should be eliminated in PRs #670 and #666.

UPD. This comment is separate into a new issue #732, so it shouldn't be done in context of #76.

@krizhanovsky krizhanovsky modified the milestones: 0.5.0 Web Server, 0.5 alpha Jan 8, 2018
@vankoven
Copy link
Contributor

vankoven commented Jan 9, 2018

All the requirements are already implemented or moved to separated issues/task.

@vankoven vankoven closed this as completed Jan 9, 2018
@krizhanovsky
Copy link
Contributor Author

It seems the issue is done, but we still have no results from #680 test. Let's close it if the test shows that we really able to efficiently handle 1M hosts.

@krizhanovsky krizhanovsky reopened this Jan 9, 2018
@vladtcvs
Copy link
Contributor

Creating many backends, with 1 backend in server group, causes problems. Creating 16 interfaces with 64 ports on interface, makes problem:

ERROR: start() for module 'sock_srv' returned the error: -12 - ENOMEM

8x32: TCP: Too many orphaned sockets and kmemleak messages
8x128: much more TCP: Too many orphaned sockets and much more kmemleak

backends created with nginx, single nginx per interface, nginx config contains server {} for each port

ports used: 16384, 16375, etc for each interface

@vladtcvs
Copy link
Contributor

testing: test_1M.py from vlts-680-1M

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Feb 10, 2018

I didn't notice TCP: Too many orphaned sockets messages, but using tempesta_fw.conf generated by @ikoveshnikov 's script and his nginx.conf (the both are attached), I see that sysctl -w net.tempesta.state=stop (on --restart) or sysctl -w net.tempesta.state=start (on --reload) takes about 20 seconds for 1000 backends config. The call stack for the sysctl process:

[<ffffffffafecb533>] __wait_rcu_gp+0xc3/0xf0
[<ffffffffafecfc9c>] synchronize_sched.part.65+0x3c/0x60
[<ffffffffafecfdf0>] synchronize_sched+0x30/0x90
[<ffffffffc027f169>] tfw_sched_ratio_del_grp+0x49/0x80 [tfw_sched_ratio]
[<ffffffffc043e462>] tfw_sg_release+0x22/0x80 [tempesta_fw]
[<ffffffffc043e512>] tfw_sg_release_all+0x52/0xb0 [tempesta_fw]
[<ffffffffc0443656>] tfw_sock_srv_stop+0xb6/0xd0 [tempesta_fw]
[<ffffffffc043c19c>] tfw_ctlfn_state_io+0x19c/0x530 [tempesta_fw]
[<ffffffffb004e025>] proc_sys_call_handler+0xe5/0x100
[<ffffffffb004e04f>] proc_sys_write+0xf/0x20
[<ffffffffaffca322>] __vfs_write+0x32/0x160
[<ffffffffaffcb660>] vfs_write+0xb0/0x190
[<ffffffffaffcca83>] SyS_write+0x53/0xc0
[<ffffffffb03dd72e>] entry_SYSCALL_64_fastpath+0x1c/0xb1

scrip_cfg.tar.gz

@krizhanovsky
Copy link
Contributor Author

After the fix 6d11ff1 perf top shows for 100K servers reconfiguration:

    76.25%  [kernel]            [k] strcasecmp
    16.01%  [tempesta_fw]       [k] tfw_cfgop_begin_srv_group
     5.64%  [tempesta_fw]       [k] tfw_sg_lookup_reconfig

@krizhanovsky
Copy link
Contributor Author

krizhanovsky commented Feb 17, 2018

After the fix 94b18ed performance profile became:

    62.33%  [tempesta_fw]  [k] tfw_cfgop_begin_srv_group
     9.25%  [tempesta_fw]  [k] tfw_apm_prcntl_tmfn
     7.98%  [tempesta_fw]  [k] __tfw_stricmp_avx2

However, reloading 10K server groups takes about 30 seconds, the same as for full restart. tempesta_fw.conf for 10K servers takes about 1MB, so all the parsing and server groups manipulations, e.g. tfw_cfgop_begin_srv_group(), takes time.

@krizhanovsky
Copy link
Contributor Author

With the commit c58993a (also https://github.com/tempesta-tech/linux-4.9.35-tfw/commit/f20d5703592ce3078d3415edbc5b2703f614d9b7 for the kernel) I still cannot normally start Tempesta FW with 30K backends using configuration #680 (comment) . (Surely it'd be better to use many IP addresses and ports to avoid lock contention on single TCP socket.) The system hangs on softirq softlockups. Only following patch allows to "normally" start Tempesta FW:

diff --git a/tempesta_fw/apm.c b/tempesta_fw/apm.c
index b82a3ce..5f78ee1 100644
--- a/tempesta_fw/apm.c
+++ b/tempesta_fw/apm.c
@@ -1034,9 +1034,10 @@ tfw_apm_add_srv(TfwServer *srv)
 
        /* Start the timer for the percentile calculation. */
        set_bit(TFW_APM_DATA_F_REARM, &data->flags);
+       goto AK_DBG;
        setup_timer(&data->timer, tfw_apm_prcntl_tmfn, (unsigned long)data);
        mod_timer(&data->timer, jiffies + TFW_APM_TIMER_INTVL);
-
+AK_DBG:
        srv->apmref = data;
 
        return 0;
diff --git a/tempesta_fw/sock_srv.c b/tempesta_fw/sock_srv.c
index dc9e0ba..3b4e361 100644
--- a/tempesta_fw/sock_srv.c
+++ b/tempesta_fw/sock_srv.c
@@ -227,7 +227,12 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
        /* Don't rearm the reconnection timer if we're about to shutdown. */
        if (unlikely(!ss_active()))
                return;
-
+{
+       static unsigned long delta = 0;
+       timeout = 1000 + delta;
+       delta += 10;
+       goto AK_DBG_end;
+}
        if (srv_conn->recns < ARRAY_SIZE(tfw_srv_tmo_vals)) {
                if (srv_conn->recns)
                        TFW_DBG_ADDR("Cannot establish connection",
@@ -249,7 +254,7 @@ tfw_sock_srv_connect_try_later(TfwSrvConn *srv_conn)
                timeout = tfw_srv_tmo_vals[ARRAY_SIZE(tfw_srv_tmo_vals) - 1];
        }
        srv_conn->recns++;
-
+AK_DBG_end:
        mod_timer(&srv_conn->timer, jiffies + msecs_to_jiffies(timeout));
 }
 
@@ -2119,7 +2124,7 @@ static TfwCfgSpec tfw_srv_group_specs[] = {
        },
        {
                .name = "server_connect_retries",
-               .deflt = "10",
+               .deflt = "1", // AK_DBG "10",
                .handler = tfw_cfgop_in_conn_retries,
                .spec_ext = &(TfwCfgSpecInt) {
                        .range = { 0, INT_MAX },

The reason is #736: TIMER_SOFTIRQ is the higest priority softirq functions and we setup about 60K timers for the test of 30K groups and all the timers aren't so lightweight. So the timers just block any activity in the system and don't allow it to make progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants