TDB eviction and stale response processing #2074

const-t · 2024-03-08T09:56:25Z

Wiki branch related to this PR is kt-use-stale-desc

fw/cache.c

db/core/htrie.c

fw/cache.c

fw/vhost.c

EvgeniiMekhanik · 2024-07-05T06:40:38Z

fw/vhost.c

+	cfg = kzalloc(sizeof(TfwCacheUseStale), GFP_KERNEL);
+	if (!cfg)
+		return -ENOMEM;
+	loc->cache_use_stale = cfg;


It seems we have memory leak here. For such simple config we allocate cfg two times and rewrite already set cache_use_stale.
listen 443 proto=h2;

srv_group default {
server 127.0.0.1:8000;
}
vhost default {
proxy_pass default;
tls_certificate /home/evgeny/workdir/certs/tempesta-tech.test.crt;
tls_certificate_key /home/evgeny/workdir/certs/tempesta-tech.test.key;
tls_match_any_server_name;
cache_use_stale 200;
}
cache 2;
cache_fulfill * *;
cache_use_stale 500;

Moreover I am not sure about this implementation, where we add cfg only for location. In previos config we rewrite cache_use_stale 200 with cache_use_stale 500. For all other options we don't do it.

This overriding only for default vhost, I don't see a reason to fix it, because default vhost will be removed soon. Now I just I forbade this directive for default vhost.

About inheritance: it's has same logic with req_hdr_set and we also don't have inheritance for these directives. I think we should fix it together in another task, we have such task for inheritance.

No error is still present.
vhost xxx {
proxy_pass default;
tls_certificate /home/evgeny/workdir/certs/tempesta-tech.test.crt;
tls_certificate_key /home/evgeny/workdir/certs/tempesta-tech.test.key;
cache_use_stale 400;
cache_use_stale 500;
}

And we have a memory leak. Because we don't check dublicate values.

For http_resp_code_block I catch duplicate entry: 'http_resp_code_block', for such config. Moreover

if (cfg->http_resp_code_block) { if (ce->dflt_value) return 0; kfree(cfg->http_resp_code_block); cfg->http_resp_code_block = NULL; }

We need also check how we handle this directive when it is present in vhost and location inside vhost

It seems we should set .allow_repeat = false, to this option

EvgeniiMekhanik · 2024-07-05T06:48:30Z

fw/vhost.c

+	loc->cache_use_stale = cfg;
+
+	for (i = 0; i < ce->val_n; i++) {
+		if (!mask4x && !strcasecmp(ce->vals[i], "4*")) {


May be we should check duplicate values in config and return error?

We have TFW_CFG_CHECK_VAL_DUP on master branch, please use it to avoid dublicates

fw/vhost.h

fw/vhost.c

EvgeniiMekhanik · 2024-07-05T07:19:43Z

fw/http.c

-				      HTTP2_ECODE_PROTO);
+
+	if (tfw_http_resp_should_fwd_stale(bad_req, 502)) {
+		__tfw_http_resp_fwd_stale(hmresp);


We should set r == 0 here as I think. Otherwise we drop the connection with this backend. or it is expected behavior?

I'm not sure, it's error case I think we need to drop the connection. However without effects for client.

@krizhanovsky pay attention here. As I think if we fail to send stail response we should got to else directive and call tfw_http_req_block or tfw_http_req_drop. Also I am not sure that we can send stale response if we blocked by frang.

I think it's OK to drop upstream connection if we receive a broken response, basically this means that something on the upstream went wrong and this is the reason why Nginx and Apache by default terminate connections every 100 requests

Also I agree that we should not send stale response on a security event, e.g. if r == T_BLOCK at least.

As I see we don't have return code for __tfw_http_resp_fwd_stale(). @EvgeniiMekhanik do you propose to make the function (and all called functions) to return code? I didn't get your proposal

As I think if we fail to send stail response we should got to else directive and call tfw_http_req_block or tfw_http_req_drop

EvgeniiMekhanik · 2024-07-05T08:12:43Z

fw/cache.c

+
+	if (resp && stale) {
+		req->resp = NULL;
+		req->stale_resp = resp;


I don't agree with this commit (Add ability to respond with a stale response ). What purpose to send request to server and after receiving response send stale response?? It is very strange for me! If we receive new fresh response we should send it. If this directive is set we should send stale response to client without accessing backend. We should check status of response saved in cache and if it is in cache_use_stale send it without accessing backend. For error cases we can send stale successful responses after receiving error from backend.
For example if I add
param(name="200", status=200, use_stale="200", r2_hdrs=[]), to TestCacheUseStale in your branch. I receive stale 200 response after receiving successful response from server!

You are wrong. You got stale response because you specified cache_use_stale 200 in the config. It's desired behavior, however, maybe we should deny to use status codes except 4x and 5x.

This feature need in case when you for instance got 500 status code, but you have stale response and you reply with stale resp if it configured.

Error will be if you specify cache_use_stale 500, but you received 200 stale response.

After discussion we decide that administrator can't set 2** or 3** in cache_use_stale

For me the logic seems aligned with the proposed implementation in #522. I'm OK with allowing only 4xx and 5xx codes

fw/cache.c

fw/http_parser.c

EvgeniiMekhanik · 2024-07-05T14:10:21Z

fw/http_parser.c

+				  Resp_I_CC_StaleIfErrorV);
+
+	__FSM_STATE(Resp_I_CC_StaleIfErrorV) {
+		__fsm_sz = __data_remain(p);


Should we don like for previos directives?

if (unlikely(resp->cache_ctl.flags & TFW_HTTP_CC_STALE_IF_ERROR))
__FSM_EXIT(T_DROP);

Why? We can have repeating headers with different values

Are you sure? I check that without this check we can set cache_control: stale-if-error=20, stale-if-error=25. Is it ok?
For other cache-control directives it is prohibited.

EvgeniiMekhanik · 2024-07-05T15:52:36Z

db/core/htrie.c

+
+		b_tmp = bckt;
+		bckt = TDB_HTRIE_BUCKET_NEXT(dbh, bckt);
+		write_lock_bh(&bckt->lock);


As I understand this function we should lock prev backet and current backet. We shoudl have not more then two locks.
But here we try to catch third lock. And why we don't unlock b_tmp?
As I think we should do
write_unlock_bh(&b_tmp->lock);
write_lock_bh(&bckt->lock);

Please explain for me logic here

We do it in this sequence because we can't free the bucket than now hold the lock and just relies on the lock from previous bucket. However, maybe it's overengineering, for simplicity we can lock/unlock only head bucket, to lock whole list.

To remove an entry from tdb we must be able to free allocated blocks. For this purposes block allocation logic was reworked, now every block has its own refcounter, when it equal to zero, block will be moved to percpu freelist of blocks and will be reused on next block allocation on current cpu. Seems we cam use only atomic inc/dec for refcounter, because blocks allocates percpu and race between get/put is not possible. Refcounter is not parf of the block, all refcounters stores in the last block of current extent. It's bad solution, but I guees it's enough for current fix. Removing applies only for buckets in collision chain, the first bucket in chain can't be removed. Essentially entry removing is just freeing blocks in collision chain, index node never be removed.

This directive has following syntax: cache_use_stale 1*(status_code | status_code_wildcard) Examples: `cache_use_stale 404 502 500;` - Respond with stale response in case of 404 502 500 statuses; `cache_use_stale 5*;` - Respond with stale response in case of any 5xx status code;

Depending on response status code specified using `cache_use_stale` directive Tempesta can respond from cache using stale response. Here we have two pathes: - Tempesta received a response without errors. It checks the status code, and if the status code matches the code specified in cache_use_stale, it responds with the stale response from the cache instead of the received response. - Tempesta can't send the response due to scheduling error and tries to send error response to the client, at this stage error response status code from Tempesta's prepared response will be matched with status code specified in `cache_use_stale` directive and if the status code matches, stale response from the cache will be sent instead of prepared error response.

Some htrie entries can't be allocated in one "shot", only "head" of this record adds to htrie and after part by part record will be copied to htrie completely. However, after adding the "head", entry already accessible, therefore might be read and removed, that leads to races. To fix this we mark these entries as incomplete and ignore them while removing or reading. Also changed behaviour of tdb_htrie_lookup(), now it locks the bucket. Otherwise potentially we will have race when check completness and while removing the record returned by tdb_htrie_lookup(). For instance: In tdb_rec_get() record might be removed before it will reach tdb_htrie_bscan_for_rec().

Simple remove expired clients with same key, when obtain new client. Client removing is not fully copleted by this commit, it's only temporary fix. Now we can't remove client just on disconnection time, because we need to store frang client info for a while even after disconnection, this info will be used for the connection rate calculation when client tries to reconnect.

This logic might be used and should be revisioned only after implementation #934 issue.

For client identification we must calculate hash only for address excluding client's port, because on the next connection port will be different.

Implemented logic of using stale response in case when upstream server returns some type of error(500, 502 503, 504) or when such error code was generated on Tempesta side. Defined by RFC 5861 Fix warning header for stale response The response is stale when age is greater then lifetime. We must compare cache entry lifetime with age instead of total calculated lifetime of entry.

Added two counters to cache statists: `Cache objects`- the number of cached objects. `Cache bytes` - number of bytes at the moment stored in the cache. `Cache bytes` calculation: Length of key + length of status + length of reason phrase + length of headers + length of body + length of cache entry descriptor (part of TfwCacheEntry). Size of `TdbRec` is not included. Total length of the record added to debug print as well

The new logic was introduced, this patch changes removing logic to follow new requirments. Now we not remove records with different method on the cleaning stage wich invokes before adding the new record to the cache. The main reason is HEAD and GET with the same key may have two different records. One with body for the GET and second without for the HEAD.

-Fix style and comments -Forbid use `cache_use_stale` to simplify inheritance. -Rewrite validation in `tfw_cfgop_cache_use_stale()` -Add `cache_use_stale` to config file. -Forbid status code below that 400 for `cache_use_stale`

krizhanovsky

I think we can do better in terms of performance and memory efficiency

krizhanovsky · 2024-08-04T18:34:10Z

db/core/htrie.c

+{
+	TdbExt *e = tdb_ext(dbh, TDB_PTR(dbh, ptr));
+	atomic_t *reftbl = tdb_get_reftbl(dbh, e);
+	unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT;


diff --git a/ktest/linux/slab.h b/ktest/linux/slab.h index cc421400..8d8ac230 100644 --- a/ktest/linux/slab.h +++ b/ktest/linux/slab.h @@ -35,6 +35,7 @@ /* asm/page.h */ #define PAGE_SIZE 4096UL #define PAGE_MASK (~(PAGE_SIZE-1)) +#define PAGE_SHIFT 12 /* arch/x86/include/asm/page_64_types.h */ #define THREAD_SIZE (PAGE_SIZE << 2)

or the unit test compilation fails with

~/tempesta/db/t# make cc -O2 -msse4.2 -ggdb -Wall -Werror -fno-strict-aliasing -lpthread -DL1_CACHE_BYTES=64 -I../../ktest -I../.. -Wno-address-of-packed-member -c -o tdb_htrie.o tdb_htrie.c In file included from tdb_htrie.c:39: ../core/htrie.c: In function ‘tdb_get_blk’: ../core/htrie.c:131:65: error: ‘PAGE_SHIFT’ undeclared (first use in this function); did you mean ‘PAGE_SIZE’? 131 | unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT; | ^~~~~~~~~~ | PAGE_SIZE ../core/htrie.c:131:65: note: each undeclared identifier is reported only once for each function it appears in ../core/htrie.c: In function ‘tdb_put_blk’: ../core/htrie.c:143:65: error: ‘PAGE_SHIFT’ undeclared (first use in this function); did you mean ‘PAGE_SIZE’? 143 | unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT; | ^~~~~~~~~~ | PAGE_SIZE ../core/htrie.c: In function ‘tdb_alloc_index’: ../core/htrie.c:455:33: error: variable ‘old_rptr’ set but not used [-Werror=unused-but-set-variable] 455 | unsigned long rptr = 0, old_rptr; | ^~~~~~~~ cc1: all warnings being treated as errors make: *** [<builtin>: tdb_htrie.o] Error 1

krizhanovsky · 2024-08-04T18:43:40Z

db/core/htrie.c

@@ -356,10 +462,17 @@ tdb_alloc_index(TdbHdr *dbh)
 		     || TDB_BLK_O(rptr + sizeof(TdbHtrieNode) - 1)
 			> TDB_BLK_O(rptr)))
 	{
-		/* Use a new page and/or extent for local CPU. */
-		rptr = tdb_alloc_blk(dbh);
+		old_rptr = ((rptr - 1) & TDB_BLK_MASK);


We never free index blocks, so old_rptr isn't used

krizhanovsky · 2024-08-04T18:51:16Z

db/core/main.c

 	if (!r)
 		TDB_ERR("Cannot allocate cache entry for key=%#lx\n", key);

 	return r;
 }
 EXPORT_SYMBOL(tdb_entry_alloc);

+void


Please write a comment what the function is required for, just an API doc

krizhanovsky · 2024-08-04T18:52:15Z

db/core/main.c

@@ -37,11 +37,17 @@ MODULE_LICENSE("GPL");



Let's advance the TDB version to 0.2.0 :)

krizhanovsky · 2024-08-04T19:00:23Z

db/core/main.c

+	b->flags |= TDB_HTRIE_COMPLETE_BIT;
+	write_unlock_bh(&b->lock);
+}
+EXPORT_SYMBOL(tdb_entry_mark_complete);


The function is used in t/tdb_htrie.c, so it seems HTrie isn't usable w/o the function and shouldn't we move it to htrie.c?

krizhanovsky · 2024-08-07T18:42:46Z

db/core/htrie.c

+						r = 0;
+					} else {
+						has_alive = true;
+					}


That was a poor design to write polimorphic code for fixed and variable sized records through macros and in the new implementation I got rid of the approach. You can get the type from dbh handler (TDB table can handle only one type of records). You can write a function iterating over the all the records using the records type from dbh. Both the callbacks can be passed to the function. The function can return has_alive.

krizhanovsky · 2024-08-07T18:52:12Z

fw/cache.c

+		    tfw_cache_entry_key_eq(db, req, ce))
+		{
+			if (!ce_best || tfw_cache_entry_age(ce)
+				      < tfw_cache_entry_age(ce_best))


I think the code is wrong as well as the statement "Cache may store one or more responses to the effective Request URI." in the comment above. What for to store so many entries? We recently discussed this with @EvgeniiMekhanik .

We can #ifdef 0 the code with TODO #508, but the point is that at the moment we are able to use and do use only one cache entry for a key and the rest of entries are just a garbage and extra iterations. We need to store only one cache entry at any given point of time. If the entry is stale, then we remove it on insertion a new response.

The key is URI, so for one uri we can have HEAD and GET records which are both valid

krizhanovsky · 2024-08-07T18:56:30Z

fw/cache.c

+	 * Remove each entry with current key, even fresh entries. We assume
+	 * that if we at current place, no valid entries are exists.
+	 */
+	tdb_entry_remove(db, key, &tfw_cache_rec_eq_req, resp->req,


The same as for tfw_client_obtain(): double index traversal, maybe in vain, and bucket locking

krizhanovsky · 2024-08-07T21:00:20Z

db/core/htrie.c

@@ -602,6 +718,11 @@ tdb_htrie_create_rec(TdbHdr *dbh, unsigned long off, unsigned long key,
 	if (data)
 		memcpy_fast(ptr, data, len);

+	bckt = (TdbBucket *)((unsigned long)r & TDB_HTRIE_DMASK);
+	bckt->flags |= (TDB_HTRIE_COMPLETE_BIT * complete);


This is for variable length records only, right? This should be under if (TDB_HTRIE_VARLENRECS(dbh)).

Why this is for the whole bucket? If we have fully resolved key, we may have collisions in the bucket, so even for variable length records we can have many TdbVRecs in a bucket. I'd expect that the bit should be set for TdbVRec.

krizhanovsky · 2024-08-07T21:06:52Z

db/core/htrie.c

 	TdbHtrieNode *root = TDB_HTRIE_ROOT(dbh);

 	o = tdb_htrie_descend(dbh, &root, key, &bits);
 	if (!o)
 		return NULL;
+	b = TDB_PTR(dbh, o);
+	read_lock_bh(&(b)->lock);
+	if (!tdb_bucket_is_complete(b)) {


I don't think we need to implement #500 right now, but there is a comment #500 (comment) , which is aligned with incomplete buckets (why not records?), so it'd be good to write a good TODO #500 comment and align the current implementation with the upcoming task.

const-t force-pushed the kt-522-eviction branch 2 times, most recently from 5d4eaf5 to d02eeaa Compare March 8, 2024 12:40

const-t force-pushed the kt-522-eviction branch from d02eeaa to 53cf2bc Compare March 15, 2024 12:48

const-t marked this pull request as ready for review March 15, 2024 12:49

const-t force-pushed the kt-522-eviction branch from 53cf2bc to d9d48f0 Compare March 19, 2024 10:04

const-t marked this pull request as draft April 9, 2024 07:16

const-t force-pushed the kt-522-eviction branch from d9d48f0 to 5bd7df8 Compare April 13, 2024 19:08

const-t marked this pull request as ready for review April 13, 2024 19:09

const-t marked this pull request as draft April 13, 2024 20:10

const-t marked this pull request as ready for review April 17, 2024 11:07

krizhanovsky requested review from krizhanovsky and EvgeniiMekhanik April 17, 2024 11:17

const-t force-pushed the kt-522-eviction branch from 5bd7df8 to 0d30588 Compare May 20, 2024 16:38

krizhanovsky reviewed Jun 16, 2024

View reviewed changes

fw/cache.c Outdated Show resolved Hide resolved

This was linked to issues Jun 17, 2024

TDB: kernel panic in tfw_client_ent_init #2000

Open

Servicing stale cached responses and immediate purging #522

Open

const-t force-pushed the kt-522-eviction branch 5 times, most recently from 00dffd7 to 67b01fb Compare July 2, 2024 10:51