Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDB eviction and stale response processing #2074

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

const-t
Copy link
Contributor

@const-t const-t commented Mar 8, 2024

Wiki branch related to this PR is kt-use-stale-desc

@const-t const-t force-pushed the kt-522-eviction branch 2 times, most recently from 5d4eaf5 to d02eeaa Compare March 8, 2024 12:40
@const-t const-t marked this pull request as ready for review March 15, 2024 12:49
@const-t const-t marked this pull request as draft April 9, 2024 07:16
@const-t const-t marked this pull request as ready for review April 13, 2024 19:09
@const-t const-t marked this pull request as draft April 13, 2024 20:10
@const-t const-t marked this pull request as ready for review April 17, 2024 11:07
fw/cache.c Outdated Show resolved Hide resolved
db/core/htrie.c Outdated Show resolved Hide resolved
db/core/htrie.c Outdated Show resolved Hide resolved
fw/cache.c Outdated Show resolved Hide resolved
fw/vhost.c Outdated Show resolved Hide resolved
fw/vhost.c Outdated Show resolved Hide resolved
cfg = kzalloc(sizeof(TfwCacheUseStale), GFP_KERNEL);
if (!cfg)
return -ENOMEM;
loc->cache_use_stale = cfg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we have memory leak here. For such simple config we allocate cfg two times and rewrite already set cache_use_stale.
listen 443 proto=h2;

srv_group default {
server 127.0.0.1:8000;
}
vhost default {
proxy_pass default;
tls_certificate /home/evgeny/workdir/certs/tempesta-tech.test.crt;
tls_certificate_key /home/evgeny/workdir/certs/tempesta-tech.test.key;
tls_match_any_server_name;
cache_use_stale 200;
}
cache 2;
cache_fulfill * *;
cache_use_stale 500;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover I am not sure about this implementation, where we add cfg only for location. In previos config we rewrite cache_use_stale 200 with cache_use_stale 500. For all other options we don't do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overriding only for default vhost, I don't see a reason to fix it, because default vhost will be removed soon. Now I just I forbade this directive for default vhost.

About inheritance: it's has same logic with req_hdr_set and we also don't have inheritance for these directives. I think we should fix it together in another task, we have such task for inheritance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No error is still present.
vhost xxx {
proxy_pass default;
tls_certificate /home/evgeny/workdir/certs/tempesta-tech.test.crt;
tls_certificate_key /home/evgeny/workdir/certs/tempesta-tech.test.key;
cache_use_stale 400;
cache_use_stale 500;
}

And we have a memory leak. Because we don't check dublicate values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For http_resp_code_block I catch duplicate entry: 'http_resp_code_block', for such config. Moreover

if (cfg->http_resp_code_block) {
		if (ce->dflt_value)
			return 0;
		kfree(cfg->http_resp_code_block);
		cfg->http_resp_code_block = NULL;
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need also check how we handle this directive when it is present in vhost and location inside vhost

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should set .allow_repeat = false, to this option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

loc->cache_use_stale = cfg;

for (i = 0; i < ce->val_n; i++) {
if (!mask4x && !strcasecmp(ce->vals[i], "4*")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should check duplicate values in config and return error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have TFW_CFG_CHECK_VAL_DUP on master branch, please use it to avoid dublicates

fw/vhost.h Show resolved Hide resolved
fw/vhost.c Show resolved Hide resolved
HTTP2_ECODE_PROTO);

if (tfw_http_resp_should_fwd_stale(bad_req, 502)) {
__tfw_http_resp_fwd_stale(hmresp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set r == 0 here as I think. Otherwise we drop the connection with this backend. or it is expected behavior?

Copy link
Contributor Author

@const-t const-t Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, it's error case I think we need to drop the connection. However without effects for client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krizhanovsky pay attention here. As I think if we fail to send stail response we should got to else directive and call tfw_http_req_block or tfw_http_req_drop. Also I am not sure that we can send stale response if we blocked by frang.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to drop upstream connection if we receive a broken response, basically this means that something on the upstream went wrong and this is the reason why Nginx and Apache by default terminate connections every 100 requests

Also I agree that we should not send stale response on a security event, e.g. if r == T_BLOCK at least.

As I see we don't have return code for __tfw_http_resp_fwd_stale(). @EvgeniiMekhanik do you propose to make the function (and all called functions) to return code? I didn't get your proposal

As I think if we fail to send stail response we should got to else directive and call tfw_http_req_block or tfw_http_req_drop


if (resp && stale) {
req->resp = NULL;
req->stale_resp = resp;
Copy link
Contributor

@EvgeniiMekhanik EvgeniiMekhanik Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this commit (Add ability to respond with a stale response ). What purpose to send request to server and after receiving response send stale response?? It is very strange for me! If we receive new fresh response we should send it. If this directive is set we should send stale response to client without accessing backend. We should check status of response saved in cache and if it is in cache_use_stale send it without accessing backend. For error cases we can send stale successful responses after receiving error from backend.
For example if I add
param(name="200", status=200, use_stale="200", r2_hdrs=[]), to TestCacheUseStale in your branch. I receive stale 200 response after receiving successful response from server!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are wrong. You got stale response because you specified cache_use_stale 200 in the config. It's desired behavior, however, maybe we should deny to use status codes except 4x and 5x.

This feature need in case when you for instance got 500 status code, but you have stale response and you reply with stale resp if it configured.

Error will be if you specify cache_use_stale 500, but you received 200 stale response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion we decide that administrator can't set 2** or 3** in cache_use_stale

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me the logic seems aligned with the proposed implementation in #522. I'm OK with allowing only 4xx and 5xx codes

fw/cache.c Outdated Show resolved Hide resolved
fw/cache.c Show resolved Hide resolved
fw/http_parser.c Outdated Show resolved Hide resolved
Resp_I_CC_StaleIfErrorV);

__FSM_STATE(Resp_I_CC_StaleIfErrorV) {
__fsm_sz = __data_remain(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we don like for previos directives?

if (unlikely(resp->cache_ctl.flags & TFW_HTTP_CC_STALE_IF_ERROR))
__FSM_EXIT(T_DROP);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We can have repeating headers with different values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I check that without this check we can set cache_control: stale-if-error=20, stale-if-error=25. Is it ok?
For other cache-control directives it is prohibited.


b_tmp = bckt;
bckt = TDB_HTRIE_BUCKET_NEXT(dbh, bckt);
write_lock_bh(&bckt->lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand this function we should lock prev backet and current backet. We shoudl have not more then two locks.
But here we try to catch third lock. And why we don't unlock b_tmp?
As I think we should do
write_unlock_bh(&b_tmp->lock);
write_lock_bh(&bckt->lock);

Please explain for me logic here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it in this sequence because we can't free the bucket than now hold the lock and just relies on the lock from previous bucket. However, maybe it's overengineering, for simplicity we can lock/unlock only head bucket, to lock whole list.

To remove an entry from tdb we must be able to free
allocated blocks. For this purposes block allocation
logic was reworked, now every block has its own
refcounter, when it equal to zero, block will be moved
to percpu freelist of blocks and will be reused on
next block allocation on current cpu. Seems we
cam use only atomic inc/dec for refcounter, because
blocks allocates percpu and race between get/put is
not possible.

Refcounter is not parf of the block, all refcounters
stores in the last block of current extent. It's bad
solution, but I guees it's enough for current fix.

Removing applies only for buckets in collision chain,
the first bucket in chain can't be removed.

Essentially entry removing is just freeing blocks in
collision chain, index node never be removed.
This directive has following syntax:
cache_use_stale 1*(status_code | status_code_wildcard)

Examples:
`cache_use_stale 404 502 500;` - Respond with stale
response in case of 404 502 500 statuses;

`cache_use_stale 5*;` - Respond with stale response
in case of any 5xx status code;
Depending on response status code specified using
`cache_use_stale` directive Tempesta can respond from
cache using stale response. Here we have two pathes:
 - Tempesta received a response without errors.
   It checks the status code, and if the status code
   matches the code specified in cache_use_stale, it
   responds with the stale response from the cache
   instead of the received response.
 - Tempesta can't send the response due to scheduling
   error and tries to send error response to the
   client, at this stage error response status code
   from Tempesta's prepared response will be matched
   with status code specified in `cache_use_stale`
   directive and if the status code matches, stale
   response from the cache will be sent instead of
   prepared error response.
Some htrie entries can't be allocated in one "shot",
only "head" of this record adds to htrie and after part
by part record will be copied to htrie completely.
However, after adding the "head", entry already
accessible, therefore might be read and removed, that
leads to races.

To fix this we mark these entries as incomplete and
ignore them while removing or reading.

Also changed behaviour of tdb_htrie_lookup(), now
it locks the bucket. Otherwise potentially we will
have race when check completness and while removing
the record returned by tdb_htrie_lookup().

For instance:
In tdb_rec_get() record might be removed before it
will reach tdb_htrie_bscan_for_rec().
Simple remove expired clients with same key, when
obtain new client.

Client removing is not fully copleted by this commit,
it's only temporary fix. Now we can't remove client
just on disconnection time, because we need to store
frang client info for a while even after disconnection,
this info will be used for the connection rate
calculation when client tries to reconnect.
This logic might be used and should be revisioned
only after implementation #934 issue.
For client identification we must calculate hash only
for address excluding client's port, because on the
next connection port will be different.
Implemented logic of using stale response in case when
upstream server returns some type of error(500, 502
503, 504) or when such error code was generated on
Tempesta side.

Defined by RFC 5861

Fix warning header for stale response

The response is stale when age is greater then lifetime.
We must compare cache entry lifetime with age instead
of total calculated lifetime of entry.
Added two counters to cache statists:
`Cache objects`- the number of cached objects.
`Cache bytes`  - number of bytes at the moment stored
                 in the cache.

`Cache bytes` calculation: Length of key + length of
status + length of reason phrase + length of headers
+ length of body + length of cache entry descriptor
(part of TfwCacheEntry). Size of `TdbRec` is not
included.

Total length of the record added to debug print as well
The new logic was introduced, this patch changes
removing logic to follow new requirments. Now we not
remove records with different method on the cleaning
stage wich invokes before adding the new record to the
cache. The main reason is HEAD and GET with the same
key may have two different records. One with body for
the GET and second without for the HEAD.
 -Fix style and comments
 -Forbid use `cache_use_stale` to simplify inheritance.
 -Rewrite validation in `tfw_cfgop_cache_use_stale()`
 -Add `cache_use_stale` to config file.
 -Forbid status code below that 400 for
  `cache_use_stale`
Copy link
Contributor

@krizhanovsky krizhanovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better in terms of performance and memory efficiency

{
TdbExt *e = tdb_ext(dbh, TDB_PTR(dbh, ptr));
atomic_t *reftbl = tdb_get_reftbl(dbh, e);
unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/ktest/linux/slab.h b/ktest/linux/slab.h
index cc421400..8d8ac230 100644
--- a/ktest/linux/slab.h
+++ b/ktest/linux/slab.h
@@ -35,6 +35,7 @@
 /* asm/page.h */
 #define PAGE_SIZE      4096UL
 #define PAGE_MASK      (~(PAGE_SIZE-1))
+#define PAGE_SHIFT     12
 /* arch/x86/include/asm/page_64_types.h */
 #define THREAD_SIZE    (PAGE_SIZE << 2)

or the unit test compilation fails with

~/tempesta/db/t# make
cc -O2 -msse4.2 -ggdb -Wall -Werror -fno-strict-aliasing -lpthread -DL1_CACHE_BYTES=64 -I../../ktest -I../.. -Wno-address-of-packed-member   -c -o tdb_htrie.o tdb_htrie.c
In file included from tdb_htrie.c:39:
../core/htrie.c: In function ‘tdb_get_blk’:
../core/htrie.c:131:65: error: ‘PAGE_SHIFT’ undeclared (first use in this function); did you mean ‘PAGE_SIZE’?
  131 |         unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT;
      |                                                                 ^~~~~~~~~~
      |                                                                 PAGE_SIZE
../core/htrie.c:131:65: note: each undeclared identifier is reported only once for each function it appears in
../core/htrie.c: In function ‘tdb_put_blk’:
../core/htrie.c:143:65: error: ‘PAGE_SHIFT’ undeclared (first use in this function); did you mean ‘PAGE_SIZE’?
  143 |         unsigned int blkoff = TDB_BLK_ID(ptr & TDB_BLK_MASK) >> PAGE_SHIFT;
      |                                                                 ^~~~~~~~~~
      |                                                                 PAGE_SIZE
../core/htrie.c: In function ‘tdb_alloc_index’:
../core/htrie.c:455:33: error: variable ‘old_rptr’ set but not used [-Werror=unused-but-set-variable]
  455 |         unsigned long rptr = 0, old_rptr;
      |                                 ^~~~~~~~
cc1: all warnings being treated as errors
make: *** [<builtin>: tdb_htrie.o] Error 1

@@ -356,10 +462,17 @@ tdb_alloc_index(TdbHdr *dbh)
|| TDB_BLK_O(rptr + sizeof(TdbHtrieNode) - 1)
> TDB_BLK_O(rptr)))
{
/* Use a new page and/or extent for local CPU. */
rptr = tdb_alloc_blk(dbh);
old_rptr = ((rptr - 1) & TDB_BLK_MASK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never free index blocks, so old_rptr isn't used

if (!r)
TDB_ERR("Cannot allocate cache entry for key=%#lx\n", key);

return r;
}
EXPORT_SYMBOL(tdb_entry_alloc);

void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a comment what the function is required for, just an API doc

@@ -37,11 +37,17 @@ MODULE_LICENSE("GPL");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's advance the TDB version to 0.2.0 :)

b->flags |= TDB_HTRIE_COMPLETE_BIT;
write_unlock_bh(&b->lock);
}
EXPORT_SYMBOL(tdb_entry_mark_complete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is used in t/tdb_htrie.c, so it seems HTrie isn't usable w/o the function and shouldn't we move it to htrie.c?

r = 0;
} else {
has_alive = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a poor design to write polimorphic code for fixed and variable sized records through macros and in the new implementation I got rid of the approach. You can get the type from dbh handler (TDB table can handle only one type of records). You can write a function iterating over the all the records using the records type from dbh. Both the callbacks can be passed to the function. The function can return has_alive.

tfw_cache_entry_key_eq(db, req, ce))
{
if (!ce_best || tfw_cache_entry_age(ce)
< tfw_cache_entry_age(ce_best))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is wrong as well as the statement "Cache may store one or more responses to the effective Request URI." in the comment above. What for to store so many entries? We recently discussed this with @EvgeniiMekhanik .

We can #ifdef 0 the code with TODO #508, but the point is that at the moment we are able to use and do use only one cache entry for a key and the rest of entries are just a garbage and extra iterations. We need to store only one cache entry at any given point of time. If the entry is stale, then we remove it on insertion a new response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is URI, so for one uri we can have HEAD and GET records which are both valid

* Remove each entry with current key, even fresh entries. We assume
* that if we at current place, no valid entries are exists.
*/
tdb_entry_remove(db, key, &tfw_cache_rec_eq_req, resp->req,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as for tfw_client_obtain(): double index traversal, maybe in vain, and bucket locking

@@ -602,6 +718,11 @@ tdb_htrie_create_rec(TdbHdr *dbh, unsigned long off, unsigned long key,
if (data)
memcpy_fast(ptr, data, len);

bckt = (TdbBucket *)((unsigned long)r & TDB_HTRIE_DMASK);
bckt->flags |= (TDB_HTRIE_COMPLETE_BIT * complete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for variable length records only, right? This should be under if (TDB_HTRIE_VARLENRECS(dbh)).

Why this is for the whole bucket? If we have fully resolved key, we may have collisions in the bucket, so even for variable length records we can have many TdbVRecs in a bucket. I'd expect that the bit should be set for TdbVRec.

TdbHtrieNode *root = TDB_HTRIE_ROOT(dbh);

o = tdb_htrie_descend(dbh, &root, key, &bits);
if (!o)
return NULL;
b = TDB_PTR(dbh, o);
read_lock_bh(&(b)->lock);
if (!tdb_bucket_is_complete(b)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to implement #500 right now, but there is a comment #500 (comment) , which is aligned with incomplete buckets (why not records?), so it'd be good to write a good TODO #500 comment and align the current implementation with the upcoming task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TDB: kernel panic in tfw_client_ent_init Servicing stale cached responses and immediate purging
3 participants