[4.1/master] clock-step related crash #1874

Closed
cinek810 opened this Issue Mar 11, 2016 · 30 comments

Projects

None yet

10 participants

@cinek810

Expected Behavior

In general child process restart is problematic, because we are loosing whole cache, which can lead to efficiency problems under high load. I'm attaching output of varnishadm panic.show

Steps to Reproduce (for bugs)

This happened only once after last varnish update, varnish mgmt process was running for 16 days. I don't think it ease to reproduce.

Your Environment

My varnishadm panic.show:

Last panic at: Thu, 10 Mar 2016 15:04:48 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 546:
  Condition(now >= sp->t_open) not true.
errno = 11 (Resource temporarily unavailable)
thread = (cache-worker)
version = varnish-4.1.1 revision 66bb824
ident = Linux,2.6.32-573.8.1.el6.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x4328b3: varnishd() [0x4328b3]
  0x43976f: varnishd() [0x43976f]
  0x44e395: varnishd(HTTP1_Session+0x855) [0x44e395]
  0x439dbd: varnishd(SES_Proto_Req+0x5d) [0x439dbd]
  0x448a1d: varnishd() [0x448a1d]
  0x448e4b: varnishd() [0x448e4b]
  0x7f503ed2caa1: libpthread.so.0(+0x7aa1) [0x7f503ed2caa1]
  0x7f503ea7993d: libc.so.6(clone+0x6d) [0x7f503ea7993d]
req = 0x7f4949882020 {
  vxid = 973177293, step = R_STP_MISS,
  req_body = R_BODY_NONE,
  restarts = 0, esi_level = 1,
  sp = 0x7f4d43942020 {
    fd = 85, vxid = 973177236,
    client = 10.3.7.201 34115,
    step = S_STP_H1PROC,
  },
  worker = 0x7f503e6f7bc0 {
    stack = {0x7f503e6f8000 -> 0x7f503e6ec000},
    ws = 0x7f503e6f7db8 {
      id = \"wrk\",
      {s,f,r,e} = {0x7f503e6f7380,0x7f503e6f7380,(nil),+2040},
    },
    VCL::method = MISS,
    VCL::return = fetch,
    VCL::methods = {RECV, PASS, HASH, MISS, HIT, DELIVER},
  },
  ws = 0x7f4949882200 {
    id = \"req\",
    {s,f,r,e} = {0x7f4949884000,+80,(nil),+57336},
  },
  http[req] = 0x7f4949882298 {
    ws[req] = 0x7f4949882200,
    hdrs {
      \"GET\",
      \"/_fragment?_path=doctorId%3D170878%26doctor_updated_at%3D2016-03-09T02%253A40%253A14%252B01%253A00%26_format%3Dhtml%26_locale%3Dpl%26_controller%3DDoctorBundle%253AItems%253Aitems&_hash=oyUFV8q6E7Sq0axakCFry0NcSAkqHcerFZU5hz%2Bh2L0%3D\",
      \"HTTP/1.1\",
      \"X-UA-Device: desktop\",
      \"Host: www.XXXl\",
      \"X-Real-IP: X.X.X.X\",
      \"X-Forwarded-Port: 443\",
      \"X-Forwarded-Protocol: https\",
      \"X-Forwarded-Proto: https\",
      \"user-agent: Mozilla/5.0 (Windows NT 6.0; rv:44.0) Gecko/20100101 Firefox/44.0\",
      \"accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",
      \"referer: https://www.google.pl\",
      \"accept-language: pl,en-US;q=0.7,en;q=0.3\",
      \"cookie: __utma=151343218.1275728339.1404376761.1414331937.1418310542.5; smuuid=146fb60c8c7-491faf585a46-bbcb7cc7-be9b9fca-e81497cd-9bd6fd59b481; _ga=GA1.2.1275728339.1404376761; km_ai=%2B2YqZZQKAMdrm80K8LIM0V4v08M%3D; km_uq=; km_lv=x; __hstc=23043236.97eb64dfeb056498ff0609525fc06970.1418310543991.1418310543991.1418310543991.1; hsfirstvisit=XXX",
      \"X-Forwarded-For: 178.42.119.218, 10.3.7.201\",
      \"Surrogate-Capability: ESI/1.0\",
    },
  },
  vcl = {
    temp = warm
    srcname = {
      \"/etc/varnish/default.vcl\",
      \"Builtin\",
    },
  },
  objcore[REQ] = 0x7f4579a69580 {
    refcnt = 2,
    flags = 0xa,
    exp_flags = 0x0,
    exp = { 0.000000, 0.000000, 0.000000, 0.000000 }
    objhead = 0x7f4255c4a320,
    stevedore = (nil),
  },
  busyobj = 0x7f4d2609a020 {
    ws = 0x7f4d2609a0e0 {
      id = \"bo\",
      {s,f,r,e} = {0x7f4d2609bf98,+1824,(nil),+57440},
    },
    refcnt = 2,
    retries = 0, failed = 0, state = 1,
    flags = {do_stream},
    http_conn = 0x7f4d2609c608 {
      fd = 35,
      doclose = NULL,
      ws = (nil),
      {rxbuf_b, rxbuf_e} = {(nil), (nil)},
      {pipeline_b, pipeline_e} = {(nil), (nil)},
      content_length = 0,
      body_status = none,
      first_byte_timeout = 60.000000,
      between_bytes_timeout = 60.000000,
    },
    director_req = 0x7f42851a3240 {
      vcl_name = all,
      type = round-robin {
      },
    },
    director_resp = 0x7f489a56fd78 {
      vcl_name = dp10_3_7_24,
      type = backend {
        display_name = reload_2016-03-09T12:47:23.dp10_3_7_24,
        ipv4 = 10.3.7.24,
        port = 81,
        hosthdr = 10.3.7.24,
        health=healthy, admin_health=probe, changed=1457618158.0,
        n_conn = 5,
      },
    },
    http[bereq] = 0x7f4d2609a6a8 {
      ws[bo] = 0x7f4d2609a0e0,
      hdrs {
        \"GET\",
        \"/_fragment?_path=doctorId%3D170878%26doctor_updated_at%3D2016-03-09T02%253A40%253A14%252B01%253A00%26_format%3Dhtml%26_locale%3Dpl%26_controller%3DDoctorBundle%253AItems%253Aitems&_hash=oyUFV8q6E7Sq0axakCFry0NcSAkqHcerFZU5hz%2Bh2L0%3D\",
        \"HTTP/1.1\",
        \"X-UA-Device: desktop\",
        \"Host: www.XXX.XX\",
        \"X-Real-IP: X.X.X.X\",
        \"X-Forwarded-Port: 443\",
        \"X-Forwarded-Protocol: https\",
        \"X-Forwarded-Proto: https\",
        \"user-agent: Mozilla/5.0 (Windows NT 6.0; rv:44.0) Gecko/20100101 Firefox/44.0\",
        \"accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",
        \"referer: https://www.google.pl\",
        \"accept-language: pl,en-US;q=0.7,en;q=0.3\",
        \"cookie: __utma=151343218.1275728339.1404376761.1414331937.1418310542.5; smuuid=146fb60c8c7-491faf585a46-bbcb7cc7-be9b9fca-e81497cd-9bd6fd59b481; _ga=GA1.2.1275728339.1404376761; km_ai=%2B2YqZZQKAMdrm80K8LIM0V4v08M%3D; km_uq=; km_lv=x; __hstc=23043236.97eb64dfeb056498ff0609525fc06970.1418310543991.1418310543991.1418310543991.1; hsfirstvisit=XXXX",
        \"X-Forwarded-For: 178.42.119.218, 10.3.7.201\",
        \"Surrogate-Capability: ESI/1.0\",
        \"Accept-Encoding: gzip\",
        \"X-Varnish: 973177294\",
      },
    },
    http[beresp] = 0x7f4d2609ab20 {
      ws[bo] = 0x7f4d2609a0e0,
      hdrs {
      },
    },
    objcore[fetch] = 0x7f4579a69580 {
      refcnt = 2,
      flags = 0xa,
      exp_flags = 0x0,
      exp = { 0.000000, 0.000000, 0.000000, 0.000000 }
      objhead = 0x7f4255c4a320,
      stevedore = (nil),
    },
    vcl = {
      temp = warm
      srcname = {
        \"/etc/varnish/default.vcl\",
        \"Builtin\",
      },
    },
  },
  flags = {
  },
},
@nigoroll
Contributor

This looks like your clock has stepped back.
With ntpd, slewalways should help

@Dridi Dridi changed the title from Varnish child process restarted to [4.1] "Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 11, 2016
@fgsch
Member
fgsch commented Mar 12, 2016

@cinek810, thanks for submitting it!

@nigoroll we have a discussion about this on irc. @bsdphk is not convinced it was due to ntp.
I'm adding some code to dump t_open and t_idle. In any case, we should not crash.

@fgsch fgsch changed the title from [4.1] "Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. to Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 12, 2016
@fgsch fgsch changed the title from Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. to [4.1/master] Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 13, 2016
@jtru
jtru commented Mar 31, 2016

We recorded a number of similar panics after recently rebooting our machines running Varnish 4.1.2 on Debian Jessie into the new 4.4-based Debian bpo Kernel (Linux 4.4.0-0.bpo.1-amd64, ~ Linux 4.4.6; we had the Linux 4.3-based Debian bpo kernel before). I initially suspected that the systems' clocksources could pose a problem (because of a fair share of trouble with jumpy TSCs some 10 years back on SMP machines), but I was unable to verify that - the TSCs of all CPUs seem to work OK and increase monotonically.

Right now, I can't say if it's the new kernel, or the reboot that caused this problem to surface, but I'll try to collect more data (and revert one of the machines to run 4.3 again) and report in with any new data and/or findings.

@fgsch
Member
fgsch commented Mar 31, 2016

@jtru did ntp correct the clocks shortly after rebooting?

@jtru
jtru commented Mar 31, 2016

Unfortunately, we didn't log system time re-adjustments when the panic was first reported.

Since the values compared in the assert() that fails appear to use CLOCK_REALTIME clocks on our platform, the ntp theory could have merit. Maybe it'd make sense to use CLOCK_MONOTONIC or CLOCK_MONOTONIC_COARSE instead for these values? (I do not know what purpose the entries in the struct serve apart from appearing in that assert() statement, and/or why that assert() is there in the first place - maybe it's safe to just drop it?)

@nigoroll
Contributor

@jtru , ages ago I suggested the same but @bsdphk rejected it. IIUC the main reason was the impact of the change - for a marginal benefit, because clock stepping can easily be avoided.
can't you just make sure that you call ntpdate to step-set your clock before starting varnishd and use slewalways with ntpd as I suggested before?

@jtru
jtru commented Mar 31, 2016

@nigoroll the documentation of the ntp implementation we use doesn't mention that keyword/option/command; we have Debian Jessie's ntp package, version 1:4.2.6.p5+dfsg-7+deb8u1 (upstream: http://support.ntp.org) - can you refer me to where you found that documented?

Also, there's a lot of ntp-esque solutions out there that consume ntp updates (systemd-timesyncd, openntpd, chrony, the ntp reference implementation that we happen to have running, that new timesync-thing that @bsdphk is cooking up and I'm personally very much looking forward to ;), etc.), many of which probably don't even have a feature like that. Future users of Varnish who do not have their system-time-fixing services configured exactly the way Varnish needs them to are bound to experience this issue, if the root cause is related to system time re-adjustments (which we don't know for sure yet).

Of course we can try to implement a (somewhat fragile) workaround by having our varnish init script (actually: systemd service) depend on a somewhat fragile ntpdate invocation, but it probably makes more sense to avoid this issue in Varnish in the first place, to not have requests/responses thrown out for a reason that doesn't seem very good.

@nigoroll
Contributor

@jtru I am 100% with you. I still think we really should use monotonic time for all time-keeping which does not need wall clock time - and probably even base the wall clock calculations on a single wall clock base time (e.g. per-session) offset by monotonic time deltas.
The suggestion I made was meant as an attempt to gather more evidence pro/con the hypothesis.

@nigoroll
Contributor

ntpd: Seems like they've dropped slewalays, but there is -x

@jtru
jtru commented Mar 31, 2016

@nigoroll aiui, -x is supposed to disable the careful approach to slowly adjusting the system clock which it uses by default to avoid jumps. So setting -x should only make the problem worse, shouldn't it? :) (For the record, we're running without -x in ntpd's argv.)

@nigoroll
Contributor

(man ntpd)

   -x     Normally, the time is slewed if the offset is less than the step threshold, which is 128  ms  by  default,  and stepped  if  above  the  threshold.  This option sets the threshold to 600 s

I read this as: "with -x slew if the offset is less than 600 s

@nigoroll
Contributor

BTW, I have an idea regarding varnish use of wallclock time, I'll attempt a PoC

@fgsch
Member
fgsch commented Mar 31, 2016

From IRC on March 11:

10:22 < fgs> Someone reported another crash in SES_Delete:   Condition(now >= sp->t_open) not true.
10:22 < fgs> Just looking at it, if ntp is running or the date was changed that condition might not be true AFAICT
10:23 < phk> if the date changes, we can be screwed so many ways...
10:23 < scn> haven't we seen that a couple of times already?
10:23 < phk> originally I tried to write varnish to be NTP resistent, but that proved unworkable.
10:23 < daghf> #1820
10:23 -beb:#varnish-hacking- #1820: Assert error in SES_Delete [closed] https://www.varnish-cache.org/trac/ticket/1820
10:23 < gofer> #1820: Assert error in SES_Delete [closed] https://github.com/varnishcache/varnish-cache/issues/1820
10:23 < fgs> but what about small changes?
10:23 < phk> did they report it happening soon after reboot ?
10:24 < fgs> i'm trying to get more details
10:24 < fgs> he's running 4.1.1 but the assertion is still there
10:24 < fgs> so I don't think it matters
10:25 < phk> I'm not terribly keen on adding a mutext to VTIM to see if time moves backwards
10:25 < fgs> Can we just use the monotonic clock?
10:25 < phk> I guess we could do it on a per-thread basis
10:25 < phk> fgs, not for timestamps we have to report in VSL
10:25 < phk> (like t_open)
10:25 < fgs> Or just cope with such cases?
10:26 < phk> well, the assert is there to guard against programming/logic mistakes.
10:27 < fgs> maybe relax it a bit to cope with fluctuations?
10:28 < phk> I'd like to be a bit more certain of the diagnosis before I do that.
10:28 < fgs> system's been up for 94 days
10:28 < fgs> varnish for 16 before the crash
10:28 < phk> fgs,  and nothing in syslog about NTPD stepping time ?
10:28 < fgs> not that he could find
10:28 < fgs> he's going to open an issue
10:28 < phk> so I don't think that is the case then.
10:28 < phk> NTPD is quite noise about that.
10:29 < phk> fgs, would you by any chance have time to backport the graphviz thing to 4.1 and 4.0 so I don't have to work around them for the automatic doc-builds ?
10:29 < fgs> yes, assuming he's not sending the logs to /dev/null and he didn't fail with the search
10:30 < fgs> I can do that, yes
10:30 < fgs> But since it's basically impossible to trigger that assert any other way
10:30 < fgs> (or so far we failed to find another explanation)
10:30 < fgs> it might very well be the only possibility
10:32 < phk> it might be, but to me a code-bug sounds more probable still.
10:32 < phk> in particular when it doesn't happen in the first 10 minutes after reboot
@nigoroll
Contributor

@jtru have you been able to avoid the issue by avoiding clock stepping?

@nigoroll nigoroll self-assigned this Apr 21, 2016
@nigoroll
Contributor

@jtru please respond so we know what we're at

@jtru
jtru commented Apr 21, 2016

It's difficult to tell if adding -x to ntpd's argument vector actually solved the problem for us for good, but since we've done so, we could not reproduce the problem with varnishd segfaulting the way it did without it.

@nigoroll
Contributor
nigoroll commented Apr 21, 2016 edited

@jtru Thats the bit of information I was asking for, thanks.
Yes, the solution is not to fiddle ntpd. The solution is to not assert on real time deltas.

@nigoroll nigoroll added 4.0 master and removed more-info-needed labels Apr 21, 2016
@jay78238
jay78238 commented May 6, 2016

I have the same problem.

If the time was changed on my system, which it probably was, then it was only marginally (~1 second) since time update runs daily on the affected server.

@nigoroll
Contributor
nigoroll commented May 6, 2016

@jay78238 even small steps backwards in time will trigger the issue

@bsdphk
Contributor
bsdphk commented May 9, 2016 edited

@jay78238

You wrote "since time update runs daily on the affected server."

If that means that you are running some program (ntpdate ?) once a day in cron(8) to kick the time into submission, then I have say that is a really bad way to run a server. You should run ntpd or similar to get a continous timescale.

Also, if that is the root cause, then this crash should happen very soon after the time syncing event.

@bsdphk bsdphk changed the title from [4.1/master] Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. to [4.1/master] clock-step related crash ? May 9, 2016
@jay78238

@bsdphk: We currently dont use ntpd due to security.

I have one varnish-server that crashes probably for every time update, but I also have another that doesnt crash. It is/should be running on the same config/kernel/hardware/everything as the first one, but maybe I'm missing something.

@nigoroll
Contributor

Have now seen it myself on a system I got under control, will need to continue working on the solution

Panic at: Fri, 27 May 2016 08:05:49 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 524:
  Condition(now >= sp->t_open) not true.
thread = (cache-worker)
version = varnish-trunk revision c7db60b
ident = Linux,3.16.0-4-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x433412: pan_ic+0x152
  0x43b3af: varnishd() [0x43b3af]
  0x4351ce: Req_Cleanup+0x1ee
  0x44f332: HTTP1_Session+0xd2
  0x44fe17: http1_req+0x57
  0x44985a: WRK_Thread+0x46a
  0x449cbb: pool_thread+0x2b
  0x7f72c07600a4: libpthread.so.0(+0x80a4) [0x7f72c07600a4]
  0x7f72c049587d: libc.so.6(clone+0x6d) [0x7f72c049587d]
req = 0x7f726620a020 {
  MAGIC 0x00000000 (Should:REQ_MAGIC/0x2751aaa1)
  vxid = 0, transport = NULL
  step = R_STP_RESTART,
  req_body = R_BODY_INIT,
  restarts = 0, esi_level = 0,
  ws = 0x7f726620a1f8 {
    MAGIC 0x00000000 (Should:WS_MAGIC/0x35fac554)
    OVERFLOWED id = \"\",
    {s, f, r, e} = {(nil), (nil), (nil), (nil)},
  },
  http[req] = (nil) {
"

the code is bff2e6d plus #1945

@nigoroll nigoroll changed the title from [4.1/master] clock-step related crash ? to [4.1/master] clock-step related crash May 29, 2016
@nigoroll
Contributor

I am now really confident that this is clock stepping related.

@lkarsten
Contributor
lkarsten commented Jun 14, 2016 edited

Removing backport label so it doesn't clutter up the todo list for 4.1.3. Will be added again automatically when the issue is resolved.

@Rod-Gomes
Rod-Gomes commented Jul 15, 2016 edited

Same problem here:

Jul 15 01:33:03 server systemd: Time has been changed
Jul 15 01:33:03 server abrt-hook-ccpp: Can't open 'core.2114' at '/var/lib/varnish/server.example.com': Permission denied
Jul 15 01:33:36 server abrt-hook-ccpp: /var/spool/abrt is 8979056333 bytes (more than 1279MiB), deleting 'ccpp-2016-07-13-17:52:21-28439'
Jul 15 01:33:36 server abrt-hook-ccpp: /var/spool/abrt is 8942707575 bytes (more than 1279MiB), deleting 'ccpp-2016-07-07-04:14:02-692'
Jul 15 01:33:37 server varnishd[2090]: Child (2114) not responding to CLI, killing it.
Jul 15 01:33:38 server varnishd[2090]: Child (2114) not responding to CLI, killing it.
Jul 15 01:33:38 server varnishd[2090]: Child (2114) died signal=6 (core dumped)
Jul 15 01:33:38 server varnishd[2090]: Child (2114) Last panic at: Fri, 15 Jul 2016 04:33:38 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 544:
  Condition(now >= sp->t_open) not true.
thread = (cache-worker)
version = varnish-4.1.3 revision 5e3b6d2
ident = Linux,3.10.0-327.22.2.el7.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x4342f6: varnishd() [0x4342f6]
  0x43bd0f: varnishd() [0x43bd0f]
  0x450625: varnishd(HTTP1_Session+0x585) [0x450625]
  0x43a91d: varnishd(SES_Proto_Req+0x5d) [0x43a91d]
  0x44aa61: varnishd() [0x44aa61]
  0x44aecb: varnishd() [0x44aecb]
  0x7fdd1b99bdc5: libpthread.so.0(+0x7dc5) [0x7fdd1b99bdc5]
  0x7fdd1b6c8ced: libc.so.6(clone+0x6d) [0x7fdd1b6c8ced]
req = 0x7fdbf18e8020 {
  vxid = 0, step = R_STP_DELIVER,
  req_body = R_BODY_INIT,
  restarts = 0, esi_level = 0,
  sp = 0x7fdc9dbc7420 {
    fd = 345, vxid = 71459936,
    client = 172.68.26.118 15481,
    t_open = 1468557180.519868,
    t_idle = 1468557183.111095,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7fdbf18e8200 {
    id = \"req\",
    {s,f,r,e} = {0x7fdbf18ea000,0x7fdbf18ea000,+32768,+57336},
  },
  http_conn = 0x7fdbf18e8128 {
    fd = 345,
    doclose = NULL,
    ws = 0x7fdbf18e8200,
    {rxbuf_b, rxbuf_e} = {0x7fdbf18ea000, 0x7fdbf18ea000},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = -1,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7fdbf18e8298 {
    ws[req] = 0x7fdbf18e8200,
    hdrs {
      \"\",
      \"/wp-content/plugins/foobox-image-lightbox/js/foobox.free.min.js\",
      \"HTTP/1.1\",
      \"Host: example.com\",
      \"Connection: Keep-Alive\",
      \"CF-IPCountry: BR\",
      \"CF-RAY: 2c2a7e7c490b4bc9-GRU\",
      \"X-Forwarded-Proto: http\",
      \"CF-Visitor: {\"scheme\":\"http\"}\",
      \"Accept: */*\",
      \"CF-Connecting-IP: 177.139.127.108\",
      \"Varnish-Client-IP: 177.139.127.108\",
      \"X-Forwarded-For: 177.139.127.108, 172.68.26.118, 172.68.26.118\",
      \"Accept-Encoding: gzip\",
    },
  },
  flags = {
  },
},

"
Jul 15 01:33:38 server varnishd[2090]: Child (19227) said Child starts
Jul 15 01:33:39 server abrt-server: Gerando core_backtrace
time:           Sex 15 Jul 2016 01:33:03 BRT
cmdline:        /usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a 123.456.789.10:80 -T 127.0.0.1:6082 -S /etc/varnish/secret -s malloc,10G -p thread_pool_min=2000 -p thread_pool_max=5000 -p thread_pool_timeout=300
uid:            978 (varnish)
abrt_version:   2.1.11
event_log:
executable:     /usr/sbin/varnishd
global_pid:     2114
hostname:       server.example.com
kernel:         3.10.0-327.22.2.el7.x86_64
last_occurrence: 1468557183
pid:            2114
pkg_arch:       x86_64
pkg_epoch:      0
pkg_name:       varnish
pkg_release:    1.el7
pkg_version:    4.1.3
pwd:            /var/lib/varnish/server.example.com
runlevel:       N 5
username:       varnish
@nigoroll
Contributor

@Rod-Gomes thank you for the report. For now, please apply the workaround to avoid stepping the clock backwards. We want to change how varnish keeps time internally which will solve this issue. Maybe we should consider to still point-fix this issue - I'll try to bring this up during the next bugwash.

@Rod-Gomes

@nigoroll I have applied the workaround. Thanks!

@nigoroll nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Sep 12, 2016
@nigoroll nigoroll remove an assertion which we may easily trip over when the clock is s…
…tepped

The assertion itself is semantically correct, but using wallclock time for
this kind of check is inappropriate.

The mid-term plan is to re-work time keeping in varnish and re-check where
using a monotonic clock is more appropriate, but for now we should
at least not panic just because wall clock time was stepped back.

Fixes #1874 for now
b295062
@nigoroll nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Sep 12, 2016
@nigoroll nigoroll remove an assertion which we may easily trip over when the clock is s…
…tepped

The assertion itself is semantically correct, but using wallclock time for
this kind of check is inappropriate.

The mid-term plan is to re-work time keeping in varnish and re-check where
using a monotonic clock is more appropriate, but for now we should
at least not panic just because wall clock time was stepped back.

Fixes #1874 for now
5f2cc24
@mbgrydeland mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Oct 28, 2016
@mbgrydeland mbgrydeland Allow up to clock_step error when checking the session timestamps
When checking the timestamps on deleting a session, allow up to
clock_step seconds error before bailing.

Fixes: #1874
13e0d67
@nigoroll
Contributor
nigoroll commented Oct 28, 2016 edited

Brief writeup of an irc discussion between @mbgrydeland any myself on #2122 and #2087 - my own perspective, please feel free to correct:

  • The overall goal should be to become real time stepping agnostic and use monotonic wherever possible
  • offsetting some real time start point with a mono delta can help us in many situations
  • I'd like to work on a new VTIM api, but had to postpone work on this for the last couple of months

My personal opinion is that we should not promote anything which suggests that we rely more on a continuous real clock time which does not exist. If I understand Martins opinion correctly, he thinks that as long as we rely on the real clock not stepping, having some protection is better than none.

So my recommendation:

  • merge Martins panic patch c04f040
  • merge my session patch 5f2cc24
  • create an optional, configurable time guardian thread, which tries to detect clock stepping and can either log warnings or trigger a panic. This would use a tolerance margin comparing deltas of real and mono time. A short log of past events could be kept in-core and dumped with a panic

Mid/Long term I think we should

  • replace real with mono wherever possible
  • use real + mono offset in all other places
@nigoroll
Contributor
nigoroll commented Nov 1, 2016

during yesterdays bugwash my recommendation did not find agreement and it was decided that #2122 is to go in

@mbgrydeland mbgrydeland added a commit that closed this issue Nov 4, 2016
@mbgrydeland mbgrydeland Allow up to clock_step error when checking the session timestamps
When checking the timestamps on deleting a session, allow up to
clock_step seconds error before bailing.

Fixes: #1874
c2a62c3
@hermunn hermunn added a commit that referenced this issue Nov 16, 2016
@mbgrydeland @hermunn mbgrydeland + hermunn Allow up to clock_step error when checking the session timestamps
When checking the timestamps on deleting a session, allow up to
clock_step seconds error before bailing.

Fixes: #1874
4f58bb8
@hermunn
Contributor
hermunn commented Nov 16, 2016

Backport review: Since #2122 has been packported, there is nothing more to do here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment