New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.1/master] clock-step related crash #1874

Closed
cinek810 opened this Issue Mar 11, 2016 · 33 comments

Comments

Projects
None yet
@cinek810
Copy link

cinek810 commented Mar 11, 2016

Expected Behavior

In general child process restart is problematic, because we are loosing whole cache, which can lead to efficiency problems under high load. I'm attaching output of varnishadm panic.show

Steps to Reproduce (for bugs)

This happened only once after last varnish update, varnish mgmt process was running for 16 days. I don't think it ease to reproduce.

Your Environment

My varnishadm panic.show:

Last panic at: Thu, 10 Mar 2016 15:04:48 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 546:
  Condition(now >= sp->t_open) not true.
errno = 11 (Resource temporarily unavailable)
thread = (cache-worker)
version = varnish-4.1.1 revision 66bb824
ident = Linux,2.6.32-573.8.1.el6.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x4328b3: varnishd() [0x4328b3]
  0x43976f: varnishd() [0x43976f]
  0x44e395: varnishd(HTTP1_Session+0x855) [0x44e395]
  0x439dbd: varnishd(SES_Proto_Req+0x5d) [0x439dbd]
  0x448a1d: varnishd() [0x448a1d]
  0x448e4b: varnishd() [0x448e4b]
  0x7f503ed2caa1: libpthread.so.0(+0x7aa1) [0x7f503ed2caa1]
  0x7f503ea7993d: libc.so.6(clone+0x6d) [0x7f503ea7993d]
req = 0x7f4949882020 {
  vxid = 973177293, step = R_STP_MISS,
  req_body = R_BODY_NONE,
  restarts = 0, esi_level = 1,
  sp = 0x7f4d43942020 {
    fd = 85, vxid = 973177236,
    client = 10.3.7.201 34115,
    step = S_STP_H1PROC,
  },
  worker = 0x7f503e6f7bc0 {
    stack = {0x7f503e6f8000 -> 0x7f503e6ec000},
    ws = 0x7f503e6f7db8 {
      id = \"wrk\",
      {s,f,r,e} = {0x7f503e6f7380,0x7f503e6f7380,(nil),+2040},
    },
    VCL::method = MISS,
    VCL::return = fetch,
    VCL::methods = {RECV, PASS, HASH, MISS, HIT, DELIVER},
  },
  ws = 0x7f4949882200 {
    id = \"req\",
    {s,f,r,e} = {0x7f4949884000,+80,(nil),+57336},
  },
  http[req] = 0x7f4949882298 {
    ws[req] = 0x7f4949882200,
    hdrs {
      \"GET\",
      \"/_fragment?_path=doctorId%3D170878%26doctor_updated_at%3D2016-03-09T02%253A40%253A14%252B01%253A00%26_format%3Dhtml%26_locale%3Dpl%26_controller%3DDoctorBundle%253AItems%253Aitems&_hash=oyUFV8q6E7Sq0axakCFry0NcSAkqHcerFZU5hz%2Bh2L0%3D\",
      \"HTTP/1.1\",
      \"X-UA-Device: desktop\",
      \"Host: www.XXXl\",
      \"X-Real-IP: X.X.X.X\",
      \"X-Forwarded-Port: 443\",
      \"X-Forwarded-Protocol: https\",
      \"X-Forwarded-Proto: https\",
      \"user-agent: Mozilla/5.0 (Windows NT 6.0; rv:44.0) Gecko/20100101 Firefox/44.0\",
      \"accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",
      \"referer: https://www.google.pl\",
      \"accept-language: pl,en-US;q=0.7,en;q=0.3\",
      \"cookie: __utma=151343218.1275728339.1404376761.1414331937.1418310542.5; smuuid=146fb60c8c7-491faf585a46-bbcb7cc7-be9b9fca-e81497cd-9bd6fd59b481; _ga=GA1.2.1275728339.1404376761; km_ai=%2B2YqZZQKAMdrm80K8LIM0V4v08M%3D; km_uq=; km_lv=x; __hstc=23043236.97eb64dfeb056498ff0609525fc06970.1418310543991.1418310543991.1418310543991.1; hsfirstvisit=XXX",
      \"X-Forwarded-For: 178.42.119.218, 10.3.7.201\",
      \"Surrogate-Capability: ESI/1.0\",
    },
  },
  vcl = {
    temp = warm
    srcname = {
      \"/etc/varnish/default.vcl\",
      \"Builtin\",
    },
  },
  objcore[REQ] = 0x7f4579a69580 {
    refcnt = 2,
    flags = 0xa,
    exp_flags = 0x0,
    exp = { 0.000000, 0.000000, 0.000000, 0.000000 }
    objhead = 0x7f4255c4a320,
    stevedore = (nil),
  },
  busyobj = 0x7f4d2609a020 {
    ws = 0x7f4d2609a0e0 {
      id = \"bo\",
      {s,f,r,e} = {0x7f4d2609bf98,+1824,(nil),+57440},
    },
    refcnt = 2,
    retries = 0, failed = 0, state = 1,
    flags = {do_stream},
    http_conn = 0x7f4d2609c608 {
      fd = 35,
      doclose = NULL,
      ws = (nil),
      {rxbuf_b, rxbuf_e} = {(nil), (nil)},
      {pipeline_b, pipeline_e} = {(nil), (nil)},
      content_length = 0,
      body_status = none,
      first_byte_timeout = 60.000000,
      between_bytes_timeout = 60.000000,
    },
    director_req = 0x7f42851a3240 {
      vcl_name = all,
      type = round-robin {
      },
    },
    director_resp = 0x7f489a56fd78 {
      vcl_name = dp10_3_7_24,
      type = backend {
        display_name = reload_2016-03-09T12:47:23.dp10_3_7_24,
        ipv4 = 10.3.7.24,
        port = 81,
        hosthdr = 10.3.7.24,
        health=healthy, admin_health=probe, changed=1457618158.0,
        n_conn = 5,
      },
    },
    http[bereq] = 0x7f4d2609a6a8 {
      ws[bo] = 0x7f4d2609a0e0,
      hdrs {
        \"GET\",
        \"/_fragment?_path=doctorId%3D170878%26doctor_updated_at%3D2016-03-09T02%253A40%253A14%252B01%253A00%26_format%3Dhtml%26_locale%3Dpl%26_controller%3DDoctorBundle%253AItems%253Aitems&_hash=oyUFV8q6E7Sq0axakCFry0NcSAkqHcerFZU5hz%2Bh2L0%3D\",
        \"HTTP/1.1\",
        \"X-UA-Device: desktop\",
        \"Host: www.XXX.XX\",
        \"X-Real-IP: X.X.X.X\",
        \"X-Forwarded-Port: 443\",
        \"X-Forwarded-Protocol: https\",
        \"X-Forwarded-Proto: https\",
        \"user-agent: Mozilla/5.0 (Windows NT 6.0; rv:44.0) Gecko/20100101 Firefox/44.0\",
        \"accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",
        \"referer: https://www.google.pl\",
        \"accept-language: pl,en-US;q=0.7,en;q=0.3\",
        \"cookie: __utma=151343218.1275728339.1404376761.1414331937.1418310542.5; smuuid=146fb60c8c7-491faf585a46-bbcb7cc7-be9b9fca-e81497cd-9bd6fd59b481; _ga=GA1.2.1275728339.1404376761; km_ai=%2B2YqZZQKAMdrm80K8LIM0V4v08M%3D; km_uq=; km_lv=x; __hstc=23043236.97eb64dfeb056498ff0609525fc06970.1418310543991.1418310543991.1418310543991.1; hsfirstvisit=XXXX",
        \"X-Forwarded-For: 178.42.119.218, 10.3.7.201\",
        \"Surrogate-Capability: ESI/1.0\",
        \"Accept-Encoding: gzip\",
        \"X-Varnish: 973177294\",
      },
    },
    http[beresp] = 0x7f4d2609ab20 {
      ws[bo] = 0x7f4d2609a0e0,
      hdrs {
      },
    },
    objcore[fetch] = 0x7f4579a69580 {
      refcnt = 2,
      flags = 0xa,
      exp_flags = 0x0,
      exp = { 0.000000, 0.000000, 0.000000, 0.000000 }
      objhead = 0x7f4255c4a320,
      stevedore = (nil),
    },
    vcl = {
      temp = warm
      srcname = {
        \"/etc/varnish/default.vcl\",
        \"Builtin\",
      },
    },
  },
  flags = {
  },
},
@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 11, 2016

This looks like your clock has stepped back.
With ntpd, slewalways should help

@Dridi Dridi changed the title Varnish child process restarted [4.1] "Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 11, 2016

@fgsch

This comment has been minimized.

Copy link
Member

fgsch commented Mar 12, 2016

@cinek810, thanks for submitting it!

@nigoroll we have a discussion about this on irc. @bsdphk is not convinced it was due to ntp.
I'm adding some code to dump t_open and t_idle. In any case, we should not crash.

@fgsch fgsch changed the title [4.1] "Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 12, 2016

@fgsch fgsch changed the title Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. [4.1/master] Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. Mar 13, 2016

@jtru

This comment has been minimized.

Copy link

jtru commented Mar 31, 2016

We recorded a number of similar panics after recently rebooting our machines running Varnish 4.1.2 on Debian Jessie into the new 4.4-based Debian bpo Kernel (Linux 4.4.0-0.bpo.1-amd64, ~ Linux 4.4.6; we had the Linux 4.3-based Debian bpo kernel before). I initially suspected that the systems' clocksources could pose a problem (because of a fair share of trouble with jumpy TSCs some 10 years back on SMP machines), but I was unable to verify that - the TSCs of all CPUs seem to work OK and increase monotonically.

Right now, I can't say if it's the new kernel, or the reboot that caused this problem to surface, but I'll try to collect more data (and revert one of the machines to run 4.3 again) and report in with any new data and/or findings.

@fgsch

This comment has been minimized.

Copy link
Member

fgsch commented Mar 31, 2016

@jtru did ntp correct the clocks shortly after rebooting?

@jtru

This comment has been minimized.

Copy link

jtru commented Mar 31, 2016

Unfortunately, we didn't log system time re-adjustments when the panic was first reported.

Since the values compared in the assert() that fails appear to use CLOCK_REALTIME clocks on our platform, the ntp theory could have merit. Maybe it'd make sense to use CLOCK_MONOTONIC or CLOCK_MONOTONIC_COARSE instead for these values? (I do not know what purpose the entries in the struct serve apart from appearing in that assert() statement, and/or why that assert() is there in the first place - maybe it's safe to just drop it?)

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 31, 2016

@jtru , ages ago I suggested the same but @bsdphk rejected it. IIUC the main reason was the impact of the change - for a marginal benefit, because clock stepping can easily be avoided.
can't you just make sure that you call ntpdate to step-set your clock before starting varnishd and use slewalways with ntpd as I suggested before?

@jtru

This comment has been minimized.

Copy link

jtru commented Mar 31, 2016

@nigoroll the documentation of the ntp implementation we use doesn't mention that keyword/option/command; we have Debian Jessie's ntp package, version 1:4.2.6.p5+dfsg-7+deb8u1 (upstream: http://support.ntp.org) - can you refer me to where you found that documented?

Also, there's a lot of ntp-esque solutions out there that consume ntp updates (systemd-timesyncd, openntpd, chrony, the ntp reference implementation that we happen to have running, that new timesync-thing that @bsdphk is cooking up and I'm personally very much looking forward to ;), etc.), many of which probably don't even have a feature like that. Future users of Varnish who do not have their system-time-fixing services configured exactly the way Varnish needs them to are bound to experience this issue, if the root cause is related to system time re-adjustments (which we don't know for sure yet).

Of course we can try to implement a (somewhat fragile) workaround by having our varnish init script (actually: systemd service) depend on a somewhat fragile ntpdate invocation, but it probably makes more sense to avoid this issue in Varnish in the first place, to not have requests/responses thrown out for a reason that doesn't seem very good.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 31, 2016

@jtru I am 100% with you. I still think we really should use monotonic time for all time-keeping which does not need wall clock time - and probably even base the wall clock calculations on a single wall clock base time (e.g. per-session) offset by monotonic time deltas.
The suggestion I made was meant as an attempt to gather more evidence pro/con the hypothesis.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 31, 2016

ntpd: Seems like they've dropped slewalays, but there is -x

@jtru

This comment has been minimized.

Copy link

jtru commented Mar 31, 2016

@nigoroll aiui, -x is supposed to disable the careful approach to slowly adjusting the system clock which it uses by default to avoid jumps. So setting -x should only make the problem worse, shouldn't it? :) (For the record, we're running without -x in ntpd's argv.)

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 31, 2016

(man ntpd)

   -x     Normally, the time is slewed if the offset is less than the step threshold, which is 128  ms  by  default,  and stepped  if  above  the  threshold.  This option sets the threshold to 600 s

I read this as: "with -x slew if the offset is less than 600 s

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Mar 31, 2016

BTW, I have an idea regarding varnish use of wallclock time, I'll attempt a PoC

@fgsch

This comment has been minimized.

Copy link
Member

fgsch commented Mar 31, 2016

From IRC on March 11:

10:22 < fgs> Someone reported another crash in SES_Delete:   Condition(now >= sp->t_open) not true.
10:22 < fgs> Just looking at it, if ntp is running or the date was changed that condition might not be true AFAICT
10:23 < phk> if the date changes, we can be screwed so many ways...
10:23 < scn> haven't we seen that a couple of times already?
10:23 < phk> originally I tried to write varnish to be NTP resistent, but that proved unworkable.
10:23 < daghf> #1820
10:23 -beb:#varnish-hacking- #1820: Assert error in SES_Delete [closed] https://www.varnish-cache.org/trac/ticket/1820
10:23 < gofer> #1820: Assert error in SES_Delete [closed] https://github.com/varnishcache/varnish-cache/issues/1820
10:23 < fgs> but what about small changes?
10:23 < phk> did they report it happening soon after reboot ?
10:24 < fgs> i'm trying to get more details
10:24 < fgs> he's running 4.1.1 but the assertion is still there
10:24 < fgs> so I don't think it matters
10:25 < phk> I'm not terribly keen on adding a mutext to VTIM to see if time moves backwards
10:25 < fgs> Can we just use the monotonic clock?
10:25 < phk> I guess we could do it on a per-thread basis
10:25 < phk> fgs, not for timestamps we have to report in VSL
10:25 < phk> (like t_open)
10:25 < fgs> Or just cope with such cases?
10:26 < phk> well, the assert is there to guard against programming/logic mistakes.
10:27 < fgs> maybe relax it a bit to cope with fluctuations?
10:28 < phk> I'd like to be a bit more certain of the diagnosis before I do that.
10:28 < fgs> system's been up for 94 days
10:28 < fgs> varnish for 16 before the crash
10:28 < phk> fgs,  and nothing in syslog about NTPD stepping time ?
10:28 < fgs> not that he could find
10:28 < fgs> he's going to open an issue
10:28 < phk> so I don't think that is the case then.
10:28 < phk> NTPD is quite noise about that.
10:29 < phk> fgs, would you by any chance have time to backport the graphviz thing to 4.1 and 4.0 so I don't have to work around them for the automatic doc-builds ?
10:29 < fgs> yes, assuming he's not sending the logs to /dev/null and he didn't fail with the search
10:30 < fgs> I can do that, yes
10:30 < fgs> But since it's basically impossible to trigger that assert any other way
10:30 < fgs> (or so far we failed to find another explanation)
10:30 < fgs> it might very well be the only possibility
10:32 < phk> it might be, but to me a code-bug sounds more probable still.
10:32 < phk> in particular when it doesn't happen in the first 10 minutes after reboot
@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Apr 18, 2016

@jtru have you been able to avoid the issue by avoiding clock stepping?

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Apr 21, 2016

@jtru please respond so we know what we're at

@jtru

This comment has been minimized.

Copy link

jtru commented Apr 21, 2016

It's difficult to tell if adding -x to ntpd's argument vector actually solved the problem for us for good, but since we've done so, we could not reproduce the problem with varnishd segfaulting the way it did without it.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Apr 21, 2016

@jtru Thats the bit of information I was asking for, thanks.
Yes, the solution is not to fiddle ntpd. The solution is to not assert on real time deltas.

@jay78238

This comment has been minimized.

Copy link

jay78238 commented May 6, 2016

I have the same problem.

If the time was changed on my system, which it probably was, then it was only marginally (~1 second) since time update runs daily on the affected server.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented May 6, 2016

@jay78238 even small steps backwards in time will trigger the issue

@bsdphk

This comment has been minimized.

Copy link
Contributor

bsdphk commented May 9, 2016

@jay78238

You wrote "since time update runs daily on the affected server."

If that means that you are running some program (ntpdate ?) once a day in cron(8) to kick the time into submission, then I have say that is a really bad way to run a server. You should run ntpd or similar to get a continous timescale.

Also, if that is the root cause, then this crash should happen very soon after the time syncing event.

@bsdphk bsdphk changed the title [4.1/master] Assert error in SES_Delete(), cache/cache_session.c line 546: Condition(now >= sp->t_open) not true. [4.1/master] clock-step related crash ? May 9, 2016

@jay78238

This comment has been minimized.

Copy link

jay78238 commented May 10, 2016

@bsdphk: We currently dont use ntpd due to security.

I have one varnish-server that crashes probably for every time update, but I also have another that doesnt crash. It is/should be running on the same config/kernel/hardware/everything as the first one, but maybe I'm missing something.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented May 27, 2016

Have now seen it myself on a system I got under control, will need to continue working on the solution

Panic at: Fri, 27 May 2016 08:05:49 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 524:
  Condition(now >= sp->t_open) not true.
thread = (cache-worker)
version = varnish-trunk revision c7db60b
ident = Linux,3.16.0-4-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x433412: pan_ic+0x152
  0x43b3af: varnishd() [0x43b3af]
  0x4351ce: Req_Cleanup+0x1ee
  0x44f332: HTTP1_Session+0xd2
  0x44fe17: http1_req+0x57
  0x44985a: WRK_Thread+0x46a
  0x449cbb: pool_thread+0x2b
  0x7f72c07600a4: libpthread.so.0(+0x80a4) [0x7f72c07600a4]
  0x7f72c049587d: libc.so.6(clone+0x6d) [0x7f72c049587d]
req = 0x7f726620a020 {
  MAGIC 0x00000000 (Should:REQ_MAGIC/0x2751aaa1)
  vxid = 0, transport = NULL
  step = R_STP_RESTART,
  req_body = R_BODY_INIT,
  restarts = 0, esi_level = 0,
  ws = 0x7f726620a1f8 {
    MAGIC 0x00000000 (Should:WS_MAGIC/0x35fac554)
    OVERFLOWED id = \"\",
    {s, f, r, e} = {(nil), (nil), (nil), (nil)},
  },
  http[req] = (nil) {
"

the code is bff2e6d plus #1945

@nigoroll nigoroll changed the title [4.1/master] clock-step related crash ? [4.1/master] clock-step related crash May 29, 2016

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented May 29, 2016

I am now really confident that this is clock stepping related.

@lkarsten

This comment has been minimized.

Copy link
Contributor

lkarsten commented Jun 14, 2016

Removing backport label so it doesn't clutter up the todo list for 4.1.3. Will be added again automatically when the issue is resolved.

@Rod-Gomes

This comment has been minimized.

Copy link

Rod-Gomes commented Jul 15, 2016

Same problem here:

Jul 15 01:33:03 server systemd: Time has been changed
Jul 15 01:33:03 server abrt-hook-ccpp: Can't open 'core.2114' at '/var/lib/varnish/server.example.com': Permission denied
Jul 15 01:33:36 server abrt-hook-ccpp: /var/spool/abrt is 8979056333 bytes (more than 1279MiB), deleting 'ccpp-2016-07-13-17:52:21-28439'
Jul 15 01:33:36 server abrt-hook-ccpp: /var/spool/abrt is 8942707575 bytes (more than 1279MiB), deleting 'ccpp-2016-07-07-04:14:02-692'
Jul 15 01:33:37 server varnishd[2090]: Child (2114) not responding to CLI, killing it.
Jul 15 01:33:38 server varnishd[2090]: Child (2114) not responding to CLI, killing it.
Jul 15 01:33:38 server varnishd[2090]: Child (2114) died signal=6 (core dumped)
Jul 15 01:33:38 server varnishd[2090]: Child (2114) Last panic at: Fri, 15 Jul 2016 04:33:38 GMT
"Assert error in SES_Delete(), cache/cache_session.c line 544:
  Condition(now >= sp->t_open) not true.
thread = (cache-worker)
version = varnish-4.1.3 revision 5e3b6d2
ident = Linux,3.10.0-327.22.2.el7.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
  0x4342f6: varnishd() [0x4342f6]
  0x43bd0f: varnishd() [0x43bd0f]
  0x450625: varnishd(HTTP1_Session+0x585) [0x450625]
  0x43a91d: varnishd(SES_Proto_Req+0x5d) [0x43a91d]
  0x44aa61: varnishd() [0x44aa61]
  0x44aecb: varnishd() [0x44aecb]
  0x7fdd1b99bdc5: libpthread.so.0(+0x7dc5) [0x7fdd1b99bdc5]
  0x7fdd1b6c8ced: libc.so.6(clone+0x6d) [0x7fdd1b6c8ced]
req = 0x7fdbf18e8020 {
  vxid = 0, step = R_STP_DELIVER,
  req_body = R_BODY_INIT,
  restarts = 0, esi_level = 0,
  sp = 0x7fdc9dbc7420 {
    fd = 345, vxid = 71459936,
    client = 172.68.26.118 15481,
    t_open = 1468557180.519868,
    t_idle = 1468557183.111095,
    step = S_STP_H1NEWREQ,
  },
  ws = 0x7fdbf18e8200 {
    id = \"req\",
    {s,f,r,e} = {0x7fdbf18ea000,0x7fdbf18ea000,+32768,+57336},
  },
  http_conn = 0x7fdbf18e8128 {
    fd = 345,
    doclose = NULL,
    ws = 0x7fdbf18e8200,
    {rxbuf_b, rxbuf_e} = {0x7fdbf18ea000, 0x7fdbf18ea000},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = -1,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7fdbf18e8298 {
    ws[req] = 0x7fdbf18e8200,
    hdrs {
      \"\",
      \"/wp-content/plugins/foobox-image-lightbox/js/foobox.free.min.js\",
      \"HTTP/1.1\",
      \"Host: example.com\",
      \"Connection: Keep-Alive\",
      \"CF-IPCountry: BR\",
      \"CF-RAY: 2c2a7e7c490b4bc9-GRU\",
      \"X-Forwarded-Proto: http\",
      \"CF-Visitor: {\"scheme\":\"http\"}\",
      \"Accept: */*\",
      \"CF-Connecting-IP: 177.139.127.108\",
      \"Varnish-Client-IP: 177.139.127.108\",
      \"X-Forwarded-For: 177.139.127.108, 172.68.26.118, 172.68.26.118\",
      \"Accept-Encoding: gzip\",
    },
  },
  flags = {
  },
},

"
Jul 15 01:33:38 server varnishd[2090]: Child (19227) said Child starts
Jul 15 01:33:39 server abrt-server: Gerando core_backtrace
time:           Sex 15 Jul 2016 01:33:03 BRT
cmdline:        /usr/sbin/varnishd -P /var/run/varnish.pid -f /etc/varnish/default.vcl -a 123.456.789.10:80 -T 127.0.0.1:6082 -S /etc/varnish/secret -s malloc,10G -p thread_pool_min=2000 -p thread_pool_max=5000 -p thread_pool_timeout=300
uid:            978 (varnish)
abrt_version:   2.1.11
event_log:
executable:     /usr/sbin/varnishd
global_pid:     2114
hostname:       server.example.com
kernel:         3.10.0-327.22.2.el7.x86_64
last_occurrence: 1468557183
pid:            2114
pkg_arch:       x86_64
pkg_epoch:      0
pkg_name:       varnish
pkg_release:    1.el7
pkg_version:    4.1.3
pwd:            /var/lib/varnish/server.example.com
runlevel:       N 5
username:       varnish
@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Jul 15, 2016

@Rod-Gomes thank you for the report. For now, please apply the workaround to avoid stepping the clock backwards. We want to change how varnish keeps time internally which will solve this issue. Maybe we should consider to still point-fix this issue - I'll try to bring this up during the next bugwash.

@Rod-Gomes

This comment has been minimized.

Copy link

Rod-Gomes commented Jul 15, 2016

@nigoroll I have applied the workaround. Thanks!

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Sep 12, 2016

remove an assertion which we may easily trip over when the clock is s…
…tepped

The assertion itself is semantically correct, but using wallclock time for
this kind of check is inappropriate.

The mid-term plan is to re-work time keeping in varnish and re-check where
using a monotonic clock is more appropriate, but for now we should
at least not panic just because wall clock time was stepped back.

Fixes varnishcache#1874 for now

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Sep 12, 2016

remove an assertion which we may easily trip over when the clock is s…
…tepped

The assertion itself is semantically correct, but using wallclock time for
this kind of check is inappropriate.

The mid-term plan is to re-work time keeping in varnish and re-check where
using a monotonic clock is more appropriate, but for now we should
at least not panic just because wall clock time was stepped back.

Fixes varnishcache#1874 for now

mbgrydeland added a commit to mbgrydeland/varnish-cache that referenced this issue Oct 28, 2016

Allow up to clock_step error when checking the session timestamps
When checking the timestamps on deleting a session, allow up to
clock_step seconds error before bailing.

Fixes: varnishcache#1874
@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Oct 28, 2016

Brief writeup of an irc discussion between @mbgrydeland any myself on #2122 and #2087 - my own perspective, please feel free to correct:

  • The overall goal should be to become real time stepping agnostic and use monotonic wherever possible
  • offsetting some real time start point with a mono delta can help us in many situations
  • I'd like to work on a new VTIM api, but had to postpone work on this for the last couple of months

My personal opinion is that we should not promote anything which suggests that we rely more on a continuous real clock time which does not exist. If I understand Martins opinion correctly, he thinks that as long as we rely on the real clock not stepping, having some protection is better than none.

So my recommendation:

  • merge Martins panic patch c04f040
  • merge my session patch 5f2cc24
  • create an optional, configurable time guardian thread, which tries to detect clock stepping and can either log warnings or trigger a panic. This would use a tolerance margin comparing deltas of real and mono time. A short log of past events could be kept in-core and dumped with a panic

Mid/Long term I think we should

  • replace real with mono wherever possible
  • use real + mono offset in all other places
@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Nov 1, 2016

during yesterdays bugwash my recommendation did not find agreement and it was decided that #2122 is to go in

hermunn added a commit that referenced this issue Nov 16, 2016

Allow up to clock_step error when checking the session timestamps
When checking the timestamps on deleting a session, allow up to
clock_step seconds error before bailing.

Fixes: #1874
@hermunn

This comment has been minimized.

Copy link
Contributor

hermunn commented Nov 16, 2016

Backport review: Since #2122 has been packported, there is nothing more to do here.

@tsuppa

This comment has been minimized.

Copy link

tsuppa commented Apr 24, 2017

Should this be fixed on 4.1.4? I have had several crashes after NTP time sync (it always happens few minutes after server reboot):

Apr 21 13:58:42 web-server ntpd[750]: 0.0.0.0 c612 02 freq_set kernel -2125.794 PPM
Apr 21 13:58:42 web-server ntpd[750]: 0.0.0.0 c61c 0c clock_step -2.024965 s
Apr 21 13:58:42 web-server systemd: Time has been changed
Apr 21 13:58:42 web-server kernel: varnishd[2426]: segfault at 0 ip 0000000000433b19 sp 00007f55b52b2170 error 4 in varnishd[400000+96000]
Apr 21 13:58:42 web-server varnishd[2320]: Child (2323) died signal=11
Apr 21 13:58:42 web-server varnishd[2320]: Child (2323) Last panic at: Fri, 21 Apr 2017 10:58:42 GMT#012"Wrong turn at cache/cache_session.c:542:#012Clock step detected#012thread = (cache-worker)#012version = varnish-4.1.4 revision 4529ff7#012ident = Linux,3.10.0-514.6.1.el7.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll#012now = 958.382731 (mono), 1492772322.442557 (real)#012Backtrace:#12 0x434612: pan_ic+0x192#012 0x43c083: varnishd() [0x43c083]#12 0x436117: Req_Cleanup+0x1f7#012 0x450e86: HTTP1_Session+0x4f6#012 0x43ac6d: SES_Proto_Req+0x5d#012 0x44b089: WRK_Thread+0x4c9#012 0x44b4fb: pool_thread+0x2b#012 0x7f55bd42cdc5: libpthread.so.0(+0x7dc5) [0x7f55bd42cdc5]#12 0x7f55bd15b73d: libc.so.6(clone+0x6d) [0x7f55bd15b73d]#012req = 0x7f55a0b79020 {#12 MAGIC 0x00000000 (Should:REQ_MAGIC/0x2751aaa1)#12 vxid = 0, step = R_STP_RESTART,#12 req_body = R_BODY_INIT,#12 restarts = 0, esi_level = 0,#12 ws = 0x7f55a0b79200 {#12 MAGIC 0x00000000 (Should:WS_MAGIC/0x35fac554)#12 id = "",#12 {s,f,r,e} = {(nil),(nil),(nil),(nil)},#12 },#12 http[req] = (nil) {#12"
Apr 21 13:58:42 web-server varnishd[2320]: Child (6858) said Child starts

web-server # varnishadm panic.show
Last panic at: Fri, 21 Apr 2017 10:58:42 GMT
"Wrong turn at cache/cache_session.c:542:
Clock step detected
thread = (cache-worker)
version = varnish-4.1.4 revision 4529ff7
ident = Linux,3.10.0-514.6.1.el7.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 958.382731 (mono), 1492772322.442557 (real)
Backtrace:
0x434612: pan_ic+0x192
0x43c083: varnishd() [0x43c083]
0x436117: Req_Cleanup+0x1f7
0x450e86: HTTP1_Session+0x4f6
0x43ac6d: SES_Proto_Req+0x5d
0x44b089: WRK_Thread+0x4c9
0x44b4fb: pool_thread+0x2b
0x7f55bd42cdc5: libpthread.so.0(+0x7dc5) [0x7f55bd42cdc5]
0x7f55bd15b73d: libc.so.6(clone+0x6d) [0x7f55bd15b73d]
req = 0x7f55a0b79020 {
MAGIC 0x00000000 (Should:REQ_MAGIC/0x2751aaa1)
vxid = 0, step = R_STP_RESTART,
req_body = R_BODY_INIT,
restarts = 0, esi_level = 0,
ws = 0x7f55a0b79200 {
MAGIC 0x00000000 (Should:WS_MAGIC/0x35fac554)
id = "",
{s,f,r,e} = {(nil),(nil),(nil),(nil)},
},
http[req] = (nil) {
"
The server is RHEL 7.3.

@nigoroll

This comment has been minimized.

Copy link
Contributor

nigoroll commented Apr 24, 2017

@tsuppa either fixing your ntp config (see earlier comments in this ticket) or tuning the clock_step parameter should help

@hermunn

This comment has been minimized.

Copy link
Contributor

hermunn commented Apr 24, 2017

As @nigoroll says, you should fix your ntp config. The problem was fixed in 4.1.4-beta3 (se 4f58bb8), and the pan_ic above contains the new error message (introduced in the patch), "Clock step detected". You need to make sure that ntp does not change the time more than clock_step, either by increasing the clock_step parameter or, preferably, configure the system so that time drifts instead of steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment