New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion To Unix Timestamps Is 24 Hours Behind #830

Closed
ckbuckley opened this Issue Mar 2, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@ckbuckley

ckbuckley commented Mar 2, 2016

Hi guys,

All of our servers (Amazon Linux 2015.03 running rsyslog 8.16.0-3.el6) have suddenly stopped recording Unix timestamps in log messages correctly, as of midnight UTC on 29/02/2016.

We use the following GELF template to format log messages and ship them to our Graylog server:

template(name="gelf" type="list") {
    constant(value="\"host\":\"")
    property(name="hostname")
    constant(value="\",\"Syslog-Tag\":\"")
    property(name="syslogtag" format="json")
    constant(value="\",\"short_message\":\"")
    property(name="msg" format="json")
    constant(value="\",\"timestamp\":\"")
    property(name="timegenerated" dateformat="unixtimestamp")
    constant(value="\",\"Level\":\"")
    property(name="syslogseverity")
    constant(value="\",\"Severity\":\"")
    property(name="syslogseverity-text")
    constant(value="\"}")
}

But all 'timestamp' components of the message are 24 hours in the past. As an experiment, I tried changing the dateformat to 'rfc3339' and all of the timestamps began being reported correctly. I then changed it back to unixtimestamp and it went back to being 24 hours in the past.

A quick example is below for some log messages:

When 'dateformat' is set to 'unixtimestamp':
Message time in syslog: 2016-03-02T04:24:25+00:00
Message time in 'timestamp' field: 1456806270
Converted to readable time: 2016-03-01T04:24:30+00:00

When 'dateformat' is set to 'rfc3339':
Message time in syslog: 2016-03-02T04:34:00+00:00
Message time in 'timestamp' field: 2016-03-02T04:34:02.110830+00:00

This information was pulled from running a tcpdump on one of our application servers and capturing the logging traffic as it left the server in order to eliminate Graylog as the sou
rce of the issues - the behaviour detailed above seems to point to rsyslog as the culprit.

I can't think of any reason why this could have suddenly happened (and Graylog shows that it happened to all servers at exactly the same time - midnight UTC on 29/02/2016). We made no change to any of our systems or applications from 48 hours before this occurred and have been in a change freeze since.

Thanks for your help!

@rgerhards rgerhards added the bug label Mar 2, 2016

@rgerhards rgerhards self-assigned this Mar 2, 2016

@rgerhards

This comment has been minimized.

Member

rgerhards commented Mar 2, 2016

I strongly think this is a bug in rsyslog: we use some custom code to compute the Unix timestamp. I think the reason was that we missed some features in the usual api because of the different time zones we need to support concurrently. My guess is that this code has a problem with this years leap year. I will investigate today and report back.

@rgerhards rgerhards added this to the v8.17 release milestone Mar 2, 2016

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 2, 2016

testbench: add tests for timegenerated
among others, showcases a problem with leap year handling

see also rsyslog#830
@rgerhards

This comment has been minimized.

Member

rgerhards commented Mar 2, 2016

I have confirmed this problem with the tests I have just added. They also show we have been off for one second in other cases, too. Now working on the analysis/fix.

@rgerhards

This comment has been minimized.

Member

rgerhards commented Mar 2, 2016

I think I have found the culprit. We are using lookup tables for year second values, but it was an oversight on how to handle month within the leap year itself. In other words: it'll auto-heal on 2017-01-01... I also noticed that we seem to be consistently one second too low.

Now working on a simple fix to the current algo. When done, I'll see if I can refactor this to mktime(), not sure what was the orginal cause why we didn't use it from day one.

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 2, 2016

testbench: add tests for timegenerated
among others, showcases a problem with leap year handling

see also rsyslog#830

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 2, 2016

testbench: add tests for timegenerated
among others, showcases a problem with leap year handling

see also rsyslog#830

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 2, 2016

bugfix: unixtimestamp was incorrectly computed
The problem happened in leap year from March til then end
of year and healed itself at the begining of the next year.
During the problem period, the timestamp was 24 hours too
low.

This is primarily a simple fix that can also be applied to
older rsyslog versions. However, we will see if we can
refactor the code to make use of mktime(). Originally, that
was not done for some issues seen, which may no longer
apply.

fixes rsyslog#830
@rgerhards

This comment has been minimized.

Member

rgerhards commented Mar 2, 2016

Unfortunately, there is solid reason not to use mktime(). From the (new) code comment:

 * Note: it would be better to use something similar to mktime() here.
 * Unfortunately, mktime() semantics are problematic: first of all, it
 * works on local time, on the machine's time zone. In syslog, we have
 * to deal with multiple time zones at once, so we cannot plainly rely
 * on the local zone, and so we cannot rely on mktime(). One solution would
 * be to refactor all time-related functions so that they are all guarded 
 * by a mutex to ensure TZ consistency (which would also enable us to
 * change the TZ at will for specific function calls). But that would
 * potentially mean a lot of overhead.
 * Also, mktime() has some side effects, at least setting of tzname. With
 * a refactoring as described above that should probably not be a problem,
 * but would also need more work. For some more thoughts on this topic,
 * have a look here:
 * http://stackoverflow.com/questions/18355101/is-standard-c-mktime-thread-safe-on-linux
 * In conclusion, we keep our own code for generating the unix timestamp.

So we need to stay with the "home grown" code, at least the testbench now has good tests up until 2100 (which is when rsyslog needs at least some new lookup tables, so it is it's current expiration date ;)).

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 3, 2016

testbench: add tests for timegenerated
among others, showcases a problem with leap year handling

see also rsyslog#830

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 3, 2016

bugfix: unixtimestamp was incorrectly computed
The problem happened in leap year from March til then end
of year and healed itself at the begining of the next year.
During the problem period, the timestamp was 24 hours too
low.

This is primarily a simple fix that can also be applied to
older rsyslog versions. However, we will see if we can
refactor the code to make use of mktime(). Originally, that
was not done for some issues seen, which may no longer
apply.

fixes rsyslog#830

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 3, 2016

testbench: add tests for timegenerated
among others, showcases a problem with leap year handling

see also rsyslog#830

rgerhards added a commit to rgerhards/rsyslog that referenced this issue Mar 3, 2016

bugfix: unixtimestamp was incorrectly computed
The problem happened in leap year from March til then end
of year and healed itself at the begining of the next year.
During the problem period, the timestamp was 24 hours too
low.

This is primarily a simple fix that can also be applied to
older rsyslog versions. However, we will see if we can
refactor the code to make use of mktime(). Originally, that
was not done for some issues seen, which may no longer
apply.

fixes rsyslog#830

Whissi added a commit to Whissi/gentoo that referenced this issue Mar 12, 2016

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Mar 12, 2016

app-admin/rsyslog: Rev bump to fix a leap year issue
Bug: rsyslog/rsyslog#830

Package-Manager: portage-2.2.28
Closes: #1034

JindongChen pushed a commit to jopstools/rsyslog that referenced this issue Mar 15, 2016

bugfix: unixtimestamp was incorrectly computed
The problem happened in leap year from March til then end
of year and healed itself at the begining of the next year.
During the problem period, the timestamp was 24 hours too
low.

This is primarily a simple fix that can also be applied to
older rsyslog versions. However, we will see if we can
refactor the code to make use of mktime(). Originally, that
was not done for some issues seen, which may no longer
apply.

fixes rsyslog#830

janmejay added a commit to janmejay/rsyslog that referenced this issue Apr 4, 2016

bugfix: unixtimestamp was incorrectly computed
The problem happened in leap year from March til then end
of year and healed itself at the begining of the next year.
During the problem period, the timestamp was 24 hours too
low.

This is primarily a simple fix that can also be applied to
older rsyslog versions. However, we will see if we can
refactor the code to make use of mktime(). Originally, that
was not done for some issues seen, which may no longer
apply.

fixes rsyslog#830
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment