Do we need a "regex" parser? #143

Open
rgerhards opened this Issue Aug 27, 2015 · 6 comments

Projects

None yet

3 participants

@rgerhards
Member

some late versions of liblognorm v1 had a regex parser, but it was disabled by default and users were strongly advised not to activate it due to bad performance. Also, there is the general fear that the availability of "regex" makes users use it, simply because this is something they know well. Given the fact that the whole point of liblognorm is to gain speed by avoiding regex processing, this sounds like a very bad thing. Note that we avoid regexp because all the major implementation do have very bad runtime, often exponential (PCRE is a "good" example here).

In v1, I accepted the regex parser because there were some border cases which were hard to process otherwise. With the much-increased capabilities of v2, I see an even lower need for regex processing. So it is currently not supported in v2.

If you have real trouble without regex, please post your use case here (no theoretical thoughts, please, I would like to see where it really hurts!). This can lead to changes in liblognorm or maybe to the re-introduction of a regex feature. In the latter case, we may go for some of those small libraries that actually are O(n), with n being the max message size -- a complexity that can be easily achieved for regex if regex processing is not abused.

@Whissi Whissi added a commit to Whissi/gentoo-overlay that referenced this issue Oct 9, 2015
@Whissi Whissi dev-libs/liblognorm: PCRE support removed due to upstream's suggestion
See rsyslog/liblognorm#143 for details.
6dae520
@janmejay
Member

Logging frameworks that allow context-building (like picking variables from thread-context or manually created logging-context etc) often generate logs that have a lot of fields that can be easily parsed along with a free-form fragment which can have arbitrary content (content comes form many places in application, or is generated by an external service call etc). I have come across a handful of such cases (from different users of-course).

In such cases, the free-form portion is perfectly human readable(which is why they were coded up that way), but not easy to handle without regex while parsing it.

For instance, consider this rule:

rule=:"%client_ip:char-to:"%" %tcp_peer_ip:ipv4% - [%req_time:char-to:]%] "%verb:word% %url:word% %protocol:char-to:"%" %status:interpret:int:number% %latency:interpret:float:word% %bytes_sent:interpret:int:number% "%referrer:char-to:"%" "%user_agent:char-to:"%" %upstream_addrs:tokenized:, :tokenized: \x3a :regex:[^ ,]+% %upstream_response_times:tokenized:, :tokenized: \x3a :interpret:float:regex:[^ ,]+% %pipe:word% \t %host:word% cache_%cache:word%

Nginx generates multiple upstream_addr and upstream_status separated by comma and colon. Regex provides a clean way to handle it.

Also, it allows a large set of unforeseen cases to be handled at user's level without being blocked on code-changes. It allows user to be unblocked immediately and if the usecase if important enough bring it to our notice(or create a PR). Once its supported first-class, replace regex call with it.

@rgerhards
Member

Can you post a sample log line that matches the rule?

I am very concerned that once we add regex support, users begin to use it simply because they know regex. That would make liblognorm just another dull low performance solution. I think we should try to do better.

@rgerhards
Member

I should stress that I am really after the actual samples rather than theoretical thoughts. I know that there are different schools of thinking and I would like to see where we are really hurt in practice without regex.

@davidelang
Contributor

On Wed, 25 Nov 2015, Janmejay Singh wrote:

Logging frameworks that allow context-building (like picking variables from
thread-context or manually created logging-context etc) often generate logs
that have a lot of fields that can be easily parsed along with a free-form
fragment which can have arbitrary content (content comes form many places in
application, or is generated by an external service call etc). I have come
across a handful of such cases (from different users of-course).

In such cases, the free-form portion is perfectly human readable(which is why
they were coded up that way), but not easy to handle without regex while
parsing it.

For instance, consider this rule:

rule=:"%client_ip:char-to:"%" %tcp_peer_ip:ipv4% - [%req_time:char-to:]%]
"%verb:word% %url:word% %protocol:char-to:"%" %status:interpret:int:number%
%latency:interpret:float:word% %bytes_sent:interpret:int:number%
"%referrer:char-to:"%" "%user_agent:char-to:"%" %upstream_addrs:tokenized:,
:tokenized: \x3a :regex:[^ ,]+% %upstream_response_times:tokenized:,
:tokenized: \x3a :interpret:float:regex:[^ ,]+% %pipe:word% \t %host:word%
cache_%cache:word%

Nginx generates multiple upstream_addr and upstream_status separated by comma and colon. Regex provides a clean way to handle it.

Also, it allows a large set of unforeseen cases to be handled at user's level
without being blocked on code-changes. It allows user to be unblocked
immediately and if the usecase if important enough bring it to our notice(or
create a PR). Once its supported first-class, replace regex call with it.

take a look at the liblognorm v2 capabilities (repeated, custom types,
alternates). I think they provide a way to much more readily address this.

we still need a more generic name-value capability in v2 (where we can define
the name limits, and the two separator types)

David Lang

@rgerhards
Member

the current discussion around different date formats might show a valid argument pro regex support. Can't provide more details at this moment, but thought I at least add the information to this tracker.

@davidelang
Contributor

full regex support wouldn't solve the date parsing problem. I would make it easier to extract the various components, but not assemble the components into a timestamp result.

However, the grok compatibility module is a regex engine, so it may address the request.

There are some regex engines that are far more efficient than the standard ones when dealing with large bodies of rules (IIRC Google released one that turned the pile of regex lines into a parse tree)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment