Skip to content

Add built-in CEF access-log format#2418

Open
oltdaniel wants to merge 4 commits into
squid-cache:masterfrom
oltdaniel:feat/logformat-cef
Open

Add built-in CEF access-log format#2418
oltdaniel wants to merge 4 commits into
squid-cache:masterfrom
oltdaniel:feat/logformat-cef

Conversation

@oltdaniel
Copy link
Copy Markdown

@oltdaniel oltdaniel commented May 9, 2026

Introduce the "cef" logformat, emitting ArcSight Common Event Format
lines for SIEM ingestion. The format is rendered directly by Squid
so that CEF-reserved bytes are escaped per the spec and derived values
not exposed to logformat (notably severity) can be included.

Header: Vendor=Squid, Product="Squid Cache", DeviceVersion=VERSION,
SignatureID=Squid cache code, Name="Proxy Request", Severity derived
from LogTags and error category (preferring proxy signals over upstream
HTTP status). Extensions cover client/server addressing, request/
response sizing, timing, user, URL, hierarchy code, content-type, and
error reason; HTTP status is exposed via cn2/cn2Label=HttpStatus.
Header pipe/backslash and extension =/CR/LF are escaped per the CEF
Implementation Standard.

Also add %squid::hostname and %squid::version logformat tokens so
administrators can replicate the built-in shape via a custom logformat,
in consideration of the above mentioned restrictions, when their SIEM
schema needs adjustments.

Header and extension values follow the standard specification, prior
work from other products, and best-effort mapping of squid specific
values to CEF fields. Field and value choices remain open for
discussion.

Introduce the "cef" logformat, emitting ArcSight Common Event Format lines for SIEM ingestion. The format is rendered directly by Squid so that CEF-reserved bytes are escaped per the spec and derived values not exposed to logformat (notably severity) can be included.

Header: Vendor=Squid, Product="Squid Cache", DeviceVersion=VERSION, SignatureID=<Squid cache code>, Name="Proxy Request", Severity derived from LogTags and error category (preferring proxy signals over upstream HTTP status). Extensions cover client/server addressing, request/response sizing, timing, user, URL, hierarchy code, content-type, and error reason; HTTP status is exposed via cn2/cn2Label=HttpStatus. Header pipe/backslash and extension =/CR/LF are escaped per the CEF Implementation Standard.

Also add %squid::hostname and %squid::version logformat tokens so administrators can replicate the built-in shape via a custom logformat, in consideration of the above mentioned restrictions, when their SIEM schema needs adjustments.
@squid-anubis squid-anubis added the M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels label May 9, 2026
@squid-anubis

This comment was marked as resolved.

@squid-anubis

This comment was marked as resolved.

@squid-anubis

This comment was marked as resolved.

@oltdaniel
Copy link
Copy Markdown
Author

oltdaniel commented May 9, 2026

Due to the strict description and title format requirements, I stripped all linked sources and example from the description. If anybody is interested in it, see the initial version of the description.

@squid-anubis squid-anubis removed the M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels label May 9, 2026
@rousskov rousskov self-requested a review May 9, 2026 19:42
@rousskov rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label May 9, 2026
Copy link
Copy Markdown
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an initial review. There are likely more details that need to be attended to after the conversion to SBufStream is done.

Comment thread src/log/FormatSquidCEF.cc Outdated
Comment thread src/log/Formats.h Outdated
Comment thread src/cf.data.pre Outdated
Comment thread src/cf.data.pre Outdated
Comment thread src/log/Formats.h Outdated
Comment thread src/log/FormatSquidCEF.cc Outdated
Comment thread src/log/FormatSquidCEF.cc Outdated
Comment thread src/log/FormatSiemCef.cc
Comment thread src/log/FormatSquidCEF.cc Outdated
Rename FormatSquidCEF.cc to FormatSiemCef.cc and Log::Format::SquidCEF to Log::Format::SiemCef, dropping the ArcSight reference now that CEF is an open standard. Update the cf.data.pre description accordingly.

Switch the output buffer from SBuf to SBufStream so header and extension fields can be written via stream operators. Collapse FieldWriter's typed methods (str/literal/integer) into a templated put() plus putStr() for escaped strings, and move the extension-escape helper inside FieldWriter as a private static. Make file-local helpers static and UpperCamelCase (cefTransport -> CefTransport, cefSeverity -> CefSeverity) per Squid's file-local function policy.
@oltdaniel oltdaniel force-pushed the feat/logformat-cef branch from 11e6297 to fea3807 Compare May 10, 2026 11:56
@oltdaniel
Copy link
Copy Markdown
Author

Explanation for the force-push: I forgot to add the name change of the function/format in two files.

  • src/log/access_log.cc
  • src/tests/stub_liblog.cc

Test ./test-builds.sh highlighted function signature error. This commit fixes that error. In addition, changed order of SIEM CEF Format to align with other positions in similar files.
Copy link
Copy Markdown
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on improving CEF/SIEM support.

Introduce the "cef" logformat, emitting ArcSight Common Event Format
lines for SIEM ingestion.

Adding a new hard-coded logformat is a showstopper IMO: Squid should get rid of the few remaining legacy hard-coded logformat implementations, not add new ones. The number of real problems hard-coded logformats solve do not justify their development and maintenance overheads, especially since most important use cases usually require some environment-specific customizations that hard-coded logformats do not support well.

In cf.data.pre, this PR currently says that If the built-in "cef" format does not fit your SIEM schema, you can build a CEF-shaped line yourself with logformat. If that claim is accurate, then let's remove built-in logformat. Documenting (in cf.data.pre) a logformat configuration matching a popular or common SIEM schema sounds like a good idea.

The format is rendered directly by Squid so that CEF-reserved bytes are escaped per the spec

I hope this part can be implemented in Squid code by adding support for an additional escaping mechanism (that could be applied to other logformat %codes). Do you think that is feasible? What are the examples of the missing escaping mechanisms? Or does the problem below make additional escaping mechanisms in Squid unnecessary because the helper will implement them?

... and derived values not exposed to logformat (notably severity) can be included.

Defining what transactions represent a "problem" and determining that problem "severity" does not belong to Squid code. Different Squid admins are very likely to classify different transactions differently. If tight integration with Squid is desirable, that transaction/log analysis should be done via annotation ACLs, in an external ACL helper, or in an access log daemon.

N.B. I have not reviewed the proposed low-level code changes yet. I wanted to log this blocker ASAP to reduce the work on the parts of code that I think should be removed from this PR. Let's focus on resolving the high-level concerns/questions above first.

@rousskov rousskov added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels May 10, 2026
@oltdaniel
Copy link
Copy Markdown
Author

@rousskov #2418 (review)

Adding a new hard-coded logformat is a showstopper IMO: Squid should get rid of the few remaining legacy hard-coded logformat implementations, not add new ones. The number of real problems hard-coded logformats solve do not justify their development and maintenance overheads, especially since most important use cases usually require some environment-specific customizations that hard-coded logformats do not support well.

That is a fair point and I already expected being required, that all values in this format are ported to the custom format in order to be able to fully replicate it. In the end, the different values, especially in the header, are very unlikely to include any reserved characters.

In order to fully replicate the current format as is, it would require the following additions to the formatter:

  • [cef-header] quoting mode for any header related values (affects characters \ and |)
  • [cef-extension] quoting mode for any extension values (affects characters \, =, \n, \r)
  • squid::hostname already added
  • squid::version already added
  • cef::severity, debatable if meaningful enough, but due to header position I thing important enough
  • cef::outcome, very debatable, if meaningful enough
  • squid::transport, most debatable, as the current "detection" method is questionable already

With these additions (without optimized naming), the logformat would look like this (at least I think, I'm loosing the overview at this length):

logformat cef CEF:0|Squid|Squid Cache|%[cef-header]squid::version|%[cef-header]Ss|Proxy Request|%cef::severity|rt=%ts%03tu src=%>a spt=%>p dst=%<a dpt=%<p dhost=%[cef-extension]>rd app=%[cef-extension]>rs/%>rv suser=%[cef-extension]un requestMethod=%rm request=%[cef-extension]ru requestClientApplication=%[cef-extension]{User-Agent}>h in=%>st out=%<st act=%Ss outcome=%cef::outcome cn1=%tr cn1Label=ResponseTime cn2=%>Hs cn2Label=HttpStatus cs1=%[cef-extension]{Referer}>h cs1Label=Referer cs2=%Sh cs2Label=Hierarchy fileType=%[cef-extension]mt reason=%err_code dvc=%>la dvchost=%squid::hostname

However, choosing from a list of predefined log formats would still be nice. But as far as I understand the codebase (which is new to me), the cleanest way would be to replace all builtin formats with a lookup table for custom formats. But I'm far from qualified in regards to the squid project to make/suggest such a decision.

Defining what transactions represent a "problem" and determining that problem "severity" does not belong to Squid code. Different Squid admins are very likely to classify different transactions differently. If tight integration with Squid is desirable, that transaction/log analysis should be done via annotation ACLs, in an external ACL helper, or in an access log daemon.

Here I need to disagree. Yes, different admins might have different opinions about severity. However, this is a technical value calculated by the server as an initial default. Offering such a default would save admins from replicating a set of ACLs just to derive it. It also enables easier initial rating within a SIEM for further analysis (e.g. anomaly detection on Squid severity to highlight potential issues or unusual behavior).

@yadij
Copy link
Copy Markdown
Contributor

yadij commented May 12, 2026

@rousskov #2418 (review)

Adding a new hard-coded logformat is a showstopper IMO: Squid should get rid of the few remaining legacy hard-coded logformat implementations, not add new ones.

On this we disagree. Some formats can be built-in as src/cf.data.pre entries for the logformat directive. Others cannot be represented that way without deep adjustments to Squid logging mechanisms - those a Format:: function is appropriate IMO.

In order to fully replicate the current format as is, it would require the following additions to the formatter:

* `[cef-header]` quoting mode for any header related values (affects characters `\` and `|`)

* `[cef-extension]` quoting mode for any extension values (affects characters `\`, `=`, `\n`, `\r`)

The shell-escape quoting would be useful here. Provided a mechanism was added to support {arg} supplying additional custom "reserved characters" for the quoting mechanism.

However, choosing from a list of predefined log formats would still be nice. But as far as I understand the codebase (which is new to me), the cleanest way would be to replace all builtin formats with a lookup table for custom formats. But I'm far from qualified in regards to the squid project to make/suggest such a decision.

When the log format can actually be represented using a squid.conf logformat definition, we could add the pattern to src/cf.data.pre instead of C++ code. For example;

NAME: logformat
DEFAULT: referrer %ts.%03tu %>a %{Referer}>h %ru
DEFAULT: useragent %>a [%tl] "%{User-Agent}>h"
DEFAULT: cef CEF:0|Squid|Squid Cache|%/squid::version|%Ss|Proxy Request|0| ...

Problem is that the Apache and Squid native formats are slightly different in output for some values that are not represented properly by the custom logformat codes. The worst issue is when the fields which are expected to display '0' (is such) instead of the '-' our custom codes use for non-existence of a field value.
OR, as in the case of this CEF format, some fields need unusual encoding for special cases.

Comment thread src/log/FormatSiemCef.cc
Comment on lines +272 to +274
out << '|';
appendHeader(out, cacheCode);
out << "|Proxy Request|" << CefSeverity(*al) << '|';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cacheCode string is a set of alphanumeric tags with _ delimiter. It does not contain any of the special CEF header characters and thus does not need filtering.

Suggested change
out << '|';
appendHeader(out, cacheCode);
out << "|Proxy Request|" << CefSeverity(*al) << '|';
out << '|' << cacheCode << "|Proxy Request|" << CefSeverity(*al) << '|';

Comment thread src/log/FormatSiemCef.cc

/* Time (rt = receipt time; start/end mark activity boundaries) */
if (al->cache.start_time.tv_sec > 0) {
w.put("rt", startMs);
Copy link
Copy Markdown
Contributor

@yadij yadij May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code is to stay I believe it is better to design the FieldWriter class as having (key,value) constructor parameters and an std::ostream &operator <<(std::ostream &os) method that does the stream output.

So it can be used like this:

Suggested change
w.put("rt", startMs);
out << FieldWriter("rt", startMs);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-for-author author action is expected (and usually required)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants