Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

journal ip anonymization #2447

Closed
micah opened this Issue Jan 27, 2016 · 16 comments

Comments

10 participants
@micah
Copy link
Contributor

micah commented Jan 27, 2016

The default on most systems is for services to log useful information, included in that information is often IP logs, identifying the individual who are using the service. You keep this data on your systems until it is rotated. This information is useful for debugging, analysis, dealing with abuse and many other things.

Regardless of the defaults, and the typical use cases, system administrators are required to implement site logging policies deemed appropriate by their organization, abiding by their local jurisdictional requirements. In some cases, having easy control over exactly what data is retained in logfiles is incredibly important. Data retention is a hot legal topic in many different countries and jurisdictions. In some places retention of such data is mandated by law, but in many cases where such laws have been proposed, they have faltered. It turns out that there are many instances where it is preferable to keep less information on users than is collected by default on many systems. For example, in the United States it is not currently required to retain data on users of a server, but you may be required to provide all data on a user which you have retained. Online service providers protect themselves from legal hassles and added work by choosing what data they wish to retain.

The EFF recommends as a general best practice to mitigate this problem by only logging enough information to maintain and upkeep the provider's intended services—no more, no less. Especially in the USA, where the PATRIOT ACT provides the government with expanded powers to request this information.

Regardless of the state of various data retention laws around the globe, there seems to be general agreement in sysloggers that providing a mechanism to remove site policy defined data is desired. In some places, this may be the wrong thing to do, but in others it may be the right thing to do. Having a default how it is makes sense, but having the capability to adjust those defaults, as needed is important to consider as well. Having the ability to implement a site-policy that enables an organization to decide if the trade-off between privacy and analysis is worthwhile. In the U.S., the EFF has made it very clear that a mechanism of anonymizing or stripping out personally identifying information in logs is perfectly (a) legal in the U.S., and (b) advisable. There are many instances where it is preferable to keep less information on users than is collected by default on many systems.

Having the capability to allow organizations to have that choice if they feel that it is more important to avoid retaining sensitive data rather than having a full history of everything logged is what led to a patch in 2004 for syslog-ng that stipped out any given regexp from log messages before they were written to disk. In 2008 syslog-ng added log anonymization capabilities by default [0]. Around the same time dsyslog was released, which also included, from the beginning, full ip anonymization capabilities[1]. In 2013, rsyslog added log anonymization capabilities[2].

Now with journald, these problems are back. I've seen many people trying to disable journald, or send it through rsyslog because they can manipulate the logs with rsyslog's mmanon capability in order to try to continue to adhere to their organization's data retention requirements. Some organizations I've worked with have been unpleasantly surprised to find that on upgrading to Debian Jessie, their rsyslog logs that had a regexp filter applied to them before they were written to disk were being "subverted" by journald's logging, which includes all the information that they were not wanting to keep, or could get into serious trouble if they did. There seems to be a great deal of confusion out there about if setting Storage=none and ForwardToSyslog=yes will work, or if some socket needs to be added to rsyslog to pick up journal entries and push them through rsyslog, etc.

One of the advantages of journald is a chance to centralize all logging, including filtering/anonymizing. What would it take to get an profile for journald that would enable administrators to specify regexps that would enable them to put aside these hacks? At minimum, an anonymizing profile, that will remove, or replace ipv4/ipv6 addresses would be a good start, since this is where I suspect most usage of such log replacement filters have focused. What needs to be done to enhance journald to to make the logs we produce respect the privacy and reflect the needs of users and organizations?

  1. https://lists.balabit.hu/pipermail/syslog-ng/2008-November/012183.html
  2. https://web.archive.org/web/20080215234516/http://nenolod.net/dsyslog
  3. http://blog.gerhards.net/2013/04/log-anonymization-with-rsyslog.html

Further references:
EFF’s Best Practices for Online Service Providers http://www.eff.org/wp/osp
EPIC’s International Data Retention Page http://www.epic.org/privacy/intl/data_retention.html
Working paper on Usage Log Data Management from the Computers, Freedom, and Privacy conference http://cryptome.org/usage-logs.htm

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Jan 27, 2016

I fear this is out of focus for journald. We really don't want to do regexp munging. Our focus is about collecting data and filtering at display time. If that's not suitable for your usecase, that's OK, but journald is not for you then. In this case, use rsyslog or something else, and turn off journald's on storage.

Sorry.

@poettering poettering closed this Jan 27, 2016

@poettering poettering added the journal label Jan 27, 2016

@anarcat

This comment has been minimized.

Copy link

anarcat commented Jan 27, 2016

the problem here is that it's pretty hard to actually turn off journald, no? the thing is still running and collecting those logs in memory somewhere...

I think it's really unfortunate that such a proposal is being refused right off the bat. At least explain why you refuse to do what you call "regex munging" (which is only one possible implementation)...

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Jan 27, 2016

Storage=none turns off storage in both /run and /var. That's pretty easy, no?

The whole idea of the journal is as mentioned that we filter during display, not during collection or storage. We also put a focus on structured log data that is indexed, rather than trying to heuristically derive structure from english language messages. If you want something else, then that's completely OK, but journald really isn't that.

@anarcat

This comment has been minimized.

Copy link

anarcat commented Jan 27, 2016

On 2016-01-27 10:44:47, Lennart Poettering wrote:

Storage=none turns off storage in both /run and /var. That's pretty easy, no?

But doesn't that keep it in memory? I admit I am not familiar enough
with journald to tell.

Furthermore, the original post mentionned a lack of clarity to that
effect, so the clarification is certainly appreciated.

The whole idea of the journal is as mentioned that we filter during display, not during collection or storage. We also put a focus on structured log data that is indexed, rather than trying to heuristically derive structure from english language messages. If you want something else, then that's completely OK, but journald really isn't that.

There's an ethical issue with creating a standard system that
deliberately keeps all private data. It could even be a legal liability
in some countries. Having the capacity to not only censor the display,
but refuse or censor the input would be a requirement in certain
situations.

With the current popularity of systemd and journald, 'just wanting
something else' is becoming more difficult...

A.

Il est sage de nous réconcilier avec notre adolescence ; haїr, mépriser,
nier ou simplement oublier l’adolescent que nous fûmes est en soi une
attitude adolescente.
- Daniel Pennac, Comme un roman

@dkg

This comment has been minimized.

Copy link
Contributor

dkg commented Jan 27, 2016

It's disappointing to hear that you don't think this is in-scope for as important a project as journald -- the ability to filter globally during collection (not just at display time) is a critical feature when implementing and maintaining privacy-preserving systems.

Storage=none is a huge switch to flip -- administrators often do need some log data, but many responsible administrators have log data that they explicitly do not want (because possession of the data puts the users or the system itself at risk) or cannot legally retain.

There doesn't appear to be any standard structured field in the systemd logs as i understand them that refer to micah's proposal about anonymizing IP addresses, so presumably that means that services logging through systemd are just logging the data mixed into the MESSAGE= journal field. If there were a well-known field for clients to log (something like PEER_IP_ADDR= ) then journald could be configured to anonymize those particular fields during collection without any regex-wrangling, which would in turn encourage applications to log more-structured data (and would provide log display filtering based on IP addresses for non-anonymized systems, which would be an additional win).

Please reconsider the categorical opposition to this feature. It is an important step for modern system administration, and journald is in a position to fulfill this role better than many other frameworks.

@pqwy

This comment has been minimized.

Copy link

pqwy commented Jan 29, 2016

@poettering's reasoning is actually quite clear. Well defined scope tends to make or break a software project, and pulling in a filtering engine into journald, especially a regex-based one, together with the associated configuration provisions, sounds like a design wart.

At the same time, @micah and @anarcat raise a critical issue: there are situations where not retaining certain data, which naturally tends to be aggregated via syslog, is a legal necessity. And dealing with that in journald, as opposed to turning it off, is not desirable simply "because it's everywhere", but because it is shaping out to be the best tool for the task of centrally collecting and managing logs. For every unfortunate Debian user unpleasantly surprised by Jessie, there is several of us who have been pulling systemd from git five years ago.

I think a compromise might be possible here. How about optionally siphoning out all received (and decorated) data to an external process, and reading it back in before persisting it or passing it on?

I can imagine at least several ways to flesh this out in detail, but then again, I'm certain you can too. It might be as simple as giving journald a unix socket to connect to, deciding on a policy of what happens when the other end is not there, and writing a systemd service to maintain the filtering process. This would add some complexity to journald, in maintaining the link, printing the structured entries, and parsing them back in, but this is pretty limited, especially if the format is KEY=value-like. It would also incur some overhead, but only for those who use the feature. In return, it would create a general extension point for journald, allowing users to arbitrarily shape all logged data before it hits the disks.

And it would solve this particular concern as a special case.

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Jan 29, 2016

Why would you even send data to journald if you don't want it to stay in memory or on disk? If you want to suppress log data, why not attack it at the source of the problem, instead of turning the destination in to selective null sink?

@duckdalbe

This comment has been minimized.

Copy link

duckdalbe commented Jan 29, 2016

I'm actually having problems using systemd-based Linux-distros at work because we have a no-IP-storing policy. I'm trying to argue because I (grew to) like systemd, but I'm having a hard time.

Some mechanism to filter/alter data beforee it's stored would be very helpful!

@lelutin

This comment has been minimized.

Copy link

lelutin commented Feb 2, 2016

@poettering turing your question around: would you rather force every project that can ever send data to journald/syslog to add mechanisms of filtering or removing specified data before it's sent to the logging facility or would it be more productive to have one point on the system implement such functionality?

The reason why other syslog systems have implemented such filtering capability is exactly because of this question's obvious answer: pushing everyone to do the right thing wrt logging is never going to happen, whereas making the log aggregation system do it is simpler (in terms of how many ppl need to be bothered to implement it) and more productive.

@poettering

This comment has been minimized.

Copy link
Member

poettering commented Feb 2, 2016

@lelutin i am not forcing anyone. I am just saying that journald has no clue what is supposed to be private data and what not, and is certainly not going to employ regex orgies to reconstruct this information from english language text. Hence: the client's have to provide that info. But if they do that, they might as well never send it to the system logger in the first place.

@xshadow

This comment has been minimized.

Copy link

xshadow commented Feb 19, 2016

First of all @poettering and all other involved people, thanks for systemd and thanks for moving it forward regardless of all the troubles you had with it and the linux community. And another big up for the courage and your posting about the linux community.

Since many years people all around the world are fighting against data retention in many ways on many different levels (e.g. http://www.vorratsdatenspeicherung.de/?lang=en or http://www.wirspeichernnicht.de/ )

For me it's a technical and political question, and systemd/journald has the power to enable admins who respect the privacy of their users without loosing all the useful information gathered through systemd.

As @dkg pointed out before, no regex orgies are needed to implement such a feature:

If there were a well-known field for clients to log (something like PEER_IP_ADDR= ) then journald could be configured to anonymize those particular fields during collection without any regex-wrangling, which would in turn encourage applications to log more-structured data (and would provide log display filtering based on IP addresses for non-anonymized systems, which would be an additional win).

No one has to be forced to use this feature, but those previous developments @micah pointed out before, show that there are people out there, who are willing to take care of users privacy. In my opinion the we should support them. If there is an easy way for it (not piping stuff through other tools) more and more people start to think and care about it.

I think journald is in the position to give privacy-oriented admins the possibility to anonymize logs without much overhead and still use the useful informations that journald provides. A filter / switch to rule out IP addresses to preserve the anonymity of all users would be a great feature and a step in the right direction.

@RalfJung

This comment has been minimized.

Copy link

RalfJung commented Mar 30, 2016

I think I just ran into the same issue. The isc-dhcp-server sends all DHCPREQUEST and DHCPACK to the syslog. This does not seem configurable (beyond the syslog facility to use), based on the reasonable assumption that the syslog daemon will allow fine-tuning.

Migrating to systemd brought a big regression here -- suddenly we are storing the assignment from IP addresses to MAC addresses, which we never wanted to do. With systemd sucking in all data sent to the syslog, and offering no configuration of what happens prior to storage, it is clearly sub-par compared to syslog implementations. It seems we now have the choice of either entirely disabling journald (making system administration much harder), or violating our own policies of not storing user data.

it is really sad to see such a central and otherwise well-designed piece of system software force users to make such decision; clearly, this will lead to some users taking convenience over privacy and storing more data than they should.

(There is also a practical issue here, isc-dhcp-server is sending so many messages that other services' messages are rotated out within ~24 hours. Even if we would not bother storing user data, that would still be a major issue.)

@dkg

This comment has been minimized.

Copy link
Contributor

dkg commented Mar 30, 2016

OpenSSH is an example of a popular upstream project that logs IP addresses and expects filtering to be done outside of it. This is from a recent discussion:

https://marc.info/?l=openssh-unix-dev&m=145932982624862&w=2

We've always logged IP addresses in some circumstances, we're just being more consistent
in doing so. Anyone who has had special requirements around log privacy should have
implemented filtering years ago.

@coretemp

This comment has been minimized.

Copy link

coretemp commented Oct 27, 2018

@RalfJung Why are you filing an issue with systemd as opposed to whoever is your system vendor?

@dkg Interesting observation. However, if this is still the case, it just means that openssh would have to be forked, not that it would require a systemd change.

(Yes, I know this is an old issue. )

@dkg

This comment has been minimized.

Copy link
Contributor

dkg commented Oct 29, 2018

@coretemp "just" forking openssh‽ No one said that this would require a systemd change. we've just been observing here that there are a number of reasons why systemd is an eminently reasonable place to make a change that supports all of these different use cases.

@anarcat

This comment has been minimized.

Copy link

anarcat commented Oct 30, 2018

maybe we could "just" fork systemd. :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.