Version 0.2.1
A tool for anonymising postgresql dump files made using the Postgresql
pg_dump
command using row-wise filters.
The tool takes advantage of the structure of COPY
lines in dump files,
that is those between a COPY <schema>.<tablename> (...column list...) FROM stdin;
and a \.
terminating line, to separate the lines into
columns and to either remove lines or replace the columns specified.
The filtering strategy is speedy and multi-gigabyte dump files can typically be processed in under a minute. Note that the use of the reference (or "foreign key") filter requires two scans of the dump file and to hold the referenced table/s in memory.
Usage:
gopg-anonymise : a simple postgresql dump file anonymiser.
Anonymise a postgresql dump file using a toml settings file setting out
the deletion, or columnar uuid, string, file or reference replacement
filters to use.
gopg-anonymise -s <settings.toml> [-o output or stdout] [-t test] [Input]
Application Options:
-s, --settings= settings toml file
-o, --output= output file (otherwise stdout)
-t, --testmode show only changed lines for testing
Help Options:
-h, --help Show this help message
Arguments:
Input: input file or stdin
A toml file is used to describe tables that should be anonymised. For each table to be anonymised one or more filters may be provided.
Presently, apart from the row delete filter, four column replacement filters are provided:
-
uuid replaces one or more columns with a new uuid
-
string replace replaces the data in one or more columns with replacement values.
-
file replace replaces the data in one or more columns with corresponding lines in the source file. If the source file is exhausted, cycle the inputs starting from the first line of the source.
-
reference replace replaces data in one or more columns with the value from one or more values from a reference table, requiring the specification of a local and remote key to make matches along the lines of a simple foreign key lookup. Note that the reference table is read into memory. Reference tables track their original and new values so that, for example, a local table can update its
username
value from the new anonymised value in ausers
table even if the foreign key value has been updated.
Each filter can be qualified by If
and NotIf
filters which determine
if the filter should be run based on the contents of one or more columns
in the row. Conditionals match if any of their criteria are true.
The order of filters is important. Beware of changing a row ahead of using a conditional.
[["example_schema.events"]]
filter = "delete"
[["public.users"]]
filter = "string replace"
# column names must equal those in the dump file "COPY" load header
# one or more columns must be provided
columns = ["password"]
# the number of replacements must equal the number of columns
# give all users the same password
replacements = ["$2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC"]
# don't replace if lastname column has the value "langoustine"
notif = {"lastname" = "langoustine"}
[["public.users"]]
filter = "file replace"
# column names must equal those in the dump file "COPY" load header
# the number of columns in the source must equal the columns listed
columns = ["firstname", "lastname"]
source = "testdata/newnames.txt"
[["public.users"]]
filter = "uuid"
columns = ["uuid"]
[["public.users"]]
filter = "file replace"
columns = ["notes"]
# cycle through 2 notes in the file
source = "testdata/newnotes.txt"
# only replace if the notes column is not null
# Postgresql NULL columns must be recorded as '\N' or "\\N"
notif = {"notes" = '\N'}
[["public.fkexample"]]
filter = "reference replace"
# columns are the local table column names
columns = ["firstname_materialized"]
# replacements are the foreign table column names
replacements = ["firstname"]
# fklookups.0 is the localkey, .1 is the schema.table.column fk
optargs = {"fklookup" = ["user_id", "public.users.id"]}
This example uses the dump and settings toml files provided in the testdata directory.
The events table contents are removed.
The fkexample table values in the firstname_materialized column are replaced by the anonymised values in the users.firstname column achieved by the match between public.fkexample.user_id and public.users.id.
For the users table, the password column is replaced verbatim except for user "langoustine", and all uuids are newly generated. The first name and last name of each user is replaced from the file "testdata/newnames.txt", although because there are only 5 entries in the file, the 6th entry is recycled (with the firstname "zachary"). The two notes in "testdata/newnotes.txt" are cycled through the three entries which are not NULL.
Original tables:
egrep -A 5 "COPY example_schema.events"
COPY example_schema.events (id, flags, data) FROM stdin;
1 {flag1,flag2} {"a": "b"}
2 {"flag1,a","flag2,b"} {"a": "c,b", "b": [1, 0]}
3 {flag3} {"c": null}
4 {"flag3\ttab"} {"d": "x y"}
\.
egrep -A 4 "COPY public.fkexample" testdata/pg_dump.sql
COPY public.fkexample (id, user_id, firstname_materialized) FROM stdin;
1 1 ariadne
2 3 lucius
3 5 asterix
\.
egrep -A 7 "COPY public.users" testdata/pg_dump.sql
COPY public.users (id, firstname, lastname, password, uuid, notes) FROM stdin;
1 ariadne augustus $2a$06$xyhc3ZN0KLlw4XSM8YypjueqptvViUdTBQq3m2as3QMZ/lL6gH6ie 6b1b3a33-484a-4870-b6ec-58a8d72fc306 \N
2 james joyce $2a$06$YpMDzzGDmUz.tgGtkYotaeFnGliNymZTBIHPGPyCd8D9jXHLsnC/a 95ae2b5a-56a6-412d-b7af-e7d0eb1a412f \N
3 lucius langoustine $2a$06$cj4Coa76ZPud2KiFW4wPDuTL98N8p4mFjJoV5mJ2Id9.2QiAcJ6bO db761046-e61e-4b5f-8dc5-64b89ed0dd77 \N
4 biggles barrymore $2a$06$eS8/gKhuPcwdklVWwqgK0.9Z30Bk5hwveYBdyVQny1GwtnSoEEQ8C f7a53cb0-454e-43f7-8559-e1e5097e1f3a a 'note'
5 asterix a gaul $2a$06$Llerb92vQ763qEX3e/v9WueqoCJdYu4F0mI65xo8Y1uif/vMTlsLq 46cebc75-8b9a-4666-94f3-8142e73c23d2 a "note", with commas, etc.
6 wormtail wyckenhof $2a$06$BEOCQhB5i5zPkAqe2pKq5O6zJmafmwjxkn4NB0mek3w5o70ytkxzm 708fd360-34bb-4ea4-8096-71920bfa7809 a note with a tab here:"\t"
\.
Output (in test -t
mode):
./gopg-anonymise -t -s testdata/settings.toml testdata/pg_dump.sql
COPY example_schema.events (id, flags, data) FROM stdin;
COPY public.fkexample (id, user_id, firstname_materialized) FROM stdin;
1 1 zachary
2 3 xavier
3 5 vanessa
COPY public.users (id, firstname, lastname, password, uuid, notes) FROM stdin;
1 zachary zaiden $2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC 6a474264-5cd7-4184-a1ed-9049ed4aecaf \N
2 yael yaeger $2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC f3da4fee-c05a-447f-a619-a4b0c626c456 \N
3 xavier xander $2a$06$.d8FVKIVagQaHU.6ouHGKegL85H8.cFIvXDNGC/wb8dXAWt3fmukq 6837753f-4dc9-4e95-9d14-375aa62cd021 \N
4 william williamson $2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC 242704da-6ba9-47b7-a84e-e2216b713324 this is a second note\twith a tab
5 vanessa vaccarelli $2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC 913aae23-d9df-4c66-9f93-9ce59f94c907 this is the first note
6 zachary zaiden $2a$06$.wHg4l7yz1ijSfMwa7fNruq3ASx1plpkC.XcI1wXdghCb4ZJQsrtC 3395c1a1-d9ea-4803-81ba-65333a7698e3 this is a second note\twith a tab
This software is provided under an MIT licence.