Skip to content

Anonymize or modify data on the fly when creating MySQL data dumps using simple YAML configuration file

Notifications You must be signed in to change notification settings

rosmo/mydumper-anon

Repository files navigation

== What is mydumper? Why? ==

* Parallelism (hence, speed) and performance (avoids expensive character set conversion routines, efficient code overall)
* Easier to manage output (separate files for tables, dump metadata, etc, easy to view/parse data)
* Consistency - maintains snapshot across all threads, provides accurate master and slave log positions, etc
* Manageability - supports PCRE for specifying database and tables inclusions and exclusions

It does not support schema dumping and leaves that to 'mysqldump --no-data'

== How to build it? ==

Run:
 cmake .
 make

One needs to install development versions of required libaries (MySQL, GLib, ZLib, PCRE):
NOTE: you must use the correspondent mysql devel package.

* Ubuntu or Debian: apt-get install libglib2.0-dev libmysqlclient15-dev zlib1g-dev libpcre3-dev libssl-dev
* Fedora, RedHat and CentOS: yum install glib2-devel mysql-devel zlib-devel pcre-devel openssl-devel
* openSUSE: zypper install glib2-devel libmysqlclient-devel pcre-devel zlib-devel
* MacOSX: port install glib2 mysql5 pcre pkgconfig cmake
 (You may want to run 'port select mysql mysql5' afterwards)

One has to make sure, that pkg-config, mysql_config, pcre-config are all in $PATH

Binlog dump is disabled by default to compile with it you need to add -DWITH_BINLOG=ON to cmake options

=== MacOSX homebrew formula ===

[Homebrew](http://brew.sh/) is a package manager for MacOSX.

```
brew install https://raw.githubusercontent.com/rosmo/mydumper-anon/master/homebrew/mydumper-anon.rb
```


== How does consistent snapshot work? ==

This is all done following best MySQL practices and traditions:

* As a precaution, slow running queries on the server either abort the dump, or get killed
* Global write lock is acquired ("FLUSH TABLES WITH READ LOCK")
* Various metadata is read ("SHOW SLAVE STATUS","SHOW MASTER STATUS")
* Other threads connect and establish snapshots ("START TRANSACTION WITH CONSISTENT SNAPSHOT")
** On pre-4.1.8 it creates dummy InnoDB table, and reads from it.
* Once all worker threads announce the snapshot establishment, master executes "UNLOCK TABLES" and starts queueing jobs.

This for now does not provide consistent snapshots for non-transactional engines - support for that is expected in 0.2 :)

== How to exclude (or include) databases? ==

Once can use --regex functionality, for example not to dump mysql and test databases:

 mydumper --regex '^(?!(mysql|test))'

Of course, regex functionality can be used to describe pretty much any list of tables.

== How to exclude MERGE or Federated tables ==

Use same --regex exclusion syntax. Again, engine-specific behaviors are targetted for 0.2

== How to anonymize tables ==

You can specifying a YAML configuration file to the --anonymize flag. Anonymization
configuration supports truncation of tables, column editing (replace contents,
randomize with wordlists, randomized date/time/datetime).

Example of all operations currently implemented:

 anonymizer_settings:
   randomize:
     # wordlist_id: file_with_words_one_per_line.txt
     firstname: firstnames.txt
     lastname: lastnames.txt
     fullname: fullnames.txt
    
 database_name:
   table_name_1:
     truncate: yes
  
   table_name_2:
     edit:
       - when: {folder: '^config$', path: '^secret/key1'}
         set: {value: ''}
       - when: {folder: '^config$', path: '^secret/key2'}
         set: {value: ''}
   
   table_name_3:
     edit:
       - when: {}
         set: {customer_name: 'John {{user_id}}', customer_address: 'Fixed Street 1'}

   table_name_4:
     randomize:
       column_name:
         type: 'fullname'
       datetime_column_name:
         type: 'datetime'
       time_column_name:
         type: 'time'
       date_column_name:
         type: 'date'
	
   # Modifiers are separated with |, currently only md5 supported. Additionally you can 
   # specify a ":n" after the modifier to only take n amount of characters.
   table_name_5:
     edit:
       - when: {email: '.+'}
         set: {email: '{{id_user|md5:10}}@anonymized.com', name: '{{name|md5}}'}

About

Anonymize or modify data on the fly when creating MySQL data dumps using simple YAML configuration file

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages