Large Hadron Migrator
This is the Shopify downstream fork of https://github.com/soundcloud/lhm.
Rails style database migrations are a useful way to evolve your data schema in an agile manner. Most Rails projects start like this, and at first, making changes is fast and easy.
That is until your tables grow to millions of records. At this point, the
locking nature of
ALTER TABLE may take your site down for an hour or more
while critical tables are migrated. In order to avoid this, developers begin
to design around the problem by introducing join tables or moving the data
into another layer. Development gets less and less agile as tables grow and
grow. To make the problem worse, adding or changing indices to optimize data
access becomes just as difficult.
Side effects may include black holes and universe implosion.
There are few things that can be done at the server or engine level. It is
possible to change default values in an
ALTER TABLE without locking the
table. The InnoDB Plugin provides facilities for online index creation, which
is great if you are using this engine, but only solves half the problem.
At SoundCloud we started having migration pains quite a while ago, and after looking around for third party solutions, we decided to create our own. We called it Large Hadron Migrator, and it is a gem for online ActiveRecord migrations.
The Large Hadron is a test driven Ruby solution which can easily be dropped
into an ActiveRecord migration. It presumes a single auto
incremented numerical primary key called id as per the Rails convention. Unlike
the twitter solution, it does not require the presence of an indexed
Lhm currently only works with MySQL databases and requires an established ActiveRecord connection.
It is compatible and continuously tested with MRI 2.0.x, 2.1.x, ActiveRecord 3.2.x and 4.x (mysql and mysql2 adapters).
Due to the Chunker implementation, Lhm requires that the table to migrate has a
a single integer numeric key column called
Another note about the Chunker, it performs static sized row copies against the
column. Therefore sparse assignment of
id can cause performance problems for the
backfills. Typically LHM assumes that
id is an
auto_increment style column.
Install it via
gem install lhm or add
gem "lhm" to your Gemfile.
You can invoke Lhm directly from a plain ruby file after connecting ActiveRecord to your mysql instance:
require 'lhm' ActiveRecord::Base.establish_connection( :adapter => 'mysql', :host => '127.0.0.1', :database => 'lhm' ) # and migrate Lhm.change_table :users do |m| m.add_column :arbitrary, "INT(12)" m.add_index [:arbitrary_id, :created_at] m.ddl("alter table %s add column flag tinyint(1)" % m.name) end
To use Lhm from an ActiveRecord::Migration in a Rails project, add it to your Gemfile, then invoke as follows:
require 'lhm' class MigrateUsers < ActiveRecord::Migration def self.up Lhm.change_table :users do |m| m.add_column :arbitrary, "INT(12)" m.add_index [:arbitrary_id, :created_at] m.ddl("alter table %s add column flag tinyint(1)" % m.name) end end def self.down Lhm.change_table :users do |m| m.remove_index [:arbitrary_id, :created_at] m.remove_column :arbitrary end end end
Note: Lhm won't delete the old, leftover table. This is on purpose, in order to prevent accidental data loss.
Lhm is using a throttle mechanism to read data in your original table.
By default, 2000 rows are read each 0.1 second.
If you want to change that behaviour, you can pass an instance of a throttler with the
In this example, 1000 rows will be read with a 10 seconds delay between each processing:
my_throttler = Lhm::Throttler::Time.new(stride: 1000, delay: 10) Lhm.change_table :users, throttler: my_throttler do |m| # end
Lhm uses by default the time throttler, however a better solution is to throttle the copy of the data depending on the time that the slaves are behind. To use the SlaveLag throttler:
Lhm.change_table :users, throttler: :slave_lag_throttler do |m| # end
Or to set that as default throttler, use the following (for instance in a Rails initializer):
Table rename strategies
There are two different table rename strategies available: LockedSwitcher and AtomicSwitcher.
The LockedSwitcher strategy locks the table being migrated and issues two ALTER TABLE statements. The AtomicSwitcher uses a single atomic RENAME TABLE query and is the favored solution.
Lhm chooses AtomicSwitcher if no strategy is specified, unless your version of MySQL is affected by binlog bug #39675. If your version is affected, Lhm will raise an error if you don't specify a strategy. You're recommended to use the LockedSwitcher in these cases to avoid replication issues.
To specify the strategy in your migration:
Lhm.change_table :users, :atomic_switch => true do |m| # ... end
Limiting the data that is migrated
For instances where you want to limit the data that is migrated to the new table by some conditions, you may tell the migration to filter by a set of conditions:
Lhm.change_table(:sounds) do |m| m.filter("inner join users on users.`id` = sounds.`user_id` and sounds.`public` = 1") end
Note that this SQL will be inserted into the copy directly after the "from" statement - so be sure to use inner/outer join syntax and not cross joins. These conditions will not affect the triggers, so any modifications to the table during the run will happen on the new table as well.
Cleaning up after an interrupted Lhm run
If an Lhm migration is interrupted, it may leave behind the temporary tables and/or triggers used in the migration. If the migration is re-started, the unexpected presence of these tables will cause an error.
In this case,
Lhm.cleanup can be used to drop any orphaned Lhm temporary tables or triggers.
To see what Lhm tables/triggers are found:
To remove any Lhm tables/triggers found:
Optionally only remove tables up to a specific Time, if you want to retain previous migrations.
Lhm.cleanup(true, until: 1.day.ago)
Lhm.cleanup(true, until: Time.now - 86400)
First, get set up for local development:
dev clone lhm dev up
To run the tests:
dev unit # unit tests dev int # integration tests dev test # all tests
You can run an individual test as follows:
bundle exec rake unit TEST=spec/integration/atomic_switcher_spec.rb
You can get code coverage reporting for an individual test as follows:
rm -rf coverage; COV=1 bundle exec rake unit TEST=spec/integration/atomic_switcher_spec.rb; open coverage/index.html # test one file
or get code coverage for all tests:
The integration tests rely on a master/slave replication setup of MySQL.
We're using dbdeployer to set this up via
dbdeployer provides scripts for operating and accessing the nodes in
There is a lot in there, and most of time you shouldn't need to work with the nodes directly, but it's good
to know where to go!
Releasing new versions:
This gem is published to Shopify's internal PackageCloud
The procedure to publish a new version:
- Run bundle install to bump the Gemfile.lock version of the gem
- Add CHANGELOG entries by seeing what has not been shipped yet in https://shipit.shopify.io/shopify/lhm/production
- Open PR for version bump and merge to master
- Create release on GitHub with a version number that matches version.rb
- Visit https://shipit.shopify.io/shopify/lhm/production and deploy
- See your new version on https://gems.shopify.io/packages/lhm
The license is included as LICENSE in this directory.