Making database migrations to MarkLogic easy
This is a starter kit for creating an application that uses Spring Batch and marklogic-spring-batch for migrating data from any RDBMS into MarkLogic. The intent is to simplify the process of creating a migration application using Spring Batch by leveraging the reusable components in marklogic-spring-batch, and by organizing a Gradle-based project for you that you can clone/fork/etc to quickly extend and customize for your specific needs.
This project provides the following features:
- Provides a simple way to get started with Spring Batch - just configure the JDBC/MarkLogic connection properties, provide your database's JDBC driver, write a SQL query, and you're off and running
- Given a SQL query, or for every table in the database, uses a Spring JDBC ColumnMapRowMapper to read every row as a map
- Uses StAX to convert that map into an XML document
- Utilizes the Data Movement SDK to write batches of documents across many hosts in a cluster.
This project has the following defaults in place that you can use as a starting point:
- Sets up and reads data from an H2 test database - this is only used for a quick demonstration and can be easily removed from the project. See gradle task "loadH2Data".
- Defaults to writing to MarkLogic using admin:admin@localhost:8000. These defaults can be changed in gradle.properties or via the command line.
- Has a Gradle task for launching the migration - "./gradlew migrate"
The fact that this uses H2 by default is only so you can kick the tires on it. Instructions are below for how to change the JDBC configuration properties to point to your own database.
How do I try this out?
To try this out locally with the default configuration for H2, just do the following:
- Clone this repo
- Verify you have ML 8+ installed locally and that port 8000 (the default one) points to the Documents database (you can of course modify this to write to any database you want)
- Verify that the username/password properties gradle.properties are correct for your MarkLogic cluster (it's best not to use the admin user unless absolutely necessary, but this defaults to it for the sake of convenience)
- Run ./gradlew migrate
You should see some logging like this:
10:18:26.934 [main] INFO c.m.s.b.i.rdbms.AllTablesItemReader - Finished reading rows for query: SELECT * FROM CUSTOMER 10:18:26.942 [main] INFO c.m.s.b.i.rdbms.AllTablesItemReader - Finished reading rows for query: SELECT * FROM INVOICE 10:18:27.068 [main] INFO c.m.s.b.i.rdbms.AllTablesItemReader - Finished reading rows for query: SELECT * FROM ITEM 10:18:27.077 [main] INFO c.m.s.b.i.rdbms.AllTablesItemReader - Finished reading rows for query: SELECT * FROM PRODUCT
The connection properties to the RDBMS and to MarkLogic, along with all other job properties, can be modified in gradle.properties. You can modify those properties on the command line via Gradle's -P mechanism, e.g.
./gradlew migrate -Pthread_count=32
Or load the data as JSON instead of XML:
./gradlew migrate -Pdocument_type=json
Or specify different collections and a different URI prefix:
./gradlew migrate -Pcollections=one,two -Poutput_uri_prefix=/test/
Or just modify gradle.properties and start building your own application.
You can also see all the supported arguments:
How do I make this work with my own database?
To try this on your own database, you'll need to change the JDBC connection properties - jdbc_url, jdbc_username, and jdbc_password.
Assuming you're not using H2, you'll also need to change the jdbc_driver property. You'll need to add your JDBC driver to the Gradle "runtime" classpath. The H2 driver is currently in that classpath via the following line in build.gradle:
You can either change that line to reference your own JDBC driver in a Maven repository, or remove it and replace it with a line that specifies where your driver is on the classpath, e.g.:
Next, by default, the migration program tries to migrate all tables that it finds in the database. You can either retain this behavior by keeping the "all_tables" property set to "true", or you can set that property to "false" and specify a SQL query, a local name for the root element of each XML document that's written, and the collections to put each document into - example:
sql=SELECT * FROM Customer root_local_name=customer collections=customer,migrated
If you do keep "all_tables" set to "true", it's best to keep "collections" set to something generic that each document will be written to - example:
At this point, you're able to reconfigure the migration job to talk to any database, run any SQL query, and set any local name for the root element, any collection, and any permission.
But how do I modify the XML that's inserted into MarkLogic?
The org.example.MigrationConfig class defines the components used by Spring Batch to migrate data. This class creates a Spring Batch Reader, Writer, and Processor (for more information on these concepts, check out the Spring Batch user manual).
To modify how this works, you'll need to write code, which opens the door to all the batch-processing power and flexibility provided by Spring Batch. Here are a few paths to consider:
- Modify the ColumnMapProcessor with your own method for converting a ColumnMap into a String of XML, or JSON.
- Write your own Processor implementation for converting a ColumnMap into something that can be written to MarkLogic.
- Write your own Reader that returns something besides a ColumnMap (you'll need to modify the Processor and Writer as well; note that the ) Processor is optional, so you can omit that if needed).
- You can even replace the MarkLogicItemWriter, which depends on a MarkLogic Java Client DocumentWriteOperation instance. Typically though, you'll be able to retain this part by having your Reader and/or Processor return a DocumenteWriteOperation, which encapsulates all the information needed to write a single document to MarkLogic.