Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
118 lines (84 sloc) 3.89 KB

Exporting the Musicbrainz database into Neo4j

What you'll need:

Create a mirror of MusicBrainz Postgresl database

I recommend Lukáš Lalinský's mbslave (

mbslave setup

$ git clone
$ cd mbslave/

Follow mbslave's

(For step 2, I usually create a new cluster just for MusicBrainz:

As root,

# mkdir musicbrainz
# cd musicbrainz/
# mkdir dbdata
# chown postgres:postgres dbdata
# su postgres

As 'postgres' user,

postgres@host$ pg_createcluster -d $PWD/dbdata --locale C -e UTF8 9.1 musicbrainz
postgres@host$ pg_ctlcluster 9.1 musicbrainz start

Create a superuser, e.g. "mbadmin" and create an UTF8 database called "musicbrainz"

postgres@host$ createuser -p 5433 -P
postgres@host$ createdb -p 5433 -l C -E UTF-8 -T template0 -O mbadmin musicbrainz

and cp mbslabe.conf.default mbslave.conf and adapt to your settings (database name, host and port number, password))

Continue following mbslave's instructions (steps 3, 4 and 5): prepare the schema, download dumps and import data

I usually use only mbdump.tar.bz2 (it's 1.5GB)

After step 5, it should be sufficient to only run ./ <sql/CreatePrimaryKeys.sql | ./

Generate the SQL export script

$ cd /path/to/sql2graph/
$ python > musicbrainz2neo4j.sql

And feed this script to psql on the musicbrainz database ; you can of course use mbslave for that too:

$ cd /path/to/mbslave
$ cat /path/to/musicbrainz2neo4j.sql | ./

By default, the generated TSV/CSV files are in /tmp:

user@host$ ls -la /tmp/musicbrainz__*
-rw-r--r-- 1 postgres postgres 355724444 oct.  28 10:23 /tmp/musicbrainz__nodes__full.csv
-rw-r--r-- 1 postgres postgres 138827549 oct.  28 10:23 /tmp/musicbrainz__rels__full.csv

Size may vary based on what entities you choose to export: you can limit the entities you export using something like python --limit label,artist,url

By default, all core entities in MusicBrainz are exported

Import into Neo4j

$ cd /path/to/sql2graph/
$ python > musicbrainz2neo4j.sql

//export just 1000 rows per table
$ python --limit 1000 > musicbrainz2neo4j.sql
$ cd /path/to/mbslave
$ cat /path/to/musicbrainz2neo4j.sql | ./

By default, the generated TSV/CSV files are in /tmp/musicbrainz__nodes__full.csv and /tmp/musicbrainz__rels__full.csv.


  • use the batch-import project to import the csv files into neo4j,
  • make sure you swith to the "20" branch of batch-import (for labels support)
  • using a mb_fulltext and mb_exact index in a custom ./ file,
  • putting the database under ./musicbrainz.db.
  • (this will erase your current neo4j datastore)
$ cd /path/to/jexp/batch-import
$ git checkout -b neo4j-2.0 origin/20
$ # build batch-import...
$ # prepare a file:
$ echo -e "batch_import.node_index.mb_exact=exact\nbatch_import.node_index.mb_fulltext=fulltext" >
$ MAVEN_OPTS="-server -Xmx10G -Dfile.encoding=UTF-8" mvn exec:java -Dfile.encoding=UTF-8 -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args=" musicbrainz.db /tmp/musicbrainz__nodes__full.csv /tmp/musicbrainz__rels__full.csv"

Finally, restart your Neo4j instance (you had stopped it before running the batch-import, right?) and play around with MusicBrainz data with the sample queries in examples/musicbrainz/queries.cyp