SO2DB provides an API for importing the Stack Overflow/Stack Exchange data dumps into a database. It also provides a PostgreSQL import utility (so2pg) out of the box.
Download a Stack Exchange Data Dump and extract the data set that you want to import (e.g., math.stackexchange.com.7z).
Ensure that you have the PostgreSQL client (
psqlis used to bulk copy data into your database.
Create a PostgreSQL database for the data.
Install SO2DB with the command
gem install so2db. You may have to use
sudodepending on your setup.
Call so2pg to import the data with the command
so2pg -O -R -d db_name -D /path/to/data/dir. The data directory is the directory containing the clean XML files. The -O and -R flags indicate that you wish to include optional tables and table relationships, respectively. You can see all the options by running
so2pg --helpor in the list below.
Wait impatiently! Depending on your hardware and the data set you choose to import,
so2pgmay take a while to run. With all that free time on your hands, why don't you help me figure out how to make so2db faster?!
-H --host HOST The database host -d --database DBNAME The name of the database (REQUIRED) -D --directory DIRECTORY The data directory path (REQUIRED) -u --user USER The user name -p --password PASSWORD The user's password -P --port PORT_NUMBER The port number -O --include-optionals Includes optional tables -R --include-relationships Includes table relationships -h --help Show this help screen
Be sure to check out the Tekpub Hello PostgreSQL series. It presents a lot of useful information quickly and in an accessible manner. I built this gem simply so I could practice along with the videos. Highly recommended!
We have run so2pg several times on two machines with slightly hardware configurations. The CPUs and RAM were comparable (2.66 vs. 2.4 GHz, 4GB RAM), but one had an SSD while the other had a ...slow... HD. When importing the Apr 2012 Stack Overflow dump on each machine, we noticed about 43% faster import times on the SSD (2 hrs vs. 3.5 hrs).
Your import may take quite some time to complete, and the performance is very dependent on your hard drive speed.
If you are running
so2pg on OS X, you may want to run the
purge command in a terminal periodically. This will free up "Inactive Memory" and reduce the number of pages to disk. This should help performance, though I haven't specifically made measurements on it.
Before you create your own custom importer, you should check to see if someone is already working on one with the same purpose. Otherwise, let us know what you're working on and get started!
SO2DB depends on ActiveRecord and Foreigner to build tables and relationships, and so SO2DB is limited to the databases supported by both projects. (At the time of writing, Foreigner only supports PostgreSQL, MySQL, and SQLite. ActiveRecord supports these and more.)
Creating a custom importer requires you to provide two classes: an ActiveRecord monkey patch and the importer implementation.
First, you need to create a monkey patch that adds a uuid method to the associated ActiveRecord connection adapter. The uuid method defines the database type associated with a universally unique identifier (e.g., 'uuid' in PostgreSQL, 'CHAR(16)' in MySQL).
Next, you must create a subclass of SO2DB::Importer that contains a method with the following definition:
The formatter will be a SO2DB::Formatter, which generates formatted output from a StackOverflow XML data file when you call the its
format method. The
format method accepts a stream and pumps formatted data to that stream. This approach allows for you to deal with the data as you see fit - pass it over STDIN to another command (this is what so2pg does) or send it to a file that you then pass to a database system. If you need the database parameters provided to SO2DB, they are provided in the importer conn_opts property.
The data coming from the formatter will use a delimiter specified in the Importer implementation. A vertical tab, '\v' (0xB), is used by default. This can be changed by setting the format_delimiter property in the importer.
The formatter also offers a couple of convenience routines,
file_name simply provides the name of the file to be formatted.
value_str provides a partial SQL string based on the data to be formatted. It provides the table name and the ordered value names associated with the formatted data. For example, assume you are provided a formatter for badges.xml. The convenience routines produce the following:
puts formatter.file_name # => badges.xml puts formatter.value_str # => badges(id,date,name,user_id)
Note that the field names are alphabetized. The field values from the formatter will have the same ordering. This is to ensure consistent and predictable field ordering, even if fields are not provided in the XML file. If a field is not provided in the XML file, an empty string is inserted into the formatted data.
I encourage you to check out the implementation of so2pg if you are interested in developing your own importer; there are additional notes throughout the source to make your development experience a bit smoother.
When you are finished, consider creating a pull request. I would love to include your labor of love in a future release!
SO2DB is released under the MIT License.
This project has been tested under OS X (Snow Leopard) and Ubuntu 12.04 using Ruby version 1.9.2 and 1.9.3. If anyone wants to test on other platforms, I would appreciate it!
SO2DB has been tested against the following data dumps:
Sept 2011 Apr 2012
Thanks to @nclaburn for being a great mentor and helping me get this project released!