Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
aseering committed Apr 9, 2013
2 parents ca48045 + 3efb6fe commit b4c2632
Show file tree
Hide file tree
Showing 8 changed files with 773 additions and 1 deletion.
34 changes: 34 additions & 0 deletions odbc_loader_package/Makefile
@@ -0,0 +1,34 @@

TMPDIR ?= /tmp
SDK_HOME ?= /opt/vertica/sdk
SHELL = /bin/bash
VSQL ?= /opt/vertica/bin/vsql

CXXFLAGS := $(CXXFLAGS) -I $(SDK_HOME)/include -I $(SDK_HOME)/examples/HelperLibraries -fPIC -shared -Wall -g

build: build/ODBCLoader.so
## See targets below for actual build logic

clean:
rm build/ODBCLoader.so

install: build
$(VSQL) -f ddl/install.sql

uninstall:
$(VSQL) -f ddl/uninstall.sql

test:
## Try uninstalling first, just in case we have a stale version around
-@$(MAKE) uninstall
@$(MAKE) install
-diff -u Tests.out <($(VSQL) -f examples/sample_usage.sql 2>&1)
@$(MAKE) uninstall

.PHONY: build clean install uninstall test


## Actual build target
build/ODBCLoader.so: ODBCLoader.cpp $(SDK_HOME)/include/Vertica.cpp $(SDK_HOME)/include/BuildInfo.h
mkdir -p build
$(CXX) $(CXXFLAGS) -o $@ $(SDK_HOME)/include/Vertica.cpp ODBCLoader.cpp -lodbc
538 changes: 538 additions & 0 deletions odbc_loader_package/ODBCLoader.cpp

Large diffs are not rendered by default.

88 changes: 88 additions & 0 deletions odbc_loader_package/README
@@ -0,0 +1,88 @@
-------------------------------
INTRODUCTION
-------------------------------

ODBCLoader

This package contains a pair of User-Defined Load functions, ODBCSource() and ODBCLoader(), that can be used to load data directly from a remote database.

These Load functions connect to a remote database and load data directly from that database into Vertica. Data is not converted into an intermediate file format; the data is fetched over the network via ODBC and each record is copied directly into Vertica as it is received.

-------------------------------
BUILDING
-------------------------------

To build:

$ make

The ODBC Loader depends on ODBC. It is tested with the unixODBC driver manager. unixODBC is available through all major Linux distribution's package managers; we recommend getting the supported version through your distribution. It is also available from <http://www.unixodbc.org>.

unixODBC needs to be installed on ALL COMPUTERS IN YOUR VERTICA CLUSTER in order to be able to use the ODBC Loader!

In order to compile the ODBC Loader, you will need the unixODBC development headers as well. Most Linux distributions package these separately, in a package named something like "unixodbc-dev" or "unixodbc-devel". You only need these installed on the machine you're using to build the ODBC Loader.

-------------------------------
INSTALLING / UNINSTALLING
-------------------------------

Assuming vsql is in your path, just do:

$ make install
$ make uninstall

Alternately, you can find the DDL that 'make install' uses in:
ddl/install.sql
and
ddl/uninstall.sql

As mentioned under 'BUILDING', to install the ODBC Loader you will need to have an ODBC driver manager installed on your cluster.

In order for the ODBC Loader to be useful, you will also need to install one or more ODBC drivers. These must be installed on ALL COMPUTERS IN YOUR VERTICA CLUSTER! The drivers are what enable the ODBC Loader to talk to other non-Vertica databases. Each database product typically has its own driver; you'll need to install the driver per their documentation and packages.

-------------------------------
USAGE
-------------------------------

See examples/sample_usage.sql

In general, usage looks like:

COPY tbl WITH SOURCE ODBCSource() PARSER ODBCLoader(connect='DSN=some_odbc_dsn', query='select * from remote_table;');

This will cause Vertica to connect to the remote database identified by the given "connect" string and execute the given query. It will then fetch the results of the query and load them into the table "tbl".

"tbl" must have the same number of columns as 'remote_table'. The column types must also match up, or the ODBC driver for the remote database must be able to cast the column types to the Vertica types. If necessary, you can always explicitly cast on the remote side by modifying the query, or on the local side with a Vertica COPY expression.

The 'query' argument to ODBCLoader can be any valid SQL query, or any other statement that the remote database will recognize and that will cause it to return a row-set.

The 'connect' argument to ODBCLoader can be any valid ODBC connect string. It is common to configure /etc/odbc.ini and /etc/odbcinst.ini with all the necessary information, then simply reference the DSN listing in /etc/odbc.ini in each query. For help configuring these files, or for more information on valid 'connect' strings, please see the documentation that came with the ODBC driver for the remote database product that you are connecting to, as the format of the string is specified by the driver.

WARNING: The ODBC Loader CAN CRASH VERTICA if not used with THREAD-SAFE ODBC drivers! Please see your driver's documentation to see if it is thread-safe by default or can be configured to operate in a thread-safe mode.

NOTE: There are essentially an unlimited number of permutations of remote database software, query types, data types, etc. All of them behave a little differently. If something doesn't work for you, patches welcome :-)

-------------------------------
DATABASE-SPECIFIC USAGE
-------------------------------

ORACLE: All integers in Vertica are 64-bit integers. Oracle doesn't support 64-bit integers; their ODBC driver can't even cast to them on request. This code contains a quirk/workaround for Oracle that retrieves integers as C strings and re-parses them. However, the quirk doesn't reliably detect Oracle database servers right now. You can force Oracle with an obvious modification to the setQuirksMode() function in ODBCLoader.cpp. If you know of a more-reliable way to detect Oracle, or a better workaround, patches welcome :-)

MYSQL: The MySQL ODBC driver comes in both a thread-safe and thread-unsafe build and configuration. The thread-unsafe version is KNOWN TO CRASH VERTICA if used in multiple COPY statements concurrently! (Vertica is, after all, highly multithreaded.) And distributions aren't consistently careful to package thread-safe defaults. So if you're connecting to MySQL, be very careful to set up a thread-safe configuration.

VERTICA: Why are you using this tool to connect one Vertica database to another Vertica database? Don't do that! Use Vertica's built-in IMPORT/EXPORT; it's dramatically faster. Two big clusters talking through a single ODBC translation layer / pipe just isn't the best way to do it :-)

-------------------------------
PERFORMANCE
-------------------------------

The ODBC Loader has not been meaningfully performance-benchmarked to date.

It is constrained by how fast the remote database can execute the specified query. It is also constrained by the network between the Vertica cluster and the remote database, as well as by any overhead in the ODBC driver being used. For this reason, it's recommended to use a relatively simple query, to make sure the network between the two database systems is high-bandwidth and low-latency, and to read the documentation for the ODBC driver in use for the remote database to verify that it is properly configured for optimal performance.

-------------------------------
LICENSE
-------------------------------

Please see LICENSE.txt

65 changes: 65 additions & 0 deletions odbc_loader_package/Tests.out
@@ -0,0 +1,65 @@
CREATE TABLE
Rows Loaded
-------------
10
(1 row)

i | b | f | v | c | lv | bn | vb | lvb | d | t | ts | tz | tsz | n
---+---+-----+--------+--------+--------+----------------+--------+--------+------------+----------+---------------------+-------------+------------------------+-------------
| | | | | | | | | | | | | |
1 | t | 1.5 | test 1 | test 1 | test 1 | test 1\000\000 | test 1 | test 1 | 2000-01-01 | 00:01:04 | 2038-01-01 08:14:07 | 01:21:00-04 | 2000-06-01 03:21:00-04 | 123456.7890
2 | t | 2.5 | test 2 | test 2 | test 2 | test 2\000\000 | test 2 | test 2 | 2000-01-02 | 00:02:04 | 2038-01-02 08:14:07 | 01:22:00-04 | 2000-06-01 03:22:00-04 | 123456.7890
3 | t | 3.5 | test 3 | test 3 | test 3 | test 3\000\000 | test 3 | test 3 | 2000-01-03 | 00:03:04 | 2038-01-03 08:14:07 | 01:23:00-04 | 2000-06-01 03:23:00-04 | 123456.7890
4 | t | 4.5 | test 4 | test 4 | test 4 | test 4\000\000 | test 4 | test 4 | 2000-01-04 | 00:04:04 | 2038-01-04 08:14:07 | 01:24:00-04 | 2000-06-01 03:24:00-04 | 123456.7890
5 | t | 5.5 | test 5 | test 5 | test 5 | test 5\000\000 | test 5 | test 5 | 2000-01-05 | 00:05:04 | 2038-01-05 08:14:07 | 01:25:00-04 | 2000-06-01 03:25:00-04 | 123456.7890
6 | t | 6.5 | test 6 | test 6 | test 6 | test 6\000\000 | test 6 | test 6 | 2000-01-06 | 00:06:04 | 2038-01-06 08:14:07 | 01:26:00-04 | 2000-06-01 03:26:00-04 | 123456.7890
7 | t | 7.5 | test 7 | test 7 | test 7 | test 7\000\000 | test 7 | test 7 | 2000-01-07 | 00:07:04 | 2038-01-07 08:14:07 | 01:27:00-04 | 2000-06-01 03:27:00-04 | 123456.7890
8 | t | 8.5 | test 8 | test 8 | test 8 | test 8\000\000 | test 8 | test 8 | 2000-01-08 | 00:08:04 | 2038-01-08 08:14:07 | 01:28:00-04 | 2000-06-01 03:28:00-04 | 123456.7890
9 | t | 9.5 | test 9 | test 9 | test 9 | test 9\000\000 | test 9 | test 9 | 2000-01-09 | 00:09:04 | 2038-01-09 08:14:07 | 01:29:00-04 | 2000-06-01 03:29:00-04 | 123456.7890
(10 rows)

i b f v c lv bn vb lvb d t ts tz tsz n
NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
1 1 1.5 test 1 test 1 test 1 test 1\0\0 test 1 test 1 2000-01-01 04:01:00 2038-01-01 03:14:07 1:21:00 June 1, 2000 03:21 EDT 123456.7890
2 1 2.5 test 2 test 2 test 2 test 2\0\0 test 2 test 2 2000-01-02 04:02:00 2038-01-02 03:14:07 1:22:00 June 1, 2000 03:22 EDT 123456.7890
3 1 3.5 test 3 test 3 test 3 test 3\0\0 test 3 test 3 2000-01-03 04:03:00 2038-01-03 03:14:07 1:23:00 June 1, 2000 03:23 EDT 123456.7890
4 1 4.5 test 4 test 4 test 4 test 4\0\0 test 4 test 4 2000-01-04 04:04:00 2038-01-04 03:14:07 1:24:00 June 1, 2000 03:24 EDT 123456.7890
5 1 5.5 test 5 test 5 test 5 test 5\0\0 test 5 test 5 2000-01-05 04:05:00 2038-01-05 03:14:07 1:25:00 June 1, 2000 03:25 EDT 123456.7890
6 1 6.5 test 6 test 6 test 6 test 6\0\0 test 6 test 6 2000-01-06 04:06:00 2038-01-06 03:14:07 1:26:00 June 1, 2000 03:26 EDT 123456.7890
7 1 7.5 test 7 test 7 test 7 test 7\0\0 test 7 test 7 2000-01-07 04:07:00 2038-01-07 03:14:07 1:27:00 June 1, 2000 03:27 EDT 123456.7890
8 1 8.5 test 8 test 8 test 8 test 8\0\0 test 8 test 8 2000-01-08 04:08:00 2038-01-08 03:14:07 1:28:00 June 1, 2000 03:28 EDT 123456.7890
9 1 9.5 test 9 test 9 test 9 test 9\0\0 test 9 test 9 2000-01-09 04:09:00 2038-01-09 03:14:07 1:29:00 June 1, 2000 03:29 EDT 123456.7890
CREATE TABLE
Rows Loaded
-------------
10
(1 row)

i | b | f | v | c | lv | d | t | ts | tz | tsz | n
---+---+-----+--------+--------+--------+------------+----------+---------------------+-------------+------------------------+-------------
| | | | | | | | | | |
| | | | | | | | | | |
1 | t | 1.5 | test 1 | test 1 | test 1 | 2000-01-01 | 00:01:04 | 2038-01-01 08:14:07 | 01:21:00-04 | 2000-06-01 03:21:00-04 | 123456.7890
1 | t | 1.5 | test 1 | test 1 | test 1 | 2000-01-01 | 04:01:00 | 2038-01-01 13:14:07 | 01:21:00-04 | 2000-06-01 03:21:00-04 | 123456.7890
2 | t | 2.5 | test 2 | test 2 | test 2 | 2000-01-02 | 00:02:04 | 2038-01-02 08:14:07 | 01:22:00-04 | 2000-06-01 03:22:00-04 | 123456.7890
2 | t | 2.5 | test 2 | test 2 | test 2 | 2000-01-02 | 04:02:00 | 2038-01-02 13:14:07 | 01:22:00-04 | 2000-06-01 03:22:00-04 | 123456.7890
3 | t | 3.5 | test 3 | test 3 | test 3 | 2000-01-03 | 00:03:04 | 2038-01-03 08:14:07 | 01:23:00-04 | 2000-06-01 03:23:00-04 | 123456.7890
3 | t | 3.5 | test 3 | test 3 | test 3 | 2000-01-03 | 04:03:00 | 2038-01-03 13:14:07 | 01:23:00-04 | 2000-06-01 03:23:00-04 | 123456.7890
4 | t | 4.5 | test 4 | test 4 | test 4 | 2000-01-04 | 00:04:04 | 2038-01-04 08:14:07 | 01:24:00-04 | 2000-06-01 03:24:00-04 | 123456.7890
4 | t | 4.5 | test 4 | test 4 | test 4 | 2000-01-04 | 04:04:00 | 2038-01-04 13:14:07 | 01:24:00-04 | 2000-06-01 03:24:00-04 | 123456.7890
5 | t | 5.5 | test 5 | test 5 | test 5 | 2000-01-05 | 00:05:04 | 2038-01-05 08:14:07 | 01:25:00-04 | 2000-06-01 03:25:00-04 | 123456.7890
5 | t | 5.5 | test 5 | test 5 | test 5 | 2000-01-05 | 04:05:00 | 2038-01-05 13:14:07 | 01:25:00-04 | 2000-06-01 03:25:00-04 | 123456.7890
6 | t | 6.5 | test 6 | test 6 | test 6 | 2000-01-06 | 00:06:04 | 2038-01-06 08:14:07 | 01:26:00-04 | 2000-06-01 03:26:00-04 | 123456.7890
6 | t | 6.5 | test 6 | test 6 | test 6 | 2000-01-06 | 04:06:00 | 2038-01-06 13:14:07 | 01:26:00-04 | 2000-06-01 03:26:00-04 | 123456.7890
7 | t | 7.5 | test 7 | test 7 | test 7 | 2000-01-07 | 00:07:04 | 2038-01-07 08:14:07 | 01:27:00-04 | 2000-06-01 03:27:00-04 | 123456.7890
7 | t | 7.5 | test 7 | test 7 | test 7 | 2000-01-07 | 04:07:00 | 2038-01-07 13:14:07 | 01:27:00-04 | 2000-06-01 03:27:00-04 | 123456.7890
8 | t | 8.5 | test 8 | test 8 | test 8 | 2000-01-08 | 00:08:04 | 2038-01-08 08:14:07 | 01:28:00-04 | 2000-06-01 03:28:00-04 | 123456.7890
8 | t | 8.5 | test 8 | test 8 | test 8 | 2000-01-08 | 04:08:00 | 2038-01-08 13:14:07 | 01:28:00-04 | 2000-06-01 03:28:00-04 | 123456.7890
9 | t | 9.5 | test 9 | test 9 | test 9 | 2000-01-09 | 00:09:04 | 2038-01-09 08:14:07 | 01:29:00-04 | 2000-06-01 03:29:00-04 | 123456.7890
9 | t | 9.5 | test 9 | test 9 | test 9 | 2000-01-09 | 04:09:00 | 2038-01-09 13:14:07 | 01:29:00-04 | 2000-06-01 03:29:00-04 | 123456.7890
(20 rows)

DROP TABLE
vsql:Tests.sql:33: ERROR 0: Error calling setup() in User Function UDParser at [ODBCConnector.cpp:194], error code: 0, message: ODBC Error: SQLDriverConnect() failed with error code IM002, native code 0 [[unixODBC][Driver Manager]Data source name not found, and no default driver specified]
vsql:Tests.sql:34: ERROR 0: Error calling setup() in User Function UDParser at [ODBCConnector.cpp:194], error code: 0, message: ODBC Error: SQLExecDirect() failed with error code 42000, native code 1064 [[MySQL][ODBC 5.1 Driver][mysqld-5.5.24-0ubuntu0.12.04.1]You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'hi!' at line 1]
DROP TABLE
4 changes: 4 additions & 0 deletions odbc_loader_package/ddl/install.sql
@@ -0,0 +1,4 @@
\set libfile '\''`pwd`'/build/ODBCLoader.so\''
CREATE LIBRARY ODBCLoaderLib AS :libfile;
CREATE PARSER ODBCLoader AS LANGUAGE 'C++' NAME 'ODBCLoaderFactory' LIBRARY ODBCLoaderLib;
CREATE SOURCE ODBCSource AS LANGUAGE 'C++' NAME 'ODBCSourceFactory' LIBRARY ODBCLoaderLib;
1 change: 1 addition & 0 deletions odbc_loader_package/ddl/uninstall.sql
@@ -0,0 +1 @@
DROP LIBRARY ODBCLoaderLib CASCADE;
35 changes: 35 additions & 0 deletions odbc_loader_package/examples/sample_usage.sql
@@ -0,0 +1,35 @@
-- Try some stuff out
-- Requires that you have a local MySQL instance with no root password,
-- and ODBC properly installed and configured to connect to it under the "MySQL" DSN.
\! echo "CREATE DATABASE testdb;" | mysql -u root

-- Populate a table in MySQL
-- Create the timezone-based entries as varchar's; MySQL doesn't do timezones at all AFAICT...
-- MySQL Timestamps automagically rewrite 'null' to CURRENT_TIMESTAMP() unless the column in question explicitly allows null values. (Older MySQL's have other nonstandard behavior re: timestamps.)
\! echo "CREATE TABLE test_mysql (i integer, b boolean, f float, v varchar(32), c char(32), bn binary(32), vb varbinary(32), d date, t time, ts timestamp null, tz varchar(80), tsz varchar(80), n numeric(10,4));" | mysql -u root testdb
\! (echo "INSERT INTO test_mysql VALUES (null, null, null, null, null, null, null, null, null, null, null, null, null, null, null);"; for i in `seq 1 9`; do echo "INSERT INTO test_mysql VALUES ($i, 1, $i.5, 'test $i', 'test $i', 'test $i', 'test $i', 'test $i', 'test $i', '2000/1/$i', '4:0$i', '2038-01-0$i 03:14:07', '1:2$i:00', 'June 1, 2000 03:2$i EDT', '123456.7890');"; done) | mysql -u root testdb

-- Create the corresponding table in Vertica
CREATE TABLE test_vertica (i integer, b boolean, f float, v varchar(32), c char(32), bn binary(32), vb varbinary(32), d date, t time, ts timestamp, tz timetz, tsz timestamptz, n numeric(18,4));

-- Copy from MySQL into Vertica
COPY test_vertica WITH SOURCE ODBCSource() PARSER ODBCLoader(connect='DSN=Default', query='SELECT * FROM test_mysql;');

-- Verify that Vertica and MySQL have the same contents
SELECT i,b,f,v,trim(c::varchar) as c,bn::binary(8) as bn,vb,d,t,ts,tz,tsz,n FROM test_vertica ORDER BY i,b,f,v;
\! echo "SELECT i,b,f,v,trim(c) as c,cast(bn as binary(8)) as bn,vb,d,t,ts,tz,tsz,n FROM test_mysql ORDER BY i,b,f,v;" | mysql -u root testdb

-- Now try copying from ourselves into ourselves; see what happens
CREATE TABLE test_vertica_no_bin AS SELECT i,b,f,v,c,d,t,ts,tz,tsz,n FROM test_vertica;
COPY test_vertica_no_bin WITH SOURCE ODBCSource() PARSER ODBCLoader(connect='DSN=VerticaDSN', query='SELECT * FROM test_vertica_no_bin;');
SELECT i,b,f,v,trim(c::varchar) as c,d,t,ts,tz,tsz,n FROM test_vertica_no_bin ORDER BY i,b,f,v;
DROP TABLE test_vertica_no_bin;

-- Try some invalid commands; make sure they error out correctly
COPY test_vertica WITH SOURCE ODBCSource() PARSER ODBCLoader(connect='DSN=InvalidDSN', query='SELECT * FROM test_mysql;');
COPY test_vertica WITH SOURCE ODBCSource() PARSER ODBCLoader(connect='DSN=Default', query='hi!');

-- Clean up after ourselves
DROP TABLE test_vertica;
\! echo "DROP DATABASE testdb;" | mysql -u root

9 changes: 8 additions & 1 deletion shell_load_package/README
Expand Up @@ -62,11 +62,18 @@ The actual execution is accomplished by forking and running: sh -c <command>
Because the command is actually interpreted by the shell, pipes,
redirects, looping and other shell functionality is available.

Example:
Example 1:
Creating a table and populating it once.

create table t (i int);
copy t with source ExternalSource(cmd='seq 1 10');
copy t from 'list_of_numbers.txt.gz' with filter ExternalFilter(cmd='gunzip');

Example 2:
Creating an external table without state (is populated on every query).

create external table t (i int) as copy source ExternalSource(cmd='seq 1 10');

The following environment variables are available from ExternalSource:

- CURRENT_NODE_NAME - name of the current node, e.g. 'v_test_node0001'
Expand Down

0 comments on commit b4c2632

Please sign in to comment.