Skip to content

Latest commit

 

History

History
330 lines (269 loc) · 12.4 KB

README.md

File metadata and controls

330 lines (269 loc) · 12.4 KB

DBLINK()

DBLINK() is a Vertica User Defined Transform Function coded in C++ to run SQL against other databases.

For example, the following statement runs a row count in PostgreSQL and retrieves the result (6,001,215) in Vertica:

=> SELECT DBLINK(USING PARAMETERS
    cid='pgdb',
    query='SELECT COUNT(*) FROM tpch.lineitem'
) OVER();
count
--------
6,001,215
--- 1 row selected in 0.228s (prep 0.17s, exec 0.211s, fetch 0.000/0.000s)

Usage

DBLINK() is a Vertica function that pushes SQL to other databases and retrieves the result of the remote execution back in Vertica. DBLINK() can push any type of SQL commands that the remote database accepts:

  • DDL statements. For example, CREATE a table in the remote database.
  • DML statements to manipulate data in the remote database (INSERT, UPDATE, etc.).
  • DQL statements to SELECT data from the remote database using the SQL dialect and functions available on the remote database.
  • DCL statements like GRANT and REVOKE.

Examples

The following statement creates a table in Vertica named public.customer that contains 10% of randomly selected data from the PostreSQL table tpch.customer:

=> CREATE TABLE public.customer AS
    SELECT DBLINK(USING PARAMETERS
        cid='pgdb',
        query='SELECT * FROM tpch.customer WHERE RANDOM() < 0.1')
OVER();

This statement creates an empty table in Vertica corresponding to the table definition in the remote database:

=> CREATE TABLE public.customer AS
    SELECT DBLINK(USING PARAMETERS
        cid='pgdb',
        query='SELECT * FROM tpch.customer LIMIT 0')
OVER();

This statement will group-by the result of a JOIN between the Vertica table tpch.nation and the MySQL table tpch.region:

=> SELECT r.r_name, count(*)
FROM tpch.nation n
    LEFT OUTER JOIN
        ( SELECT DBLINK(USING PARAMETERS
            cid='mypg’,
            query='SELECT r_name, r_regionkey FROM tpch.region'
            ) OVER()) r
    ON n.n_regionkey = r.r_regionkey
GROUP BY 1 ;

This statement drops a PostgreSQL table if exists:

=> SELECT DBLINK(USING PARAMETERS
	cid='pgdb',
	query='DROP TABLE IF EXISTS public.t1') OVER();

Sometimes the SQL that you want to push to the remote database is quite complex. In these cases, you might find useful to write the SQL in a file using your preferred editor, and then pass the file containing the SQL text to DBLINK() using the following syntax:

=> SELECT DBLINK(USING PARAMETERS
	cid='mysql',
	query='@/tmp/myscript.sql') OVER()";

Installation

You can install DBLINK() from the latest released binaries without needing to clone this repository. Or, you can clone the repository and build from source. Whichever way you chose, ODBC has to be configured on all nodes in order to use DBLINK()..

Prerequisites

DBLINK() uses ODBC to interact with the remote databases. You must install and configure the following on all nodes in your cluster:

Installing pre-built binaries

  1. Download the appropriate build of ldblink.so from the latest release. Example: ldblink.so.centos-v23.3.0 -> binary for vertica v23.3.0 on CentOS
  2. Rename the binary to ldblink.so
  3. Copy ldblink.so to the initiator node.
  4. Execute this SQL substituting the full path of ldblink.so on the initiator node:
	    CREATE OR REPLACE LIBRARY DBLink AS '/full/path/to/ldblink.so' LANGUAGE 'C++';
	    CREATE OR REPLACE TRANSFORM FUNCTION dblink AS LANGUAGE 'C++' NAME 'DBLinkFactory' LIBRARY DBLink ;
            GRANT EXECUTE ON TRANSFORM FUNCTION dblink() TO PUBLIC ;
  1. Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

  2. (optional) Delete ldblink.so from initiator node.

Build DBLINK() From Source

Before you run make commands, review the Makefile and make any necessary changes.

On your own build environment

  • First you need to Setup a C++ Development Environment
  • Then to build using the installed Vertica SDK and devtoolset (centos): just run

    make

  • Install the library in Vertica (as dbadmin):

    make install

  • Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

Without setting up a build environment

This is mainly for testing and release, or when you want to build dblink for a specific vertica version on centos/ubuntu but you do not have the tools to do it. It uses the verticasdk docker image owned by Vertica to build dblink for any specified vertica version.

  1. Compile the DBLINK source code with for the appropriate Vertica version and Linux distribution.

    • To build without needing to set up a build environment, specify the version and target OS. For example:
      $ make -C docker-dblink VERTICA_VERSION=12.0.2 OSTAG=ubuntu
      
      it will create ldblink.s0.ubuntu-v12.0.2 in the docker-dblink directory
  2. Install the library in Vertica (as dbadmin): Follow Installing pre-built libraries from step 2.

  3. Create a Connection Identifier Database (a simple text file) under /usr/local/etc/dblink.cids. You can use a different location by changing the DBLINK_CIDS define in the source code. For details, see Configure DBLINK().

Uninstall DBLINK()

You can uninstall the library with DROP LIBRARY DBLink in vsql or by running make clean.

Configure DBLINK()

DBLINK() requires two parameters and accepts one optional parameter with the following syntax:

DBLINK(USING PARAMETERS cid=value, query=value[, rowset=value]);

Parameters

Name Required Description
cid No Connection Identifier Database. Identifies an entry in the connection identifier database.
connect_secret No The ODBC connection string containing the DSN and credentials.
query Yes The query being pushed on the remote database. If the first character of this parameter is @, the rest is interpreted as the name of the file containing the query.
rowset No Number of rows retrieved from the remote database during each SQLFetch() cycle. Default is 100.

For example, the following query retrieves data from the remote database 500 rows at a time:

=> SELECT DBLINK(USING PARAMETERS
    cid='pgdb',
    query='SELECT c_custkey, c_nationkey, c_phone FROM tpch.customer ORDER BY 1',
    rowset=500) OVER();
 c_custkey | c_nationkey |     c_phone
-----------+-------------+-----------------
         1 |          15 | 25-989-741-2988
         2 |          13 | 23-768-687-3665
         3 |           1 | 11-719-748-3364
         4 |           4 | 14-128-190-5944
         5 |           3 | 13-750-942-6364
         6 |          20 | 30-114-968-4951
         7 |          18 | 28-190-982-9759
...

Connection parameters

Connection Identifier Database

One way to specify the connection parameters is to use a Connection Identifier Database -- a simple text file containing the codes used with cid. The cid file must exist in the same location on all vertica nodes. For example:

$ cat /usr/local/etc/dblink.cids
# Vertica DBLINK Configuration File
#
# Connection IDs lines have the following format:
#    <mnemonic code>:<ODBC configuration>
# and are terminated by a SINGLE '\n' (ASCII dec 10, ASCII hex 0x0a)
# Be aware of this! Windows editors might end lines with \r\n. In
# this case the Carriage Return is considered part of the ODBC config
# and can cause undefined ODBC Driver Behavior.
#
# Lines starting with '#' are considered comments
#
# Sample configuration:

pgdb:UID=mauro;PWD=xxx;DSN=pmf
myver:UID=mauro;PWD=xxx;DSN=vmf
mysql:USER=mauro;PASSWORD=xxx;DSN=mmf

Then use the cid parameter to pick our connection:

SELECT DBLINK(USING PARAMETERS cid='myconnecction', query=...) ...

DBLINK Parameters

Another methods you can use to specify the connection parameters is to use connect_secret. This way you don't have to create a the dblink.cids database however defining the connection parameters in the command line is not safe before Vertica 12.0.4. All queries are recorded under v_monitor.query_requests and in the log file, and that can expose passwords in the connect_secret parameter.

SELECT DBLINK(USING PARAMETERS connect_secret='UID=mauro;PWD=secret;DSN=pmf', query=...) ...

Session Parameters

Lastly, a final method is to set a UDPARAMETER in the session after connecting to Vertica. The value for this SESSION PARAMETER won't be recorded in query_requests.

ALTER SESSION SET UDPARAMETER FOR ldblink dblink_secret = 'UID=mauro;PWD=secret;...' ;
SELECT DBLINK(USING PARAMETERS query='my first query') ...
SELECT DBLINK(USING PARAMETERS query='my second query') ...
SELECT DBLINK(USING PARAMETERS query='my third query') ...

Configure the ODBC Layer

You must install and configure the ODBC layer on all nodes for each database that you want to connect to.

The ODBC configuration depends on the specific ODBC Driver Manager and Database ODBC Drivers. The following is an example configuration file that configures unixODBC and the the PostgreSQL/MySQL drivers:

$ cat /etc/odbc.ini
[ODBC Data Sources]
PSQLODBC  = PostgreSQL ODBC
MYODBC  = MySQL ODBC

[pmf]
Description  = PostgreSQL mftest
Driver = PSQLODBC
Trace  = No
TraceFile  = sql.log
Database = pmf
Servername = mftest
UserName =
Password =
Port = 5432
SSLmode  = allow
ReadOnly = 0
Protocol = 7.4-1
FakeOidIndex = 0
ShowOidColumn  = 0
RowVersioning  = 0
ShowSystemTables = 0
ConnSettings =
Fetch  = 1000
Socket = 4096
UnknownSizes = 0
MaxVarcharSize = 1024
MaxLongVarcharSize = 8190
Debug  = 0
CommLog  = 0
Optimizer  = 0
Ksqo = 0
UseDeclareFetch  = 0
TextAsLongVarchar  = 1
UnknownsAsLongVarchar  = 0
BoolsAsChar  = 1
Parse  = 0
CancelAsFreeStmt = 0
ExtraSysTablePrefixes  = dd_
LFConversion = 0
UpdatableCursors = 0
DisallowPremature  = 0
TrueIsMinus1 = 0
BI = 0
ByteaAsLongVarBinary = 0
LowerCaseIdentifier  = 0
GssAuthUseGSS  = 0
XaOpt  = 1
UseServerSidePrepare = 0

[mmf]
Description  = MySQL mftest
Driver = MYODBC
SERVER = mftest
PORT = 3306
SQL-MODE = 'ANSI_QUOTES'

$ cat /etc/odbcinst.ini
[ODBC]
Trace=off
Tracefile=/tmp/uodbc.trc

[PSQLODBC]
Description=PostgreSQL ODBC Driver
Driver64=/usr/lib64/psqlodbcw.so
UsageCount=1

[MYODBC]
Driver=/usr/lib64/libmyodbc8w.so
UsageCount=1

[MySQL ODBC 8.0 ANSI Driver]
Driver=/usr/lib64/libmyodbc8a.so
UsageCount=1

Report an issue

To ask a question, start a discussion. To report an issue, open up an issue and provide following information:

  • The command that you ran and the associated output as shown on your screen by using the standard Vertica SQL client vsql.
  • Vertica version: SELECT VERSION();.
  • DBLINK library metadata by running statement the following as dbadmin:
    => SELECT * FROM USER_LIBRARIES WHERE lib_name = 'ldblink';
    
  • Attach the following ODBC configuration files:
    • odbc.ini (please remove passwords or other confidential information)
    • odbcinst.ini
  • ODBC Driver Manager version and configuration. For example, with unixODBC, the output of the command odbcinst -j.
  • ODBC traces obtained while running the command (see 1.). To enable the ODBC traces you have to set Trace = on in odbcinst.ini. Do not forget to switch ODBC tracing off at the end.