RDBMS Serde swaps the traditional file based storage used by Apache Hive (https://hive.apache.org/) for database tables.
- Build this project with mvn clean package
- Create the directory /tmp/rdbmsSerde (this is not currently configurable) on your HiveServer2 host and on HDFS and copyt the built Jar to it
- Create an empty file named tmpfile and copy this to /tmp/rdbmsSerde in hdfs (touch tmpfile; hdfs dfs -put tmpfile /tmp/rdbmsSerde;)
- Configure the following properties in Hive:
Name | Value | |
hive.aux.jars.path | /tmp/rdbmsSerde | |
hive.rdbmsSerde.nodes | comma seperated list of rdbms nodes (e.g. 10.17.101.175,10.17.101.176,10.17.101.177) | |
hive.rdbmsSerde.connectionString | The connection string used to connect to the RDBMS instances e.g. jdbc:mysql://[host]:3306/rdbmsSerde. [host] is replaced with one of the hosts above as appropriate | |
hive.rdbmsSerde.userName | The username used to connect to the RDBMS instances | |
hive.rdbmsSerde.password | The password used to connect to the RDBMS instances |
All tables using RDBMS serde should be external and us the following template:
CREATE EXTERNAL TABLE <table name>
(
<column spec e.g. col1
string> )
ROW FORMAT SERDE
'com.threefi.hive.RDBMS.RDBMSSerde’
STORED AS INPUTFORMAT
'com.threefi.hive.RDBMS.RDBMSInputFormat'
OUTPUTFORMAT
'com.threefi.hive.RDBMS.RDBMSOutputFormat'
LOCATION
'/tmp/rdbmsSerde'
The table location should always be /tmp/rdbmsSerde as it is not relevant when reading from RDBMS instances
rdbmsSerde requires tables to be created in the underlying databases that provide storage. A post exec hook is provided to automate this. To configure this add the following property:
Name | Value | |
hive.exec.post.hooks | com.threefi.hive.RDBMS.hooks.CreateTableHook |
Important Note: For the post exec hook to work the create table statement must include the database name.
The following Hive types are supports in RDBMS serde:
- INT
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL
- TIMESTAMP
Currently Select * type queries with no predicates do not work (as they do not launch mapreduce jobs). To query data add a predicate like the following: Select * from [table name] where [column name] <= 1 or [column name] > 1
INSERT INTO statements should work with RDBMSSerde (this is still under development). INSERT OVERWRITE cases do not work.
- Tom Scott
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0