GitHub - tj---/trino-hive: A sample docker setup for Trino (Presto) with Hive & Iceberg connectors

Trino (Presto) with `Hive` & `Iceberg` connectors

The instructions here relate to run Presto with data present in GCS. This setup has been adapted from this repository: Hive connector over MinIO file storage. Tables can be created in two flavors:

Hive
Iceberg

Stack

Trino (Presto)
Hive GCS connector
Iceberg connector
Hive Metastore with MariaDB persistence

Steps

Download the gcs connector and place the jar file in lib directory.
Obtain GCP service-account credentials json with the service-account having admin permissions on GCS. Place this file in creds directory
Set the GCP project-id against fs.gs.project.id in metastore-site.xml.
Run docker-compose up -d.
It should start instances of the database, metastore and Trino

Creating network "trino-hive_trino-network" with driver "bridge"
Creating trino-hive_mariadb_1           ... done
Creating trino-hive_trino-coordinator_1 ... done
Creating trino-hive_hive-metastore_1    ... done

Access trino

docker container exec -it trino-hive_trino-coordinator_1 trino

# The following prompt must show up
trino>

Create Entities

View Catalogs

trino> show catalogs;
 Catalog
---------
 hive
 iceberg
 system
 tpcds
 tpch
(4 rows)

1. HIVE

Create a Schema in the Hive Catalog

trino> CREATE SCHEMA hive.hive_gcs WITH (location = 'gs://bucket-test-tj-1/');
CREATE SCHEMA

Create some data in GCS. Ensure that partitions like partion_name=foo are present in the structure and each file has 5 columns with headers col_1, col_2, col_3, col_4, col_5. Example:

Create a partitioned Hive Table

USE hive.hive_gcs;

CREATE TABLE sample_table1 (
  col_1 varchar, col_2 varchar, col_3 varchar, col_4 varchar, col_5 varchar,
  xing varchar, fing varchar
)
WITH (
  format = 'CSV',
  partitioned_by = ARRAY['xing','fing'],
  external_location = 'gs://bucket-test-tj-1/ping',
  skip_header_line_count = 1
);

Update partitions

CALL system.sync_partition_metadata(schema_name=>'hive_gcs', table_name=>'sample_table1', mode=>'FULL');

Read Data


trino:hive_gcs> select * from sample_table1 where xing = 'bar' and fing = 'io';

 col_1 | col_2 | col_3 | col_4 | col_5 | xing | fing
-------+-------+-------+-------+-------+------+------
 91    | 92    | 93    | 94    | 95    | bar  | io
 81    | 82    | 83    | 84    | 85    | bar  | io

2. ICEBERG

Create a Schema in the Iceberg Catalog

trino> CREATE SCHEMA iceberg.iceberg_gcs WITH (location = 'gs://bucket-test-tj-1/');
CREATE SCHEMA

Create a partitioned Iceberg Table

USE iceberg.iceberg_gcs;

CREATE TABLE sample_table (
  id bigint, name varchar, known varchar, country varchar, fact varchar
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['country', 'fact']
);

Insert some data

INSERT INTO sample_table values 
  (25, 'James Bond', 'Agent', 'American', 'NYC'),
  (40, 'George Bush', 'President', 'American', 'Texas'),
  (82, 'A1', 'Football Player', 'Austria', 'Vienna')
;

A sample write mechanism to Iceberg table is present in this repository

Read Data


trino:iceberg_gcs> select * from sample_table where country = 'American';

 id |    name     |   known   | country  | fact
----+-------------+-----------+----------+-------
 40 | George Bush | President | American | Texas
 25 | James Bond  | Agent     | American | NYC

Sample code to read from the tables

Maven dependency

<dependency>
    <groupId>io.trino</groupId>
    <artifactId>trino-jdbc</artifactId>
    <version>382</version>
</dependency>

Java Code

public static void main(String[] args) throws Exception {
    // Format: <Trino_Coordinator>/Catalog/Schema
    try (Connection conn = DriverManager.getConnection("jdbc:trino://localhost:9080/hive/hive_gcs?user=anything")) {
        try(Statement stmt = conn.createStatement()) {
            try(ResultSet rs = stmt.executeQuery("SELECT * FROM sample_table1")) {
                while (rs.next()) {
                    String col1 = rs.getString("col_1");
                    String col2 = rs.getString("col_2");
                    System.out.println(String.format("col_1=%s and col_2=%s", col1, col2));
                }
            }
        }
    }
}

References

MinIO

MinIO is a multi-cloud object storage abstraction for any cloud. It has s3 like semantics and is compatible with most clouds. Hence, it can be a good alternate to object storage if rich abstraction is a necessity. A sample experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
conf		conf
etc		etc
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

etc

etc

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

Repository files navigation

Trino (Presto) with `Hive` & `Iceberg` connectors

Stack

Steps

Access trino

Create Entities

1. HIVE

2. ICEBERG

Sample code to read from the tables

References

MinIO

About

Releases

Packages

tj---/trino-hive

Folders and files

Latest commit

History

Repository files navigation

Trino (Presto) with Hive & Iceberg connectors

Stack

Steps

Access trino

Create Entities

1. HIVE

2. ICEBERG

Sample code to read from the tables

References

MinIO

About

Resources

Stars

Watchers

Forks

Trino (Presto) with `Hive` & `Iceberg` connectors