Skip to content
This repository has been archived by the owner on Oct 31, 2019. It is now read-only.

Connector Configuration

Siva Narayanan edited this page Jul 3, 2015 · 4 revisions

You will need to add a Kinesis catalog to Presto by adding this line to ${PRESTO_HOME}/etc/config.properties.

datasources=jmx,hive,kinesis

Next, to configure the Kinesis Connector, create a catalog properties file etc/catalog/kinesis.properties with the following contents, replacing the properties as appropriate.

connector.name=kinesis
kinesis.table-names=table1,table2
kinesis.access-key=<amazon-access-key> (optional)
kinesis.secret-key=<amazon-secret-key> (optional)

The access and secret keys are optional. If they are not provided, the default credentials chain will be used to access the kinesis streams. The following configuration properties are available :

Property Name Description
kinesis.table-names List of all the tables provided by the catalog
kinesis.default-schema Default schema name for tables
kinesis.table-description-dir Directory containing table description files
kinesis.access-key Access key to aws account
kinesis.secret-key Secret key to aws account
kinesis.hide-internal-columns Controls whether internal columns are part of the table schema or not
kinesis.aws-region Aws region to be used to read kinesis stream from
kinesis.batch-size Maximum number of records to return. Maximum Limit 10000
kinesis.fetch-attempts Attempts to be made to fetch data from kinesis streams
kinesis.sleep-time Time till which thread sleep waiting to make next attempt to fetch data

kinesis.table-names

Comma-separated list of all tables provided by this catalog. A table name can be unqualified (simple name) and will be put into the default schema (see below) or qualified with a scheme name (<schema-name>.<table-name>).

For each table defined here, a table description file (see below) may exist. If no table description file exists, the table name is used as the stream name on Kinesis and no data columns are mapped into the table. The table will contain all internal columns (see below).

This property is required; there is no defualt and at least one table must be defined.

kinesis.default-schema

Defines the schema which will contain all tables that were defined without a qualifying schema name.

This property is optional; the default is default.

kinesis.table-description-dir

References a folder within Presto deployment that holds one or more JSON files (must end wiht .json) which contain table description files.

This property is optional; the default is etc/kinesis.

kinesis.access-key

Defines the access key ID for AWS root account or IAM roles, which is used to sign programmatic requests to AWS Kinesis.

This property is optional; if not defined, connector will try to follow Defualt- Credential-Provider-Chain provided by aws in the following order -

  • Environment Variable: Load credentials from environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  • Java System Variable: Load from java system as aws.accessKeyId and aws.secretKey
  • Profile Credentials File: Load from file typically located at ~/.aws/credentials
  • Instance profile credentials: These credentials can be used on EC2 instances, and are delivered through the Amazon EC2 metadata service.

kinesis.secret-key

Defines the secret key for AWS root account or IAM roles, which together with Access Key ID, is used to sign programmatic requests to AWS Kinesis.

This property is optional; if not defined, connector will try to follow Defualt- Credential-Provider-Chain same as above.

kinesis.aws-region

Defines AWS Kinesis regional endpoint. Selecting appropriate region may reduce latency in fetching data.

This field is optional; The default region is us-east-1 referring to end point 'kinesis.us-east-1.amazonaws.com'.

Amazon Kinesis Regions

For each Amazon Kinesis account, following availabe regions can be used:

Region name Region Endpoint
us-east-1 US East (N. Virginia) kinesis.us-east-1.amazonaws.com
us-west-1 US West (N. California) kinesis.us-west-1.amazonaws.com
us-west-2 US West (Oregon) kinesis.us-west-2.amazonaws.com
eu-west-1 EU (Ireland) kinesis.eu-west-1.amazonaws.com
eu-central-1 EU (Frankfurt) kinesis.eu-central-1.amazonaws.com
ap-southeast-1 Asia Pacific (Singapore) kinesis.ap-southeast-1.amazonaws.com
ap-southeast-2 Asia Pacific (Sydney) kinesis.ap-southeast-2.amazonaws.com
ap-northeast-1 Asia Pacific (Tokyo) kinesis.ap-northeast-1.amazonaws.com

kinesis.batch-size

Defines maximum number of records to return in one request to Kinesis Streams. Maximum Limit is 10000 records. If a value greater than 10000 is specified, will throw InvalidArgumentException.

This field is optional; the default value is 10000.

kinesis.fetch-attempts

Defines number of failed attempts to be made to fetch data from Kinesis Streams. If first attempt returns non empty records, then no further attempts are made.

It has been found that sometimes GetRecordResult returns empty records, when shard is not empty. That is why multiple attempts need to be made.

This field is optional; the default value is 3.

kinesis.sleep-time

Defines the milliseconds for which thread needs to sleep between get-record-attempts made to fetch data. The quantity should be followed by 'ms' string.

This field is optional; the default value is 1000ms.

kinesis.hide-internal-columns

In addition to the data columns defined in a table description file, the connector maintains a number of additional columns for each table. If these columns are hidden, they can still be used in queries but do not show up in DESCRIBE <table-name> or SELECT *.

This property is optional; the default is true.

Internal Columns

For each defined table, the connector maintains the following columns:

Column name Type Description
_shard_id VARCHAR ID of the Kinesis stream shard which contains this row
_shard_sequence_id VARCHAR Sequence id within the Kinesis shard for this row
_segment_start VARCHAR Lowest sequence id in the segment (inclusive) which contains this row. This sequence id is shrard specific
_segment_end VARCHAR Highest sequence id in the segment (exclusive) which contains this row. The sequence id is shard specific. If stream is open, then this is not defined
_segment_count BIGINT Running count of for the current row within the segment
_message_valid BOOLEAN True if the decoder could decode the message successfully for this row. When false, data columns mapped f
rom the message should be treated as invalid
_message VARCHAR Message bytes as an UTF-8 encoded string
_message_length BIGINT Number of bytes in the message
_partition_key VARCHAR Partition Key bytes as an UTF-8 encoded string

For tables without a table definition file, the _message_valid column will always be true.