dbt-athena2

This is a adapter leveraged from this repo to better serve our custom needs. It supports addtional capabilities as below:

Run on dbt-core version 1.1.x
Support boto3 session to take the credential from from aws profile name
On schema change support for new columns added
Add s3 bucket for storing data instead of randomly writing on staging dir

Quick started

Within your python environment, proceed below step to initate a first project. There will be some prompts at during inital steps, refer Configuring your profile section below to properly set it up.

pip install dbt-athena2
dbt init my_dbt_project
export DBT_PROFILES_DIR=`pwd`
cd my_dbt_project
dbt debug # to test connection
dbt seed # to dump your seed data
dbt run # to run all models
dbt run --select model_name # to run specific model

#...and more...

Basic usage

Model configuration

Below show an example how we configure how our model be configured.

There are 4 supported materialized modes: view, table, incremental and esphemeral. Details here.
incremental_strategy supports insert_overwrite and append. If partition is specified, it only overwrites partions available from source data.
on_schema_change support fail, ignore and append_new_columns only and for only incremental materialization. Details here.
There are some usefule macro such as run_started_at can be referred from here to enhance the flexibility on the model.

{{ config(
    materialized="incremental",
    partitioned_by=["report_month"],
    incremental_strategy="insert_overwrite",
    on_schema_change="append_new_columns"
) }}

{% set run_date = run_started_at.astimezone(modules.pytz.timezone("Asia/Saigon")).strftime("%Y-%m-%d") %}

select cast(working_day as timestamp) working_day,
sum(spend_balance) spend_balance,
sum(td_balance) td_balance,
sum(gs_balance) gs_balance,
cast(date_format(date_parse('{{ run_date }}', '%Y-%m-%d') - interval '1' month, '%Y%m') as int) report_month
from {{ source('analytics', 'eod_balance') }}
where cast(working_day as date) >= date_trunc('month', cast(date_parse('{{ run_date }}', '%Y-%m-%d')  as date)-interval'2'month)
and cast(working_day as date) < date_trunc('month', cast(date_parse('{{ run_date }}', '%Y-%m-%d')  as date)-interval'1'month)
group by working_day
order by working_day desc

Seed

Under folder seeds, place csv seed file ( eg. c_ecom_rate.csv) and the yaml config (eg. c_ecom_rate.yml) as below example. Then run dbt seed

version: 2

seeds:
  - name: c_ecom_rate
    config:
      enabled: true
      quote_columns: true
      tags: accounting | report

Further notes

If the workgroup is specified in the profile.yml without s3_staging_dir, it will extract the default s3 ouput attached with that work_group when Override client-side settings enabled.
The boto3 session inherit from devlopment environment; once deployed, it should be obtained permission as role such as EC2 profile instance or K8S service account role.
Athena limit ALTER ADD COLUMNS with data type date, recommend to parse it to timestamp or string during implementing the model. Details here.
Athena not accept the comment like /*, to ignore these auto generated comment from dbt, place this query-comment: null in dbt_project.yml file.

Configuring your profile

A dbt profile can be configured to run against AWS Athena using the following configuration:

Option	Description	Required?	Example
s3_staging_dir	S3 location to store Athena query results and metadata	Required	`s3://athena-output-bucket/data_services/`
region_name	AWS region of your Athena instance	Required	`ap-southeast-1`
schema	Specify the schema (Athena database) to build models into (lowercase only)	Required	`dbt`
database	Specify the database (Data catalog) to build models into (lowercase only)	Required	`awsdatacatalog`
poll_interval	Interval in seconds to use for polling the status of query results in Athena	Optional	`5`
aws_profile_name	Profile to use from your AWS shared credentials file.	Optional	`my-profile`
work_group	Identifier of Athena workgroup	Optional	`my-custom-workgroup`
num_retries	Number of times to retry a failing query	Optional	`3`
s3_data_dir	Prefix for storing tables, if different from the connection's `s3_staging_dir`	Optional	`s3://athena-data-bucket/{schema_name}/{table_name}/`

Example profiles.yml entry:

athena:
  target: dev
  outputs:
    dev:
      database: awsdatacatalog
      region_name: ap-southeast-1
      aws_profile_name: dl-dev-process
      s3_staging_dir: s3://athena-output-bucket/data_services/
      s3_data_dir: s3://athena-data-bucket/{schema_name}/{table_name}/
      schema: accounting
      type: athena

Additional information

threads is supported
database and catalog can be used interchangeably

Running tests

First, install the adapter and its dependencies using make (see Makefile):

make install_deps

Next, configure the environment variables in dev.env to match your Athena development environment. Finally, run the tests using make:

make run_tests

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
dbt		dbt
test/integration		test/integration
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
dev.env		dev.env
dev_requirements.txt		dev_requirements.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

vdn-tools/dbt-athena2

Folders and files

Latest commit

History

Repository files navigation

dbt-athena2

Quick started

Basic usage

Model configuration

Seed

Further notes

Configuring your profile

Running tests

References

About

Resources

License

Stars

Watchers

Forks

Languages