# PyIceberg 🐍 Tabular CDC Automation Guide
Hey, welcome! 

This guide shows how you can automate the creation and config of Tabular managed CDC Target (or Mirror) tables with pyiceberg.

### Installation:
- clone this repo (just do it, it takes 2 seconds)
- cd into this folder
- install packages: 
  - `pip install pyiceberg`. 
  - OR, If you use pipenv like I do (`pip install pipenv jupyterlab`), you can just run `pipenv install` and it'll handle everything

See, that wasn't so bad.

### Tabular Requirements:
- head over to app.tabular.io and log in (or signup if you don't already have an account)
- go to connections > security > service account and hit the big + button to create a new credential
- assign your service account credential to a role that has the correct access for what you want to do (if you don't know, `EVERYONE` is a pretty safe default)
  - We will need full read on your cdc changelog tables
  - We will also need write to the location you want your mirror tables to end up in
- copy that credential!
- come back here and create a `.env` file in this directory (`guides/pyiceberg_cdc_automation/.env`). Edit it to look like below and make sure to SAVE IT.
```
TABULAR_CREDENTIAL=t-asdf:1234
```
⬆️ replace `t-asdf:1234` with the tabular credential you just created. 

Good job! Now we're ready to get down to business 💪

### Starting Jupyter Lab:
- Seriously, make sure you save that env file. 
- pipenv users can just run `pipenv run jupyter lab` to fire up jupyter lab. pipenv will load up your credential for you and all will be well
- if this is scary, you can ignore the `.env` file and just paste your credential in plaintext directly in this notebook -- but you should feel bad about your craftsmanship.


*One last note* -- you definitely don't have the same CDC data I do. Make sure you use your own configs as required, but this should be a really simple starting point for you.

In [1]:
# Establish our connection with Tabular 💪

import os

from pyiceberg.catalog import load_catalog
from pyiceberg.exceptions import TableAlreadyExistsError

# You'll need a tabular credential. Member credential or service account will work fine
TABULAR_CREDENTIAL       = os.environ['TABULAR_CREDENTIAL']
TABULAR_TARGET_WAREHOUSE = 'enterprise_data_warehouse' # replace this with your tabular warehouse name
TABULAR_CATALOG_URI      = 'https://api.tabular.io/ws' # unless you're a single tenant user, you don't need to change this

catalog_properties = {
    'uri':        TABULAR_CATALOG_URI,
    'credential': TABULAR_CREDENTIAL,
    'warehouse':  TABULAR_TARGET_WAREHOUSE
}
catalog = load_catalog(**catalog_properties)

In [6]:
# set some configs for finding our changelog tables 💪
changelog_db = 'kafka_connect_raw' # this should already exist
changelog_table_postfix = '_changelog' # set this to '' or None if you don't use one of these. But you should really consider it.

# set some configs for placing the mirror tables 💪
mirror_db = 'cdc_mirrors' # this doesn't have to exist. If it does, awesome -- if not we'll create it
mirror_table_postfix = '_mirror' # set this to '' or None if you don't want one of these.
mirror_table_should_expand_key_col_struct = True # set to false if your key-column is a scalar

# CDC configs for processing the mirror tables 💪
cdc_properties = { # https://docs.tabular.io/change-data-capture for more details on this ⬇️
    'cdc.type':       'DMS', # don't change this unless you know what you're doing, even if you're not using DMS
    'etl.job-type':   'cdc', # don't change this unless you know what you're doing. Even then, maybe don't change this as of 2024-04-01
    'etl.target-lag': '0',  # 15 minutes
    'cdc.key-column-default': '_cdc.key', # can be the ID of the table as well. This _cdc.key works for any debezium changelog data using the kafka connect iceberg sink with tabular's dbz transform
    'cdc.ts-column':  '_cdc.ts',  # the timestamp column of when the change happened
}

In [7]:
# get changelog tables to build mirrors for 💪
changelog_tables = []
for _, tablename in catalog.list_tables(changelog_db):
    if not changelog_table_postfix or tablename.endswith(changelog_table_postfix):
        changelog_tables.append(catalog.load_table(f"{changelog_db}.{tablename}"))
        print(f"Found changelog table: '{changelog_tables[-1].identifier[-1]}'")

Found changelog table: 'dbz_pg_reactions_changelog'
Found changelog table: 'dbz_zodiac_changelog'


In [8]:
# Build mirrors 💪

# init mirror db then process a mirror for each changelog, skipping those that may already exist
try:
    catalog.create_namespace(mirror_db)
    print(f"Succesfully created mirror namespace '{mirror_db}'. 🪐\n")
except:
    print(f"namespace '{mirror_db}' already exists, moving on 💪\n") # lazy, but assuming an error here is because it already exists


# process the mirror tables
for changelog_table in changelog_tables:
    mirror_table_name = mirror_db + '.' + changelog_table.identifier[-1].replace(changelog_table_postfix, mirror_table_postfix)
    try:
        # create the table with the same schema as the changelog
        mirror_table = catalog.create_table(
            identifier=mirror_table_name,
            schema=changelog_table.schema()
        )

        # get the unique key(s) for this change log:
        if mirror_table_should_expand_key_col_struct:
            # get to choppin up that _cdc.key struct
            cdc_key_columns = []
            for field in changelog_table.schema().find_field(cdc_properties['cdc.key-column-default']).field_type.fields:
                cdc_key_columns.append(cdc_properties['cdc.key-column-default'] + '.' + field.name)

            cdc_properties['cdc.key-column'] = ','.join(cdc_key_columns)
        else:
            cdc_properties['cdc.key-column'] = cdc_properties['cdc.key-column-default']

        # set the cdc table props for this new mirror
        with mirror_table.transaction() as transaction:
            transaction.set_properties(**cdc_properties)

        # update the changelog table so it triggers CDC processing against this mirror
        changelog_cdc_property = { # https://docs.tabular.io/change-data-capture for more details
            'dependent-tables': mirror_table_name
        }
        with changelog_table.transaction() as transaction:
            transaction.set_properties(**changelog_cdc_property)

        print(f"\tSuccesfully created and configured cdc mirror '{mirror_table_name}'🪐")
        
        
    except TableAlreadyExistsError:
        print(f"Mirror table already exists for table '{mirror_table_name}'. Taking no action here and moving on to the next table ⚡")

print('\nYour CDC changes will begin to flow AFTER new change records hit your changelog tables')

Succesfully created mirror namespace 'cdc_mirrors'. 🪐

	Succesfully created and configured cdc mirror 'cdc_mirrors.dbz_pg_reactions_mirror'🪐
	Succesfully created and configured cdc mirror 'cdc_mirrors.dbz_zodiac_mirror'🪐

Your CDC changes will begin to flow AFTER new change records hit your changelog tables
