# Automating Boilerplate

Setting up `dbt` project from scratch often involves writing a lot of boilerplate from configuring the project to bringing in the sources and create staging models. While there are tools to semi-automate this process, there is still a lot of manual heavy-lifting that is required. In this notebook, I explore ways to automate this flow based on a highly opinionated way of organizing staging models. I will turn this into a Python package once I am settled on the API.

## Initialize Project

```bash
dbt init dbt-greenery --adapter postgres
sed -i 's/my_project_name/dbt_greenery/g' dbt-greenery/dbt_project.yml 
sed -i 's/default/greenery/g' dbt-greenery/dbt_project.yml 
```

## Identify Sources

The next step is to identify the sources to build the data models on top. A list of sources can be identified by listing the schemas under the database connection configured in `~/.dbt/profiles.yml`.

In [1]:
%load_ext sql
%sql postgresql://corise:corise@localhost:5432/dbt
%config SqlMagic.displaylimit=5
%config SqlMagic.displaycon = False

In [5]:
!psql -U postgres -c 'SELECT nspname AS schema FROM pg_catalog.pg_namespace;'


       schema       
--------------------
 pg_toast
 pg_temp_1
 pg_toast_temp_1
 pg_catalog
 public
 information_schema
(6 rows)



Before we go on to the next step, let us import some useful python packages and write some handy utiility functions that will let us run `dbt` command line operations from the notebook.

In [24]:
import subprocess
import yaml
import json
from pathlib import Path

In [1]:
def dbt_run_operation(operation, **kwargs):
    args_json = json.dumps(kwargs)
    cmd = f"dbt run-operation {operation} --args '{args_json}' | tail -n +2"
    out = subprocess.getoutput(cmd)
    return(out)

def write_as_yaml(x, file=None):
    x_yaml = yaml.dump(x, sort_keys=False)
    if file is None:
      print(x_yaml)
    else:
      Path(file).write_text(x_yaml)

## Generate Source

The next step to modeling in `dbt` is to identify the sources that need to be modelled. `dbt` has a command line tool that makes it easy to query a database schema and identify the tables in it. The `dbt_generate_source` function uses this tool to generate the source configuration based on a `database` and a `schema`. The `dbt_write_source` function writes a yaml file for the source config to `models/staging/<source_name>/<source_name>.yml`. This is a highly opinionated way of organizing the staging layer, and is based on the setup recommended by [dbt Labs](https://github.com/dbt-labs/corp/blob/master/dbt_style_guide.md).

In [20]:
def dbt_generate_source(database, schema, name):
    if name is None:
        name = schema
    source_yaml = dbt_run_operation('generate_source', database_name=database, schema_name=schema)
    source_dict = yaml.safe_load(source_yaml)
    return ({
       "version": source_dict['version'],
       "sources": [{
           "name": name,
           "database": database,
           "schema": schema,
           "tables": source_dict['sources'][0]['tables']
       }]
    })

def dbt_write_source(source):
  source_name = source['sources'][0]['name']
  source_dir = Path(f"models/staging/{source_name}")
  source_dir.mkdir(parents=True, exist_ok=True)
  source_file = source_dir / f"src_{source_name}.yml"
  print(f"Writing source yaml for {source_name} to {source_file}")
  write_as_yaml(source_greenery, f)

source_greenery = dbt_generate_source('dbt', 'public', 'greenery')
dbt_write_source(source_greenery)

## Generate Staging Models

The next step is to bootstrap staging models for every source table. Once again `dbt` provides a really handy command line tool to generate the models and their configuration. The `dbt_generate_staging_models` function uses this tool to generate the boilerplate SQL for the staging model for every source table. The `dbt_write_staging_models` function writes these models to `models/staging/<source_name>/stg_<source_name>_<table_name>.sql`.

In [None]:
def dbt_generate_staging_models(source):
    source_database = source['sources'][0]['database']
    source_schema = source['sources'][0]['schema']
    source_name = source['sources'][0]['name']
    table_names = [table['name'] for table in source['sources'][0]['tables']]
    staging_models = {"name": source_name, "models": {}}
    for table_name in table_names:
        print(table_name)
        sql = dbt_run_operation('generate_base_model', source_name = source_name, table_name = table_name)
        staging_models['models'][table_name] = sql
    return staging_models

def dbt_write_staging_models(staging_models):
    source_name = staging_models['name']
    for staging_model_name, staging_model_sql in staging_models['models'].items():
        staging_model_dir = Path(f"models/staging/{source_name}")
        staging_model_dir.mkdir(parents=True, exist_ok=True)
        staging_model_file = staging_model_dir / f"stg_{source_name}__{staging_model_name}.sql"
        print(f"Writing staging model for {staging_model_name} to {staging_model_file}")
        staging_model_file.write_text(staging_model_sql)

staging_models_greenery = dbt_generate_staging_models(source_greenery)
dbt_write_staging_models(staging_models_greenery)

It is very important to think documentation first while building data models. Once again, `dbt` has a very useful utility to bootstrap the documentation for a single model. The `dbt_generate_staging_models_yaml` function uses this utility to loop through all staging models and returns a dictionary with the boilerplate documentation for all these models. The `dbt_write_staging_models_yaml` function then writes this to `models/staging/<source_name>/stg_<source_name>.yml`. It is important to run `dbt run` before running these two funtions, since otherwise, the column documentation is NOT generated.

In [93]:
def dbt_generate_staging_models_yaml(staging_models):
    source_name = staging_models['name']
    staging_models_yaml_dict = []
    for staging_model_name in list(staging_models['models'].keys()):
        staging_model_name = f"stg_{source_name}__{staging_model_name}"
        print(f"Generating yaml for staging model {staging_model_name}")
        staging_model_yaml = dbt_run_operation('generate_model_yaml', model_name = staging_model_name)
        staging_model_yaml_dict = yaml.safe_load(staging_model_yaml)
        staging_models_yaml_dict = staging_models_yaml_dict + staging_model_yaml_dict['models']
  
    return {'name': source_name, 'models': staging_models_yaml_dict}

def dbt_write_staging_models_yaml(staging_models_yaml):
   source_name = staging_models_yaml['name']
   staging_model_yaml_file = Path(f"models/staging/{source_name}/stg_{source_name}.yml")
   out = {'version': 2, 'models': staging_models_yaml['models']}
   write_as_yaml(out, staging_model_yaml_file)

staging_models_greenery_yaml = dbt_generate_staging_models_yaml(staging_models_greenery)
dbt_write_staging_models_yaml(staging_models_greenery_yaml)

    

Generating yaml for staging model stg_greenery__addresses
Generating yaml for staging model stg_greenery__events
Generating yaml for staging model stg_greenery__order_items
Generating yaml for staging model stg_greenery__orders
Generating yaml for staging model stg_greenery__products
Generating yaml for staging model stg_greenery__promos
Generating yaml for staging model stg_greenery__superheroes
Generating yaml for staging model stg_greenery__users


In [None]:
!cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})'

In [25]:
import json
from pathlib import Path
from typing import Dict, List, Optional
from enum import Enum

from pydantic import BaseModel, validator


class DbtResourceType(str, Enum):
    model = 'model'
    analysis = 'analysis'
    test = 'test'
    operation = 'operation'
    seed = 'seed'
    source = 'source',
    snapshot = 'snapshot'


class DbtMaterializationType(str, Enum):
    table = 'table'
    view = 'view'
    incremental = 'incremental'
    ephemeral = 'ephemeral'
    seed = 'seed',
    snapshot = 'snapshot',
    test = 'test'


class NodeDeps(BaseModel):
    nodes: List[str]


class NodeConfig(BaseModel):
    materialized: Optional[DbtMaterializationType]


class Node(BaseModel):
    unique_id: str
    path: Path
    resource_type: DbtResourceType
    description: str
    depends_on: Optional[NodeDeps]
    config: NodeConfig


class Manifest(BaseModel):
    nodes: Dict["str", Node]
    sources: Dict["str", Node]

    @validator('nodes', 'sources')
    def filter(cls, val):
        return {k: v for k, v in val.items() if v.resource_type.value in ('model', 'seed', 'source')}


if __name__ == "__main__":
    with open("target/manifest.json") as fh:
        data = json.load(fh)


In [None]:
data['nodes']['model.dbt_greenery.dim_address']

In [None]:
%%bash
cat target/manifest.json | \
    jq '.nodes | to_entries | map({node: .key, compiled_sql: .value.compiled_sql, dependencies: .value.depends_on.nodes})'

In [None]:
import networkx as nx
class GraphManifest(Manifest):
    @property
    def node_list(self):
        return list(self.nodes.keys()) + list(self.sources.keys())

    @property
    def edge_list(self):
        return [(k, d) for k, v in self.nodes.items() for d in v.depends_on.nodes]

    def build_graph(self) -> nx.Graph:
        G = nx.Graph()
        G.add_nodes_from(self.node_list)
        G.add_edges_from(self.edge_list)
        return G
m = GraphManifest(**data)
G = m.build_graph()
nx.degree_centrality(G)

In [4]:
(
    dbt.fct_register_event()
       .merge(dbt.dim_event(), how='left', on='event_id',  suffixes=('', '_y'))
       [['session_id', 'event_created_at', 'event_type']]
       .to_csv('events.csv', index=False)
)

In [27]:
import pandas as pd

fct_register_event = pd.read_sql('SELECT * FROM dbt_ramnath_v.fct_register_event', con)
dim_event = pd.read_sql('SELECT * FROM dbt_ramnath_v.dim_event', con)
