## DABs from the Ground Up

### 1. Create a `databricks.yml` file
```yaml
bundle:
  name: thhart_dab_python

include:
  - ./*.job.yml

# specify variables but do not assign values
variables:
  warehouse_id: 
    description: The warehouse to use
  catalog: 
    description: The catalog to use
  schema:
    description: The schema to use

# specify variables values in targets
targets:
  
  # where do we want to develop
  thhart_target_dev:
    mode: development
    default: true
    workspace: 
      host: https://dbc-446db140-571a.cloud.databricks.com
    variables:
      catalog: thhart
      schema: dab_python_dev
  
  # where do we want to run in production
  thhart_target_prod:
    mode: production
    workspace: 
      host: https://dbc-446db140-571a.cloud.databricks.com
    variables:
      catalog: thhart
      schema: dab_python_prod
```

In [0]:
%sql 
create catalog if not exists thhart;
create schema if not exists thhart.dab_python_dev;
create schema if not exists thhart.dab_python_prod;

### 2. Explore Data

In [0]:
%sql list '/databricks-datasets/retail-org/sales_orders'

In [0]:
sdf = (
    spark
    .read
    .format('json')
    .load('/databricks-datasets/retail-org/sales_orders')
    .limit(5))
display(sdf)

### 3. Create Logic

In [0]:
import pyspark.sql.functions as f
sdf_etl = (
    spark
    .read
    .format('json')
    .load('/databricks-datasets/retail-org/sales_orders')
    # .withColumn('order_datetime',)
    .select(
        'customer_id'
        , 'order_number'
        , f.expr('from_unixtime(try_cast(order_datetime as bigint))').alias('order_datetime')))
display(sdf_etl.limit(5))

In [0]:
import pyspark.sql.functions as f
sdf_agg = (
    sdf_etl
    .groupBy(f.col('order_datetime').cast('date'))
    .agg(f.count('*').alias('n_orders'))
)
display(sdf_agg.limit(5))

### 4. Link SQL Together
Two way to achieve this
* Use interface to create a job and copy yaml
* Write yaml 
* Jobs
```yaml
resources:
  jobs:
    thhart_dab_python_job_yml:
      name: thhart_dab_python_job_yml

      trigger:
          interval: 1
          unit: DAYS
      
      email_notifications:
        on_failure:
          - thomas.hart@databricks.com

      parameters:
        - name: catalog
          default: ${var.catalog}
        - name: schema
          default: ${var.schema}
        - name: bundle_target
          default: ${bundle.target}

      tasks:
        - task_key: orders_raw
          notebook_task:
            notebook_path: ./orders_raw
            
        - task_key: orders_daily
          notebook_task:
            notebook_path: ./orders_daily
```

### Create GitHub Actions
1. In Databricks, create PAT
2. In Github, add PAT to Repository Secretes
3. Create `.github/workflows/deploy_to_dev_workflow.yml`