<img src="https://github.com/richardcerny/bricksflow/raw/rc-bricksflow2.1/docs/img/databricks_icon.png?raw=true" width=100/>
# Bricksflow example 1.

## Create new table from CSV

This is a very first template notebook that should give you brief overeview how to develop pipeline using Bricksflow.

You learn how to organize cells and functions, use `@decorators`, pass variables from `config.yaml`.

There are other template notebooks within this project so just look for _template_ notebooks within workspace.

### Requirements for running this notebook
It is possible to run this demo notebook to see it in action. The datasource is public dataset from Databricks so you should be able to access it.

It is expected that following set-up is already configured.

##### Environment variables defined on a cluster
```
APP_ENV=dev
```

##### Database `dev_bronze_covid`

If you want to run this notebook you need to create it using this command(the cell is prepared bellow):
```
%sql
create database if not exists dev_bronze_covid
```
__NOTE:__ Tested on a cluster running Databricks 7.3.

In [0]:
%sql
-- this cell is only for demo purposes
create database if not exists dev_bronze_covid;
create database if not exists dev_silver_covid;
create database if not exists dev_gold_reporting

### This command loads Bricksflow framework and its dependencies

In [0]:
%run ../../../app/install_master_package

### All your imports should be placed up here

In [0]:
from datetime import datetime
from pyspark.sql import functions as F

from logging import Logger
from datalakebundle.table.TableManager import TableManager
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from databricksbundle.notebook.decorators import dataFrameLoader, transformation, dataFrameSaver
from datalakebundle.table.TableNames import TableNames

### Cells and functions

Bricksflow`s best practice to write transformation is by using function per cell approach. Each transformation has its own function and is used in one cell. This sorting of cells and functions significatly improves debuggability of each step and bring other advantages.

We try to avoid complex dataframe manipulation within one function. Function name should briefly describe what it does.

We are able to create so called *Lineage* that shows all aggregations input/output tables. This is usefull especially for business analysts as they have better idead what is happening.

#### Lineage example

<img src="https://github.com/richardcerny/bricksflow/raw/rc-bricksflow2.1/docs/img/lineage.png?raw=true" width=1200/>

In [0]:
# Check 
@dataFrameLoader("%datalakebundle.tables%", display=False)
def read_csv_mask_usage(parameters_datalakebundle, spark: SparkSession, logger: Logger):
    source_csv_path = parameters_datalakebundle['bronze_covid.tbl_template_1_mask_usage']['params']['source_csv_path']
    logger.info(f"Reading CSV from source path: `{source_csv_path}`.")
    return (
        spark
            .read
            .format('csv')
            .option('header', 'true')
            .option('inferSchema', 'true') # Tip: it might be better idea to define schema!
            .load(source_csv_path)
            .limit(10) # only for test
    )

### @decorators
Did you notice that peace of code above a function starting with "@". It`s a standard python element called _decorator_. Bricksflow uses decorators to enable software engineering approaches while using advantage of interactive notebook. Run a function without explicitly calling it - simulates interactive cell and allows to run as a script without any modification. It is possible to generate Lineage documenation based on order of transformations and other things and many in the future.
- *@dataFrameLoader* - use when loading table or data from source. Accepts varibles from config and returns dataframe.
- *@transformation* - use for any kind of dataframe transformation/step. You probably use many of those. Accepts Input dataframe and varibles from config, Returns dataframe.
- *@dataFrameSaver* - use when saving dataframe to a table. Accepts only Input dataframe and varibles from config.
- *@notebookFunction* - use when running any other Python code like - Mlflow, Widgets, Secrets,...

#### Decorators parameters
It is possible to define some functionality by decorates. You have this possibilities:
- Variables from config -> see section _Define param in config.yaml_ bellow 
- `display=True/False`
  Do you use display(df) function to show content of a dataframe? This parameter is exactly the same. By using it as decorator param we are able to easily deactivate it in production where it is not necessary. Set the parameter to True to show data preview or False to skip preview.
  
  <img src="https://github.com/richardcerny/bricksflow/raw/rc-bricksflow2.1/docs/img/display_true.png?raw=true" width=800/>

### Set parameter display=True to show results in this cell

In [0]:
@transformation(read_csv_mask_usage, display=True)
def add_column_insert_ts(df: DataFrame, logger: Logger):
    logger.info("Adding Insert timestamp")
    return df.withColumn('INSERT_TS', F.lit(datetime.now()))
    

COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS,INSERT_TS
1001,0.053,0.074,0.134,0.295,0.444,2021-01-13T09:32:20.557+0000
1003,0.083,0.059,0.098,0.323,0.436,2021-01-13T09:32:20.557+0000
1005,0.067,0.121,0.12,0.201,0.491,2021-01-13T09:32:20.557+0000
1007,0.02,0.034,0.096,0.278,0.572,2021-01-13T09:32:20.557+0000
1009,0.053,0.114,0.18,0.194,0.459,2021-01-13T09:32:20.557+0000
1011,0.031,0.04,0.144,0.286,0.5,2021-01-13T09:32:20.557+0000
1013,0.102,0.053,0.257,0.137,0.451,2021-01-13T09:32:20.557+0000
1015,0.152,0.108,0.13,0.167,0.442,2021-01-13T09:32:20.557+0000
1017,0.117,0.037,0.15,0.136,0.56,2021-01-13T09:32:20.557+0000
1019,0.135,0.027,0.161,0.158,0.52,2021-01-13T09:32:20.557+0000


### Passing dataframe between functions
Normally you would pass dataframes between tranformation like this:
```
df_1 = df2.select('xxx',...)
df2 = df3.withColumn(...
df3.write...
```
*Bricksflow does it a bit differently!*

Basically you use name of original function and place it as an input parameter to following(or any other) function`s @decorator. Thanks to this you are able to easilly navigate between functions in your IDE.
See bellow how to pass dataframe from one function to another.

![Passing dataframe between functions](https://github.com/richardcerny/bricksflow/raw/rc-bricksflow2.1/docs/img/df_passing.png)

You can see this in acion accross this notebook.

In [0]:
@dataFrameSaver(add_column_insert_ts)
def save_table_bronze_covid_tbl_template_1_mask_usage(df: DataFrame, logger: Logger, tableNames: TableNames,  tableManager: TableManager):
    
    # Recreate = remove table and create again
    tableManager.recreate('bronze_covid.tbl_template_1_mask_usage')
    
    outputTableName = tableNames.getByAlias('bronze_covid.tbl_template_1_mask_usage')
    logger.info(f"Saving data to table: {outputTableName}")
    (
        df
            .select(
                'COUNTYFP',
                'NEVER',
                'RARELY',
                'SOMETIMES',
                'FREQUENTLY',
                'ALWAYS',
                'INSERT_TS'
            )
            .write
            .option('partitionOverwriteMode', 'dynamic')
            .insertInto(outputTableName)
    )
    logger.info(f"Data successfully saved to: {outputTableName}")