# Overview

Welcome to a demo of snapshot and real time replication to Databricks.

Use this notebook customized schema, data, workload, and **legacy** Arcion.

**NOTE**: **Databricks Personal Access Token** and **Arcion License** are required. 

- Initial Setup
  - Open `Table of Contents` (Outline)
  - Enter `Arcion License`
  - Enter `Databricks Personal Access Token`
  - Click `Run All`
  - Click `View` -> `Results Only`
  - Click `View` -> `Web Terminal`, 
    - enter `tmux attach`.  
      - If fails with `session not found`, then wait a bit retry.
    - In the `tmux`'s console window, `htop` will be displayed during the setup.
    - Once the setup is complete, Arcion snapshot summary will be displayed.
    - Wait for the setup to finish and the snapshot to complete. 
    - Takes about 5 minutes in for the setup to finish.
- Iterate with the following:
  - Configure Schema and Data
  - Configure Workload
  - Configure Arcion

## Where is Data in Databricks
  - Spark (Delta Lake) uses **Hive Meta Store** catalog: 
    - Open new tab Catalog -> hive_metastore -> <your username>
    - find ycsbdense and ycsbsparse tables 
  - Lakehouse uses **Unity Catalog** catalog: 
    - Open new tab Catalog -> <your username> 
    - find ycsbdense and ycsbsparse tables 

## Frequent Demo Configurations
- Step 1
  - Click Real-Time
  - Run just Arcion
  - Change YCSB Size
  - Watch real-time performance
- Step 2
  - Click Unity Catalog target
  - Select full replication mode
  - Run just Arcion

# Personal Compute Cluster

Choose at least 16GB of RAM for a demo.

Processes use RAM.  The following is the minimum RAM usage.  The server needs enough RAM to avoid swapping.
- Databricks: 5GB 
- SQL Server: 2GB
- Arcion: 10% of server RAM.

Note:
- `vmstat 5`.  any non zero metrics under the `si` and `so` columns (swap in and swap out) indicate RAM shortage. 
- DBR 13 does not print output of subprocess.run 

In [9]:
%pip install file-read-backwards
%pip install deepdiff


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [29]:
# prep python env
import subprocess
import math
import pandas as pd
import re
import ipywidgets as widgets
import os
import pathlib
import json
import requests
import deepdiff
from ipywidgets import HBox, VBox, Label
from file_read_backwards import FileReadBackwards

# all exp parameters 
def exp_params():
    all_params={
    # arcion
    "arcion_download_url": arcion_download_url.value,
    "srcdb_arc_user": src_username.value,
    "repl_type": repl_mode.value,
    "replicant_memory_percentage": ram_percent.value,
    "srcdb_snapshot_threads": snapshot_threads.value,
    "srcdb_realtime_threads": realtime_threads.value, 
    "srcdb_delta": delta_threads.value,
    "dstdb_type": dbx_destinations.value,
    "dstdb_stage": dbx_staging.value,
    "dbx_spark_url": dbx_spark_url.value,
    "dbx_databricks_url": dbx_databricks_url.value,
    "dbx_hostname": dbx_hostname.value,
    "dbx_dbfs_root": dbx_username.value,
    "dbx_username": dbx_username.value,

    # schema and data
    "sparse_cntstart": sparse_cntstart.value,
    "sparse_cnt": sparse_cnt.value , 
    "sparse_fieldcount": sparse_fieldcount.value, 
    "sparse_fieldlength": sparse_fieldlength.value, 
    "sparse_recordcount": sparse_recordcount.value, 
    "sparse_fillpct_start": sparse_fillpct.value[0],
    "sparse_fillpct_end": sparse_fillpct.value[1],
    "dense_cntstart": dense_cntstart.value, 
    "dense_cnt": dense_cnt.value, 
    "dense_fieldcount": dense_fieldcount.value, 
    "dense_fieldlength": dense_fieldlength.value, 
    "dense_recordcount": dense_recordcount.value, 
    "dense_fillpct_start": dense_fillpct.value[0],
    "dense_fillpct_end": dense_fillpct.value[1],

    # workload
    "sparse_tps": sparse_tps.value,
    "dense_tps": dense_tps.value,
    "sparse_threads": sparse_threads.value,
    "dense_threads": dense_threads.value,
    "sparse_multiUpdateSize": sparse_multiupdatesize.value,
    "sparse_multiInsertSize": sparse_multiinsertsize.value,
    "sparse_multiDeleteSize": sparse_multideletesize.value,
    "dense_multiUpdateSize": dense_multiupdatesize.value,
    "dense_multiInsertSize": dense_multiinsertsize.value,
    "dense_multiDeleteSize": dense_multideletesize.value,
    }

    # cluster
    try:
        all_params["spark.databricks.clusterUsageTags.clusterNodeType"] = spark.conf.get("spark.databricks.clusterUsageTags.clusterNodeType")
        all_params["spark.databricks.clusterUsageTags.cloudProvider"]  =  spark.conf.get("spark.databricks.clusterUsageTags.cloudProvider")
    except:
        pass

    return(all_params)

# used to start new MLFlow when parameters changes 
try:
    mlflow_proc_state
except:
    mlflow_proc_state={}

try:
    previous_exp_params
except:
    previous_exp_params={}
try:
    current_exp_params
except:
    current_exp_params={}

try:
    ycsb_logfile_positions
except:
    ycsb_logfile_positions={}
try:
    ycsb_metrics
except:
    ycsb_metrics={}
try:
    previous_log_time
except:
    previous_log_time=None    

# arcion statistics CSV
arcion_stats_csv_header_lines="catalog_name,schema_name,table_name,snapshot_start_range,snapshot_end_range,start_time,end_time,insert_count,update_count,upsert_count,delete_count,elapsed_time_sec,replicant_lag,total_lag"
arcion_key_index={'insert_count':7,'update_count':8,'upsert_count':9,'delete_count':10,'elapsed_time_sec':11,'replicant_lag':12,'total_lag':13}

try:
    arcion_stats_csv_positions
except:
    arcion_stats_csv_positions={}

# setup GUI elements

repl_mode = widgets.Dropdown(options=['snapshot', 'real-time', 'full'],value='real-time',
    description='Replication:',
)
cdc_mode = widgets.Dropdown(options=['change', 'cdc'],value='change',
    description='CDC Method:',
)
ram_percent = widgets.BoundedIntText(value=10,min=10,max=80,
    description='RAM %:',
)

snapshot_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Snapshot Threads:',
)

realtime_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Real Time Threads:',
)    

delta_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Delta Snapshot Threads:',
)    

dbx_destinations = widgets.Dropdown(options=['null', 'deltalake', 'unitycatalog'],value='null',
    description='Destinations:',
)
dbx_staging = widgets.Dropdown(options=['dbfs'],value='dbfs',
    description='Staging:',
)

sparse_cnt = widgets.BoundedIntText(value=4,min=1,max=1000,
    description='Tbl End:',
)
sparse_cntstart = widgets.BoundedIntText(value=1,min=1,max=1000,
    description='Tbl Start:',
)

sparse_fieldcount = widgets.BoundedIntText(value=50,min=0,max=9000,
    description='# of Fields:',
)
sparse_fieldlength = widgets.BoundedIntText(value=10,min=1,max=1000,
    description='Field Len:',
)

sparse_tps = widgets.BoundedIntText(value=2000,min=0,max=10000,
    description='TPS:',
)
sparse_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)
sparse_recordcount = widgets.Text(value="1M",
    description='Rec Cnt:',
)

sparse_fillpct = widgets.IntRangeSlider(value=[0,0],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

dense_cnt = widgets.BoundedIntText(value=2,min=1,max=1000,
    description='Tbl End:',
)
dense_cntstart = widgets.BoundedIntText(value=1,min=1,max=1000,
    description='Tbl Start:',
)

dense_fieldcount = widgets.BoundedIntText(value=10,min=0,max=9000,
    description='# of Fields:',
)
dense_fieldlength = widgets.BoundedIntText(value=100,min=1,max=1000,
    description='Field Len:',
)
dense_recordcount = widgets.Text(value="100K",
    description='Rec Cnt:',
)

dense_tps = widgets.BoundedIntText(value=2000,min=0,max=10000,
    description='TPS:',
)
dense_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)

delupdins_proportion = widgets.IntRangeSlider(value=[1,999],min=0,max=1000,step=1,
    description='Del Upd Ind:', orientation='horizontal', readout=True
)

dense_multiupdatesize = widgets.BoundedIntText(value=1024,min=0,max=10240, description='Upd Size:')
dense_multiinsertsize = widgets.BoundedIntText(value=0,min=0,max=10240, description='Ins Size:')
dense_multideletesize = widgets.BoundedIntText(value=0,min=0,max=10240, description='Del Size:')

sparse_multiupdatesize = widgets.BoundedIntText(value=1024,min=0,max=10240, description='Upd Size:')
sparse_multiinsertsize = widgets.BoundedIntText(value=0,min=0,max=10240, description='Ins Size:')
sparse_multideletesize = widgets.BoundedIntText(value=0,min=0,max=10240, description='Del Size:')

dense_fillpct = widgets.IntRangeSlider(value=[1,99],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

dbx_spark_url = widgets.Textarea(value='',
    description='Spark URL:',
)

dbx_databricks_url = widgets.Textarea(value='',
    description='Databricks URL:',
)

dbx_hostname = widgets.Textarea(value='',
    description='Hostname:',
)

src_username = widgets.Textarea(value='',
    description='SRC User:',
)

dbx_username = widgets.Textarea(value='',
    description='DST User:',
)

arcion_license = widgets.Textarea(value='',
    description='Lic',
)

arcion_download_url = widgets.Textarea(value='https://arcion-releases.s3.us-west-1.amazonaws.com/general/replicant/replicant-cli-24.01.25.7.zip',
    description='Download URL',
)

dbx_access_token = widgets.Password(value='',
    description='Access Token',
)

dbx_default_catalog = widgets.Textarea(value='',
    description='HMS Catalog',
)


# cluster where the notebook is running to auto populate the destinations
spark_url=""
databricks_url=""
workspaceUrl=""
username=""
try:
    cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    workspace_id =spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")

    # clusterName = spark.conf.get("spark.databricks.clusterUsageTags.clusterName")

    workspaceUrl = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())['tags']['browserHostName']

    # below does not work on GCP
    # sc.getConf().getAll() to see what is avail
    # workspaceUrl = spark.conf.get("spark.databricks.workspaceUrl") # host name

    http_path = f"sql/protocolv1/o/{workspace_id}/{cluster_id}"

    spark_url=f"jdbc:spark://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3;UID=token;"
    databricks_url=f"jdbc:databricks://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3;UID=token;"

except:
    pass
dbx_spark_url.value = spark_url
dbx_databricks_url.value = databricks_url
dbx_hostname.value = workspaceUrl

try:
    username = spark.sql("SELECT current_user()").collect()[0][0]
    dbx_username.value = re.sub('[.@]','_',username)
    src_username.value = re.sub('[.@]','_',username)
except:
    src_username.value='arcsrc'
    dbx_username.value='arcdst'

try:
    dbx_default_catalog.value=spark.conf.get("spark.databricks.sql.initial.catalog.name")
except:
    pass

# check arcion license via os env
try:
    arclicenv=os.environ["ARCION_LICENSE"]
    if arclicenv != "": 
        arcion_license.value=arclicenv
except:
    pass

# check arcion license via dbx widget
try:
    arclicwidget=dbutils.widgets.get("Arcion License")
    if arclicwidget != "": 
        arcion_license.value=arclicwidget
        arcion_license.disabled = True
except:
    pass

# check access token via dbx widget
try:
    acctokwidget=dbutils.widgets.get("Access Token")
    if acctokwidget != "": 
        dbx_access_token.value=acctokwidget
        dbx_access_token.disabled = True
except:
    pass

# check dpkg dir via dbx widget
pkg_src_dir=widgets.Textarea(value='',
    description='Pkg Src Dir:',
)
try:
    pkgsrcdirwidget=dbutils.widgets.get("Package Source Dir")
    if pkgsrcdirwidget != "": 
        pkg_src_dir.value=pkgsrcdirwidget
        pkg_src_dir.disabled = True
except:
    pass

# check if os env has ARCION_LICENSE
try:
    arclicenv=os.getenv('ARCION_LICENSE')
    if arclicenv != "": 
        arcion_license.value=arclicenv
except:
    pass

# gcp does not change cwd to notebook path
pwd_result= subprocess.run(f"""pwd""",capture_output = True, text = True )
if (pwd_result.stdout == "/databricks/driver\n"):
    notebookpath="/Workspace" + str(pathlib.Path(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()).parent)
else:
    notebookpath = None

# optional MLflow
experiment_id=None
try:
    import mlflow
    experiment_id=dbutils.widgets.get("Experiment ID")
except:
    pass

# src_db
src_db_type = widgets.Dropdown(value='SQL Server', options=['MySQL', 'Postgres', 'SQL Server'])
src_db_host = widgets.Text(value='localhost', placeholder='hostname or IP')
src_db_port = widgets.Text(value='', placeholder='port #')
src_db_user = widgets.Text(value='', placeholder='username')
src_db_pass = widgets.Text(value='', placeholder='user password')
src_db_root_user = widgets.Text(value='', placeholder='root username')
src_db_root_pass = widgets.Text(value='', placeholder='root password')

# dst_db

# Setup
  - Enter `Arcion License`
  - Enter `Personal Access Token` (generate **One Day** and delete afterwards)
  - Click **Menu Bar** ->  Run -> Run All Below 

## Configure

In [None]:
# enter license and DBX personal access token
VBox([HBox([Label('Arcion'), arcion_license, arcion_download_url]),
      HBox([Label('Local Stage'), pkg_src_dir]),
      HBox([Label('DBX'), dbx_access_token, dbx_default_catalog]),
      HBox([Label('Username'), src_username, dbx_username]),
      HBox([Label('Workspace'), dbx_spark_url, dbx_databricks_url, dbx_hostname, ]),
       ])

VBox(children=(HBox(children=(Label(value='Arcion'), Textarea(value='ewogICJsaWNlbnNlIiA6IHsKICAgICJ1dWlkIiA6I…

## Start

In [None]:
# setup tmux, arcion, ycsb, sql server
subprocess.run(f""". ./bin/setup-tmux.sh; setup_tmux '{dbx_username.value}'""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""bin/download-jars.sh""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""ARCION_LICENSE='{arcion_license.value}' ARCION_DOWNLOAD_URL='{arcion_download_url.value}' bin/install-arcion.sh""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""bin/install-ycsb.sh""",shell=True,executable="bash",cwd=notebookpath)

# mysql

# pg


# sqlserver
subprocess.run(f"""SQL_SERVER_DPKG='{pkg_src_dir.value}'; bin/install-sqlserver.sh""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value}; . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; ping_sql_cli;""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value}; . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; create_user;""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value}; . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; set_sqlserver_ram '{dbx_username.value}';""",shell=True,executable="bash",cwd=notebookpath)
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value}; bin/install-prometheus.sh""",shell=True,executable="bash",cwd=notebookpath)

tmux session ready. session arcdst already exists
installing apt-utils
installing mssql-server


open terminal failed: not a terminal
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
100   983  100   983    0     0   2611      0 --:--:-- --:--:-- --:--:--  2607
curl: Failed writing body
bin/install-sqlserver.sh: line 52: lsb_release: command not found
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
sudo: a terminal is required to read the password; either use th

installing mssql-tools18
installing unixodbc-dev
sqlserver start failed. 1
deltalake /opt/stage/libs/SparkJDBC42.jar found
lakehouse  /opt/stage/libs/DatabricksJDBC42.jar found
postgres  /opt/stage/libs/postgresql-42.7.1.jar found
mariadb  /opt/stage/libs/mariadb-java-client-3.3.2.jar found
oracle /opt/stage/libs/ojdbc8.jar found
log4j /opt/stage/libs/log4j-1.2.17.jar found
sqlserver /opt/stage/libs/mssql-jdbc-12.6.1.jre8.jar found


sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required
sudo: unknown user mssql
sudo: error initializing audit plugin sudoers_audit
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  396M  100  396M    0     0  3137k      0  0:02:09  0:02:09 --:--:-- 3464k


arcion   downloaded
checking jar(s) in /opt/stage/arcion/replicant-cli/lib for updates
checking jar(s) in /opt/stage/arcion/replicant-cli-24.01.25.7/replicant-cli/lib for updates
'/opt/stage/libs//SparkJDBC42.jar' -> '/opt/stage/arcion/replicant-cli-24.01.25.7/replicant-cli/lib/./SparkJDBC42.jar'
'/opt/stage/libs//ojdbc8.jar' -> '/opt/stage/arcion/replicant-cli-24.01.25.7/replicant-cli/lib/./ojdbc8.jar'
'/opt/stage/libs//log4j-1.2.17.jar' -> '/opt/stage/arcion/replicant-cli-24.01.25.7/replicant-cli/lib/./log4j-1.2.17.jar'
'/opt/stage/libs//DatabricksJDBC42.jar' -> '/opt/stage/arcion/replicant-cli-24.01.25.7/replicant-cli/lib/./DatabricksJDBC42.jar'
setting /opt/stage/arcion/replicant.lic from $ARCION_LICENSE
{
  "license" : {
    "uuid" : "e87e0e0f-1b26-4117-aa86-e490a8f64f92",
    "owner" : "Robert Lee",
    "created" : "2023-02-22T00:00Z",
    "expires" : "2024-02-22T00:00Z",
    "type" : "OFFLINE",
    "edition" : "ENTERPRISE",
    "src" : [ "ALL" ],
    "dst" : [ "ALL" ]
  },
  "ke

bash: line 1: bin/install-ycsb.sh: Permission denied


/opt/stage/arcion/replicant-cli/bin/replicant 23.09.29.11 23.09


Error: SQL Server is not running.
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


1: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


2: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


3: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


4: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


5: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


6: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


7: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


8: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


9: waiting for db


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


10: waiting for db
/opt/stage/arcion/replicant-cli/bin/replicant 23.09.29.11 23.09


Error: SQL Server is not running.
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 42: /create_user.sql: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found
cat: /create_user.sql: No such file or directory


creating user arcsrc
/opt/stage/arcion/replicant-cli/bin/replicant 23.09.29.11 23.09


Error: SQL Server is not running.
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 348: sqlcmd: command not found


prometheus already downloaded
prometheus node_exporter already downloaded
prometheus sql_exporter being downloaded


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 24.3M  100 24.3M    0     0  4987k      0  0:00:04  0:00:04 --:--:-- 5833k
x sql_exporter-0.14.0.darwin-arm64/
x sql_exporter-0.14.0.darwin-arm64/README.md
x sql_exporter-0.14.0.darwin-arm64/mssql_standard.collector.yml
x sql_exporter-0.14.0.darwin-arm64/sql_exporter

started /opt/stage/prom/sql_exporter-0.14.0.linux-amd64/sql_exporter.  log at /var/tmp/arcsrc/sqlserver/logs/sql_exporter.log
started /opt/stage/prom/node_exporter-1.7.0.linux-amd64/node_exporter.  log at /var/tmp/arcsrc/sqlserver/logs/node_exporter.log



x sql_exporter-0.14.0.darwin-arm64/LICENSE
x sql_exporter-0.14.0.darwin-arm64/sql_exporter.yml
sed: /opt/stage/prom/sql_exporter-0.14.0.linux-amd64/sql_exporter.yml: No such file or directory
bin/install-prometheus.sh: line 79: pushd: /opt/stage/prom/sql_exporter-0.14.0.linux-amd64: No such file or directory
bin/install-prometheus.sh: line 83: popd: directory stack empty
bin/install-prometheus.sh: line 81: /var/tmp/arcsrc/sqlserver/logs/sql_exporter.log: No such file or directory
bin/install-prometheus.sh: line 89: /var/tmp/arcsrc/sqlserver/logs/node_exporter.log: No such file or directory


CompletedProcess(args='export SRCDB_ARC_USER=arcsrc; bin/install-prometheus.sh', returncode=0)

# Schema and Data

Existing tables will be appended with additional rows if the `Fill Range` is the same.  
Increase the `Table Count` to create additional tables.  

The following options are available:
- Table count (Table Cnt): The number of tables to create.  
  - Table names are `ycsbdense`, `ycsbdense2`, `ycsbdense3`, ... and `ycssparse`, `ycsbdense2`, and `ycsbdense3` ...
- Number of Fields (# of Fields): The number of fields per table.  
  - The field names are `FIELD0`, `FIELD1`, `FIELD2`, ...
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Field Length (Field Len): The length of random character data populated per field.  
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Record Count (Rec Cnt): The number of records per table generated.
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Fill Range: The relative start and end range of fields that are populated with data.  Be default: 
    - sparse tables are all NULLs by having the fill range be 0% to 0% ranges
    - dense tables have all fields populated by having the fill range be 0% to 100% of ranges 

```sql
[localhost][arcsrc] 1> \describe ycsbsparse
+-------------+-------------+-----------+-------------+----------------+-------------+
| TABLE_SCHEM | COLUMN_NAME | TYPE_NAME | COLUMN_SIZE | DECIMAL_DIGITS | IS_NULLABLE |
+-------------+-------------+-----------+-------------+----------------+-------------+
| dbo         | YCSB_KEY    | int       |          10 |              0 | NO          |
| dbo         | FIELD0      | text      |  2147483647 |         [NULL] | YES         |
| dbo         | FIELD1      | text      |  2147483647 |         [NULL] | YES         |
```

## Configure
Make changes below and click `Run All Below`.  

In [28]:
VBox([
    HBox([Label(''),Label('Type'),Label('Host'),Label('Port'),Label('Username'),Label('User Password'),Label('Root User'),Label('Root Password')]),
    HBox([Label('SRC'),src_db_type,src_db_host,src_db_port,src_db_user,src_db_pass,src_db_root_user,src_db_root_pass]),
    ])


VBox(children=(HBox(children=(Label(value=''), Label(value='Type'), Label(value='Host'), Label(value='Port'), …

In [None]:
# show YCSB Data Controls
VBox([HBox([Label('Sparse'), sparse_cntstart,sparse_cnt, sparse_fieldcount, sparse_fieldlength, sparse_recordcount, sparse_fillpct]),
    HBox([Label('Dense'),  dense_cntstart, dense_cnt, dense_fieldcount, dense_fieldlength, dense_recordcount, dense_fillpct])])

VBox(children=(HBox(children=(Label(value='Sparse'), BoundedIntText(value=1, description='Tbl Start:', max=100…

## Start

In [None]:
# run load_sparse_data_cnt and load_dense_data_cnt 
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value}; 
    . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    y_fieldcount={sparse_fieldcount.value} 
    y_fieldlength={sparse_fieldlength.value}  
    y_recordcount={sparse_recordcount.value} 
    y_fillstart={math.ceil((sparse_fillpct.value[0] * sparse_fieldcount.value) / 100)}      
    y_fillend={int((sparse_fillpct.value[1] * sparse_fieldcount.value) / 100)}      
    load_sparse_data_cnt {sparse_cnt.value} {sparse_cntstart.value};
    y_fieldcount={dense_fieldcount.value} 
    y_fieldlength={dense_fieldlength.value} 
    y_recordcount={dense_recordcount.value} 
    y_fillstart={math.ceil((dense_fillpct.value[0] * dense_fieldcount.value) / 100)}      
    y_fillend={int((dense_fillpct.value[1] * dense_fieldcount.value) / 100)}      
    load_dense_data_cnt {dense_cnt.value} {dense_cntstart.value};
    dump_schema;
    list_table_counts""",
    shell=True,executable="bash",cwd=notebookpath) 
# show tables
pd.read_csv (f"/var/tmp/{src_username.value}/sqlserver/config/list_table_counts.csv",header=None, names= ['table name','min key','max key','field count'])

/opt/stage/arcion/replicant-cli/bin/replicant 23.09.29.11 23.09
starting sparse load. /ycssparse.load.log

Error: SQL Server is not running.
touch: cannot touch '/ycsbsparse.load.log': Read-only file system


 1 2 3 4
starting dense load. /ycsbdense.load.log 1 2


./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 126: /ycsbsparse.load.log: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 126: /ycsbsparse.load.log: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 126: /ycsbsparse.load.log: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 126: /ycsbsparse.load.log: Read-only file system
touch: cannot touch '/ycsbdense.load.log': Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 106: /ycsbdense.load.log: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 106: /ycsbdense.load.log: Read-only file system
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 540: /schema_dump.csv: Read-only file system
schema dump at /schema_dump.csv
./demo/sqlserver/run-ycsb-sqlserver-source.sh: line 331: sqlcmd: command not found
touch: cannot touch '/list_table_counts.sql': Read-only file system
table count at /list_table_counts.csv


FileNotFoundError: [Errno 2] No such file or directory: '/var/tmp/arcsrc/sqlserver/config/list_table_counts.csv'

# Workload

Choose the options in the UI and run the cell below it to start the workload (YCSB).  

YCSB update (workload A) controls for Dense and Sparse table groups separated. Each group has a separate control.  However, all of the tables in the group use the same controls.  
1. Each table's TPS (throughput per second)
   1. 0=fast as possible
   2. 1=1 TPS
   3. 10=10 TPS
2. Each table's threads (concurrency) used to achieve the desired TPS.

## Configure

In [None]:
# show YCSB run controls
VBox([HBox([Label('DML Ratio'), delupdins_proportion]),
      HBox([Label('Sparse'), sparse_tps, sparse_threads, sparse_multiupdatesize, sparse_multiinsertsize, sparse_multideletesize]), 
      HBox([Label('Dense'),  dense_tps, dense_threads, dense_multiupdatesize, dense_multiinsertsize, dense_multideletesize])])


## Start

In [None]:
# start/restart YCSB run
# 1 = 0.1% of total TPS = delete.  Total of 0.1% * 100 = 10% (up to this many records are can be deleted)
if (delupdins_proportion.value[0] > 1):
    delupdins_proportion.value = [1,delupdins_proportion.value[1]]
y_del_proportion=(delupdins_proportion.value[0]) / 1000.0
y_upd_proportion=(delupdins_proportion.value[1] - delupdins_proportion.value[0]) / 1000.0
y_ins_proportion=(1000 - delupdins_proportion.value[1]) / 1000.0
# set the min TPS to match multi row size.  Otherwise, txn will to to fill up the size
min_tps=max(sparse_multiupdatesize.value, sparse_multideletesize.value, sparse_multiupdatesize.value)
if (sparse_tps.value < min_tps):
    sparse_tps.value = min_tps
min_tps=max(dense_multiupdatesize.value, dense_multideletesize.value, dense_multiupdatesize.value)    
if (dense_tps.value < min_tps):
    dense_tps.value = min_tps
# start the actual run
subprocess.run(f"""export SRCDB_ARC_USER={src_username.value};
    . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    kill_ycsb;
    list_table_counts;       
    y_del_proportion={y_del_proportion}
    y_upd_proportion={y_upd_proportion}
    y_ins_proportion={y_ins_proportion}
    y_target_sparse={sparse_tps.value} 
    y_target_dense={dense_tps.value} 
    y_threads_sparse={sparse_threads.value} 
    y_threads_dense={dense_threads.value} 
    y_multiinsertsize_dense={dense_multiinsertsize.value} 
    y_multiupdatesize_dense={dense_multiupdatesize.value} 
    y_multideletesize_dense={dense_multideletesize.value} 
    y_multiinsertsize_sparse={sparse_multiinsertsize.value} 
    y_multiupdatesize_sparse={sparse_multiupdatesize.value} 
    y_multideletesize_sparse={sparse_multideletesize.value} 
    y_fieldlength_sparse={sparse_fieldlength.value} 
    y_fieldlength_dense={dense_fieldlength.value} 
    start_ycsb;""",
    shell=True,executable="bash",cwd=notebookpath)

# Arcion

Choose the options in the UI and run the cell below it to start the replication.  

The following control are avail in the demo.  
- Arcion - replication type and CDC methods  
- Threads - control the parallelism.
- Target - null, unity catalog or delta lake

NOTE: Full mode does not work at this time.

For SQL Server, change tracking, cdc are available for demo.  

Performance is mainly controlled by the thread count by the extract and apply process.
Additional controls are customizable via modifying the YAML files directly below.
- [CDC YAML files](./demo/sqlserver/yaml/cdc/)
- [Change Tracking YAML files](./demo/sqlserver/yaml/change/)

## Configure
Make changes below and click `Run All Below`.  

In [None]:
# show Arcion and DBX controls
VBox([
      HBox([Label('RAM'), ram_percent]),
      HBox([Label('Modes'), repl_mode, cdc_mode]),
      HBox([Label('Target'), dbx_destinations, dbx_staging ]),
      HBox([Label('Threads'), snapshot_threads, realtime_threads, delta_threads]),
      ])

## Start

In [None]:
# start/restart Arcion
if ( f"{dbx_access_token.value}" == "" ) and ( f"{dbx_destinations.value}" != "null" ):
    print("personal access token not entered.")
else:
    # start a new run
    print (f"""{cdc_mode.value} {repl_mode.value}""")
    arcion_run = subprocess.run(f"""export ARCION_DOWNLOAD_URL='{arcion_download_url.value}';
    export SRCDB_ARC_USER={src_username.value};
    . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    kill_arcion;
    disable_cdc;
    disable_change_tracking;
    echo prog_dir=$PROG_DIR arcion_bin=$ARCION_BIN;
    cd $PROG_DIR;
    a_repltype='{repl_mode.value}'
    REPLICANT_MEMORY_PERCENTAGE='{ram_percent.value}.0'
    SRCDB_SNAPSHOT_THREADS='{snapshot_threads.value}' 
    SRCDB_REALTIME_THREADS='{realtime_threads.value}' 
    SRCDB_DELTA='{delta_threads.value}'
    DSTDB_TYPE='{dbx_destinations.value}'
    DSTDB_STAGE='{dbx_staging.value}'
    DBX_SPARK_URL='{dbx_spark_url.value}'
    DBX_DATABRICKS_URL='{dbx_databricks_url.value}'
    DBX_ACCESS_TOKEN='{dbx_access_token.value}'
    DBX_HOSTNAME='{dbx_hostname.value}'
    DBX_DBFS_ROOT='/{dbx_username.value}'
    DBX_USERNAME='{dbx_username.value}'
    start_{cdc_mode.value}_arcion;""",
    shell=True,executable="bash",cwd=notebookpath)

# MLFLow

Save the artifacts in MLFlow.

Artifacts are collected for 5 min (600 sec).

## Start

In [None]:
# use process to run MLflow without blocking the notebook.  thread does not work with mlflow

import mlflow
import time
import os
import numpy as np
from multiprocessing import Process

def log_artifacts():
    pass

from file_read_backwards import FileReadBackwards
import datetime

# convert ycsb log mlflow metric
# time                    elapsed  cumulative      time period                                   per operations metric
#                         sec      operations      ops/sec
# 2024-03-07 10:05:38:240 410 sec: 409 operations; 1 current ops/sec; est completion in 116 days [UPDATE: Count=10, Max=15383, Min=6792, Avg=9264.6, 90=15359, 99=15383, 99.9=15383, 99.99=15383]
# ycsb_tablename_[update|update-failed]_count=x
# ycsb_tablename_[update|update-failed]_avg_microsec=x 

ycsb_date_time_pattern = r"^(?P<dt>[0-9\-]+ [0-9\:]+)"  # at the beginning
ycsb_op_val_pattern = r'\[([^]]*)\]'                    # [Update: ] [Insert: ] ...

def parse_ycsb_log_to_metric(table_name="ycsbsparse",
                    file="/var/tmp/arcsrc/sqlserver/logs/ycsb/ycsb.ycsbsparse.log",
                    previous_log_time=None, 
                    metrics={},
                    ycsb_logfile_positions={}):
    with FileReadBackwards(file, encoding="utf-8") as ycsb_log_file:
        count=0
        for line in ycsb_log_file:            
            # parse date time and bail if already processed
            m = re.search(ycsb_date_time_pattern, line)
            if m is None:
                continue

            log_time=datetime.datetime.strptime(m.group('dt'), '%Y-%m-%d %H:%M:%S:%f')
            try:
                if log_time == ycsb_logfile_positions[table_name]:
                    return
            except:
                pass    

            # parse [update: ...]
            m = re.findall(ycsb_op_val_pattern, line.lower())
            if m is None:
                continue
            
            ycsb_logfile_positions[table_name] = log_time
            for ops in m:
                op_vals=ops.split(":")                  # update: ....
                vals_array=op_vals[1].split(",")        # count=?, max=?, ...
                try:    
                    op_count=float(vals_array[0].split("=")[1])    # [0] count=?
                except:
                    op_count=0.0
                try:
                    op_avg=float(vals_array[3].split("=")[1])      # [1] avg=? if count=0, then this will be not defined
                except:
                    op_avg=0.0
                metrics[f"ycsb_{op_vals[0]}_count_{table_name}"]=op_count
                metrics[f"ycsb_{op_vals[0]}_avg_microsec_{table_name}"]=op_avg
            return

def get_ycsb_metrics(metrics={}):
    ycsb_current_metrics={}
    print(ycsb_logfile_positions)
    ycsb_tables = pd.read_csv (f"/var/tmp/{src_username.value}/sqlserver/config/list_table_counts.csv",header=None, names= ['table name','min key','max key','field count'])
    for table_name in ycsb_tables['table name']:
        table_name = table_name.lower()
        parse_ycsb_log_to_metric(
            table_name=table_name, 
            file=f"/var/tmp/{src_username.value}/sqlserver/logs/ycsb/ycsb.{table_name}.log",
            previous_log_time=previous_log_time,
            metrics=ycsb_current_metrics,
            ycsb_logfile_positions=ycsb_logfile_positions)
    return(ycsb_current_metrics)

def get_prom_metrics(prom_metric_url=None,metric_prefix="",metric_step=None):
    # there is a limit on the number of metrics that you can log in a single log_batch call. This limit is typically 1000. 
    # timestamp=If unspecified, the number of milliseconds since the Unix epoch is used.
    # step=If unspecified, the default value of zero is used
    contents = requests.get(prom_metric_url)
    all_metrics = {}
    metrics_count = 0
    for line in contents.text.splitlines():
        if line.startswith("#"):
            continue
        key_val=line.rsplit(' ',1)  # split from the end in case the key has spaces
        all_metrics[re.sub('[" {}=,]',"_",key_val[0])]=float(key_val[1])
        metrics_count += 1
    return(all_metrics)


def start_mlflow(max_intervals=5,experiment_id=None, log_interval_sec=60, all_params={}):
    # stop previous run
    mlflow_run = mlflow.active_run()
    if not(mlflow_run is None):
        # upload final artifacts
        log_artifacts()
        print(f"""stopping previous MLflow {mlflow_run.info.run_id}""")
        mlflow.end_run()

    # start a new run
    if experiment_id == '':
        experiment_id=None
    mlflow.start_run(experiment_id=experiment_id, log_system_metrics=True)

    # params
    mlflow.log_params(params=all_params)

    # schema
    dataset_source=f"/var/tmp/{src_username.value}/sqlserver/config/list_table_counts.csv"
    mlflow.log_artifact(dataset_source)
    
    # data
    dataset_shape = pd.read_csv(dataset_source, header=None, names= ['table name','min key','max key','field count'])
    dataset = mlflow.data.from_pandas(dataset_shape, source=dataset_source)
    mlflow.log_input(dataset, context="training")    

    # wait to end
    # TODO: Make this smarter by checking whether the process is still running
    wait_count=0
    while wait_count < max_intervals:
        mlflow.log_metrics(metrics=get_prom_metrics(prom_metric_url="http://localhost:9399/metrics"))
        mlflow.log_metrics(metrics=get_prom_metrics(prom_metric_url="http://localhost:9100/metrics"))
        mlflow.log_metrics(metrics=get_ycsb_metrics())
        time.sleep(log_interval_sec)
        wait_count += 1

    # upload the rest of the artifacts generated /var/tmp/{src_username.value}/sqlserver/logs
    log_artifacts()
    # experiment done
    mlflow.end_run()

def register_mlflow(exp_params):
    mlflow_proc = Process(target=start_mlflow, kwargs={"experiment_id":experiment_id, "all_params":current_exp_params})
    mlflow_proc.start()   
    try:
        mlflow_proc_state['proc'].terminate()
        print("previous MLFlow process terminated")
    except:
        pass
    mlflow_proc_state['proc']       = mlflow_proc
    mlflow_proc_state['exp_params'] = exp_params


current_exp_params=exp_params()
if not ('exp_params' in mlflow_proc_state):
    print("first run of mlflow")
    register_mlflow(current_exp_params)
elif current_exp_params != mlflow_proc_state['exp_params']:
    print("param changed. starting new mlflow")
    register_mlflow(current_exp_params)
elif not(mlflow_proc_state['proc'].is_alive()):
    print("mlflow stopped. starting new mlflow with new step")
    register_mlflow(current_exp_params)
else:
    print("no parameters changed. New MLFLow experiment not needed.")

# Manually Kill Processes
Uncomment below to kill desired processes

In [None]:
# subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_arcion;""",shell=True,executable="bash",cwd=notebookpath)
# subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_ycsb;""",shell=True,executable="bash",cwd=notebookpath)

In [None]:
from file_read_backwards import FileReadBackwards

def parse_arcion_stats():
    with FileReadBackwards("/var/tmp/arcsrc/sqlserver/logs/3fb22d2a4/3fb22d2a4/replication_statistics_history_2024-03-07_3.CSV", encoding="utf-8") as BigFile:
        max_start_time = None
        all_metrics = {}
        unprocessed_tables = arcion_stats_csv_positions.copy()
        for line in BigFile:
            if line==arcion_stats_csv_header_lines:
                break
            tokens=line.split(",")
            print (line)

            key=tokens[0]+"_"+tokens[1]+"_"+tokens[2]

            # max start_time
            start_time_str=tokens[5]
            start_time=datetime.strptime(start_time_str, '%Y-%m-%dT%H:%M:%S.%f%z')

            if max_start_time is None:
                max_start_time = start_time
            
            if key in arcion_stats_csv_positions:
                if arcion_stats_csv_positions[key] < start_time:
                    # line has newer data
                    arcion_stats_csv_positions[key]=start_time
                else:
                    print("skip")
            else:
                arcion_stats_csv_positions[key]=start_time

            # continue until all tables are processed or time out
            try:
                del unprocessed_tables[key]
            except:
                pass
            if len(unprocessed_tables) == 0:
                break
            if (max_start_time - start_time).seconds > 10:
                break