# Overview

Welcome to YCSB, SQL Server, Arcion and Databricks Demo

- Intial Setup
  - Click `View` -> `Results Only`
  - Click `Run All` ( takes about 5 min )
  - [Enter Arcion License and DBX Personal Access Token](#Enter-Arcion-License-and-DBX-Personal-Access-Token)
  - Click `Run All` ( takes about less than 1 min )
- Iterate with the following:
  - [Change YCSB Data Size](#change-ycsb-data-size)
  - [Change YCSB TPS](#change-ycsb-tps)
  - [Change Arcion Target](#change-arcion-target)
  - [Start Arcion](#start-arcion)

Use Personal Compute, 
- Standard DS4_v2 (28 GB RAM) show up as 22GB for performance test and demo
- Standard DS3_v2 (14 GB RAM) show up as 8.8 GB avail which does work but a bit tight

- Step 1
  - Click `Run All` ( takes about 5 min )
  - Click **Menu Bar** -> Cluster -> Right Carrot -> Web Terminal
    - type `tmux ls` 
    - type `tmux attach -t robert_lee_databricks_com`
- Step 2
  - Scroll Down to [Enter Arcion License and DBX Personal Access Token](#Enter-Arcion-License-and-DBX-Personal-Access-Token)
  - Enter `Arcion License`
  - Enter `Personal Access Token` (generate **One Day** and delete afterwards)
  - Click **Menu Bar** ->  Run -> Run All Below
- Step 3
  - for **Hive Meta Store**: Open new tab Catalog -> hive_metastore -> <your username>
    - find ycsbdense and ycsbsparse tables 
  - for **Unity Catalog**: Open new tab Catalog -> <your username> 
    - find ycsbdense and ycsbsparse tables 

- Step 4
  - Click Real-Time
  - Run just Arcion
  - Change YCSB Size
  - Watch real-time performance
- Step 3
  - Click Unity Catalog target
  - Select full replication mode
  - Run just Arcion

Note:
- When Arion is running, bulk insert will wait if not enough CPU / RAM is available 
- DBR 13 does not print output of subprocess.run


In [2]:
# prep python env
import subprocess
import math
import pandas as pd
import re
import ipywidgets as widgets
from ipywidgets import HBox, VBox, Label

# setup GUI elements

repl_mode = widgets.Dropdown(options=['snapshot', 'real-time', 'full'],value='snapshot',
    description='Replication:',
)
cdc_mode = widgets.Dropdown(options=['change', 'cdc'],value='change',
    description='CDC Method:',
)

snapshot_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Snapshot Threads:',
)

realtime_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Real Time Threads:',
)    

delta_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Delta Snapshot Threads:',
)    

dbx_destinations = widgets.Dropdown(options=['null', 'deltalake', 'unitycatalog'],value='null',
    description='Destinations:',
)
dbx_staging = widgets.Dropdown(options=['dbfs'],value='dbfs',
    description='Staging:',
)

sparse_cnt = widgets.BoundedIntText(value=1,min=1,max=100,
    description='Table Cnt:',
)
sparse_fieldcount = widgets.BoundedIntText(value=50,min=0,max=9000,
    description='# of Fields:',
)
sparse_fieldlength = widgets.BoundedIntText(value=10,min=1,max=1000,
    description='Field Len:',
)

sparse_tps = widgets.BoundedIntText(value=1,min=0,max=1000,
    description='TPS:',
)
sparse_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)
sparse_recordcount = widgets.Text(value="2K",
    description='Rec Cnt:',
)

sparse_fillpct = widgets.IntRangeSlider(value=[0,0],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

dense_cnt = widgets.BoundedIntText(value=1,min=1,max=100,
    description='Table Cnt:',
)
dense_fieldcount = widgets.BoundedIntText(value=10,min=0,max=9000,
    description='# of Fields:',
)
dense_fieldlength = widgets.BoundedIntText(value=100,min=1,max=1000,
    description='Field Len:',
)
dense_recordcount = widgets.Text(value="1K",
    description='Rec Cnt:',
)

dense_tps = widgets.BoundedIntText(value=1,min=0,max=1000,
    description='TPS:',
)
dense_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)

dense_fillpct = widgets.IntRangeSlider(value=[0,100],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

dbx_spark_url = widgets.Textarea(value='',
    description='Spark URL:',
)

dbx_databricks_url = widgets.Textarea(value='',
    description='Databricks URL:',
)

dbx_hostname = widgets.Textarea(value='',
    description='Hostname:',
)

dbx_username = widgets.Textarea(value='',
    description='Username:',
)

arcion_license = widgets.Textarea(value='',
    description='Arcion Lic',
)

dbx_access_token = widgets.Textarea(value='',
    description='Access Token',
)

dbx_default_catalog = widgets.Textarea(value='',
    description='Default Catalog',
)


# cluster where the notebook is running to auto populate the destinations
spark_url=""
databricks_url=""
workspaceUrl=""
username=""
try:
    cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    workspace_id =spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")

    # clusterName = spark.conf.get("spark.databricks.clusterUsageTags.clusterName")

    workspaceUrl = spark.conf.get("spark.databricks.workspaceUrl") # host name

    http_path = f"sql/protocolv1/o/{workspace_id}/{cluster_id}"

    spark_url=f"jdbc:spark://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3;UID=token;"
    databricks_url=f"jdbc:databricks://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3;UID=token;"

except:
    pass
dbx_spark_url.value = spark_url
dbx_databricks_url.value = databricks_url
dbx_hostname.value = workspaceUrl

try:
    username = spark.sql("SELECT current_user()").collect()[0][0]
except:
    username='arcdst'
dbx_username.value = re.sub('[.@]','_',username)

try:
    dbx_default_catalog.value=spark.conf.get("spark.databricks.sql.initial.catalog.name")
except:
    pass


## Enter Arcion License and DBX Personal Access Token

  - Enter `Arcion License`
  - Enter `Personal Access Token` (generate **One Day** and delete afterwards)
  - Click **Menu Bar** ->  Run -> Run All Below

Links: [Overview](#overview) | [Change YCSB Data Size](#change-ycsb-data-size) | [Change YCSB TPS](#change-ycsb-tps) | [Change Arcion Target](#change-arcion-target) | [Start Arcion](#start-arcion)

In [13]:
# enter license and DBX personal access token
VBox([HBox([Label('License'), arcion_license, dbx_access_token, dbx_default_catalog]),
      HBox([Label('Workspace'), dbx_spark_url, dbx_databricks_url, dbx_hostname, dbx_username]),
       ])

VBox(children=(HBox(children=(Label(value='License'), Textarea(value='', description='Arcion Lic'), Textarea(v…

## Setup Arcion, YCSB and SQL Server

In [4]:
# setup tmux, arcion, ycsb, sql server
subprocess.run(f""". ./bin/setup-tmux.sh; setup_tmux '{dbx_username.value}'""",shell=True,executable="/usr/bin/bash")
subprocess.run(f"""bin/install-sqlserver.sh""",shell=True,executable="/usr/bin/bash")
subprocess.run(f"""bin/download-jars.sh""",shell=True,executable="/usr/bin/bash")
subprocess.run(f"""ARCION_LICENSE='{arcion_license.value}' bin/install-arcion.sh""",shell=True,executable="/usr/bin/bash")
subprocess.run(f"""bin/install-ycsb.sh""",shell=True,executable="/usr/bin/bash")
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; ping_sql_cli;""",shell=True,executable="/usr/bin/bash")
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; create_user;""",shell=True,executable="/usr/bin/bash")
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; set_sqlserver_ram '{dbx_username.value}';""",shell=True,executable="/usr/bin/bash")

tmux session ready. session arcdst already exists
apt-utils already installed
mssql-server already installed
mssql-tools18 already installed
unixodbc-dev alrady installed
sqlserver already started
deltalake /opt/stage/libs/SparkJDBC42.jar found
lakehouse  /opt/stage/libs/DatabricksJDBC42.jar found
postgres  /opt/stage/libs/postgresql-42.7.1.jar found
mariadb  /opt/stage/libs/mariadb-java-client-3.3.2.jar found
oracle /opt/stage/libs/ojdbc8.jar found
log4j /opt/stage/libs/log4j-1.2.17.jar found
arcion  /opt/stage/arcion/replicant-cli/bin/replicant found
checking jar(s) in /opt/stage/arcion/24.01.25.1/replicant-cli/lib for updates
checking jar(s) in /opt/stage/arcion/24.01.25.1/lib for updates
checking jar(s) in /opt/stage/arcion/replicant-cli/lib for updates
checking jar(s) in /opt/stage/arcion/replicate-cli-23.05.31.29/lib for updates
checking jar(s) in /opt/stage/arcion/23.05.31.31/lib for updates
checking jar(s) in /opt/stage/arcion/24.01.25.7/lib for updates
checking jar(s) in /opt/

open terminal failed: not a terminal


24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
Msg 15025, Level 16, State 1, Server ron, Line 1
The server principal 'arcsrc' already exists.
Msg 1801, Level 16, State 3, Server ron, Line 1
Database 'arcsrc' already exists. Choose a different database name.
Changed database context to 'arcsrc'.
Msg 15023, Level 16, State 5, Server ron, Line 1
User, group, or role 'arcsrc' already exists in the current database.
replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
Configuration option 'show advanced options' changed from 1 to 1. Run the RECONFIGURE statement to install.
Configuration option 'max server memory (MB)' changed from 2048 to 2048. Run the RECONFIGURE statement to install.


CompletedProcess(args=". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; set_sqlserver_ram 'arcdst';", returncode=0)

# Customize schema and size

Existing tables will be appended with additional rows if the `Fill Range` is the same.  
Increase the `Table Count` to create additional tables.  

The following options are available:
- Table count (Table Cnt): The number of tables to create.  
  - Table names are `ycsbdense`, `ycsbdense2`, `ycsbdense3`, ... and `ycssparse`, `ycsbdense2`, and `ycsbdense3` ...
- Number of Fields (# of Fields): The number of fields per table.  
  - The field names are `FIELD0`, `FIELD1`, `FIELD2`, ...
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Field Length (Field Len): The length of random character data populated per field.  
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Record Count (Rec Cnt): The number of records per table generated.
  - Note the use of `K`,`M`,`B` ... suffix at the end.
- Fill Range: The relative start and end range of fields that are populated with data.  Be default: 
    - sparse tables are all NULLs by having the fill range be 0% to 0% ranges
    - dense tables have all fields populated by having the fill range be 0% to 100% of ranges 

```sql
[localhost][arcsrc] 1> \describe ycsbsparse
+-------------+-------------+-----------+-------------+----------------+-------------+
| TABLE_SCHEM | COLUMN_NAME | TYPE_NAME | COLUMN_SIZE | DECIMAL_DIGITS | IS_NULLABLE |
+-------------+-------------+-----------+-------------+----------------+-------------+
| dbo         | YCSB_KEY    | int       |          10 |              0 | NO          |
| dbo         | FIELD0      | text      |  2147483647 |         [NULL] | YES         |
| dbo         | FIELD1      | text      |  2147483647 |         [NULL] | YES         |
```

## Change YCSB Data Size
Make changes below and click `Run All Below`.  
Links: [Overview](#overview) | [Change YCSB Data Size](#change-ycsb-data-size) | [Change YCSB TPS](#change-ycsb-tps) | [Change Arcion Target](#change-arcion-target)

In [5]:
# show YCSB Data Controls
VBox([HBox([Label('Sparse'), sparse_cnt, sparse_fieldcount, sparse_fieldlength, sparse_recordcount, sparse_fillpct]),
    HBox([Label('Dense'),  dense_cnt, dense_fieldcount, dense_fieldlength, dense_recordcount, dense_fillpct])])

VBox(children=(HBox(children=(Label(value='Sparse'), BoundedIntText(value=1, description='Table Cnt:', min=1),…

In [6]:
# run load_sparse_data_cnt and load_dense_data_cnt 
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    y_fieldcount={sparse_fieldcount.value} 
    y_fieldlength={sparse_fieldlength.value}  
    y_recordcount={sparse_recordcount.value} 
    y_fillstart={math.ceil((sparse_fillpct.value[0] * sparse_fieldcount.value) / 100)}      
    y_fillend={int((sparse_fillpct.value[1] * sparse_fieldcount.value) / 100)}      
    load_sparse_data_cnt {sparse_cnt.value};
    y_fieldcount={dense_fieldcount.value} 
    y_fieldlength={dense_fieldlength.value} 
    y_recordcount={dense_recordcount.value} 
    y_fillstart={math.ceil((dense_fillpct.value[0] * dense_fieldcount.value) / 100)}      
    y_fillend={int((dense_fillpct.value[1] * dense_fieldcount.value) / 100)}      
    load_dense_data_cnt {dense_cnt.value};
    dump_schema;
    list_table_counts""",
    shell=True,executable="/usr/bin/bash") 


replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
Starting type=sparse inst=1
skip table create
skip load need existing count 2000 -gt 1000000 && field 50 -eq 50 
Starting type=dense inst=1
skip table create
skip load need existing count 1000 -gt 100000 && field 10 -eq 10 


schema dump at /tmp/schema_dump.csv
table count at /tmp/list_table_counts.csv


CompletedProcess(args='. ./demo/sqlserver/run-ycsb-sqlserver-source.sh; \n    y_fieldcount=50 \n    y_fieldlength=10  \n    y_recordcount=2K \n    y_fillstart=0      \n    y_fillend=0      \n    load_sparse_data_cnt 1;\n    y_fieldcount=10 \n    y_fieldlength=100 \n    y_recordcount=1K \n    y_fillstart=0      \n    y_fillend=10      \n    load_dense_data_cnt 1;\n    dump_schema;\n    list_table_counts', returncode=0)

In [7]:
# show loaded tables
pd.read_csv ('/tmp/list_table_counts.csv',header=None, names= ['table name','min key','max key','field count'])

Unnamed: 0,table name,min key,max key,field count
0,YCSBDENSE,0,99999,10
1,YCSBSPARSE,0,999999,50


# Run YCSB and Arcion in the background

## Start/Restart YCSB workload at 1 TPS

Choose the options in the UI and run the cell below it to start the workload (YCSB).  


YCSB update (workload A) controls for Dense and Sparse table groups separated. Each group has a separate control.  However, all of the tables in the group use the same controls.  
1. Each table's TPS (throughput per second)
   1. 0=fast as possible
   2. 1=1 TPS
   3. 10=10 TPS
2. Each table's threads (concurrency) used to achieve the desired TPS.

## Change YCSB TPS 
Make changes below and click `Run All Below`.  
Links: [Overview](#overview) | [Change YCSB Data Size](#change-ycsb-data-size) | [Change YCSB TPS](#change-ycsb-tps) | [Change Arcion Target](#change-arcion-target) | [Start Arcion](#start-arcion)

In [8]:
# show YCSB run controls
VBox([HBox([Label('Sparse'), sparse_tps, sparse_threads]), HBox([Label('Dense'),  dense_tps, dense_threads])])

VBox(children=(HBox(children=(Label(value='Sparse'), BoundedIntText(value=1, description='TPS:', max=1000), Bo…

In [9]:
# start/restart YCSB run
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    kill_ycsb;
    y_target_sparse={sparse_tps.value} 
    y_target_dense={dense_tps.value} 
    y_threads_sparse={sparse_threads.value} 
    y_threads_dense={dense_threads.value} 
    y_fieldlength_sparse={sparse_fieldlength.value} 
    y_fieldlength_dense={dense_fieldlength.value} 
    start_ycsb;""",
    shell=True,executable="/usr/bin/bash")

replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
running ycsb on /tmp/list_table_counts.csv
YCSBDENSE,0,99999,10
table_name=ycsbdense tabletype=dense record_count=100000 field_count=10 _y_threads=1 _y_target=1 _y_fieldlength=100
ycsb ycsbdense pid 3960416
ycsb ycsbdense log is at /var/tmp/sqlserver/logs/ycsb.ycsbdense.log
ycsb ycsbdense can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_ycsb)
YCSBSPARSE,0,999999,50
table_name=ycsbsparse tabletype=sparse record_count=1000000 field_count=50 _y_threads=1 _y_target=1 _y_fieldlength=10
ycsb ycsbsparse pid 3960425
ycsb ycsbsparse log is at /var/tmp/sqlserver/logs/ycsb.ycsbsparse.log
ycsb ycsbsparse can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_ycsb)


CompletedProcess(args='. ./demo/sqlserver/run-ycsb-sqlserver-source.sh; \n    kill_ycsb;\n    y_target_sparse=1 \n    y_target_dense=1 \n    y_threads_sparse=1 \n    y_threads_dense=1 \n    y_fieldlength_sparse=10 \n    y_fieldlength_dense=100 \n    start_ycsb;', returncode=0)

## About Arcion

Choose the options in the UI and run the cell below it to start the replication.  

The following control are avail in the demo.  
- Arcion - replication type and CDC methods  
- Threads - control the parallelism.
- Target - null, unity catalog or delta lake

NOTE: Full mode does not work at this time.

For SQL Server, change tracking, cdc are available for demo.  

Performance is mainly controlled by the thread count by the extract and apply process.
Additional controls are customizable via modifying the YAML files directly below.
- [CDC YAML files](./demo/sqlserver/yaml/cdc/)
- [Change Tracking YAML files](./demo/sqlserver/yaml/change/)

## Change Arcion Target
Make changes below and click `Run All Below`.  
Links: [Overview](#overview) | [Change YCSB Data Size](#change-ycsb-data-size) | [Change YCSB TPS](#change-ycsb-tps) | [Change Arcion Target](#change-arcion-target) | [Start Arcion](#start-arcion)

In [10]:
# show Arcion and DBX controls

VBox([
      HBox([Label('Arcion'), repl_mode, cdc_mode]),
      HBox([Label('Target'), dbx_destinations, dbx_staging]),
      HBox([Label('Threads'), snapshot_threads, realtime_threads, delta_threads]),
      ])

VBox(children=(HBox(children=(Label(value='Arcion'), Dropdown(description='Replication:', options=('snapshot',…

## Start Arcion
`Run All Below` and Watch the progress on `tmux`.  
Links: [Overview](#overview) | [Change YCSB Data Size](#change-ycsb-data-size) | [Change YCSB TPS](#change-ycsb-tps) | [Change Arcion Target](#change-arcion-target) | [Start Arcion](#start-arcion)

In [11]:
# start/restart Arcion
print (f"""{cdc_mode.value} {repl_mode.value}""")
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    echo $PROG_DIR;
    cd $PROG_DIR;
    kill_arcion;
    a_repltype='{repl_mode.value}'
    SRCDB_SNAPSHOT_THREADS='{snapshot_threads.value}' 
    SRCDB_REALTIME_THREADS='{realtime_threads.value}' 
    SRCDB_DELTA='{delta_threads.value}'
    DSTDB_TYPE='{dbx_destinations.value}'
    DSTDB_STAGE='{dbx_staging.value}'
    DBX_SPARK_URL='{dbx_spark_url.value}'
    DBX_DATABRICKS_URL='{dbx_databricks_url.value}'
    DBX_ACCESS_TOKEN='{dbx_access_token.value}'
    DBX_HOSTNAME='{dbx_hostname.value}'
    DBX_DBFS_ROOT='/{dbx_username.value}'
    DBX_USERNAME='{dbx_username.value}'
    start_{cdc_mode.value}_arcion;""",
    shell=True,executable="/usr/bin/bash")


change snapshot
replicant


24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
/home/rslee/github/dbx/ingestion/demo/sqlserver
enable change tracking on database arcsrc
skip ALTER DATABASE arcsrc SET CHANGE_TRACKING = ON  (CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON);
skip ALTER TABLE replicate_io_audit_ddl ENABLE CHANGE_TRACKING;
skip ALTER TABLE replicate_io_audit_tbl_cons ENABLE CHANGE_TRACKING;
skip ALTER TABLE replicate_io_audit_tbl_schema ENABLE CHANGE_TRACKING;
skip ALTER TABLE YCSBDENSE ENABLE CHANGE_TRACKING;
skip ALTER TABLE YCSBSPARSE ENABLE CHANGE_TRACKING;
replicant
arcion pid 3960749
arcion console is at /var/tmp/sqlserver/logs/3fa794eaa/arcion.log
arcion log is at /var/tmp/sqlserver/logs/3fa794eaa
arcion can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_arcion)


+ cd /var/tmp/sqlserver/logs/3fa794eaa
+ set +x
+ JAVA_HOME=
+ REPLICANT_MEMORY_PERCENTAGE=25.0
+ JAVA_OPTS='"-Djava.security.egd=file:/dev/urandom" "-Doracle.jdbc.javaNetNio=false" "-XX:-UseCompressedOops"'
+ /opt/stage/arcion/24.01.25.1/bin/replicant snapshot /var/tmp/sqlserver/logs/3fa794eaa/src.yaml /var/tmp/sqlserver/logs/3fa794eaa/dst.yaml --applier /var/tmp/sqlserver/logs/3fa794eaa/applier.yaml --general /var/tmp/sqlserver/logs/3fa794eaa/general.yaml --extractor /var/tmp/sqlserver/logs/3fa794eaa/extractor.yaml --filter /var/tmp/sqlserver/logs/3fa794eaa/filter.yaml --metadata /var/tmp/sqlserver/logs/3fa794eaa/metadata.yaml --overwrite --id 3fa794eaa --replace


CompletedProcess(args=". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; \n    echo $PROG_DIR;\n    cd $PROG_DIR;\n    kill_arcion;\n    a_repltype='snapshot'\n    SRCDB_SNAPSHOT_THREADS='1' \n    SRCDB_REALTIME_THREADS='1' \n    SRCDB_DELTA='1'\n    DSTDB_TYPE='null'\n    DSTDB_STAGE='dbfs'\n    DBX_SPARK_URL=''\n    DBX_DATABRICKS_URL=''\n    DBX_ACCESS_TOKEN=''\n    DBX_HOSTNAME=''\n    DBX_DBFS_ROOT='/arcdst'\n    DBX_USERNAME='arcdst'\n    start_change_arcion;", returncode=0)