## Purpose of Demo
This demo environment is designed to help answer the following common questions:
- How does Arcion work
- How is the snapshot and realtime (CDC) performance
- Can Arcion handle my capacity

Customers should use pilot for:
- Schema and data type validation
- Tuning based on environmental factors
- Data value conversion
- HA and resilience testing 
  
## Demo is a model to simplify complex environment

The following the demo environment.
```text
  +--------Databricks Personal Compute Cluster--------------------------+  
  |  +-----------+    +-----------+    +----------+     +------------+  |
  |  |  Workload |    | Source DB |    |  Arcion  |     | Target DB  |  | 
  |  |           |    |           | <--|          |     |            |  |
  |  |   YCSB    | -->| SQL Server|    | Notebook | --> | Databricks |  |
  |  |           |    |           | -->|   CLI    |     |            |  |
  |  +-----------+    +-----------+    +----------+     +------------+  |
  +---------------------------------------------------------------------+
```

In the production, the following is expected separation.
```text
  +----------Customer Cloud---------+   F   +---- Databricks Serverless ------+  
  |  +-----------+    +-----------+ |   I   | +----------+     +------------+ |
  |  |  Workload |    | Source DB | |   R   | |  Arcion  |     | Target DB  | | 
  |  |           |    |           | | <-E-- | |          |     |            | |
  |  |   YCSB    | -->| SQL Server| |   W   | | Notebook |  -->| Databricks | |
  |  |           |    |           | | --A-> | |    UI    |     |            | |
  |  +-----------+    +-----------+ |   L   | +----------+     +------------+ |  
  +---------------------------------+   L   +---------------------------------+
```

## Schema 

- An arbitrary number of dense and sparse tables be defined.  
- Each table can have defined number of fields.  
- The amount of data in each field can be defined.  

Dense tables have all fields populated to data to the max length of the field.
Sparse tables fields are populated with NULLs.

```text
+-------+    +-------+  +--------+    +--------+
| Dense |    | Dense |  | Sparse |    | Sparse | 
| Table | ...| Table |  | Table  | ...| Table  |
|   1   |    |   n   |  |   1    |    |    n   |
+-------+    +-------+  +--------+    +--------+
```

This allows one to model capacity and performance of 
- star schema
- IOT data 
- big data

## Workload

YCSB does: 
- update only.
- No inserts
- No deletes.

## Install Arcion, YCSB and SQL Server

In [15]:
# prep python env
%pip install ipywidgets
import subprocess
import math

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [12]:
# install YCSB, Arcion, Database, JAR files
print (subprocess.run("bin/download-jars.sh",stdout=subprocess.PIPE).stdout.decode('utf-8'))
print (subprocess.run("bin/install-arcion.sh",stdout=subprocess.PIPE).stdout.decode('utf-8'))
print (subprocess.run("bin/install-ycsb.sh",stdout=subprocess.PIPE).stdout.decode('utf-8'))
print (subprocess.run("bin/install-sqlserver.sh",stdout=subprocess.PIPE).stdout.decode('utf-8'))

deltalake /opt/stage/libs/SparkJDBC42.jar found
lakehouse  /opt/stage/libs/DatabricksJDBC42.jar found
postgres  /opt/stage/libs/postgresql-42.7.1.jar found
mariadb  /opt/stage/libs/mariadb-java-client-3.3.2.jar found
oracle /opt/stage/libs/ojdbc8.jar found
log4j /opt/stage/libs/log4j-1.2.17.jar found



arcion  /opt/stage/arcion/replicant-cli/bin/replicant found
checking jar(s) in /opt/stage/arcion/24.01.25.1/replicant-cli/lib for updates
checking jar(s) in /opt/stage/arcion/24.01.25.1/lib for updates
checking jar(s) in /opt/stage/arcion/replicant-cli/lib for updates
checking jar(s) in /opt/stage/arcion/replicate-cli-23.05.31.29/lib for updates
checking jar(s) in /opt/stage/arcion/23.05.31.31/lib for updates
checking jar(s) in /opt/stage/arcion/24.01.25.7/lib for updates
checking jar(s) in /opt/stage/arcion/23.09.29.11/lib for updates

YCSB  /opt/stage/ycsb/ycsb-jdbc-binding-0.18.0-SNAPSHOT  found
numfmt found
checking jar(s) in /opt/stage/ycsb/ycsb-jdbc-binding-0.18.0-SNAPSHOT/lib for updates

sqlserver found



In [8]:
# setup GUI elements
from libpython.arcion_control import *    
from libpython.ycsb_control import *    
#show_arcion_config()
#show_ycsb_config()

repl_mode = widgets.Dropdown(options=['snapshot', 'real-time', 'full'],value='snapshot',
    description='Replication:',
)
cdc_mode = widgets.Dropdown(options=['change', 'cdc'],value='change',
    description='CDC Method:',
)

snapshot_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Snapshot Threads:',
)

realtime_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Real Time Threads:',
)    

delta_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Delta Snapshot Threads:',
)    


sparse_cnt = widgets.BoundedIntText(value=1,min=1,max=100,
    description='Table Cnt:',
)
sparse_fieldcount = widgets.BoundedIntText(value=50,min=0,max=9000,
    description='# of Fields:',
)
sparse_fieldlength = widgets.BoundedIntText(value=10,min=1,max=1000,
    description='Field Len:',
)

sparse_tps = widgets.BoundedIntText(value=1,min=0,max=1000,
    description='TPS:',
)
sparse_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)
sparse_recordcount = widgets.Text(value="1M",
    description='Rec Cnt:',
)

sparse_fillpct = widgets.IntRangeSlider(value=[0,0],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

dense_cnt = widgets.BoundedIntText(value=1,min=1,max=100,
    description='Table Cnt:',
)
dense_fieldcount = widgets.BoundedIntText(value=10,min=0,max=9000,
    description='# of Fields:',
)
dense_fieldlength = widgets.BoundedIntText(value=100,min=1,max=1000,
    description='Field Len:',
)
dense_recordcount = widgets.Text(value="100K",
    description='Rec Cnt:',
)

dense_tps = widgets.BoundedIntText(value=1,min=0,max=1000,
    description='TPS:',
)
dense_threads = widgets.BoundedIntText(value=1,min=1,max=8,
    description='Threads:',
)

dense_fillpct = widgets.IntRangeSlider(value=[0,100],min=0,max=100,step=1,
    description='Fill Range:', orientation='horizontal', readout=False
)

# Customize YCSB workload characteristics

In [9]:
# show YCSB Data Controls
VBox([HBox([Label('Sparse'), sparse_cnt, sparse_fieldcount, sparse_fieldlength, sparse_recordcount, sparse_fillpct]),
    HBox([Label('Dense'),  dense_cnt, dense_fieldcount, dense_fieldlength, dense_recordcount, dense_fillpct])])

VBox(children=(HBox(children=(Label(value='Sparse'), BoundedIntText(value=1, description='Table Cnt:', min=1),…

## Create SQL Server user, create and load YCSB data sets

In [14]:
print(f"dense={dense_fillpct.value}")
print(f"sparse={sparse_fillpct.value}")

# run load_sparse_data_cnt and load_dense_data_cnt 
subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    create_user;
    y_fieldcount={sparse_fieldcount.value} 
    y_fieldlength={sparse_fieldlength.value}  
    y_recordcount={sparse_recordcount.value} 
    y_fillstart={math.ceil((sparse_fillpct.value[0] * sparse_fieldcount.value) / 100)}      
    y_fillend={int((sparse_fillpct.value[1] * sparse_fieldcount.value) / 100)}      
    load_sparse_data_cnt {sparse_cnt.value};
    y_fieldcount={dense_fieldcount.value} 
    y_fieldlength={dense_fieldlength.value} 
    y_recordcount={dense_recordcount.value} 
    y_fillstart={math.ceil((dense_fillpct.value[0] * dense_fieldcount.value) / 100)}      
    y_fillend={int((dense_fillpct.value[1] * dense_fieldcount.value) / 100)}      
    load_dense_data_cnt {dense_cnt.value};
    dump_schema;
    list_table_counts""",
    shell=True,executable="/usr/bin/bash") 


dense=(30, 67)
sparse=(0, 0)
replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
Msg 15025, Level 16, State 1, Server ron, Line 2
The server principal 'arcsrc' already exists.
Msg 1801, Level 16, State 3, Server ron, Line 1
Database 'arcsrc' already exists. Choose a different database name.
Changed database context to 'arcsrc'.
Msg 15023, Level 16, State 5, Server ron, Line 1
User, group, or role 'arcsrc' already exists in the current database.
Starting type=sparse inst=1
Msg 2714, Level 16, State 6, Server ron, Line 2
There is already an object named 'YCSBSPARSE' in the database.
/home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_ycsbsparse.sql
inserting insertstart=0 insert ends at ycsb_key=9
y_fillcount=
/home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_sparsetable.fmt
data file to be purged /home/rslee/github/dbx/ingestion/demo/sqlserver/config/tmp.tXtvL97crl
SQLState = S1002, NativeError = 0
Error = [Microsoft][ODBC Driver 18 for SQL Ser

+ bcp YCSBSPARSE in /home/rslee/github/dbx/ingestion/demo/sqlserver/config/tmp.tXtvL97crl -S 127.0.0.1,1433 -U arcsrc -P Passw0rd -u -d arcsrc -f /home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_sparsetable.fmt -b 1
+ tee /home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_sparsetable.log

real	0m0.030s
user	0m0.006s
sys	0m0.005s
+ set +x
+ bcp YCSBDENSE in /home/rslee/github/dbx/ingestion/demo/sqlserver/config/tmp.4x4wGw3D3y -S 127.0.0.1,1433 -U arcsrc -P Passw0rd -u -d arcsrc -f /home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_densetable.fmt -b 2
+ tee /home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_densetable.log

real	0m0.024s
user	0m0.003s
sys	0m0.008s
+ set +x
schema dump at /tmp/schema_dump.csv


/home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_densetable.fmt
data file to be purged /home/rslee/github/dbx/ingestion/demo/sqlserver/config/tmp.4x4wGw3D3y
SQLState = S1002, NativeError = 0
Error = [Microsoft][ODBC Driver 18 for SQL Server]Invalid Descriptor Index
bcp log at /home/rslee/github/dbx/ingestion/demo/sqlserver/config/03_densetable.log
Finished dense table 1


table count at /tmp/list_table_counts.csv


CompletedProcess(args='. ./demo/sqlserver/run-ycsb-sqlserver-source.sh; \n    create_user;\n    y_fieldcount=3 \n    y_fieldlength=9  \n    y_recordcount=10 \n    y_fillstart=0      \n    y_fillend=0      \n    load_sparse_data_cnt 1;\n    y_fieldcount=6 \n    y_fieldlength=7 \n    y_recordcount=20 \n    y_fillstart=1      \n    y_fillend=4      \n    load_dense_data_cnt 1;\n    dump_schema;\n    list_table_counts', returncode=0)

# Run YCSB and Arcion in the background

## Start/Restart YCSB workload at 1 TPS
YCSB update (workload A) controls for Dense and Sparse table groups separated. Each group has a separate control.  However, all of the tables in the group use the same controls.  
1. Each table's TPS (throughput per second)
   1. 0=fast as possible
   2. 1=1 TPS
   3. 10=10 TPS
2. Each table's threads (concurrency) used to achieve the desired TPS.

In [6]:
# show YCSB run controls
VBox([HBox([Label('Sparse'), sparse_tps, sparse_threads]), HBox([Label('Dense'),  dense_tps, dense_threads])])

VBox(children=(HBox(children=(Label(value='Sparse'), BoundedIntText(value=1, description='TPS:', max=1000), Bo…

In [17]:
# start/restart YCSB run
print (subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    kill_ycsb;
    y_target_sparse={sparse_tps.value} y_target_dense={dense_tps.value} y_threads_sparse={sparse_threads.value} y_threads_dense={dense_threads.value} y_fieldcount_sparse={sparse_fieldcount.value} y_fieldcount_dense={dense_fieldcount.value} y_fieldlength_sparse={sparse_fieldlength.value} y_fieldlength_dense={dense_fieldlength.value} 
    start_ycsb;""",
    shell=True,executable="/usr/bin/bash",stdout=subprocess.PIPE).stdout.decode('utf-8'))

replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
YCSBDENSE
dense
ycsb YCSBDENSE pid 880188
ycsb YCSBDENSE log is at /home/rslee/github/dbx/ingestion/demo/sqlserver/logs/ycsb.YCSBDENSE.log
ycsb YCSBDENSE can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_ycsb)
YCSBSPARSE
sparse
ycsb YCSBSPARSE pid 880195
ycsb YCSBSPARSE log is at /home/rslee/github/dbx/ingestion/demo/sqlserver/logs/ycsb.YCSBSPARSE.log
ycsb YCSBSPARSE can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_ycsb)



## Start Arcion

NOTE: Full mode does not work at this time.

For SQL Server, change tracking, cdc are available for demo.  
Performance is mainly controlled by the thread count by the extract and apply process. 
- [CDC YAML files](./demo/sqlserver/yaml/cdc/)
- [Change Tracking YAML files](./demo/sqlserver/yaml/change/)

In [5]:
# show Arcion controls
VBox([
      HBox([Label('Arcion'), repl_mode, cdc_mode]),
      HBox([Label('Threads'), snapshot_threads, realtime_threads, delta_threads])
      ])

VBox(children=(HBox(children=(Label(value='Arcion'), Dropdown(description='Replication:', options=('snapshot',…

In [19]:
# start/restart Arcion
print (f"""{cdc_mode.value} {repl_mode.value}""")

print (subprocess.run(f""". ./demo/sqlserver/run-ycsb-sqlserver-source.sh; 
    echo $PROG_DIR;
    cd $PROG_DIR;
    kill_arcion;
    a_repltype={repl_mode.value} 
    SRCDB_SNAPSHOT_THREADS={snapshot_threads.value} 
    SRCDB_REALTIME_THREADS={realtime_threads.value} 
    SRCDB_DELTA={delta_threads.value}_THREADS
    start_{cdc_mode.value}_arcion;""",
    shell=True,executable="/usr/bin/bash",stdout=subprocess.PIPE).stdout.decode('utf-8'))


change snapshot
replicant
24.01.25.1 24.01
PATH=/opt/stage/bin/jsqsh-dist-3.0-SNAPSHOT/bin added
/home/rslee/github/dbx/ingestion/demo/sqlserver
enable change tracking on database arcsrc
skip ALTER DATABASE arcsrc SET CHANGE_TRACKING = ON  (CHANGE_RETENTION = 2 DAYS, AUTO_CLEANUP = ON);
skip ALTER TABLE replicate_io_audit_ddl ENABLE CHANGE_TRACKING;
skip ALTER TABLE replicate_io_audit_tbl_cons ENABLE CHANGE_TRACKING;
skip ALTER TABLE replicate_io_audit_tbl_schema ENABLE CHANGE_TRACKING;
ALTER TABLE YCSBDENSE ENABLE CHANGE_TRACKING;
ALTER TABLE YCSBSPARSE ENABLE CHANGE_TRACKING;
replicant
arcion pid 880422
arcion log is at /home/rslee/github/dbx/ingestion/demo/sqlserver/logs/arcion.log
arcion can be killed with . ./demo/sqlserver/run-ycsb-sqlserver-source.sh; kill_arcion)



In [20]:

cluster_id = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")

workspace_id =spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")

# clusterName = spark.conf.get("spark.databricks.clusterUsageTags.clusterName")

workspaceUrl = spark.conf.get("spark.databricks.workspaceUrl") # host name

http_path = f"sql/protocolv1/o/{workspace_id}/{cluster_id}"

spark_url=f"jdbc:spark://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3"
databricks_url=f"jdbc:databricks://{workspaceUrl}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3"

NameError: name 'spark' is not defined