# Real-time Workload for Scylla
### Web Sales | Web Returns | Store Sales | Store Returns | Catalog Sales

### Installing NoSQLBench 

#### Download:

In [4]:
!curl -L -O https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb
!chmod +x nb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   146  100   146    0     0    368      0 --:--:-- --:--:-- --:--:--   368
100   641  100   641    0     0   1030      0 --:--:-- --:--:-- --:--:--  1030
100  217M  100  217M    0     0  20.1M      0  0:00:10  0:00:10 --:--:-- 49.7M02 --:--:--     0


### Getting variables values 

Importing stuff and starting Spark Context:

In [5]:
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy, DowngradingConsistencyRetryPolicy, ConsistencyLevel, RoundRobinPolicy
from cassandra.query import tuple_factory
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, SQLContext
from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, FloatType, DateType, LongType

## Starting Spark
spark = SparkSession\
    .builder\
    .appName("TPCDS-Scylla")\
    .config("setMaster","172.19.0.2")\
    .config("spark.jars", "target/scala-2.12/spark3-scylla4-example-assembly-0.1.jar")\
    .config("spark.cassandra.connection.host", "172.19.0.2")\
    .config('spark.cassandra.output.consistency.level','LOCAL_QUORUM')\
    .config("spark.driver.memory", "28g")\
    .config("spark.executor.memory", "28g")\
    .getOrCreate()
sc = spark.sparkContext
## Start SQL Context, it will enable you to run SQL Queries
sqlContext = SQLContext(spark)

Creating Spark Tables based on ScyllaDB

In [6]:
call_center = spark.read.format("org.apache.spark.sql.cassandra").options(table="call_center", keyspace="tpcds").load()
call_center.registerTempTable("call_center")
catalog_page = spark.read.format("org.apache.spark.sql.cassandra").options(table="catalog_page", keyspace="tpcds").load()
catalog_page.registerTempTable("catalog_page")
catalog_returns = spark.read.format("org.apache.spark.sql.cassandra").options(table="catalog_returns", keyspace="tpcds").load()
catalog_returns.registerTempTable("catalog_returns")
catalog_sales = spark.read.format("org.apache.spark.sql.cassandra").options(table="catalog_sales", keyspace="tpcds").load()
catalog_sales.registerTempTable("catalog_sales")
customer = spark.read.format("org.apache.spark.sql.cassandra").options(table="customer", keyspace="tpcds").load()
customer.registerTempTable("customer")
customer_address = spark.read.format("org.apache.spark.sql.cassandra").options(table="customer_address", keyspace="tpcds").load()
customer_address.registerTempTable("customer_address")
customer_demographics = spark.read.format("org.apache.spark.sql.cassandra").options(table="customer_demographics", keyspace="tpcds").load()
customer_demographics.registerTempTable("customer_demographics")
date_dim = spark.read.format("org.apache.spark.sql.cassandra").options(table="date_dim", keyspace="tpcds").load()
date_dim.registerTempTable("date_dim")
household_demographics = spark.read.format("org.apache.spark.sql.cassandra").options(table="household_demographics", keyspace="tpcds").load()
household_demographics.registerTempTable("household_demographics")
income_band = spark.read.format("org.apache.spark.sql.cassandra").options(table="income_band", keyspace="tpcds").load()
income_band.registerTempTable("income_band")
inventory = spark.read.format("org.apache.spark.sql.cassandra").options(table="inventory", keyspace="tpcds").load()
inventory.registerTempTable("inventory")
item = spark.read.format("org.apache.spark.sql.cassandra").options(table="item", keyspace="tpcds").load()
item.registerTempTable("item")
promotion = spark.read.format("org.apache.spark.sql.cassandra").options(table="promotion", keyspace="tpcds").load()
promotion.registerTempTable("promotion")
reason = spark.read.format("org.apache.spark.sql.cassandra").options(table="reason", keyspace="tpcds").load()
reason.registerTempTable("reason")
ship_mode = spark.read.format("org.apache.spark.sql.cassandra").options(table="ship_mode", keyspace="tpcds").load()
ship_mode.registerTempTable("ship_mode")
store = spark.read.format("org.apache.spark.sql.cassandra").options(table="store", keyspace="tpcds").load()
store.registerTempTable("store")
store_returns = spark.read.format("org.apache.spark.sql.cassandra").options(table="store_returns", keyspace="tpcds").load()
store_returns.registerTempTable("store_returns")
store_sales = spark.read.format("org.apache.spark.sql.cassandra").options(table="store_sales", keyspace="tpcds").load()
store_sales.registerTempTable("store_sales")
time_dim = spark.read.format("org.apache.spark.sql.cassandra").options(table="time_dim", keyspace="tpcds").load()
time_dim.registerTempTable("time_dim")
warehouse = spark.read.format("org.apache.spark.sql.cassandra").options(table="warehouse", keyspace="tpcds").load()
warehouse.registerTempTable("warehouse")
web_page = spark.read.format("org.apache.spark.sql.cassandra").options(table="web_page", keyspace="tpcds").load()
web_page.registerTempTable("web_page")
web_returns = spark.read.format("org.apache.spark.sql.cassandra").options(table="web_returns", keyspace="tpcds").load()
web_returns.registerTempTable("web_returns")
web_sales = spark.read.format("org.apache.spark.sql.cassandra").options(table="web_sales", keyspace="tpcds").load()
web_sales.registerTempTable("web_sales")
web_site = spark.read.format("org.apache.spark.sql.cassandra").options(table="web_site", keyspace="tpcds").load()
web_site.registerTempTable("web_site")



21/11/15 14:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/15 14:40:02 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


## Creating Workload variables based on dimension tables size already loaded into scyllaDB

Install requirements:

In [7]:
!pip install pyyaml



Reading template file:

In [8]:
import yaml

with open(r'workload-nosqlbench.template.yaml') as file:
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format
    workload = yaml.load(file, Loader=yaml.FullLoader)
    columns = workload["bindings"]
 


Since the size of the environment can change, we need to check what are the Max keys that can be used to generate data that it will return data from queries. For this, we will be using Spark to get the MAX of each key.

In [6]:
column_to_get_max=[]
variables_values=[]
import re
for column,val in columns.items():
    if column.endswith("_sk"):
        column_to_get_max.append([column,val])
        print([column,val])
    else:
        pass
 

NameError: name 'columns' is not defined

Capturing only the keys names from the template file:

In [9]:
# column_to_get_max 
full_variables_list = []
for column,value in column_to_get_max:
    if re.findall('<<(.+?)>>', value):
        match = re.findall('<<(.+?)>>', value)
        full_variables_list.append(match)
        #print(match)
    else: 
        pass
variable_list=[]
[variable_list.append(x) for x in full_variables_list if x not in variable_list]
print(variable_list)

[['max_item_sk'], ['max_web_page_sk'], ['start_date', 'end_date'], ['max_c_address_sk'], ['max_cdemo_sk'], ['max_customer_sk'], ['max_reason_sk'], ['max_warehouse'], ['max_promo_sk'], ['max_web_site_sk'], ['max_store_sk'], ['max_warehouse_sk'], ['max_call_center_sk'], ['max_catalog_page_sk']]


Creating relationship between foreign keys, tables and variables


In [10]:
array = [["cp_catalog_page_sk","catalog_page",'max_catalog_page_sk'],["cc_call_center_sk","call_center","max_call_center_sk"],["s_store_sk","store","max_store_sk"],["p_promo_sk","promotion",'max_promo_sk'],["r_reason_sk","reason",'max_reason_sk'],["i_item_sk", "item","max_item_sk"],["c_customer_sk","customer","max_customer_sk"],["cd_demo_sk","customer_demographics","max_cdemo_sk"],["ca_address_sk","customer_address","max_c_address_sk"],["wp_web_page_sk","web_page","max_web_page_sk"],["web_site_sk","web_site","max_web_site_sk"],["w_warehouse_sk","warehouse","max_warehouse"]]


 Generating and executing queries and capturing the values

In [11]:

#['start_date', 'end_date'], [''],  , [''], [''], [''], ['max_call_center_sk'], 
queries=[]
for column, tables,variable in array:
    query = 'select max({}) as max from {}'.format(column,tables)
    
    result = sqlContext.sql(query).collect()[0]
    print(query + " = " + str(result[0]) + "-> {}".format(variable))
    print("result: "+ str(result[0]) + " | column: " + column + "| variable :" + variable)
    queries.append([result[0],column,variable])



                                                                                

select max(cp_catalog_page_sk) as max from catalog_page = 11718-> max_catalog_page_sk
result: 11718 | column: cp_catalog_page_sk| variable :max_catalog_page_sk
select max(cc_call_center_sk) as max from call_center = 6-> max_call_center_sk
result: 6 | column: cc_call_center_sk| variable :max_call_center_sk
select max(s_store_sk) as max from store = 37-> max_store_sk
result: 37 | column: s_store_sk| variable :max_store_sk
select max(p_promo_sk) as max from promotion = 2003-> max_promo_sk
result: 2003 | column: p_promo_sk| variable :max_promo_sk
select max(r_reason_sk) as max from reason = 233-> max_reason_sk
result: 233 | column: r_reason_sk| variable :max_reason_sk


                                                                                

select max(i_item_sk) as max from item = 281758-> max_item_sk
result: 281758 | column: i_item_sk| variable :max_item_sk


                                                                                

select max(c_customer_sk) as max from customer = 1920799-> max_customer_sk
result: 1920799 | column: c_customer_sk| variable :max_customer_sk


                                                                                

select max(cd_demo_sk) as max from customer_demographics = 1920799-> max_cdemo_sk
result: 1920799 | column: cd_demo_sk| variable :max_cdemo_sk


                                                                                

select max(ca_address_sk) as max from customer_address = 156532-> max_c_address_sk
result: 156532 | column: ca_address_sk| variable :max_c_address_sk
select max(wp_web_page_sk) as max from web_page = 400-> max_web_page_sk
result: 400 | column: wp_web_page_sk| variable :max_web_page_sk
select max(web_site_sk) as max from web_site = 29-> max_web_site_sk
result: 29 | column: web_site_sk| variable :max_web_site_sk
select max(w_warehouse_sk) as max from warehouse = 11-> max_warehouse
result: 11 | column: w_warehouse_sk| variable :max_warehouse


#### Now if that we have values of each variable, we will need to create the file replacing the variables for the values

In [8]:
## ASK HELP FOR THE TEAM TO DUMP THE NEW VALUES INTO A NEW FILE REPLACING THE STRINGS

In [12]:
# queries.append([result[0],column,variable])
i=0
import numpy as np 
import pandas as pd
new_file=[]
with open(r'workload-nosqlbench.template.yaml') as file:
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format
    workload = yaml.load(file, Loader=yaml.FullLoader)
import pandasql as psql
  
columns = workload["bindings"]

df_values = pd.DataFrame(queries,columns=['result',"column","variable"])
df = pd.DataFrame(columns.items(),columns=["column","value"])
df['parsed'] = df.loc[df['value'].apply(lambda st: st[st.find("<")+2:st.find(">")]).isin(df_values.variable)== True,df['parsed']] = df['value'].apply(lambda st: st[st.find("<")+2:st.find(">")])
df = df[["column","value","parsed"]]
df['column'] = df['column'].astype('string')
df_values['column'] = df_values['column'].astype('string')

new_df = psql.sqldf("select distinct df.column,df.value,cast(df_values.result as text) as result from df_values  join df on cast(df_values.column as string)= cast(df.column as string) and parsed=variable") 

pat = re.compile(r"<<.*?>>")

new_df["value"] = new_df[["value","result"]].apply(lambda x: pat.sub(repl=x[1],string=x[0]),axis=1)

new_df[["column","value"]]


Unnamed: 0,column,value
0,cs_catalog_page_sk,"Uniform(0,11718)"
1,cs_call_center_sk,"Uniform(0,6)"
2,sr_store_sk,"Uniform(0,37))"
3,ss_store_sk,"Uniform(0,37))"
4,cs_promo_sk,"Uniform(0,2003)"
5,ss_promo_sk,"Uniform(0,2003)"
6,ws_promo_sk,"Uniform(0,2003)"
7,sr_reason_sk,"Uniform(0,233)"
8,wr_reason_sk,"Uniform(0,233)"
9,cs_item_sk,"Uniform(0,281758)"


In [171]:
import ruamel.yaml
import sys
yaml = ruamel.yaml.YAML()
import re


# #new_df.info()
# #print(type(new_df))
file = 'workload-nosqlbench.template.yaml'
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format
    #workload = yaml.load(file)
config, ind, bsi = ruamel.yaml.util.load_yaml_guess_indent(open(file))

columns_pop = config['bindings']
# instances[0]['host'] = '1.2.3.4'
# instances[0]['username'] = 'Username'
# instances[0]['password'] = 'Password'

yaml = ruamel.yaml.YAML()
yaml.indent(mapping=ind, sequence=ind, offset=bsi) 
with open('output.yaml', 'w') as fp:
    yaml.dump(config, fp)
    
    
    
# data['bindings'].items()
#data['bindings'].values()
pat = re.compile("<<(.*?)>>")
#new_df.to_dict
#for dct in lst:
#df = workload['bindings'].items()

for key in list(columns_pop):
    #print(key)
    #print(workload['bindings'][item])
    for column,new_value,result in new_df.values:
        #print(column,new_value)
        if key==column:
            #print("match")
            
            columns_pop[key] =  [new_value]
#             level1[1] == level1[1].replace(column,level1[1])
            #print(workload['bindings']+": ",key)

  #  print(level1[0],": ",level1[1])
        
        
ruamel.yaml.dump(workload, sys.stdout)

!!python/object/apply:ruamel.yaml.comments.CommentedMap
dictitems:
  bindings: !!python/object/apply:ruamel.yaml.comments.CommentedMap
    dictitems: {cs_bill_addr_sk: 'Uniform(0,156532)', cs_bill_cdemo_sk: 'Uniform(0,1920799)',
      cs_bill_customer_sk: 'Uniform(0,1920799)', cs_bill_hdemo_sk: 'Uniform(0,7200)',
      cs_call_center_sk: 'Uniform(0,6)', cs_catalog_page_sk: 'Uniform(0,11718)', cs_coupon_amt: 'Normal(0.0,9999.0)',
      cs_ext_discount_amt: 'Normal(0.0,9999.0)', cs_ext_list_price: 'Normal(0.0,9999.0)',
      cs_ext_sales_price: 'Normal(0.0,9999.0)', cs_ext_ship_cost: 'Normal(0.0,9999.0)',
      cs_ext_tax: 'Normal(0.0,9999.0)', cs_ext_wholesale_cost: 'Normal(0.0,9999.0)',
      cs_item_sk: 'Uniform(0,281758)', cs_list_price: 'Normal(0.0,9999.0)', cs_net_paid: 'Normal(0.0,9999.0)',
      cs_net_paid_inc_ship: 'Normal(0.0,9999.0)', cs_net_paid_inc_ship_tax: 'Normal(0.0,9999.0)',
      cs_net_paid_inc_tax: 'Normal(0.0,9999.0)', cs_net_profit: 'Normal(0.0,9999.0)',
      cs_

                line: 153
          - !!python/object/apply:ruamel.yaml.comments.CommentedMap
            dictitems:
              drop-table-web-returns: !!python/object/new:ruamel.yaml.scalarstring.LiteralScalarString [
                "drop table if exists benchmark.web_returns;     \n"]
            state:
              _yaml_format: !!python/object/new:ruamel.yaml.comments.Format
                state: !!python/tuple
                - null
                - {_flow_style: false}
              _yaml_line_col: !!python/object:ruamel.yaml.comments.LineCol
                col: 7
                data:
                  drop-table-web-returns: [155, 7, 155, 31]
                line: 155
          - !!python/object/apply:ruamel.yaml.comments.CommentedMap
            dictitems:
              drop-table-catalog_sales: !!python/object/new:ruamel.yaml.scalarstring.LiteralScalarString [
                "drop table if exists benchmark.catalog_sales;          \n"]
            state:
             

In [193]:
import jinja2
from jinja2 import meta
templateLoader = jinja2.FileSystemLoader(searchpath="./")
templateEnv = jinja2.Environment(loader=templateLoader)
TEMPLATE_FILE = "workload-nosqlbench.template.yaml"
template = templateEnv.get_template(TEMPLATE_FILE)
#outputText = template.render()
parsed_content = templateEnv.parse(template)
jinja2.meta.find_undeclared_variables(parsed_content)

set()

## Using  NoSQLBench against ScyllaDB

cycles - standard, however the cql activity type will default this to however many statements are included in the current activity, after tag filtering, etc.

pooling default: none - Applies the connection pooling options to the policy. Examples:

    pooling=4:10 keep between 4 and 10 connections to LOCAL hosts
    pooling=4:10,2:5 keep 4-10 connections to LOCAL hosts and 2-5 to REMOTE
    pooling=4:10:2000 keep between 4-10 connections to LOCAL hosts with up to 2000 requests per connection
    pooling=5:10:2000,2:4:1000 keep between 5-10 connections to LOCAL hosts with up to 2000 requests per connection, and 2-4 connection to REMOTE hosts with up to 1000 requests per connection
    
    
Additionally, you may provide the following options on pooling. Any of these that are provided must appear in this order: ,heartbeat_interval_s:n,idle_timeout_s:n,pool_timeout_ms:n, so a full example with all options set would appear as:

       pooling=5:10:2000,2:4:1000,heartbeat_interval_s:30,idle_timeout_s:120,pool_timeout_ms:5

lbp - configures the load balancing policies for the Java driver. With this parameter, you can configure nested load balancing policies in short-hand form.

The policies available are documented in detail under the help topic cql-loadbalancing. See that guide if you need more than the examples below.

Examples:
    

        lbp=LAP(retry_period=3,scale=10) - Latency aware policy with retry period of 3 seconds. (Seconds is the default time unit, unless _ms parameter is used) and scale 10.
        lbp=LAP(rp=3,s=10) - Same as above, using the equivalent but terser form.
        lbp=LAP(rp_ms=3000,s_ms=10000) - Same as above, with milliseconds instead of seconds.
        loadbalancing=LAP(s=10),TAP() - Latency aware policy, followed by token aware policy.

#### Start Schema 

In [1]:
!./nb run driver=cqld3 \
    host=172.19.0.2\
    workload=workload-nosqlbench.yaml\
    tags=phase:schema \
    cycles=11 \
    loadbalancing='TAP()'


### Start workload

In [2]:
!./nb run driver=cqld3 \
        host=172.19.0.2\
        workload=workload-nosqlbench.yaml\
        threads=auto \
        cycles=3000000 \
        async=64  \
        tags=phase:rampup \
        --progress console:10s \
        loadbalancing='TAP()' pooling=8:16:10000,8:16:10000  
                        #start_connections:number_max_connections:ops per connection ## 2*shards:4*shards:5000

workload-nosqlbench.yaml: 3.61%/Running (details: min=0 cycle=108352 max=3000000)
workload-nosqlbench.yaml: 8.62%/Running (details: min=0 cycle=258656 max=3000000)
workload-nosqlbench.yaml: 13.64%/Running (details: min=0 cycle=409056 max=3000000)
workload-nosqlbench.yaml: 18.65%/Running (details: min=0 cycle=559424 max=3000000)
workload-nosqlbench.yaml: 23.65%/Running (details: min=0 cycle=709600 max=3000000)
workload-nosqlbench.yaml: 28.66%/Running (details: min=0 cycle=859872 max=3000000)
workload-nosqlbench.yaml: 33.66%/Running (details: min=0 cycle=1009728 max=3000000)
workload-nosqlbench.yaml: 38.67%/Running (details: min=0 cycle=1160224 max=3000000)
workload-nosqlbench.yaml: 43.69%/Running (details: min=0 cycle=1310560 max=3000000)
workload-nosqlbench.yaml: 48.66%/Running (details: min=0 cycle=1459936 max=3000000)
workload-nosqlbench.yaml: 53.67%/Running (details: min=0 cycle=1609984 max=3000000)
workload-nosqlbench.yaml: 58.66%/Running (details: min=0 cycle=1759840 max=3000000)
