Import data from SST files is not working #5278

porscheme · 2023-01-25T00:58:02Z

The 'show stats' give all zeroes after data ingestion using SST files(see below detailed steps)
Based on the logs collected from one of the storage node, the SUBMIT JOB INGEST command was run successfully
Even though 'move_files=true' was set in the nebula-storaged.conf files (see below), the SST data files were present at the /usr/local/nebula/data/storage/nebula/<SPACE_ID>/download location on every storage node. My expected was they are supposed to me moved

Any help is really appreciated.
Thanks

Below is what I have done.

Generated the SST files
Copied same set of generated SST files to all storage nodes at this location /usr/local/nebula/data/storage/nebula/<SPACE_ID>/download

(Evidence: SST parts in /usr/local/nebula/data/storage/nebula/<SPACE_ID>/download)

(Evidence: Copied to all storage nodes)
The 'SUBMIT JOB INGEST' was run successfully, Below is that is in the logs files on the storaged

I20230125 00:33:50.223367   360 IngestTask.cpp:38] Ingest files: 2
I20230125 00:33:50.224948   360 EventListener.h:133] Ingest external SST file: column family default, the external file path /usr/local/nebula/data/storage/nebula/861/download/199/199-678-1.sst, the internal file path /usr/local/nebula/data/storage/nebula/861/data/001027.sst, the properties of the table: # data blocks=825; # entries=23082; # deletions=0; # merge operands=0; # range deletions=0; raw key size=1292592; raw average key size=56.000000; raw value size=2123544; raw average value size=92.000000; data block size=2349260; index block size (user-key? 1, delta-value? 1)=15784; filter block size=0; # entries for filter=0; (estimated) table size=2365044; filter policy name=N/A; prefix extractor name=nullptr; column family ID=N/A; column family name=N/A; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ; creation time=0; time stamp of earliest key=0; file creation time=0; slow compression estimated data size=0; fast compression estimated data size=0; DB identity=SST Writer; DB session identity=b90c8a87-8479-4304-b2ad-3161b66b5de9; DB host id=spark-cluster-pyspark-0; original file number=0; unique ID=N/A; Sequence number to time mapping=;
I20230125 00:33:50.224968   360 EventListener.h:133] Ingest external SST file: column family default, the external file path /usr/local/nebula/data/storage/nebula/861/download/199/199-402-1.sst, the internal file path /usr/local/nebula/data/storage/nebula/861/data/001028.sst, the properties of the table: # data blocks=10087; # entries=393353; # deletions=0; # merge operands=0; # range deletions=0; raw key size=22027768; raw average key size=56.000000; raw value size=20262273; raw average value size=51.511678; data block size=24075394; index block size (user-key? 1, delta-value? 1)=210980; filter block size=0; # entries for filter=0; (estimated) table size=24286374; filter policy name=N/A; prefix extractor name=nullptr; column family ID=N/A; column family name=N/A; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ; creation time=0; time stamp of earliest key=0; file creation time=0; slow compression estimated data size=0; fast compression estimated data size=0; DB identity=SST Writer; DB session identity=14ff285f-8173-4ca3-8443-754fb12e6b75; DB host id=spark-cluster-pyspark-0; original file number=0; unique ID=N/A; Sequence number to time mapping=;
I20230125 00:33:50.225014   360 IngestTask.cpp:38] Ingest files: 2
I20230125 00:33:50.226675   360 EventListener.h:133] Ingest external SST file: column family default, the external file path /usr/local/nebula/data/storage/nebula/861/download/200/200-679-1.sst, the internal file path /usr/local/nebula/data/storage/nebula/861/data/001029.sst, the properties of the table: # data blocks=824; # entries=23064; # deletions=0; # merge operands=0; # range deletions=0; raw key size=1291584; raw average key size=56.000000; raw value size=2121888; raw average value size=92.000000; data block size=2345756; index block size (user-key? 1, delta-value? 1)=15752; filter block size=0; # entries for filter=0; (estimated) table size=2361508; filter policy name=N/A; prefix extractor name=nullptr; column family ID=N/A; column family name=N/A; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ; creation time=0; time stamp of earliest key=0; file creation time=0; slow compression estimated data size=0; fast compression estimated data size=0; DB identity=SST Writer; DB session identity=d0c43ed3-c5e7-47e3-9255-6d603785f458; DB host id=spark-cluster-pyspark-0; original file number=0; unique ID=N/A; Sequence number to time mapping=;
I20230125 00:33:50.226701   360 EventListener.h:133] Ingest external SST file: column family default, the external file path /usr/local/nebula/data/storage/nebula/861/download/200/200-403-1.sst, the internal file path /usr/local/nebula/data/storage/nebula/861/data/001030.sst, the properties of the table: # data blocks=10058; # entries=392240; # deletions=0; # merge operands=0; # range deletions=0; raw key size=21965440; raw average key size=56.000000; raw value size=20205210; raw average value size=51.512365; data block size=24008572; index block size (user-key? 1, delta-value? 1)=210341; filter block size=0; # entries for filter=0; (estimated) table size=24218913; filter policy name=N/A; prefix extractor name=nullptr; column family ID=N/A; column family name=N/A; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; max_dict_buffer_bytes=0; ; creation time=0; time stamp of earliest key=0; file creation time=0; slow compression estimated data size=0; fast compression estimated data size=0; DB identity=SST Writer; DB session identity=2e0a12fa-4628-466c-aaf9-6b6f541cbd9d; DB host id=spark-cluster-pyspark-0; original file number=0; unique ID=N/A; Sequence number to time mapping=;
I20230125 00:33:50.226717   360 AdminTaskManager.cpp:318] subtask of task(881, 0) finished, unfinished task 0
I20230125 00:33:50.226727   360 AdminTask.h:129] task(881, 0) finished, rc=[SUCCEEDED]
I20230125 00:33:50.226816   142 AdminTaskManager.cpp:92] reportTaskFinish(), job=881, task=0, rc=SUCCEEDED
I20230125 00:33:50.226884    60 MetaClient.cpp:716] Send request to meta "nebula-cluster-metad-0.nebula-cluster-metad-headless.graph.svc.cluster.local":9559
I20230125 00:33:50.229092   142 AdminTaskManager.cpp:134] reportTaskFinish(), job=881, task=0, rc=SUCCEEDED
I20230125 00:33:50.504428    80 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 2 files into 1, base level is 0, output level is 1
I20230125 00:33:50.504573    81 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 2 files into 0, base level is 0, output level is 1
I20230125 00:33:50.504621    81 CompactionFilter.h:92] Do default minor compaction!
I20230125 00:33:50.845562    81 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 2 files into 1, base level is 0, output level is 1
I20230125 00:33:50.845701    82 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 2 files into 0, base level is 0, output level is 1

nebula-storaged.conf

Name:         nebula-cluster-storaged
Namespace:    graph
Labels:       app.kubernetes.io/cluster=nebula-cluster
              app.kubernetes.io/component=storaged
              app.kubernetes.io/managed-by=nebula-operator
              app.kubernetes.io/name=nebula-graph
Annotations:  <none>

Data
====
nebula-storaged.conf:
----

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged.pid
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=1
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2
# Wether logging files' name contain timestamp.
--timestamp_in_logfile_name=true

########## networking ##########
# Comma separated Meta server addresses
--meta_server_addrs=127.0.0.1:9559
# Local IP used to identify the nebula-storaged process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=127.0.0.1
# Storage daemon listening port
--port=9779
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19779
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=14400

########## Disk ##########
# Root data path. split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
--data_path=data/storage

# Minimum reserved bytes of each data path
--minimum_reserved_bytes=268435456

# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096
# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=40960
# Disable page cache to better control memory used by rocksdb.
# Caution: Make sure to allocate enough block cache if disabling page cache!
--disable_page_cache=false

# Compression algorithm, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
# For the sake of binary compatibility, the default value is snappy.
# Recommend to use:
#   * lz4 to gain more CPU performance, with the same compression ratio with snappy
#   * zstd to occupy less disk space
#   * lz4hc for the read-heavy write-light scenario
--rocksdb_compression=lz4

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=

############## rocksdb Options ##############
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={"max_subcompactions":"4","max_background_jobs":"4"}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"8192"}

# Whether or not to enable rocksdb's statistics, disabled by default
--enable_rocksdb_statistics=false

# Statslevel used by rocksdb to collection statistics, optional values are
#   * kExceptHistogramOrTimers, disable timer stats, and skip histogram stats
#   * kExceptTimers, Skip timer stats
#   * kExceptDetailedTimers, Collect all stats except time inside mutex lock AND time spent on compression.
#   * kExceptTimeForMutex, Collect all stats except the counters requiring to get time inside the mutex lock.
#   * kAll, Collect all stats
--rocksdb_stats_level=kExceptHistogramOrTimers

# Whether or not to enable rocksdb's prefix bloom filter, enabled by default.
--enable_rocksdb_prefix_filtering=true
# Whether or not to enable rocksdb's whole key bloom filter, disabled by default.
--enable_rocksdb_whole_key_filtering=false

############## Key-Value separation ##############
# Whether or not to enable BlobDB (RocksDB key-value separation support)
--rocksdb_enable_kv_separation=false
# RocksDB key value separation threshold in bytes. Values at or above this threshold will be written to blob files during flush or compaction.
--rocksdb_kv_separation_threshold=100
# Compression algorithm for blobs, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
--rocksdb_blob_compression=lz4
# Whether to garbage collect blobs during compaction
--rocksdb_enable_blob_garbage_collection=true

############## storage cache ##############
# Whether to enable storage cache
--enable_storage_cache=false
# Total capacity reserved for storage in memory cache in MB
--storage_cache_capacity=0
# Number of buckets in base 2 logarithm. E.g., in case of 20, the total number of buckets will be 2^20.
# A good estimate can be ceil(log2(cache_entries * 1.6)). The maximum allowed is 32.
--storage_cache_buckets_power=20
# Number of locks in base 2 logarithm. E.g., in case of 10, the total number of locks will be 2^10.
# A good estimate can be max(1, buckets_power - 10). The maximum allowed is 32.
--storage_cache_locks_power=10

# Whether to add vertex pool in cache. Only valid when storage cache is enabled.
--enable_vertex_pool=false
# Vertex pool size in MB
--vertex_pool_capacity=50
# TTL in seconds for vertex items in the cache
--vertex_item_ttl=300

# Whether to add empty key pool in cache. Only valid when storage cache is enabled.
--enable_empty_key_pool=false
# Empty key pool size in MB
--empty_key_pool_capacity=50
# TTL in seconds for empty key items in the cache
--empty_key_item_ttl=300

############### misc ####################
--snapshot_part_rate_limit=10485760
--snapshot_batch_size=1048576
--rebuild_index_part_rate_limit=4194304
--rebuild_index_batch_size=1048576

########## Custom ##########
--enable_partitioned_index_filter=true
--max_edge_returned_per_vertex=100000
--move_files=true


BinaryData
====

Events:  <none>

Your Environments (required)

OS: uname -a

 inux nebula-cluster-storaged-0 5.4.0-1098-azure #104~18.04.2-Ubuntu SMP Tue Nov 29 12:13:35 UTC 2022 x86_64 x86_64 x86_64

GNU/Linux

Compiler: g++ --version or clang++ --version

Using office images

CPU: lscpu
Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:

Step 1
Step 2
Step 3

Expected behavior

Additional context

The text was updated successfully, but these errors were encountered:

porscheme · 2023-01-26T02:50:26Z

Upon further investigation, I was able to root cause the failure.

I was trying to import SST data files that were generated on a different cluster (same configuration though!)
This brings up the important question, needs to be documented, does the SST Files needs to be generated & imported on the same cluster?

wey-gu · 2023-02-06T03:17:57Z

The SST file generation and ingest assumes to help offload sorting computation of the NebulaGraph cluster, to accelerate batch data importing.

It's not built for data ingesting across clusters as the SST data itself is cluster-context related, and the file structure is highly related to its cluster internal states i.e. space id, etc.

porscheme · 2023-02-06T06:09:03Z

Thanks for reply.

Can we do incremental updates to graph through SST files?

wey-gu · 2023-02-06T10:16:00Z

Yes, SST files were not just for full data import, it's actually incremental on the whole graph perspective, like we could do it every night

porscheme · 2023-02-07T04:14:40Z

Yes, SST files were not just for full data import, it's actually incremental on the whole graph perspective, like we could do it every night

Oh nice, how does it handle deletes?

wey-gu · 2023-02-23T07:02:21Z

from my understanding, there is no pure deletion(but we do have insert and update) in exchange, @Nicole00 correct me if wrong :)

porscheme · 2023-02-27T14:59:41Z

Any update on this, how to deo deletion through SST?

If not possible to do deletion using SST, what's best alternate?

porscheme added the type/bug Type: something is unexpected label Jan 25, 2023

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Jan 25, 2023

porscheme mentioned this issue Jan 25, 2023

Few Questions: Importing SST files into Nebula graph vesoft-inc/nebula-exchange#116

Closed

wey-gu mentioned this issue Jan 28, 2023

Weekly Report 2023-01-27 vesoft-inc/nebula-community#209

Closed

Sophie-Xie added type/question Type: question about the product and removed type/bug Type: something is unexpected labels Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import data from SST files is not working #5278

Import data from SST files is not working #5278

porscheme commented Jan 25, 2023

porscheme commented Jan 26, 2023

wey-gu commented Feb 6, 2023

porscheme commented Feb 6, 2023

wey-gu commented Feb 6, 2023

porscheme commented Feb 7, 2023

wey-gu commented Feb 23, 2023

porscheme commented Feb 27, 2023

Import data from SST files is not working #5278

Import data from SST files is not working #5278

Comments

porscheme commented Jan 25, 2023

porscheme commented Jan 26, 2023

wey-gu commented Feb 6, 2023

porscheme commented Feb 6, 2023

wey-gu commented Feb 6, 2023

porscheme commented Feb 7, 2023

wey-gu commented Feb 23, 2023

porscheme commented Feb 27, 2023