Few Questions: Importing SST files into Nebula graph #116

porscheme · 2022-11-29T04:11:59Z

General Question

I was following Nebula official documentation

Below are few questions

Is space partition_num and tags/edges partition we define in application.conf related?
Should the tags/edges partition be equal to space partition_num?
Can I submit SST file generation spark jobs multiple times with different application.conf settings?
Should I copy SST folder from hdfs to all storage nodes?

@wey-gu Can you please help me here

wey-gu · 2022-11-29T12:17:58Z

a partition in application.conf is for the exchange(a spark app)'s partition. say, if your graph space is with partition_num 100, in the richest man's case(on spark resource perspective), you could have an exchange running 100 tasks(configure 100 partitions) in parallel. But the partitions in exchange/spark is also limited by the spark cores( the spark community recommend to not assign partition larger than 2-3 times the spark cores number)
As 1. you could, but if you don't have as many of spark cores, put it as 2-3 times of spark cores
Can I submit SST file generation spark jobs multiple times with different application.conf settings? I think yes, different spark application/exchange will read storaged in parallel if you do so(I assume multiple times means to generate for different spaces?)
Yes you should copy to all storage nodes(with corresponding graph space being placed)

@Nicole00 correct me if anything not correct :)

Nicole00 · 2022-11-29T12:33:11Z

Thank @wey-gu for your answers.
@wey-gu ’s description is right, I just wanna say something more. The process for generating SST files is: reading datasource, processing datasource's data to sst format, sorting sst format data inside each Spark partition, writing sst format data into SST files.
The partition in the configuration file is the concurrency used for data processing, and the concurrency of writing to sst files depends on the configuration item repartitionWithNebula. If repartitionWithNebula is true, the writing concurrency is the same with NebulaGraph Space part number; if false, the writing concurrency is the same with spark.sql.shuffle.partitions.

Nicole00 · 2022-11-29T12:36:03Z

Another point, when you submit SST file generation job multiple times, please make sure your remote path (hdfs path) is different, or there may be some overwrite issue.

porscheme · 2022-11-29T16:45:51Z

Another point, when you submit SST file generation job multiple times, please make sure your remote path (hdfs path) is different, or there may be some overwrite issue.

Should I copy all the remote paths (HDFS paths) as a subfolder of 'download' folder in the space ID in the data/storage/nebula directory? This did not work for me (see below)
Coping contents of remote HDFS path directly to 'download' folder in the space ID in the data/storage/nebula directory, this worked. Please confirm this is how it's supposed to be.
After successful INGEST, can I recreate download and use it for another remote folder?

@wey-gu, @Nicole00
Please help

For Example: Say I have below
-- my space ID 2
-- partition_num is 100
-- I have 3 remote hdfs paths sst1, sst2, sst3

Scenario - 1
I copied all remote SST folders to all storage nodes like below. Running SUBMIT JOB INGEST command did nothing
-- data/storage/nebula/2/download/sst1
-- data/storage/nebula/2/download/sst2
-- data/storage/nebula/2/download/sst3

Scenario - 2
I copied just one remote hdfs folder to all storage nodes like below. Running SUBMIT JOB INGEST command did nothing
-- data/storage/nebula/2/download/sst1

Scenario - 3
I copied just contents one remote hdfs folder to all storage nodes like below. Running SUBMIT JOB INGEST command did successfully ingest the all the values to graph

-- data/storage/nebula/2/download/1
-- data/storage/nebula/2/download/2
-- data/storage/nebula/2/download/3
.
.
-- data/storage/nebula/2/download/100

wey-gu · 2022-11-30T08:44:10Z

@Nicole00 correct me if wrong, I think after put hdfs://.../sst1 files as scenario-3, and after the ingest, then, delete download files from all storage nodes and continue do the same thing for hdfs://.../sst2 is correct.

Nicole00 · 2022-12-01T03:10:21Z

Ingest operation will read the part folder to ingest data into RocksDB. So for scenarios 1 and 2, the directory is not correct for ingest.

And for wey-gu's view, just one point need to add, after the ingest, whether the download folder is deleted depends on your NebulaStorage's config --move_files, refer doc. If your config is false, please delete the download folder manually, then continue do the same thing for sst2.

porscheme · 2022-12-01T03:48:53Z

Thanks now I'm clear on SST files based data ingestion to graph. Essentially, it involves 3 parts

From parquet files, generate SST files to remote HDFS
Copy SST files from HDFS to all storage nodes
Run 'SUBMIT JOB INGEST'

Please read HDFS in above steps as Azure Data lake gen2 Storage Account.
Used AzCopy tool for step #2

Too manual manual processes? wish there is some automation.

wey-gu · 2022-12-01T04:00:49Z

cc @veezhang @MegaByte875 could we add hdfs client in the operator, so that we could do DOWNLOAD in k8s NebulaGraph, too

porscheme · 2022-12-01T04:17:06Z

I forgot to mention, I had to make few changes to nebula-exchange code to add support to Azure Data Lake gen2 Storage Account

wey-gu · 2022-12-01T05:01:37Z

I forgot to mention, I had to make few changes to nebula-exchange code to add support to Azure Data Lake gen2 Storage Account

Thanks @porscheme , if that's worthy to be applied to upstream, your PR is more than welcomed :)

MegaByte875 · 2022-12-02T06:18:36Z

cc @veezhang @MegaByte875 could we add hdfs client in the operator, so that we could do DOWNLOAD in k8s NebulaGraph, too

I think a k8s job could acheive import SST files automaticlly.

porscheme · 2023-01-25T01:11:14Z

I'm have an issue with SST data files import, please see vesoft-inc/nebula#5278 (comment)

Any help is really appreciated.

wey-gu · 2023-02-06T03:19:20Z

Dear @porscheme ,
It was Chinese Lunar New year holidays when you issue this question, sorry for replying you too late.

I replied you here vesoft-inc/nebula#5278 (comment) :)

Thanks!

QingZ11 · 2023-02-23T07:28:08Z

@porscheme I noticed that you have a new issue to discuss same problem. This issue has been closed first. If you have any updates, you can reopen this issue.

Sophie-Xie added the type/question Type: question about the product label Nov 29, 2022

wey-gu mentioned this issue Nov 29, 2022

docs on exchange configuration/tuning/meaning/how-to-decide vesoft-inc/nebula-docs-cn#2364

Open

porscheme closed this as completed Dec 1, 2022

wey-gu mentioned this issue Dec 3, 2022

Weekly Report 2022-12-02 vesoft-inc/nebula-community#145

Closed

porscheme reopened this Jan 24, 2023

QingZ11 closed this as completed Feb 23, 2023

wey-gu mentioned this issue Feb 25, 2023

Weekly Report 2023-02-24 vesoft-inc/nebula-community#333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few Questions: Importing SST files into Nebula graph #116

Few Questions: Importing SST files into Nebula graph #116

porscheme commented Nov 29, 2022

wey-gu commented Nov 29, 2022

Nicole00 commented Nov 29, 2022

Nicole00 commented Nov 29, 2022

porscheme commented Nov 29, 2022 •

edited

Loading

wey-gu commented Nov 30, 2022

Nicole00 commented Dec 1, 2022

porscheme commented Dec 1, 2022 •

edited

Loading

wey-gu commented Dec 1, 2022

porscheme commented Dec 1, 2022 •

edited

Loading

wey-gu commented Dec 1, 2022

MegaByte875 commented Dec 2, 2022

porscheme commented Jan 25, 2023

wey-gu commented Feb 6, 2023

QingZ11 commented Feb 23, 2023

Few Questions: Importing SST files into Nebula graph #116

Few Questions: Importing SST files into Nebula graph #116

Comments

porscheme commented Nov 29, 2022

wey-gu commented Nov 29, 2022

Nicole00 commented Nov 29, 2022

Nicole00 commented Nov 29, 2022

porscheme commented Nov 29, 2022 • edited Loading

wey-gu commented Nov 30, 2022

Nicole00 commented Dec 1, 2022

porscheme commented Dec 1, 2022 • edited Loading

wey-gu commented Dec 1, 2022

porscheme commented Dec 1, 2022 • edited Loading

wey-gu commented Dec 1, 2022

MegaByte875 commented Dec 2, 2022

porscheme commented Jan 25, 2023

wey-gu commented Feb 6, 2023

QingZ11 commented Feb 23, 2023

porscheme commented Nov 29, 2022 •

edited

Loading

porscheme commented Dec 1, 2022 •

edited

Loading

porscheme commented Dec 1, 2022 •

edited

Loading