Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few Questions: Importing SST files into Nebula graph #116

Closed
porscheme opened this issue Nov 29, 2022 · 14 comments
Closed

Few Questions: Importing SST files into Nebula graph #116

porscheme opened this issue Nov 29, 2022 · 14 comments
Labels
type/question Type: question about the product

Comments

@porscheme
Copy link

General Question

I was following Nebula official documentation

Below are few questions

  1. Is space partition_num and tags/edges partition we define in application.conf related?
  2. Should the tags/edges partition be equal to space partition_num?
  3. Can I submit SST file generation spark jobs multiple times with different application.conf settings?
  4. Should I copy SST folder from hdfs to all storage nodes?

@wey-gu Can you please help me here

@Sophie-Xie Sophie-Xie added the type/question Type: question about the product label Nov 29, 2022
@wey-gu
Copy link
Contributor

wey-gu commented Nov 29, 2022

  1. a partition in application.conf is for the exchange(a spark app)'s partition. say, if your graph space is with partition_num 100, in the richest man's case(on spark resource perspective), you could have an exchange running 100 tasks(configure 100 partitions) in parallel. But the partitions in exchange/spark is also limited by the spark cores( the spark community recommend to not assign partition larger than 2-3 times the spark cores number)
  2. As 1. you could, but if you don't have as many of spark cores, put it as 2-3 times of spark cores
  3. Can I submit SST file generation spark jobs multiple times with different application.conf settings? I think yes, different spark application/exchange will read storaged in parallel if you do so(I assume multiple times means to generate for different spaces?)
  4. Yes you should copy to all storage nodes(with corresponding graph space being placed)

@Nicole00 correct me if anything not correct :)

@Nicole00
Copy link
Contributor

Thank @wey-gu for your answers.
@wey-gu ’s description is right, I just wanna say something more. The process for generating SST files is: reading datasource, processing datasource's data to sst format, sorting sst format data inside each Spark partition, writing sst format data into SST files.
The partition in the configuration file is the concurrency used for data processing, and the concurrency of writing to sst files depends on the configuration item repartitionWithNebula. If repartitionWithNebula is true, the writing concurrency is the same with NebulaGraph Space part number; if false, the writing concurrency is the same with spark.sql.shuffle.partitions.

@Nicole00
Copy link
Contributor

Another point, when you submit SST file generation job multiple times, please make sure your remote path (hdfs path) is different, or there may be some overwrite issue.

@porscheme
Copy link
Author

porscheme commented Nov 29, 2022

Another point, when you submit SST file generation job multiple times, please make sure your remote path (hdfs path) is different, or there may be some overwrite issue.

  1. Should I copy all the remote paths (HDFS paths) as a subfolder of 'download' folder in the space ID in the data/storage/nebula directory? This did not work for me (see below)
  2. Coping contents of remote HDFS path directly to 'download' folder in the space ID in the data/storage/nebula directory, this worked. Please confirm this is how it's supposed to be.
  3. After successful INGEST, can I recreate download and use it for another remote folder?

@wey-gu, @Nicole00
Please help

For Example: Say I have below
-- my space ID 2
-- partition_num is 100
-- I have 3 remote hdfs paths sst1, sst2, sst3

Scenario - 1
I copied all remote SST folders to all storage nodes like below. Running SUBMIT JOB INGEST command did nothing
-- data/storage/nebula/2/download/sst1
-- data/storage/nebula/2/download/sst2
-- data/storage/nebula/2/download/sst3

Scenario - 2
I copied just one remote hdfs folder to all storage nodes like below. Running SUBMIT JOB INGEST command did nothing
-- data/storage/nebula/2/download/sst1

Scenario - 3
I copied just contents one remote hdfs folder to all storage nodes like below. Running SUBMIT JOB INGEST command did successfully ingest the all the values to graph

-- data/storage/nebula/2/download/1
-- data/storage/nebula/2/download/2
-- data/storage/nebula/2/download/3
.
.
-- data/storage/nebula/2/download/100

@wey-gu
Copy link
Contributor

wey-gu commented Nov 30, 2022

@Nicole00 correct me if wrong, I think after put hdfs://.../sst1 files as scenario-3, and after the ingest, then, delete download files from all storage nodes and continue do the same thing for hdfs://.../sst2 is correct.

@Nicole00
Copy link
Contributor

Nicole00 commented Dec 1, 2022

Ingest operation will read the part folder to ingest data into RocksDB. So for scenarios 1 and 2, the directory is not correct for ingest.

And for wey-gu's view, just one point need to add, after the ingest, whether the download folder is deleted depends on your NebulaStorage's config --move_files, refer doc. If your config is false, please delete the download folder manually, then continue do the same thing for sst2.

@porscheme
Copy link
Author

porscheme commented Dec 1, 2022

Thanks now I'm clear on SST files based data ingestion to graph. Essentially, it involves 3 parts

  1. From parquet files, generate SST files to remote HDFS
  2. Copy SST files from HDFS to all storage nodes
  3. Run 'SUBMIT JOB INGEST'

Please read HDFS in above steps as Azure Data lake gen2 Storage Account.
Used AzCopy tool for step #2

Too manual manual processes? wish there is some automation.

@wey-gu
Copy link
Contributor

wey-gu commented Dec 1, 2022

cc @veezhang @MegaByte875 could we add hdfs client in the operator, so that we could do DOWNLOAD in k8s NebulaGraph, too

@porscheme
Copy link
Author

porscheme commented Dec 1, 2022

I forgot to mention, I had to make few changes to nebula-exchange code to add support to Azure Data Lake gen2 Storage Account

@wey-gu
Copy link
Contributor

wey-gu commented Dec 1, 2022

I forgot to mention, I had to make few changes to nebula-exchange code to add support to Azure Data Lake gen2 Storage Account

Thanks @porscheme , if that's worthy to be applied to upstream, your PR is more than welcomed :)

@MegaByte875
Copy link

cc @veezhang @MegaByte875 could we add hdfs client in the operator, so that we could do DOWNLOAD in k8s NebulaGraph, too

I think a k8s job could acheive import SST files automaticlly.

@porscheme
Copy link
Author

I'm have an issue with SST data files import, please see vesoft-inc/nebula#5278 (comment)

Any help is really appreciated.

@wey-gu
Copy link
Contributor

wey-gu commented Feb 6, 2023

Dear @porscheme ,
It was Chinese Lunar New year holidays when you issue this question, sorry for replying you too late.

I replied you here vesoft-inc/nebula#5278 (comment) :)

Thanks!

@QingZ11
Copy link

QingZ11 commented Feb 23, 2023

@porscheme I noticed that you have a new issue to discuss same problem. This issue has been closed first. If you have any updates, you can reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Type: question about the product
Projects
None yet
Development

No branches or pull requests

6 participants