Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

df1-df1 · 2022-03-07T07:15:55Z

I found a problem that resulted in the generated SST file containing only the key without the TagID

Desription: Accourding to struct of 3.0 vertex data:

If all goes well, when the Exchange program is finished, the SST file will contain data for both keys

    {
      name: tag-name-1
      type: {
        source: csv
        sink: sst
      }
      path: hdfs tag path 2

      fields: [csv-field-0, csv-field-1, csv-field-2]
      nebula.fields: [nebula-field-0, nebula-field-1, nebula-field-2]
      vertex: {
        field:csv-field-0
      }
      separator: ","
      header: true
      batch: 256
      partition: 32
      repartitionWithNebula: false
    }

However, if you follow the above configuration file, the generated SST files will only contain the key without the TagID

Here's why,the sst writer changes along with the partitioning information of the key, causing lower-ranked data in the same task to overwrite previous data(with same part)

https://github.com/DemocracyAndLiberty/nebula-exchange/blob/master/exchange-common/src/main/scala/com/vesoft/exchange/common/writer/FileBaseWriter.scala

        if (part != currentPart) {
          if (writer != null) {
            writer.close()
            val localFile = s"$localPath/$currentPart-$taskID.sst"
            HDFSUtils.upload(localFile,
                             s"$remotePath/${currentPart}/$currentPart-$taskID.sst",
                             namenode)
            Files.delete(Paths.get(localFile))
          }
          currentPart = part
          val tmp = s"$localPath/$currentPart-$taskID.sst"
          writer = new NebulaSSTWriter(tmp)
          writer.prepare()
        }

Accroding to https://github.com/DemocracyAndLiberty/nebula-exchange/blob/master/exchange-common/src/main/scala/com/vesoft/exchange/common/processor/Processor.scala,I noticed that setting repartitionWithNebula to true solved this problem when the number of nebula space partitions is greater than 1.

So whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1?

The text was updated successfully, but these errors were encountered:

wey-gu · 2022-03-08T02:24:44Z

Dear @DemocracyAndLiberty

Thanks a lot for your excellent analysis and suggestions!

The default value(false) could be revisited
- what is the cost of turning repartitionWithNebula True ?
The impact of this value on v3.0.0(That repartitionWithNebula: False by default will result in losing tagID in SST file) should be highlighted in documentations

ref:

cc @Aiee @Sophie-Xie

wey-gu · 2022-05-07T03:55:24Z

We should revisit when repartitionWithNebula: False is OK to be used.

@Nicole00

Minnull · 2022-09-26T03:08:37Z

I found that there is a problem after turning on repartitioning: it affects the concurrency of tasks and the execution speed is limited

wey-gu added the doc affected PR: improvements or additions to documentation label Mar 8, 2022

jamieliu1023 mentioned this issue Mar 12, 2022

Weekly Report 2022-03-11 vesoft-inc/nebula-community#99

Closed

wey-gu mentioned this issue May 7, 2022

highlight known issue on exchange sst vesoft-inc/nebula-docs-cn#1776

Closed

Nicole00 mentioned this issue Oct 19, 2022

fix missing data for tagless when repartition is false #102

Merged

shanlai closed this as completed in #102 Oct 19, 2022

wey-gu mentioned this issue Oct 22, 2022

Weekly Report 2022-10-21 vesoft-inc/nebula-community#139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

df1-df1 commented Mar 7, 2022

wey-gu commented Mar 8, 2022 •

edited

wey-gu commented May 7, 2022

Minnull commented Sep 26, 2022

Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

Whether to forcibly enable repartitioning when the number of nebula space partitions is greater than 1 #71

Comments

df1-df1 commented Mar 7, 2022

wey-gu commented Mar 8, 2022 • edited

wey-gu commented May 7, 2022

Minnull commented Sep 26, 2022

wey-gu commented Mar 8, 2022 •

edited