Blocking tree are saved in parquet file #82 #120

navinrathore · 2022-01-08T21:47:44Z

"zingg.block" is a directory comprising parquet files. It was a file.
BlockingtreePipe, by default, has Overwrite mode. It should be OK in read().

sonalgoyal · 2022-01-10T08:43:46Z

client/src/main/java/zingg/client/util/Util.java

@@ -469,7 +471,18 @@ public static String getCurrentLabelPath(String path) throws IOException {
 		return dupesActual;
 	}

-
-


should there be some kind of buffering? bufferedoutputstream etc?

There should not be any buffering, I feel. Buffering happens mainly for relatively expensive operations like disk I/O, network activity. Also, particular operation should be quite often.
Your opinion please?

buffering is pretty standard, dont see how it harms. please add.

Yes. Added. Hope it is the correct way and does the expected.

sonalgoyal · 2022-01-10T08:44:38Z

core/src/main/java/zingg/util/BlockingTreeUtil.java

+		List<Object> objList = new ArrayList<>();
+		objList.add(byteArray);
+		JavaRDD<Row> rowRDD = ctx.parallelize(objList).map((Object row) -> RowFactory.create(row));
+		Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();


shouldnt we coalesce the df before writing?

sonalgoyal

please take a look at the comments

sonalgoyal

buffering

navinrathore mentioned this pull request Jan 8, 2022

Blocking tree can not be saved in cloud environment #82

Closed

sonalgoyal reviewed Jan 10, 2022

View reviewed changes

sonalgoyal requested changes Jan 10, 2022

View reviewed changes

sonalgoyal requested changes Jan 11, 2022

View reviewed changes

navinrathore added 3 commits January 11, 2022 14:59

Blocking tree are saved in parquet file zinggAI#82

46e23ef

using coalescs() before saving blocking tree zinggAI#82

f1c6a28

using Buffered Streams in blocking tree read/write zinggAI#82

d70ef74

navinrathore force-pushed the zBlockingTree branch from b513ed9 to d70ef74 Compare January 11, 2022 09:35

sonalgoyal merged commit a88c043 into zinggAI:main Jan 11, 2022

navinrathore deleted the zBlockingTree branch March 9, 2022 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking tree are saved in parquet file #82 #120

Blocking tree are saved in parquet file #82 #120

navinrathore commented Jan 8, 2022

sonalgoyal Jan 10, 2022

navinrathore Jan 10, 2022

sonalgoyal Jan 11, 2022

navinrathore Jan 11, 2022

sonalgoyal Jan 10, 2022

navinrathore Jan 10, 2022

sonalgoyal left a comment

sonalgoyal left a comment

		@@ -469,7 +471,18 @@ public static String getCurrentLabelPath(String path) throws IOException {
		return dupesActual;
		}

Blocking tree are saved in parquet file #82 #120

Blocking tree are saved in parquet file #82 #120

Conversation

navinrathore commented Jan 8, 2022

sonalgoyal Jan 10, 2022

Choose a reason for hiding this comment

navinrathore Jan 10, 2022

Choose a reason for hiding this comment

sonalgoyal Jan 11, 2022

Choose a reason for hiding this comment

navinrathore Jan 11, 2022

Choose a reason for hiding this comment

sonalgoyal Jan 10, 2022

Choose a reason for hiding this comment

navinrathore Jan 10, 2022

Choose a reason for hiding this comment

sonalgoyal left a comment

Choose a reason for hiding this comment

sonalgoyal left a comment

Choose a reason for hiding this comment