Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking tree are saved in parquet file #82 #120

Merged
merged 3 commits into from Jan 11, 2022

Conversation

navinrathore
Copy link
Contributor

"zingg.block" is a directory comprising parquet files. It was a file.
BlockingtreePipe, by default, has Overwrite mode. It should be OK in read().

@@ -469,7 +471,18 @@ public static String getCurrentLabelPath(String path) throws IOException {
return dupesActual;
}



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some kind of buffering? bufferedoutputstream etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be any buffering, I feel. Buffering happens mainly for relatively expensive operations like disk I/O, network activity. Also, particular operation should be quite often.
Your opinion please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffering is pretty standard, dont see how it harms. please add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added. Hope it is the correct way and does the expected.

List<Object> objList = new ArrayList<>();
objList.add(byteArray);
JavaRDD<Row> rowRDD = ctx.parallelize(objList).map((Object row) -> RowFactory.create(row));
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we coalesce the df before writing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take a look at the comments

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffering

@sonalgoyal sonalgoyal merged commit a88c043 into zinggAI:main Jan 11, 2022
@navinrathore navinrathore deleted the zBlockingTree branch March 9, 2022 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants