New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking tree are saved in parquet file #82 #120
Conversation
@@ -469,7 +471,18 @@ public static String getCurrentLabelPath(String path) throws IOException { | |||
return dupesActual; | |||
} | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be some kind of buffering? bufferedoutputstream etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be any buffering, I feel. Buffering happens mainly for relatively expensive operations like disk I/O, network activity. Also, particular operation should be quite often.
Your opinion please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffering is pretty standard, dont see how it harms. please add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added. Hope it is the correct way and does the expected.
List<Object> objList = new ArrayList<>(); | ||
objList.add(byteArray); | ||
JavaRDD<Row> rowRDD = ctx.parallelize(objList).map((Object row) -> RowFactory.create(row)); | ||
Dataset<Row> df = spark.sqlContext().createDataFrame(rowRDD, schema).toDF(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we coalesce the df before writing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take a look at the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffering
b513ed9
to
d70ef74
Compare
"zingg.block" is a directory comprising parquet files. It was a file.
BlockingtreePipe, by default, has Overwrite mode. It should be OK in read().