Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PR Meta Issue #22

Open
9 of 10 tasks
fschueler opened this issue Apr 18, 2016 · 7 comments
Open
9 of 10 tasks

Update PR Meta Issue #22

fschueler opened this issue Apr 18, 2016 · 7 comments

Comments

@fschueler
Copy link

fschueler commented Apr 18, 2016

I copy and paste Matthias' suggestions here. Let's keep track of it and coordinate the work in this issue.

  • APIs and interfaces:
    Unfortunately, you've put already effort into API extensions. I would recommend to remove the modifications of MLOutput and MLContext because MLOutput is not maintained and MLContext is about to get a major refactoring. Let's use the command line interface for now and extend the APIs once the backend is ready.
  • Isolation of Flink dependencies: We need to be very careful about introducing Flink API dependencies into shared classes in order to ensure that Spark and Flink backends can be used without dependency on the other. For example, there are (1) references to Flink in runtime/controlprogram/Program, and (2) references to the SparkExecutionContext in Flink execution modes. Could you please systematically go over shared classes and double check for isolation?
  • Bufferpool integration: I agree that the duplication of RDDObject and DataSetObject creates unnecessary redundancy in a core place like our bufferpool (CacheableData and ExecutionContext). Let's abstract this to something like DistDataObject with two implementations RDDObject and DataSetObject. Just let me know, I'd be happy to prepare this for you guys.
  • Broadcast memory budget: Currently, the broadcast memory budget is obtained from the SparkExecutionContext. It would be good to automatically obtain this size (e.g., size for user objects) to remove the dependency and because it is really important for operator selection.
  • Checkpointing: I see a CheckpointFLInstruction which always does a pass-through. Let's remove this instruction (until Flink supports distributed caching) and simply do not apply the rewrites for injecting these checkpoints (RewriteInjectSparkLoopCheckpointing, RewriteInjectSparkPReadCheckpointing).
  • Instructions w/ broadcasts: There seem to be some inconsistencies wrt when and how we broadcast data (e.g., join vs join w/ hint vs withBroadcastSet). I understand that Flink offers optimization capabilities in that regard. However, from a SystemML perspective I would recommend to stick to physical operator semantics (e.g., MAPMM with broadcast, CPMM/RMM with join). If experiments show that Flink actually does better than our operator selection, we could simply compile earlier (for smaller data sizes) CPMM/RMM. Btw, is there a way to pin the broadcast in distributed memory and reuse it across multiple invocations?
  • Advanced operations: You already added compiler support for advanced operations like weighted squared loss and cumulative aggregates. Let's put these aside for now and get a narrow prototype running well first before we broaden the scope.
  • Tests: It would be good if we could add a few tests for flink mapmm and tsmm to ensure they are producing correct results as this is the precondition for any experiments. Please have a look into FullMatrixMultiplicationTransposeSelfTest and FullDistributedMatrixMultiplicationTest.
  • Exception handling: There are a couple of instances where exceptions are simply put to stderr. Could you please double check that all caught exceptions are rethrown in order to simplify debugging. Also, please replace the assertions for dimensions mismatch checks with exceptions as these checks need to happen even if assertions are disabled (by default).
  • Formatting: Well, we don't have a style guide defined yet but the project-wide convention is to use tabs instead of spaces (e.g., issues in DMLScript). Furthermore, you've also done some code cleanups of formatting existing code. Could we separate these changes out of this PR to simplify the review? Feel free to submit them as an individual PR though.
@FelixNeutatz
Copy link

I am working on AggBinary problem and try to make the linear test run on "FLINK" mode

@carabolic
Copy link
Member

I'll have a look at the Instructions w/ broadcasts issue. I will replace the general implementation with the one given by the name. That is replace Flink's join operator with BroadcastSets.

@fschueler
Copy link
Author

I am having a look at the unnecessary API changes and Flink dependencies. The removal of the Flink dependency from Program.java touches the question of where to put the env.execute() parts ( #15 )

@fschueler
Copy link
Author

I think for the CheckpointFLInstruction, we can just remove it since it was not generated during optimization anyways...

@fschueler
Copy link
Author

I went over the Flink/Spark interdependencies but we should double check before we check this off the list.

@fschueler
Copy link
Author

Is this up to date? What is left to do for the next update of the PR?

@j143-zz
Copy link

j143-zz commented Jul 21, 2017

Hi @fschueler !
May be we can have a fresh look on this flink integration status and relist what are the current bottlenecks or things to be taken care off. I think this integration is important for engineering applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants