Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap after v0.2 #26

Closed
spirali opened this issue Apr 8, 2018 · 9 comments
Closed

Roadmap after v0.2 #26

spirali opened this issue Apr 8, 2018 · 9 comments
Assignees
Labels
enhancement New feature or request meta
Milestone

Comments

@spirali
Copy link
Collaborator

spirali commented Apr 8, 2018

We have received many feedbacks from our reddit post (https://www.reddit.com/r/rust/comments/89yppv/rain_rust_based_computational_framework/). I think that now is time to recap our plans and maybe reconsider priorities to reflect real needs of users. This is meant as a kind of brainstorming white board; each individal task should have own issue at the end. Also, I would like to focus on a relatively short term road plan (let us say things that could be finished within 3 months), maybe we can create a similar post for our long term plans.

EDIT (gavento): Moved @spirali's TODO to a post below.

@gavento gavento changed the title Roadmap Roadmap after 0.2 Apr 12, 2018
@gavento
Copy link
Contributor

gavento commented Apr 12, 2018

Based on the feedback from the mentioned Reddit discussion, our long-term goals and internal discussion, this is the list of issues to work on, their [priority], (asignee) and their sub-tasks.

Prioritized enhancements

Custom tasks (subworkers) in more languages

Requested by several people in the discussion, seems like a good idea anyway. For now with Capnp.

Easier deployment in the cloud

Packaging for easier deployment

Multiple options, priorities may vary. (@spirali)

  • Changelog [high]
  • PyPI packege for easy client installation [high]
    • Needs a name (rain is taken). Suggestions?
  • Crates.io package [medium]
    • Needs a different name (rain is taken). Suggestions?
  • AppImage/Snap packages [low] (we already have static binaries)
  • Deb/other distro packages [low]

Fix current bugs

Improve Python API

Pythonize the client API.

Improve testing infrastructure

Client-side protocols

Replace capnp RPC and the current monitoring dashboard HTTP API with common protocol.
Part of #11 (more discussion there) but specific to the public API.

Improve the dashboard with more information and post-mortem analysis

More real-world code examples

Lower priority, best based on real use-cases. Ideas: numpy subtasks, C++/Rust subworkers

Enhancements to revisit in the (not so distant) future

  • Integration with some popular libraries
    • Apache Arrow content-type
    • XGBoost tasks, etc ...
    • Why not now: Not clear what would be the demand
  • Worker configuration files (needed for common (CPU) and special resources (GPU), different subworker locatins and configurations, ...)
    • Why not now: Needs to be thought-through (esp. w.r.t. resources), not needed now
  • Separate session construction and running (save/load session)
    • Why not now: Not clear what would be the use-cases, not difficult when API stabilized
  • Clients in other languages: Rust, C++, Java, ...
    • Why not now: Not clear what would be the demand. Easier after the protocol/Python API stabilization.
  • Scale the scheduler, benchmarks
    • There is a benchmark in utils/bench/simple_task_scaling.py. The results as of 0.2 are here.
    • Why not now: While eventually crucial, the scheduler is sufficient when there are <1000 tasks to be scheduled at once.

@gavento
Copy link
Contributor

gavento commented Apr 12, 2018

@spirali's original TODO notes

First, I start with my todo list as looked like before the reddit post:

  • Basic resilience, so crashing of worker (or subworker) does not take down the whole infrastructure. It needs some cleanup in the server that already has begun.
  • Datastore revamp - The goal is to simplify data fetching API. The current implementation is the only serious usage of capnp's distributes objects. Even I really love the idea, this brings us problem if we want to support additional RPCs (i.e. REST api for clients) or even get rid of capnp. The current design also makes difficulties when we want to redirect fetching to a different source; that is necessary for resilience.
  • Possibility to just write in client "client = Client("my.env")" to connect Rain infrastructure, where "my.env" is a file
    created by our Exoscale scripts and (in the near future) by "rain start" + some additional tweaks to make starting infrastructure more easier.
  • Do some tweaks to Dashboard, especially when a session contains a large graph.

The list of items that was actually in our long term goals, but we should reconsider its priority.

  • Implement additional subworkers. It should be relatively easy to implement a simple library for e.g. Rust and C++ that provide basic subworker interface and allows to use simply use Rust/C++ code in Rain. There are some open questions how should API exactly look like, but prepaparing some initial prototype should be no problem. I think that only a real question is if we should wait for decesion about new RPC protocol. The good thing is that worker<->subworker is quite separated from other communication (because it referes to local objects - e.g. it does not even use data store API, hence
    we do not have wait for datastore revamp).

  • Implement additional clients. It seems that having non-Python clients is more popular than we expected. As far I know,
    we did not discuss this option too much; however create some working prototype should be relatively easy. However, the qestion about waiting for new RPC is more serious here.

@gavento gavento added enhancement New feature or request question Further information is requested labels Apr 12, 2018
@spirali
Copy link
Collaborator Author

spirali commented Apr 12, 2018

Is "Python subworker as a library" necessary? I have the feeling that for each environment where we can transfer a function to a subworker from a client in reasonable (and portable) way then we should do it that way. The overhead of transferring a function is minimal (it is done only once) and flexibility is huge. I consider building a "fixed" subworker as a kind of side-step where there is no such option (C++/Rust [?*])

  • Having a possibility to annotate a rust function in a Rust client and send it to a generic Rust subworker would be a killing feature, but I do not know if this is possible

@gavento
Copy link
Contributor

gavento commented Apr 12, 2018

I can imagine some scenarios where a python worker could be useful:

  • When your task code is extensive and stable across pipelines. Now you need to somehow import it and let cloudpickle transfer it. (Not really elegant.)
  • When you have a Python 2.7 (legacy scientific code) or Pypy (for speed) code to be run. We can probably make the subworker Python2 compatible relatively easily if we want.
  • When you have cython or other binary extensions (does cloudpickle transfer them reliably / at all?). Even if you deploy the binaries manually, you may have a different python ABI version on the workers than on your client.

Also, the built-in pytask subworker can be trivially implemented as one such subworker task (with a bit of unpacking logic) and so it is not much more work.

@spirali
Copy link
Collaborator Author

spirali commented Apr 13, 2018

  • I think that even you have a stable pipeline then sending it to subworkers costs nothing (especially compared to the fact that you use Python) and setting up the infrastructure with own subworkers is always more painful (admin work, changes & updates).

  • Pypy actually works right now - there is no problem to cloudpickle an object in CPython and unpickle it in Pypy. Capnp works in pypy.

  • Cloudpickle transports Python bytecode not binary code, so if you have a library installed on both ends, there is no ABI problem.

However, I see now that it can be useful to define tasks that may be called e.g. from Java client where cloudpickle is not easily accessible.

@spirali
Copy link
Collaborator Author

spirali commented Apr 17, 2018

PR #40 implements replacement of DataStore API with direct calls

@vojtechcima
Copy link
Collaborator

PR #52 implements Exoscale deployment scripts.

@yingfeng
Copy link

yingfeng commented Jun 9, 2018

These are what I think useful in future:

  • Being able to be orchestrated by Kubernetes
  • Provide compatible api with popular python libraries, including sklearn, numpy, and pandas, for example, some ideas from dask could be borrowed, which is a distributed data process framework with those compatible apis. However, dask suffers from performance as well as fault tolerancy. As @gavento listed above, integration with XGBoost or other machine learning libraries are required in future, however, the integration would not be that useful without such design which would be used to serve as the data pre-processing for those machine learning libraries. On the other hand, spark is a good contender

@gavento gavento mentioned this issue Jul 2, 2018
15 tasks
@gavento
Copy link
Contributor

gavento commented Jul 2, 2018

Transitioned to #64 after v0.3 release

@gavento gavento closed this as completed Jul 2, 2018
@gavento gavento changed the title Roadmap after 0.2 Roadmap after v0.2 Jul 2, 2018
@gavento gavento added meta and removed question Further information is requested labels Jul 2, 2018
@gavento gavento added this to the v0.3 milestone Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request meta
Projects
None yet
Development

No branches or pull requests

4 participants