Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launch Notebooks on AWS with 1-click #16

Open
amit1rrr opened this issue Mar 23, 2019 · 6 comments
Open

Launch Notebooks on AWS with 1-click #16

amit1rrr opened this issue Mar 23, 2019 · 6 comments
Labels
Feature Request A new feature that's under consideration.

Comments

@amit1rrr
Copy link
Member

amit1rrr commented Mar 23, 2019

Problem

Reproducibility is a sore problem in data science. Simply put, given an experiment/analysis how can someone else quickly rerun it. Reruns can be for verification, to see intermediate states, modify some parameters or to simply rerun the analysis with up-to-date data. Reruns should be easy, right? Not quite. Here are the challenges:

  • One needs to setup the exact same environment again. These include dependent packages, python versions, environment variables, data files etc. Notebooks don't capture environment information anywhere.

  • Resource intensive long running scripts require powerful machines in the cloud (e.g. GPU). It's time consuming to manually set up the environment on these machines every time you want to run something. Often data scientist have to coordinate with DevOps folks to help them with infra, resulting in coordination delays and time cost.

Solution

What if users can codify environment information only once and then launch any experiment/analysis with 1-click.

What you do

  • Specify environment once in the form of Dockerfile/requirements.txt in the repo.
  • Create & specify AWS API keys to launch EC2 instances in your own account

What you get

  • You can launch any Notebook (or entire repository) on EC2 instance type of your choice with a single click.
  • You can make modifications to Notebooks, run them on EC2, and share access with your team members to see results etc.
  • You get notified via email when a long running notebook cell finishes execution
  • These 1-click launch buttons/URLs can be shared anywhere e.g. slack, email, internal documentation, GitHub readme etc.
  • These 1-click URLs will specify repo's branch/commit/PR so the links work as intended even though the repo content evolves over time.

Use cases

Hypothetical scenarios to give you a flavour of what's possible with this feature. Not a comprehensive list.

  • You want to run your experiment on a beefy machine in the cloud & be notified when the results arrive
  • You want to run analysis created by your teammate & build on top of it
  • You want to share weekly engagement metrics in the form of Notebook that anyone in the company can run (Excel, PDFs become stale from the moment they are sent out)
  • There is an open source implementation of random forest algorithm on GitHub that you want to try out on your own dataset
  • You want to review a pull request by actually trying out the proposed changes
  • You want to enable non technical people in the organisation to interact with data directly (via visualisations, reports etc.)

All of the above are possible with just clicking around the UI once the Notebook repositories are setup with the environment config.

Please see FAQ below.


Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.

@amit1rrr amit1rrr self-assigned this Mar 23, 2019
@amit1rrr amit1rrr added the Feature Request A new feature that's under consideration. label Mar 23, 2019
@sebinsua
Copy link

It's a nice idea but what about other cloud providers? For example, the company I'm at uses Azure.

Will this be specific to AWS?

@amit1rrr
Copy link
Member Author

amit1rrr commented Mar 23, 2019

FAQs

What about other cloud providers?

  • We are starting with AWS and will extend support for GCP, Azure over time.

Doesn't BinderHub already solves this?

  • To launch private repos, you have to maintain your own BinderHub (non-trivial)
  • Can't choose instance type per launch
  • Private access tokens are shared between all users of BinderHub. See warning

@jason-curtis
Copy link

Another similar product you may want to check for inspiration is Google's Colaboratory. I'm not sure what options they have in terms of instance types or provisioning (though I just checked that you can run !pip install to install pip packages). Naturally, their integration with Google Drive appears to be pretty baked-in, which could be a pro or a con depending on your organization!

@lamberta
Copy link

Google Colab integrates with GitHub (as well as Drive) and is how it's used on tensorflow.org.
For example, this notebook lives in Github, pass the GitHub path as part of the URL directly to Colab: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/basic_classification.ipynb
You must log in to run the notebook and your account basically gets a container to install packages, etc.

@amit1rrr
Copy link
Member Author

amit1rrr commented Mar 26, 2019

@thatneat @lamberta Thanks for the note about Google Colab. From what I heard there are couple of limitations,

  • Accessing notebooks in private repo is not that straightforward. Reference
  • Can't really choose instance size/type (I guess it's GPU or normal hardware)
  • All your private data, model, code runs on Google's Colab infrastructure.

@gramhagen
Copy link

Azure notebooks may be worth considering, also leveraging Azure devops for build pipelines is really nice. I understand you're planning to look at other cloud providers later. But imo Azure might be able to get you to a mvp faster.

@amit1rrr amit1rrr removed their assignment Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request A new feature that's under consideration.
Projects
None yet
Development

No branches or pull requests

5 participants