Skip to content

Launch Notebooks on AWS with 1-click #16

Open
@amit1rrr

Description

@amit1rrr

Problem

Reproducibility is a sore problem in data science. Simply put, given an experiment/analysis how can someone else quickly rerun it. Reruns can be for verification, to see intermediate states, modify some parameters or to simply rerun the analysis with up-to-date data. Reruns should be easy, right? Not quite. Here are the challenges:

  • One needs to setup the exact same environment again. These include dependent packages, python versions, environment variables, data files etc. Notebooks don't capture environment information anywhere.

  • Resource intensive long running scripts require powerful machines in the cloud (e.g. GPU). It's time consuming to manually set up the environment on these machines every time you want to run something. Often data scientist have to coordinate with DevOps folks to help them with infra, resulting in coordination delays and time cost.

Solution

What if users can codify environment information only once and then launch any experiment/analysis with 1-click.

What you do

  • Specify environment once in the form of Dockerfile/requirements.txt in the repo.
  • Create & specify AWS API keys to launch EC2 instances in your own account

What you get

  • You can launch any Notebook (or entire repository) on EC2 instance type of your choice with a single click.
  • You can make modifications to Notebooks, run them on EC2, and share access with your team members to see results etc.
  • You get notified via email when a long running notebook cell finishes execution
  • These 1-click launch buttons/URLs can be shared anywhere e.g. slack, email, internal documentation, GitHub readme etc.
  • These 1-click URLs will specify repo's branch/commit/PR so the links work as intended even though the repo content evolves over time.

Use cases

Hypothetical scenarios to give you a flavour of what's possible with this feature. Not a comprehensive list.

  • You want to run your experiment on a beefy machine in the cloud & be notified when the results arrive
  • You want to run analysis created by your teammate & build on top of it
  • You want to share weekly engagement metrics in the form of Notebook that anyone in the company can run (Excel, PDFs become stale from the moment they are sent out)
  • There is an open source implementation of random forest algorithm on GitHub that you want to try out on your own dataset
  • You want to review a pull request by actually trying out the proposed changes
  • You want to enable non technical people in the organisation to interact with data directly (via visualisations, reports etc.)

All of the above are possible with just clicking around the UI once the Notebook repositories are setup with the environment config.

Please see FAQ below.


Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Feature RequestA new feature that's under consideration.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions