python/go grpc interface for async uploads #408

andreasjansson · 2020-12-15T01:01:19Z

Removes duplicate logic in Python, reusing the Go implementation via a grpc API.

Apologies for this massive PR, I couldn't see a way of splitting it up since it touches everything.

Closes #317
Closes #344

bfirsh · 2020-12-22T18:06:45Z

go/pkg/project/project.go

+		return nil, errors.IncompatibleRepositoryVersion(p.repository.RootURL())
+	}
+
+	hostIP, err := localIP()


This is changing behavior I think isn't it? I think this got bumped because we wanted to come up with some sensible behavior for localhost, IIRC. #203

The problem in the previous review still applies, I think. The main problem, off the top of my head, is that experiments run on your laptop will have different hosts when you have different local IPs, which is very odd behavior. The "HOST" column will suddenly appear when you connect to a different network or get a new DHCP lease.

Maybe we should make RFC 1918 addresses the blank string until we come up with a better solution?

I think this function always will return local addresses, so maybe the solution is to just blank it out for now. It's a shame since it's useful information when you're running on multiple hosts, but I can't think of a good solution.

bfirsh

Broadly looks good! I realize there is a lot of unfinished stuff in here so I won't review in any detail.

A few high-level thoughts:

Have you thought about how the user interface might work? Maybe there is some of it in here but I can't see it obviously. I can think of things like displaying errors, what is printed when the training process is finished and things are still uploading, etc
This is also currently unresolved, but have you put any more thought into partial writes? This might become more of an issue if things are done in the background. Checking out a partially written checkpoint might be particularly destructive (you could lose your current work, and the checked out stuff is corrupted!)
We need to make sure there's a bit of developer documentation, otherwise it is going to very hard for people to add, e.g., a new bit of metadata to an experiment.

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

andreasjansson · 2021-01-06T18:43:58Z

Since the data hasn't yet been uploaded before saving the checkpoint, it doesn't have much use to the user. But I see your point about it being weird the way it looks now.

What I as a user want from the logs is to know that something successfully finished, not that it started. I want to trust that Replicate has uploaded my data and that the checkpoint is consistent.

I agree that it reads a little strange how messages show up out of order, but since uploads happen asynchronously I'm expecting that. In a sense it's nice, because it tells me that Replicate isn't blocking my training loop. Putting the step number in the Replicate log message would make that abundantly clear.

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

bfirsh · 2021-01-06T19:37:58Z

What I as a user want from the logs is to know that something successfully finished, not that it started. I want to trust that Replicate has uploaded my data and that the checkpoint is consistent.

On the other hand, you could argue that by Replicate not printing anything when you create a checkpoint, it looks broken because it doesn't print anything.

The broader point is that this is a change in behavior, and I don't think we should change behavior. The old behavior didn't print a message on success either.

andreasjansson · 2021-01-06T19:39:31Z

The broader point is that this is a change in behavior, and I don't think we should change behavior. The old behavior didn't print a message on success either.

There is a change in the logic though, in that checkpoints are now uploaded in the background. I think that should be reflected in the log output.

bfirsh · 2021-01-06T19:57:44Z

TODO, discussed on zoom: Copy to temp directory before uploading (and block)

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

bfirsh · 2021-01-11T00:35:59Z

I have created a bunch of issues for things mentioned in this PR. They are mentioned in the reference messages above.

bfirsh · 2021-01-11T00:36:05Z

TODO, discussed on zoom: Copy to temp directory before uploading (and block)

for future reference, this was fixed in https://github.com/replicate/replicate/pull/464

vercel bot deployed to Preview December 15, 2020 01:01 View deployment

andreasjansson force-pushed the async-uploads branch from 2616767 to a5f2ae8 Compare December 18, 2020 12:23

vercel bot deployed to Preview December 18, 2020 12:23 View deployment

bfirsh reviewed Dec 22, 2020

View reviewed changes

This was referenced Dec 30, 2020

Make writes atomic #436

Open

Disable flakey heartbeat test #437

Merged

bfirsh added this to the v0.3.0 milestone Dec 30, 2020

andreasjansson force-pushed the async-uploads branch from a5f2ae8 to d496848 Compare December 30, 2020 20:53

vercel bot deployed to Preview December 30, 2020 20:53 View deployment

andreasjansson force-pushed the async-uploads branch from d496848 to ff4c8df Compare January 1, 2021 14:20

vercel bot deployed to Preview January 1, 2021 14:20 View deployment

andreasjansson force-pushed the async-uploads branch from ff4c8df to e99b937 Compare January 3, 2021 23:15

vercel bot deployed to Preview January 3, 2021 23:15 View deployment

andreasjansson force-pushed the async-uploads branch from e99b937 to 9980599 Compare January 3, 2021 23:17

vercel bot deployed to Preview January 3, 2021 23:17 View deployment

andreasjansson force-pushed the async-uploads branch from 9980599 to 2a6ce21 Compare January 4, 2021 12:44

vercel bot deployed to Preview January 4, 2021 12:44 View deployment

andreasjansson force-pushed the async-uploads branch from 2a6ce21 to 85c0894 Compare January 4, 2021 12:51

vercel bot deployed to Preview January 4, 2021 12:51 View deployment

andreasjansson force-pushed the async-uploads branch from 85c0894 to fc446b3 Compare January 4, 2021 12:54

vercel bot deployed to Preview January 4, 2021 12:54 View deployment

andreasjansson force-pushed the async-uploads branch from fc446b3 to d8119f0 Compare January 4, 2021 12:58

vercel bot deployed to Preview January 4, 2021 12:58 View deployment

andreasjansson force-pushed the async-uploads branch from d8119f0 to de02470 Compare January 4, 2021 13:11

vercel bot deployed to Preview January 4, 2021 13:11 View deployment

andreasjansson force-pushed the async-uploads branch from de02470 to 7db690a Compare January 4, 2021 14:22

vercel bot deployed to Preview January 4, 2021 14:22 View deployment

andreasjansson force-pushed the async-uploads branch from 7db690a to 4ec155b Compare January 4, 2021 14:59

vercel bot deployed to Preview January 4, 2021 14:59 View deployment

vercel bot deployed to Preview January 6, 2021 17:22 View deployment

vercel bot deployed to Preview January 6, 2021 18:38 View deployment

make logging consistent with current

7802941

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

andreasjansson force-pushed the async-uploads branch from b1b09ea to 7802941 Compare January 6, 2021 18:43

vercel bot deployed to Preview January 6, 2021 18:43 View deployment

fix flaky checkout test

f970fe4

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

vercel bot deployed to Preview January 6, 2021 19:08 View deployment

wrap unwrapped gcs WriteError

1551fee

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

vercel bot deployed to Preview January 6, 2021 19:30 View deployment

move log message to before work happens

86f356d

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

vercel bot deployed to Preview January 6, 2021 20:00 View deployment

fix final flaky test

6ffe401

Signed-off-by: Andreas Jansson <andreas@replicate.ai>

vercel bot deployed to Preview January 6, 2021 20:39 View deployment

bfirsh approved these changes Jan 6, 2021

View reviewed changes

andreasjansson merged commit 1adde49 into replicate:main Jan 6, 2021

This was referenced Jan 7, 2021

Don't show incomplete experiments in user interface #453

Open

Only display message on exit if work to be done #456

Merged

andreasjansson mentioned this pull request Jan 8, 2021

Speed up reading data in Python #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python/go grpc interface for async uploads #408

python/go grpc interface for async uploads #408

andreasjansson commented Dec 15, 2020 •

edited by bfirsh

bfirsh Dec 22, 2020

bfirsh Jan 5, 2021 •

edited

andreasjansson Jan 5, 2021

bfirsh left a comment

andreasjansson commented Jan 6, 2021

bfirsh commented Jan 6, 2021

andreasjansson commented Jan 6, 2021

bfirsh commented Jan 6, 2021

bfirsh commented Jan 11, 2021

bfirsh commented Jan 11, 2021

python/go grpc interface for async uploads #408

python/go grpc interface for async uploads #408

Conversation

andreasjansson commented Dec 15, 2020 • edited by bfirsh

bfirsh Dec 22, 2020

Choose a reason for hiding this comment

bfirsh Jan 5, 2021 • edited

Choose a reason for hiding this comment

andreasjansson Jan 5, 2021

Choose a reason for hiding this comment

bfirsh left a comment

Choose a reason for hiding this comment

andreasjansson commented Jan 6, 2021

bfirsh commented Jan 6, 2021

andreasjansson commented Jan 6, 2021

bfirsh commented Jan 6, 2021

bfirsh commented Jan 11, 2021

bfirsh commented Jan 11, 2021

andreasjansson commented Dec 15, 2020 •

edited by bfirsh

bfirsh Jan 5, 2021 •

edited