Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copyFromLocal not implemented? #37

Open
interskh opened this Issue Dec 4, 2013 · 48 comments

Comments

Projects
None yet
@interskh
Copy link
Contributor

interskh commented Dec 4, 2013

I notice copyFromLocal exists in commandlineparser.py but not in client.py. Is it not implemented yet?

Thanks!

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Dec 4, 2013

Yes that shouldn't be there.. Put was commented out, but I forgot copyFromLocal. I'll submit a patch this week, because this is confusing.

@interskh

This comment has been minimized.

Copy link
Contributor Author

interskh commented Dec 5, 2013

Thanks.

@BlondAngel

This comment has been minimized.

Copy link

BlondAngel commented Dec 18, 2013

So, this means that copyFromLocal/put is not implemented? Do we use 'hadoop fs -copyFromLocal' instead?

I note that in the spotify blog [http://labs.spotify.com/2013/05/07/snakebite/], it states:
there are plans to also implement actions that also involve interaction with the DataNode

In addition, the documentation [http://spotify.github.io/snakebite/] has a 'To Do' section where it states:
put [paths] dst copy sources from local file system to destination

What is the timeline for this 'put'/'copyFromLocal' feature?

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Mar 4, 2014

Sorry for the late reply, but we haven't prioritized this. Would be nice to have (just like full YARN support).

@sodul

This comment has been minimized.

Copy link

sodul commented Jun 5, 2014

+ 1
I want to use snakebite to replace a several slow steps in our deployment automation, unfortunately we use copyFromlocal a lot. So this is definitely a must have feature for a lot of people.

Thanks for the good work.

@carolinux

This comment has been minimized.

Copy link

carolinux commented Sep 17, 2014

seconding sodul's comment

@ravwojdyla ravwojdyla self-assigned this Sep 17, 2014

@briancline

This comment has been minimized.

Copy link

briancline commented Sep 29, 2014

Thanks for an excellent and straightforward client -- just throwing in a makeshift vote for the ability to use put/copyFromLocal to speed up a few data ingress scripts.

@ptrxyz

This comment has been minimized.

Copy link

ptrxyz commented Dec 13, 2014

Great work, keep it up. Would also like to see put/copyfromlocal in the future.

@DandyDev

This comment has been minimized.

Copy link

DandyDev commented Jan 31, 2015

Still no word on this?
If communicating through protobuf makes it hard to implement features that require direct access to datanodes (such as the put and append operations), it would be wise to have a look at WebHDFS. Using WebHDFS in Snakebite, instead of Protobuf would make it trivial to implement copyFromLocal/put, and other file write operations.

I think it's a shame that such a promising project gets stuck on something that is really needed, like copyFromLocal.

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Jan 31, 2015

@ravwojdyla and I have been discussing this and currently there doesn't seem to be much time to implement this, so it's very hard to give any ETA on this feature.
I don't think we want to add WebHDFS support, since that sort of defeats the purpose of snakebite and requires additional infrastructure.

@simonellistonball

This comment has been minimized.

Copy link

simonellistonball commented Jan 31, 2015

I agree with @wouterdebie webhdfs wouldn't have the speed of snakebite. I'm working on implementing put in RPC at the moment, if anyone has any thoughts or progress they can share to accelerate it would be great to work together.

@DandyDev

This comment has been minimized.

Copy link

DandyDev commented Jan 31, 2015

Where can I find the RPC documentation?

@zachmullen

This comment has been minimized.

Copy link

zachmullen commented Mar 4, 2015

Has there been progress toward implementing put? I was going to take a crack at it for a project I'm working on, and was considering contributing it upstream, but don't want to duplicate effort if someone already has a handle on this.

@Tarrasch

This comment has been minimized.

Copy link
Contributor

Tarrasch commented Mar 5, 2015

I'm pretty sure it has not, maybe @ravwojdyla can confirm.

@ravwojdyla

This comment has been minimized.

Copy link
Contributor

ravwojdyla commented Mar 9, 2015

I have started working on this feature some time ago - can probably upload what I have right now (it's far from complete). That said if anyone feels like working on this problem please create issues you plan to work on, and if you need help - please ping me/us. Thanks!

@zachmullen

This comment has been minimized.

Copy link

zachmullen commented Mar 9, 2015

@ravwojdyla I'd love to help, I started to do it but the problem that ended up blocking me was that I couldn't find documentation on what RPCs I should even call to do something like an append, and the ones I tried didn't return what they claimed in the auto-generated protobuf spec... I might be able to help with this effort if you could point me to good documentation about the protocol, but I was unable to find any in sufficient detail.

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Mar 9, 2015

The problem with Hadoop is that protocols are pretty badly documented. When I started snakebite, I spent a lot of time reading Hadoop code and tcpdumping to figure out what was going on...

@aman572

This comment has been minimized.

Copy link

aman572 commented May 1, 2015

is there any ETA on when will copyFromLocal/put support would be present?

@tothandor

This comment has been minimized.

Copy link

tothandor commented Aug 6, 2015

+1

1 similar comment
@ligao101

This comment has been minimized.

Copy link

ligao101 commented Aug 31, 2015

+1

@mbultrow

This comment has been minimized.

Copy link

mbultrow commented Sep 18, 2015

+1 :)

@ctimmins

This comment has been minimized.

Copy link

ctimmins commented Oct 9, 2015

in the mean time:

import subprocess

subprocess.check_call(['hdfs', 'dfs', '-put', '/path/to/src', 'path/to/dst'], shell=False]

@jtaryma

This comment has been minimized.

Copy link

jtaryma commented Oct 14, 2015

+1

@jwszolek

This comment has been minimized.

Copy link

jwszolek commented Oct 28, 2015

@ravwojdyla - is there a separate branch for that issue? Did you have a chance to push what you had already done? Thanks!

@aeroevan

This comment has been minimized.

Copy link

aeroevan commented Nov 7, 2015

It looks like a go library similar to snakebite has started making progress on writing to hdfs:
colinmarc/hdfs#12

@Condla

This comment has been minimized.

Copy link

Condla commented Dec 22, 2015

+1

@sodul

This comment has been minimized.

Copy link

sodul commented Dec 22, 2015

An alternative that is relatively snappy is to use httpfs, it is a service that provide an http interface to hdfs. We actually ended up writing our own REST API in groovy to access hdfs and the hbase shell (which has no API).

https://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html

@tworec

This comment has been minimized.

Copy link

tworec commented Jan 7, 2016

+1

1 similar comment
@crorella

This comment has been minimized.

Copy link

crorella commented Feb 17, 2016

👍

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Feb 17, 2016

Because it was never implemented.
On Feb 17, 2016 17:54, Cristian Orellana notifications@github.com wrote:

—Reply to this email directly or view it on GitHub.

@austintrombley

This comment has been minimized.

Copy link

austintrombley commented Mar 17, 2016

+1

1 similar comment
@philpot

This comment has been minimized.

Copy link

philpot commented Mar 24, 2016

+1

@zachmullen

This comment has been minimized.

Copy link

zachmullen commented Mar 24, 2016

Github recently added a handy new feature to avoid all the "+1" comment spam. You can now vote +1 on an issue by going to the top comment, clicking the little smile face in the upper right and then click the thumbs up.

@agrebin

This comment has been minimized.

Copy link

agrebin commented May 31, 2016

+1
I've given my smiley but, just in case :)

@cherrot

This comment has been minimized.

Copy link

cherrot commented Jun 22, 2016

I planned to implement a storage service for big files using snakebite because I really like its implementation. Sadly it didn't support saving file.

Maybe I would switch back to it when this feature has been implemented :)

@francisar

This comment has been minimized.

Copy link

francisar commented Jul 27, 2016

wait for this to bbe implemented

@wesmadrigal

This comment has been minimized.

Copy link

wesmadrigal commented Aug 24, 2016

This was opened 3 years ago and still not implemented...wtf?

@weikai2

This comment has been minimized.

Copy link

weikai2 commented Sep 1, 2016

Use webhdfs instead.

@wangwenpei

This comment has been minimized.

Copy link

wangwenpei commented Sep 2, 2016

unbelievable,3 years 😱
I am using snakebite from yesterday.

@zachmullen

This comment has been minimized.

Copy link

zachmullen commented Sep 2, 2016

If this feature is truly critical to you, I'd suggest checking out hdfs3, it's BSD licensed and implements this capability. Also supports python 3.

@wangwenpei

This comment has been minimized.

Copy link

wangwenpei commented Sep 3, 2016

libhdfs3 is so hard to config on MacOSX 😓 I use webhdfs write data.

@arudyk

This comment has been minimized.

Copy link

arudyk commented Sep 13, 2016

3 years later...

@wouterdebie

This comment has been minimized.

Copy link
Contributor

wouterdebie commented Sep 13, 2016

Honestly, I'm not sure how complaining and pointing out the obvious is
going to help getting this implemented.

Yes, this feature has been open for a very long time, but writing is a
complicated operation in HDFS. Snakebite was conceived to work around long
JVM startup times, which matters mostly for operations that you do often
and should be relatively short (ls, test, etc). In cases where you read or
write, the overhead of the JVM startup time has less impact. At Spotify we
haven't had the need to invest time in write functionality, but of course
if someone feels like it, please do so. That said, please refrain from
complaining when software is open source since people do this in their
spare time or companies invest in getting software out there.

On Sep 13, 2016 2:41 PM, "Andriy Rudyk" notifications@github.com wrote:

3 years later...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#37 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKgBgGXtg-413vf-Jou7Hql-X-auAR8ks5qppn8gaJpZM4BRoNr
.

@DandyDev

This comment has been minimized.

Copy link

DandyDev commented Sep 13, 2016

To be fair, this feature was mentioned both in an ealier version of the documentation and the original blog post announcing Snakebite. You can't blame people interpreting this as some sort of promise :)

Snakebite was conceived to work around long JVM startup times

Is that really all there is to it? I suggested WebHDFS before, which is just a REST API, so it doesn't have anything to do with JVM startup times, but it does give you a much easier path to implementing features than the undocumented Protobuf interface.
I was told back then, that using WebHDFS "defeats the purpose of Snakebite", but now I don't see how, if the purpose is to circumvent the JVM startup times that using hdfs dfs incurs. WebHDFS eliminates the JVM overhead just as well as Protobuf does.

@tworec

This comment has been minimized.

Copy link

tworec commented Sep 28, 2016

@DandyDev maybe this SO thread will (at least partially) explain why WebHDFS is not exactly what we want:

http://stackoverflow.com/questions/31580832/hdfs-put-vs-webhdfs

It seems that webhdfs is 4x slower.
Furthermore if we were using WebHDFS, theres no need to read Hadoop code and port it to python.

@tworec

This comment has been minimized.

Copy link

tworec commented Jan 6, 2017

nice reading for this thread
http://wesmckinney.com/blog/python-hdfs-interfaces/

@baiyunping333

This comment has been minimized.

Copy link

baiyunping333 commented Jun 10, 2017

I will implement the feature!

@spyzzz

This comment has been minimized.

Copy link

spyzzz commented Apr 10, 2018

Still not implemented yet ?
I really need this feature :|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.