Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write back into text format after reading a file stored into HDFS and running rhwatch in rhipe #42

Open
bedantaguru opened this issue May 10, 2016 · 2 comments

Comments

@bedantaguru
Copy link

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of packages.

Now in both of the packages [rhipe and rmr] I can ingest / read the data stored into csv or text file. Both of them kind of supports creation of new file formats but I find rmr has more support for it or at least more resources to get started. Well, this requirement will be useful when one plans to perform few data processing on raw data stored in HDFS and finally want to store it back to HDFS in a format recognizable by other components of Hadoop like Hive Impala etc. Both of the packages can write in their native format recognizable by the package only. The package rmr supports few other formats.

For reference related to rmr have a look into this page.

However for rhipe I did not get any such document and I tried various ways it failed.

So my question is how can I write back into text [as for example, other recognizable format will also work] after reading a file stored into HDFS and running rhwatch in rhipe ?

I have asked same question here.

@saptarshiguha
Copy link
Contributor

I'll respond to this tomorrow.
Cheers
SG

On Mon, May 9, 2016 at 9:02 PM, Indranil Gayen notifications@github.com
wrote:

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of
packages.

Now in both of the packages [rhipe and rmr] I can ingest / read the data
stored into csv or text file. Both of them kind of supports creation of new
file formats but I find rmr has more support for it or at least more
resources to get started. Well, this requirement will be useful when one
plans to perform few data processing on raw data stored in HDFS and finally
want to store it back to HDFS in a format recognizable by other components
of Hadoop like Hive Impala etc. Both of the packages can write in their
native format recognizable by the package only. The package rmr supports
few other formats.

For reference related to rmr have a look into this page
https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
.

However for rhipe I did not get any such document and I tried various ways
it failed.

So my question is how can I write back into text [as for example, other
recognizable format will also work] after reading a file stored into HDFS
and running rhwatch in rhipe ?

I have asked same question here
https://stackoverflow.com/questions/37129039/getting-data-in-and-out-of-rhipe-r-hadoop.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#42

@saptarshiguha
Copy link
Contributor

Hello,

It is true that RHIPE doesn't have too many output / inputformats, but it
does have sadly not documented.

If you load RHIPE and type rhoptions$ioformats you'll see the options
available

  • text input/output
  • sequencefile storing protobufs (RHIPE's serialization) input/output
  • mapfile (which behave like on disk hashtables) input/output

An example of Text output

y <- rhwatch(map=function(a,b){
b <- formatC(b, format="f",digits=0)
rhcollect(NULL, c(a,b))
}
, reduce=reducers
, input=i
, output=rhfmt(type='text'
, folder=o
, writeKey=FALSE
, field.sep="\t"
, stringquote="")
, read=FALSE)

This converts an input file where 'a' is character vector and 'b' is
numeric vector. I want to write the

  • fields without a key (writeKey=FALSE),
  • separate each element with \t
  • no string quotess
  • output is placed in 'folder'

Also present in the package is HBase input (not sure if output works)

See
https://github.com/tesseradata/RHIPE/blob/d3eed56735ece58a7a39e44cd48cfd3522212766/src/main/R/rhfmt.R

But for that to work you'll need this JAR file which translates Hbase to
RHIPE ( https://github.com/saptarshiguha/RhipeHbaseMozilla )

RHIPE io formats are fairly pluggable i.e. you can write your own.

HTH
Saptarshi

On Mon, May 9, 2016 at 9:27 PM, Saptarshi Guha saptarshi.guha@gmail.com
wrote:

I'll respond to this tomorrow.
Cheers
SG

On Mon, May 9, 2016 at 9:02 PM, Indranil Gayen notifications@github.com
wrote:

I was trying out rhipe and RHadoop [rmr rhdfs rhbase etc.] series of
packages.

Now in both of the packages [rhipe and rmr] I can ingest / read the data
stored into csv or text file. Both of them kind of supports creation of new
file formats but I find rmr has more support for it or at least more
resources to get started. Well, this requirement will be useful when one
plans to perform few data processing on raw data stored in HDFS and finally
want to store it back to HDFS in a format recognizable by other components
of Hadoop like Hive Impala etc. Both of the packages can write in their
native format recognizable by the package only. The package rmr supports
few other formats.

For reference related to rmr have a look into this page
https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/getting-data-in-and-out.md
.

However for rhipe I did not get any such document and I tried various
ways it failed.

So my question is how can I write back into text [as for example, other
recognizable format will also work] after reading a file stored into HDFS
and running rhwatch in rhipe ?

I have asked same question here
https://stackoverflow.com/questions/37129039/getting-data-in-and-out-of-rhipe-r-hadoop.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants