Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with Hive #24

Closed
michaeltrobinson opened this issue Aug 21, 2017 · 2 comments
Closed

Compatibility with Hive #24

michaeltrobinson opened this issue Aug 21, 2017 · 2 comments

Comments

@michaeltrobinson
Copy link

Has anyone had any luck getting Hive to read ORC files written with this libary?

hive --orcfiledump test.orc

Currently, I'm getting the following error from Hive if I write the file with zlib compression:

Processing data file test.orc [length: 3403]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 8026884
	at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212)
	at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:257)
	at java.io.InputStream.read(InputStream.java:101)
	at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
	at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
	at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11063)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
	at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
	at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
	at org.apache.orc.tools.FileDump.main(FileDump.java:134)
	at org.apache.orc.tools.FileDump.main(FileDump.java:141)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

And this error if I write it without compression:

Processing data file test.orc [length: 3945]
Structure for test.orc
File Version: 0.12 with ORIGINAL
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
	at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
	at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124)
	at com.google.protobuf.CodedInputStream.readGroup(CodedInputStream.java:241)
	at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:488)
	at com.google.protobuf.GeneratedMessage.parseUnknownField(GeneratedMessage.java:193)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11069)
	at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11027)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11132)
	at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11127)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
	at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
	at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11360)
	at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:267)
	at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:296)
	at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:953)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:915)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1081)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1116)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:272)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:598)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:592)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:308)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:273)
	at org.apache.orc.tools.FileDump.main(FileDump.java:134)
	at org.apache.orc.tools.FileDump.main(FileDump.java:141)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Using schema

		struct<
			t_int:int,
			t_int64:bigint,
			t_float32:float,
			t_float64:double,
			t_string:string,
			t_bool:boolean,
			t_timestamp:timestamp,
			t_list:array<int>,
			t_map:map<string,int>,
			t_nested:struct<
				t_int:int,
				t_int64:bigint,
				t_float32:float,
				t_float64:double,
				t_string:string,
				t_bool:boolean,
				t_timestamp:timestamp
			>
		>
@scritchley
Copy link
Owner

Hi, I've just tested a file with the same schema and had no issue. Please find the file attached. Can you confirm whether this file also produces an error?

Note, the file is zipped in order to upload to github. Please ensure to unzip before testing.
test.orc.zip

Here is create table statement I used to produce a table over this file:

create external table orctest (
  			t_int int,
			t_int64 bigint,
			t_float32 float,
			t_float64 double,
			t_string string,
			t_bool boolean,
			t_timestamp timestamp,
			t_list array<int>,
			t_map map<string,int>,
			t_nested struct<
				t_int:int,
				t_int64:bigint,
				t_float32:float,
				t_float64:double,
				t_string:string,
				t_bool:boolean,
				t_timestamp:timestamp
			>
) 
stored as orc
location '/user/admin/orctest';

@michaeltrobinson
Copy link
Author

michaeltrobinson commented Aug 22, 2017

@scritchley Yes I can confirm that file does work for me.

Could you take a look at this example of how I am writing the file: https://gist.github.com/michaeltrobinson/ab2b7c2ed2e43fc24ca3ecf65965c21f

EDIT: I figured it out, silly stupid mistake on my end. I had modified the version I had in vendor with a fmt.Println to debug an issue I was having parsing the schema. That was putting bad bytes into my output since I was writing to stdout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants