Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apache ORC Support in TensorFlow IO #1372

Open
3 of 8 tasks
oliverhu opened this issue Apr 21, 2021 · 7 comments
Open
3 of 8 tasks

Apache ORC Support in TensorFlow IO #1372

oliverhu opened this issue Apr 21, 2021 · 7 comments

Comments

@oliverhu
Copy link
Contributor

oliverhu commented Apr 21, 2021

(Creating this issue for visibility so people interested can join the discussion... )

Overview

Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

Motivation

We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

Design Discussions

Milestones

  • Add Apache ORC build dependency.
  • Implement a simple ORC dataset that maps records in ORC files into Tensors.
  • add a tutorial for ORC reader.
  • feature schemas support: support sparseTensor and VarLenFeature.
  • feature schemas support: support denseTensor FixedLenFeature only. (follow parse_example_v2.)
  • usability improvements
  • performance tuning
  • feature schemas support: support raggedTensor
@kvignesh1420
Copy link
Member

@oliverhu any update on this?

@oliverhu
Copy link
Contributor Author

no update recently @kvignesh1420

@kvignesh1420
Copy link
Member

@oliverhu can we document the current feature in the form of a tutorial?

@oliverhu
Copy link
Contributor Author

sure, will add that !

@kvignesh1420
Copy link
Member

Reference FYKI: https://github.com/tensorflow/io/tree/master/docs/tutorials

@372046933
Copy link
Contributor

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

@372046933 372046933 mentioned this issue Apr 27, 2022
@372046933
Copy link
Contributor

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

HDFS supported (with kerberos) by #1674

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants