Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support directly operating on files #372

Open
voycey opened this issue May 24, 2019 · 5 comments
Open

Feature Request: Support directly operating on files #372

voycey opened this issue May 24, 2019 · 5 comments

Comments

@voycey
Copy link

voycey commented May 24, 2019

One thing we have noticed during our Big Data Journey is that while DB / Query engines are a convenient way to get access to data, if you want to do any kind of meaningful heavy lifting it is sometimes better to skip them altogether.

One thing we have been working on lately is directly operating on ORC files via Spark.

For example - doing a simple table copy from an unpartitioned table into a partitioned table (insert into <x> select * from <y> or load data inpath...) with the size of data and number of partitions we have (50 x 90) = 4500 via Hive with several well powered nodes (16 CPU x 104G Memory) takes days to complete. (Even with stats generation turned off for inserted rows!).

However looping the same query in Spark over a dataset like this:

val day = spark.read.orc("gs://pipeline/unpartitioned/ts_date=2019-02-15/state_code=CA/")
day.write.option("orc.bloom.filter.columns","id,place_cat,place_name,city").option("orc.compress","ZLIB").option("orc.compress.size","262144").option("orc.create.index","true").option("orc.row.index.stride","50000").orc("gs://pipeline/partitioned/ts_date=2019-02-15/state_code=CA/")

Finishes in hours on a very small and basic 3 node standard cluster.

As scramble generation is basically a transmutation of the data in one table into another one - avoiding the overhead of these Query Engines would go a long way to making this more efficient.

Currently most of our plans for this has been in crippling the scramble generation somehow - whilst it is possible to get the scrambles created using a reduced partitioning system - it might not be necessary if you can take full advantage of an MPP system such as Spark directly (and not relying on the HiveContext).

What do you think?

@voycey
Copy link
Author

voycey commented May 24, 2019

Naturally this wouldn't just be limited to ORC. Spark supports all sorts of writers from Parquet and Avro to even CSV and JSON files!

@pyongjoo
Copy link
Member

Thank you suggestion! I agree with you on the usefulness of directly operating on file(s), but it won't always be magical (in comparison to SQL-based ones).

I see several orthogonal things in your comment:

  1. A possibility of operating directly on files (either compressed or raw files): This will be very useful. We keep debating about common use cases, and we will definitely work on this.
  2. Spark is faster than Hive: I agree, but it's mainly due to Spark vs Hive (or Hive's legacy Hadoop engine), not due to Spark SQL vs Spark. Spark SQL is all translated into Spark's map-reduce functions eventually. Also, Spark SQL's external table simply works on files (e.g., ORC, Parquet, csv, etc).
  3. Operating on files may allow a larger number of partitions: If so, it will be useful. But, there may be negative performance impact, which I don't have a good understanding.

Eventually, I think Verdict should support direct data loads from your local disk or from Google cloud storage, etc., without requiring you all messy data transformation.

@pyongjoo pyongjoo pinned this issue May 24, 2019
@voycey
Copy link
Author

voycey commented May 25, 2019

Sorry I just want to clarify - I am only talking about scramble creation here - not necessarily running Verdict directly on files as its own "Query Engine".

I find that Tez (the default engine underneath Hive in more recent versions) is not great at Memory Management compared to MR and Spark as execution engines, we see jobs fail much more regularly - even jobs that have previously completed on the same hardware. The only reliable way of processing anything on our kind of datasets are by trying to actually "reduce" the amount of processing that has to be done.

For example - to convert to ORC at the moment - the "easiest" way to do this in terms of effort is to create an external table on the CSV files, create an ORC table and then insert into orc select * from csv
Spark takes a bit more effort to get set up initially but can process the files much faster on a single node than Hive can do using a multi node cluster on the above.

I think for the time being - a way to simply create a scramble from an RDD / Dataframe and be able to write those to ORC files would be a great huge step - I don't think it would be too much effort to develop (we had a brief try but we have deadline to meet) as all the code to do this is basically there just using HiveContext to materialise them. If we can do the computationally difficult stuff such as creating scrambles in Spark and then keep the SQL interface so it can do all of the query translation that would be perfect :)

@pyongjoo
Copy link
Member

pyongjoo commented Jun 4, 2019

@voycey Assuming you will be using Google BigQuery (or at least Google Cloud ecosystem), what do you think of this workflow? This is to figure out the simplest way for you to use VerdictDB. Once the interface is determined, we can engineer its internals for higher efficiency.

  1. You store raw (or compressed) files in Google Cloud Storage. Let's say on Jan 1, you collected mydata_jan1.csv and on Jan 2, you collected mydata_jan2.csv, and so on.
  2. You let VerdictDB load your data into its own some data store by pointing the locations of those files. The some data store will simply be BigQuery tables (managed by VerdictDB).
  3. You run queries to VerdictDB.

@voycey
Copy link
Author

voycey commented Jun 4, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants