Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to cache dataframe? #191

Closed
mshanm10 opened this issue Jul 13, 2015 · 10 comments
Closed

How to cache dataframe? #191

mshanm10 opened this issue Jul 13, 2015 · 10 comments

Comments

@mshanm10
Copy link

Jobserver supports RDD Caching. How do we cache Dataframe (Spark 1.3+)?. Is there any workaround to cache Dataframes? Thanks.

@velvia
Copy link
Contributor

velvia commented Jul 13, 2015

Hi, dataframes (SQLContext) has a cacheTable method, have you tried that?
That is more efficient than RDD caching because it gets stored in a
compressed columnar format.

On Mon, Jul 13, 2015 at 2:20 AM, mshanm10 notifications@github.com wrote:

Jobserver supports RDD Caching. How do we cache Dataframe (Spark 1.3+)?.
Is there any workaround to cache Dataframes? Thanks.


Reply to this email directly or view it on GitHub
#191.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

@mshanm10
Copy link
Author

I want to share the cache across jobs. Will this work in that scenario?.

@velvia
Copy link
Contributor

velvia commented Jul 13, 2015

yes, the cached tables will be shared so long as you use a SqlJob (or
HiveJob) and not the regular SparkJob.

On Mon, Jul 13, 2015 at 11:02 AM, mshanm10 notifications@github.com wrote:

I want to share the cache across jobs. Will this work in that scenario?.


Reply to this email directly or view it on GitHub
#191 (comment)
.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

@mshanm10
Copy link
Author

Couldn't find SqlJob or HiveJob. I am using 0.5.1.. should I be using different release?

@velvia
Copy link
Contributor

velvia commented Jul 13, 2015

I might not remember the right name but a SqlJob is definitely in 0.5.1.
You need job-server-extras package though.

On Mon, Jul 13, 2015 at 2:18 PM, mshanm10 notifications@github.com wrote:

Couldn't find SqlJob or HiveJob. I am using 0.5.1.. should I be using
different release?


Reply to this email directly or view it on GitHub
#191 (comment)
.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

@mshanm10
Copy link
Author

SparkSqlJob it is and works fine. Thanks very much for your help.

@velvia velvia closed this as completed Jul 14, 2015
@mjanson
Copy link

mjanson commented Sep 9, 2015

What if we only want to cache the pre-computed DataFrame? This workaround would only work for cases where all data could be loaded into memory. For our use case (many TB of data) that is not possible.

@velvia
Copy link
Contributor

velvia commented Sep 11, 2015

@mjanson sorry I'm not sure I understand your question. If you have many TB of data then you cannot cache data frames or RDDs in memory.

Using SparkSqlJob and the named tables is not about caching them in memory, it is just registering a table in the SQLContext so you can refer to them from different jobs sharing the same SparkSqlJob. If you use HiveContext and the Hive metastore, then the tables can be shared between contexts too. They don't have to be in memory at all! In fact by default they are computed from source.

Hope that answers your question.

@mjanson
Copy link

mjanson commented Nov 20, 2015

Specifically, we have a Parquet DF on S3 which takes many minutes to scan before we can execute queries against it. We want to re-use the (precomputed, post-scanned) Dataframe for multiple queries. I don't understand how you say different jobs can re-use the same job. If, say, Job A scans/completes, and afterwards Job B is submitted, we want to be able to re-use the pre-computed DF which Job A generated in Job B.

@velvia
Copy link
Contributor

velvia commented Nov 20, 2015

@mjanson so you want to share a precomputed DF against multiple jobs, and you are writing jobs against a SQLContext. Therefore, let's say Job A scans, and at the end of runJob(), it will do this:

myScanedDF.registerTempTable("scanned_table")

In Job B, which runs in the same context, and is also passed a sqlContext, it does this:

val scannedDF = sqlContext.table("scanned_table")

If I understand what you are saying, then the above should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants