How to cache dataframe? #191

mshanm10 · 2015-07-13T09:20:20Z

Jobserver supports RDD Caching. How do we cache Dataframe (Spark 1.3+)?. Is there any workaround to cache Dataframes? Thanks.

velvia · 2015-07-13T17:07:40Z

Hi, dataframes (SQLContext) has a cacheTable method, have you tried that?
That is more efficient than RDD caching because it gets stored in a
compressed columnar format.

On Mon, Jul 13, 2015 at 2:20 AM, mshanm10 notifications@github.com wrote:

Jobserver supports RDD Caching. How do we cache Dataframe (Spark 1.3+)?.
Is there any workaround to cache Dataframes? Thanks.

—
Reply to this email directly or view it on GitHub
#191.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

mshanm10 · 2015-07-13T18:02:24Z

I want to share the cache across jobs. Will this work in that scenario?.

velvia · 2015-07-13T18:05:07Z

yes, the cached tables will be shared so long as you use a SqlJob (or
HiveJob) and not the regular SparkJob.

On Mon, Jul 13, 2015 at 11:02 AM, mshanm10 notifications@github.com wrote:

I want to share the cache across jobs. Will this work in that scenario?.

—
Reply to this email directly or view it on GitHub
#191 (comment)
.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

mshanm10 · 2015-07-13T21:18:53Z

Couldn't find SqlJob or HiveJob. I am using 0.5.1.. should I be using different release?

velvia · 2015-07-13T21:41:44Z

I might not remember the right name but a SqlJob is definitely in 0.5.1.
You need job-server-extras package though.

On Mon, Jul 13, 2015 at 2:18 PM, mshanm10 notifications@github.com wrote:

Couldn't find SqlJob or HiveJob. I am using 0.5.1.. should I be using
different release?

—
Reply to this email directly or view it on GitHub
#191 (comment)
.

If you are free, you need to free somebody else.
If you have some power, then your job is to empower somebody else.
--- Toni Morrison

mshanm10 · 2015-07-14T01:59:34Z

SparkSqlJob it is and works fine. Thanks very much for your help.

mjanson · 2015-09-09T22:37:21Z

What if we only want to cache the pre-computed DataFrame? This workaround would only work for cases where all data could be loaded into memory. For our use case (many TB of data) that is not possible.

velvia · 2015-09-11T17:04:08Z

@mjanson sorry I'm not sure I understand your question. If you have many TB of data then you cannot cache data frames or RDDs in memory.

Using SparkSqlJob and the named tables is not about caching them in memory, it is just registering a table in the SQLContext so you can refer to them from different jobs sharing the same SparkSqlJob. If you use HiveContext and the Hive metastore, then the tables can be shared between contexts too. They don't have to be in memory at all! In fact by default they are computed from source.

Hope that answers your question.

mjanson · 2015-11-20T00:07:44Z

Specifically, we have a Parquet DF on S3 which takes many minutes to scan before we can execute queries against it. We want to re-use the (precomputed, post-scanned) Dataframe for multiple queries. I don't understand how you say different jobs can re-use the same job. If, say, Job A scans/completes, and afterwards Job B is submitted, we want to be able to re-use the pre-computed DF which Job A generated in Job B.

velvia · 2015-11-20T00:13:18Z

@mjanson so you want to share a precomputed DF against multiple jobs, and you are writing jobs against a SQLContext. Therefore, let's say Job A scans, and at the end of runJob(), it will do this:

myScanedDF.registerTempTable("scanned_table")

In Job B, which runs in the same context, and is also passed a sqlContext, it does this:

val scannedDF = sqlContext.table("scanned_table")

If I understand what you are saying, then the above should work.

velvia closed this as completed Jul 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to cache dataframe? #191

How to cache dataframe? #191

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 14, 2015

mjanson commented Sep 9, 2015

velvia commented Sep 11, 2015

mjanson commented Nov 20, 2015

velvia commented Nov 20, 2015

How to cache dataframe? #191

How to cache dataframe? #191

Comments

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 13, 2015

velvia commented Jul 13, 2015

mshanm10 commented Jul 14, 2015

mjanson commented Sep 9, 2015

velvia commented Sep 11, 2015

mjanson commented Nov 20, 2015

velvia commented Nov 20, 2015