-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to cache dataframe? #191
Comments
Hi, dataframes (SQLContext) has a cacheTable method, have you tried that? On Mon, Jul 13, 2015 at 2:20 AM, mshanm10 notifications@github.com wrote:
If you are free, you need to free somebody else. |
I want to share the cache across jobs. Will this work in that scenario?. |
yes, the cached tables will be shared so long as you use a SqlJob (or On Mon, Jul 13, 2015 at 11:02 AM, mshanm10 notifications@github.com wrote:
If you are free, you need to free somebody else. |
Couldn't find SqlJob or HiveJob. I am using 0.5.1.. should I be using different release? |
I might not remember the right name but a SqlJob is definitely in 0.5.1. On Mon, Jul 13, 2015 at 2:18 PM, mshanm10 notifications@github.com wrote:
If you are free, you need to free somebody else. |
SparkSqlJob it is and works fine. Thanks very much for your help. |
What if we only want to cache the pre-computed DataFrame? This workaround would only work for cases where all data could be loaded into memory. For our use case (many TB of data) that is not possible. |
@mjanson sorry I'm not sure I understand your question. If you have many TB of data then you cannot cache data frames or RDDs in memory. Using Hope that answers your question. |
Specifically, we have a Parquet DF on S3 which takes many minutes to scan before we can execute queries against it. We want to re-use the (precomputed, post-scanned) Dataframe for multiple queries. I don't understand how you say different jobs can re-use the same job. If, say, Job A scans/completes, and afterwards Job B is submitted, we want to be able to re-use the pre-computed DF which Job A generated in Job B. |
@mjanson so you want to share a precomputed DF against multiple jobs, and you are writing jobs against a SQLContext. Therefore, let's say Job A scans, and at the end of runJob(), it will do this:
In Job B, which runs in the same context, and is also passed a sqlContext, it does this:
If I understand what you are saying, then the above should work. |
Jobserver supports RDD Caching. How do we cache Dataframe (Spark 1.3+)?. Is there any workaround to cache Dataframes? Thanks.
The text was updated successfully, but these errors were encountered: