New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
temporary spark_serialize_*.csv files don't get deleted #496
Comments
Do the files remain after you close the Spark session (spark_disconnect)? |
Thanks for the fast response, Yes, the files remain after closing the spark connection. |
ok, I'd suggest changing the path to the scratch folder to see this has to do with server access: spark.local.dir /tmp Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. - http://spark.apache.org/docs/latest/configuration.html |
Hey,
But this did not help. The files written in this tmp folder by spark get indeed deleted when the spark connection is closed. |
@adder right, the We can't immediately clean them since the actual Spark Data Frame uses these files. |
In my case, I load files sequentially in R, process, load them as a spark dataframe and append write a Parquet file. The spark dataframe gets overwritten each time with new data, but the old serialises spark files remain. |
Haven't given this much thought, but worth exploring cleaning temp files more aggressively than waiting until the r session restarts. |
On *nix platforms one can just unlink the temp file immediately after creation, and the underlying file is gone as soon as all its file descriptors are closed. Also the file can be read from Sadly none of this would work for Windows last time I checked. We probably don't want to have anything that is too platform-specific in sparklyr, so this will require a bit of research I guess. |
Starting from |
Dear.
I'm using sparklyr to read hunderds of txt files process in R, copy the resulting dataframe into spark memory and write them to a parquet file.
Sparklyr creates this intermediate spark_serialize_*.csv files in the /tmp folder in root.
However these files don't get deleted after the data is loaded into file.
Ofter 100's of files these add up to gigabytes and I run out of disk space in my root partition.
I'm not sure if it's expected behavior to keep this files in tmp or not?
I'm under the impression that these files are only created as an intermediate before loading an R data frame in to spark and that they are not needed afterwards.
Best
The text was updated successfully, but these errors were encountered: