Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

temporary spark_serialize_*.csv files don't get deleted #496

Closed
adder opened this issue Feb 16, 2017 · 9 comments
Closed

temporary spark_serialize_*.csv files don't get deleted #496

adder opened this issue Feb 16, 2017 · 9 comments

Comments

@adder
Copy link

adder commented Feb 16, 2017

Dear.
I'm using sparklyr to read hunderds of txt files process in R, copy the resulting dataframe into spark memory and write them to a parquet file.
Sparklyr creates this intermediate spark_serialize_*.csv files in the /tmp folder in root.
However these files don't get deleted after the data is loaded into file.
Ofter 100's of files these add up to gigabytes and I run out of disk space in my root partition.
I'm not sure if it's expected behavior to keep this files in tmp or not?

I'm under the impression that these files are only created as an intermediate before loading an R data frame in to spark and that they are not needed afterwards.

Best

@edgararuiz-zz
Copy link
Contributor

Do the files remain after you close the Spark session (spark_disconnect)?

@adder
Copy link
Author

adder commented Feb 16, 2017

Thanks for the fast response,

Yes, the files remain after closing the spark connection.

@edgararuiz-zz
Copy link
Contributor

ok, I'd suggest changing the path to the scratch folder to see this has to do with server access:

spark.local.dir /tmp Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. - http://spark.apache.org/docs/latest/configuration.html

@adder
Copy link
Author

adder commented Feb 16, 2017

Hey,
I did the following:

conf = spark_config()
conf$spark.local.dir = paste0(getwd(),'/tmp')
sc = spark_connect(master = "local", version = "2.0.2", hadoop_version = "2.7",config = conf)

But this did not help. The files written in this tmp folder by spark get indeed deleted when the spark connection is closed.
The spark_serialize_*.csv files however, are written in another tmp folder. I believe it's the R tmp folder (/tmp/RtmpkkztvF).
These files remain upon closing the spark connection.

@javierluraschi
Copy link
Collaborator

javierluraschi commented Feb 17, 2017

@adder right, the spark_serialize_*.csv files are currently cleaned when the rsession restarts; however, we could consider cleaning up these files when the spark connection completes which would be an incremental improvement over this.

We can't immediately clean them since the actual Spark Data Frame uses these files.

@adder
Copy link
Author

adder commented Feb 17, 2017

In my case, I load files sequentially in R, process, load them as a spark dataframe and append write a Parquet file. The spark dataframe gets overwritten each time with new data, but the old serialises spark files remain.
Closing and reopening the connection is also quite time consuming.
Would removing these files upon deleting or overwriting the spark dataframe not work?
Is there another way to get them removed?

@javierluraschi
Copy link
Collaborator

Haven't given this much thought, but worth exploring cleaning temp files more aggressively than waiting until the r session restarts.

@yitao-li yitao-li self-assigned this Apr 22, 2020
@yitao-li
Copy link
Contributor

yitao-li commented Apr 22, 2020

On *nix platforms one can just unlink the temp file immediately after creation, and the underlying file is gone as soon as all its file descriptors are closed. Also the file can be read from /proc/self/fd/<file descriptor number> while it has open file descriptor pointing to it.

Sadly none of this would work for Windows last time I checked. We probably don't want to have anything that is too platform-specific in sparklyr, so this will require a bit of research I guess.

@yitao-li
Copy link
Contributor

Starting from sparklyr 1.5 CSV serialization is no longer used, and the new RDS-based serialization format does not rely on temp files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants