temporary spark_serialize_*.csv files don't get deleted #496

adder · 2017-02-16T16:00:13Z

Dear.
I'm using sparklyr to read hunderds of txt files process in R, copy the resulting dataframe into spark memory and write them to a parquet file.
Sparklyr creates this intermediate spark_serialize_*.csv files in the /tmp folder in root.
However these files don't get deleted after the data is loaded into file.
Ofter 100's of files these add up to gigabytes and I run out of disk space in my root partition.
I'm not sure if it's expected behavior to keep this files in tmp or not?

I'm under the impression that these files are only created as an intermediate before loading an R data frame in to spark and that they are not needed afterwards.

Best

edgararuiz-zz · 2017-02-16T16:04:12Z

Do the files remain after you close the Spark session (spark_disconnect)?

adder · 2017-02-16T16:16:58Z

Thanks for the fast response,

Yes, the files remain after closing the spark connection.

edgararuiz-zz · 2017-02-16T16:20:50Z

ok, I'd suggest changing the path to the scratch folder to see this has to do with server access:

spark.local.dir /tmp Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. - http://spark.apache.org/docs/latest/configuration.html

adder · 2017-02-16T16:43:29Z

Hey,
I did the following:

conf = spark_config()
conf$spark.local.dir = paste0(getwd(),'/tmp')
sc = spark_connect(master = "local", version = "2.0.2", hadoop_version = "2.7",config = conf)

But this did not help. The files written in this tmp folder by spark get indeed deleted when the spark connection is closed.
The spark_serialize_*.csv files however, are written in another tmp folder. I believe it's the R tmp folder (/tmp/RtmpkkztvF).
These files remain upon closing the spark connection.

javierluraschi · 2017-02-17T02:07:31Z

@adder right, the spark_serialize_*.csv files are currently cleaned when the rsession restarts; however, we could consider cleaning up these files when the spark connection completes which would be an incremental improvement over this.

We can't immediately clean them since the actual Spark Data Frame uses these files.

adder · 2017-02-17T06:52:46Z

In my case, I load files sequentially in R, process, load them as a spark dataframe and append write a Parquet file. The spark dataframe gets overwritten each time with new data, but the old serialises spark files remain.
Closing and reopening the connection is also quite time consuming.
Would removing these files upon deleting or overwriting the spark dataframe not work?
Is there another way to get them removed?

javierluraschi · 2017-05-10T07:20:41Z

Haven't given this much thought, but worth exploring cleaning temp files more aggressively than waiting until the r session restarts.

yitao-li · 2020-04-22T11:36:56Z

On *nix platforms one can just unlink the temp file immediately after creation, and the underlying file is gone as soon as all its file descriptors are closed. Also the file can be read from /proc/self/fd/<file descriptor number> while it has open file descriptor pointing to it.

Sadly none of this would work for Windows last time I checked. We probably don't want to have anything that is too platform-specific in sparklyr, so this will require a bit of research I guess.

yitao-li · 2021-01-14T22:59:32Z

Starting from sparklyr 1.5 CSV serialization is no longer used, and the new RDS-based serialization format does not rely on temp files.

javierluraschi mentioned this issue Feb 17, 2017

WIP: Immediately remove serialized temp csv files #498

Closed

javierluraschi added data featurerequest labels May 10, 2017

yitao-li self-assigned this Apr 22, 2020

yitao-li closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

temporary spark_serialize_*.csv files don't get deleted #496

temporary spark_serialize_*.csv files don't get deleted #496

adder commented Feb 16, 2017

edgararuiz-zz commented Feb 16, 2017

adder commented Feb 16, 2017

edgararuiz-zz commented Feb 16, 2017

adder commented Feb 16, 2017

javierluraschi commented Feb 17, 2017 •

edited

adder commented Feb 17, 2017 •

edited

javierluraschi commented May 10, 2017

yitao-li commented Apr 22, 2020 •

edited

yitao-li commented Jan 14, 2021

temporary spark_serialize_*.csv files don't get deleted #496

temporary spark_serialize_*.csv files don't get deleted #496

Comments

adder commented Feb 16, 2017

edgararuiz-zz commented Feb 16, 2017

adder commented Feb 16, 2017

edgararuiz-zz commented Feb 16, 2017

adder commented Feb 16, 2017

javierluraschi commented Feb 17, 2017 • edited

adder commented Feb 17, 2017 • edited

javierluraschi commented May 10, 2017

yitao-li commented Apr 22, 2020 • edited

yitao-li commented Jan 14, 2021

javierluraschi commented Feb 17, 2017 •

edited

adder commented Feb 17, 2017 •

edited

yitao-li commented Apr 22, 2020 •

edited