-
Couldn't load subscription status.
- Fork 0
DataFrameSerializer
Serialization support for DataFrames is provided and implemented by the DataFrameSerializer class. It is responsible for both serializing and deserializing DataFrames.
NOTE: For serialization it is recommended to use the static functions provided by the DataFrame interface instead of directly calling the functions of the DataFrameSerializer class.
Serializing a DataFrame is really easy. Just pass the DataFrame you want to serialize to the static serialize() method:
byte[] bytes = DataFrameSerializer.serialize(df);That's it. The returned array of bytes represents your DataFrame in a serialized form. All information about column names and types is preserved.
You can persist any DataFrame to a file. But don't just write the bytes from above example into a file. Instead use the writeFile() method:
DataFrameSerializer.writeFile("myFile.df", df);The above code will create a file named "myFile.df" which persists your DataFrame to the filesystem. The bytes written to that file are compressed, making it smaller in size. However, because of that you won't be able to open .df files in your standard editor and change specific entries of your DataFrame. You can use the Icecrusher editor to view and modify any .df file. If you need your file to be human readable, consider using a CSVWriter. If your DataFrame is relatively large, then writing the entire file might take several seconds. In many cases it's not desirable to perform such heavy operations on the calling thread. Therefore both readFile() and writeFile() provide *Async() methods which perform the same operation as their non-async counterpart, but on a background thread. So if you want to persist your DataFrame in the background, simply call writeFileAsync() instead of writeFile(). Asynchronous methods do not block the calling thread and return a CompletableFuture which will be completed when the background thread finishes.
CompletableFuture<Void> future = DataFrameSerializer.writeFileAsync("myFile.df", df);
//do some work
//wait for the background thread to complete
future.get();When asynchronously writing to a file, the Future's get() method returns null when completed. Depending on your use case, you may ignore the Future object returned by asynchronous methods.
You can also serialize your DataFrame to a Base64 encoded string.
String encoded = DataFrameSerializer.toBase64(df);As already mentioned in the introduction, the DataFrameSerializer class is also responsible for deserialization. So, if you still have the array of bytes from the first example, you could do the following to get the original DataFrame back:
DataFrame df = DataFrameSerializer.deserialize(bytes);You can read a .df file from the filesystem by calling:
DataFrame df = DataFrameSerializer.readFile("myFile.df");If the DataFrame had any column names set when you persisted it, they will be restored as well as all concrete column types. Also, you can perform a read operation in the background. The Future's get() method will return the read DataFrame when comleted:
CompletableFuture<DataFrame> future = DataFrameSerializer.readFileAsync("myFile.df");
//do something else
//wait for the background thread to complete
DataFrame df = future.get();Of course, you can also deserialize the Base64 encoded string we created earlier:
DataFrame df = DataFrameSerializer.fromBase64(encoded);