<font size="6"><b>SERDE: SERIALIZATION/DESERIALIZATION</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(jsonlite)
library(qs)
library(fst)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

Serialization is tranforming a data object in a format that can be stored or transmitter while deserialization is the inverse of that, constructing the data object from that stored or transmitted format.

**Serde** is an alias for serialization/deserialization.

In R, several options exist for serde operations

# Text formats

Advantage is text formats, is the ease of conversion and ubiquity of tools across every platform and programming language.

So a text based serde can be read/written by almost all applications easily.

The downside is, metadata cannot easily be stored in text format so this data sometimes is not recreated during deserialization (however there are some methods to overcome this problem).

Another downside is, text formats can hold more space than binary formats.

## base writeLines, readLines

writeLines and readLines writes and reads objects as is, without any further conversionİ

In [None]:
set.seed(1000)
charvec <- sample(letters, 100, replace = T)

In [None]:
charvec %>% head

In [None]:
writeLines(charvec, "~/databb/temp/charvec")

In [None]:
charvec2 <- readLines("~/databb/temp/charvec")

In [None]:
charvec2 %>% head

In [None]:
identical(charvec, charvec2)

Of course metadata is not kept with this simple method

## read_file from readr

readr package has its own versions to serialize/deserialize many text or binary formats, including, csv, excel and rds.

In contrast with readLines, which reads each line into a separate vector value, read_file concatenates all lines into a single character value:

In [None]:
charvec3 <- read_file("~/databb/temp/charvec")

In [None]:
charvec3

This is identical to collapsing the original vector with newline and appending a trailing newline at the end:

In [None]:
charvec %>% paste(collapse="\n") %>% paste("\n", sep = "") %>% identical(charvec3)

## base write/read into/from csv/tsv

The easiest way is the base functions read.table, read.csv, read.delim, write.table

In [None]:
iris1 <- iris

In [None]:
iris1 %>% write.csv("~/databb/temp/iris1.csv", row.names = F)

In [None]:
iris2 <- read.csv("~/databb/temp/iris1.csv")

While serde in text format can recreate the numeric types, the format cannot preserve the metadata of factor columns, so factor attributes are lost.

In [None]:
iris2

While stringsAsFactors option can control the behaviour for reading character columns, alphanumeric order is taken for levels and it is not guaranteed to preserve the original level order.

## fread/fwrite from data.table

The advantage of `fread` and `fwrite` is parallel read/write of very large objects from/into disk

The current number of threads used is retrieved by:

In [None]:
getDTthreads()

The thread number can be increased:

In [None]:
setDTthreads(getDTthreads() + 2)

In [None]:
getDTthreads()

In [None]:
iris1 %>% fwrite("~/databb/temp/iris2.csv", row.names = F)

In [None]:
iris3 <- fread("~/databb/temp/iris2.csv")

While numeric columns are recreated, factor is now a character column, as with the case of `read.csv`

In [None]:
iris3

The same stringsAsFactors option can be used, while the order of levels is not guaranteed to coincide with the original one

## JSON serde with jsonlite

We will spend more time with JSON format, because it has become almost a universal standard for serde of semistructured and complex data in text format, especially in web applications.

The advantage of serde with JSON using jsonlite is that the metada can also be preserved so that the object is reconstructed to a large extent (the only limitation being the precision limit of numeric values=

In [None]:
iris_json1 <- serializeJSON(iris1)

In [None]:
iris_json1 %>% class

In [None]:
iris_json1

In [None]:
writeLines(iris_json1, "~/databb/temp/iris1.json")

In [None]:
iris_json2 <- readLines("~/databb/temp/iris1.json")

In [None]:
class(iris_json2) <- "json"

In [None]:
iris_json2

In [None]:
identical(iris_json1, iris_json2)

The deserialized and original jsons are identical

Now let's deserialize the json into the native object format:

In [None]:
iris4 <- unserializeJSON(iris_json2)

In [None]:
iris1

Objects are identical:

In [None]:
identical(iris1, iris4)

# Binary formats

While binary formats cannot easily be shared across applications and languages, they hold less space, can be serialized/deserialized faster and metadata is preserved better

## Base rds

Except for very large objects, rds is the most convenient way to serde R objects back and forth in R. And rds can serde any kind of R objects, not just data.frames or similar.

In [None]:
saveRDS(iris1, "~/databb/temp/iris1.rds")

In [None]:
iris5 <- readRDS("~/databb/temp/iris1.rds")

In [None]:
identical(iris1, iris5)

## Base rda

The difference with rda is that, rda can hold multiple R objects and when and rda file is deserialized, the objects that it holds are automatically assigned to the orginal names

In [None]:
iris6 <- iris1

In [None]:
iris7 <- iris1

Serialize both objects into a single rda file:

In [None]:
save(iris6, iris7, file = "~/databb/temp/iris1.rda")

Remove the objects and check that they don't exist anymore:

In [None]:
rm(iris6, iris7)

In [None]:
exists("iris6")

In [None]:
exists("iris7")

Load them again:

In [None]:
load("~/databb/temp/iris1.rda")

Check that they exist and are identical with the original copy

In [None]:
exists("iris6")

In [None]:
exists("iris7")

In [None]:
identical(iris1, iris6)

In [None]:
identical(iris1, iris7)

rda is good when a large number of objects are used in a session and serialization/deserialization of each of the objects take too much effort

## fst

`fst` is a performance tool for serde of very large tabular objects.

Serial read/write is possible, compression is available, and columns, rows to be read can also be selected

The number of threads can be retrieved or set:

In [None]:
threads_fst()

In [None]:
write_fst(iris1, "~/databb/temp/iris1.fst")

In [None]:
iris8 <- read_fst("~/databb/temp/iris1.fst")

In [None]:
identical(iris1, iris8)

Selected columns and rows can also be read:

In [None]:
iris9 <- read_fst("~/databb/temp/iris1.fst", columns = c("Sepal.Length", "Sepal.Width"), from = 40, to = 120)

In [None]:
iris9

Note that, fst can serialize/deserialize only data.frame and similar tabular objects

## qs

`qs` is also optimized for very large data objects of any kind and supports parallel threads and compression, just like fst.

While fst has more fine tuned control for reading row and column wise data, qs reads the whole object once. However qs can also serialize/deserialize any kind of R objects, including lists, visualizations, etc.

In [None]:
qsave(iris1, "~/databb/temp/iris1.qs", nthreads = 8)

In [None]:
iris10 <- qread("~/databb/temp/iris1.qs", nthreads = 8)

In [None]:
identical(iris1, iris10)

Now let's create a list object:

In [None]:
irisl1 <- split(iris1, f = iris1$Species)

In [None]:
irisl1 %>% str

In [None]:
qsave(irisl1, "~/databb/temp/irisl1.qs", nthreads = 8)

In [None]:
irisl2 <- qread("~/databb/temp/irisl1.qs", nthreads = 8)

In [None]:
identical(irisl1, irisl2)

Check that this operation would fail, since fst cannot handle list objects:

In [None]:
#write.fst(irisl1, "~/databb/temp/irisl1.fst")