<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R_Beginner/01-02-03-data-import-export-datatable-feather-arrow-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

#  Big-Data Import/Export with {data.table}, {Feather} and {Arrow}




When working with large and complex datasets in R, it is essential to have effective techniques for importing and exporting data. Since these datasets can be enormous, standard techniques for data transfer can often be insufficient and may result in inefficient and time-consuming processes. Therefore, it is crucial to use efficient data management methods to handle the size and complexity of the datasets involved. Doing so ensures that your data analysis is accurate, reliable, and fast, which is essential when working with big data in R.


## Install rpy2

Easy way to run R in Colab with Python runtime using **rpy2** python package. We have to install this package using the pip command:

In [None]:
!pip uninstall rpy2 -y
! pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.5.1
Uninstalling rpy2-3.5.1:
  Successfully uninstalled rpy2-3.5.1
Collecting rpy2==3.5.1
  Using cached rpy2-3.5.1-cp310-cp310-linux_x86_64.whl
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


##  Mount Google Drive

Then you must create a folder in Goole drive named "R" to install all packages permanently. Before installing R-package in Python runtime. You have to mount Google Drive and follow on-screen instruction:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Check and Install Required R Packages

In [None]:
%%R
packages <- c(
          'tidyverse',
          'data.table',
          'feather',
          'arrow'
)

In [None]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

## Load Packages

In [None]:
%%R
# set library path
.libPaths('drive/My Drive/R')
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors





In [None]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

## Data

All data set use in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/r-colab/tree/main/Data/R_Beginners) accounts.


In [None]:
%%R
dataFolder = "/content/drive/MyDrive/R_Website/R_Bigenner/Data/"
df<-readr::read_csv(paste0(dataFolder,"nepal_df_balance.csv")) |>
  glimpse()

Rows: 17865 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Foodstatus_ID, Sex_ID, Region_ID, Livelihood_ID
dbl (16): Foodstatus, Schooling_year, Age, Household_size, Rainfed_area, Irr...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 17,865
Columns: 20
$ Foodstatus           <dbl> 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0…
$ Schooling_year       <dbl> 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Age                  <dbl> 55, 25, 25, 84, 84, 16, 65, 49, 60, 74, 69, 45, 6…
$ Household_size       <dbl> 3, 3, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 3, 3, 1, 3, 3…
$ Rainfed_area         <dbl> 0.175, 0.258, 0.257, 0.334, 0.334, 0.127, 0.000, …
$ Irrigated_area       <dbl> 0.076, 0.000, 0.000, 0.000, 0.000, 0.051, 0.000, …
$ Remittance           <dbl> 0.000, 0.000, 0.000, 30.600, 30.600, 0.000, 0.000…
$ 

## data.table

The [**data.table**](https://rdatatable.gitlab.io/data.table/) is a powerful tool that offers a high-performance alternative to the standard "data.frame" object in base R. With a range of syntax and feature enhancements, this package provides unparalleled ease of use, convenience, and programming speed. Whether you're working with large datasets or complex queries, "data.table" is a versatile and efficient solution for all your data manipulation needs. With its intuitive syntax, powerful indexing capabilities, and seamless integration with other R packages, "data.table" is a must-have tool for any data scientist or analyst looking to optimize their workflow and get the most out of their data.

!> install.packages("data.table")

The latest development version (only if newer available)

> data.table::update_dev_pkg()

The atest development version (force install)

> install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table")

**Importan Features of data.table**

-   fast and friendly delimited **file reader**: [**`?fread`**](https://rdatatable.gitlab.io/data.table/reference/fread.html), see also [convenience features for *small* data](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread)

-   fast and feature rich delimited **file writer**: [**`?fwrite`**](https://rdatatable.gitlab.io/data.table/reference/fwrite.html)

-   low-level **parallelism**: many common operations are internally parallelized to use multiple CPU threads

-   fast and scalable aggregations; e.g. 100GB in RAM (see [benchmarks](https://h2oai.github.io/db-benchmark/) on up to **two billion rows**)

-   fast and feature rich joins: **ordered joins** (e.g. rolling forwards, backwards, nearest and limited staleness), [**overlapping range joins**](https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf) (similar to `IRanges::findOverlaps`), [**non-equi joins**](https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanUseR2016.pdf) (i.e. joins using operators `>, >=, <, <=`), **aggregate on join** (`by=.EACHI`), **update on join**

-   fast add/update/delete columns **by reference** by group using no copies at all

-   fast and feature rich **reshaping** data: [**`?dcast`**](https://rdatatable.gitlab.io/data.table/reference/dcast.data.table.html) (*pivot/wider/spread*) and [**`?melt`**](https://rdatatable.gitlab.io/data.table/reference/melt.data.table.html) (*unpivot/longer/gather*)

-   **any R function from any R package** can be used in queries not just the subset of functions made available by a database backend, also columns of type `list` are supported

-   has [**no dependencies**](https://en.wikipedia.org/wiki/Dependency_hell) at all other than base R itself, for simpler production/maintenance

-   the R dependency is **as old as possible for as long as possible**, dated April 2014, and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0


> install.packages("data.table")

The latest development version (only if newer available)

> data.table::update_dev_pkg()

The atest development version (force install)

> install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table")

### Create data.table object

We can create data.table object using the `data.table()` function. Here is an example:

In [None]:
%%R
DT= data.table(
    Variety =c("BR1","BR3", "BR16", "BR17", "BR18", "BR19","BR26",
	      "BR27","BR28","BR29","BR35","BR36"),
    Yield = c(5.2,6.0,6.6,5.6,4.7,5.2,5.7,
	            5.9,5.3,6.8,6.2,5.8))
class(DT)


[1] "data.table" "data.frame"


### Convet data.frame to data.table

You can also convert existing objects to  `data.table using`,  `setDT()` for `data.frame`

In [None]:
%%R
DT<-setDT(df)
class(DT)

[1] "data.table" "data.frame"


### Reading/Writing CSV file with data.table  `fread()` and `fwrite()`

If you're dealing with large datasets and looking for an efficient way to read files into R as data tables, the **data.table** package has got you covered with its highly efficient function called `fread()`. This function outperforms other alternatives like read.csv or read.table and is specifically designed to handle large datasets. So, if you want to save time and increase your productivity, consider using `fread()` for your file reading needs.

The `fread()` function in data.table offers a great level of versatility when it comes to efficiently reading various types of delimited files. You can easily specify delimiters, select specific columns, and even set particular data types while reading to optimize memory usage. This function proves to be especially powerful when dealing with large datasets due to its exceptional speed and memory efficiency.

In [None]:
%%R
# read with fread()
df.DT<-data.table::fread(paste0(dataFolder,"nepal_df_balance.csv"), header= TRUE)
str(df.DT)

Classes ‘data.table’ and 'data.frame':	17865 obs. of  20 variables:
 $ Foodstatus          : int  1 1 1 0 0 1 0 1 1 1 ...
 $ Schooling_year      : int  0 5 5 0 0 0 0 0 0 0 ...
 $ Age                 : int  55 25 25 84 84 16 65 49 60 74 ...
 $ Household_size      : int  3 3 3 2 2 2 3 3 3 1 ...
 $ Rainfed_area        : num  0.175 0.258 0.257 0.334 0.334 0.127 0 0.052 0.267 0 ...
 $ Irrigated_area      : num  0.076 0 0 0 0 0.051 0 0.016 0 0 ...
 $ Remittance          : num  0 0 0 30.6 30.6 0 0 0 0 0 ...
 $ No_livestock        : num  3.21 1.96 1.96 2.83 2.83 ...
 $ Infrastructure_Index: num  0.381 0.726 0.727 0.765 0.765 0.773 0.785 0.809 0.82 0.823 ...
 $ Region              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Sex                 : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Caste               : int  0 0 0 1 1 1 1 1 1 1 ...
 $ Livelihood          : int  1 1 1 0 0 1 1 1 1 1 ...
 $ School_Class        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Household_Class     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Remitance_Class

In the data.table package of R, `fwrite()` serves as the counterpart to `fread()`. It is primarily utilized for writing data tables to files, usually in CSV or other delimited formats. With a focus on speed and efficiency, `fwrite()` is optimized to handle large datasets effectively. Therefore, it is an excellent option for saving such datasets.

In [None]:
%%R
# read with fread()
data.table::fwrite(df.DT,  paste0(dataFolder, "DT.csv"), row.names=F, quote=TRUE)

## Feather: A Fast On-Disk Format for Data Frames

Feather is a binary columnar serialization tool that is specifically designed to make reading and writing data frames highly efficient, while also making it easier to share data across various data analysis languages. It offers bindings for both Python (written by Wes McKinney) and R (written by Hadley Wickham) and uses the Apache Arrow columnar memory specification to represent binary data on disk, which results in fast read and write operations. This feature is particularly useful when it comes to encoding null/NA values and variable-length types like UTF8 strings. Feather is an integral part of the Apache Arrow project and defines its own simplified schemas and metadata for on-disk representation.





![alt text](http://drive.google.com/uc?export=view&id=1olj1URtrJ9-vnmvEw3SY3IhBvoMT1GG1)



Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

-   Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible

-   Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today's laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk [source](https://www.rstudio.com/blog/feather/).

Feather currently supports the following column types:

-   A wide range of numeric types (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).

-   Logical/boolean values.

-   Dates, times, and timestamps.

-   Factors/categorical variables that have fixed set of possible values.

-   UTF-8 encoded strings.

-   Arbitrary binary data.

All column types support NA/null values.

> install.packages("feather")

### Read/Write with feather

The feather package in R provides functions to read and write data in the Feather file format. Feather is a fast, lightweight, and cross-language columnar storage file format designed for efficient data interchange between programming languages.

In [None]:
%%R
# write_feather()
feather::write_feather(df, paste0(dataFolder, "napal_data.feather"))

### Read feather file

Then we use  `read_feather()` function specifically reads data from a Feather file into an R data frame.


In [None]:
%%R
df.feather <- feather::read_feather(paste0(dataFolder, "napal_data.feather"))
str(df.feather)

tibble [17,865 × 20] (S3: tbl_df/tbl/data.frame)
 $ Foodstatus          : num [1:17865] 1 1 1 0 0 1 0 1 1 1 ...
 $ Schooling_year      : num [1:17865] 0 5 5 0 0 0 0 0 0 0 ...
 $ Age                 : num [1:17865] 55 25 25 84 84 16 65 49 60 74 ...
 $ Household_size      : num [1:17865] 3 3 3 2 2 2 3 3 3 1 ...
 $ Rainfed_area        : num [1:17865] 0.175 0.258 0.257 0.334 0.334 0.127 0 0.052 0.267 0 ...
 $ Irrigated_area      : num [1:17865] 0.076 0 0 0 0 0.051 0 0.016 0 0 ...
 $ Remittance          : num [1:17865] 0 0 0 30.6 30.6 0 0 0 0 0 ...
 $ No_livestock        : num [1:17865] 3.21 1.96 1.96 2.83 2.83 ...
 $ Infrastructure_Index: num [1:17865] 0.381 0.726 0.727 0.765 0.765 0.773 0.785 0.809 0.82 0.823 ...
 $ Region              : num [1:17865] 1 1 1 1 1 1 1 1 1 1 ...
 $ Sex                 : num [1:17865] 0 0 0 1 1 0 0 0 0 0 ...
 $ Caste               : num [1:17865] 0 0 0 1 1 1 1 1 1 1 ...
 $ Livelihood          : num [1:17865] 1 1 1 0 0 1 1 1 1 1 ...
 $ School_Class        : num

## Apache Arrow

[Apache Arrow](https://arrow.apache.org/docs/r/index.html) is a cross-language development platform for processing data, both in-memory and larger-than-memory. It provides a standardized, language-independent columnar memory format for flat and hierarchical data, organized to support fast analytic operations on modern hardware. Additionally, it offers computational libraries and zero-copy streaming, messaging, and interprocess communication.

![alt text](http://drive.google.com/uc?export=view&id=1uJnX1RjsWQXSuGVZxVcnlBCxGsrsrc8_)


The arrow R package exposes an interface to the `Arrow C++ library`, allowing access to many of its features in R. It provides not only low-level access to the Arrow `C++ library API` but also higher-level access through a `dplyr` backend and familiar R functions.

The arrow package boasts several key features, including interoperability, columnar data representation, and high performance. Arrow offers seamless communication between different systems and languages, making it easy to exchange data between R and other programming languages such as `Python`, `Julia`, and `C++`. Arrow uses a columnar memory layout, which can be more efficient for many analytical tasks than traditional row-based formats. Arrow is designed for high-performance data processing, making it suitable for big data and parallel computing environments.

The arrow package also provides several functionalities. It allows importing data from various sources into R and exporting R data to Arrow files. Arrow data can be manipulated in R for various tasks such as filtering, sorting, and aggregating. Arrow can be integrated with other R packages for advanced data analysis and visualization tasks.



### Tabular data in Arrow

Apache Arrow relies on its in-memory columnar format, a standardized, programming language-independent definition for representing structured, table-like datasets in memory. The arrow R package employs the Table class to store these objects, which behave like data frames. You can use the `arrow_table()` function to create new Arrow Tables, much like how `data. frame()` is utilized to produce new data frames.

In [None]:
%%R
dat <- arrow_table(x = 1:4, y = c("a", "b", "c", "d"))
dat

Table
4 rows x 2 columns
$x <int32>
$y <string>


We can also convert exiting data.frame to arrow.table:

In [None]:
%%R
arrow.df <- arrow_table(name = rownames(df), df)
dim(arrow.df)

[1] 17865    21


You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:

In [None]:
%%R
dat[1:3, 1:2]

Table
3 rows x 2 columns
$x <int32>
$y <string>


Along the same lines, the `$` operator can be used to extract named columns:

In [None]:
%%R
dat$y

ChunkedArray
<string>
[
  [
    "a",
    "b",
    "c",
    "d"
  ]
]


### Converting Arrow Tables to data frames

In [None]:
%%R
as.data.frame(dat)

  x y
1 1 a
2 2 b
3 3 c
4 4 d


### Convert data.fame to arrow.table

We can also convert exiting data.frame to arrow.table:

In [None]:
%%R
df.arrow <- arrow_table(name = rownames(df), df)
dim(df.arrow)

[1] 17865    21


### Reading and writing data with Arrow

One of the critical features of Arrow is its ability to handle data in different formats, including `CSV,` `Parquet,` and `Arrow` (also called Feather). While many packages support `CSV,` Arrow's high-speed CSV reading and writing capabilities make it stand out. Additionally, Arrow supports data formats like Parquet and Arrow, which are not widely supported in other packages, making it an excellent choice for handling complex data structures.

Another unique feature of Arrow is its support for multi-file datasets. It can store a single rectangular dataset across multiple files, thus making it possible to work with large datasets that cannot fit into memory. This feature is handy for data scientists and analysts who work with big data and must process large datasets efficiently.

When the goal is to read a single data file into memory, there are several functions you can use:

`read_parquet()`: read a file in Parquet format

`read_feather()`: read a file in Arrow/Feather format

`read_delim_arrow()`: read a delimited text file

`read_csv_arrow()`: read a comma-separated values (CSV) file

`read_tsv_arrow()`: read a tab-separated values (TSV) file

`read_json_arrow()`: read a JSON data file

For writing data to single files, the arrow package provides the following functions, which can be used with both R data frames and Arrow Tables:

`write_parquet()`: write a file in Parquet format

`write_feather()`: write a file in Arrow IPC format

`write_csv_arrow()`: write a file in CSV format


We will write it to a Parquet file using `write_parquet()` function:

In [None]:
%%R
arrow::write_parquet(df, paste0(dataFolder, "napal_data.parquet"))

We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame  but when we set as_data_frame = FALSE the data are read as an Arrow Table:

In [None]:
%%R
df.parquet<- arrow::read_parquet(paste0(dataFolder, "napal_data.parquet"))


## Comparison

### File Size

Comparing file sizes among different file formats (data frame, Parquet, Feather, and data table) can be insightful in understanding their efficiency in storage. However, please note that the actual file size depends on various factors such as the data type, compression settings, and the nature of the data itself.

Now, check disk space of these three format:

In [None]:
%%R
# CSV file
file.info(paste0(dataFolder,"nepal_df_balance.csv"))$size/1000


[1] 1687.577


In [None]:
%%R
# Feather
file.info(paste0(dataFolder,"napal_data.feather"))$size/1000


[1] 3402.08


In [None]:
%%R
# parquet
file.info(paste0(dataFolder,"napal_data.parquet"))$size/1000

[1] 198.24


### Reading time

In [None]:
%%R
# R-base function `read.csv()`
system.time(read.csv(paste0(dataFolder,"nepal_df_balance.csv"), header= TRUE))

   user  system elapsed 
  0.110   0.004   0.117 


In [None]:
%%R
# data.table `fread()`
system.time(data.table::fread(paste0(dataFolder,"nepal_df_balance.csv"), header= TRUE))

   user  system elapsed 
  0.013   0.001   0.021 


In [None]:
%%R
# Feather `read_feather()`
system.time(feather::read_feather(paste0(dataFolder, "napal_data.feather")))

   user  system elapsed 
  0.007   0.002   0.014 


In [None]:
%%R
# Arrow `read_parquet()`
system.time( arrow::read_parquet(paste0(dataFolder, "napal_data.parquet")))

   user  system elapsed 
  0.014   0.006   0.016 


### Writing time

In [None]:
%%R
# R-base function `write.csv()`
system.time(write.csv(df,  paste0(dataFolder, "df.csv"), row.names=F))

   user  system elapsed 
  0.347   0.002   0.374 


In [None]:
%%R
# data.table `fwrite()`
system.time(data.table::fwrite(df.DT,  paste0(dataFolder, "DT.csv"), row.names=F, quote=TRUE))

   user  system elapsed 
  0.018   0.000   0.035 


In [None]:
%%R
# Feather `write_feather()`
system.time(feather::write_feather(df.feather, paste0(dataFolder, "napal_data.feather")))

   user  system elapsed 
  0.004   0.004   0.033 


In [None]:
%%R
# Arrow `write_parquet()`
system.time(arrow::write_parquet(df.feather, paste0(dataFolder, "napal_data.parquet")))

   user  system elapsed 
  0.027   0.003   0.043 



## Summary

Dealing with big data in R requires efficient import and export methods to ensure performance and scalability. Utilizing columnar storage formats like Parquet and Feather, along with database connections and distributed computing frameworks, can help you effectively handle and analyze large datasets in R. Additionally, compression can further optimize storage and transfer of big data files.

This tutorial covers efficient data export-import processes using the R packages **data.table**, **Arrow** and **Feather**, which handle large datasets with speed and ease. We explore data.table's syntax for importing and exporting data and feather's binary columnar data format for seamless data exchange between R and other programming languages. Using these packages, data scientists can handle large datasets efficiently, ensuring, storage, speed and readability in data operations.

To optimize data manipulation workflows, consider exploring advanced features of data.table, and experimenting with feather's compatibility with various data science ecosystems. On the other hand, the arrow package provides a powerful platform for efficient analytic operations on data, with its standardized columnar memory format, computational libraries, and zero-copy streaming capabilities. Its interoperability, columnar data representation, and high performance make it a valuable tool for big data and parallel computing environments.

Compared to other formats like **CSV** and **Feather**,  **Parquet** files have a significantly smaller size on disk, making them an excellent option for handling big data. Although **Feather** files have a faster read and write speed than Parquet, they take up more space on disk. However, the Parquet format supports compression, which helps to reduce the file sizes even further significantly. The actual size of files can depend on various factors, such as the compression codec used (e.g., Snappy, Gzip) and the nature of the data itself. With all these advantages, the Parquet format is an excellent choice for storing large datasets efficiently while keeping their storage costs low.



## References

1.   [Feather](https://posit.co/blog/feather/)

2.  [data.tabler](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)

3. [Arrow](https://arrow.apache.org/docs/r/)