<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R%20for%20Beginners/data_table_import_export.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

#  **Data Imoport/Extort with data.table and Feather**



## Introduction

### Data-table

The [**data.table**](https://rdatatable.gitlab.io/data.table/) is a powerful tool that offers a high-performance alternative to the standard "data.frame" object in base R. With a range of syntax and feature enhancements, this package provides unparalleled ease of use, convenience, and programming speed. Whether you're working with large datasets or complex queries, "data.table" is a versatile and efficient solution for all your data manipulation needs. With its intuitive syntax, powerful indexing capabilities, and seamless integration with other R packages, "data.table" is a must-have tool for any data scientist or analyst looking to optimize their workflow and get the most out of their data.

![alt text](http://drive.google.com/uc?export=view&id=1ok_8au0GAsR6nvk5qpejgcn5beGxdWQ5)

**Features**

-   fast and friendly delimited **file reader**: [**`?fread`**](https://rdatatable.gitlab.io/data.table/reference/fread.html), see also [convenience features for *small* data](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread)

-   fast and feature rich delimited **file writer**: [**`?fwrite`**](https://rdatatable.gitlab.io/data.table/reference/fwrite.html)

-   low-level **parallelism**: many common operations are internally parallelized to use multiple CPU threads

-   fast and scalable aggregations; e.g. 100GB in RAM (see [benchmarks](https://h2oai.github.io/db-benchmark/) on up to **two billion rows**)

-   fast and feature rich joins: **ordered joins** (e.g. rolling forwards, backwards, nearest and limited staleness), [**overlapping range joins**](https://github.com/Rdatatable/data.table/wiki/talks/EARL2014_OverlapRangeJoin_Arun.pdf) (similar to `IRanges::findOverlaps`), [**non-equi joins**](https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanUseR2016.pdf) (i.e. joins using operators `>, >=, <, <=`), **aggregate on join** (`by=.EACHI`), **update on join**

-   fast add/update/delete columns **by reference** by group using no copies at all

-   fast and feature rich **reshaping** data: [**`?dcast`**](https://rdatatable.gitlab.io/data.table/reference/dcast.data.table.html) (*pivot/wider/spread*) and [**`?melt`**](https://rdatatable.gitlab.io/data.table/reference/melt.data.table.html) (*unpivot/longer/gather*)

-   **any R function from any R package** can be used in queries not just the subset of functions made available by a database backend, also columns of type `list` are supported

-   has [**no dependencies**](https://en.wikipedia.org/wiki/Dependency_hell) at all other than base R itself, for simpler production/maintenance

-   the R dependency is **as old as possible for as long as possible**, dated April 2014, and we continuously test against that version; e.g. v1.11.0 released on 5 May 2018 bumped the dependency up from 5 year old R 3.0.0 to 4 year old R 3.1.0


> install.packages("data.table")

The latest development version (only if newer available)

> data.table::update_dev_pkg()

The atest development version (force install)

> install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table")

### Feather: A Fast On-Disk Format for Data Frames

Feather is a binary columnar serialization tool that is specifically designed to make reading and writing data frames highly efficient, while also making it easier to share data across various data analysis languages. It offers bindings for both Python (written by Wes McKinney) and R (written by Hadley Wickham) and uses the Apache Arrow columnar memory specification to represent binary data on disk, which results in fast read and write operations. This feature is particularly useful when it comes to encoding null/NA values and variable-length types like UTF8 strings. Feather is an integral part of the Apache Arrow project and defines its own simplified schemas and metadata for on-disk representation.





![alt text](http://drive.google.com/uc?export=view&id=1olj1URtrJ9-vnmvEw3SY3IhBvoMT1GG1)



Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

-   Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible

-   Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today's laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk [source](https://www.rstudio.com/blog/feather/).

Feather currently supports the following column types:

-   A wide range of numeric types (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).

-   Logical/boolean values.

-   Dates, times, and timestamps.

-   Factors/categorical variables that have fixed set of possible values.

-   UTF-8 encoded strings.

-   Arbitrary binary data.

All column types support NA/null values.

> install.packages("feather")

## Install rpy2

Easy way to run R in Colab with Python runtime using **rpy2** python package. We have to install this package using the pip command:

In [None]:
!pip uninstall rpy2 -y
! pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.4.2
Uninstalling rpy2-3.4.2:
  Successfully uninstalled rpy2-3.4.2
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp310-cp310-linux_x86_64.whl size=314933 sha256=98839c496cc15441f335f68bb835401b55502fe06cdf291bed5d0bed4dfb055a
  Stored in directory: /root/.cache/pip/wheels/73/a6/ff/4e75dd1ce1cfa2b9a670cbccf6a1e41c553199e9b25f05d953
Successfully built rpy2
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


##  Mount Google Drive

Then you must create a folder in Goole drive named "R" to install all packages permanently. Before installing R-package in Python runtime. You have to mount Google Drive and follow on-screen instruction:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Check and Install Required R Packages

In [None]:
%%R
pkg <- c(('tidyverse', 'data.table', 'feather'),,lib='drive/My Drive/R/')
new.packages <- pkg[!(pkg %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages,lib='drive/My Drive/R/')

## Load Libaray

In [None]:
%%R
# set library path
.libPaths('drive/My Drive/R')
library(tidyverse)
library(data.table)
library(feather)

## Data


All data set use in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/r-colab/tree/main/Data/R_Beginners) accounts.



## Reading CSV file with **fread**

If you're dealing with large datasets and looking for an efficient way to read files into R as data tables, the **data.table** package has got you covered with its highly efficient function called `fread()`. This function outperforms other alternatives like read.csv or read.table and is specifically designed to handle large datasets. So, if you want to save time and increase your productivity, consider using `fread()` for your file reading needs.

The fread function in data.table offers a great level of versatility when it comes to efficiently reading various types of delimited files. You can easily specify delimiters, select specific columns, and even set particular data types while reading to optimize memory usage. This function proves to be especially powerful when dealing with large datasets due to its exceptional speed and memory efficiency.

In [None]:
%%R
dataFolder<- "/content/drive/MyDrive/R_Website/R_Bigenner/Data/"
DT<-data.table::fread(paste0(dataFolder,"LBC_Data.csv"), header= TRUE)
str(DT)


Classes ‘data.table’ and 'data.frame':	3110 obs. of  25 variables:
 $ FIPS            : int  1003 1013 1013 1017 1023 1025 1031 1035 1039 1041 ...
 $ REGION_ID       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ STATE           : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ County          : chr  "Baldwin County" "Butler County" "Butler County" "Chambers County" ...
 $ X               : num  789778 877732 877732 984215 726606 ...
 $ Y               : num  884557 1007286 1007286 1148649 1023616 ...
 $ Empty_Column    : logi  NA NA NA NA NA NA ...
 $ LCB Rate        : num  48.1 38.3 38.3 49.6 31.8 42 53.7 46.9 65.5 57.1 ...
 $ Smoking         : num  20.8 26 26 25.1 21.8 22.6 21.2 24.9 25.9 22.9 ...
 $ PM25            : num  7.89 8.46 8.46 8.87 8.58 8.42 8.42 8.23 8.24 8.45 ...
 $ NO2             : num  0.794 0.634 0.634 0.844 0.593 ...
 $ SO2             : num  0.0353 0.0135 0.0135 0.0482 0.024 ...
 $ Ozone           : num  39.8 38.3 38.3 40.1 37.1 ...
 $ Pop 65          : num  19.5 19 19 18.

We can also create data.table object using the `data.table()` function. Here is an example:

In [None]:
%%R
DT= data.table(
    Variety =c("BR1","BR3", "BR16", "BR17", "BR18", "BR19","BR26",
	      "BR27","BR28","BR29","BR35","BR36"),
    Yield = c(5.2,6.0,6.6,5.6,4.7,5.2,5.7,
	            5.9,5.3,6.8,6.2,5.8))
DT

    Variety Yield
 1:     BR1   5.2
 2:     BR3   6.0
 3:    BR16   6.6
 4:    BR17   5.6
 5:    BR18   4.7
 6:    BR19   5.2
 7:    BR26   5.7
 8:    BR27   5.9
 9:    BR28   5.3
10:    BR29   6.8
11:    BR35   6.2
12:    BR36   5.8


You can also convert existing objects to a `data.table` using `setDT()` (for `data.frame`s and `list`s) and `as.data.table()` (for other structures); the difference is beyond the scope of this vignette, see `?setDT` and `?as.data.table` for more details.

Now we compare writing time of **frwite** functions with write.csv functions.:

In [None]:
%%R
# r-base
system.time(read.csv(paste0(dataFolder,"LBC_Data.csv"), header= TRUE))

   user  system elapsed 
  0.021   0.000   0.024 


In [None]:
%%R
# data.table
system.time(data.table::fread(paste0(dataFolder,"LBC_Data.csv"), header= TRUE))


   user  system elapsed 
  0.005   0.000   0.009 


## Writing CSV file with **fwrite**


In the data.table package of R, `fwrite()` serves as the counterpart to **fread**. It is primarily utilized for writing data tables to files, usually in CSV or other delimited formats. With a focus on speed and efficiency, `fwrite()` is optimized to handle large datasets effectively. Therefore, it is an excellent option for saving such datasets.


In [None]:
%%R
# read with fread()
data.table::fwrite(DT,  paste0(dataFolder, "DT.csv"), row.names=F, quote=TRUE)

Now we compare writing time of `frwite()` functions with write.csv functions.

In [None]:
%%R
#r-base
system.time(write.csv(DT,  paste0(dataFolder, "DT.csv"), row.names=F))

   user  system elapsed 
  0.007   0.001   0.011 


In [None]:
%%R
## data.table
system.time(data.table::fwrite(DT,  paste0(dataFolder, "DT.csv"), row.names=F, quote=TRUE))

   user  system elapsed 
  0.000   0.000   0.005 


## Write with feather

First we have to create feather data using `write_feather()` function

In [None]:
%%R
# write_feather()
feather::write_feather(DT, paste0(dataFolder, "LBC_data.feather"))

We can read this feather data with lighting speed using `read_feather function()`

In [None]:
%%R
DT_feather <- feather::read_feather(paste0(dataFolder, "LBC_data.feather"))
str(DT_feather)

tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
 $ Variety: chr [1:12] "BR1" "BR3" "BR16" "BR17" ...
 $ Yield  : num [1:12] 5.2 6 6.6 5.6 4.7 5.2 5.7 5.9 5.3 6.8 ...


In [None]:
%%R
system.time(feather::write_feather(DT, paste0(dataFolder, "LBC_data.feather")))

   user  system elapsed 
  0.001   0.000   0.006 


## Summary

This tutorial covers efficient data export-import processes using the R packages data.table and feather, which handle large datasets with speed and ease. We explore data.table's syntax for importing and exporting data and feather's binary columnar data format for seamless data exchange between R and other programming languages. Using these packages, data scientists can handle large datasets efficiently, ensuring speed and readability in data operations. To optimize data manipulation workflows, consider exploring advanced features of data.table and experimenting with feather's compatibility with various data science ecosystems.

## Further Reading

1.   [feather](https://posit.co/blog/feather/)

2.  [data.tabler](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)