<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Benchmark read write with file formats</p><br>

**_DataFrames_** is a Julia library for tabular data manipulation. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Julia. Similar to pandas in Python and data.table, dplyr in R. DataFrames.jl work well with a range of file formats such as CSVs (using CSV.jl), Apache Arrow (using Arrow.jl) Stata, SPSS, and SAS files (using StatFiles.jl), and reading and writing parquet files (using Parquet.jl)


## Advanced Topics: Working with Other file formats

Let us compare the read write using a huge file.

XML, JSON , BSON, YAML , MessagePack, and protobuf are some commonly used data serialization formats.

JDF is a serailization format supported by Julia. JDF stores a DataFrame in a folder with each column stored as a separate file. There is also a metadata.jls file that stores metadata about the original DataFrame. Collectively, the column files, the metadata file, and the folder is called a JDF "file".

JDF.jl is a pure-Julia solution and there are a lot of ways to do nifty things like compression and encapsulating the underlying struture of the arrays that's hard to do in R and Python. E.g. Python's numpy arrays are C objects, but all the vector types used in JDF are Julia data types.

JDF is a DataFrames serialization format with the following goals

* Fast save and load times
* Compressed storage on disk
* Enable disk-based data manipulation (not yet achieved; from v0.4.0)
* Supports machine learning workloads, e.g. mini-batch, sampling (not yet achieved; from v0.4.0)


More here: https://github.com/xiaodaigh/JDF.jl

## Data: Seattle Library Collection Inventory

This dataset includes monthly snapshot of all of the physical items in the Seattle Public Library’s collection. Consistent monthly data begins with a snapshot taken August 1, 2016, continuing to the present. Additionally, this dataset contains snapshots taken on: January 1 in the years 2012, 2013, 2014, and 2016.

* Approx 12 GB of data from August 2016 to Nov 2019. 
* Dimension of the data: 35,5,31,308 * 13 (approx. 35 million observation across 13 variables)

This code is to compare the read time of a large file. The data was taken from https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory


Memory management in R: http://adv-r.had.co.nz/memory.html

## Importing library

In [1]:
using Pkg

In [2]:
#Pkg.add("DataFrames")
#Pkg.add("CSV") ## To load a file
#Pkg.add("TableView") ## optional - to render a table
#Pkg.add("WebIO")
#Pkg.add("DataFramesMeta")

In [1]:
using CSV
using DataFrames

In [4]:
#using WebIO
#WebIO.install_jupyter_nbextension()
#using TableView

## General settings 

By default, Julia uses 80 horizontal space to show columns and 30 vertical space to show rows.
> * Check the default number of rows and columns which are displayed. 
* Change the default setting.

In [2]:
ENV["COLUMNS"]

"80"

In [3]:
ENV["LINES"]

"30"

In [4]:
#os.getcwd()
homedir()

"/Users/Rahul"

In [5]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

In [6]:
cd("../")

In [7]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code"

In [8]:
cd("./Julia_Practice")

In [9]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

## Read the dataset, which is in csv format

Pandas has many read_* functions to read data from multiples data sources or formats like json, jdbc, excel, pickel (python serialized objects) etc.

In [10]:
ENV["COLUMNS"] = 1000

1000

In [11]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

In [12]:
cd("../../../../")

In [13]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts"

In [14]:
Threads.nthreads()

6

In [15]:
big_df = CSV.read("./R/seattle-library-collection-inventory/library-collection-inventory.csv",DataFrame, tasks=6)
#big_df = CSV.read("./data/bigdata.csv", DataFrame)

#CSV.File("abc.csv"; allowmissing=:none, limit=5)

Unnamed: 0_level_0,BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount
Unnamed: 0_level_1,Int64,String?,String?,String?,String?,String?,String?,String,String,String,String,DateTim…,Int64
1,3011076,"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.","O'Ryan, Ellie","1481425730, 1481425749, 9781481425735, 9781481425742",2014.,"Simon Spotlight,","Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction",jcbk,ncrdr,Floating,qna,2017-09-01T00:00:00,1
2,2248846,"Naruto. Vol. 1, Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].","Kishimoto, Masashi, 1974-",1569319006,"2003, c1999.","Viz,","Ninja Japan Comic books strips etc, Comic books strips etc Japan Translations into English, Graphic novels",acbk,nycomic,,lcy,2017-09-01T00:00:00,1
3,3209270,"Peace, love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.","Scott, Jerry, 1955-","144945867X, 9781449458676",2014.,"Andrews McMeel Publishing,","Duncan Jeremy Fictitious character Comic books strips etc, Teenagers United States Comic books strips etc, Parent and teenager Comic books strips etc, Families Comic books strips etc, Comic books strips etc, Comics Graphic works, Humorous comics",acbk,nycomic,,bea,2017-09-01T00:00:00,1
4,1907265,The Paris pilgrims : a novel / Clancy Carlile.,"Carlile, Clancy, 1930-",0786706155,c1999.,"Carroll & Graf,","Hemingway Ernest 1899 1961 Fiction, Biographical fiction, Historical fiction",acbk,cafic,,cen,2017-09-01T00:00:00,1
5,1644616,"Erotic by nature : a celebration of life, of love, and of our wonderful bodies / edited by David Steinberg.",missing,094020813X,"1991, c1988.","Red Alder Books/Down There Press,","Erotic literature American, American literature 20th century",acbk,canf,,cen,2017-09-01T00:00:00,1
6,1736505,Children of Cambodia's killing fields : memoirs by survivors / compiled by Dith Pran ; introduction by Ben Kiernan ; edited by Kim DePaul.,missing,"0300068395, 0300078730",c1997.,"Yale University Press,","Political atrocities Cambodia, Children Cambodia Biography, Cambodia History 1975",acbk,canf,,cen,2017-09-01T00:00:00,1
7,1749492,"Anti-Zionism : analytical reflections / editors: Roselle Tekiner, Samir Abed-Rabbo, Norton Mezvinsky.",missing,091559773X,c1989.,"Amana Books,","Berger Elmer 1908 1996, Zionism Controversial literature",acbk,canf,,cen,2017-09-01T00:00:00,1
8,3270562,Hard-hearted Highlander / Julia London.,"London, Julia","0373789998, 037380394X, 9780373789993, 9780373803941",[2017],"HQN,","Man woman relationships Fiction, Betrothal Fiction, Governesses Fiction, Highlands Scotland Fiction, Romance fiction, Historical fiction",acbk,nanew,,lcy,2017-09-01T00:00:00,1
9,3264577,The Sandcastle Empire / Kayla Olson.,"Olson, Kayla","0062484877, 9780062484871",2017.,"HarperTeen,","Survival Juvenile fiction, Islands Juvenile fiction, Dystopias Juvenile fiction, Fantasy fiction, Young adult fiction",acbk,nynew,,nga,2017-09-01T00:00:00,1
10,3236819,Doctor Who. The return of Doctor Mysterio / BBC ; BBC Wales ; produced by Peter Bennett ; [written] by Steven Moffat ; directed by Ed Bazalgette.,missing,missing,[2017],"BBC Worldwide,","Doctor Fictitious character Drama, Time travel Drama, Human alien encounters Drama, Science fiction television programs, Fiction television programs, Television series, Video recordings for the hearing impaired",acdvd,nadvd,Floating,wts,2017-09-01T00:00:00,2


with @elapsed @time:

244.506704 seconds (300.00 M allocations: 20.825 GiB, 73.34% gc time)
Out[17]:
244.545922116

In [16]:
typeof(big_df)

DataFrame

### Benchmark - Read Write

In [None]:
#using Pkg
#Pkg.add("DataFrames")
#Pkg.add("JSONTables")
#Pkg.add("JLSO")
#Pkg.add("JDF")
#Pkg.add("Arrow")
#Pkg.add("Serialization")
#Pkg.add("StatsPlots")


In [17]:
#using JSONTables
#using DataFrames
#using JLSO
using JDF
using Arrow
#using Serialization
#using StatsPlots

In [21]:
println("First run")
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdata.csv", big_df)

First run
CSV.jl
674.821815 seconds (2.60 G allocations: 83.748 GiB, 59.96% gc time)


675.136428251

In [None]:
#println("Serialization")
#serializewrite1 = @elapsed @time open(io -> serialize(io, big_df), "bigdata.bin", "w")

In [18]:
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdata.jdf", big_df)

JDF.jl
6667.948019 seconds (3.60 M allocations: 17.884 GiB, 89.95% gc time)


6668.017561573

In [None]:
#println("JLSO.jl")
#jlsowrite1 = @elapsed @time JLSO.save("bigdata.jlso", :data => big_df)

In [None]:
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdata.arrow", big_df)

In [None]:
#println("JSONTables.jl arraytable")
#jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, big_df), "bigdata1.json", "w")

#println("JSONTables.jl objecttable")
#jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, big_df), "bigdata.json", "w")

In [None]:
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdata.csv", big_df)

In [None]:
#println("Serialization")
#serializewrite2 = @elapsed @time open(io -> serialize(io, big_df), "bigdata.bin", "w")

In [None]:
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdata.jdf", big_df)

In [None]:
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdata.jlso", :data => big_df)

In [None]:
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdata.arrow", big_df)

In [None]:
#println("JSONTables.jl arraytable")
#jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, big_df), "bigdata1.json", "w")

println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, big_df), "bigdata.json", "w")

In [None]:
using StatsPlots

In [None]:
groupedbar(
    # Exclude JSONTables.jl arraytable due to timing
    repeat(["CSV.jl", "Serialization","JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
            inner = 2),
    [csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
     jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
    group = repeat(["1st", "2nd"], outer = 6),
    ylab = "Second",
    title = "Write Performance\nDataFrame: big_df\nSize: $(size(big_df))"
)

In [None]:
data_files = ["bigdata.csv", "bigdata.bin", "bigdata.arrow", "bigdata.json", "bigdata.json"]
df = DataFrame(file = data_files, size = getfield.(stat.(data_files), :size))

In [None]:
pwd()

In [None]:
append!(df, DataFrame(file = "bigdata.jdf", size=reduce((x,y)->(x+y.size),
                                                      stat.(joinpath.("bigdata.jdf", readdir("bigdata.jdf"))),
                                                      init=0)))
sort!(df, :size)

In [None]:
@df df plot(:file, :size/1024^2, seriestype=:bar, title = "Format File Size (MB)", label="Size", ylab="MB")

In [None]:
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdata.csv", DataFrame)

println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdata.bin")

println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdata.jdf") |> DataFrame

println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdata.jlso")

println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdata.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)

#println("JSONTables.jl arraytable")
#jsontablesaread1 = @elapsed @time open(jsontable, "bigdata.json")

println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdata.json")

In [None]:
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdata.csv", DataFrame)

println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdata.bin")

println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdata.jdf") |> DataFrame

println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdata.jlso")

println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdata.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)

#println("JSONTables.jl arraytable")
#jsontablesaread2 = @elapsed @time open(jsontable, "bigdata.json")

println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdata.json");

In [None]:
# Exclude JSON\narraytable arraytable due to much longer timing
groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", #"JSON\narraytable",
            "JSON\nobjecttable"], inner = 2),
    [csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
     arrowread1, arrowread2, arrowread1+arrowread1copy, arrowread2+arrowread2copy,
     # jsontablesaread1, jsontablesaread2,
     jsontablesoread1, jsontablesoread2],    
    group = repeat(["1st", "2nd"], outer = 7),
    ylab = "Second",
    title = "Read Performance\nDataFrame: big_df\nSize: $(size(big_df))"
)

https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb

## Thank You