<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Benchmark read write with file formats</p><br>

**_DataFrames_** is a Julia library for tabular data manipulation. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Julia. Similar to pandas in Python and data.table, dplyr in R. DataFrames.jl work well with a range of file formats such as CSVs (using CSV.jl), Apache Arrow (using Arrow.jl) Stata, SPSS, and SAS files (using StatFiles.jl), and reading and writing parquet files (using Parquet.jl)


## Advanced Topics: Working with Other file formats

Let us compare the read write using a huge file.

XML, JSON , BSON, YAML , MessagePack, and protobuf are some commonly used data serialization formats.

JDF is a serailization format supported by Julia. JDF stores a DataFrame in a folder with each column stored as a separate file. There is also a metadata.jls file that stores metadata about the original DataFrame. Collectively, the column files, the metadata file, and the folder is called a JDF "file".

JDF.jl is a pure-Julia solution and there are a lot of ways to do nifty things like compression and encapsulating the underlying struture of the arrays that's hard to do in R and Python. E.g. Python's numpy arrays are C objects, but all the vector types used in JDF are Julia data types.

JDF is a DataFrames serialization format with the following goals

* Fast save and load times
* Compressed storage on disk
* Enable disk-based data manipulation (not yet achieved; from v0.4.0)
* Supports machine learning workloads, e.g. mini-batch, sampling (not yet achieved; from v0.4.0)


More here: https://github.com/xiaodaigh/JDF.jl

## Data: Seattle Library Collection Inventory

This dataset includes monthly snapshot of all of the physical items in the Seattle Public Library’s collection. Consistent monthly data begins with a snapshot taken August 1, 2016, continuing to the present. Additionally, this dataset contains snapshots taken on: January 1 in the years 2012, 2013, 2014, and 2016.

* Approx 12 GB of data from August 2016 to Nov 2019. 
* Dimension of the data: 35,5,31,308 * 13 (approx. 35 million observation across 13 variables)

This code is to compare the read time of a large file. The data was taken from https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory


Memory management in R: http://adv-r.had.co.nz/memory.html

## Importing library

In [1]:
#using Pkg
#Pkg.add("DataFrames")
#Pkg.add("CSV") ## To load a file
#Pkg.add("TableView") ## optional - to render a table
#Pkg.add("WebIO")
#Pkg.add("DataFramesMeta")
#Pkg.add("DataFrames")
#Pkg.add("JSONTables")
#Pkg.add("JLSO")
#Pkg.add("JDF")
#Pkg.add("Arrow")
#Pkg.add("Serialization")
#Pkg.add("StatsPlots")
#Pkg.add("BenchmarkTools")

In [2]:
using CSV
using DataFrames
#using BenchmarkTools

## General settings 

By default, Julia uses 80 horizontal space to show columns and 30 vertical space to show rows.
> * Check the default number of rows and columns which are displayed. 
* Change the default setting.

In [3]:
ENV["COLUMNS"]

"80"

In [4]:
ENV["LINES"]

"30"

In [5]:
#os.getcwd()
homedir()

"/Users/Rahul"

In [6]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

In [7]:
cd("../")

In [8]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code"

In [9]:
cd("./Julia_Practice")

In [10]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

## Read the dataset, which is in csv format

Pandas has many read_* functions to read data from multiples data sources or formats like json, jdbc, excel, pickel (python serialized objects) etc.

In [11]:
ENV["COLUMNS"] = 1000

1000

In [12]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts/Julia/ML_using_Julia/Julia_Code/Julia_Practice"

In [13]:
cd("../../../../")

In [14]:
pwd()

"/Users/Rahul/Documents/Rahul Office/IIMB/Concepts"

In [15]:
Threads.nthreads()

6

In [None]:
big_df = @time CSV.read("./R/seattle-library-collection-inventory/library-collection-inventory.csv",DataFrame, tasks=6)

head(big_df)

with @elapsed @time:

* 201.937353 seconds (269.14 M allocations: 19.501 GiB, 62.60% gc time)

In [None]:
typeof(big_df)

### Benchmark - Read Write

In [None]:
println("First run")
println("CSV.jl")
csvwrite1 = @btime CSV.write("bigdata.csv", big_df)

In [None]:
#println("Second run")
#println("CSV.jl")
#csvwrite2 = @time CSV.write("bigdata.csv", big_df)

In [None]:
using StatsPlots

In [None]:
groupedbar(
    repeat(["CSV.jl"],inner = 2),
    [csvwrite1, csvwrite2],
    group = repeat(["1st", "2nd"], outer = 2),
    ylab = "Second",
    title = "Write Performance\nDataFrame: big_df\nSize: $(size(big_df))"
)

In [None]:
data_files = ["bigdata.csv"]
df = DataFrame(file = data_files, size = getfield.(stat.(data_files), :size))

In [None]:
#append!(df, DataFrame(file = "bigdata.jdf", size=reduce((x,y)->(x+y.size),
                                                      #stat.(joinpath.("bigdata.jdf", readdir("bigdata.jdf"))),
                                                      #init=0)))
#sort!(df, :size)

In [None]:
@df df plot(:file, :size/1024^2, seriestype=:bar, title = "Format File Size (MB)", label="Size", ylab="MB")

In [None]:
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdata.csv", DataFrame)

In [None]:
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdata.csv", DataFrame)

In [None]:
# Exclude JSON\narraytable arraytable due to much longer timing
groupedbar(
    repeat(["CSV.jl"], inner = 2),[csvread1, csvread2],    
    group = repeat(["1st", "2nd"], outer = 2),
    ylab = "Second",
    title = "Read Performance\nDataFrame: big_df\nSize: $(size(big_df))"
)

https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb

## Thank You