## Reading a CSV file to a DataFrame in Julia (programing lang)
Julia often offer several ways how to do the same thing and reading CSV is an example. In all cases, you will need the `CSV` and `DataFrames` package. If you don't have them installed, in the Julia REPL run: `import Pkg; Pkg.add("CSV"); Pkg.add("DataFrames")`

In [97]:
import CSV
import DataFrames.DataFrame
using StringEncodings
using Dates

In [2]:
DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,String
1,1,M
2,2,F
3,3,F
4,4,M


Using the `CSV.File` will return `CSV.File` object, which you can iterate to get `CSV.Row`s. See complete documentation of the [Julia CSV parser](https://csv.juliadata.org/stable/).

In [3]:
csv_reader = CSV.File("file.csv")
typeof(csv_reader)

CSV.File{false}

In [4]:
for row in csv_reader
    println(typeof(row))
end

CSV.Row
CSV.Row


Great thing is that CSV support Table.jl interface which allow to easilly access the columns using dot notation.

In [6]:
for row in csv_reader
    println("first col. value: $(row.col1) ... $(row.col3)")
end

first col. value: A ... 2.0
first col. value: B ... 5.1


## Read CSV option 1
Load the data from CSV.File (Julia reader) and pass to the DataFrame function.

`DataFrame(CSV.File("file.csv"); kwargs)`

By default the CSV.File will try to detect the delimiter from the first 10 lines. Default is `","`. Notice that keyword argumets (kwargs) are separated by semicolon `;` by convention, though comma `,` works too. 

In [19]:
DataFrame(CSV.File("file.csv"; delim=","))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


## Read CSV option 2
You can also read the csv file and pipe it to the `DataFrames.DataFrame` object.

In [20]:
CSV.File("file.csv", header=1) |> DataFrame

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


## Read CSV option 3
To be similar to other languages, there's `CSV.read(file,DataFrame; kwargs)`, syntactic sugar. 

In [23]:
CSV.read("file.csv", DataFrame; ignoreemptylines=true)

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


## Encoding
You can deal with the non-UTF-8 encoding, by using `read` method and specifying the encoding. Encoding fall under the [StringEncodings.jl](https://github.com/JuliaStrings/StringEncodings.jl) which must first be imported by `using StringEncodings`

In [49]:
# letter in the column `col1` are encoded in windows-1250
DataFrame(CSV.File("file_encoding.csv"))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,\xc8,12,2.0
2,\xf8,22,5.1


to create an instance of Encoding type, you can use `enc` shortcut in front of encodings string name.

In [5]:
DataFrame(CSV.File(read("file_encoding.csv", enc"windows-1250")))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


In [7]:
# using full syntax
DataFrame(CSV.File(read("file_encoding.csv", Encoding("windows-1250"))))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


## CSV.File parameters
Often you don't have goldplated csv and you have to define a few parameters of the reader. You can read about all the parameters in the [CSV.jl documentation](https://csv.juliadata.org/stable/). Key parameters are:
* `delim` - the delimiter (separator), which can be char or string.
* `header` - which row is header or used to rename the headers
* `select` - include only these columns (identify by int, string or symbol)
* `drop` - inverse of select, which column to drop
* `skipto` or `datarow` - at which which row the data start
* `footerskip` - rows to skip at the end of the file
* `limit` - how many rows to read (only reliable if `threaded=false`)
* `type` - single type to use to parse whole file
* `types` - specify type for each column

There are few other methods, mainly related to error handling in [CSV.jl documentation](https://csv.juliadata.org/stable/)

In [48]:
data = """

c1|c2|c3|c4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

CSV.read(IOBuffer(data), DataFrame; delim='|', skipto=2, quotechar='"')

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


### Select paramter

In [24]:
# select only the second column (Julia list start with 1)
CSV.read(IOBuffer(data), DataFrame;select=[2])

Unnamed: 0_level_0,c2
Unnamed: 0_level_1,Int64
1,2
2,16


In [25]:
# select using string
CSV.read(IOBuffer(data), DataFrame;select=["c2","c4"])

Unnamed: 0_level_0,c2,c4
Unnamed: 0_level_1,Int64,Float64
1,2,1.5
2,16,2.33


In [50]:
# select using symbol (symbols identify the columns and have : notation in Julia)
CSV.read(IOBuffer(data), DataFrame;select=[:c1,:c2])

Unnamed: 0_level_0,c1,c2
Unnamed: 0_level_1,String,Int64
1,1,2
2,C|D,16


In [53]:
# select using symbol with Symbol constuctor
CSV.read(IOBuffer(data), DataFrame;select=[Symbol("c3"),:c2])

Unnamed: 0_level_0,c2,c3
Unnamed: 0_level_1,Int64,String
1,2,c
2,16,x


## Drop parameter
Drop is the opposite of select. It says which columns to drop.

In [49]:
CSV.read(IOBuffer(data), DataFrame;drop=[Symbol("c3"),:c2])

Unnamed: 0_level_0,c1,c4
Unnamed: 0_level_1,String,Float64
1,1,1.5
2,C|D,2.33


### Header parameter
You can specify 

* row which is header (it can be more than 1 row, e.g. range 1:2)
* there's no header
* rename the columns

In [43]:
data = """c1|c2|c3|c4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

# specify that header is on the first row
CSV.read(IOBuffer(data), DataFrame; header=1)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


In [44]:
# if there's no header, arbitrary names will be created
CSV.read(IOBuffer(data), DataFrame; header=0)

Unnamed: 0_level_0,Column1,Column2,Column3,Column4
Unnamed: 0_level_1,String,String,String,String
1,c1,c2,c3,c4
2,1,2,c,1.5
3,C|D,16,x,2.33


In [47]:
# use header to rename the columns
CSV.read(IOBuffer(data), DataFrame; header=["A","B","C","D"])

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,c
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


In [45]:
data = """
c|c|c|c
1|2|3|4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

CSV.read(IOBuffer(data), DataFrame; header=1:2, skipto=3)

Unnamed: 0_level_0,c_1,c_2,c_3,c_4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


### Type and Types
* `type` - set the same type to all columns
* `types` - Vector or Dict of types

In [60]:
# type turns all the columns to the same type
CSV.read(IOBuffer(data), DataFrame; type=String)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,String
1,1,2,c,1.5
2,C|D,16,x,2.33


In [63]:
data = """c1|c2|c3|c4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

# specify types of the columns
CSV.read(IOBuffer(data), DataFrame; types=Dict(:c2=>String, :c4=>Int64))

│ ", error=INVALID: OK | NEWLINE | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606
│ ", error=INVALID: OK | NEWLINE | EOF | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606


Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,Int64?
1,1,2,c,missing
2,C|D,16,x,missing


In [67]:
# you can silence the warnings using `silencewarnings`
CSV.read(IOBuffer(data), DataFrame; types=Dict("c2"=>String, "c4"=>Int64), silencewarnings=true)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,Int64?
1,1,2,c,missing
2,C|D,16,x,missing


### Date Formats

In [92]:
data = """c1|c2|c3|c4|d1
"XY"|2|c|1.5|2020-01-05
"AB"|16|x|2.33|2021-01-05
"""

CSV.read(IOBuffer(data), DataFrame; 
    dateformat="yyyy-mm-dd")

Unnamed: 0_level_0,c1,c2,c3,c4,d1
Unnamed: 0_level_1,String,Int64,String,Float64,Date…
1,XY,2,c,1.5,2020-01-05
2,AB,16,x,2.33,2021-01-05


In [200]:
data = """c1|c2|c3|c4|d1|d2
"XY"|2|c|1.5|2020-01-05|01/12/20
"AB"|16|x|2.33|2021-01-05|15/10/20
"""

# specify that columns are dates and then specify the dateformat
df = CSV.read(IOBuffer(data), DataFrame; 
    types=Dict("d1"=>Date, "d2"=>Date), 
    dateformats=Dict(
        "d1"=>"yyyy-mm-dd",
        "d2"=>"dd/mm/yy"
    )
)
df

Unnamed: 0_level_0,c1,c2,c3,c4,d1,d2
Unnamed: 0_level_1,String,Int64,String,Float64,Date,Date
1,XY,2,c,1.5,2020-01-05,0020-12-01
2,AB,16,x,2.33,2021-01-05,0020-10-15


In [201]:
# add 2000 years to the column d2 containing 0020-MM-DD
# caregul to run this only once, since both df[:, :d2] or df[!, :d2] modifies the column
df[!, :d2] += Dates.Year(2000)
df

Unnamed: 0_level_0,c1,c2,c3,c4,d1,d2
Unnamed: 0_level_1,String,Int64,String,Float64,Date,Date
1,XY,2,c,1.5,2020-01-05,2020-12-01
2,AB,16,x,2.33,2021-01-05,2020-10-15


# Speed

In [228]:
path = "/home/vaclav/Data/Kaggle/EEE-CIS_Fraud_Detection/train_transaction.csv"

"/home/vaclav/Data/Kaggle/EEE-CIS_Fraud_Detection/train_transaction.csv"

In [234]:
%%timeit

The analogue of IPython's `%time statement` (also `%timeit`) in Julia is `@time statement`.  The analogue of `%%time ...code...` is

```
@time begin
    ...code...
end
```

Note, however, that you should put all performance-critical code into a function, avoiding global variables, before doing performance measurements in Julia; see the [performance tips in the Julia manual](http://docs.julialang.org/en/latest/manual/performance-tips/).

The `@time` macro prints the timing results, and returns the value of evaluating the expression.  To instead return the time (in seconds), use `@elapsed statement`.

For more extensive benchmarking tools, including the ability to collect statistics from multiple runs, see the [BenchmarkTools package](https://github.com/JuliaCI/BenchmarkTools.jl).


In [233]:
@elapsed begin
    CSV.read(path, DataFrame)
end

  8.834676 seconds (3.62 M allocations: 2.048 GiB, 2.77% gc time)


Unnamed: 0_level_0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2
Unnamed: 0_level_1,Int64,Int64,Int64,Float64,String,Int64,Float64?
1,2987000,0,86400,68.5,W,13926,missing
2,2987001,0,86401,29.0,W,2755,404.0
3,2987002,0,86469,59.0,W,4663,490.0
4,2987003,0,86499,50.0,W,18132,567.0
5,2987004,0,86506,50.0,H,4497,514.0
6,2987005,0,86510,49.0,W,5937,555.0
7,2987006,0,86522,159.0,W,12308,360.0
8,2987007,0,86529,422.5,W,12695,490.0
9,2987008,0,86535,15.0,H,2803,100.0
10,2987009,0,86536,117.0,W,17399,111.0
