## Reading a CSV file to a DataFrame in Julia (programing lang)
Julia often offer several ways how to do the same thing and reading CSV is an example. In all cases, you will need the `CSV` and `DataFrames` package. If you don't have them installed, in the Julia REPL run: `import Pkg; Pkg.add("CSV"); Pkg.add("DataFrames")`

In [2]:
VERSION

v"1.4.1"

In [3]:
# import libraries; you can also use import, but there's slight difference in the behavior. 
# for example you would have to import DataFrames.DataFrame
using CSV
using DataFrames
using StringEncodings
using Dates

In [4]:
DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,String
1,1,M
2,2,F
3,3,F
4,4,M


Using the `CSV.File` will return `CSV.File` object, which you can iterate to get `CSV.Row`s. See complete documentation of the [Julia CSV parser](https://csv.juliadata.org/stable/).

In [5]:
csv_reader = CSV.File("file.csv")
println(typeof(csv_reader))

CSV.File{false}


In [6]:
for row in csv_reader
    println(typeof(row))
end

CSV.Row
CSV.Row


Great thing is that CSV support Table.jl interface which allow to easilly access the columns using dot notation.

In [7]:
for row in csv_reader
    println("values: $(row.col1), $(row.col2), $(row.col3)")
end

values: A, 12, 2.0
values: B, 22, 5.1


CSV.File `header` parameter has default value `1` (in Julia it mean the first row) and `","` for the delimiter. But you can always specify them manually. By convention kwargs (key-value parameters) are passed after the semicolon.

In [8]:
CSV.File("file.csv"; header=1, delim=",")

2-element CSV.File{false}:
 CSV.Row: (col1 = "A", col2 = 12, col3 = 2.0)
 CSV.Row: (col1 = "B", col2 = 22, col3 = 5.1)

Comma is also acceptable

In [9]:
CSV.File("file.csv", header=1, delim=",")

2-element CSV.File{false}:
 CSV.Row: (col1 = "A", col2 = 12, col3 = 2.0)
 CSV.Row: (col1 = "B", col2 = 22, col3 = 5.1)

## Read CSV option 1
Load the data from CSV.File (Julia reader) and pass to the DataFrame function.

`DataFrame(CSV.File("file.csv"); kwargs)`

By default the CSV.File will try to detect the delimiter from the first 10 lines. Default is `","`. Notice that keyword argumets (kwargs) are separated by semicolon `;` by convention, though comma `,` works too. 

In [10]:
df = DataFrame(CSV.File("file.csv"))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


## Read CSV option 2
You can also read the csv file and pipe it to the `DataFrames.DataFrame` object.

In [11]:
df = CSV.File("file.csv") |> DataFrame

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


## Read CSV option 3
To be similar to other languages, there's `CSV.read(file,DataFrame; kwargs)`, syntactic sugar. 

In [12]:
df = CSV.read("file.csv", DataFrame; ignoreemptylines=true)

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,A,12,2.0
2,B,22,5.1


Some older tutorials show `.read()` method without `DataFrame` argument which now leads to an error.

In [13]:
CSV.read("file.csv")

ArgumentError: ArgumentError: provide a valid sink argument, like `using DataFrames; CSV.read(source, DataFrame)`

## Encoding
You can deal with the non-UTF-8 encoding, by using `read` method and specifying the encoding. Encoding fall under the [StringEncodings.jl](https://github.com/JuliaStrings/StringEncodings.jl) which must first be imported by `using StringEncodings`

In [14]:
# letter in the column `col1` are encoded in windows-1250
DataFrame(CSV.File("file_encoding.csv"))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,\xc8,12,2.0
2,\xf8,22,5.1


to create an instance of Encoding type, you can use `enc` shortcut in front of encodings string name.

In [15]:
DataFrame(CSV.File(open(read,"file_encoding.csv", enc"windows-1250")))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


In [16]:
open(read,"file_encoding.csv", enc"windows-1250") |> CSV.File |> DataFrame

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


In [17]:
DataFrame(CSV.File(read("file_encoding.csv", enc"windows-1250")))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


In [18]:
# using full syntax
DataFrame(CSV.File(read("file_encoding.csv", Encoding("windows-1250"))))

Unnamed: 0_level_0,col1,col2,col3
Unnamed: 0_level_1,String,Int64,Float64
1,Č,12,2.0
2,ř,22,5.1


## CSV.File parameters
Often you don't have goldplated csv and you have to define a few parameters of the reader. You can read about all the parameters in the [CSV.jl documentation](https://csv.juliadata.org/stable/). Key parameters are:
* `delim` - the delimiter (separator), which can be char or string.
* `header` - which row is header or used to rename the headers
* `select` - include only these columns (identify by int, string or symbol)
* `drop` - inverse of select, which column to drop
* `skipto` or `datarow` - at which which row the data start
* `footerskip` - rows to skip at the end of the file
* `limit` - how many rows to read (only reliable if `threaded=false`)
* `type` - single type to use to parse whole file
* `types` - specify type for each column

There are few other methods, mainly related to error handling in [CSV.jl documentation](https://csv.juliadata.org/stable/)

In [19]:
data = """

c1|c2|c3|c4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

CSV.read(IOBuffer(data), DataFrame; delim='|', skipto=2, quotechar='"')

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


### Select paramter

In [20]:
# select only the second column (Julia list start with 1)
CSV.read(IOBuffer(data), DataFrame;select=[2])

Unnamed: 0_level_0,c2
Unnamed: 0_level_1,Int64
1,2
2,16


In [21]:
# select using string
CSV.read(IOBuffer(data), DataFrame;select=["c2","c4"])

Unnamed: 0_level_0,c2,c4
Unnamed: 0_level_1,Int64,Float64
1,2,1.5
2,16,2.33


In [22]:
# select using symbol (symbols identify the columns and have : notation in Julia)
CSV.read(IOBuffer(data), DataFrame;select=[:c1,:c2])

Unnamed: 0_level_0,c1,c2
Unnamed: 0_level_1,String,Int64
1,1,2
2,C|D,16


In [23]:
# select using symbol with Symbol constuctor
CSV.read(IOBuffer(data), DataFrame;select=[Symbol("c3"),:c2])

Unnamed: 0_level_0,c2,c3
Unnamed: 0_level_1,Int64,String
1,2,c
2,16,x


## Drop parameter
Drop is the opposite of select. It says which columns to drop.

In [24]:
CSV.read(IOBuffer(data), DataFrame;drop=[Symbol("c3"),:c2])

Unnamed: 0_level_0,c1,c4
Unnamed: 0_level_1,String,Float64
1,1,1.5
2,C|D,2.33


### Header parameter
You can specify 

* row which is header (it can be more than 1 row, e.g. range 1:2)
* there's no header
* rename the columns

In [25]:
# Header parameter of the Julia CSV parser

We will parse the following data using the IOBuffer to pass them to the CSV reader. 

In [26]:
data = """
c|c|c|d
1|2|3|4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

"c|c|c|d\n1|2|3|4\n\"1\"|2|c|1.5\n\"C|D\"|16|x|2.33\n"

#### Default header on the first row

In [27]:
# default header is the first row, duplicated column names are postfixed
df = CSV.read(IOBuffer(data), DataFrame)

Unnamed: 0_level_0,c,c_1,c_2,d
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,3,4.0
2,1,2,c,1.5
3,C|D,16,x,2.33


#### No header

In [28]:
# Julia convention is to separate the key arguments (kwargs) by semicolon (;). Comma though works as well
CSV.read(IOBuffer(data), DataFrame; header=0)

Unnamed: 0_level_0,Column1,Column2,Column3,Column4
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


In [29]:
CSV.read(IOBuffer(data), DataFrame; header=false)

Unnamed: 0_level_0,Column1,Column2,Column3,Column4
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


#### Specify own column names using headers

You can pass the vector of column names (as strings or symbols) to specify the headers

In [30]:
CSV.read(IOBuffer(data), DataFrame; header=["first","second","third","fourth"])

Unnamed: 0_level_0,first,second,third,fourth
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


In [31]:
CSV.read(IOBuffer(data), DataFrame; header=[:a,:b,:c,:d])

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


In [32]:
CSV.read(IOBuffer(data), DataFrame; header=[Symbol("z"),Symbol("zy"),Symbol("zyx"),Symbol("zyxw")])

Unnamed: 0_level_0,z,zy,zyx,zyxw
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


#### Header on the first row

In [33]:
# Remember that Julia stasts indexing at 1
CSV.read(IOBuffer(data), DataFrame; header=1)

Unnamed: 0_level_0,c,c_1,c_2,d
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,3,4.0
2,1,2,c,1.5
3,C|D,16,x,2.33


#### Header on the x-th row
Everything above the header is ignored

In [34]:
CSV.read(IOBuffer(data), DataFrame; header=2)

Unnamed: 0_level_0,1,2,3,4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


#### Multirow header
Can be specified using range, for example `1:2` or list `[1,2]`. Some rows can even be skipped, e.g. `[1,3]`. The columns names are concatenation of the values on these rows.

In [35]:
# no headers, to remind you the data
df = CSV.read(IOBuffer(data), DataFrame; header=false)

Unnamed: 0_level_0,Column1,Column2,Column3,Column4
Unnamed: 0_level_1,String,String,String,String
1,c,c,c,d
2,1,2,3,4
3,1,2,c,1.5
4,C|D,16,x,2.33


In [36]:
# headers are formed as concat of row 1 till row 2
CSV.read(IOBuffer(data), DataFrame; header=1:2)

Unnamed: 0_level_0,c_1,c_2,c_3,d_4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


In [37]:
# the same using the list
CSV.read(IOBuffer(data), DataFrame; header=[1,2])

Unnamed: 0_level_0,c_1,c_2,c_3,d_4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


In [38]:
# skipping row number 2
CSV.read(IOBuffer(data), DataFrame; header=[1,3])

Unnamed: 0_level_0,c_1,c_2,c_c,d_1.5
Unnamed: 0_level_1,String,Int64,String,Float64
1,C|D,16,x,2.33


### Type and Types
* `type` - set the same type to all columns
* `types` - Vector or Dict of types for each column

More about [Julia types](https://docs.julialang.org/en/v1/manual/types/)

All examples are based on string input, which is passed to Julia's CSV reader through `IOBuffer`

In [39]:
data = """c1|c2|c3|c4
"1"|2|c|1.5
"C|D"|16|x|2.33
"""

"c1|c2|c3|c4\n\"1\"|2|c|1.5\n\"C|D\"|16|x|2.33\n"

You can set the same type for all columns using `type` parameter, e.g. string

In [40]:
# type turns all the columns to the same type
CSV.read(IOBuffer(data), DataFrame; type=String)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,String
1,1,2,c,1.5
2,C|D,16,x,2.33


Or specify type for each or just some columns using a Dict. If the data cannot be parsed to the type, it's turned to `missing` type, equivalent of pandas's `Nan`.

In [41]:
for r in CSV.File(IOBuffer(data), types=Dict(:c2=>String, :c4=>Int64))
    println(r)
end

│ ", error=INVALID: OK | NEWLINE | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606
│ ", error=INVALID: OK | NEWLINE | EOF | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606


CSV.Row:
 :c1  "1"
 :c2  "2"
 :c3  "c"
 :c4  missing
CSV.Row:
 :c1  "C|D"
 :c2  "16"
 :c3  "x"
 :c4  missing


In [42]:
# specify types of the columns
CSV.read(IOBuffer(data), DataFrame; types=Dict(:c2=>String, :c4=>Int64))

│ ", error=INVALID: OK | NEWLINE | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606
│ ", error=INVALID: OK | NEWLINE | EOF | INVALID_DELIMITER 
└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606


Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,Int64?
1,1,2,c,missing
2,C|D,16,x,missing


You can silence these warnings by `silencewarnings=true`

In [43]:
# specify types of the columns
CSV.read(IOBuffer(data), DataFrame; types=Dict(:c2=>String, :c4=>Int64), silencewarnings=true)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,Int64?
1,1,2,c,missing
2,C|D,16,x,missing


In [44]:
# specify valid type for columns
CSV.read(IOBuffer(data), DataFrame; types=Dict(:c2=>String, :c4=>Float32))

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,String,String,Float32
1,1,2,c,1.5
2,C|D,16,x,2.33


Or specify types for all columns using a **Vector**

In [45]:
# specify Array with type
types = Array{DataType,1}([String, Int, String, Float64])
CSV.read(IOBuffer(data), DataFrame; types=types)

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,Int64,String,Float64
1,1,2,c,1.5
2,C|D,16,x,2.33


In [46]:
[String, Int32, String, Float32]

4-element Array{DataType,1}:
 String
 Int32
 String
 Float32

In [47]:
# or just pass the array
CSV.read(IOBuffer(data), DataFrame; types=[String, Int32, String, Float32])

Unnamed: 0_level_0,c1,c2,c3,c4
Unnamed: 0_level_1,String,Int32,String,Float32
1,1,2,c,1.5
2,C|D,16,x,2.33


### Date Formats

In [48]:
data = """c1|c2|c3|c4|d1
"XY"|2|c|1.5|2020-01-05
"AB"|16|x|2.33|2021-01-05
"""

CSV.read(IOBuffer(data), DataFrame; 
    dateformat="yyyy-mm-dd")

Unnamed: 0_level_0,c1,c2,c3,c4,d1
Unnamed: 0_level_1,String,Int64,String,Float64,Date
1,XY,2,c,1.5,2020-01-05
2,AB,16,x,2.33,2021-01-05


In [49]:
data = """c1|c2|c3|c4|d1|d2
"XY"|2|c|1.5|2020-01-05|01/12/20
"AB"|16|x|2.33|2021-01-05|15/10/20
"""

# specify that columns are dates and then specify the dateformat
df = CSV.read(IOBuffer(data), DataFrame; 
    types=Dict("d1"=>Date, "d2"=>Date), 
    dateformats=Dict(
        "d1"=>"yyyy-mm-dd",
        "d2"=>"dd/mm/yy"
    )
)
df

Unnamed: 0_level_0,c1,c2,c3,c4,d1,d2
Unnamed: 0_level_1,String,Int64,String,Float64,Date,Date
1,XY,2,c,1.5,2020-01-05,0020-12-01
2,AB,16,x,2.33,2021-01-05,0020-10-15


In [50]:
# add 2000 years to the column d2 containing 0020-MM-DD
# caregul to run this only once, since both df[:, :d2] or df[!, :d2] modifies the column
df[!, :d2] += Dates.Year(2000)
df

Unnamed: 0_level_0,c1,c2,c3,c4,d1,d2
Unnamed: 0_level_1,String,Int64,String,Float64,Date,Date
1,XY,2,c,1.5,2020-01-05,2020-12-01
2,AB,16,x,2.33,2021-01-05,2020-10-15


## Treustrings and Falsestrings
These arguments let you set the list of inputs to be considered as true and false boolean values.

In [51]:
data = """
b01|b002|c1|c2|c3|c4|d1
"t"|"fa"|"XY"|2|c|1.5|2020-01-05
"f"|"tr"|"AB"|16|x|2.33|2021-01-05
"""

CSV.read(IOBuffer(data), DataFrame; 
    truestrings=["t","tr"],
    falsestrings=["f","fa"])

Unnamed: 0_level_0,b01,b002,c1,c2,c3,c4,d1
Unnamed: 0_level_1,Bool,Bool,String,Int64,String,Float64,Date
1,1,0,XY,2,c,1.5,2020-01-05
2,0,1,AB,16,x,2.33,2021-01-05


## missingstrings

In [52]:
data = """
m01|b002|c1|c2|c3|c4|d1
999|"fa"|"XY"|2|c|1.5|2020-01-05
-1|"tr"|"AB"|16|x|2.33|2021-01-05
"""

CSV.read(IOBuffer(data), DataFrame; 
    missingstrings=["999"])

Unnamed: 0_level_0,m01,b002,c1,c2,c3,c4,d1
Unnamed: 0_level_1,Int64?,String,String,Int64,String,Float64,Date
1,missing,fa,XY,2,c,1.5,2020-01-05
2,-1,tr,AB,16,x,2.33,2021-01-05


## Pool
Pooling is similar to pandas catagory. The strings are stored to `PooledArrays.PooledArray` which can make some operations much faster. `Pool` argument set the treshold when the string column is turned to the pooled array. 

In [104]:
data = """unique,cat
A18E9,AT
BF392,GC
93EBC,AT
54EE1,AT
8CD2E,GC
3A42E,GC"""

"unique,cat\nA18E9,AT\nBF392,GC\n93EBC,AT\n54EE1,AT\n8CD2E,GC\n3A42E,GC"

In [124]:
df = CSV.read(IOBuffer(data), DataFrame; pool=0.29)

Unnamed: 0_level_0,unique,cat
Unnamed: 0_level_1,String,String
1,A18E9,AT
2,BF392,GC
3,93EBC,AT
4,54EE1,AT
5,8CD2E,GC
6,3A42E,GC


In [125]:
# columns `unique` has different value on each row so it's not pooled
df[:,:unique]

6-element Array{String,1}:
 "A18E9"
 "BF392"
 "93EBC"
 "54EE1"
 "8CD2E"
 "3A42E"

In [126]:
# column can contain 2 values on 6 rows; 2/6 = 0.33 so the column is pooled. 
df[:,:cat]

6-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "AT"
 "GC"
 "AT"
 "AT"
 "GC"
 "GC"

## Comment parameter
Sometimes your file contains comment lines. You can exclude them using `comment` parameter.

In [60]:
data = """c1|c2|c3|c4|d1
"XY"|2|c|1.5|2020-01-05
~ this is a comment
"AB"|16|x|2.33|2021-01-05
"""

"c1|c2|c3|c4|d1\n\"XY\"|2|c|1.5|2020-01-05\n~ this is a comment\n\"AB\"|16|x|2.33|2021-01-05\n"

In [62]:
# rows starting with `~` will be skipped
CSV.read(IOBuffer(data), DataFrame; comment="~")

Unnamed: 0_level_0,c1,c2,c3,c4,d1
Unnamed: 0_level_1,String,Int64,String,Float64,Date
1,XY,2,c,1.5,2020-01-05
2,AB,16,x,2.33,2021-01-05


In [66]:
# comment line can start by more than one character
data = """c1|c2|c3|c4|d1
"XY"|2|c|1.5|2020-01-05
!! this is a comment
"AB"|16|x|2.33|2021-01-05
"""
CSV.read(IOBuffer(data), DataFrame; comment="!!")

Unnamed: 0_level_0,c1,c2,c3,c4,d1
Unnamed: 0_level_1,String,Int64,String,Float64,Date
1,XY,2,c,1.5,2020-01-05
2,AB,16,x,2.33,2021-01-05


In [63]:
# you can only specify one string to mark the comments
CSV.read(IOBuffer(data), DataFrame; comment=["~","#"])

TypeError: TypeError: in keyword argument comment, expected Union{Nothing, String}, got Array{String,1}

## Transpose parameter
When you can to transpose your data and turn columns into rows, you can set `transpose=true`.

In [68]:
data = """
c1|X|1.0|2
c2|Y|2.0|5
"""
CSV.read(IOBuffer(data), DataFrame; transpose=true)

Unnamed: 0_level_0,c1,c2
Unnamed: 0_level_1,String,String
1,X,Y
2,1.0,2.0
3,2,5


## Fixed Width File with ignorerepeated

In [38]:
data = """
A   B   C  
1   2.0 "X"
"""

df = CSV.read(IOBuffer(data), DataFrame; 
    delim=" ",
    ignorerepeated=true)
df

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,1,2.0,X


In [39]:
names(df)

3-element Array{String,1}:
 "A"
 "B"
 "C"

In [43]:
length.(df)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1


### Dataframes with different delimiter

In [90]:
data = """
| A |   B  | C  
| 1 |  2.0 | X 
"""

df = CSV.read(IOBuffer(data), DataFrame; 
    delim="|",
    ignorerepeated=true)
df

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,1,2.0,X


Since the extra strings are blank spaces " " and not the pipe delimiters, nothing is removed using `ignorerepeated=true`. We need to remove the extra spaces manually.

In [91]:
# header names contain blank spaces
names(df)

3-element Array{String,1}:
 " A "
 "   B  "
 " C  "

In [92]:
# also the lenght of string values can be increated by the blank spaces
length.(df)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,3


In [93]:
# applying strip (trim) element-wise (using . dot operator)
strip.(names(df))

3-element Array{SubString{String},1}:
 "A"
 "B"
 "C"

In [94]:
# rename the names, by applying `strip` element wise to the column names vector
rename!(df, strip.(names(df)))

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,1,2.0,X


In [95]:
# now the names are stripped
names(df)

3-element Array{String,1}:
 "A"
 "B"
 "C"

In [81]:
# apply strip to specific column
transform!(df, :C => ByRow(x -> strip(x)) => :C)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,SubStri…
1,1,2.0,X


In [98]:
# select string columns


Unnamed: 0_level_0,C
Unnamed: 0_level_1,String
1,X


UndefVarError: UndefVarError: x not defined

In [101]:
# applying strip to whole dataframe fails, because there's no strip method for ints or floats
strip.(df)

MethodError: MethodError: no method matching strip(::Int64)
Closest candidates are:
  strip(::Any, !Matched::AbstractString) at strings/util.jl:222
  strip(::Any, !Matched::CategoricalValue{String,R} where R<:Integer) at deprecated.jl:65
  strip(!Matched::AbstractString) at strings/util.jl:220
  ...

In [103]:
# you can apply strip to only string column
transform!(df, :C => ByRow(x -> strip(x)) => :C)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,SubStri…
1,1,2.0,X


In [104]:
length.(df)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1


# Speed

In [59]:
path = "/home/vaclav/Data/Kaggle/EEE-CIS_Fraud_Detection/train_transaction.csv"

"/home/vaclav/Data/Kaggle/EEE-CIS_Fraud_Detection/train_transaction.csv"

In [60]:
@time begin
    CSV.read(path, DataFrame)
end

  7.818338 seconds (3.62 M allocations: 2.048 GiB, 1.98% gc time)


Unnamed: 0_level_0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2
Unnamed: 0_level_1,Int64,Int64,Int64,Float64,String,Int64,Float64?
1,2987000,0,86400,68.5,W,13926,missing
2,2987001,0,86401,29.0,W,2755,404.0
3,2987002,0,86469,59.0,W,4663,490.0
4,2987003,0,86499,50.0,W,18132,567.0
5,2987004,0,86506,50.0,H,4497,514.0
6,2987005,0,86510,49.0,W,5937,555.0
7,2987006,0,86522,159.0,W,12308,360.0
8,2987007,0,86529,422.5,W,12695,490.0
9,2987008,0,86535,15.0,H,2803,100.0
10,2987009,0,86536,117.0,W,17399,111.0


In [61]:
# number of threads
Threads.nthreads() 

1

From repl:

```ENV["JULIA_NUM_THREADS"] = 4
using IJulia
notebook()```