# Outline

The Julia DataFrames package is a handy package for working with and manipulating tabular data in Julia. It's well suited for working with data where the columns are of different types, i.e. heterogeneous data, and when the dataset can fit in memory. It can be used to perform a variety of data manipulation operations such as subsetting rows, selecting columns, performing aggregations by group, joining, etc. We will explore doing all of these things and more.

What we'll be covering today:

#### I. Getting started
#### II. Working with dataframes
#### III. Joining and concatenating
#### IV. Handling missing values
#### V. Split-apply-combine
#### VI. Using Query.jl

# I. Getting started

In [1]:
#To install
#using Pkg
#Pkg.add("DataFrames")

In [2]:
#To load the DataFrames package once installed
using DataFrames

### Dataframes fundamentals:

The basic data structure you will be working with is the **DataFrame**. This type is defined in the DataFrames package. In this section we'll see a few different ways of manually creating dataframes using the DataFrame constructor. You'll rarely use this constructor directly to create your dataframes, except for maybe testing out ideas, but it's good to have an understanding of how to do this.


Let's start by creating a DataFrame explicitly using keyword arguments. We'll create a small dataframe **df** with five columns named <i>A</i> , <i>B</i>, <i>C</i>, <i>D</i>, and <i>E</i>.

In [3]:
using Random

In [4]:
df = DataFrame(B = [0, 1, 1, 0], C = [0, 0, 1, 1], A = [0, 1, 0, 1], D = [randstring(9) for j in 1:4], E = 1:4)

Unnamed: 0_level_0,B,C,A,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,krLzKYfPX,1
2,1,0,1,q7H41ej3q,2
3,1,1,0,nAToTQrKJ,3
4,0,1,1,bzm87wkyC,4


In [5]:
typeof(df)

DataFrame

Note that column names in Julia are actually **Symbols** and not **Strings**. In Julia, symbols are prefixed with ":" which is how you can tell that an object is a symbol. Also notice that Julia has typed each colulmn.

For us it's not too important to know what a symbol is exactly in Julia. You just need to be aware that when referring to columns in your dataframes you will need to refer to the column names as symbols (using the symbol notation) and not strings.

You can initialize an empty dataframe using the `DataFrame()` constructor and then build it up using a dictionary or by arbitrarily adding columns. In this case you pass the dictionary as an argument to the `DataFrame()` constructor.

In [6]:
df = DataFrame(); #initialize an empty dataframe

In [7]:
d = Dict("A" => [0, 1, 0, 1], "B" => [0, 1, 1, 0], "C" => [0, 0, 1, 1], "D" => [randstring(9) for j in 1:4], 
    "E" => 1:4)

Dict{String,AbstractArray{T,1} where T} with 5 entries:
  "B" => [0, 1, 1, 0]
  "A" => [0, 1, 0, 1]
  "C" => [0, 0, 1, 1]
  "D" => ["ZMRDwcRv1", "QG6vAsCWY", "SlBocwMeI", "zYajYbiII"]
  "E" => 1:4

In [8]:
df = DataFrame(d)

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,ZMRDwcRv1,1
2,1,1,0,QG6vAsCWY,2
3,0,1,1,SlBocwMeI,3
4,1,0,1,zYajYbiII,4


The following syntax also works where you pass the dataframe constructor a comma-separated list of **Pairs** where the first element of each pair is a **Symbol** that refers to the column and the second element are the values. Note with this method the order in which the columns were passed was maintained in the resulting dataframe.

In [9]:
df = DataFrame(:B => [0, 1, 1, 0], :A => [0, 1, 0, 1], :C => [0, 0, 1, 1], :D => [randstring(9) for j in 1:4], 
    :E => 1:4)

Unnamed: 0_level_0,B,A,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,NfgMMkH7s,1
2,1,1,0,09r6rwhbg,2
3,1,0,1,JBQmjqYN9,3
4,0,1,1,xd8UZ32MW,4


You can build up an empty dataframe by explicitly adding columns using dot notation to refer to columns.

In [10]:
df = DataFrame();
df.B = [0, 1, 1, 0];
df.C = [0, 0, 1, 1];
df.A = [0, 1, 0, 1];
df.D = [randstring(9) for j in 1:4];
df.E = 1:4;

In [11]:
df

Unnamed: 0_level_0,B,C,A,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,2OFpjFUVh,1
2,1,0,1,mBme76yTQ,2
3,1,1,0,uHaerFmAX,3
4,0,1,1,dGf63DEzC,4


You can also create a dataframe by passing in the column values and symbols as separate arguments to the `DataFrame()` constructor. The first argument is an array of vectors where each vector is a column of data; the second argument is the array of symbols designating the column names.

In [12]:
df = DataFrame([[0,1,1,0], [0,0,1,1], [0,1,0,1], [randstring(9) for j in 1:4], 1:4], [:B, :C, :A, :D, :E])

Unnamed: 0_level_0,B,C,A,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,G5gjascaS,1
2,1,0,1,91m4oiXfs,2
3,1,1,0,cF5KPzl02,3
4,0,1,1,VOVNq9JTH,4


If you want to convert your dataframe to an array wrap the dataframe in a call to `Matrix`.

In [13]:
m = Matrix(df)

4×5 Array{Any,2}:
 0  0  0  "G5gjascaS"  1
 1  0  1  "91m4oiXfs"  2
 1  1  0  "cF5KPzl02"  3
 0  1  1  "VOVNq9JTH"  4

And then to convert it back to a dataframe you can wrap the array in a call to `DataFrame()`. Note the arbitrary column names <i>x1</i>, <i>x2</i>, etc.

In [14]:
df = DataFrame(m)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Any,Any,Any,Any,Any
1,0,0,0,G5gjascaS,1
2,1,0,1,91m4oiXfs,2
3,1,1,0,cF5KPzl02,3
4,0,1,1,VOVNq9JTH,4


Finally, if you need to, you can initialize a non-empty datframe with garbage values. You will need to specify the desired columns types and optionally specify the column names and number of rows.

In [15]:
df_garbage = DataFrame([Int64, Int64, Int64, String, Int64], [:B, :C, :A, :D, :E], 4)

Unnamed: 0_level_0,B,C,A,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,139997191614944,139997253354704,139997093954256,#undef,139997255412576
2,139997184114560,139997093294272,139997094024368,#undef,139997255407376
3,139997184114560,139997093294336,139997094416088,#undef,139997093328880
4,1,139997184449008,2,#undef,0


Regardless of how you create it, the **DataFrame** type represents a data table as a series of vectors, each corresponding to a column or variable.

### Selecting columns:

In [16]:
# create a dataframe

df = DataFrame(A = [0, 1, 0, 1], B = [0, 1, 1, 0], C = [0, 0, 1, 1], D = [randstring(9) for j in 1:4], E = 1:4);

You can select individual columns in a few different ways:
- `df.col`
- `df."col"`
- `df[!,:col]`
- `df[!, col_idx]`


In [17]:
df

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,CnxKGI2pr,1
2,1,1,0,qjEDuIH9h,2
3,0,1,1,P2RzhHpln,3
4,1,0,1,SHd4BTv0z,4


Below we access columns <i>A</i>, <i>B</i>, and <i>C</i> using the dot notation. Note this does __not__ return a copy of the column data; so if you modify __df.A__ then you will modify the dataframe as well.

In [18]:
df.A

4-element Array{Int64,1}:
 0
 1
 0
 1

In [19]:
typeof(df.A)

Array{Int64,1}

In [20]:
typeof(df.D)

Array{String,1}

You can also use string notation again this does __not__ return a copy:

In [21]:
df."A"

4-element Array{Int64,1}:
 0
 1
 0
 1

You can refer to columns using bracket notation but note that for the column you have to use the symbol notation, i.e., :B and not "B". The ! used above means to grab all rows.

Please note that `df[!, :col]` does **not** make a copy of the column therefore modifying elements in it will change elements in the dataframe itself. If you want to work with a copy use `df[:, :col]`.

In [22]:
df[!, :B][2] = 2# df[!, "B"] will not work

2

In [23]:
a = df[!, :B]

4-element Array{Int64,1}:
 0
 2
 1
 0

In [24]:
df

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,CnxKGI2pr,1
2,1,2,0,qjEDuIH9h,2
3,0,1,1,P2RzhHpln,3
4,1,0,1,SHd4BTv0z,4


You can also use the column index to refer to a specific column. Here we get the third column. Note that indexing in Julia starts at 1.

In [25]:
df[!, 3] #third way using a column index

4-element Array{Int64,1}:
 0
 0
 1
 1

You can retrieve multiple columns by listing them out by symbol or index. In this case the returned object will be a dataframe.

In [26]:
df[!, [:A, :D]] # get columns A and D

Unnamed: 0_level_0,A,D
Unnamed: 0_level_1,Int64,String
1,0,CnxKGI2pr
2,1,qjEDuIH9h
3,0,P2RzhHpln
4,1,SHd4BTv0z


In [27]:
df[!, 2:5]  #get columns two through five

Unnamed: 0_level_0,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,String,Int64
1,0,0,CnxKGI2pr,1
2,2,0,qjEDuIH9h,2
3,1,1,P2RzhHpln,3
4,0,1,SHd4BTv0z,4


An alternative method to selecting columns in a dataframe is to use the `select` function.

In [28]:
select(df, :A) #select column A

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,0
2,1
3,0
4,1


In [29]:
select(df, [:A, :D]) # select columns A and D

Unnamed: 0_level_0,A,D
Unnamed: 0_level_1,Int64,String
1,0,CnxKGI2pr
2,1,qjEDuIH9h
3,0,P2RzhHpln
4,1,SHd4BTv0z


In [30]:
select(df, 2:5) #select columns 2 through 5

Unnamed: 0_level_0,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,String,Int64
1,0,0,CnxKGI2pr,1
2,2,0,qjEDuIH9h,2
3,1,1,P2RzhHpln,3
4,0,1,SHd4BTv0z,4


In [31]:
select(df, Not(:A)) #select all columns except column A

Unnamed: 0_level_0,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,String,Int64
1,0,0,CnxKGI2pr,1
2,2,0,qjEDuIH9h,2
3,1,1,P2RzhHpln,3
4,0,1,SHd4BTv0z,4


The `select` function returns a new dataframe where the columns are copies of the columns from the original dataframe.

To get the names and types of the columns you can use the `names` and `eltype` functions.

In [32]:
names(df)

5-element Array{String,1}:
 "A"
 "B"
 "C"
 "D"
 "E"

In [33]:
eltype.(eachcol(df))

5-element Array{DataType,1}:
 Int64 
 Int64 
 Int64 
 String
 Int64 

In [34]:
?eachcol

search: [0m[1me[22m[0m[1ma[22m[0m[1mc[22m[0m[1mh[22m[0m[1mc[22m[0m[1mo[22m[0m[1ml[22m [0m[1me[22m[0m[1ma[22m[0m[1mc[22m[0m[1mh[22msli[0m[1mc[22me [0m[1me[22m[0m[1ma[22m[0m[1mc[22m[0m[1mh[22mmat[0m[1mc[22mh



```
eachcol(A::AbstractVecOrMat)
```

Create a generator that iterates over the second dimension of matrix `A`, returning the columns as views.

See also [`eachrow`](@ref) and [`eachslice`](@ref).

!!! compat "Julia 1.1"
    This function requires at least Julia 1.1.


---

```
eachcol(df::AbstractDataFrame)
```

Return a `DataFrameColumns` that is an `AbstractVector` that allows iterating an `AbstractDataFrame` column by column. Additionally it is allowed to index `DataFrameColumns` using column names.

# Examples

```jldoctest
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 11    │
│ 2   │ 2     │ 12    │
│ 3   │ 3     │ 13    │
│ 4   │ 4     │ 14    │

julia> collect(eachcol(df))
2-element Array{AbstractArray{T,1} where T,1}:
 [1, 2, 3, 4]
 [11, 12, 13, 14]

julia> map(eachcol(df)) do col
           maximum(col) - minimum(col)
       end
2-element Array{Int64,1}:
 3
 3

julia> sum.(eachcol(df))
2-element Array{Int64,1}:
 10
 50
```


You can append a row using `push!` and providing the row values in a tuple.

In [35]:
push!(df, (1, 1, 1, randstring(7), 5))

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,CnxKGI2pr,1
2,1,2,0,qjEDuIH9h,2
3,0,1,1,P2RzhHpln,3
4,1,0,1,SHd4BTv0z,4
5,1,1,1,41G6s08,5


### Reading and writing data:

Most likely you will not be manually creating dataframes as above but rather loading data from external files.

You can read and write data to a variety of file formats.

If you want to save your dataframe to a CSV file you can use the `CSV.write` function in the **CSV.jl** package:

In [36]:
using CSV

The first argument is the desired name of the CSV file and the second is the name of the dataframe:

In [37]:
CSV.write("mydf.csv", df);

If you want to load the CSV file use `CSV.read`:

In [38]:
df = CSV.read("mydf.csv")

│   caller = read(::String) at CSV.jl:40
└ @ CSV /nas/longleaf/apps/julia/1.3.0/share/julia/packages/CSV/MKemC/src/CSV.jl:40


Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,CnxKGI2pr,1
2,1,2,0,qjEDuIH9h,2
3,0,1,1,P2RzhHpln,3
4,1,0,1,SHd4BTv0z,4
5,1,1,1,41G6s08,5


In [39]:
typeof(df)

DataFrame

Note that the column type is different for a dataframe created from reading in a csv file versus a dataframe created manually.

In [40]:
typeof(df.A)

Array{Int64,1}

You can specify saving the dataframe using a different delimiter:

In [41]:
CSV.write("mydf.tsv", df, delim='\t');

In [42]:
df = CSV.read("mydf.tsv", delim='\t');

│   caller = ip:0x0
└ @ Core :-1


In [43]:
df

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,Int64,Int64,String,Int64
1,0,0,0,CnxKGI2pr,1
2,1,2,0,qjEDuIH9h,2
3,0,1,1,P2RzhHpln,3
4,1,0,1,SHd4BTv0z,4
5,1,1,1,41G6s08,5


The CSV.jl package has a couple of useful features. Let us look at a few features related to reading in csv data (there are many more than we'll cover here).

You can indicate where the header row starts in the file. By default, data will be read in starting on the next row. Let's look at our example file:

In [44]:
;cat readinexample.csv

# This is a test file
# Header starts on line 3
A,B,C,D,E
0,1,1,Ux0pu5ELc,3
1,0,1,F7ZVLlfJJ,4
1,1,1,FKLTflu,5
99,1,0,HUfdsDOOas,6
0,NA,NA,PUhgjmjef,7
1,99,NA,Ytf4OFtr,8
0,0,1,hU4df56sf,9


Here the header row is on the third line so we can specify that in our CSV.read() command using its `header` keyword argument.

In [45]:
CSV.read("readinexample.csv", header=3)

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,String,String,String,Int64
1,0,1.0,1.0,Ux0pu5ELc,3
2,1,0.0,1.0,F7ZVLlfJJ,4
3,1,1.0,1.0,FKLTflu,5
4,99,1.0,0.0,HUfdsDOOas,6
5,0,,,PUhgjmjef,7
6,1,99.0,,Ytf4OFtr,8
7,0,0.0,1.0,hU4df56sf,9


 Note if your file has no header you can simply set `header=false`.

You can also read in data starting at a specified row in the file. There are two ways to do this: one way is using the `datarow` keyword argument and anoterway is via the `skipto` keyword argument. The former indicates the row number at which to start reading in data; the latter indicates the number of rows to skip before reading in data.

Here we indicate the header is on row 3 and the data we want to read starts on row 6.

In [46]:
CSV.read("readinexample.csv", header=3, datarow=6)

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,String,String,String,Int64
1,1,1.0,1.0,FKLTflu,5
2,99,1.0,0.0,HUfdsDOOas,6
3,0,,,PUhgjmjef,7
4,1,99.0,,Ytf4OFtr,8
5,0,0.0,1.0,hU4df56sf,9


In [47]:
CSV.read("readinexample.csv", header=3, skipto=6)

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64,String,String,String,Int64
1,1,1.0,1.0,FKLTflu,5
2,99,1.0,0.0,HUfdsDOOas,6
3,0,,,PUhgjmjef,7
4,1,99.0,,Ytf4OFtr,8
5,0,0.0,1.0,hU4df56sf,9


If certain values should be treated as `missing` you can indicate that with the `missingstrings` keyword argument. In our file, let's assume that the values 99 and NA should be treated as missing when the data is read in:

In [48]:
CSV.read("readinexample.csv", header=3, missingstrings=["99", "NA"])

Unnamed: 0_level_0,A,B,C,D,E
Unnamed: 0_level_1,Int64?,Int64?,Int64?,String,Int64
1,0,1,1,Ux0pu5ELc,3
2,1,0,1,F7ZVLlfJJ,4
3,1,1,1,FKLTflu,5
4,missing,1,0,HUfdsDOOas,6
5,0,missing,missing,PUhgjmjef,7
6,1,missing,missing,Ytf4OFtr,8
7,0,0,1,hU4df56sf,9


The last thing we'll cover is selecting specific columns, or dropping columns, when reading in data. Suppose we only wanted to read in the columns _A_, _B_, and _D_? You can specify this using the column name or column index in the `select` keyword argument.

In [49]:
CSV.read("readinexample.csv", header=3, select=["A", "B", "D"])

Unnamed: 0_level_0,A,B,D
Unnamed: 0_level_1,Int64,String,String
1,0,1.0,Ux0pu5ELc
2,1,0.0,F7ZVLlfJJ
3,1,1.0,FKLTflu
4,99,1.0,HUfdsDOOas
5,0,,PUhgjmjef
6,1,99.0,Ytf4OFtr
7,0,0.0,hU4df56sf


And if you wanted to drop columns _C_ and _D_ use `drop`:

In [50]:
CSV.read("readinexample.csv", header=3, drop=[C", "D"])

LoadError: syntax: cannot juxtapose string literal

You can also use the column index number. Here columns _C_ and _D_ are the third and fifth columns respectively in our csv file:

In [51]:
CSV.read("readinexample.csv", header=3, drop=[3,5])

Unnamed: 0_level_0,A,B,D
Unnamed: 0_level_1,Int64,String,String
1,0,1.0,Ux0pu5ELc
2,1,0.0,F7ZVLlfJJ
3,1,1.0,FKLTflu
4,99,1.0,HUfdsDOOas
5,0,,PUhgjmjef
6,1,99.0,Ytf4OFtr
7,0,0.0,hU4df56sf


There are other Julia packages for reading other file formats (these are just a select few):
* ReadStat.jl: Stata, SAS, and SPSS data files.
* Parquet.jl: Parquet files.
* JSON.jl, JSON2.jl, JSON3.jl: JSON files.


In this lesson we covered:
* What the Julia DataFrames package can be used for.
* What the DataFrame type is.
* The basics of Julia datframes.
* Simple I/O using dataframes and the CSV.jl package.