Data Science Fundamentals: Julia |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 18. [Constructors](01_constructors.ipynb) | **[Basic Information](02_basicinfo.ipynb)** | [Missing Values](03_missingvalues.ipynb) | [Load Save](04_loadsave.ipynb) | [Columns](05_columns.ipynb) | [Rows](06_rows.ipynb) | [Factors](07_factors.ipynb) | [Joins](08_joins.ipynb) | [Reshaping](09_reshaping.ipynb) | [Transforms](10_transforms.ipynb) | [Performance](11_performance.ipynb) | [Pitfalls](12_pitfalls.ipynb) | [Extras](13_extras.ipynb)

In [1]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…,Type
1,A,1.5,1,1.5,2,,,Int64
2,B,1.0,1.0,1.0,1.0,,1.0,"Union{Missing, Float64}"
3,C,,a,,b,2.0,,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Nothing,Union…,Type
1,A,1.5,1.0,1.5,2.0,,,Int64
2,B,1.0,1.0,1.0,1.0,,1.0,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Array{String,1}:
 "A"
 "B"
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [8]:
propertynames(x)

3-element Array{Symbol,1}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [9]:
eltype.(eachcol(x))

3-element Array{Type,1}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [10]:
y = DataFrame(rand(1:10, 1000, 10))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,7,7,8,2,6,1,9,3,1,1
2,4,4,2,5,3,7,10,8,6,7
3,8,6,7,3,9,10,5,2,10,10
4,6,2,8,6,3,8,1,3,6,5
5,2,5,9,10,10,1,9,4,3,6
6,3,2,1,6,7,6,10,8,4,4
7,10,10,6,9,1,9,5,2,4,9
8,5,1,10,5,3,2,1,5,10,6
9,7,3,2,1,10,8,7,1,9,2
10,6,10,7,5,8,5,4,2,5,8


and then we can use `first` to peek into its first few rows

In [11]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,7,7,8,2,6,1,9,3,1,1
2,4,4,2,5,3,7,10,8,6,7
3,8,6,7,3,9,10,5,2,10,10
4,6,2,8,6,3,8,1,3,6,5
5,2,5,9,10,10,1,9,4,3,6


and `last` to see its bottom rows.

In [12]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,3,9,6,8,8,7,7,4,9,6
2,4,5,5,8,4,10,3,4,7,4
3,9,9,10,9,5,2,2,10,5,8


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [13]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,7,7,8,2,6,1,9,3,1,1


In [14]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,9,9,10,9,5,2,2,10,5,8


### Displaying large data frames

Create a wide and tall data frame:

In [15]:
df = DataFrame(rand(100, 100))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.78421,0.632682,0.717387,0.114581,0.0527668,0.838625,0.586866,0.43078
2,0.76552,0.48835,0.378482,0.483952,0.513253,0.32416,0.327797,0.565872
3,0.887606,0.644904,0.0977502,0.149482,0.223725,0.498404,0.179926,0.904395
4,0.224705,0.794643,0.722068,0.964643,0.119267,0.589702,0.650384,0.057006
5,0.753703,0.763321,0.612657,0.291588,0.62151,0.499765,0.705866,0.108964
6,0.781352,0.902342,0.639493,0.0391614,0.187419,0.822137,0.821152,0.925857
7,0.254295,0.583424,0.334097,0.972415,0.374985,0.739663,0.501554,0.428565
8,0.411225,0.121568,0.0804997,0.129359,0.295333,0.203935,0.988299,0.252741
9,0.920259,0.916218,0.250618,0.489701,0.99901,0.456845,0.604728,0.0454408
10,0.780548,0.429435,0.486449,0.256069,0.626564,0.31027,0.709269,0.859462


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [16]:
ENV["LINES"] = 10

10

In [17]:
ENV["COLUMNS"] = 200

200

In [18]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.78421,0.632682,0.717387,0.114581,0.0527668,0.838625,0.586866,0.43078,0.464678,0.893563,0.401211,0.396356,0.611418,0.453025,0.679236,0.439772,0.907162,0.745251,0.82576
2,0.76552,0.48835,0.378482,0.483952,0.513253,0.32416,0.327797,0.565872,0.465197,0.792672,0.892776,0.934555,0.968703,0.322439,0.445629,0.958035,0.529923,0.520289,0.791401
3,0.887606,0.644904,0.0977502,0.149482,0.223725,0.498404,0.179926,0.904395,0.695331,0.818037,0.164099,0.22953,0.633211,0.899343,0.817648,0.459941,0.371385,0.53241,0.517085
4,0.224705,0.794643,0.722068,0.964643,0.119267,0.589702,0.650384,0.057006,0.780412,0.516881,0.811718,0.742405,0.953653,0.0245915,0.711977,0.00274542,0.584805,0.235295,0.61247
5,0.753703,0.763321,0.612657,0.291588,0.62151,0.499765,0.705866,0.108964,0.950511,0.264285,0.426226,0.162441,0.466255,0.832862,0.252411,0.85291,0.886041,0.794545,0.129267
6,0.781352,0.902342,0.639493,0.0391614,0.187419,0.822137,0.821152,0.925857,0.703031,0.920979,0.563051,0.0284439,0.356737,0.506513,0.630543,0.103031,0.438346,0.414213,0.852658
7,0.254295,0.583424,0.334097,0.972415,0.374985,0.739663,0.501554,0.428565,0.308771,0.454022,0.305895,0.8881,0.373925,0.736172,0.34105,0.412045,0.416606,0.554392,0.948777
8,0.411225,0.121568,0.0804997,0.129359,0.295333,0.203935,0.988299,0.252741,0.826215,0.813836,0.240699,0.0727684,0.0432211,0.235843,0.975927,0.817867,0.959214,0.328505,0.108311
9,0.920259,0.916218,0.250618,0.489701,0.99901,0.456845,0.604728,0.0454408,0.904758,0.0835009,0.784217,0.960441,0.886849,0.289049,0.000246728,0.328499,0.591128,0.0917397,0.453637
10,0.780548,0.429435,0.486449,0.256069,0.626564,0.31027,0.709269,0.859462,0.981628,0.606973,0.824016,0.950194,0.560552,0.0101158,0.80106,0.0696165,0.483464,0.807616,0.901232


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlies, here are following ways to grab one of its columns as a `Vector`.

In [19]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [20]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [21]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [22]:
x[:, 1] # note that this creates a copy

2-element Array{Int64,1}:
 1
 2

In [23]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [24]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [25]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [26]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [27]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [28]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [29]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [30]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [31]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [32]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [33]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples how `All` and `Between` can be used to select columns of a data frame.

In [34]:
x = DataFrame(rand(4, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.9304,0.175895,0.773552,0.369851,0.2446
2,0.0523155,0.0234722,0.590701,0.930948,0.101109
3,0.472186,0.492733,0.583881,0.189044,0.452569
4,0.295229,0.793781,0.0980745,0.556772,0.298303


In [35]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.175895,0.773552,0.369851
2,0.0234722,0.590701,0.930948
3,0.492733,0.583881,0.189044
4,0.793781,0.0980745,0.556772


In [36]:
x[:, All("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.9304,0.175895,0.773552,0.369851
2,0.0523155,0.0234722,0.590701,0.930948
3,0.472186,0.492733,0.583881,0.189044
4,0.295229,0.793781,0.0980745,0.556772


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [37]:
@view x[1:2, 1]

2-element view(::Array{Float64,1}, 1:2) with eltype Float64:
 0.9304004265404249
 0.05231547935896752

In [38]:
@view x[1,1]

0-dimensional view(::Array{Float64,1}, 1) with eltype Float64:
0.9304004265404249

In [39]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.9304,0.175895


In [40]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.9304,0.175895
2,0.0523155,0.0234722


### Adding new columns to a data frame

In [41]:
df = DataFrame()

using `setproperty!`

In [42]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [43]:
df.a === x # no copy is performed

true

using `setindex!`

In [44]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [45]:
df.b === x # no copy

true

In [46]:
df.c === x # copy

false

In [47]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [48]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [49]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [50]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [51]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [52]:
collect(pairs(eachcol(df)))

5-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.