# Basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [1]:
using DataFrames

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Vector{String}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [8]:
names(x, String)

1-element Vector{String}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [9]:
propertynames(x)

3-element Vector{Symbol}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [10]:
eltype.(eachcol(x))

3-element Vector{Type}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [11]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,1,9,2,10,3,10,9,10
2,5,1,7,9,8,2,10,1,4,1
3,9,10,5,3,9,1,3,5,3,1
4,2,4,1,2,10,5,2,5,4,5
5,5,5,5,10,8,6,6,8,9,8
6,7,5,6,5,9,8,7,3,3,8
7,6,6,4,9,6,8,10,1,8,9
8,6,3,7,5,2,8,4,2,7,5
9,10,3,1,8,7,7,1,2,9,1
10,5,10,2,9,4,10,6,6,10,8


and then we can use `first` to peek into its first few rows

In [12]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,1,9,2,10,3,10,9,10
2,5,1,7,9,8,2,10,1,4,1
3,9,10,5,3,9,1,3,5,3,1
4,2,4,1,2,10,5,2,5,4,5
5,5,5,5,10,8,6,6,8,9,8


and `last` to see its bottom rows.

In [13]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,5,3,9,7,10,4,4,1,1
2,8,1,1,4,5,1,3,5,1,4
3,8,8,4,3,7,5,8,8,1,8


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [14]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,1,9,2,10,3,10,9,10


In [15]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,8,8,4,3,7,5,8,8,1,8


## Displaying large data frames

Create a wide and tall data frame:

In [16]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.895515,0.122478,0.43021,0.947462,0.177689,0.940036,0.0800526,0.0605464
2,0.79084,0.0437097,0.883725,0.653532,0.801702,0.00766898,0.898272,0.352463
3,0.617988,0.387487,0.318054,0.764076,0.506029,0.505076,0.858486,0.020337
4,0.33011,0.160413,0.0273744,0.277932,0.660373,0.302,0.236495,0.424567
5,0.137219,0.751608,0.359242,0.614104,0.363926,0.958595,0.776145,0.757605
6,0.883833,0.446591,0.843669,0.0150616,0.168174,0.259399,0.892036,0.724112
7,0.945243,0.968535,0.0571247,0.43819,0.232547,0.229044,0.831233,0.612208
8,0.972199,0.440778,0.594305,0.508575,0.56609,0.680669,0.157559,0.656047
9,0.641235,0.0841282,0.922782,0.226138,0.177105,0.33052,0.867605,0.945299
10,0.614527,0.680108,0.220532,0.648881,0.35948,0.85442,0.651811,0.321926


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [17]:
ENV["LINES"] = 10

10

In [18]:
ENV["COLUMNS"] = 200

200

In [19]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.895515,0.122478,0.43021,0.947462,0.177689,0.940036,0.0800526,0.0605464,0.144485,0.771371,0.87796,0.505235,0.286084,0.587084,0.635324,0.0321169,0.961375,0.911213,0.0926078
2,0.79084,0.0437097,0.883725,0.653532,0.801702,0.00766898,0.898272,0.352463,0.680297,0.0731915,0.981606,0.639526,0.241439,0.694869,0.736835,0.725853,0.260728,0.726286,0.660242
3,0.617988,0.387487,0.318054,0.764076,0.506029,0.505076,0.858486,0.020337,0.202723,0.935914,0.460463,0.112408,0.573712,0.850772,0.261814,0.569291,0.162676,0.170768,0.280298
4,0.33011,0.160413,0.0273744,0.277932,0.660373,0.302,0.236495,0.424567,0.475889,0.293931,0.235654,0.692486,0.348565,0.40103,0.177299,0.25912,0.248149,0.523958,0.290321
5,0.137219,0.751608,0.359242,0.614104,0.363926,0.958595,0.776145,0.757605,0.900448,0.555793,0.33023,0.14108,0.524951,0.342711,0.776811,0.632302,0.233622,0.063797,0.891215
6,0.883833,0.446591,0.843669,0.0150616,0.168174,0.259399,0.892036,0.724112,0.918434,0.143452,0.645022,0.276158,0.627481,0.511584,0.131376,0.763089,0.655206,0.805891,0.892849
7,0.945243,0.968535,0.0571247,0.43819,0.232547,0.229044,0.831233,0.612208,0.206948,0.290748,0.815872,0.270091,0.0599951,0.724633,0.866094,0.921528,0.485681,0.099164,0.635257
8,0.972199,0.440778,0.594305,0.508575,0.56609,0.680669,0.157559,0.656047,0.246124,0.758434,0.150791,0.0187106,0.23321,0.521247,0.706315,0.110001,0.941937,0.321936,0.362289
9,0.641235,0.0841282,0.922782,0.226138,0.177105,0.33052,0.867605,0.945299,0.646767,0.426581,0.0123345,0.287967,0.3017,0.180796,0.420178,0.64986,0.30148,0.935868,0.0858095
10,0.614527,0.680108,0.220532,0.648881,0.35948,0.85442,0.651811,0.321926,0.613538,0.875944,0.786885,0.742021,0.945884,0.290851,0.970736,0.714771,0.946009,0.67625,0.917979


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [20]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [21]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [22]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [23]:
x[:, 1] # note that this creates a copy

2-element Vector{Int64}:
 1
 2

In [24]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [25]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [26]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [27]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [28]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [29]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [30]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [31]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [32]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [33]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [34]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [35]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.059842,0.59467,0.115021,0.75207,0.424398
2,0.264698,0.0942013,0.269276,0.242939,0.949513
3,0.423581,0.63315,0.34089,0.286639,0.181045
4,0.884411,0.350776,0.456152,0.657085,0.779614


In [36]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.59467,0.115021,0.75207
2,0.0942013,0.269276,0.242939
3,0.63315,0.34089,0.286639
4,0.350776,0.456152,0.657085


In [37]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.059842,0.59467,0.115021,0.75207
2,0.264698,0.0942013,0.269276,0.242939
3,0.423581,0.63315,0.34089,0.286639
4,0.884411,0.350776,0.456152,0.657085


## Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [38]:
@view x[1:2, 1]

2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.05984198400800622
 0.2646975736612095

In [39]:
@view x[1,1]

0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.05984198400800622

In [40]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.059842,0.59467


In [41]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.059842,0.59467
2,0.264698,0.0942013


## Adding new columns to a data frame

In [42]:
df = DataFrame()

using `setproperty!`

In [43]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [44]:
df.a === x # no copy is performed

true

using `setindex!`

In [45]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [46]:
df.b === x # no copy

true

In [47]:
df.c === x # copy

false

In [48]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [49]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [50]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [51]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [52]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [53]:
collect(pairs(eachcol(df)))

5-element Vector{Pair{Symbol, AbstractVector}}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

## Comparisons

In [54]:
using DataFrames

In [55]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.910762,0.354667,0.630757
2,0.79466,0.13567,0.0295027


In [56]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.910762,0.354667,0.630757
2,0.79466,0.13567,0.0295027


In [57]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [58]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.910762,0.354667,0.630757
2,0.79466,0.13567,0.0295027


In [59]:
df == df3

false

In [60]:
isapprox(df, df3)

true

In [61]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [62]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [63]:
df == df

missing

In [64]:
df === df

true

In [65]:
isequal(df, df)

true