# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), July 16, 2019**

In [1]:
using DataFrames, Random
Random.seed!(1);

## Manipulating rows of DataFrame

### Selecting rows

In [2]:
df = DataFrame(rand(4, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.488613,0.251662,0.424718,0.251379
2,0.346517,0.210968,0.986666,0.773223,0.0203749
3,0.312707,0.951916,0.555751,0.28119,0.287702
4,0.00790928,0.999905,0.437108,0.209472,0.859512


using `:` as row selector will copy columns

In [3]:
df[:, :]

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.488613,0.251662,0.424718,0.251379
2,0.346517,0.210968,0.986666,0.773223,0.0203749
3,0.312707,0.951916,0.555751,0.28119,0.287702
4,0.00790928,0.999905,0.437108,0.209472,0.859512


this is the same as

In [4]:
copy(df)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.488613,0.251662,0.424718,0.251379
2,0.346517,0.210968,0.986666,0.773223,0.0203749
3,0.312707,0.951916,0.555751,0.28119,0.287702
4,0.00790928,0.999905,0.437108,0.209472,0.859512


you can get a subset of rows of a data frame without copying using `view` to get a `SubDataFrame` 

In [5]:
sdf = view(df, 1:3, 1:3)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.236033,0.488613,0.251662
2,0.346517,0.210968,0.986666
3,0.312707,0.951916,0.555751


you still have a detailed reference to the parent

In [6]:
parent(sdf), parentindices(sdf)

(4×5 DataFrame
│ Row │ x1         │ x2       │ x3       │ x4       │ x5        │
│     │ [90mFloat64[39m    │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m   │
├─────┼────────────┼──────────┼──────────┼──────────┼───────────┤
│ 1   │ 0.236033   │ 0.488613 │ 0.251662 │ 0.424718 │ 0.251379  │
│ 2   │ 0.346517   │ 0.210968 │ 0.986666 │ 0.773223 │ 0.0203749 │
│ 3   │ 0.312707   │ 0.951916 │ 0.555751 │ 0.28119  │ 0.287702  │
│ 4   │ 0.00790928 │ 0.999905 │ 0.437108 │ 0.209472 │ 0.859512  │, (1:3, 1:3))

selecting a single row returns a `DataFrameRow` object which is also a view

In [7]:
dfr = df[3, :]

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
3,0.312707,0.951916,0.555751,0.28119,0.287702


In [8]:
parent(dfr), parentindices(dfr)

(4×5 DataFrame
│ Row │ x1         │ x2       │ x3       │ x4       │ x5        │
│     │ [90mFloat64[39m    │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m   │
├─────┼────────────┼──────────┼──────────┼──────────┼───────────┤
│ 1   │ 0.236033   │ 0.488613 │ 0.251662 │ 0.424718 │ 0.251379  │
│ 2   │ 0.346517   │ 0.210968 │ 0.986666 │ 0.773223 │ 0.0203749 │
│ 3   │ 0.312707   │ 0.951916 │ 0.555751 │ 0.28119  │ 0.287702  │
│ 4   │ 0.00790928 │ 0.999905 │ 0.437108 │ 0.209472 │ 0.859512  │, (3, Base.OneTo(5)))

let us add a column to a data frame by assigning a scalar broadcasting

In [9]:
df[!, :Z] .= 1

4-element Array{Int64,1}:
 1
 1
 1
 1

In [10]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
1,0.236033,0.488613,0.251662,0.424718,0.251379,1
2,0.346517,0.210968,0.986666,0.773223,0.0203749,1
3,0.312707,0.951916,0.555751,0.28119,0.287702,1
4,0.00790928,0.999905,0.437108,0.209472,0.859512,1


Earlier we used `:` for column selection in a view (`SubDataFrame` and `DataFrameRow`).
In this case a view will have all columns of the parent after the parent is mutated.

In [11]:
dfr

Unnamed: 0_level_0,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
3,0.312707,0.951916,0.555751,0.28119,0.287702,1


In [12]:
parent(dfr), parentindices(dfr)

(4×6 DataFrame
│ Row │ x1         │ x2       │ x3       │ x4       │ x5        │ Z     │
│     │ [90mFloat64[39m    │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mFloat64[39m   │ [90mInt64[39m │
├─────┼────────────┼──────────┼──────────┼──────────┼───────────┼───────┤
│ 1   │ 0.236033   │ 0.488613 │ 0.251662 │ 0.424718 │ 0.251379  │ 1     │
│ 2   │ 0.346517   │ 0.210968 │ 0.986666 │ 0.773223 │ 0.0203749 │ 1     │
│ 3   │ 0.312707   │ 0.951916 │ 0.555751 │ 0.28119  │ 0.287702  │ 1     │
│ 4   │ 0.00790928 │ 0.999905 │ 0.437108 │ 0.209472 │ 0.859512  │ 1     │, (3, Base.OneTo(6)))

### Reordering rows

We create some random data frame (and hope that `x[:x]` is not sorted :), which is quite likely with 12 rows)

In [13]:
x = DataFrame(id=1:12, x = rand(12), y = [zeros(6); ones(6)])

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.0769509,0.0
2,2,0.640396,0.0
3,3,0.873544,0.0
4,4,0.278582,0.0
5,5,0.751313,0.0
6,6,0.644883,0.0
7,7,0.0778264,1.0
8,8,0.848185,1.0
9,9,0.0856352,1.0
10,10,0.553206,1.0


check if a DataFrame or a subset of its columns is sorted

In [14]:
issorted(x), issorted(x, :x)

(true, false)

we sort x in place

In [15]:
sort!(x, :x)

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.0769509,0.0
2,7,0.0778264,1.0
3,9,0.0856352,1.0
4,12,0.185821,1.0
5,4,0.278582,0.0
6,11,0.46335,1.0
7,10,0.553206,1.0
8,2,0.640396,0.0
9,6,0.644883,0.0
10,5,0.751313,0.0


now we create a new DataFrame

In [16]:
y = sort(x, :id)

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.0769509,0.0
2,2,0.640396,0.0
3,3,0.873544,0.0
4,4,0.278582,0.0
5,5,0.751313,0.0
6,6,0.644883,0.0
7,7,0.0778264,1.0
8,8,0.848185,1.0
9,9,0.0856352,1.0
10,10,0.553206,1.0


here we sort by two columns, first is decreasing, second is increasing

In [17]:
sort(x, (:y, :x), rev=(true, false))

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,7,0.0778264,1.0
2,9,0.0856352,1.0
3,12,0.185821,1.0
4,11,0.46335,1.0
5,10,0.553206,1.0
6,8,0.848185,1.0
7,1,0.0769509,0.0
8,4,0.278582,0.0
9,2,0.640396,0.0
10,6,0.644883,0.0


In [18]:
sort(x, (order(:y, rev=true), :x)) # the same as above

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,7,0.0778264,1.0
2,9,0.0856352,1.0
3,12,0.185821,1.0
4,11,0.46335,1.0
5,10,0.553206,1.0
6,8,0.848185,1.0
7,1,0.0769509,0.0
8,4,0.278582,0.0
9,2,0.640396,0.0
10,6,0.644883,0.0


now we try some more fancy sorting stuff

In [19]:
sort(x, (order(:y, rev=true), order(:x, by=v->-v)))

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,8,0.848185,1.0
2,10,0.553206,1.0
3,11,0.46335,1.0
4,12,0.185821,1.0
5,9,0.0856352,1.0
6,7,0.0778264,1.0
7,3,0.873544,0.0
8,5,0.751313,0.0
9,6,0.644883,0.0
10,2,0.640396,0.0


this is how you can reorder rows (here randomly)

In [20]:
x[shuffle(1:10), :]

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,12,0.185821,1.0
2,11,0.46335,1.0
3,6,0.644883,0.0
4,2,0.640396,0.0
5,1,0.0769509,0.0
6,10,0.553206,1.0
7,7,0.0778264,1.0
8,4,0.278582,0.0
9,9,0.0856352,1.0
10,5,0.751313,0.0


 it is also easy to swap rows using broadcasted assignment

In [21]:
sort!(x, :id)
x[[1,10],:] .= x[[10,1],:]
x

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,10,0.553206,1.0
2,2,0.640396,0.0
3,3,0.873544,0.0
4,4,0.278582,0.0
5,5,0.751313,0.0
6,6,0.644883,0.0
7,7,0.0778264,1.0
8,8,0.848185,1.0
9,9,0.0856352,1.0
10,1,0.0769509,0.0


In [22]:
# x[1,:], x[10,:] = x[10,:], x[1,:] # this is currently impossible, but will be allowed in the future
# x

### Merging/adding rows

In [23]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443


merge by rows - data frames must have the same column names; the same is `vcat`

In [24]:
[x; x]

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443


you can efficiently `vcat` a vector of `DataFrames` using `reduce`

In [25]:
reduce(vcat, [x, x, x])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,0.440897,0.953803,0.0135403,0.596537,0.548635
8,0.404673,0.0951856,0.303399,0.638935,0.262992
9,0.736787,0.519675,0.702557,0.872347,0.526443


get `y` with other order of names

In [26]:
y = x[:, reverse(names(x))]

Unnamed: 0_level_0,x5,x4,x3,x2,x1
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.548635,0.596537,0.0135403,0.953803,0.440897
2,0.262992,0.638935,0.303399,0.0951856,0.404673
3,0.526443,0.872347,0.702557,0.519675,0.736787


`vcat` is still possible as it does column name matching

In [27]:
vcat(x, y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443


but column names must still match

In [28]:
vcat(x, y[:, 1:3])

ArgumentError: ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

unless you pass `:intersect`, `:union` or specific column names as keyword argument `cols`

In [29]:
vcat(x, y[:, 1:3], cols=:intersect)

Unnamed: 0_level_0,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.0135403,0.596537,0.548635
2,0.303399,0.638935,0.262992
3,0.702557,0.872347,0.526443
4,0.0135403,0.596537,0.548635
5,0.303399,0.638935,0.262992
6,0.702557,0.872347,0.526443


In [30]:
vcat(x, y[:, 1:3], cols=:union)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64⍰,Float64⍰,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,missing,missing,0.0135403,0.596537,0.548635
5,missing,missing,0.303399,0.638935,0.262992
6,missing,missing,0.702557,0.872347,0.526443


In [31]:
vcat(x, y[:, 1:3], cols=[:x1, :x5])

Unnamed: 0_level_0,x1,x5
Unnamed: 0_level_1,Float64⍰,Float64
1,0.440897,0.548635
2,0.404673,0.262992
3,0.736787,0.526443
4,missing,0.548635
5,missing,0.262992
6,missing,0.526443


`append!` modifies `x` in place

In [32]:
append!(x, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443


here column names must match exactly

In [33]:
append!(x, y)

ErrorException: Column names do not match

standard `repeat` function works on rows; also `inner` and `outer` keyword arguments are accepted

In [34]:
repeat(x, 2)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,0.440897,0.953803,0.0135403,0.596537,0.548635
8,0.404673,0.0951856,0.303399,0.638935,0.262992
9,0.736787,0.519675,0.702557,0.872347,0.526443
10,0.440897,0.953803,0.0135403,0.596537,0.548635


`push!` adds one row to `x` at the end; one must pass a correct number of values

In [35]:
push!(x, 1:5)
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,1.0,2.0,3.0,4.0,5.0


also works with dictionaries

In [36]:
push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15))
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,1.0,2.0,3.0,4.0,5.0
8,11.0,12.0,13.0,14.0,15.0


and `NamedTuples` via name matching

In [37]:
push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,1.0,2.0,3.0,4.0,5.0
8,11.0,12.0,13.0,14.0,15.0
9,1.0,2.0,3.0,4.0,5.0


and `DataFrameRow` also via name matching

In [38]:
push!(x, x[1, :])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.440897,0.953803,0.0135403,0.596537,0.548635
2,0.404673,0.0951856,0.303399,0.638935,0.262992
3,0.736787,0.519675,0.702557,0.872347,0.526443
4,0.440897,0.953803,0.0135403,0.596537,0.548635
5,0.404673,0.0951856,0.303399,0.638935,0.262992
6,0.736787,0.519675,0.702557,0.872347,0.526443
7,1.0,2.0,3.0,4.0,5.0
8,11.0,12.0,13.0,14.0,15.0
9,1.0,2.0,3.0,4.0,5.0
10,0.440897,0.953803,0.0135403,0.596537,0.548635


### Subsetting/removing rows

In [39]:
x = DataFrame(id=1:10, val='a':'j')

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'
3,3,'c'
4,4,'d'
5,5,'e'
6,6,'f'
7,7,'g'
8,8,'h'
9,9,'i'
10,10,'j'


by using indexing

In [40]:
x[1:2, :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'


a single row selection creates a `DataFrameRow`

In [41]:
x[1, :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'


but this is a `DataFrame`

In [42]:
x[1:1, :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'


the same but a view

In [43]:
view(x, 1:2, :)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'


selects columns 1 and 2

In [44]:
view(x, :, 1:2)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'
3,3,'c'
4,4,'d'
5,5,'e'
6,6,'f'
7,7,'g'
8,8,'h'
9,9,'i'
10,10,'j'


indexing by `Bool`, exact length math is required

In [45]:
x[repeat([true, false], 5), :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,3,'c'
3,5,'e'
4,7,'g'
5,9,'i'


alternatively we can also create a view

In [46]:
view(x, repeat([true, false], 5), :)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,3,'c'
3,5,'e'
4,7,'g'
5,9,'i'


we can delete one row in place

In [47]:
deleterows!(x, 7)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'
3,3,'c'
4,4,'d'
5,5,'e'
6,6,'f'
7,8,'h'
8,9,'i'
9,10,'j'


or a collection of rows, also in place

In [48]:
deleterows!(x, 6:7)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'
3,3,'c'
4,4,'d'
5,5,'e'
6,9,'i'
7,10,'j'


you can also create a new `DataFrame` when deleting rows using `Not` indexing

In [49]:
x[Not(1:2), :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,3,'c'
2,4,'d'
3,5,'e'
4,9,'i'
5,10,'j'


In [50]:
x

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,2,'b'
3,3,'c'
4,4,'d'
5,5,'e'
6,9,'i'
7,10,'j'


now we move to row filtering

In [51]:
x = DataFrame([1:4, 2:5, 3:6])

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4
3,3,4,5
4,4,5,6


create a new `DataFrame` where filtering function operates on `DataFrameRow`

In [52]:
filter(r -> r.x1 > 2.5, x)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


in place modification of `x`, an example with `do`-block syntax

In [53]:
filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,3,4,5


### Deduplicating

In [54]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x.C = 1:4
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


get first unique rows for given index

In [55]:
unique(x, [1,2])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


now we look at whole rows

In [56]:
unique(x)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


get indicators of non-unique rows

In [57]:
nonunique(x, :A)

4-element Array{Bool,1}:
 false
 false
  true
  true

modify `x` in place

In [58]:
unique!(x, :B)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


### Extracting one row from a `DataFrame` into standard collections

In [59]:
x = DataFrame(x=[1,missing,2], y=["a", "b", missing], z=[true,false,true])

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Int64⍰,String⍰,Bool
1,1,a,True
2,missing,b,False
3,2,missing,True


In [60]:
cols = [:y, :z]

2-element Array{Symbol,1}:
 :y
 :z

you can use a conversion to a `Vector`

In [61]:
Vector(x[1, cols])

2-element Array{Any,1}:
     "a"
 true   

now you will get a vector of vectors

In [62]:
[Vector(x[i, cols]) for i in axes(x, 1)]

3-element Array{Array{Any,1},1}:
 ["a", true]    
 ["b", false]   
 [missing, true]

it is easy to convert a `DataFrameRow` into a `NamedTuple`

In [63]:
copy(x[1, cols])

(y = "a", z = true)

or a `Tuple`

In [64]:
Tuple(x[1, cols])

("a", true)