# Possible pitfalls

In [1]:
using DataFrames

## Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.632706,0.828699,0.652296,0.470422,0.121272
2,0.328729,0.378696,0.289502,0.11514,0.15551
3,0.305744,0.694864,0.384864,0.867069,0.535597


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

## Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

## Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


## When iterating rows of a data frame 

- use `eachrow` to avoid compilation cost (wide tables), 
- but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Char,Bool,Float64,Float64,Char,Char,Int64,Bool,Int64,Bool,Float64,Bool
1,a,0,1.0,1.0,a,a,1,0,1,0,1.0,0
2,b,1,2.0,2.0,b,b,2,1,2,1,2.0,1


In [23]:
@time collect(eachrow(df1))

  0.054688 seconds (86.08 k allocations: 4.747 MiB, 99.92% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1   [0m[1m x2    [0m[1m x3      [0m[1m x4      [0m[1m x5   [0m[1m x6   [0m[1m x7    [0m[1m x8    [0m[1m x9    [0m[1m x10   [0m[1m [0m ⋯
[1m     [0m│[90m Char [0m[90m Bool  [0m[90m Float64 [0m[90m Float64 [0m[90m Char [0m[90m Char [0m[90m Int64 [0m[90m Bool  [0m[90m Int64 [0m[90m Bool  [0m[90m [0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ a     false      1.0      1.0  a     a         1  false      1  false   ⋯
[36m                                                             890 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1   [0m[1m x2   [0m[1m x3      [0m[1m x4      [0m[1m x5   [0m[1m x6   [0m[1m x7    [0m[1m x8   [0m[1m x9    [0m[1m x10  [0m[1m x11[0m ⋯
[1m     [0m│[90m Char [0m[90m Bool [0m[90m Float64 [0m[90m Float64 [0m[90m Char [0m[90m Char [0m[90m Int64 [0m[90

In [24]:
@time collect(Tables.namedtupleiterator(df1));

 12.036778 seconds (2.53 M allocations: 160.368 MiB, 0.69% gc time, 99.92% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.645921,0.491887,0.389992,0.228953,0.654484,0.948857,0.581677,0.0193141
2,0.118609,0.368878,0.980045,0.587752,0.909204,0.907201,0.900109,0.59192
3,0.169184,0.344385,0.668348,0.480096,0.890114,0.0845076,0.0166872,0.321769
4,0.241078,0.916109,0.439923,0.876351,0.436529,0.465807,0.830802,0.109801
5,0.778605,0.365662,0.544145,0.188905,0.191452,0.30328,0.586022,0.127979
6,0.911988,0.767918,0.936312,0.730067,0.831546,0.841071,0.835001,0.0632923
7,0.668706,0.55897,0.663811,0.182541,0.95669,0.78937,0.258726,0.938542
8,0.935811,0.335218,0.729323,0.79415,0.411703,0.487489,0.0656464,0.426007
9,0.442173,0.310678,0.213951,0.00452081,0.352958,0.872389,0.458232,0.0962693
10,0.726673,0.773645,0.953159,0.624539,0.264796,0.00210201,0.653499,0.0157098


In [26]:
@time map(sum, eachrow(df2))

  4.332456 seconds (60.19 M allocations: 1.061 GiB, 12.51% gc time, 2.66% compilation time)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755

In [27]:
@time map(sum, eachrow(df2))

  3.996981 seconds (59.99 M allocations: 1.050 GiB, 5.63% gc time)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.766770 seconds (500.50 k allocations: 35.528 MiB, 90.22% compilation time)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.073676 seconds (13 allocations: 7.630 MiB)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755

as you can see - this time it is much faster to iterate a type stable container

still you might want to use the `select` syntax, which is optimized for such reductions:

In [30]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum # this includes compilation time

  1.447752 seconds (2.20 M allocations: 136.225 MiB, 2.35% gc time, 99.26% compilation time)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755

In [31]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.009120 seconds (153 allocations: 7.637 MiB)


1000000-element Vector{Float64}:
 5.232411779775741
 6.027783651517928
 4.009419962874653
 5.03823050700658
 4.413149060902737
 7.128282987803032
 6.098800005138575
 5.383892749790318
 4.06348101185808
 5.174425492450083
 5.457695281098244
 5.309960192944014
 6.382438499807212
 ⋮
 5.501632296667598
 6.318108405565564
 5.625675913657931
 4.3682926922091765
 5.108101619554987
 4.310174474396364
 4.752829000387278
 5.129459228916927
 4.9650116172529035
 5.316763267689454
 3.6612234918781033
 4.924316432677755