Data Science Fundamentals: Julia |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 18. [Constructors](01_constructors.ipynb) | [Basic Information](02_basicinfo.ipynb) | [Missing Values](03_missingvalues.ipynb) | [Load Save](04_loadsave.ipynb) | [Columns](05_columns.ipynb) | [Rows](06_rows.ipynb) | [Factors](07_factors.ipynb) | [Joins](08_joins.ipynb) | [Reshaping](09_reshaping.ipynb) | [Transforms](10_transforms.ipynb) | [Performance](11_performance.ipynb) | **[Pitfalls](12_pitfalls.ipynb)** | [Extras](13_extras.ipynb)

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.858673,0.79071,0.0230825,0.555414,0.24509
2,0.993829,0.177617,0.125274,0.548583,0.549459
3,0.732227,0.592261,0.216254,0.287079,0.285782


In [3]:
y = convert(DataFrame, x)
x === y # no copyinng performed

true

In [4]:
y = copy(x)
x === y # not the same object

false

In [5]:
y = DataFrame(x)
x === y

false

In [6]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [7]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [8]:
y === df.y # different object

false

In [9]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Array{Int64,1})

In [10]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions. In particular `DataFrame!` is a shorthand for a non-copying constructor.

In [11]:
df = DataFrame!(x=x,y=y)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [12]:
y === df.y # now it is the same

true

In [13]:
select(df, :y)[!, 1] === y # not the same

false

In [14]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [15]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [16]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [18]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [19]:
s # error

BoundsError: BoundsError: attempt to access 2-element Array{Int64,1} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [20]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [21]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:2000])

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
Unnamed: 0_level_1,Float64,Char,Int64,Int64,Bool,Int64,Float64,Int64,Int64,Bool,Float64
1,1.0,'a',1,1,0,1,1.0,1,1,0,1.0
2,2.0,'b',2,2,1,2,2.0,2,2,1,2.0


In [22]:
@time collect(eachrow(df1))

  0.041201 seconds (103.51 k allocations: 5.411 MiB)


2-element Array{DataFrameRow{DataFrame,DataFrames.Index},1}:
 DataFrameRow. Omitted printing of 1992 columns
│ Row │ x1      │ x2   │ x3    │ x4    │ x5   │ x6    │ x7      │ x8    │
│     │ [90mFloat64[39m │ [90mChar[39m │ [90mInt64[39m │ [90mInt64[39m │ [90mBool[39m │ [90mInt64[39m │ [90mFloat64[39m │ [90mInt64[39m │
├─────┼─────────┼──────┼───────┼───────┼──────┼───────┼─────────┼───────┤
│ 1   │ 1.0     │ 'a'  │ 1     │ 1     │ 0    │ 1     │ 1.0     │ 1     │
 DataFrameRow. Omitted printing of 1992 columns
│ Row │ x1      │ x2   │ x3    │ x4    │ x5   │ x6    │ x7      │ x8    │
│     │ [90mFloat64[39m │ [90mChar[39m │ [90mInt64[39m │ [90mInt64[39m │ [90mBool[39m │ [90mInt64[39m │ [90mFloat64[39m │ [90mInt64[39m │
├─────┼─────────┼──────┼───────┼───────┼──────┼───────┼─────────┼───────┤
│ 2   │ 2.0     │ 'b'  │ 2     │ 2     │ 1    │ 2     │ 2.0     │ 2     │

In [23]:
@time collect(Tables.namedtupleiterator(df1));

 63.361998 seconds (12.07 M allocations: 620.814 MiB, 0.59% gc time)


as you can see the time to compile `Tables.namedtupleiterator` is prohibitive in this case (that is why we include this tip in pitfalls notebook)

the table below is tall

In [24]:
df2 = DataFrame(rand(10^6, 10))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.568491,0.268587,0.256403,0.741511,0.648123,0.930351,0.167457,0.980754
2,0.744475,0.831044,0.894722,0.299662,0.334919,0.880855,0.754838,0.203738
3,0.69306,0.0402436,0.319557,0.4143,0.160553,0.386284,0.485616,0.247916
4,0.221967,0.230616,0.0299796,0.698827,0.29674,0.55879,0.524201,0.932496
5,0.42018,0.162836,0.927171,0.952488,0.411027,0.542914,0.64362,0.392806
6,0.885983,0.755675,0.386897,0.783268,0.199874,0.0355547,0.228639,0.900181
7,0.299833,0.744919,0.76264,0.889228,0.576511,0.729001,0.926209,0.770737
8,0.254902,0.300898,0.796695,0.940978,0.871815,0.240346,0.445238,0.0807118
9,0.0242557,0.554034,0.839943,0.880411,0.248174,0.408633,0.164396,0.487239
10,0.709502,0.940366,0.923532,0.398793,0.0742642,0.517558,0.161878,0.0899359


In [25]:
@time map(sum, eachrow(df2))

  3.377965 seconds (60.21 M allocations: 1.076 GiB, 8.23% gc time)


1000000-element Array{Float64,1}:
 6.150909489011941
 6.600268969671474
 3.2350845965910624
 4.647609536200212
 5.008661858757231
 5.656899063394897
 6.161721733792737
 4.5534753308054485
 4.992776849520935
 4.43861698146911
 6.020512596774206
 6.095746782467509
 6.09541396595931
 ⋮
 5.939789381052082
 6.026865872953317
 4.62841333969052
 3.7862861966163646
 5.484604803884082
 5.565837311085242
 5.7172172532414836
 4.639425144162218
 3.9993603502668087
 6.956557139246081
 7.0906812949726366
 5.412900773072941

In [26]:
@time map(sum, eachrow(df2))

  3.096062 seconds (59.99 M allocations: 1.065 GiB, 2.68% gc time)


1000000-element Array{Float64,1}:
 6.150909489011941
 6.600268969671474
 3.2350845965910624
 4.647609536200212
 5.008661858757231
 5.656899063394897
 6.161721733792737
 4.5534753308054485
 4.992776849520935
 4.43861698146911
 6.020512596774206
 6.095746782467509
 6.09541396595931
 ⋮
 5.939789381052082
 6.026865872953317
 4.62841333969052
 3.7862861966163646
 5.484604803884082
 5.565837311085242
 5.7172172532414836
 4.639425144162218
 3.9993603502668087
 6.956557139246081
 7.0906812949726366
 5.412900773072941

In [27]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.187365 seconds (477.06 k allocations: 31.998 MiB)


1000000-element Array{Float64,1}:
 6.150909489011941
 6.600268969671474
 3.2350845965910624
 4.647609536200212
 5.008661858757231
 5.656899063394897
 6.161721733792737
 4.5534753308054485
 4.992776849520935
 4.43861698146911
 6.020512596774206
 6.095746782467509
 6.09541396595931
 ⋮
 5.939789381052082
 6.026865872953317
 4.62841333969052
 3.7862861966163646
 5.484604803884082
 5.565837311085242
 5.7172172532414836
 4.639425144162218
 3.9993603502668087
 6.956557139246081
 7.0906812949726366
 5.412900773072941

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.017245 seconds (23 allocations: 7.631 MiB)


1000000-element Array{Float64,1}:
 6.150909489011941
 6.600268969671474
 3.2350845965910624
 4.647609536200212
 5.008661858757231
 5.656899063394897
 6.161721733792737
 4.5534753308054485
 4.992776849520935
 4.43861698146911
 6.020512596774206
 6.095746782467509
 6.09541396595931
 ⋮
 5.939789381052082
 6.026865872953317
 4.62841333969052
 3.7862861966163646
 5.484604803884082
 5.565837311085242
 5.7172172532414836
 4.639425144162218
 3.9993603502668087
 6.956557139246081
 7.0906812949726366
 5.412900773072941

as you can see - this time it is much faster to iterate a type stable container