Data Science Fundamentals: Julia |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 18. [Constructors](01_constructors.ipynb) | [Basic Information](02_basicinfo.ipynb) | [Missing Values](03_missingvalues.ipynb) | [Load Save](04_loadsave.ipynb) | [Columns](05_columns.ipynb) | [Rows](06_rows.ipynb) | [Factors](07_factors.ipynb) | [Joins](08_joins.ipynb) | [Reshaping](09_reshaping.ipynb) | [Transforms](10_transforms.ipynb) | **[Performance](11_performance.ipynb)** | [Pitfalls](12_pitfalls.ipynb) | [Extras](13_extras.ipynb)

In [1]:
using DataFrames
using BenchmarkTools

## Performance tips

### Access by column number is faster than by name

In [2]:
x = DataFrame(rand(5, 1000))
@btime $x[!, 500];
@btime $x.x500;

  2.837 ns (0 allocations: 0 bytes)
  11.796 ns (0 allocations: 0 bytes)


### When working with data `DataFrame` use barrier functions or type annotation

In [3]:
using Random
function f_bad() # this function will be slow
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y, z = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_bad();
# if you run @code_warntype f_bad() then you notice
# that Julia does not know column types of `DataFrame`


  70.038 ms (5999014 allocations: 122.06 MiB)


In [4]:
# solution 1 is to use barrier function (it should be possible to use it in almost any code)
function f_inner(y,z)
   p = 0.0
   for i in 1:length(y)
       p += y[i]*z[i]
   end
   p
end

function f_barrier() # extract the work to an inner function
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    f_inner(x[!, 1], x[!, 2])
end

using LinearAlgebra
function f_inbuilt() # or use inbuilt function if possible
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    dot(x[!, 1], x[!, 2])
end

@btime f_barrier();
@btime f_inbuilt();

  5.747 ms (36 allocations: 30.52 MiB)
  5.471 ms (36 allocations: 30.52 MiB)


In [5]:
# solution 2 is to provide the types of extracted columns
# it is simpler but there are cases in which you will not know these types
# This example  assumes that you have DataFrames master at least from August 31, 2018
function f_typed()
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y::Vector{Float64}, z::Vector{Float64} = x[!, 1], x[!, 2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_typed();

  5.892 ms (36 allocations: 30.52 MiB)


In general for tall and narrow tables it is often useful to use `Tables.rowtable`, `Tables.columntable` or `Tables.namedtupleiterator` for intermediate processing of data in a type-stable way.

### Consider using delayed `DataFrame` creation technique

also notice the difference in performance between `DataFrame` and `DataFrame!` (copying vs non-copying data frame creation)

In [6]:
function f1()
    x = DataFrame!([Vector{Float64}(undef, 10^4) for i in 1:100]) # we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f1a()
    x = DataFrame([Vector{Float64}(undef, 10^4) for i in 1:100]) # we work with a DataFrame directly
    for c in 1:ncol(x)
        d = x[!, c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef, 10^4)
        for r in 1:length(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame!(x) # we delay creation of DataFrame after we have our job done
end

function f2a()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef, 10^4)
        for r in 1:length(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x) # we delay creation of DataFrame after we have our job done
end

@btime f1();
@btime f1a();
@btime f2();
@btime f2a();

  26.782 ms (1949729 allocations: 37.42 MiB)
  28.022 ms (1949929 allocations: 45.05 MiB)
  4.403 ms (830 allocations: 7.68 MiB)
  5.251 ms (1030 allocations: 15.32 MiB)


### You can add rows to a `DataFrame` in place and it is fast

In [7]:
x = DataFrame(rand(10^6, 5))
y = DataFrame(transpose(1.0:5.0))
z = [1.0:5.0;]

@btime vcat($x, $y); # creates a new DataFrame - slow
@btime append!($x, $y); # in place - fast

x = DataFrame(rand(10^6, 5)) # reset to the same starting point
@btime push!($x, $z); # add a single row in place - fast

  3.715 ms (200 allocations: 38.16 MiB)
  939.913 ns (19 allocations: 864 bytes)
  362.171 ns (16 allocations: 256 bytes)


### Allowing `missing` as well as `categorical` slows down computations

In [8]:
using StatsBase

function test(data) # uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

test(1:10)
test([randstring() for i in 1:10])
test(allowmissing(1:10))
test(allowmissing([randstring() for i in 1:10]))


Int64
 raw:
  3.749 ms (7 allocations: 7.63 MiB)
 categorical:
  16.859 ms (4 allocations: 608 bytes)
String
 raw:
  23.311 ms (4 allocations: 608 bytes)
 categorical:
  27.759 ms (4 allocations: 608 bytes)
Union{Missing, Int64}
 raw:
  12.039 ms (4 allocations: 624 bytes)
 categorical:
  16.830 ms (4 allocations: 608 bytes)
Union{Missing, String}
 raw:
  32.462 ms (4 allocations: 608 bytes)
 categorical:
  29.268 ms (4 allocations: 608 bytes)


### When aggregating use column selector and prefer categorical or pooled array grouping variable

In [9]:
df = DataFrame(x=rand('a':'d', 10^7), y=1);

In [10]:
gdf = groupby(df, :x)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Char,Int64
1,'a',1
2,'a',1
3,'a',1
4,'a',1
5,'a',1
6,'a',1
7,'a',1
8,'a',1
9,'a',1
10,'a',1

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Char,Int64
1,'c',1
2,'c',1
3,'c',1
4,'c',1
5,'c',1
6,'c',1
7,'c',1
8,'c',1
9,'c',1
10,'c',1


In [11]:
@btime combine(v -> sum(v.y), $gdf) # traditional syntax, slow

  49.725 ms (193 allocations: 76.31 MiB)


Unnamed: 0_level_0,x,x1
Unnamed: 0_level_1,Char,Int64
1,'a',2500455
2,'b',2499250
3,'d',2499805
4,'c',2500490


In [12]:
@btime combine($gdf, :y=>sum) # use column selector

  12.342 ms (169 allocations: 12.05 KiB)


Unnamed: 0_level_0,x,y_sum
Unnamed: 0_level_1,Char,Int64
1,'a',2500455
2,'b',2499250
3,'d',2499805
4,'c',2500490


In [13]:
categorical!(df, :x);

In [14]:
gdf = groupby(df, :x)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Cat…,Int64
1,'a',1
2,'a',1
3,'a',1
4,'a',1
5,'a',1
6,'a',1
7,'a',1
8,'a',1
9,'a',1
10,'a',1

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Cat…,Int64
1,'d',1
2,'d',1
3,'d',1
4,'d',1
5,'d',1
6,'d',1
7,'d',1
8,'d',1
9,'d',1
10,'d',1


In [15]:
@btime combine($gdf, :y=>sum)

  12.343 ms (181 allocations: 12.92 KiB)


Unnamed: 0_level_0,x,y_sum
Unnamed: 0_level_1,Cat…,Int64
1,'a',2500455
2,'b',2499250
3,'c',2500490
4,'d',2499805


In [16]:
using PooledArrays

In [17]:
df.x = PooledArray{Char}(df.x)

10000000-element PooledArray{Char,UInt8,1,Array{UInt8,1}}:
 'a'
 'b'
 'a'
 'b'
 'd'
 'a'
 'd'
 'c'
 'a'
 'b'
 'd'
 'a'
 'a'
 ⋮
 'a'
 'd'
 'c'
 'b'
 'b'
 'b'
 'c'
 'c'
 'c'
 'a'
 'c'
 'c'

In [18]:
gdf = groupby(df, :x)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Char,Int64
1,'a',1
2,'a',1
3,'a',1
4,'a',1
5,'a',1
6,'a',1
7,'a',1
8,'a',1
9,'a',1
10,'a',1

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Char,Int64
1,'c',1
2,'c',1
3,'c',1
4,'c',1
5,'c',1
6,'c',1
7,'c',1
8,'c',1
9,'c',1
10,'c',1


In [19]:
@btime combine($gdf, :y=>sum)

  12.381 ms (176 allocations: 12.63 KiB)


Unnamed: 0_level_0,x,y_sum
Unnamed: 0_level_1,Char,Int64
1,'a',2500455
2,'b',2499250
3,'d',2499805
4,'c',2500490


### Use views instead of materializing a new DataFrame

In [20]:
x = DataFrame(rand(100, 1000))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996
2,0.631602,0.517963,0.777873,0.654429,0.74561,0.368433,0.732388,0.914784
3,0.435846,0.073114,0.0466154,0.548013,0.532386,0.317982,0.836461,0.259588
4,0.784843,0.057933,0.438015,0.451396,0.31707,0.862087,0.346002,0.0938355
5,0.191626,0.289158,0.901954,0.183431,0.080411,0.819305,0.289609,0.621425
6,0.593219,0.995807,0.198614,0.599726,0.300731,0.95686,0.442651,0.54594
7,0.852045,0.169453,0.325186,0.491535,0.700797,0.837159,0.594777,0.828016
8,0.844865,0.484975,0.0253496,0.820666,0.77177,0.235397,0.368437,0.131434
9,0.95415,0.253442,0.0529586,0.484742,0.550639,0.0281588,0.728503,0.464371
10,0.221597,0.153504,0.568389,0.427674,0.171532,0.552755,0.35014,0.976896


In [21]:
@btime $x[1:1, :]

  148.254 μs (1988 allocations: 193.39 KiB)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996


In [22]:
@btime $x[1, :]

  25.479 ns (1 allocation: 32 bytes)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996


In [23]:
@btime view($x, 1:1, :)

  19.624 ns (1 allocation: 48 bytes)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996


In [24]:
@btime $x[1:1, 1:20]

  4.788 μs (43 allocations: 6.28 KiB)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996


In [25]:
@btime $x[1, 1:20]

  24.640 ns (2 allocations: 80 bytes)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996


In [26]:
@btime view($x, 1:1, 1:20)

  24.554 ns (2 allocations: 96 bytes)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.866276,0.295163,0.459568,0.340574,0.687971,0.976229,0.388116,0.669996
