# Julia is fast

Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's Macbook Pro with a 4-core Intel Core I7, or run the notebook yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096-iap17/blob/master/lecture1/Boxes-and-registers.ipynb.)

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes

$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i.
$$

In [1]:
a = rand(10^7) # array of random numbers, uniform on [0,1)

10000000-element Array{Float64,1}:
 0.419706  
 0.411478  
 0.678047  
 0.495881  
 0.0345912 
 0.276511  
 0.749733  
 0.490188  
 0.391619  
 0.437043  
 0.997511  
 0.395294  
 0.0310659 
 ⋮         
 0.543147  
 0.394134  
 0.136932  
 0.852247  
 0.00696537
 0.880507  
 0.366073  
 0.5232    
 0.667962  
 0.948684  
 0.465594  
 0.225454  

In [2]:
sum(a) # one expects this is 10^7 * .5 , since the mean of each entry is .5

4.999322568984508e6

# Benchmarking a few ways in a few languages

In [3]:
using BenchmarkTools  # Julia package for benchmarking

#  1. The C language: (8.0 msecs)

C is fften considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

You can put C code in a Julia session, compile it, and run it.

In [4]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    size_t i;
    for (i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [5]:
c_sum(a)

4.999322568984473e6

In [6]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbol

true

We can now benchmark the C code directly from Julia:

In [7]:
c_bench = @benchmark c_sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.017 ms (0.00% GC)
  median time:      8.109 ms (0.00% GC)
  mean time:        8.149 ms (0.00% GC)
  maximum time:     8.609 ms (0.00% GC)
  --------------
  samples:          613
  evals/sample:     1

In [8]:
println("C: Fastest time was $(minimum(c_bench.times)/1e6) msecs.")

C: Fastest time was 8.017094 msecs.


# 2. Python's built in `sum` (133 msecs)

In [9]:
# Julia interface to Python:
using PyCall

In [10]:
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):

apy_list = PyCall.array2py(a, 1, 1)

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [11]:
pysum(a)

4.999322568984473e6

In [12]:
pysum(a) ≈ sum(a)

true

In [13]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  512 bytes
  allocs estimate:  17
  --------------
  minimum time:     132.633 ms (0.00% GC)
  median time:      133.222 ms (0.00% GC)
  mean time:        133.238 ms (0.00% GC)
  maximum time:     133.829 ms (0.00% GC)
  --------------
  samples:          38
  evals/sample:     1

In [14]:
println("Python (built in): fastest time was $(minimum(py_list_bench.times)/1e6) msecs.")

Python (built in): fastest time was 132.632981 msecs.


# 3. Python: `numpy` (4.5 msec)  

## Takes advantage of hardware "SIMD", but only works when it works.

`numpy` is an optimized C library, callable from Python

If it is not installed, install it from Julia as follows:

In [15]:
# using Conda 
# Conda.add("numpy")

In [16]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default

py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  memory estimate:  720 bytes
  allocs estimate:  22
  --------------
  minimum time:     4.549 ms (0.00% GC)
  median time:      4.670 ms (0.00% GC)
  mean time:        4.763 ms (0.00% GC)
  maximum time:     6.939 ms (0.00% GC)
  --------------
  samples:          1047
  evals/sample:     1

In [17]:
numpy_sum(apy_list) # python thing

4.999322568984511e6

In [18]:
numpy_sum(apy_list) ≈ sum(a)

true

# 4. Python, hand written (353 msec!)

In [19]:
# The PyCall package lets us define python functions directly from Julia:

py"""
def mysum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""

# mysum_py is a reference to the Python mysum function
mysum_py = py"""mysum"""o

PyObject <function mysum at 0x7f71c0699cf8>

In [20]:
@benchmark $mysum_py($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  512 bytes
  allocs estimate:  17
  --------------
  minimum time:     353.346 ms (0.00% GC)
  median time:      355.955 ms (0.00% GC)
  mean time:        355.883 ms (0.00% GC)
  maximum time:     359.651 ms (0.00% GC)
  --------------
  samples:          15
  evals/sample:     1

In [21]:
mysum_py(apy_list)

4.999322568984473e6

In [22]:
mysum_py(apy_list) ≈ sum(a)

true

# 5. Julia (built-in) (4.4 msec) 

## Written directly in Julia, not in C! 

(and just as fast as numpy's optimized `sum()`)

In [23]:
@which sum(a)

In [24]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.367 ms (0.00% GC)
  median time:      4.396 ms (0.00% GC)
  mean time:        4.408 ms (0.00% GC)
  maximum time:     4.902 ms (0.00% GC)
  --------------
  samples:          1132
  evals/sample:     1

# 6. Julia (hand-written) (8.0 msec, same as hand-written C)

In [25]:
function mysum(A)   
    s = 0.0  # s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

mysum (generic function with 1 method)

In [26]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.018 ms (0.00% GC)
  median time:      8.103 ms (0.00% GC)
  mean time:        8.156 ms (0.00% GC)
  maximum time:     8.789 ms (0.00% GC)
  --------------
  samples:          613
  evals/sample:     1