# Julia is fast

Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's Macbook Pro with a 4-core Intel Core I7, or run the notebook yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096-iap17/blob/master/lecture1/Boxes-and-registers.ipynb.)

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes

$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i.
$$

In [1]:
a = rand(10^7) # array of random numbers, uniform on [0,1)

10000000-element Array{Float64,1}:
 0.553158 
 0.318    
 0.53733  
 0.700712 
 0.659499 
 0.183877 
 0.390121 
 0.0602211
 0.42338  
 0.127738 
 0.730677 
 0.328059 
 0.548411 
 ⋮        
 0.142799 
 0.160456 
 0.279071 
 0.192312 
 0.331961 
 0.063653 
 0.365396 
 0.619037 
 0.379909 
 0.131102 
 0.260203 
 0.48403  

In [2]:
sum(a) # one expects this is 10^7 * .5 , since the mean of each entry is .5

4.999162666237622e6

# Benchmarking a few ways in a few languages

In [3]:
using BenchmarkTools  # Julia package for benchmarking

[1m[34mINFO: Recompiling stale cache file /Users/dpsanders/.julia/lib/v0.5/Blosc.ji for module Blosc.
[0m[1m[34mINFO: Recompiling stale cache file /Users/dpsanders/.julia/lib/v0.5/HDF5.ji for module HDF5.
[0m[1m[34mINFO: Recompiling stale cache file /Users/dpsanders/.julia/lib/v0.5/JLD.ji for module JLD.
[0m

#  1. The C language: (8.1 msecs)

C is fften considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it.

In [4]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [5]:
c_sum(a)

4.999162666236683e6

In [22]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbol

true

We can now benchmark the C code directly from Julia:

In [8]:
c_bench = @benchmark c_sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.366 ms (0.00% GC)
  median time:      8.518 ms (0.00% GC)
  mean time:        8.608 ms (0.00% GC)
  maximum time:     12.149 ms (0.00% GC)
  --------------
  samples:          581
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [9]:
println("C: Fastest time was $(minimum(c_bench.times)/1e6) msecs.")

C: Fastest time was 8.365547 msecs.


# 2. Python's built in `sum` (68 msecs)

In [11]:
# Julia interface to Python:
using PyCall

In [12]:
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):

apy_list = PyCall.array2py(a, 1, 1)

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [13]:
pysum(a)

4.999162666236683e6

In [14]:
pysum(a) ≈ sum(a)

true

In [15]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  672.00 bytes
  allocs estimate:  19
  --------------
  minimum time:     69.136 ms (0.00% GC)
  median time:      72.520 ms (0.00% GC)
  mean time:        74.051 ms (0.00% GC)
  maximum time:     85.883 ms (0.00% GC)
  --------------
  samples:          68
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [16]:
println("Python (built in): fastest time was $(minimum(py_list_bench.times)/1e6) msecs.")

Python (built in): fastest time was 69.135909 msecs.


# 3. Python: `numpy` (3.9 msec)  

## Takes advantage of hardware "SIMD", but only works when it works.

`numpy` is an optimized C library, callable from Python

If it is not installed, install it from Julia as follows:

In [17]:
# using Conda 
# Conda.add("numpy")

In [18]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default

py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  memory estimate:  960.00 bytes
  allocs estimate:  25
  --------------
  minimum time:     3.945 ms (0.00% GC)
  median time:      4.317 ms (0.00% GC)
  mean time:        4.340 ms (0.00% GC)
  maximum time:     7.423 ms (0.00% GC)
  --------------
  samples:          1152
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [19]:
numpy_sum(apy_list) # python thing

4.999162666237619e6

In [20]:
numpy_sum(apy_list) ≈ sum(a)

true

# 4. Python, hand written (419 msec!)

In [23]:
# It currently takes a little bit of hackery to define a custom Python function
# in a Julia string and call it via PyCall, sorry:

syms = PyDict{AbstractString, PyObject}()
syms["syms"] = PyObject(Any[])

pyeval("""
def mysum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s

syms.insert(0, mysum)
""", PyAny, syms, PyCall.Py_file_input)

mysum_py = syms["syms"][1] # a reference to the Python mysum function

PyObject <function mysum at 0x32d45e230>

In [24]:
@benchmark $mysum_py($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  672.00 bytes
  allocs estimate:  19
  --------------
  minimum time:     450.494 ms (0.00% GC)
  median time:      459.039 ms (0.00% GC)
  mean time:        459.120 ms (0.00% GC)
  maximum time:     466.200 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [25]:
mysum_py(apy_list)

4.999162666236683e6

In [26]:
mysum_py(apy_list) ≈ sum(a)

true

# 5. Julia (built-in) (3.7 msec) 

## Written directly in Julia, not in C!

In [27]:
@which sum(a)

In [28]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.750 ms (0.00% GC)
  median time:      3.885 ms (0.00% GC)
  mean time:        3.970 ms (0.00% GC)
  maximum time:     6.484 ms (0.00% GC)
  --------------
  samples:          1259
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

# 6. Julia (hand-written) (8.1 msec, same as hand-written C)

In [30]:
function mysum(A)   
    s = 0.0  # s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end



mysum (generic function with 1 method)

In [None]:
j_bench_hand = @benchmark mysum($a)