One Billion Row Challenge in Julia

This repository contains scripts where I have made an attempt to tackle Gunnar's popular One Billion Row Challenge.

Though originally intended for the Java community, the challenge has since then been addressed in many other languages. I am sure the performance in Julia can be much better than what I have been able to achieve.

How to run?

Dependencies can be installed using the Project.toml file. Execute the following code in the Julia REPL from the root of this repository.

using Pkg
Pkg.activate(pwd())

Input file measurements.txt is not added here due to its large size. You can generate your own file by executing the Python script from here.

python3 create_measurements.py 1000000000

Benchmark can be triggered from the Julia REPL as shown below. This will take about 10-15 minutes to complete.

julia> ARGS = ["measurements.txt", "24"]
julia> include("execute_df_v11.jl")

1brc_notebook.jl is a Pluto notebook, where all different implementations have been tested. Make sure to first generate the input data file as described above.

Strategy

The following strategy has given me the best result so far:

Use memory mapping to read the file
Generate indexes that will split file into chunks (based on user input)
Loop through the chunks, read each chunk (into a DataFrame) in parallel using CSV.read
Use groupby and combine on station, get min, max, and mean of all temperatures
Vertically concatenate all DataFrames
Finally repeat step 4 again to combine data from all chunks
Format according to challenge specifications and print output

Benchmark system (Ryzen 9 5900X, 32 GB RAM, NVMe SSD)

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 12 default, 0 interactive, 6 GC (on 24 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12
  JULIA_PKG_USE_CLI_GIT = true

Best results

Using external dependencies (CSV.jl, DataFrames.jl)

julia> include("execute_df_v11.jl")
< printed output is omitted for clarity >
Range (min … max):  89.459 s … 94.728 s  ┊ GC (min … max): 10.08% … 10.85%
 Time  (median):     90.178 s             ┊ GC (median):    10.54%
 Time  (mean ± σ):   90.765 s ±  1.567 s  ┊ GC (mean ± σ):  10.41% ±  0.41%

  █ ██ ██  █   █     █    █                               █  
  █▁██▁██▁▁█▁▁▁█▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  89.5 s         Histogram: frequency by time        94.7 s <

 Memory estimate: 92.03 GiB, allocs estimate: 1000828591.

Using only base Julia

julia> ARGS = ["measurements.txt", "384"]
julia> include("execute_base_v1_4.jl")
< printed output is omitted for clarity >
Range (min … max):  71.958 s …   74.295 s  ┊ GC (min … max): 39.73% … 38.84%
 Time  (median):     72.886 s               ┊ GC (median):    39.44%
 Time  (mean ± σ):   72.889 s ± 705.485 ms  ┊ GC (mean ± σ):  39.44% ±  0.31%

  ▁    ▁     ▁   ▁      ▁ █▁                   ▁            ▁  
  █▁▁▁▁█▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  72 s            Histogram: frequency by time         74.3 s <

 Memory estimate: 157.38 GiB, allocs estimate: 2010613120.

julia> Threads.nthreads()
12
julia> ARGS = ["measurements.txt", "24"]
julia> include("execute_base_v1_5.jl")
< printed output is omitted for clarity >
 Range (min … max):  64.612 s …   66.248 s  ┊ GC (min … max): 35.02% … 34.99%
 Time  (median):     65.529 s               ┊ GC (median):    34.63%
 Time  (mean ± σ):   65.438 s ± 604.725 ms  ┊ GC (mean ± σ):  34.81% ±  0.41%

  █            █                 ▁   ▁      ▁    ▁      ▁   ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁█▁▁▁█ ▁
  64.6 s          Histogram: frequency by time         66.2 s <

 Memory estimate: 156.46 GiB, allocs estimate: 2000885392.

julia> Threads.nthreads()
24
julia> ARGS = ["measurements.txt", "24"]
julia> include("execute_base_v1_5.jl")
< printed output is omitted for clarity >
Range (min … max):  59.267 s …   61.046 s  ┊ GC (min … max): 39.95% … 38.96%
 Time  (median):     60.404 s               ┊ GC (median):    39.65%
 Time  (mean ± σ):   60.364 s ± 553.757 ms  ┊ GC (mean ± σ):  39.85% ±  0.66%

  ▁                    ▁    ▁  ▁      ▁  █              ▁   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁█▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
  59.3 s          Histogram: frequency by time           61 s <

 Memory estimate: 156.46 GiB, allocs estimate: 2000885452.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
1brc_notebook.jl		1brc_notebook.jl
Project.toml		Project.toml
README.md		README.md
calculate_output.jl		calculate_output.jl
chunks.jl		chunks.jl
execute_base_v1.jl		execute_base_v1.jl
execute_base_v1_1.jl		execute_base_v1_1.jl
execute_base_v1_2.jl		execute_base_v1_2.jl
execute_base_v1_3.jl		execute_base_v1_3.jl
execute_base_v1_4.jl		execute_base_v1_4.jl
execute_base_v1_5.jl		execute_base_v1_5.jl
execute_base_v1_6.jl		execute_base_v1_6.jl
execute_df_v11.jl		execute_df_v11.jl
execute_df_v4.jl		execute_df_v4.jl
execute_df_v9.jl		execute_df_v9.jl
groupby_df.jl		groupby_df.jl
print_output.jl		print_output.jl
test_progress_meter.jl		test_progress_meter.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One Billion Row Challenge in Julia

How to run?

Strategy

Benchmark system (Ryzen 9 5900X, 32 GB RAM, NVMe SSD)

Best results

Using external dependencies (CSV.jl, DataFrames.jl)

Using only base Julia

About

Releases

Packages

Languages

vnegi10/1brc_julia

Folders and files

Latest commit

History

Repository files navigation

One Billion Row Challenge in Julia

How to run?

Strategy

Benchmark system (Ryzen 9 5900X, 32 GB RAM, NVMe SSD)

Best results

Using external dependencies (CSV.jl, DataFrames.jl)

Using only base Julia

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages