Parallel File Processing With Erlang
This tiny project is about processing line-oriented records where order does not matter. It tries to provide a sufficient solution written in Erlang for distributing workload across multiple CPU cores .
Copyright and License
Copyright (c) 2011, Tobias Rodaebel
This software is released under the Apache License, Version 2.0. You may obtain a copy of the License at
Building and Running
For compiling the Erlang programs just enter:
You can now run them as follows:
$ erl -run serial start PATH
The project includes a very simple Python script mkdata.py to generate test data. In order to generate 5*10^6 lines (~1.1 GB) of test data enter the following command:
$ python mkdata.py 5000000 > test.txt
And these are the results (in seconds) of running our programs on different hardware with the same test data. For the first series the disk cache was flushed before each run by rebooting the machine.
|Machine||Erlang R14B03 serial.erl||Erlang R14B03 parallel.erl||pypy 1.5 serial.py|
- MBP = MacBook Pro 2.3 GHz Intel Core i7 / SSD
As of this writing, Erlang R1403 seems to be relatively inefficient when doing normal file I/O. Buffering and parallel data processing helps to gain slightly better results, though. But for anyone who wants to dive deeper into this matter, I recommend Jay Nelson's talk "Process-Striped Buffering with gen_stream"  he gave at the Erlang Factory 2011.
|||See Time Bray's Wide Finder Project.|
|||Jay Nelson on Process-Striped Buffering with gen_stream.|