Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Apache log file compressor
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
__________ .__ .___ \____ /____ | | __| _/____ / // __ \| | / __ |\__ \ __/ /\ ___/| |__/ /_/ | / __ \_ /_______ \___ >____/\____ |(____ / \/ \/ \/ \/ A tool to turn plain text apache formatted logs into a binary format. Binary logs are around 15% the size of plain text, and compressed binary is around 50% smaller than compressed text. Usage: o) ./zelda.py access.log will output access.log.loz o) ./zelda.py access.log.loz will output access.log.loz.deloz o) Both modes accept a second argument to specify a different output file Example compression: o) The maximum-compression.com "fighter-planes.com" benchmark 20617071 fp.log # original, 20MB 3329095 fp.log.loz # loz, 3.5MB 708531 fp.log.lzma # compressed original, 700KB 450750 fp.log.loz.lzma # compressed loz, 450KB o) One of my own websites, around 5 million log entries: 1342318213 access.log # original, 1.3GB 200853665 access.log.loz # loz, 200MB 74790313 access.log.lzma # compressed original, 75MB 41398420 access.log.loz.lzma # compressed loz, 40MB Speed: o) The 1.3GB dataset is compressed in under 4 minutes on my Pentium-M 1.7GHz laptop; the entire development process of v0.1 of this program was faster than piping the file through lzma, once, on a 2.8GHz box :-) o) Decompression of that dataset takes under 2 minutes Accuracy: o) Timezone is assumed to be constant, though there is room within the format to allow a "change timezone" escape code, but this is currently unimplemented o) For files where the timezone doesn't change: 0b52bbe953337e9b9c6b80da707d2403 access.log 0b52bbe953337e9b9c6b80da707d2403 access.log.loz.deloz o) Corrupt lines (of which my test dataset has many) are escaped and stored as plain text