![](header.jpg)

# Comparing File Storeage

Kevin J. Walchko, Phd

9 Aug 2017

---

Ok, this is far from complete, but I was interested in looking at simple protocols like `gzip`, `pickle`, `msgpack`, and classic `json`. Wait, what about XML? I hate parsing XML and it is old ... if you really like ASCII, then use `json`.

In [6]:
import pickle
import gzip
import os
import simplejson as json
import msgpack

In [7]:
# create a bunch of data
a = {}
a['a'] = list(range(10000))
a['b'] = list(range(20000))
a['c'] = list(range(30000))
# a['d'] = bytearray(2000000) # json won't handle binary data

In [8]:
%%timeit
with open("data.pickle", 'wb') as f:
    f.write(pickle.dumps(a, pickle.HIGHEST_PROTOCOL))

2.43 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%%timeit
with gzip.open("data.pickle.gz", 'wb') as f:
    f.write(pickle.dumps(a, pickle.HIGHEST_PROTOCOL))

8.27 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]:
%%timeit
with open("data.json", 'w') as f:
    json.dump(a,f)

104 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [17]:
%%timeit
with gzip.open("data.json.gz", 'wb') as f:
    f.write(json.dumps(a).encode("utf-8"))

53.1 ms ± 6.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
%%timeit
with open("data.msgpack", 'wb') as f:
    f.write(msgpack.packb(a, use_bin_type=True))

3.93 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
print(f"Pickle: {os.path.getsize('data.pickle')//1024} kb")
print(f"Pickle gzip: {os.path.getsize('data.pickle.gz')//1024} kb")
print(f"Json: {os.path.getsize('data.json')//1024} kb")
print(f"Json gzip: {os.path.getsize('data.json.gz')//1024} kb")
print(f"Msgpack: {os.path.getsize('data.msgpack')//1024} kb")

Pickle: 175 kb
Pickle gzip: 103 kb
Json: 377 kb
Json gzip: 130 kb
Msgpack: 174 kb


So best (smallest/fastest) is `pickle` while worst (largest/slowest) is `json`. `msgpack` came close to `pickle` but it has the issue of not handling both `lists` and `tuples` at the same time (you have to pick one) while `pickle` handles anything.

| Protocol | Size (kb) | Time (ms) | Cross Platform |
|----------|-----------|-----------|----------------|
| json     | 377       | 104       | x |
| json-gz  | 130       | 53.1      | x |
| pickle   | 175       | 2.4       | |
| pickle-gz| 103       | 8.2       | |
| msgpack  | 174       | 4         | x |

**Not:** re-running the above will give small changes in times due to computer OS