![](https://images.pexels.com/photos/373290/pexels-photo-373290.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260)

# Comparing File Storeage

Kevin J. Walchko, Phd

9 Aug 2017

---

Ok, this is far from complete, but I was interested in looking at simple protocols like

In [36]:
import pickle
import gzip
import os
import simplejson as json
import msgpack

In [29]:
# create a bunch of data
a = {}
a['a'] = list(range(10000))
a['b'] = list(range(20000))
a['c'] = list(range(30000))
# a['d'] = bytearray(2000000) # json won't handle binary data

In [43]:
%%timeit
with open("data.pickle", 'wb') as f:
    f.write(pickle.dumps(a, pickle.HIGHEST_PROTOCOL))

2.8 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [42]:
%%timeit
with gzip.open("data.pickle.gz", 'wb') as f:
    f.write(pickle.dumps(a, pickle.HIGHEST_PROTOCOL))

7.44 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [41]:
%%timeit
with open("data.json", 'w') as f:
    json.dump(a,f)

77.3 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [40]:
%%timeit
with open("data.msgpack", 'wb') as f:
    f.write(msgpack.packb(a, use_bin_type=True))

5.84 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [45]:
print(f"Pickle: {os.path.getsize('data.pickle')//1024} kb")
print(f"Pickle gzip: {os.path.getsize('data.pickle.gz')//1024} kb")
print(f"Json: {os.path.getsize('data.json')//1024} kb")
print(f"Msgpack: {os.path.getsize('data.msgpack')//1024} kb")

Pickle: 175 kb
Pickle gzip: 103 kb
Json: 377 kb
Msgpack: 174 kb


So best (smallest/fastest) is `pickle` while worst (largest/slowest) is `json`. `msgpack` came close to `pickle` but it has the issue of not handling both `lists` and `tuples` at the same time (you have to pick one) while `pickle` handles anything.