# Testing pickle protocol

In [1]:
import pickle
from datetime import datetime, date

In [2]:
class DataFrame(object):
    def __init__(self):
        self.headers = ["header-{}".format(i) for i in range(6)]
        self.datatype = [i.__name__ for i in [str,int,float,datetime,date]]
        self.data = [(str(i), i, float(i), datetime(2001,1,1,12,34,56), date(2001,1,1)) for i in range(20000)]    

In [3]:
df = DataFrame()

In [4]:
len(df.data)

20000

Now as we have a 'dataframe' the next question is how fast we can store it?

In [5]:
%timeit pickle.dumps(df)

60.5 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [6]:
%timeit pickle.dumps(df, protocol=pickle.HIGHEST_PROTOCOL)

58.2 ms ± 771 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Peculiarly, the second statement with `protocol=pickle.HIGHEST_PROTOCOL` (which is newer) is ~ 40% slower.

In [7]:
import zlib

In [8]:
%timeit zlib.compress(pickle.dumps(df))

177 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [9]:
%timeit zlib.compress(pickle.dumps(df, protocol=pickle.HIGHEST_PROTOCOL))

86.5 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now, in combination with `zlib`, both pickle methods run faster and the statement with `protocol=pickle.HIGHEST_PROTOCOL` is the fastest. This seems counter-intuitive.

Q: What is going on?