In [1]:
import sys
print("Python Version:", sys.version, '\n')

Python Version: 3.8.3 (default, Jul  2 2020, 11:26:31) 
[Clang 10.0.0 ] 



# Pickle: Saving Objects for Later

Often in data science, we'll create some model or some version of our data and want to use it later. We have many options - we can save the coefficients, or save the data to csv, or...

Actually, we don't have that many options. 

One way to overcome that is to save the python object to a file as a serialized object. That means we convert the entire object to a bunch of bytes, save those bytes into a file, and then have the ability to unpack those bytes back into their original format later. 

This is done by a module called `pickle`. Let's see it in action.

In [2]:
import pickle
import random

lots_of_noise = {
    'CA': [random.randint(0,65) for _ in range(100)],
    'IL': [random.randint(0,65) for _ in range(50)],
    'NY': [random.randint(0,65) for _ in range(90)],
    'WA': [random.randint(0,65) for _ in range(33)]
}

In [3]:
print(lots_of_noise)

{'CA': [15, 2, 65, 29, 6, 49, 14, 52, 54, 54, 56, 19, 59, 17, 38, 64, 57, 58, 1, 41, 61, 5, 49, 26, 35, 22, 32, 43, 47, 32, 54, 10, 16, 31, 40, 37, 53, 14, 49, 26, 17, 64, 7, 53, 50, 16, 57, 1, 45, 34, 4, 32, 29, 43, 37, 22, 63, 22, 31, 29, 10, 4, 17, 5, 8, 54, 18, 61, 47, 37, 39, 61, 55, 20, 36, 36, 2, 10, 47, 37, 10, 20, 33, 38, 22, 12, 0, 45, 6, 19, 59, 0, 54, 45, 42, 3, 48, 15, 6, 8], 'IL': [20, 15, 62, 15, 45, 51, 62, 53, 36, 17, 59, 49, 33, 2, 54, 38, 29, 64, 0, 12, 48, 2, 41, 42, 28, 14, 6, 57, 50, 46, 47, 63, 29, 10, 52, 6, 10, 37, 22, 5, 59, 39, 62, 27, 57, 19, 27, 5, 26, 42], 'NY': [32, 65, 14, 12, 28, 41, 56, 22, 38, 5, 19, 19, 20, 11, 35, 12, 40, 61, 36, 3, 28, 63, 53, 29, 18, 22, 7, 35, 43, 30, 56, 19, 43, 4, 40, 52, 56, 33, 35, 38, 9, 2, 28, 21, 59, 2, 53, 25, 53, 35, 8, 7, 44, 59, 13, 32, 17, 31, 5, 63, 35, 2, 12, 21, 54, 6, 3, 48, 14, 65, 35, 8, 29, 21, 50, 20, 28, 27, 15, 38, 34, 34, 54, 4, 15, 24, 35, 13, 14, 22], 'WA': [51, 36, 37, 8, 26, 64, 1, 48, 24, 60, 59, 40, 1

In [4]:
whos

Variable        Type      Data/Info
-----------------------------------
lots_of_noise   dict      n=4
pickle          module    <module 'pickle' from '/U<...>lib/python3.8/pickle.py'>
random          module    <module 'random' from '/U<...>lib/python3.8/random.py'>
sys             module    <module 'sys' (built-in)>


We can see in this `whos` command that the object `lots_of_noise` exists and is a `dict` with 4 keys. Nice. Now let's look at our file system and verify that there isn't a file called `noise.pickle`.

In [5]:
!ls

advanced_python_datatypes.ipynb       my_dataframe.pickle
complexity.md                         noise.pickle
deep_and_shallow_copy.ipynb           pickle_saving_objects_for_later.ipynb
[1m[36mdeep_copy_demo[m[m                        readme.md


Okay, now we're ready to create a file and write the bytes to it. To do this with `pickle`, we use python's read-write streamer `open` and create a writable-binary (`wb`) file. We'll then use `pickle.dump` to put an object into that file as a string of bytes.

In [6]:
with open('noise.pickle', 'wb') as to_write:
    pickle.dump(lots_of_noise, to_write)

In [7]:
!ls

advanced_python_datatypes.ipynb       my_dataframe.pickle
complexity.md                         noise.pickle
deep_and_shallow_copy.ipynb           pickle_saving_objects_for_later.ipynb
[1m[36mdeep_copy_demo[m[m                        readme.md


Now let's delete `lots_of_noise` and prove to ourselves it doesn't exist in Python's memory anymore.

In [8]:
del lots_of_noise

In [9]:
whos

Variable   Type              Data/Info
--------------------------------------
pickle     module            <module 'pickle' from '/U<...>lib/python3.8/pickle.py'>
random     module            <module 'random' from '/U<...>lib/python3.8/random.py'>
sys        module            <module 'sys' (built-in)>
to_write   BufferedWriter    <_io.BufferedWriter name='noise.pickle'>


In [10]:
print(lots_of_noise)

NameError: name 'lots_of_noise' is not defined

Lovely. It's dead forever. Or is it?

Let's open that `noise.pickle` file with read-binary (`rb`) mode. Then we'll ask pickle to retrieve the file with `pickle.load` and store it back in a variable.

In [11]:
with open('noise.pickle','rb') as read_file:
    new_noise = pickle.load(read_file)

In [12]:
print(new_noise)

{'CA': [15, 2, 65, 29, 6, 49, 14, 52, 54, 54, 56, 19, 59, 17, 38, 64, 57, 58, 1, 41, 61, 5, 49, 26, 35, 22, 32, 43, 47, 32, 54, 10, 16, 31, 40, 37, 53, 14, 49, 26, 17, 64, 7, 53, 50, 16, 57, 1, 45, 34, 4, 32, 29, 43, 37, 22, 63, 22, 31, 29, 10, 4, 17, 5, 8, 54, 18, 61, 47, 37, 39, 61, 55, 20, 36, 36, 2, 10, 47, 37, 10, 20, 33, 38, 22, 12, 0, 45, 6, 19, 59, 0, 54, 45, 42, 3, 48, 15, 6, 8], 'IL': [20, 15, 62, 15, 45, 51, 62, 53, 36, 17, 59, 49, 33, 2, 54, 38, 29, 64, 0, 12, 48, 2, 41, 42, 28, 14, 6, 57, 50, 46, 47, 63, 29, 10, 52, 6, 10, 37, 22, 5, 59, 39, 62, 27, 57, 19, 27, 5, 26, 42], 'NY': [32, 65, 14, 12, 28, 41, 56, 22, 38, 5, 19, 19, 20, 11, 35, 12, 40, 61, 36, 3, 28, 63, 53, 29, 18, 22, 7, 35, 43, 30, 56, 19, 43, 4, 40, 52, 56, 33, 35, 38, 9, 2, 28, 21, 59, 2, 53, 25, 53, 35, 8, 7, 44, 59, 13, 32, 17, 31, 5, 63, 35, 2, 12, 21, 54, 6, 3, 48, 14, 65, 35, 8, 29, 21, 50, 20, 28, 27, 15, 38, 34, 34, 54, 4, 15, 24, 35, 13, 14, 22], 'WA': [51, 36, 37, 8, 26, 64, 1, 48, 24, 60, 59, 40, 1

In [13]:
whos

Variable    Type              Data/Info
---------------------------------------
new_noise   dict              n=4
pickle      module            <module 'pickle' from '/U<...>lib/python3.8/pickle.py'>
random      module            <module 'random' from '/U<...>lib/python3.8/random.py'>
read_file   BufferedReader    <_io.BufferedReader name='noise.pickle'>
sys         module            <module 'sys' (built-in)>
to_write    BufferedWriter    <_io.BufferedWriter name='noise.pickle'>


Random noise lives! We retrieved the entire structure from file. Nice.

### Okay, but I don't use dictionaries... I use pandas.

In [14]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.uniform(-10,10, size=(100,4)), columns=['Yay','specific','column','names'])
df.head(5)

Unnamed: 0,Yay,specific,column,names
0,8.616097,4.188456,-1.824472,8.336657
1,9.998942,0.785795,6.109844,-1.930751
2,9.104448,-1.906847,-3.595214,-4.90624
3,3.596761,8.534413,-0.432684,5.181858
4,-9.740232,5.565501,7.609525,-6.150188


In [15]:
with open('my_dataframe.pickle', 'wb') as to_write:
    pickle.dump(df, to_write)

In [16]:
del df

df.head(5)

NameError: name 'df' is not defined

In [17]:
with open('my_dataframe.pickle','rb') as read_file:
    new_df = pickle.load(read_file)
    
new_df.head(5)

Unnamed: 0,Yay,specific,column,names
0,8.616097,4.188456,-1.824472,8.336657
1,9.998942,0.785795,6.109844,-1.930751
2,9.104448,-1.906847,-3.595214,-4.90624
3,3.596761,8.534413,-0.432684,5.181858
4,-9.740232,5.565501,7.609525,-6.150188


Pickle is a great tool. One recommended way of using it is to make it an end point of every step in your process. Example:

* I got my data! Nice. Pickle it and stop your "getting the data" notebook.
* Load your data from pickle. Clean it. Save your clean data to a new pickle.
* Load your cleaned_data pickle. Do analysis and visualize it.

This can provide natural "pick-up-where-I-left-off-but-before-I-broke-my-data" points. It's a great way to control the flow of your data.

#### Resources

https://docs.python.org/3.7/library/pickle.html

### More pickle testing since those files already existed: 

In [29]:
# Create data frame
cddf = pd.DataFrame(np.random.uniform(-10,10, size=(100,4)), columns=['These','puppies','are','columns'])
cddf.head(5)


Unnamed: 0,These,puppies,are,columns
0,-6.864736,-8.109308,-3.405733,-1.689295
1,8.289833,7.152339,-7.960152,9.282634
2,-7.538788,7.232463,9.824188,6.098662
3,-9.743255,-0.068658,5.455818,-6.661309
4,4.015507,-0.633315,5.254363,3.117177


In [30]:
# Create pickle file
with open('cd_dataframe.pickle', 'wb') as to_write:
    pickle.dump(cddf, to_write)

In [31]:
# delete new data frame
del cddf

In [32]:
# Check to see if df still exists
cddf.head(5)

NameError: name 'cddf' is not defined

In [34]:
with open('cd_dataframe.pickle','rb') as read_file:
    new_cddf = pickle.load(read_file)
    
new_cddf.head(5)

Unnamed: 0,These,puppies,are,columns
0,-6.864736,-8.109308,-3.405733,-1.689295
1,8.289833,7.152339,-7.960152,9.282634
2,-7.538788,7.232463,9.824188,6.098662
3,-9.743255,-0.068658,5.455818,-6.661309
4,4.015507,-0.633315,5.254363,3.117177


In [35]:
!ls

advanced_python_datatypes.ipynb       my_dataframe.pickle
cd_dataframe.pickle                   noise.pickle
complexity.md                         pickle_saving_objects_for_later.ipynb
deep_and_shallow_copy.ipynb           readme.md
[1m[36mdeep_copy_demo[m[m


## Some more notes

### What can you do with pickle?
Pickling is useful for applications where you need some degree of persistency in your data. Your program's state data can be saved to disk, so you can continue working on it later on. It can also be used to send data over a Transmission Control Protocol (TCP) or socket connection, or to store python objects in a database. Pickle is very useful for when you're working with machine learning algorithms, where you want to save them to be able to make new predictions at a later time, without having to rewrite everything or train the model all over again.

### What can be pickled? 
* Booleans,
* Integers,
* Floats,
* Complex numbers,
* Strings (normal and Unicode),
* Tuples,
* Lists,
* Sets, and
* Dictionaries that ontain picklable objects.

All the above can be pickled, but you can also do the same for classes and functions, for example, if they are defined at the top level of a module.

Not everything can be pickled (easily), though: examples of this are generators, inner classes, lambda functions and `defaultdicts`. In the case of lambda functions, you need to use an additional package named dill. With `defaultdicts`, you need to create them with a module-level function.

## Further Reading
* [Python Wiki: Using Pickle](https://wiki.python.org/moin/UsingPickle)
* [Python Docs: Pickle](https://docs.python.org/3/library/pickle.html)
* [Real Python: Python Pickle Module](https://realpython.com/python-pickle-module/)
* [Data Camp: Pickle Python Tutorial](https://www.datacamp.com/community/tutorials/pickle-python-tutorial)