# Module 5: Pickle

### Introduction
There are many different file types used in the data world. Main examples are .csv, .json and .xml, which are are easy to read and/or write, and are used extensively in multiple programming languages. However, sometimes you want to save a Python data object directly, such as a dictionary, list, tuple or even a fully trained machine learning algorithms created in Python. This is where the *pickle* module comes in.

This module will give you an in-depth look on how to best use the *pickle* module to save and load Python data objects. The outline is:
1. Pickle files basics
2. The Python *pickle* module 
3. speed comparison of pickle files and other ways of saving data objects
4. saving and loading fully trained machine learning algorithm 

Enjoy!

In [None]:
# Import all the packages needed for section 1 to 3
import os
import pickle
import numpy as np
import pandas as pd

# Section 1: Pickle files

Pickle is a very useful Python library. Pickle can be used to serialize Python object structures. Any object in Python can be pickled so that it can be saved on disk. So it's very specific to Python. This process of serializing Python object structures refers to the process of converting an object in the memory to a byte stream that can be stored as a binary file on disk. When we load it back to a Python program, this binary file can be de-serialized back to a Python object.

Besides the fact that we can store Python objects, there is another advantage; its speed. Later in the module we'll focus a bit more on the speed of pickle, but let's look at an example below to get an idea.

First, we'll have to create an object that we can store. Then, we'll look at the speed differences between Pandas and Pickle in storing this object.

In [None]:
# Create a Pandas DataFrame to store.

np.random.seed = 0
df_size = 1_000_000

df = pd.DataFrame({
    'a': np.random.rand(df_size),
    'b': np.random.rand(df_size),
    'c': np.random.rand(df_size),
    'd': np.random.rand(df_size),
    'e': np.random.rand(df_size)
})

display(df)

Let's store the Pandas DataFrame as a .csv file using the pandas library.

In [None]:
%%time

# Save the Pandas DataFrame as a .csv file.
df.to_csv('data.csv', index=False)

Let's store the Pandas DataFrame as a pickle file using the pickle library.

In [None]:
%%time

# Save the Pandas DataFrame as a pickle file.
with open('dataframe.pickle', mode='wb') as file:
    pickle.dump(df, file) 

When you look at the time differences, there should be quite a difference. Pickle objects are very much faster to work with. Now that we have seen an example, we can take a look at the pickle library.

In the folder you are currently working in, there are some pickled files saved already. Pickled files are saved as a binary file. Let's print all the files that are in the current folder to see them.

##### ASSIGNMENT 1: print all pickle files in the directory of the notebook
*Hint: use the skills learned in the file-system operations module*

In [None]:
list_of_pickle_files = []

#### ADD YOUR CODE HERE ####

#### STOP ADDING CODE HERE ####

print(list_of_pickle_files)

### **Theory: which file extensions are used for pickle file?**
There are three pickle files found. They can be easily recognized by their *.pickle* extension. However, this is not the only used extension for pickle files. 

For instance, in the [Python 2 documentation](https://docs.python.org/2/library/pickle.html#example) *.pkl* is used as an extension.
Other examples online might include *.p* as an extension

##### ASSIGNMENT 2: print the contents of the pickle files with .read(). Also use the 'open' method.

In [None]:
for file in list_of_pickle_files:
    #### ADD YOUR CODE HERE ####
    print('')

Most likely, you cannot make any sense of the file contents. This is because pickle files are Python Objects that are serialized into a byte stream. The printed content shows the actual bytes.

When you are unsure with what kind of data object you are working with, it is often a good choice to check the data type with 'type()'. 

##### ASSIGNMENT 3: print the file type of each pickled file with type() 

In [None]:
for file in list_of_pickle_files:
    #### ADD YOUR CODE HERE ####
    print('')

### **Theory - why use pickle?**
Compared to other file types, it might be unclear at this point why we want to use pickle at all. Binary files aren't easy to read or use in general. So wouldn't it be more convenient to save data as a more generic file types such as XML, JSON or CSV?

The answer is: it depends on what type of information you want to save and/or transfer. Pickle is ment as a convenient file type to transfer **Python objects** specifically between different systems, environments or pieces of Python code. It is the fastest and most efficient way to transport practically any(!) Python object.