# File Pickling
***
According to __[Python Documentation on pickle module](https://docs.python.org/3/library/pickle.html)__, *Pickling* is the process whereby a Python object hierachy is converted into a byte stream, and *unpickling* is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierachy.

<div class="alert alert-block alert-danger">
    <b>WARNING:</b> The pickle module is not secure. Only unpickle data you trust.
</div>

Consider signing data with __[Keyed-Hashing for Message Authentication](https://docs.python.org/3/library/hmac.html#module-hmac)__ simply abbreviated as **hmac**. This provides for a way to check the integrity of information transmitted over or stored in an unreliable medium. It provides for for mechanism to check integrity of a file based on a secret key referred to as *message authentication codes (MAC)*

## Why Use Object Serialization?

In order to understand the importance of serialization, we will demonstrate it with an example. We will follow the following steps:
1. Create a nested dictionary, a dictionary of dictionaries.
2. Write the dictionary data as a .txt file without **serialization**
3. Load the .txt file
4. Try accessing elements of the dictionary from the loaded .txt file

In [3]:
# STEP 1: Creating the nested dictionary of domestic employees to XY
employees = {
    'employee_1' : {
        'name': 'Alice', 'age':32, 'role':'Chef'
    },
    'employee_2' : {
        'name': 'Liza', 'age':37, 'role':'Nanny'
    },
    'employee_3' : {
        'name': 'John', 'age':35, 'role':'Gardener'
    },
    'employee_4' : {
        'name': 'Bobby', 'age':28, 'role':'Security'
    },
    'employee_5' : {
        'name': 'Akello', 'age':29, 'role':'Teacher'
    },
    
}

employees

{'employee_1': {'name': 'Alice', 'age': 32, 'role': 'Chef'},
 'employee_2': {'name': 'Liza', 'age': 37, 'role': 'Nanny'},
 'employee_3': {'name': 'John', 'age': 35, 'role': 'Gardener'},
 'employee_4': {'name': 'Bobby', 'age': 28, 'role': 'Security'},
 'employee_5': {'name': 'Akello', 'age': 29, 'role': 'Teacher'}}

In [4]:
# Checking the object type
type(employees)

dict

In [5]:
# STEP 2: Writing the file into a .txt file without serialization
with open('employees.txt','w') as data:
    data.write(str(employees))

<div class="alert alert-block alert-info">
<b>NOTE:</b> The str() function is has been used to convert the employees dictionary into text because the write() method can only write strings to a file.
</div>

In [6]:
# STEP 3: Lading the employees.txt file
with open('employees.txt','r') as f:
    # Printing the content of the file
    for employee in f:
        print(employee)

{'employee_1': {'name': 'Alice', 'age': 32, 'role': 'Chef'}, 'employee_2': {'name': 'Liza', 'age': 37, 'role': 'Nanny'}, 'employee_3': {'name': 'John', 'age': 35, 'role': 'Gardener'}, 'employee_4': {'name': 'Bobby', 'age': 28, 'role': 'Security'}, 'employee_5': {'name': 'Akello', 'age': 29, 'role': 'Teacher'}}


In [7]:
print(f)

<_io.TextIOWrapper name='employees.txt' mode='r' encoding='cp1252'>


In [8]:
# Trying to access a dictionary from the main container dictionary, 
# i.e. accessing employee_1 dictionary from the employees dictionary
f['employee_1']

TypeError: '_io.TextIOWrapper' object is not subscriptable

<div class="alert alert-block alert-danger">
    <b>TypeError:</b> '_io.TextIOWrapper' object is not subscriptable:
</div>

This error is thrown when we try accessing an element from the dictionary. This error occurs when we try slicing or indexing an object (data type) that does not support such operations. In this case, the object is not identified as type dictionary.

In [9]:
# Confirming the type
type(f)

_io.TextIOWrapper

**_io.TextIOWrapper** is a string file that represents contents of the entire string file object. File f cannot be accessed as a dictionary since it is not one, it is a string containing the contents of the file that had been read.

In our case above, the nested dictionary is now being printed as a string. And is on this that the importance of file pickling comes up! **How do we preserve the state of a file/object?**

<div class="alert alert-block alert-success">
<b>Importance of Serialization:</b> Serialization allows for preservation of objects in their original state without loosing any information. In Python, we use the pickle module to serialize and deserialize data types. Note that this format cannot be loaded using any other languages since it is native to Python.
</div>

### Comparison of Pickle and JSON
The **comparisons** between the pickle protocol and JSON __[are](https://docs.python.org/3/library/pickle.html)__:

|Pickle|JSON|
|:--|:--|
|Binary serialization format|Text Serialization _(usually utf-8)_|
|Not human readeable|Human readable|
|Python specific|Interoperable with other languages|
|Only represents one Python data type/structure|Can represent various Python data structures/types|
|Deserializing untrusted JSON does not create an arbitrary code execution vulnerability|Deserializing untrusted pickle creates an abitrary code execution vulnerability|


### Actual Serialization and Deserialization Operations
- To _serialize_ an object we call the **dump()** or **dumps()** functions. Function **dump()** writes the data to a file while **dumps()** represents it as a byte object.
- To _deserialize_ we call the **load()** or **loads()** functions. Function **load()** reads pickled objects from a file while **loads()** deserializes them from bytes-like objects.
- To have more control over _serialization_ and _deserialization_ we create a **Pickler** or an **Unpickler** object.

In [10]:
# Importing the pickle module that is used in (un)pickling
import pickle

> **Example 1:** Here we shall serialize and deserialize a list and use it to demonstrate that the file object preserves its original state that it can be manipulated as a list object

In [11]:
# Serilaizing a list
cities = ['Nairobi', 'Kisumu', 'Mombasa', 'nakuru', 'Eldoret']

with open('cities.pkl', 'wb') as f: # We use 'wb' to write as binary
    pickle.dump(cities, f) # Serializing the list
    f.close()

In [12]:
# Deserializing the list
with open('cities.pkl', 'rb') as f: # The 'rb' is used to read binary
    unpickled_cities = pickle.load(f)
    
# Accessing the first element of the list
unpickled_cities[0]

'Nairobi'

<div class="alert alert-block alert-success">
<b>Original state preserved: </b> The file can be manipulated as a list
</div>

> **Example 2:** Here we shall work with a pandas DataFrame and use it to demostrate how using pickle can help improve performance

In [13]:
# Loading the required libraries
import time
import pandas as pd
import numpy as np

In [16]:
# Generating random data to be used to create our DataFrame

# Setting the random seed for reproducibility of this work
np.random.seed(345)

# Generating random data
data = {
    'weight' : np.random.randint(47, 113, size=100000),
    'height' : np.random.randint(120, 187, size=100000),
    'gender' : np.random.choice(['Male','Female'], size=100000)
}

# Creating the pandas DataFrame
df = pd.DataFrame(data)

# Viewing the 10 random values from the DataFrame
df.sample(10)

Unnamed: 0,weight,height,gender
43726,59,136,Female
7791,93,160,Female
57576,103,145,Female
42120,76,156,Male
61861,85,144,Female
4605,60,175,Female
2958,74,175,Female
3914,95,168,Female
26051,82,125,Female
79619,90,138,Male


In [17]:
# Saving the file into a .csv file and calculating the time it takes to save
# Start time
start = time.time()

# Saving to a .csv file
df.to_csv('radomly_generated_df.csv')

# End time
end = time.time()

# Time taken to save
time_taken_csv = end - start

In [18]:
# Saving the file as a pickle file and noting the time
start = time.time()

# Saving to a .pkl file
df.to_pickle('randomly_generated_df.pkl')

# End time
end = time.time()

# Time taken to save
time_taken_pkl = end - start

In [19]:
# Comparing the time taken to save the two files
print("Seconds taken to save .csv file ", time_taken_csv)
print("Seconds taken to save .pkl file ", time_taken_pkl)

Seconds taken to save .csv file  0.1512463092803955
Seconds taken to save .pkl file  0.017955541610717773


In [22]:
print("In this case saving a pikle file is ", time_taken_csv/time_taken_pkl, 
      "times faster than saving the file as a csv")

In this case saving a pikle file is  8.423377727025269 times faster than saving the file as a csv


In [24]:
# Time taken to read the files

# Loading the csv file
start1 = time.time()
df_csv = pd.read_csv('radomly_generated_df.csv')
end1 = time.time()
print('Time taken to load .csv file: ', end1-start1)

# Loading the pkl file
start2 = time.time()
df_pkl = pd.read_pickle('randomly_generated_df.pkl')
end2 = time.time()
print('Time taken to load .pkl file: ', end2 - start2)

Time taken to load .csv file:  0.03493165969848633
Time taken to load .pkl file:  0.00998234748840332


<div class="alert alert-block alert-success">
<b>Improved Performance: </b> It takes shorter time to process pickle files
</div>

In [29]:
# Pickling and Unpickling our earlier employees dictionary and accessing its components

# Serializing the employees dictionary
with open('employees_dict.pkl', 'wb') as f:
    pickle.dump(employees, f)
    
# Deserializing the employees dictionary
with open('employees_dict.pkl', 'rb') as f:
    deserialized_dict = pickle.load(f)

# Printing the dictionary
print(deserialized_dict)

# Accessing the role of the first employee in the employees dictionary
print("")
print("The first employee works as a: ", employees['employee_1']['role'])

{'employee_1': {'name': 'Alice', 'age': 32, 'role': 'Chef'}, 'employee_2': {'name': 'Liza', 'age': 37, 'role': 'Nanny'}, 'employee_3': {'name': 'John', 'age': 35, 'role': 'Gardener'}, 'employee_4': {'name': 'Bobby', 'age': 28, 'role': 'Security'}, 'employee_5': {'name': 'Akello', 'age': 29, 'role': 'Teacher'}}

The first employee works as a:  Chef


> We have further demonstrated that pickling preserves the state/properties of an object

#### Serializing a Machine Learning model using the Python pickle module

Training a machine learning model takes time. Therefore, there is need to preserve the results of from a training session in a way that permits for its use and transfer to a different setup. The pickle module provides this possibility by preserving the states of models. This then makes them available for use as need arise.

In [30]:
# Loading the required libraries
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

In [31]:
# Generating regression data
X, y = make_regression(n_samples=100, n_features=3, noise=0.1, random_state=1)

In [32]:
# Training the regression model
linear_model = LinearRegression()
linear_model.fit(X,y)

LinearRegression()

In [33]:
# Displaying the summary of the model
print("Intercept: ", linear_model.intercept_)
print("Coefficients: ", linear_model.coef_)
print("Score: ", linear_model.score(X, y))

Intercept:  -0.010109549594705669
Coefficients:  [44.18793068 98.97389468 58.17121618]
Score:  0.9999993081899219


In [34]:
# Saving the model as a pickle file
with open('linear_regression_model.pkl', 'wb') as f:
    pickle.dump(linear_model, f)

In [35]:
# Loading the saved model
with open('linear_regression_model.pkl', 'rb') as f:
    unserialized_linear_model = pickle.load(f)
    
# Confirming the model parameters
print("Intercept: ", unserialized_linear_model.intercept_)
print("Coefficients: ", unserialized_linear_model.coef_)
print("Score: ", unserialized_linear_model.score(X, y))

Intercept:  -0.010109549594705669
Coefficients:  [44.18793068 98.97389468 58.17121618]
Score:  0.9999993081899219


The model has been preserved as and can be further used to make predictions, train on top of it or transfer to a different location.

***
#### Further Considerations
- How to speed up workflow using the HIGHEST_PROTOCOL argument
- Using the **cPickle** module which is written in C making it faster than the **Pickle** module
- Other serialization formats