# Model persistence

In this notebook we will review two ways of saving trained models and then loading the saved models to then make predictions on new data without having to retrain the model.


## Motivation
Serialization is the process of converting a program entity into a stream of bytes that can be saved as a file.
There are two primary reasons why we might want to save (and later load) a trained model:
- Avoid redundant training. Models can take a long time to train and data can take a long time to load/process
- Deployment into an application (keep model training/deployment code separate)

## Pickle
- Pickling is the process where a Python object is converted into a byte stream (usually not human readable).
- Unpickling is the reverse operation, where a byte stream is converted back into a working Python object.
- Pickling is the simplest way to store the object from a coding perspective.
- The Python Pickle module is an object-oriented way of storing objects.
    - It can store *any* Python object, not just Sklearn models.
    
#### Features
- Store/load dictionaries and lists.
- Store/load the attributes of arbitrary data types (i.e. classes)
- Do this recursively, so that if your object has attributes that are
classes themselves, it can be saved just as easily

#### Limitations
- Does not save the *code* of an object — only its attribute values.
- Cannot save file handles or connection sockets.
- Pickle is **version-dependent**. For example, if you saved a model with a certain version
of Sklearn then try to load it with a different one (e.g. you updated), there may be issues.
    - Another motivation for using virtual environments, which can be containerized.

#### Saving procedure
```python
import pickle        # Built-in python module

# Create some object and manipulate it in some way (e.g. train the model)
myobj = SomeClass(...)
myobj = myobj.some_method(...)

# Save to a file using Pickle
with open('myfile.pickle', 'wb') as file_handle:
    pickle.dump(myobj, file_handle)
```

#### Loading procedure
```python
import pickle        # Built-in python module

# Load from a file using Pickle
with open('myfile.pickle', 'rb') as file_handle:
    myobj = pickle.load(file_handle)    # myobj will be an instance of SomeClass
```

#### Methods
The pickle module provides four different methods:
- dump() − The dump() method serializes to an open file (file-like object).
- dumps() − Serializes to a string.
- load() − Deserializes from an open-like object.
- loads() − Deserializes from a string.

### Example

In [4]:
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = LogisticRegression(multi_class="ovr")
clf.fit(X, y)

In [8]:
from pathlib import Path

filedir = Path("saved_models")    # Create a directory to save the model
filedir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist and overwrite if it does

filepath = filedir / "iris.pickle" # Create a file path to save the model

# Save the model
with filepath.open('wb') as f:
    pickle.dump(clf, f)

# Load the model
with filepath.open('rb') as f:
    clf_loaded = pickle.load(f)

assert (clf.coef_ == clf_loaded.coef_).all() # Check that the loaded model is the same as the original model

## Joblib
Joblib is an alternative serialization module to Pickle. It's main advantage over Pickle
is that it is faster and more efficient at saving large `numpy` arrays.

<sub>*Note: Starting with Python 3.8, Pickle is actually better than Joblib for saving `numpy` arrays.
    If you have Python >=3.8, just use Pickle. [Source](https://stackoverflow.com/a/12617603).*</sub>
    
#### Saving procedure
```python
import joblib

# Create some object and manipulate it in some way (e.g. train the model)
myobj = SomeClass(...)
myobj = myobj.some_method(...)

# Save to a file using Joblib
joblib.dump(myobj, file_path)
```

#### Loading procedure
```python
import joblib

# Load from a file using Joblib
myobj = joblib.load(file_path)    # myobj will be an instance of SomeClass
```

### Example

In [9]:
import joblib

jl_filedir = Path("saved_models_jl")   # Create a directory to save the model
jl_filedir.mkdir(parents=True, exist_ok=True) # Create the directory if it doesn't exist and overwrite if it does

jl_filepath = jl_filedir / 'clf.joblib' # Create a file path to save the model

joblib.dump(clf, jl_filepath)  # Save the model

clf_jl_loaded = joblib.load(jl_filepath) # Load the model
    
assert (clf.coef_ == clf_jl_loaded.coef_).all() # Check that the loaded model is the same as the original model