<img src="./images/banner.png" width="800">

# Model Persistence

Model persistence is a crucial aspect of machine learning workflows, allowing you to save trained models for future use without the need for retraining. This capability is especially important in production environments where you want to deploy your models efficiently.


Model persistence refers to the process of saving a trained machine learning model to disk and later loading it for making predictions.


Here are some key benefits of model persistence:
1. **Time-saving**: Training complex models can be time-consuming. Persistence allows you to save and reuse models without retraining.
2. **Reproducibility**: Saved models ensure consistent results across different sessions or environments.
3. **Deployment**: Persistent models can be easily deployed in production systems.


While there are several methods for model persistence, we'll focus primarily on joblib in this lecture. However, it's worth briefly mentioning other options:

1. **joblib**: Efficient for large NumPy arrays, supports memory mapping.
2. pickle: Python's built-in serialization module.
3. ONNX: Open Neural Network Exchange format, useful for cross-platform deployment.
4. skops.io: A more secure alternative to pickle-based formats.
5. cloudpickle: Useful for serializing custom Python code.


💡 **Pro Tip:** joblib is generally the recommended method for scikit-learn model persistence due to its efficiency and ease of use.


The typical workflow for model persistence involves three main steps:

1. Train the model
2. Save the trained model to disk
3. Load the model when needed for predictions


Here's a simple example using joblib:


In [3]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from joblib import dump, load

# 1. Train the model
X, y = load_diabetes(return_X_y=True)
regressor = Ridge()
regressor.fit(X, y)

# 2. Save the model
dump(regressor, 'ridge_model.joblib')

# 3. Load the model (typically in a different session or script)
loaded_model = load('ridge_model.joblib')

# Use the loaded model for predictions
loaded_model.predict(X[:5])

array([182.67335421,  90.99860656, 166.11347597, 156.03488009,
       133.65957541])

When working with model persistence, keep these points in mind:

1. **Compatibility**: Ensure that the scikit-learn version used for loading is compatible with the version used for saving.
2. **Dependencies**: All required libraries should be available in the environment where the model is loaded.
3. **Security**: Be cautious when loading models from untrusted sources, as malicious code could potentially be executed.


❗️ **Important Note:** joblib (like pickle) can execute arbitrary code upon loading. Only load models from trusted sources.


joblib is particularly well-suited for scikit-learn models because:

1. It's optimized for handling large NumPy arrays efficiently.
2. It supports compression, making saved files smaller.
3. It allows for memory mapping of the persisted data, which can be beneficial when loading large datasets.


Efficient model persistence is crucial for deploying machine learning models in real-world applications, where quick loading times and minimal memory usage are often required.


In the following sections, we'll dive deeper into using joblib for model persistence, exploring its features and best practices to ensure your models are saved and loaded effectively and securely.

**Table of contents**<a id='toc0_'></a>    
- [Saving Models with Pickle](#toc1_)    
  - [Basic Usage of Pickle for Model Persistence](#toc1_1_)    
  - [Best Practices](#toc1_2_)    
- [Version Control and Model Management](#toc2_)    
  - [Implementing Version Control for Models](#toc2_1_)    
  - [Best Practices for Model Version Control](#toc2_2_)    
  - [Model Registry](#toc2_3_)    
- [Summary](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Saving Models with Pickle](#toc0_)

While joblib is the recommended method for scikit-learn model persistence, understanding pickle is valuable as it's the underlying serialization protocol used by joblib. Pickle is Python's built-in module for object serialization and deserialization that converts Python objects into a byte stream, allowing them to be saved to disk and later reconstructed.


Pickle serializes Python objects by converting them into a stream of bytes. This process is called "pickling." When you load the pickled object, it's "unpickled" back into its original form.


### <a id='toc1_1_'></a>[Basic Usage of Pickle for Model Persistence](#toc0_)


Here's a simple example of how to use pickle to save and load a scikit-learn model:


In [4]:
import pickle
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Train the model
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)

# Save the model
with open('random_forest_model.pkl', 'wb') as file:
    pickle.dump(clf, file)

# Load the model
with open('random_forest_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Use the loaded model
loaded_model.predict(X[:5])

array([0, 0, 0, 0, 0])

There are several advantages to using pickle:
1. **Native to Python**: No additional libraries required.
2. **Versatility**: Can serialize most Python objects, including custom classes.
3. **Compact representation**: Efficient for small to medium-sized objects.


Note that pickle supports different protocols for serialization:


In [5]:
import pickle

# Using the highest available protocol
with open('model_latest_protocol.pkl', 'wb') as file:
    pickle.dump(clf, file, protocol=pickle.HIGHEST_PROTOCOL)

# Using a specific protocol (e.g., protocol 4)
with open('model_protocol4.pkl', 'wb') as file:
    pickle.dump(clf, file, protocol=4)

💡 **Pro Tip:** Higher protocol versions are generally more efficient but may not be compatible with older Python versions.



However, there are some limitations to using pickle:
1. **Security Risks**: Pickle can execute arbitrary code during unpickling. Only unpickle data from trusted sources.

2. **Version Compatibility**: Pickled objects may not be compatible across different versions of Python or the libraries used (like scikit-learn).

3. **Portability**: Pickled objects are not guaranteed to be portable across different platforms or architectures.


❗️ **Important Note:** Due to security concerns, never unpickle data from an untrusted or unauthenticated source.


### <a id='toc1_2_'></a>[Best Practices](#toc0_)


When using pickle, consider the following best practices:
1. **Use Protocol 5 or Higher**: For Python 3.8+, use protocol 5 or higher for better performance with large objects.

2. **Error Handling**: Always use try-except blocks when unpickling to handle potential errors gracefully.

3. **Version Control**: Store information about the Python and library versions used when pickling the model.


In [6]:
import sys
import sklearn

model_info = {
    'model': clf,
    'python_version': sys.version,
    'sklearn_version': sklearn.__version__
}

with open('model_with_info.pkl', 'wb') as file:
    pickle.dump(model_info, file)

In [11]:
with open('model_with_info.pkl', 'rb') as file:
    model_info = pickle.load(file)

model_info

{'model': RandomForestClassifier(),
 'python_version': '3.10.12 (main, Jul  5 2023, 15:02:25) [Clang 14.0.6 ]',
 'sklearn_version': '1.5.2'}

In conclusion, while pickle is a powerful and flexible serialization tool, joblib is generally preferred for scikit-learn models due to its optimizations for numerical data and additional features. However, understanding pickle is valuable for general Python object serialization and for insights into how joblib operates.

## <a id='toc2_'></a>[Version Control and Model Management](#toc0_)

Effective version control and model management are crucial aspects of maintaining a robust machine learning pipeline. These practices ensure reproducibility, traceability, and efficient collaboration in ML projects.


Version control for ML models involves tracking changes to model artifacts, datasets, and associated metadata over time.


There are several benefits to version control:

1. **Reproducibility**: Ability to recreate exact model versions.
2. **Traceability**: Track model lineage and evolution.
3. **Collaboration**: Enable team members to work on and share models effectively.
4. **Auditing**: Facilitate model audits and compliance checks.


When managing ML models, consider versioning the following components:

1. Model artifacts (saved model files)
2. Training data (or references to data versions)
3. Model hyperparameters
4. Code used for training and evaluation
5. Environment details (library versions, etc.)
6. Model performance metrics


### <a id='toc2_1_'></a>[Implementing Version Control for Models](#toc0_)


1. Using Git for Code and Small Artifacts


Git is excellent for versioning code and small files. Here's a basic workflow:


```bash
# Initialize a Git repository
git init

# Add model files and code
git add model.joblib train_script.py requirements.txt

# Commit changes
git commit -m "Add initial model version 1.0"

# Create a tag for the version
git tag -a "v1.0" -m "Model version 1.0"
```


💡 **Pro Tip:** Use `.gitignore` to exclude large data files or sensitive information from Git repositories.


2. Versioning Large Files with Git LFS


For larger model files, consider using Git Large File Storage (LFS):


```bash
# Install Git LFS
git lfs install

# Track large model files with LFS
git lfs track "*.joblib"

# Add and commit as usual
git add .gitattributes model.joblib
git commit -m "Add large model file using LFS"
```


3. Using MLflow for Comprehensive Model Management


MLflow is a platform for the complete machine learning lifecycle, including experimentation, reproducibility, and deployment.


In [13]:
%pip install mlflow

Collecting mlflow
  Downloading mlflow-2.17.2-py3-none-any.whl.metadata (29 kB)
Collecting mlflow-skinny==2.17.2 (from mlflow)
  Downloading mlflow_skinny-2.17.2-py3-none-any.whl.metadata (30 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting cloudpickle<4 (from mlflow-skinny==2.17.2->mlflow)
  Downloading cloudpickle-3.1.0-py3-none-any.whl.metadata (7.0 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==2.17.2->mlflow)
  Downloading databricks_sdk-0.36.0-py3-none-any.whl.metadata (38 kB)
Collecting opentelemetry-api<3,>=1.9.0 (from mlflow-skinny==2.17.2->mlflow)
  Downloading opentelemetry_api-1.28.1-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-sdk<3,>=1.9.0 (from mlflow-skinny==2.17.2->mlflow)
  Downloading opentelemetry_sdk-1.28.1-py3-none-any.whl.metadata (1.5 kB)
Collecting graphql-core

In [14]:
import mlflow
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

mlflow.set_experiment("iris_classification")

with mlflow.start_run():
    # Load and prepare data
    X, y = load_iris(return_X_y=True)

    # Train model
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X, y)

    # Log parameters
    mlflow.log_param("n_estimators", 100)

    # Log model
    mlflow.sklearn.log_model(clf, "random_forest_model")

    # Log metrics
    accuracy = clf.score(X, y)
    mlflow.log_metric("accuracy", accuracy)

2024/11/10 00:20:18 INFO mlflow.tracking.fluent: Experiment with name 'iris_classification' does not exist. Creating a new experiment.


❗️ **Important Note:** MLflow stores runs and artifacts in a local `mlruns` directory by default. For team collaboration, configure a shared tracking server.


### <a id='toc2_2_'></a>[Best Practices for Model Version Control](#toc0_)


1. **Semantic Versioning**: Use semantic versioning (e.g., v1.2.3) for model releases.

2. **Metadata Tracking**: Include metadata with each model version:


In [15]:
import joblib
import datetime
import platform
model_metadata = {
    'version': '1.0.0',
    'trained_on': datetime.datetime.now().isoformat(),
    'framework_versions': {
        'sklearn': sklearn.__version__,
        'python': platform.python_version(),
    },
    'hyperparameters': clf.get_params(),
    'accuracy': accuracy,
}

joblib.dump((clf, model_metadata), 'model_v1.0.0.joblib')

['model_v1.0.0.joblib']

3. **Environment Management**: Use virtual environments and requirements files:

```bash
# Create a virtual environment
conda create -n myenv python=3.10

# Activate the environment
conda activate myenv

# Install dependencies
pip install -r requirements.txt

# Save current environment
pip freeze > requirements.txt
```


4. **Documentation**: Maintain clear documentation for each model version, including:
   - Purpose and use case
   - Training data description
   - Performance metrics
   - Known limitations or biases



Proper version control and model management ensure that your ML projects are reproducible, traceable, and maintainable over time, which is crucial for both development and production environments.


### <a id='toc2_3_'></a>[Model Registry](#toc0_)


For production environments, consider implementing a model registry:

1. **Central Repository**: Store all production-ready models in a central location.
2. **Versioning**: Maintain multiple versions of each model.
3. **Metadata**: Store relevant metadata for each model version.
4. **Access Control**: Implement proper access controls and approval processes.
5. **Deployment Tracking**: Keep track of which model versions are deployed where.


Tools like MLflow, DVC (Data Version Control), or cloud-based solutions like AWS SageMaker Model Registry can help in setting up a robust model registry system.


In conclusion, effective version control and model management are essential for maintaining a scalable and reliable machine learning pipeline. By implementing these practices, you ensure that your models are trackable, reproducible, and easier to manage throughout their lifecycle.

## <a id='toc3_'></a>[Summary](#toc0_)

This lecture has covered the essential aspects of model persistence in machine learning, with a focus on scikit-learn models. Here's a concise summary of the key points:

1. **Importance of Model Persistence**: Saving and loading models is crucial for deployment, reproducibility, and efficient workflows.

2. **Joblib as the Preferred Method**: For scikit-learn models, joblib is recommended due to its efficiency with NumPy arrays and additional features like compression and memory mapping.

3. **Pickle as an Alternative**: While less optimal for large models, pickle is a built-in Python solution for general object serialization.

4. **Security Considerations**: Always exercise caution when loading models from untrusted sources to avoid potential security risks.

5. **Version Control**: Implement proper version control practices for models, including metadata tracking and environment management.

6. **Best Practices**:
   - Use appropriate file formats (.joblib for joblib, .pkl for pickle)
   - Include model metadata when saving
   - Ensure consistent environments between saving and loading
   - Implement error handling when loading models

7. **Advanced Techniques**: Explore options like compression for storage efficiency and memory mapping for large datasets.


💡 **Pro Tip:** Always test your model persistence workflow, ensuring that loaded models perform identically to the original ones.


By mastering model persistence, you'll be better equipped to manage the lifecycle of your machine learning models, from development to deployment and maintenance.