## Week 10 : Insecure Deserialization and ML Models


This can be mapped to A08: Software and Data Integrity Failures wherein the vulnerability arises because deserialization allows unverified data to execute arbitrary code, compromising the integrity of the application.

The vulnerability lies in the built-in `pickle` module which is often used to serialize (save) and deserialize (load) complex Python objects and this includes `scikit-learn` or custom models

| Component | Description | Security Risk |
|:--- | :--- | :--- |
| Serialization (pickle.dump)  | Converts a Python object into a byte stream | Low risk here as this is the saving process | 
| Deserialization (pickle.load) | Reconstructs the Python object from the byte stream | CRITICAL RISK. The pickle protocol can execute arbitrary Python functions (via __reduce__ or __unpickle__ methods) during loading. If an attacker crafts a malicious byte stream (a "malicious pickle") and an application loads it, the attacker's code will execute with the permissions of the application. This is a Remote Code Execution (RCE) vulnerability |

That being said, data from untrusted or unauthenticated sources should never be unpickled.

#### Secure Alternatives and Recommendations

The industry standard recommendation is to avoid formats that allow arbitrary code execution during deserialization. For production-grade machine learning and data exchange, two main categories of secure formats are recommended:

| Alternative Format | Why It's Secure | Typical Use Case |
|:--- | :--- | :--- |
| JSON / YAML | Data-Only Exchange. These are text-based formats that are human-readable and cannot directly serialize Python functions or classes. They can only hold simple data types (strings, numbers, lists, dictionaries) | Good for configuration files, simple data payloads, and small data structures | 
| HDF5 (or joblib) | Binary Data Storage. These formats focus on efficiently storing the numerical arrays (weights, biases, vectors) that make up a model, rather than the entire Python object structure | Excellent for storing large numerical datasets and the core components of large models (e.g., Keras/TensorFlow models often save to HDF5) |
| ONNX / PMML | Model Standard Formats. These are interchange standards specifically designed to represent the computational graph (the structure and operations) of an ML model, independent of the original Python framework | Best Practice for Production. Since they strip out the framework-specific Python object logic, they drastically reduce the attack surface |



#### Security Recommendation for ML/Data Systems

When building systems that save and load artifacts, the following hierarchy of formats should be enforced:

1. For ML Models: Use ONNX (Open Neural Network Exchange) or similar structured interchange formats. If not possible, use framework-specific secure formats like the Keras HDF5 format, which prioritizes numerical data over arbitrary Python objects.

2. For Configuration/Small Data: Use JSON or YAML. These are safe, simple, and require manual coding to achieve any RCE.

3. Authentication/Integrity Check (Mandatory): Even with safer formats, any loaded artifact (model or data) must be stored in an authenticated storage location (e.g., a Firebase or S3 bucket with strict IAM policies) and, ideally, be accompanied by a cryptographic hash (checksum). This hash must be verified before the artifact is loaded to ensure its integrity and that it hasn't been tampered with .

Final Rule: Never load a pickled Python object from an untrusted source.