# Module 8 Complete Guide: Serialization and Object Copying

This notebook covers the full Module 8 curriculum (beginner + advanced). It combines
explanations, runnable examples, and practical checklists.

Use this notebook as the main learning path, and jump into the scripts for deeper practice:
- Beginner: `beginner_edition/*.py`
- Advanced: `advanced_edition/*.py`

## Table of contents

1. Setup and environment check
2. Part A (Beginner): Encapsulation and password hashing
3. Part A (Beginner): Pickle basics
4. Part A (Beginner): JSON serialization
5. Part A (Beginner): CSV serialization
6. Part A (Beginner): Shallow vs deep copy
7. Part A (Beginner): Practice and mini projects
8. Part B (Advanced): Dataclasses and validation
9. Part B (Advanced): Production pickle patterns
10. Part B (Advanced): Modern serialization (JSON Lines, MessagePack, Parquet)
11. Part B (Advanced): Copying performance and optimization
12. Part B (Advanced): Pydantic advanced validation
13. Summary and next steps

## 1. Setup and environment check

We keep most examples standard-library only. Advanced sections have optional dependencies.

In [1]:
import importlib.util

optional = ["pydantic", "pydantic_settings", "msgpack", "pandas", "pyarrow", "polars"]
availability = {name: importlib.util.find_spec(name) is not None for name in optional}

print("Optional dependencies:")
for name, ok in availability.items():
    print(f"  {name:17} -> {'available' if ok else 'missing'}")

Optional dependencies:
  pydantic          -> available
  pydantic_settings -> available
  msgpack           -> available
  pandas            -> available
  pyarrow           -> available
  polars            -> missing


# Part A: Beginner Edition

Focus: solid fundamentals and safe defaults.

Relevant scripts:
- `beginner_edition/01_oop_encapsulation_basics.py`
- `beginner_edition/02_pickle_basics.py`
- `beginner_edition/03_json_csv_basics.py`
- `beginner_edition/04_copying_basics.py`

## 2. Encapsulation and password hashing

Key ideas:
- Keep sensitive values private (use name mangling or properties).
- Never store raw passwords.
- Use PBKDF2 and a unique salt per password.

In [3]:
import hashlib
import hmac
import os


In [None]:

class User:
    def __init__(self, email: str, password: str):
        self.email = email
        self._salt = os.urandom(16)
        self._password_hash = self._hash_password(password)

    def _hash_password(self, password: str) -> bytes:
        return hashlib.pbkdf2_hmac("sha256", password.encode(), self._salt, 100_000)

    def verify_password(self, password: str) -> bool:
        candidate = hashlib.pbkdf2_hmac("sha256", password.encode(), self._salt, 100_000)
        return hmac.compare_digest(candidate, self._password_hash)

    @property
    def password(self) -> str:
        return "********"


In [6]:

user = User("alice@example.com", "CorrectHorseBatteryStaple")


In [7]:

print("Masked password:", user.password)


Masked password: ********


In [8]:

print("Verify correct:", user.verify_password("CorrectHorseBatteryStaple"))


Verify correct: True


In [9]:

print("Verify wrong:", user.verify_password("wrong"))

Verify wrong: False


## 3. Pickle basics (Python-only persistence)

Pickle is powerful but unsafe for untrusted input.
Use it only for trusted data under your control.

In [13]:
import pickle
from dataclasses import dataclass
from io import BytesIO

@dataclass
class Task:
    title: str
    done: bool = False


In [14]:

# Save to bytes
buffer = BytesIO()
pickle.dump([Task("Learn pickle"), Task("Practice copying")], buffer, protocol=5)


In [15]:

# Load from bytes
buffer.seek(0)
loaded = pickle.load(buffer)
print("Loaded:", loaded)

Loaded: [Task(title='Learn pickle', done=False), Task(title='Practice copying', done=False)]


### Pickle safety notes

- Never unpickle data from untrusted sources.
- Prefer JSON for API input/output.
- Use explicit versioning for long-lived pickles.

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import joblib
import pandas as pd

In [17]:

# 1. Навчаємо модель
data = load_iris()
model = RandomForestClassifier()
model.fit(data.data, data.target)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [26]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [22]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [25]:
pd.DataFrame(data.data, columns=data.feature_names)  

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [27]:

# 2. Зберігаємо модель у файл
joblib.dump(model, 'rf_model.pkl')


['rf_model.pkl']

In [29]:

# 3. Завантажуємо модель пізніше
loaded_model = joblib.load('rf_model.pkl')


In [30]:

# Перевірка: робимо прогноз
result = loaded_model.score(data.data, data.target)
print(f"Точність завантаженої моделі: {result}")

Точність завантаженої моделі: 1.0


### Custom state with `__getstate__` and `__setstate__`

Use these hooks to exclude sensitive fields or to support versioning.

In [4]:
class SecureToken:
    def __init__(self, token: str):
        self.token = token
        self._cached = "expensive-derivation"

    def __getstate__(self):
        state = self.__dict__.copy()
        state["token"] = "***redacted***"
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)

obj = SecureToken("secret")
print("Pickled state length:", len(pickle.dumps(obj)))

Pickled state length: 105


## 4. JSON serialization (portable)

JSON is safe and cross-language, but you must convert custom objects to dicts.

In [31]:
import json
from dataclasses import dataclass, asdict

@dataclass
class Address:
    city: str
    country: str

@dataclass
class Customer:
    name: str
    address: Address

customer = Customer("Dana", Address("Kyiv", "UA"))


{
  "name": "Dana",
  "address": {
    "city": "Kyiv",
    "country": "UA"
  }
}
Restored: Customer(name='Dana', address=Address(city='Kyiv', country='UA'))


In [32]:

# Serialize using asdict
json_str = json.dumps(asdict(customer), indent=2)
print(json_str)


{
  "name": "Dana",
  "address": {
    "city": "Kyiv",
    "country": "UA"
  }
}


In [33]:

# Restore manually
loaded = json.loads(json_str)
restored = Customer(loaded["name"], Address(**loaded["address"]))
print("Restored:", restored)

Restored: Customer(name='Dana', address=Address(city='Kyiv', country='UA'))


In [35]:
path = 'beginner_edition/project2_config.json'


with open(path, 'r') as file_object:
    # Deserialize the file content into a Python dictionary
    data = json.load(file_object)

In [36]:
data

{'name': 'bert_classifier',
 'learning_rate': 0.001,
 'batch_size': 32,
 'epochs': 10,
 'optimizer': 'adam',
 'validation_split': 0.2}

## 5. CSV serialization (tabular data)

CSV is best for flat, tabular data. Use `csv.DictWriter` for clarity.

In [38]:
import csv
from io import StringIO

rows = [
    {"name": "Alice", "score": 91},
    {"name": "Bob", "score": 84},
]


In [39]:

out = StringIO()
writer = csv.DictWriter(out, fieldnames=["name", "score"])
writer.writeheader()
writer.writerows(rows)


In [40]:

csv_text = out.getvalue()
print(csv_text)


name,score
Alice,91
Bob,84



In [41]:

# Read it back
inp = StringIO(csv_text)
reader = csv.DictReader(inp)
loaded = [row for row in reader]
print("Loaded:", loaded)

Loaded: [{'name': 'Alice', 'score': '91'}, {'name': 'Bob', 'score': '84'}]


### When to use JSON vs CSV

- JSON: nested objects, APIs, config files
- CSV: flat tables, spreadsheets, data exports

In [42]:
pd.read_csv('beginner_edition/users_export.csv')

Unnamed: 0,user_id,email,city,country,zip_code,created_at
0,1,alice@example.com,Lviv,Ukraine,79000,2026-01-23T15:02:40.795497+00:00
1,2,bob@example.com,Warsaw,Poland,00-001,2026-01-23T15:02:40.795503+00:00
2,3,charlie@example.com,Berlin,Germany,10115,2026-01-23T15:02:40.795505+00:00


In [51]:
pd.read_pickle('beginner_edition/addresses.pkl')

[Address(city='Kyiv', country='Ukraine'),
 Address(city='Lviv', country='Ukraine'),
 Address(city='Warsaw', country='Poland')]

In [None]:
pd.read_json('beginner_edition/project1', orient='index')

Unnamed: 0,0
name,bert_classifier
learning_rate,0.001
batch_size,32
epochs,10
optimizer,adam
validation_split,0.2


## 6. Shallow vs deep copy

Shallow copy duplicates the outer container but keeps inner references.
Deep copy duplicates the entire structure.

In [7]:
import copy

original = {"numbers": [1, 2, 3], "meta": {"owner": "A"}}
shallow = copy.copy(original)
deep = copy.deepcopy(original)

shallow["numbers"].append(4)
shallow["meta"]["owner"] = "B"

print("Original after shallow mutation:", original)
print("Deep copy remains:", deep)

Original after shallow mutation: {'numbers': [1, 2, 3, 4], 'meta': {'owner': 'B'}}
Deep copy remains: {'numbers': [1, 2, 3], 'meta': {'owner': 'A'}}


### Common pitfall: mutable default arguments

In [8]:
def bad_accumulator(items=[]):
    items.append(1)
    return items

print(bad_accumulator())
print(bad_accumulator())


def good_accumulator(items=None):
    if items is None:
        items = []
    items.append(1)
    return items

print(good_accumulator())
print(good_accumulator())

[1]
[1, 1]
[1]
[1]


## 7. Practice and mini projects

Practice files:
- `beginner_edition/05_practice_tasks_beginner.py`
- `beginner_edition/06_mini_projects_beginner.py`

Suggested workflow:
1. Run the scripts end-to-end.
2. Modify inputs (add nested data, new fields).
3. Re-run and compare outputs.

# Part B: Advanced Edition

Focus: production patterns, validation, performance, and modern formats.

Relevant scripts:
- `advanced_edition/01_modern_encapsulation.py`
- `advanced_edition/02_pickle_production.py`
- `advanced_edition/03_modern_serialization.py`
- `advanced_edition/04_copying_performance.py`
- `advanced_edition/05_pydantic_dataclasses.py`

## 8. Dataclasses with validation (stdlib-first)

You can validate after initialization and add computed properties.

In [9]:
from dataclasses import dataclass

@dataclass
class DatabaseConfig:
    host: str
    port: int
    database: str
    pool_size: int = 10

    def __post_init__(self):
        if not (1 <= self.port <= 65535):
            raise ValueError("port out of range")
        if not self.database:
            raise ValueError("database name required")

    @property
    def dsn(self) -> str:
        return f"{self.host}:{self.port}/{self.database}"

cfg = DatabaseConfig(host="db.local", port=5432, database="app")
print("DSN:", cfg.dsn)

DSN: db.local:5432/app


## 9. Production pickle patterns

Use explicit versioning and avoid loading untrusted data.

In [10]:
import pickle
from dataclasses import dataclass

@dataclass
class ModelState:
    version: int
    weights: list[float]

    def __getstate__(self):
        return {"version": self.version, "weights": self.weights}

    def __setstate__(self, state):
        if state.get("version") != 1:
            raise ValueError("Unsupported model version")
        self.version = state["version"]
        self.weights = state["weights"]

state = ModelState(version=1, weights=[0.1, 0.2, 0.3])
blob = pickle.dumps(state)
print("Serialized bytes:", len(blob))
print("Loaded:", pickle.loads(blob))

Serialized bytes: 99
Loaded: ModelState(version=1, weights=[0.1, 0.2, 0.3])


## 10. Modern serialization formats

- JSON Lines is great for streaming logs.
- MessagePack is compact binary JSON.
- Parquet is columnar and efficient for analytics (optional).

In [14]:
import json

# JSON Lines (one JSON object per line)
records = [{"id": 1, "value": 10}, {"id": 2, "value": 20}]
jsonl = "\n".join(json.dumps(r) for r in records)
print(jsonl)

loaded = [json.loads(line) for line in jsonl.splitlines()]
print("Loaded:", loaded)

{"id": 1, "value": 10}
{"id": 2, "value": 20}
Loaded: [{'id': 1, 'value': 10}, {'id': 2, 'value': 20}]


In [15]:
# Optional: MessagePack
import importlib.util

if importlib.util.find_spec("msgpack"):
    import msgpack
    packed = msgpack.packb({"name": "Alice", "score": 99}, use_bin_type=True)
    unpacked = msgpack.unpackb(packed, raw=False)
    print("MessagePack bytes:", len(packed))
    print("Unpacked:", unpacked)
else:
    print("msgpack not installed")

MessagePack bytes: 19
Unpacked: {'name': 'Alice', 'score': 99}


In [16]:
# Optional: Parquet with pandas + pyarrow
import importlib.util

if importlib.util.find_spec("pandas") and importlib.util.find_spec("pyarrow"):
    import pandas as pd
    df = pd.DataFrame([{"id": 1, "score": 98}, {"id": 2, "score": 87}])
    path = "_tmp_scores.parquet"
    df.to_parquet(path, index=False)
    reloaded = pd.read_parquet(path)
    print(reloaded)
else:
    print("pandas/pyarrow not installed")

   id  score
0   1     98
1   2     87


## 11. Copying performance and optimization

Key idea: deep copy is expensive. Prefer immutability or copy-on-write.

In [17]:
import copy
import time

nested = {"items": [list(range(1000)) for _ in range(200)]}

start = time.time()
for _ in range(50):
    copy.copy(nested)
shallow_time = time.time() - start

start = time.time()
for _ in range(5):
    copy.deepcopy(nested)
deep_time = time.time() - start

print(f"Shallow copy (50x): {shallow_time:.4f}s")
print(f"Deep copy (5x):     {deep_time:.4f}s")

Shallow copy (50x): 0.0001s
Deep copy (5x):     0.2239s


### Optimization patterns

- Prefer immutable records for shared data.
- Use copy-on-write in dataframes where possible.
- Avoid deep copies in tight loops.

## 12. Pydantic advanced validation (optional)

If `pydantic` is installed, we can validate and normalize inputs at runtime.

In [18]:
import importlib.util

if importlib.util.find_spec("pydantic"):
    from pydantic import BaseModel, Field, field_validator

    class APIConfig(BaseModel):
        base_url: str = Field(...)
        api_key: str = Field(..., min_length=20)

        @field_validator("base_url")
        @classmethod
        def validate_url(cls, v: str) -> str:
            if not v.startswith(("http://", "https://")):
                raise ValueError("base_url must start with http:// or https://")
            return v

    cfg = APIConfig(base_url="https://api.example.com", api_key="sk-1234567890abcdefghij")
    print(cfg)
else:
    print("pydantic not installed")

base_url='https://api.example.com' api_key='sk-1234567890abcdefghij'


In [19]:
# Optional: Pydantic settings and env loading
import importlib.util

if importlib.util.find_spec("pydantic") and importlib.util.find_spec("pydantic_settings"):
    from pydantic_settings import BaseSettings
    from pydantic import ConfigDict

    class AppSettings(BaseSettings):
        api_key: str = "default"
        debug: bool = False
        workers: int = 4

        model_config = ConfigDict(env_file=".env")

    settings = AppSettings(api_key="sk-test", debug=True)
    print(settings)
else:
    print("pydantic/pydantic_settings not installed")

api_key='sk-test' debug=True workers=4


## 13. Summary and next steps

You now have both beginner and advanced coverage for Module 8.

Recommended next actions:
1. Run the beginner practice tasks and mini projects.
2. Run the advanced scripts and compare performance outputs.
3. Replace examples with your own data and re-run.

Entry points:
- Beginner: `beginner_edition/README_beginner.md`
- Advanced: `advanced_edition/README_advanced.md`