# DVC pipelines and model compression

In [None]:
#@title Environment setup
!rm -rf sample_data .config
!git config --global user.email "jane@doe.eu"
!git config --global user.name "Jane Doe"
!git config --global init.defaultBranch main
!pip install dvc dvclive --quiet
!pip install -U uv

## DVC & Git repositories setup

Initialize DVC & Git.

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!git init

In [None]:
!dvc init

In [None]:
!git commit -m "Initialization of DVC and Git"

## Adding the data to DVC

With [`dvc import-url`](https://dvc.org/doc/command-reference/import-url), download the following zip that we are going to use:

    https://github.com/shuuchuu/dataset-landscape/archive/refs/heads/main.zip

Use `data.zip` as output name.

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

Commit the changes to `git`.

### Solution

In [None]:
!dvc import-url https://github.com/shuuchuu/dataset-landscape/archive/refs/heads/main.zip data.zip

In [None]:
!git add .gitignore data.zip.dvc

In [None]:
!git commit -m "Add data"

## Create a pipeline step to extract the contents of the zip archive

With `dvc stage add`, or by editing the `dvc.yaml` file, create a dvc pipeline step to extract the files from `data.zip` (to do so, you can use `unzip`).

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!dvc stage add -n decompress -d data.zip -o dataset-landscape-main unzip data.zip

In [None]:
!dvc repro

In [None]:
!git add dvc.lock dvc.yaml .gitignore

In [None]:
!git commit -m "Add extraction step"

## Python project setup

Setup a python project or python file to be able to run your source code easily.

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

### Solution

1. Run `uv init --package --name compression .` to create `pyproject.toml`
2. Create source folder and `__init__.py` in it
3. Add an entrypoint:
        [project.scripts]
        commandname = "folder.file:function"
4. Run `uv sync`
6. `!commandname`

In [None]:
!uv init --package --name compression .

In [None]:
%%writefile src/compression/__init__.py
def main() -> None:
    print("Hello World")

In [None]:
!uv sync

In [None]:
!uv run compression

## Data preparation

To prepare data before training and compressing the model, we are going to use the following function, that you can incorporate into your codebase as you see fit:

In [None]:
#@title Data preparation code
import pathlib

import cv2
import numpy
import sklearn.metrics
import sklearn.utils

CLASS_NAMES = ["buildings", "forest", "glacier", "mountain", "sea", "street"]
CLASS_INDICES = {l: i for i, l in enumerate(CLASS_NAMES)}


def get_images(
    dir_path: pathlib.Path, image_size: tuple[int, int], shuffle: bool = True
) -> tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]:
    images = []
    labels = []
    file_paths = []

    for subdir_path in dir_path.iterdir():

        label = CLASS_INDICES.get(subdir_path.name)

        for image_path in subdir_path.iterdir():
            image = cv2.imread(str(image_path))
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image = cv2.resize(image, image_size)
            image = image.astype("float32")
            images.append(image)
            labels.append(label)
            file_paths.append(image_path)
    images_array = numpy.array(images)
    labels_array = numpy.array(labels)
    file_paths_array = numpy.array(file_paths)

    if shuffle:
        images_array, labels_array, file_paths_array = sklearn.utils.shuffle(
            images_array, labels_array, file_paths_array
        )
    return images_array, labels_array, file_paths_array


## Training a simple model

We are now going to train a simple classification model (and historical one at that): LeNet. You can incorporate the code below to your project as you see fit.

In [None]:
#@title Model definition code
import tensorflow as tf


def get_lenet(
    image_size: tuple[int, int], learning_rate: float = 1e-4
) -> tf.keras.models.Model:
    def conv(filters: int, padding: str) -> tf.keras.layers.Conv2D:
        return tf.keras.layers.Conv2D(
            filters=filters, kernel_size=5, padding=padding, activation="sigmoid"
        )

    def pooling() -> tf.keras.layers.MaxPooling2D:
        return tf.keras.layers.MaxPooling2D()

    def dense(units: int, activation: str = "sigmoid") -> tf.keras.layers.Dense:
        return tf.keras.layers.Dense(units, activation=activation)

    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(*image_size, 3)),
            conv(6, "same"),
            pooling(),
            conv(16, "valid"),
            pooling(),
            tf.keras.layers.Flatten(),
            dense(120),
            dense(84),
            dense(6, activation="softmax"),
        ],
        name="le_net",
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    return model


In [None]:
#@title Model training code
from pathlib import Path

from ... import get_images  # To edit depending on your code organization
from ... import get_lenet  # To edit depending on your code organization


def train(
    data_dir: str,
    image_size: tuple[int, int],
    learning_rate: float,
) -> None:
    images, labels, paths = get_images(Path(data_dir), image_size)
    model = get_lenet(image_size, learning_rate)
    model.fit(images, labels, 128, epochs=3)
    model.save("landscape_classifier.keras")


### Pipeline step definition

You can now define a pipeline step to train a model.

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

In [None]:
!dvc repro

## Post-training model compression

You can follow this [guide](https://www.tensorflow.org/model_optimization/guide/quantization/post_training) to add a compression step to your training. Check the model performances after quantization.

In [None]:
!# Your command here, note the ! that prefixes bash commands in Colab

In [None]:
!dvc repro

## Solution

https://github.com/shuuchuu/compression