# Cats vs Dogs Image DataSet Preprocessing

In this notebook, we will be preprocessing the images to run through a pipeline that will scale and convert each image to a vector embedding using the `embetter package` `ClipEncoder` class.

The `ClipEncoder` will convert the image to vector of 512 values.  Much like converting text or text phrases into an encoded vector the same approach will be taken with the images.

The encoded image data will then be saved and that data will be used in subsequent notebooks to train scikit-learn machine learning models to perform the cat vs dog classification.

The dataset is the kaggle `Cats-vs-Dogs` dataset that can be found in this repo or on the Kaggle website at:

https://www.kaggle.com/datasets/shaunthesheep/microsoft-catsvsdogs-dataset

### CLIP ( Contrastive Language–Image Pretraining ) Background Information

**What is a CLIP encoder?**
CLIP stands for Contrastive Language–Image Pretraining. It’s a model made by OpenAI that can understand both images and text, and match them to each other. So, if you show CLIP a picture of a dog and the word “dog,” it will know they go together.

CLIP uses two main parts:

An image encoder – that looks at an image and turns it into numbers (called an embedding).
A text encoder – that does the same thing for text.

**What does it mean to "encode" an image?**
When we encode an image, we’re turning it into a list of numbers that represents the important stuff about the image—kind of like its fingerprint.

This list of numbers is called a vector or embedding. It's like a summary of the image that CLIP can use to compare it with other images or text.

Think of it like this:

Imagine you take a picture of a cat.
The CLIP image encoder looks at that picture and gives you a list of, say, 512 numbers.
These numbers don't look like much to us, but to the model, they capture key features like shapes, colors, and what objects are in the image.

**What does the output represent?**
The output is a list of numbers (a vector)—for example:

[0.12, -0.58, 0.33, ..., 0.05]  ← 512 numbers
Each number in that list represents a different feature or pattern in the image. Alone, they don’t mean much to humans, but together, they help the computer know what’s in the image.

**For example:**

Similar images (like two pictures of dogs) will have similar vectors.
Different images (like a dog vs. a car) will have different vectors.

**Why is this useful?**
Because once an image is a vector:

You can compare it with text vectors (like the word "cat" or "dog").
You can search for images that are similar.
You can do things like captioning, clustering, or even generating images based on text.

**Summary**
A CLIP encoder turns images into numbers (vectors).
These numbers summarize what’s in the image.
They help computers understand and compare images and text—even if the computer has never seen the exact image before.


**Here’s how it works:**
When you pass an image to CLIP:

The image is loaded (usually as pixels).
It’s resized and preprocessed (to match what CLIP expects—like 224×224 pixels).
Then it's passed through the image encoder (like a modified ResNet or Vision Transformer).
The encoder outputs a vector of numbers (the embedding) that represents only the visual content of the image.


### Directory Structure

root_dir

    - data ( images from the kaggle dataset.  The directories have been renamed from the original download )
     
         - cats
     
         - dogs
    
    - holdout ( 20 samples of cats and dogs were removed from the original dataset to test after the model has been trained )

        - cats

        - dogs

    - models
        ( where trained models will be stored )
        
    - preprocessed_data
        ( where intermediate preprocessed files will be stored )
        

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder
from embetter.grab import ColumnGrabber
import pandas as pd
from pathlib import Path
import json

## Create Pandas DataFrame of filepaths and labels

Function will review the files in the `data` directory and return a pandas DataFrame with 2 columns:

* filepaths
* target name ( e.g. cat or dog )

In [2]:
dirs = ['cats', 'dogs']

# root_dir = '/Users/patrickryan/Development/machinelearning/scikit-learn/cats-vs-dogs-with-scikit-learn'


root_dir = '/Volumes/TheVault/ml_datasets/kaggle/cats-vs-dogs'


In [3]:
def create_filepaths_df(dir_name:str) -> pd.DataFrame:
    
    data = []
    for dir in dirs:
        for file in Path(f'{root_dir}/{dir_name}/{dir}').glob('*.jpg'):
            row_data = {
                'filepath': file,
                'target': dir
            }
            data.append(row_data)
    files_df = pd.DataFrame(data, columns=["filepath", "target"])
    return files_df


In [4]:
files_df = create_filepaths_df(dir_name='data')

In [5]:
files_df.head()

Unnamed: 0,filepath,target
0,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
1,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
2,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
3,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
4,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats


In [6]:
files_df.tail()

Unnamed: 0,filepath,target
24953,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs
24954,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs
24955,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs
24956,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs
24957,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs


### Save files_df to the preprocessing directory



In [7]:
files_df.to_csv(f"{root_dir}/preprocessed_data/files_data.csv", index=False)

## Convert Image files to Image embeddings 

In [8]:
# Image Embedding Pipeline
image_embedding_pipeline = make_pipeline(
   ColumnGrabber("filepath"),
  ImageLoader(convert="RGB"),
  ClipEncoder(),
)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [9]:
df = pd.read_csv(f"{root_dir}/preprocessed_data/files_data.csv")

In [10]:
df.head()

Unnamed: 0,filepath,target
0,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
1,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
2,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
3,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats
4,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats


### Create the CLIP embbedding for each image

In [11]:
%%time

X = image_embedding_pipeline.fit_transform(df)



CPU times: user 1min 10s, sys: 6.34 s, total: 1min 16s
Wall time: 1min 50s


In [12]:
X.shape

(24958, 512)

In [13]:
type(X)

numpy.ndarray

In [14]:
y = df['target']

In [15]:
type(y)

pandas.core.series.Series

In [16]:
X[0:1]

array([[-4.27153111e-02, -6.16295636e-02, -1.19786978e-01,
         3.06515515e-01,  2.66768724e-01, -1.00176424e-01,
        -1.45210430e-01, -9.61131752e-02,  8.40345800e-01,
        -4.60286072e-04,  4.09363389e-01, -1.17742501e-01,
         5.10108352e-01, -2.72414893e-01,  4.07326035e-02,
         1.02416284e-01,  9.01596487e-01, -5.85524812e-02,
         3.55524927e-01, -4.31637242e-02,  1.78646296e-01,
         3.30819309e-01,  6.35441959e-01, -3.41306478e-01,
        -2.18503237e-01,  1.85449734e-01,  2.55678535e-01,
        -1.89959839e-01, -1.61174178e-01, -1.36801600e-01,
        -2.57087767e-01,  2.99870789e-01,  1.54347807e-01,
         2.33616084e-02,  5.55954993e-01,  3.92032206e-01,
         2.89443940e-01, -9.27947909e-02, -5.74937873e-02,
         1.71408629e+00,  1.49413392e-01, -6.02799177e-01,
        -1.26615927e-01,  8.28924328e-02,  3.23446393e-01,
        -4.56805229e-02,  4.43084687e-01, -5.90242334e-02,
        -2.15918213e-01,  1.73231140e-01,  7.36892968e-0

In [17]:
# add embeddings to dataframe
df['image_embedding'] = [json.dumps(emb.tolist()) for emb in X]

In [18]:
df.head()

Unnamed: 0,filepath,target,image_embedding
0,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats,"[-0.04271531105041504, -0.06162956357002258, -..."
1,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats,"[0.1041913852095604, 0.25702211260795593, 0.01..."
2,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats,"[0.059033650904893875, -0.717756986618042, -0...."
3,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats,"[-0.24990229308605194, -0.1539546400308609, 0...."
4,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,cats,"[-0.20358122885227203, -0.2211882770061493, -0..."


In [19]:
df.tail()

Unnamed: 0,filepath,target,image_embedding
24953,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs,"[-0.2488526999950409, -0.345273494720459, -0.4..."
24954,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs,"[-0.15033303201198578, -0.1186925545334816, 0...."
24955,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs,"[0.031125575304031372, 0.21204261481761932, -0..."
24956,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs,"[-0.2592233717441559, 0.08375595510005951, 0.0..."
24957,/Volumes/TheVault/ml_datasets/kaggle/cats-vs-d...,dogs,"[-0.47171539068222046, -0.24606642127037048, -..."


### Save the new dataframe with the embeddings

In [20]:
df.to_csv(f"{root_dir}/preprocessed_data/image_embeddings.csv", index=False)

## Image Embeddings DataSet

The image embeddings dataset can now be loaded and used for training machine learning models.

The Features will be the image_embedding column and the target with be the target column