# Cats vs Dogs Image DataSet Preprocessing

In this notebook, we will be preprocessing the images to run through a pipeline that will scale and convert each image to a vector embedding using the `embetter package` `ClipEncoder` class.

The `ClipEncoder` will convert the image to vector of 512 values.  Much like converting text or text phrases into an encoded vector the same approach will be taken with the images.

The encoded image data will then be saved and that data will be used in subsequent notebooks to train scikit-learn machine learning models to perform the cat vs dog classification.

The dataset is the kaggle `Cats-vs-Dogs` dataset that can be found in this repo or on the Kaggle website at:

https://www.kaggle.com/datasets/shaunthesheep/microsoft-catsvsdogs-dataset

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder
from embetter.grab import ColumnGrabber
import pandas as pd
from pathlib import Path
import json

## Create Pandas DataFrame of filepaths and labels

Function will review the files in the `data` directory and return a pandas DataFrame with 2 columns:

* filepaths
* target name ( e.g. cat or dog )

In [18]:
def create_filepaths_df() -> pd.DataFrame:
    dirs = ['cats', 'dogs']
    data = []
    for dir in dirs:
        for file in Path(f'data/{dir}').glob('*.jpg'):
            row_data = {
                'filepath': file,
                'target': dir
            }
            data.append(row_data)
    files_df = pd.DataFrame(data, columns=["filepath", "target"])
    return files_df



In [19]:
files_df = create_filepaths_df()

In [20]:
files_df.head()

Unnamed: 0,filepath,target
0,data/cats/cat.5077.jpg,cats
1,data/cats/cat.2718.jpg,cats
2,data/cats/cat.10151.jpg,cats
3,data/cats/cat.3406.jpg,cats
4,data/cats/cat.4369.jpg,cats


### Save files_df to the preprocessing directory



In [21]:
files_df.to_csv("preprocessed_data/files_data.csv", index=False)

## Convert Image files to Image embeddings 

In [22]:
# Image Embedding Pipeline
image_embedding_pipeline = make_pipeline(
   ColumnGrabber("filepath"),
  ImageLoader(convert="RGB"),
  ClipEncoder(),
)

In [23]:
df = pd.read_csv("preprocessed_data/files_data.csv")

In [24]:
df.head()

Unnamed: 0,filepath,target
0,data/cats/cat.5077.jpg,cats
1,data/cats/cat.2718.jpg,cats
2,data/cats/cat.10151.jpg,cats
3,data/cats/cat.3406.jpg,cats
4,data/cats/cat.4369.jpg,cats


In [25]:
%%time

X = image_embedding_pipeline.fit_transform(df)

CPU times: user 1min 9s, sys: 6.41 s, total: 1min 15s
Wall time: 1min 49s


In [26]:
X.shape

(24980, 512)

In [27]:
type(X)

numpy.ndarray

In [28]:
y = df['target']

In [29]:
type(y)

pandas.core.series.Series

In [30]:
X[0:1]

array([[-2.75973976e-01, -3.60909134e-01, -8.04996341e-02,
        -4.34786528e-02, -2.33582407e-01, -1.86831743e-01,
         8.28629136e-02,  3.19521487e-01,  3.18694234e-01,
        -1.42109185e-01,  2.60429159e-02, -2.45073438e-01,
         7.03080356e-01, -1.63924083e-01,  3.00696284e-01,
         8.34534243e-02,  9.69187856e-01,  4.67789322e-02,
         3.80934566e-01,  5.19376755e-01, -4.75792468e-01,
        -5.41630946e-02,  6.56258702e-01, -3.17720890e-01,
        -6.65650249e-01, -6.35989308e-02,  6.21542454e-01,
         1.59050927e-01,  6.94697723e-03,  3.02137822e-01,
         1.80044323e-02,  1.26573900e-02,  1.32118696e-02,
         4.06707168e-01,  2.70707607e-01,  1.40250698e-01,
         3.37431788e-01, -1.39392778e-01,  1.37031125e-02,
         1.63533330e+00, -3.63601029e-01, -6.70272827e-01,
         3.04105747e-02,  3.65281664e-02, -6.36518449e-02,
        -1.40163314e+00, -9.81392711e-02, -2.84088627e-02,
         9.11035612e-02, -2.98568636e-01,  2.04224437e-0

In [31]:
# add embeddings to dataframe
df['image_embedding'] = [json.dumps(emb.tolist()) for emb in X]

In [32]:
df.head()

Unnamed: 0,filepath,target,image_embedding
0,data/cats/cat.5077.jpg,cats,"[-0.27597397565841675, -0.3609091341495514, -0..."
1,data/cats/cat.2718.jpg,cats,"[-0.2487901747226715, 0.06151339039206505, 0.2..."
2,data/cats/cat.10151.jpg,cats,"[-0.20199576020240784, -0.08704289048910141, 0..."
3,data/cats/cat.3406.jpg,cats,"[-0.2417299449443817, -0.42382076382637024, -0..."
4,data/cats/cat.4369.jpg,cats,"[0.11845763772726059, 0.021671850234270096, 0...."


In [33]:
df.tail()

Unnamed: 0,filepath,target,image_embedding
24975,data/dogs/dog.9316.jpg,dogs,"[-0.007655435241758823, 0.3041452467441559, 0...."
24976,data/dogs/dog.6025.jpg,dogs,"[0.12140308320522308, 0.28480660915374756, 0.0..."
24977,data/dogs/dog.8008.jpg,dogs,"[0.2268548309803009, 0.021711133420467377, -0...."
24978,data/dogs/dog.1992.jpg,dogs,"[0.0660947933793068, -0.3919999301433563, 0.12..."
24979,data/dogs/dog.12412.jpg,dogs,"[-0.18819163739681244, -0.33489400148391724, -..."


### Save the new dataframe with the embeddings

In [34]:
df.to_csv("preprocessed_data/image_embeddings.csv", index=False)

## Image Embeddings DataSet

The image embeddings dataset can not be loaded and used for training machine learning models.

The Features will be the image_embedding column and the target with be the target column