***
<span style="font-size:32px; color:rgba(0, 0, 255, 0.5);">Day 2 - Embeddings & Vector Stores/Databases</span>

---

<table style="width: 100%;">
  <tr>
    <td style="background-color: rgba(0, 255, 0, 0.2); text-align: center; font-size: 16px;">
    </td>
  </tr>
</table>

<span style="font-size:24px; color:rgba(0, 0, 0, 0.5);">Classifying embeddings with Keras and the Gemini API</span>

---
"Modern machine learning thrives on diverse data—images, text, audio, and more. This whitepaper explores the power of embeddings, which transform this heterogeneous data into a unified vector representation for seamless use in various applications.

Why embeddings are important In essence, embeddings are numerical representations of real-world data such as text, speech, image, or videos. They are expressed as low-dimensional vectors where the geometric distances of two vectors in the vector space is a projection of the relationships between the two real-world objects that the vectors represent. In other words they help you with providing compact representations of data of different types, while simultaneously also allowing you to compare two different data objects and tell how similar or different they are on a numerical scale: for example: The word ‘computer’ has a similar meaning to the picture of a computer, as well as the word ’laptop’ but not to the word ‘car’. These low-dimensional numerical representations of real-world data significantly helps efficient large-scale data processing and storage by acting as means of lossy compression of the original data while retaining its important properties."

<b>Authors:</b><br>
Anant Nawalgaria and Xiaoqi Ren

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Resources</span>

---
**Whitepaper**<br>
https://www.kaggle.com/whitepaper-embeddings-and-vector-stores

**Embedding and Vector Stores Podcast**<br>
https://www.youtube.com/watch?v=1CC39K76Nqs

**Embedding and Vector Databases Livestream**<br>
https://www.youtube.com/watch?v=kpRyiJUUFxY

**Get your API key from**<br>
https://aistudio.google.com/app/apikey

**Kaggle**<br>
https://www.kaggle.com/code/markishere/day-2-classifying-embeddings-with-keras

In [1]:
# %pip install google-generativeai

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Libraries</span>

---

In [18]:
import os, email, re, keras

import pandas as pd

import numpy as np

from dotenv import load_dotenv

from sklearn.datasets import fetch_20newsgroups

import google.generativeai as genai
from google.api_core import retry

from tqdm.rich import tqdm

from keras import layers

import warnings
from tqdm import TqdmExperimentalWarning

warnings.filterwarnings("ignore", category=TqdmExperimentalWarning)

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Initialize the API</span>

---

In [3]:
# Load API key from .env file
load_dotenv()
api_key = os.getenv("GAI_API_KEY")

# Set up the API key for the genai library
genai.configure(api_key=api_key)

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Dataset</span>

---
The 20 Newsgroups Text Dataset contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and test datasets are based on messages posted before and after a specific date. For this tutorial, you will use sampled subsets of the training and test sets, and perform some processing using Pandas.

In [4]:
newsgroups_train = fetch_20newsgroups(subset="train")
newsgroups_test = fetch_20newsgroups(subset="test")

# View list of class names for dataset
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Here is an example of what a record from the training set looks like.

In [5]:
print(newsgroups_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







Start by preprocessing the data for this tutorial in a Pandas dataframe. To remove any sensitive information like names and email addresses, you will take only the subject and body of each message. This is an optional step that transforms the input data into more generic text, rather than email posts, so that it will work in other contexts.

In [6]:
def preprocess_newsgroup_row(data):
    # Extract only the subject and body
    msg = email.message_from_string(data)
    text = f"{msg['Subject']}\n\n{msg.get_payload()}"
    # Strip any remaining email addresses
    text = re.sub(r"[\w\.-]+@[\w\.-]+", "", text)
    # Truncate each entry to 5,000 characters
    text = text[:5000]

    return text


def preprocess_newsgroup_data(newsgroup_dataset):
    # Put data points into dataframe
    df = pd.DataFrame(
        {"Text": newsgroup_dataset.data, "Label": newsgroup_dataset.target}
    )
    # Clean up the text
    df["Text"] = df["Text"].apply(preprocess_newsgroup_row)
    # Match label to target name index
    df["Class Name"] = df["Label"].map(lambda l: newsgroup_dataset.target_names[l])

    return df

In [7]:
# Apply preprocessing function to training and test datasets
df_train = preprocess_newsgroup_data(newsgroups_train)
df_test = preprocess_newsgroup_data(newsgroups_test)

df_train.head()

Unnamed: 0,Text,Label,Class Name
0,WHAT car is this!?\n\n I was wondering if anyo...,7,rec.autos
1,SI Clock Poll - Final Call\n\nA fair number of...,4,comp.sys.mac.hardware
2,"PB questions...\n\nwell folks, my mac plus fin...",4,comp.sys.mac.hardware
3,Re: Weitek P9000 ?\n\nRobert J.C. Kyanko () wr...,1,comp.graphics
4,Re: Shuttle Launch Question\n\nFrom article <>...,14,sci.space


Next, you will sample some of the data by taking 100 data points in the training dataset, and dropping a few of the categories to run through this tutorial. Choose the science categories to compare.

In [8]:
def sample_data(df, num_samples, classes_to_keep):
    # Sample rows, selecting num_samples of each Label.
    df = (
        df.groupby("Label")[df.columns]
        .apply(lambda x: x.sample(num_samples))
        .reset_index(drop=True)
    )

    df = df[df["Class Name"].str.contains(classes_to_keep)]

    # We have fewer categories now, so re-calibrate the label encoding.
    df["Class Name"] = df["Class Name"].astype("category")
    df["Encoded Label"] = df["Class Name"].cat.codes

    return df

In [9]:
TRAIN_NUM_SAMPLES = 100
TEST_NUM_SAMPLES = 25
CLASSES_TO_KEEP = "sci"  # Class name should contain 'sci' to keep science categories

df_train = sample_data(df_train, TRAIN_NUM_SAMPLES, CLASSES_TO_KEEP)
df_test = sample_data(df_test, TEST_NUM_SAMPLES, CLASSES_TO_KEEP)

In [10]:
df_train.value_counts("Class Name")

Class Name
sci.crypt          100
sci.electronics    100
sci.med            100
sci.space          100
Name: count, dtype: int64

In [11]:
df_test.value_counts("Class Name")

Class Name
sci.crypt          25
sci.electronics    25
sci.med            25
sci.space          25
Name: count, dtype: int64

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Create the embeddings</span>

---
In this section, you will generate embeddings for each piece of text using the Gemini API embeddings endpoint. To learn more about embeddings, visit the embeddings guide.

NOTE: Embeddings are computed one at a time, so large sample sizes can take a long time

<span style="font-size:16px; color:rgba(0, 0, 0, 0.5);">Task Types</span>

<p>The `text-embedding-004` model supports a task type parameter that generates embeddings tailored for the specific task.</p>

<div style="text-align: left; display: inline-block;">
<table>
  <thead>
    <tr>
      <th style="text-align: left;">Task Type</th>
      <th style="text-align: left;">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>RETRIEVAL_QUERY</td>
      <td>Specifies the given text is a query in a search/retrieval setting.</td>
    </tr>
    <tr>
      <td>RETRIEVAL_DOCUMENT</td>
      <td>Specifies the given text is a document in a search/retrieval setting.</td>
    </tr>
    <tr>
      <td>SEMANTIC_SIMILARITY</td>
      <td>Specifies the given text will be used for Semantic Textual Similarity (STS).</td>
    </tr>
    <tr>
      <td>CLASSIFICATION</td>
      <td>Specifies that the embeddings will be used for classification.</td>
    </tr>
    <tr>
      <td>CLUSTERING</td>
      <td>Specifies that the embeddings will be used for clustering.</td>
    </tr>
    <tr>
      <td>FACT_VERIFICATION</td>
      <td>Specifies that the given text will be used for fact verification.</td>
    </tr>
  </tbody>
</table>
</div>

<p>For this example, you will be performing classification.</p>


In [12]:
tqdm.pandas()

@retry.Retry(timeout=300.0)
def embed_fn(text: str) -> list[float]:
    # You will be performing classification, so set task_type accordingly.
    response = genai.embed_content(
        model="models/text-embedding-004", content=text, task_type="classification"
    )

    return response["embedding"]


def create_embeddings(df):
    df["Embeddings"] = df["Text"].progress_apply(embed_fn)
    return df

This code is optimised for clarity, and is not particularly fast. It is left as an exercise for the reader to implement batch or parallel/asynchronous embedding generation. Running this step will take some time.

In [13]:
df_train = create_embeddings(df_train)
df_test = create_embeddings(df_test)

Output()

Output()

In [15]:
df_train.head()

Unnamed: 0,Text,Label,Class Name,Encoded Label,Embeddings
1100,"Re: Once tapped, your code is no good any more...",11,sci.crypt,0,"[-0.010045305, 0.014250235, -0.03582837, 0.035..."
1101,"Re: Would ""clipper"" make a good cover for othe...",11,sci.crypt,0,"[-0.009308786, 0.029653585, -0.04822619, 0.029..."
1102,Re: text of White House announcement and Q&As ...,11,sci.crypt,0,"[-0.032723654, 0.033324193, -0.036674835, -0.0..."
1103,"Re: Once tapped, your code is no good any more...",11,sci.crypt,0,"[0.010096876, 0.0068887677, -0.024978861, 0.04..."
1104,Re: Hard drive security for FBI targets\n\n (R...,11,sci.crypt,0,"[-0.010112229, 0.022346703, -0.046342988, 0.02..."


<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Build a classification model</span>

---
Here you will define a simple model that accepts the raw embedding data as input, has one hidden layer, and an output layer specifying the class probabilities. The prediction will correspond to the probability of a piece of text being a particular class of news.

When you run the model, Keras will take care of details like shuffling the data points, calculating metrics and other ML boilerplate.

In [16]:
def build_classification_model(input_size: int, num_classes: int) -> keras.Model:
    return keras.Sequential(
        [
            layers.Input([input_size], name="embedding_inputs"),
            layers.Dense(input_size, activation="relu", name="hidden"),
            layers.Dense(num_classes, activation="softmax", name="output_probs"),
        ]
    )

In [17]:
# Derive the embedding size from observing the data. The embedding size can also be specified
# with the `output_dimensionality` parameter to `embed_content` if you need to reduce it.
embedding_size = len(df_train["Embeddings"].iloc[0])

classifier = build_classification_model(
    embedding_size, len(df_train["Class Name"].unique())
)
classifier.summary()

classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    metrics=["accuracy"],
)

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Train the model</span>

---
Finally, you can train your model. This code uses early stopping to exit the training loop once the loss value stabilises, so the number of epoch loops executed may differ from the specified value.

In [19]:
NUM_EPOCHS = 20
BATCH_SIZE = 32

# Split the x and y components of the train and validation subsets.
y_train = df_train["Encoded Label"]
x_train = np.stack(df_train["Embeddings"])
y_val = df_test["Encoded Label"]
x_val = np.stack(df_test["Embeddings"])

# Specify that it's OK to stop early if accuracy stabilises.
early_stop = keras.callbacks.EarlyStopping(monitor="accuracy", patience=3)

# Train the model for the desired number of epochs.
history = classifier.fit(
    x=x_train,
    y=y_train,
    validation_data=(x_val, y_val),
    callbacks=[early_stop],
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

Epoch 1/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.2586 - loss: 1.3697 - val_accuracy: 0.5200 - val_loss: 1.2926
Epoch 2/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8197 - loss: 1.2031 - val_accuracy: 0.7900 - val_loss: 1.1586
Epoch 3/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9162 - loss: 1.0298 - val_accuracy: 0.8000 - val_loss: 1.0211
Epoch 4/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8255 - loss: 0.8392 - val_accuracy: 0.8000 - val_loss: 0.8901
Epoch 5/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9443 - loss: 0.6517 - val_accuracy: 0.8500 - val_loss: 0.7464
Epoch 6/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9818 - loss: 0.4644 - val_accuracy: 0.8600 - val_loss: 0.6446
Epoch 7/20
[1m13/13[0m [32m━━━━━━━━━

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Evaluate model performance</span>

---
Use Keras Model.evaluate to calculate the loss and accuracy on the test dataset.

In [20]:
classifier.evaluate(x=x_val, y=y_val, return_dict=True)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8433 - loss: 0.4105


{'accuracy': 0.8399999737739563, 'loss': 0.4300147294998169}

To learn more about training models with Keras, including how to visualise the model training metrics, read <u>[Training & evaluation with built-in methods](https://www.tensorflow.org/guide/keras/training_with_built_in_methods)</u>

<span style="font-size:18px; color:rgba(0, 0, 0, 0.5);">Try a custom prediction</span>

---
Now that you have a trained model with good evaluation metrics, you can try to make a prediction with new, hand-written data. Use the provided example or try your own data to see how the model performs.

In [21]:
# This example avoids any space-specific terminology to see if the model avoids
# biases towards specific jargon.
new_text = """
First-timer looking to get out of here.

Hi, I'm writing about my interest in travelling to the outer limits!

What kind of craft can I buy? What is easiest to access from this 3rd rock?

Let me know how to do that please.
"""
embedded = embed_fn(new_text)

In [22]:
# Remember that the model takes embeddings as input, and the input must be batched,
# so here they are passed as a list to provide a batch of 1.
inp = np.array([embedded])
[result] = classifier.predict(inp)

for idx, category in enumerate(df_test["Class Name"].cat.categories):
    print(f"{category}: {result[idx] * 100:0.2f}%")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
sci.crypt: 0.62%
sci.electronics: 1.50%
sci.med: 0.21%
sci.space: 97.68%
