# `searchlite` API Embedding Model Demo Notebook v2.0 
This notebook contains code walking through how to use `searchlite` with an embedding model accessed via API.

This demo notebook uses the Google Gemini API and requires the google python package. Before running this notebook, **make sure you've pip installed both searchlite and the python package of your API**. In this case, you would do:

```bash 
pip install searchlite google
```

In this notebook we'll load a sample text data set with some metadata, split the dataframe into the text and its metadata, load it into `searchlite`, and perform/display a semantic search.
            

First, import your dependencies to load your data. For this example,             you'll only need pandas (for loading in our example data) and os (for defining the file path to our example data).

In [1]:
import pandas as pd
import os

## Import and look at data

Next, define the path to the sample data. In this case it is in the data folder.            After defining the path, use pandas to load in the csv file as a data frame.

In [2]:
sample_df = pd.read_csv(
   os.path.join(os.getcwd(), '../data/synthetic_data.csv'),
   index_col=0
)

Let's take a look at our sample data below. The data consists of 15 distinct pieces             of text with corresponding id and category values. Each text topic is quite different so you can test the                 semantic search with different queries to see if the results makes sense.

In [3]:
sample_df

Unnamed: 0,id,category,text
0,1,Product Description,Experience unparalleled sound quality with the...
1,2,Movie Synopsis,"In a world ravaged by climate change, a group ..."
2,3,News Article,The city council approved the new public trans...
3,4,Recipe,"Preheat the oven to 375°F. Mix flour, sugar, a..."
4,5,Travel Guide,"Discover the hidden gems of Kyoto, from tranqu..."
5,6,Scientific Abstract,This study investigates the effects of micropl...
6,7,Book Review,"An evocative tale of love and loss, 'The Silen..."
7,8,Job Posting,Looking for a skilled software engineer profic...
8,9,User Manual,"To reset your device, hold the power button fo..."
9,10,Historical Event,"The Berlin Wall, constructed in 1961, symboliz..."


Before initializing the `Document` class, you need to split the dataframe into the             text you want to embed and it's corresponding metadata (shown below). You can accomplish this by simply                 isolating the text column and by using the .to_dict() method to convert the metadata columns into a                     list of dictionaries, with each entry corresponding to a row in the dataframe.

In [4]:
sample_texts = sample_df["text"]
sample_metadata = sample_df[["id", "category"]].to_dict(orient = "records")

In [5]:
sample_texts[0:3]

0    Experience unparalleled sound quality with the...
1    In a world ravaged by climate change, a group ...
2    The city council approved the new public trans...
Name: text, dtype: object

In [6]:
sample_metadata[0:3]

[{'id': 1, 'category': 'Product Description'},
 {'id': 2, 'category': 'Movie Synopsis'},
 {'id': 3, 'category': 'News Article'}]

## Use searchlite to embed text and run semantic search

Now, you can initialize our `Document` class. As shown below, both the text and metadata             are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the                 `Document` instance.


We'll be using the Google Gemini API for this demo. Before writing any code, **make sure you've pip installed teh appropriate libraries to access your api**.

```bash
pip install google-genai
```

The `ApiEmbedder` instance will automatically check if your embed_func() is structured properly. 

To run an ApiEmbedder, you'll need to import `Document` and `ApiEmbedder` from `searchlite`.
        

In [7]:
from searchlite.document import Document
from searchlite.embedders.api import ApiEmbedder   

Before creating your document, you have to instantiate your `ApiEmbedder`. Unlike the other embedder classes, the `ApiEmbedder` class requires a bit more upfront work to integrate into searchlite. 

First, read through your API's documentation to see how to extract embeddings on a single string and on a list of strings. Then write an embedding function that takes in a string or list of strings, calls your embedding api, and returns a numpy array of embeddings. **Your embedding function MUST return a numpy array for BOTH an indivual string AND a list of strings**. 

The `ApiEmbedder` instance will check that your embedding function adheres to these output guidlines and will raise an error if it does not.

First, you need to load in your api key(s). Remember to store your api keys in a safe place and **never commit your .env files to GitHub!** To initialize the Google genai client, I'll load the dotenv and google packages, import load_dotenv (for env variables), genai (for the Gemini client), and types (to optimize embeddings for semantic search), load my environment variables, and instantiate a client with my api key.

In [None]:
from dotenv import load_dotenv
from google import genai
from google.genai import types

In [9]:
load_dotenv()
client = genai.Client(api_key = os.getenv("GOOGLE_API_KEY"))

Now, you need to define your embedding function. This function will get passed into the `ApiEmbedder` instance and will allow your API embedding workflow to work with the `searchlite` workflow. **Your embedding function MUST return a numpy array for both an individual string and a list of strings**.

Here's an example of the output of an incorrect embedding function.

In [20]:
def wrong_embed_func(texts):
    return list(texts)

In [21]:
wrong_embedder = ApiEmbedder(client = "Google", embed_func = wrong_embed_func)

2025-07-16 19:38:28,052 - INFO - Validating embed_func...


TypeError: Provided embed_func is not valid: embed_func must return a numpy ndarray when embedding a string or list of strings

As you can see from the above output, the `ApiEmbedder` class will test your API embedder on a string and a list of strings before fully instantiating.

Below is the correct implementation of an embedding function for the Gemini API. Be careful to confirm that the outputted arrays are oriented in the right direction. If they are not, doc.query() will throw an error. You can see that I had to run .reshape(1,-1) to ensure that my single string embedding had the right shape. This function was made in around 5 minutes mostly by directly pulling code from the [Gemini docs](https://ai.google.dev/gemini-api/docs/embeddings). 

I import numpy for converting the Gemini outputs to numpy arrays and typing to add type hints.

In [None]:
from typing import Union, List
import numpy as np

def gemini_embed_func(texts:Union[List[str], str])->np.array:
    if isinstance(texts, str):
        embedding = client.models.embed_content(
            model="gemini-embedding-001",
            contents=texts,
            config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")).embeddings[0].values
        
        return np.array(embedding).reshape(1,-1)
    else:
        embedding_list = [
            np.array(e.values) for e in client.models.embed_content(
                model="gemini-embedding-001",
                contents=texts, 
                config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")).embeddings
            ]
        
        return np.array(embedding_list)

Once you've defined your embedding function properly, you can proceed through the rest of the `searchlite` workflow as you would with any other embedder! First, instantiate your `ApiEmbedder` with the embed_func. 

In [11]:
embedder = ApiEmbedder(client = "Google Gemini [gemini-embedding-001]", embed_func = gemini_embed_func)

2025-07-16 19:36:09,980 - INFO - Validating embed_func...
2025-07-16 19:36:10,405 - INFO - embed_func validated


In [12]:
embedder

Api Embedder Object with client: Google Gemini [gemini-embedding-001].

Now, you can initialize our `Document` class. As shown below, both the text and metadata are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the `Document` instance. 

Be sure to assign your instantiated embedder to the embedder attribute of your document. If you don't, `searchlite` will automatically assign the `SkTFIDFEmbedder` as the embedding model for the document.
            

In [13]:
doc = Document(texts = sample_texts, metadata = sample_metadata, embedder = embedder)

In [14]:
doc

Document instance with 15 texts. Metadata contains the following fields: id, category. Embeddings: Not Ready.
Embedder:Api Embedder Object with client: Google Gemini [gemini-embedding-001].

Run the .embed() method to run scikit-learn's TFIDF Vectorizer. If you want to use a different             source for your embedding model, check out the other example notebooks to see how to initialize an embedder and pass it to your `Document`.

In [15]:
doc.embed()

After generating your text embeddings, you can run semantic search on your text corpus by using the             .query() method. Your query will be embedded into a vector and compared against your text corpus using cosine similarity.                 By default, .query() returns the top 3 matches but this can be changed by modifying the **top_k** parameter.
As you can                     see from the cell below, .query() returns a list of dictionaries with each dictionary containing the metadata and text                         of the identified matches.

In [16]:
res = doc.query(query_text = 'wireless earbuds with good battery life')
res          

[{'id': 1,
  'category': 'Product Description',
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.',
  'similarity score': 0.8904592105832237},
 {'id': 12,
  'category': 'Health & Fitness',
  'text': 'Regular cardio workouts not only improve heart health but also boost mental clarity and reduce stress levels.',
  'similarity score': 0.7685993907097357},
 {'id': 14,
  'category': 'E-commerce FAQ',
  'text': 'Q: Does this jacket have waterproof capabilities? A: Yes, it is made with breathable waterproof fabric suitable for heavy rain.',
  'similarity score': 0.7622437181467848}]

The `Document` class has three options to nicely display the results of your semantic search in the terminal: f-string, pprint, and tabulate.

- "f-string" outputs a custom f-string (defined in document.py)

- "pprint" leverages the pprint package to display a list of dictionaries of the top k results

- "tabulate" leverages the tabulate library to display a table of the top k results.
    

In [17]:
doc.display_results(output_list_dicts = res, style = 'f-string')

Result 1:
    id: 1
    category: Product Description
    text: Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.
    similarity score: 0.8904592105832237

Result 2:
    id: 12
    category: Health & Fitness
    text: Regular cardio workouts not only improve heart health but also boost mental clarity and reduce stress levels.
    similarity score: 0.7685993907097357

Result 3:
    id: 14
    category: E-commerce FAQ
    text: Q: Does this jacket have waterproof capabilities? A: Yes, it is made with breathable waterproof fabric suitable for heavy rain.
    similarity score: 0.7622437181467848



In [18]:
doc.display_results(output_list_dicts = res, style = 'pprint')

[{'category': 'Product Description',
  'id': 1,
  'similarity score': 0.8904592105832237,
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless '
          'earbuds, featuring noise cancellation, 12-hour battery life, and an '
          'ergonomic design perfect for workouts.'},
 {'category': 'Health & Fitness',
  'id': 12,
  'similarity score': 0.7685993907097357,
  'text': 'Regular cardio workouts not only improve heart health but also '
          'boost mental clarity and reduce stress levels.'},
 {'category': 'E-commerce FAQ',
  'id': 14,
  'similarity score': 0.7622437181467848,
  'text': 'Q: Does this jacket have waterproof capabilities? A: Yes, it is '
          'made with breathable waterproof fabric suitable for heavy rain.'}]


In [19]:
doc.display_results(output_list_dicts = res, style = 'tabulate')

+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|   id | category            | text                                                                                                                                                                          |   similarity score |
|    1 | Product Description | Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts. |           0.890459 |
+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|   12 | Health & Fitness    | Regular cardio workouts not only improve heart health but