# `searchlite` API Embedding Model Demo Notebook v2.0 
This notebook contains code walking through how to use `searchlite` with an embedding model accessed via API.

This demo notebook uses the Google Gemini API and requires the google python package. Before running this notebook, **make sure you've pip installed both searchlite and the python package of your API**. In this case, you would do:

```bash 
pip install searchlite google
```

In this notebook we'll load a sample text data set with some metadata, split the dataframe into the text and its metadata, load it into `searchlite`, and perform/display a semantic search.
            

First, import your dependencies to load your data. For this example,             you'll only need pandas (for loading in our example data) and os (for defining the file path to our example data).

In [None]:
import pandas as pd
import os

## Import and look at data

Next, define the path to the sample data. In this case it is in the data folder.            After defining the path, use pandas to load in the csv file as a data frame.

In [None]:
sample_df = pd.read_csv(
   os.path.join(os.getcwd(), '../data/synthetic_data.csv'),
   index_col=0
)

Let's take a look at our sample data below. The data consists of 15 distinct pieces             of text with corresponding id and category values. Each text topic is quite different so you can test the                 semantic search with different queries to see if the results makes sense.

In [None]:
sample_df

Before initializing the `Document` class, you need to split the dataframe into the             text you want to embed and it's corresponding metadata (shown below). You can accomplish this by simply                 isolating the text column and by using the .to_dict() method to convert the metadata columns into a                     list of dictionaries, with each entry corresponding to a row in the dataframe.

In [None]:
sample_texts = sample_df["text"]
sample_metadata = sample_df[["id", "category"]].to_dict(orient = "records")

In [None]:
sample_texts[0:3]

In [None]:
sample_metadata[0:3]

## Use searchlite to embed text and run semantic search

Now, you can initialize our `Document` class. As shown below, both the text and metadata             are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the                 `Document` instance.


We'll be using the Google Gemini API for this demo. Before writing any code, **make sure you've pip installed teh appropriate libraries to access your api**.

```bash
pip install google
```

The `ApiEmbedder` instance will automatically check if your embed_func() is structured properly. 

To run an ApiEmbedder, you'll need to import `Document` and `ApiEmbedder` from `searchlite`.
        

In [None]:
from searchlite.document import Document
from searchlite.embedders.api import ApiEmbedder   

Before creating your document, you have to instantiate your `ApiEmbedder`. Unlike the other embedder classes, the `ApiEmbedder` class requires a bit more upfront work to integrate into searchlite. 

First, read through your API's documentation to see how to extract embeddings on a single string and on a list of strings. Then write an embedding function that takes in a string or list of strings, calls your embedding api, and returns a numpy array of embeddings. **Your embedding function MUST return a numpy array for BOTH an indivual string AND a list of strings**. 

The `ApiEmbedder` instance will check that your embedding function adheres to these output guidlines and will raise an error if it does not.

In [None]:
embedder = OllamaEmbedder(model_name = 'nomic-embed-text')

In [None]:
embedder

Now, you can initialize our `Document` class. As shown below, both the text and metadata are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the `Document` instance. 

Be sure to assign your instantiated embedder to the embedder attribute of your document. If you don't, `searchlite` will automatically assign the `SkTFIDFEmbedder` as the embedding model for the document.
            

In [None]:
doc = Document(texts = sample_texts, metadata = sample_metadata, embedder = embedder)

In [None]:
doc

Run the .embed() method to run scikit-learn's TFIDF Vectorizer. If you want to use a different             source for your embedding model, check out the other example notebooks to see how to initialize an embedder and pass it to your `Document`.

In [None]:
doc.embed()

After generating your text embeddings, you can run semantic search on your text corpus by using the             .query() method. Your query will be embedded into a vector and compared against your text corpus using cosine similarity.                 By default, .query() returns the top 3 matches but this can be changed by modifying the **top_k** parameter.
As you can                     see from the cell below, .query() returns a list of dictionaries with each dictionary containing the metadata and text                         of the identified matches.

In [None]:
res = doc.query(query_text = 'wireless earbuds with good battery life')
res          

The `Document` class has three options to nicely display the results of your semantic search in the terminal: f-string, pprint, and tabulate.

- "f-string" outputs a custom f-string (defined in document.py)

- "pprint" leverages the pprint package to display a list of dictionaries of the top k results

- "tabulate" leverages the tabulate library to display a table of the top k results.
    

In [None]:
doc.display_results(output_list_dict = res, style = 'f-string')

In [None]:
doc.display_results(output_list_dict = res, style = 'pprint')

In [None]:
doc.display_results(output_list_dict = res, style = 'tabulate')