# `searchlite` Sentence Transformers Demo Notebook v2.0 
This notebook contains code walking through how to use `searchlite` with an embedding model from the `sentence-transformers` library. Before running the notebook, **make sure you've pip installed the optional dependency** by doing:

```bash
pip install searchlite[sentence_transformers]
```

In this notebook, we'll load a sample text data set with some metadata, split the dataframe into the text and its metadata, load it into `searchlite`, and perform/display a semantic search.

First, import your dependencies to load your data. For this example, you'll only need pandas (for loading in our example data) and os (for defining the file path to our example data).

In [None]:
import pandas as pd
import os

## Import and look at data

Next, define the path to the sample data. In this case it is in the data folder. After defining the path, use pandas to load in the csv file as a data frame.

In [2]:
sample_df = pd.read_csv(
    os.path.join(
        os.getcwd(), "../data/synthetic_data.csv"), 
    index_col = 0)

Let's take a look at our sample data below. The data consists of 15 distinct pieces of text with corresponding id and category values. Each text topic is quite different so you can test the semantic search with different queries to see if the results makes sense.

In [3]:
sample_df

Unnamed: 0,id,category,text
0,1,Product Description,Experience unparalleled sound quality with the...
1,2,Movie Synopsis,"In a world ravaged by climate change, a group ..."
2,3,News Article,The city council approved the new public trans...
3,4,Recipe,"Preheat the oven to 375°F. Mix flour, sugar, a..."
4,5,Travel Guide,"Discover the hidden gems of Kyoto, from tranqu..."
5,6,Scientific Abstract,This study investigates the effects of micropl...
6,7,Book Review,"An evocative tale of love and loss, 'The Silen..."
7,8,Job Posting,Looking for a skilled software engineer profic...
8,9,User Manual,"To reset your device, hold the power button fo..."
9,10,Historical Event,"The Berlin Wall, constructed in 1961, symboliz..."


Before initializing the `Document` class, you need to split the dataframe into the text you want to embed and it's corresponding metadata (shown below). You can accomplish this by simply isolating the text column and by using the .to_dict() method to convert the metadata columns into a list of dictionaries, with each entry corresponding to a row in the dataframe.

In [4]:
sample_texts = sample_df["text"]
sample_metadata = sample_df[["id", "category"]].to_dict(orient = "records")

In [5]:
sample_texts[0:3]

0    Experience unparalleled sound quality with the...
1    In a world ravaged by climate change, a group ...
2    The city council approved the new public trans...
Name: text, dtype: object

In [6]:
sample_metadata[0:3]

[{'id': 1, 'category': 'Product Description'},
 {'id': 2, 'category': 'Movie Synopsis'},
 {'id': 3, 'category': 'News Article'}]

## Use searchlite to embed text and run semantic search

Now, you can use `searchlite` to embed and query your text. In this notebook, we are specifically demoing the `sentence-transformer` library's embedding models. The first thing you need to do is import `Document` and `SentenceTransformerEmbedder` from `searchlite`.

In [17]:
from searchlite.document import Document
from searchlite.embedders.sentence_transformer import SentenceTransformerEmbedder

Before creating your document, you have to instantiate your `SentenceTransformerEmbedder`. You must pass the name of the embedding model you want to use into the 'model_name' attribute of the class. Upon instantiating, `searchlite` will load the model into an attribute of the class.

In [8]:
embedder = SentenceTransformerEmbedder(model_name = "all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


You can see below that the `SentenceTransformerEmbedder` class is really just a simple wrapper of the SentenceTransformer class. 

In [9]:
embedder

This embedder is a SentenceTransformer instance in a wrapper.
Sentence Transformer __repr__: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Now, you can initialize the `Document` class. As shown below, both the text and metadata are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the `Document` instance. 

Be sure to assign your instantiated embedder to the embedder attribute of your document. If you don't, `searchlite` will automatically assign the `SkTFIDFEmbedder` as the embedding model for the document.

In [10]:
doc = Document(texts = sample_texts, metadata = sample_metadata, embedder = embedder)

In [11]:
doc

Document instance with 15 texts. Metadata contains the following fields: id, category. Embeddings: Not Ready.

Run the .embed() method to run sentence-transformer's SentenceTransformer embedder. If you want to use a different source for your embedding model, check out the other example notebooks to see how to initialize different embedders and pass it to your document.

In [12]:
doc.embed()

After generating your text embeddings, you can run semantic search on your text corpus by using the .query() method. Your query will be embedded into a vector and compared against your text corpus using cosine similarity. By default, .query() returns the top 3 matches but this can be changed by modifying the **top_k** parameter.

As you can see from the cell below, .query() returns a list of dictionaries with each dictionary containing the metadata and text of the identified matches.

In [13]:
res = doc.query(query_text = "wireless earbuds with good battery life")
res

[{'id': 1,
  'category': 'Product Description',
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.',
  'similarity score': 0.68724173},
 {'id': 5,
  'category': 'Travel Guide',
  'text': 'Discover the hidden gems of Kyoto, from tranquil temples to bustling markets, and experience authentic Japanese culture like never before.',
  'similarity score': 0.09249082},
 {'id': 4,
  'category': 'Recipe',
  'text': 'Preheat the oven to 375°F. Mix flour, sugar, and eggs in a bowl, then fold in fresh blueberries. Bake for 25 minutes or until golden brown.',
  'similarity score': 0.074016266}]

The `Document` class has three options to nicely display the results of your semantic search in the terminal: f-string, pprint, and tabulate.

- "f-string" outputs a custom f-string (defined in document.py)

- "pprint" leverages the pprint package to display a list of dictionaries of the top k results

- "tabulate" leverages the tabulate library to display a table of the top k results.

In [14]:
doc.display_results(output_list_dicts = res, style = "f-string")

Result 1:
    id: 1
    category: Product Description
    text: Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.
    similarity score: 0.6872417330741882

Result 2:
    id: 5
    category: Travel Guide
    text: Discover the hidden gems of Kyoto, from tranquil temples to bustling markets, and experience authentic Japanese culture like never before.
    similarity score: 0.09249082207679749

Result 3:
    id: 4
    category: Recipe
    text: Preheat the oven to 375°F. Mix flour, sugar, and eggs in a bowl, then fold in fresh blueberries. Bake for 25 minutes or until golden brown.
    similarity score: 0.0740162655711174



In [15]:
doc.display_results(output_list_dicts = res, style = "pprint")

[{'category': 'Product Description',
  'id': 1,
  'similarity score': 0.68724173,
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless '
          'earbuds, featuring noise cancellation, 12-hour battery life, and an '
          'ergonomic design perfect for workouts.'},
 {'category': 'Travel Guide',
  'id': 5,
  'similarity score': 0.09249082,
  'text': 'Discover the hidden gems of Kyoto, from tranquil temples to '
          'bustling markets, and experience authentic Japanese culture like '
          'never before.'},
 {'category': 'Recipe',
  'id': 4,
  'similarity score': 0.074016266,
  'text': 'Preheat the oven to 375°F. Mix flour, sugar, and eggs in a bowl, '
          'then fold in fresh blueberries. Bake for 25 minutes or until golden '
          'brown.'}]


In [16]:
doc.display_results(output_list_dicts = res, style = "tabulate")

+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|   id | category            | text                                                                                                                                                                          |   similarity score |
|    1 | Product Description | Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts. |          0.687242  |
+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|    5 | Travel Guide        | Discover the hidden gems of Kyoto, from tranquil temples 