## `searchlite` Basic Demo Notebook v2

This notebook contains basic code walking through the first version of `searchlite`. We'll load a sample text data set with some metadata, split the dataframe into the text and its metadata, load it into `searchlite`, and perform and display a semantic search. 

First, import your dependencies. For this simple example, we only need searchlite, pandas (for loading in our example data), and os (for defining the file path to our example data).

In [1]:
import pandas as pd
import os

## Import and look at data

Next, define the path to the sample data. In this case it is in the data folder. After defining the path, use pandas to load in the csv file as a data frame. 

In [2]:
sample_df = pd.read_csv(
    os.path.join(
        os.getcwd(), "../data/synthetic_data.csv"), 
    index_col = 0)

Let's take a look at our sample data below. The data consists of 15 distinct pieces of text with corresponding id and category values. Each text topic is quite different so you can test the semantic search with different queries to see if the results makes sense.

In [3]:
sample_df

Unnamed: 0,id,category,text
0,1,Product Description,Experience unparalleled sound quality with the...
1,2,Movie Synopsis,"In a world ravaged by climate change, a group ..."
2,3,News Article,The city council approved the new public trans...
3,4,Recipe,"Preheat the oven to 375°F. Mix flour, sugar, a..."
4,5,Travel Guide,"Discover the hidden gems of Kyoto, from tranqu..."
5,6,Scientific Abstract,This study investigates the effects of micropl...
6,7,Book Review,"An evocative tale of love and loss, 'The Silen..."
7,8,Job Posting,Looking for a skilled software engineer profic...
8,9,User Manual,"To reset your device, hold the power button fo..."
9,10,Historical Event,"The Berlin Wall, constructed in 1961, symboliz..."


Before initializing the `Document` class, you need to split the dataframe into the text you want to embed and it's corresponding metadata (shown below). You can accomplish this by simply isolating the text column and by using the .to_dict() method to convert the metadata columns into a list of dictionaries, with each entry corresponding to a row in the dataframe. 

In [4]:
sample_texts = sample_df["text"]
sample_metadata = sample_df[["id", "category"]].to_dict(orient = "records")

In [5]:
sample_texts[0:3]

0    Experience unparalleled sound quality with the...
1    In a world ravaged by climate change, a group ...
2    The city council approved the new public trans...
Name: text, dtype: object

In [6]:
sample_metadata[0:3]

[{'id': 1, 'category': 'Product Description'},
 {'id': 2, 'category': 'Movie Synopsis'},
 {'id': 3, 'category': 'News Article'}]

## Use searchlite to embed text and run semantic search

Now, you can initialize our `Document` class. As shown below, both the text and metadata are saved as attributes. Before performing search, you must generate embeddings for the texts stored within the `Document` instance. For the basic demo, you only need to import `Document` from `searchlite.document`.

Note that if you do not specify a model, `searchlite` automatically imports the `SkTFIDFEmbedder` class which implements scikit-learn's TFIDF Vectorizer. Upon initializing your `Document`, the SkTFIDFEmbedder will automatically be fit to your texts.

In [7]:
from searchlite.document import Document

In [8]:
doc = Document(texts = sample_texts, metadata = sample_metadata)

In [9]:
doc

Document instance with 15 texts. Metadata contains the following fields: id, category. Embeddings: Not Ready.
Embedder:TFIDFEmbedder object implemented using scikit-learn.
 Embedder fitted: True

Run the .embed() method to run scikit-learn's TFIDF Vectorizer. If you want to use a different embedding model, check out the other example notebooks to see how to initialize an embedder and pass it to your `Document`.

In [10]:
doc.embed()

After generating your text embeddings, you can run semantic search on your text corpus by using the .query() method. Your query will be embedded into a vector and compared against your text corpus using cosine similarity. By default, .query() returns the top 3 matches but this can be changed by modifying the **top_k** parameter.

As you can see from the cell below, .query() returns a list of dictionaries with each dictionary containing the metadata and text of the identified matches.

In [11]:
res = doc.query(query_text = "wireless earbuds with good battery life")
res

[{'id': 1,
  'category': 'Product Description',
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.',
  'similarity score': 0.4920494237945505},
 {'id': 11,
  'category': 'Customer Review',
  'text': 'The blender exceeded my expectations with its powerful motor and easy-to-clean design. Perfect for smoothies and soups!',
  'similarity score': 0.07414576593774012},
 {'id': 14,
  'category': 'E-commerce FAQ',
  'text': 'Q: Does this jacket have waterproof capabilities? A: Yes, it is made with breathable waterproof fabric suitable for heavy rain.',
  'similarity score': 0.0657987549804305}]

The `Document` class has three options to nicely display the results of your semantic search in the terminal: f-string, pprint, and tabulate.

- "f-string" outputs a custom f-string (defined in document.py)

- "pprint" leverages the pprint package to display a list of dictionaries of the top k results

- "tabulate" leverages the tabulate library to display a table of the top k results.

In [12]:
doc.display_results(output_list_dicts = res, style = "f-string")

Result 1:
    id: 1
    category: Product Description
    text: Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts.
    similarity score: 0.4920494237945505

Result 2:
    id: 11
    category: Customer Review
    text: The blender exceeded my expectations with its powerful motor and easy-to-clean design. Perfect for smoothies and soups!
    similarity score: 0.07414576593774012

Result 3:
    id: 14
    category: E-commerce FAQ
    text: Q: Does this jacket have waterproof capabilities? A: Yes, it is made with breathable waterproof fabric suitable for heavy rain.
    similarity score: 0.0657987549804305



In [13]:
doc.display_results(output_list_dicts = res, style = "pprint")

[{'category': 'Product Description',
  'id': 1,
  'similarity score': 0.4920494237945505,
  'text': 'Experience unparalleled sound quality with the EchoSphere wireless '
          'earbuds, featuring noise cancellation, 12-hour battery life, and an '
          'ergonomic design perfect for workouts.'},
 {'category': 'Customer Review',
  'id': 11,
  'similarity score': 0.07414576593774012,
  'text': 'The blender exceeded my expectations with its powerful motor and '
          'easy-to-clean design. Perfect for smoothies and soups!'},
 {'category': 'E-commerce FAQ',
  'id': 14,
  'similarity score': 0.0657987549804305,
  'text': 'Q: Does this jacket have waterproof capabilities? A: Yes, it is '
          'made with breathable waterproof fabric suitable for heavy rain.'}]


In [14]:
doc.display_results(output_list_dicts = res, style = "tabulate")

+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|   id | category            | text                                                                                                                                                                          |   similarity score |
|    1 | Product Description | Experience unparalleled sound quality with the EchoSphere wireless earbuds, featuring noise cancellation, 12-hour battery life, and an ergonomic design perfect for workouts. |          0.492049  |
+------+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|   11 | Customer Review     | The blender exceeded my expectations with its powerful mo