--------------------------
#### Semantic text search using embeddings
------------------------------
- Semantic Search Efficiency:
    - Utilizing embeddings allows for efficient semantic search through all reviews.
    - The process involves embedding the search query and then identifying the most similar reviews.
    - This method enables quick retrieval of relevant reviews based on semantic similarity.
    
- Low Cost:
    - The cost-effectiveness of the search process is emphasized.
    - Embedding-based search minimizes computational expenses while maintaining search accuracy.
    - This approach offers an economical solution for exploring reviews in a semantically meaningful way.

In [12]:
import pandas as pd
import numpy as np

from ast import literal_eval

#### load the saved embeddings (food reviews)

In [13]:
datafile_path = r"D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\amazon_food_reviews_with_embeddings_2k.csv"

In [14]:
df = pd.read_csv(datafile_path)

In [15]:
df.sample(3)

Unnamed: 0.1,Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,ada_embedding
1105,358045,B0032CJPOK,A1KG2FZQW58S2F,4,"It's not Mother's milk, but it's great formula!","After going back to work full-time, I really s...","Title: It's not Mother's milk, but it's great ...",641,"[0.006337316706776619, 0.01719411462545395, -0..."
1186,374637,B0077HIJYS,A3HVA6BTVB1UYP,5,good for low carbers and diabetics,compared to the usual white bread bun you migh...,Title: good for low carbers and diabetics; Con...,192,"[-0.006478943862020969, -0.011089159175753593,..."
1741,246790,B001EO5YO8,A171CP9ZEUR8B9,4,The taste we were looking for,We originally used this product in one particu...,Title: The taste we were looking for; Content:...,59,"[0.008865384384989738, -0.01965092122554779, -..."


In [16]:
%%time
# convert string to array
df["embedding"] = df.ada_embedding.apply(literal_eval).apply(np.array)

Wall time: 13.5 s


#### Search query

- compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [17]:
from openai import OpenAI
import json
import os

In [18]:
openai_api_key = os.environ.get('OPENAI_API_KEY')

if openai_api_key is not None:
    # Use the API key in your code
    print("API Key:", openai_api_key)
else:
    print("OPENAI_API_KEY not set.")

OPENAI_API_KEY not set.


In [19]:
client = OpenAI(
    #api_key = openai_api_key
)

In [20]:
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL       = "gpt-3.5-turbo"

In [21]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_cosine_similarity(document_embeddings, query_embedding):
    """
    Calculate cosine similarity between the query embedding and document embeddings.

    Parameters:
    - query_embedding: The embedding vector for the search query.
    - document_embeddings: List of embeddings for the documents.

    Returns:
    - List of cosine similarities between the query and each document.
    """
    similarities = cosine_similarity(query_embedding, document_embeddings)
    return similarities.flatten()

In [22]:
# For a specific query and documents
query_embedding     = np.array([[0.1, 0.3, 0.5]])  

document_embeddings = np.array([
    [0.2, 0.4, 0.6],  
    [0.15, 0.35, 0.55],
   
])

In [23]:
query_embedding.reshape(-1, 1)

array([[0.1],
       [0.3],
       [0.5]])

In [24]:
get_cosine_similarity(query_embedding, document_embeddings)

array([0.99385869, 0.99808276])

In [25]:
# search through the reviews for a specific product
def get_sim_scores(df, query):
    
    response = client.embeddings.create(
                    model          = EMBEDDING_MODEL, 
                    input          = query, 
                    encoding_format= "float"
    )
    
    # obtain the embedding for the query
    query_embedding = np.array(response.data[0].embedding)[np.newaxis, :]
       
    df["similarity"] = df.embedding.apply(lambda x: get_cosine_similarity(x[np.newaxis, :], query_embedding))
    
    return df

In [26]:
df_with_sim_scores = get_sim_scores(df, "delicious beans")

In [28]:
df_with_sim_scores_sorted = df_with_sim_scores.sort_values("similarity", ascending=False)

In [29]:
df_with_sim_scores_sorted.columns

Index(['Unnamed: 0', 'ProductId', 'UserId', 'Score', 'Summary', 'Text',
       'combined', 'n_tokens', 'ada_embedding', 'embedding', 'similarity'],
      dtype='object')

In [30]:
pd.set_option('max_colwidth', 300)

In [31]:
# Selecting specific columns (e.g., 'column1', 'column2') from the sorted DataFrame
df_with_sim_scores_sorted[['combined', 'similarity']].head(5)

Unnamed: 0,combined,similarity
968,"Title: Amazing; Content: Nothing makes me feel at home more than a pot of blue runner red beans on the stove!! These red beans are the best. I'm so glad I can buy them here on amazon, my first two years of living away from Homs I had to bring them back with me or have someone send some! Now I ha...",[0.8566469739430351]
1985,"Title: Good Buy; Content: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!",[0.8514052862675929]
1771,Title: Jamaican Blue beans; Content: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and before any oil appears on the bean itself (455F @ 17 minutes).,[0.8501681322246752]
1088,"Title: Delicious!; Content: I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning and she likes it very much.<br />Thank you Amazon for having it because now I ca...",[0.8479964881101125]
1927,Title: Fantastic Instant Refried beans; Content: Fantastic Instant Refried Beans have been a staple for my family now for nearly 20 years. All 7 of us love it and my grown kids are passing on the tradition.,[0.8468315944254932]


In [32]:
# search through the reviews for a specific product
def search_reviews(df, query, n=5):
    
    response = client.embeddings.create(
                    model          = EMBEDDING_MODEL, 
                    input          = query, 
                    encoding_format= "float"
    )
    
    # obtain the embedding for the query
    query_embedding = np.array(response.data[0].embedding)[np.newaxis, :]
       
    df["similarity"] = df.embedding.apply(lambda x: get_cosine_similarity(x[np.newaxis, :], query_embedding))
    
    df_with_sim_scores_sorted = df_with_sim_scores.sort_values("similarity", ascending=False)
    
    # Selecting specific columns (e.g., 'column1', 'column2') from the sorted DataFrame
    results_df = df_with_sim_scores_sorted[['combined', 'similarity']].head(n)
    
    return results_df

In [33]:
search_reviews(df, "delicious beans", n=3)

Unnamed: 0,combined,similarity
968,"Title: Amazing; Content: Nothing makes me feel at home more than a pot of blue runner red beans on the stove!! These red beans are the best. I'm so glad I can buy them here on amazon, my first two years of living away from Homs I had to bring them back with me or have someone send some! Now I ha...",[0.8566469739430351]
1985,"Title: Good Buy; Content: I liked the beans. They were vacuum sealed, plump and moist. Would recommend them for any use. I personally split and stuck them in some vodka to make vanilla extract. Yum!",[0.8514052862675929]
1771,Title: Jamaican Blue beans; Content: Excellent coffee bean for roasting. Our family just purchased another 5 pounds for more roasting. Plenty of flavor and mild on acidity when roasted to a dark brown bean and before any oil appears on the bean itself (455F @ 17 minutes).,[0.8501681322246752]


In [34]:
search_reviews(df, "whole wheat pasta", n=3)

Unnamed: 0,combined,similarity
523,"Title: not gnocchi; Content: The package says that this pasta is gnocchi, but it's actually just whole wheat pasta in the shape of shells.",[0.8684178280608065]
1038,"Title: Tasty and Quick Pasta; Content: Barilla Whole Grain Fusilli with Vegetable Marinara is tasty and has an excellent chunky vegetable marinara. I just wish there was more of it. If you aren't starving or on a diet, the 9oz serving is enough for lunch although you might want to add a piece ...",[0.8629793254056708]
1853,Title: sooo good; Content: tastes so good. Worth the money. My boyfriend hates wheat pasta and LOVES this. cooks fast tastes great.I love this brand and started buying more of their pastas. Bulk is best.,[0.8565417212273676]


In [35]:
search_reviews(df, "bad delivery", n=1)

Unnamed: 0,combined,similarity
1766,"Title: great product, poor delivery; Content: The coffee is excellent and I am a repeat buyer. Problem this time was with the UPS delivery. They left the box in front of my garage door in the middle of the driveway. Because of this odd delivery location, my wife ran over the box when she backe...",[0.8221468637407492]


In [36]:
search_reviews(df, "spoilt", n=1)

Unnamed: 0,combined,similarity
1554,"Title: Extremely dissapointed; Content: Hi,<br />I am very disappointed with the past shipment I received of the ONE coconut water. 3 of the boxes were leaking and the coconut water was spoiled.<br /><br />Thanks.<br /><br />Laks",[0.7801374658641043]


In [37]:
search_reviews(df, "pet food", n=2)

Unnamed: 0,combined,similarity
1818,Title: Good food; Content: The only dry food my queen cat will eat. Helps prevent hair balls. Good packaging. Arrives promptly. Recommended by a friend who sells pet food.,[0.8407334901075522]
951,"Title: Appetizing, but rich; Content: My older dog was happy to go right to this bowl of food, but it upset his stomach. The cats like it, though, so it's not a bust. Must be yummy...",[0.8330178002691789]


| **Method**                          | **Description**                                                                                           | **Advantages**                                | **Use Cases**                                                    |
|-------------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------|------------------------------------------------------------------|
| **Cosine Similarity**                | Measures the cosine of the angle between two vectors.                                                     | Simple, effective, and widely used.            | General-purpose similarity search in embeddings.                 |
| **Dot Product (Inner Product)**      | Measures the direct product of two vectors.                                                               | Computationally efficient.                     | Tasks where embeddings are normalized.                           |
| **Euclidean Distance**               | Measures the straight-line distance between two vectors in space.                                         | Considers magnitude, simple to compute.        | Use cases where magnitude and scale matter.                      |
| **BM25 (Okapi BM25)**                | A ranking function based on term frequency and document length normalization.                             | Effective in traditional text retrieval.       | Initial retrieval in hybrid search systems.                      |
| **Approximate Nearest Neighbor (ANN)** | Techniques like LSH and HNSW for finding approximate nearest neighbors in large datasets.                 | Scales well to large datasets, fast retrieval. | Large-scale retrieval with trade-offs in accuracy.               |
| **Dual Encoder Models**             | Uses separate encoders for queries and documents and calculates similarity via learned functions.         | Effective for embedding-based retrieval.       | Initial retrieval or ranking tasks where embedding similarity is key. |
| **Cross-Encoder Models**             | Uses both query and document together to output a relevance score directly.                               | High accuracy in relevance scoring.            | Re-ranking top results for improved relevance.                   |
| **Attention Mechanisms**             | Dynamically weighs parts of the query and document embeddings to refine similarity scores.               | Handles complex retrieval tasks effectively.   | Multi-step reasoning and complex retrieval processes.            |


#### Implement BM25

In [None]:
#pip install rank-bm25

In [38]:
from rank_bm25 import BM25Okapi
import numpy as np

In [39]:
# Function to preprocess and tokenize text
def tokenize(text):
    # Tokenization logic here, e.g., using simple whitespace split
    return text.lower().split()

In [40]:
# Prepare the corpus and query
def search_reviews_bm25(df, query, n=5):
    # Tokenize the corpus
    tokenized_corpus = [tokenize(doc) for doc in df['combined']]
    
    # Initialize BM25
    bm25 = BM25Okapi(tokenized_corpus)
    
    # Tokenize the query
    tokenized_query = tokenize(query)
    
    # Get BM25 scores
    scores = bm25.get_scores(tokenized_query)
    
    # Add scores to DataFrame
    df['bm25_score'] = scores
    
    # Sort by BM25 score and select top-n
    df_sorted = df.sort_values('bm25_score', ascending=False)
    
    # Selecting specific columns (e.g., 'combined', 'bm25_score') from the sorted DataFrame
    results_df = df_sorted[['combined', 'bm25_score']].head(n)
    
    return results_df

In [41]:
# Example usage
query = "delicious beans"

top_results = search_reviews_bm25(df, query)
top_results

Unnamed: 0,combined,bm25_score
1596,"Title: Best beans your money can buy; Content: These are, hands down, the best jelly beans on the market. There isn't a gross one in the bunch and each of them has an intense, delicious flavor. Though I hesitate to pick a favorite, I have to say that I love green apple, a rare flavor in drugst...",7.570373
1743,"Title: Panama green beans; Content: These beans have produced some enjoyable cups of coffee with a medium roast (dark brown beans with no oil on their surfaces) at 455F for 17 minutes. We have experimented, using the same roasted beans with different water sources. We have learned that the beans...",7.454137
968,"Title: Amazing; Content: Nothing makes me feel at home more than a pot of blue runner red beans on the stove!! These red beans are the best. I'm so glad I can buy them here on amazon, my first two years of living away from Homs I had to bring them back with me or have someone send some! Now I ha...",6.233504
1088,"Title: Delicious!; Content: I enjoy this white beans seasoning, it gives a rich flavor to the beans I just love it, my mother in law didn't know about this Zatarain's brand and now she is traying different seasoning and she likes it very much.<br />Thank you Amazon for having it because now I ca...",6.071359
489,Title: coffee beans; Content: Coffee beans did not seem fresh. No oil on them what so ever. I have tasted much better and fresher. Will not order again.,5.927993


#### Key Points:
- **Tokenization:** The `tokenize` function should preprocess the text by converting it to lowercase and splitting it into tokens. You might use more sophisticated tokenization depending on your requirements.
- **BM25 Initialization:** `BM25Okapi` is initialized with the tokenized corpus.
- **Scoring:** BM25 scores are computed for the query and added to the DataFrame.
- **Sorting:** The DataFrame is sorted based on BM25 scores to retrieve the top-n results.

#### Explanation:
- **Tokenization:** Converts text to a list of tokens. This is crucial for BM25 to work effectively.
- **BM25 Scores:** Calculated using the `get_scores` method, which returns a list of scores for each document in the corpus.
- **Sorting and Selection:** The DataFrame is sorted based on the BM25 scores, and the top results are selected.


#### Implement Cross-Encoder Models

In [None]:
#pip install sentence-transformers

In [None]:
from sentence_transformers import CrossEncoder

In [None]:
# Load the pre-trained Cross-Encoder model
model_name = 'cross-encoder/ms-marco-TinyBERT-L-6'
model = CrossEncoder(model_name)

In [None]:
# Function to compute scores for query-document pairs
def score_pairs(query, documents):
    # Create input pairs
    pairs = [(query, doc) for doc in documents]
    
    # Compute scores
    scores = model.predict(pairs)
    
    return scores

In [None]:
# Function to search through the DataFrame using Cross-Encoder
def search_reviews_cross_encoder(df, query, n=5):
    # Extract documents
    documents = df['combined'].tolist()
    
    # Compute scores for query-document pairs
    scores = score_pairs(query, documents)
    
    # Add scores to DataFrame
    df['cross_encoder_score'] = scores
    
    # Sort by Cross-Encoder score and select top-n
    df_sorted = df.sort_values('cross_encoder_score', ascending=False)
    
    # Selecting specific columns (e.g., 'combined', 'cross_encoder_score') from the sorted DataFrame
    results_df = df_sorted[['combined', 'cross_encoder_score']].head(n)
    
    return results_df

In [None]:
%%time
# Example usage
# can take more than an hour
query = "delicious beans"

top_results = search_reviews_cross_encoder(df, query)
top_results

#### Key Points:

- **Loading the Model:** 
  - `CrossEncoder` from `sentence-transformers` is used to load a pre-trained Cross-Encoder model. 
  - The `model_name` can be adjusted to any other suitable pre-trained Cross-Encoder model.

- **Scoring Function:** 
  - The `score_pairs` function generates query-document pairs and computes their relevance scores using the Cross-Encoder model.

- **Integration:** 
  - The `search_reviews_cross_encoder` function integrates this scoring mechanism into your search pipeline, sorting documents based on relevance scores.

#### Explanation:

- **Model:** 
  - The `CrossEncoder` model is used to jointly process query-document pairs and calculate relevance scores.

- **Pairing:** 
  - The `score_pairs` function pairs each document with the query and uses the model to compute a relevance score for each pair.

- **Sorting and Selection:** 
  - Scores are added to the DataFrame, which is then sorted based on these scores to retrieve the top results.

This approach provides a high-accuracy method for document retrieval by leveraging Cross-Encoder models to assess the relevance of documents with respect to a query.
