<a href="https://colab.research.google.com/github/singhvis29/Hands_On_LLM_WR/blob/main/Ch8_Semantic_Search_and_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1
!pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

In [2]:
%%capture
!pip install langchain_google_genai

In [3]:
%%capture
!pip install chromadb

In [4]:
from google import genai
from google.genai import types

In [5]:
import pandas as pd
import numpy as np
from datetime import datetime

In [6]:
import getpass
import os

from google.colab import userdata


if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

In [7]:
# GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [8]:
from langchain_google_genai import ChatGoogleGenerativeAI

gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

In [9]:
client = genai.Client(api_key=GOOGLE_API_KEY)

In [10]:
for model in client.models.list():
  if 'embedContent' in model.supported_actions:
    print(model.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


## Dense Retrieval Example

### 1. Getting the text archive and chunking it

In [11]:
import cohere

# Paste your API key here. Remember to not share publicly
api_key = 'PMQ1E5rLQXivEHEK5fGWoideOmEWrfmWRV9aTGKK'

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [12]:
def gemini_embeddings(txt):
  result = client.models.embed_content(
          model="embedding-001",
          contents=txt)

  return result.embeddings

In [13]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

### 2. Embedding the Text Chunks

In [14]:
import numpy as np

## Gemini
# # Get the embeddings
# response = [gemini_embeddings(t) for t in texts]

# embeds = np.array([np.array(response[i][0].values) for i in range(len(response))])
# print(embeds.shape)

# Cohere
# Get the embeddings
response = co.embed(
  texts=texts,
  input_type="search_document",
).embeddings

embeds = np.array(response)
print(embeds.shape)

(15, 4096)


### 3. Building The Search Index

In [15]:
import faiss
import numpy as np

# Ensure embeds is a contiguous array with float32 dtype
embeds = np.ascontiguousarray(embeds, dtype=np.float32)

dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
# Reshape embeds to (number_of_sentences, embedding_dimension)
# index.add(embeds.astype('float32'))  # Pass the converted embeds array directly

In [16]:
import chromadb
chroma_client = chromadb.Client()


In [17]:
# chroma_client.delete_collection(name="interstellar")

In [18]:
import chromadb.utils.embedding_functions as embedding_functions
from datetime import datetime

cohere_ef  = embedding_functions.CohereEmbeddingFunction(api_key=api_key,  model_name="large")
# cohere_ef(input=["document1","document2"])

collection = chroma_client.create_collection(
    name="interstellar",
    embedding_function=cohere_ef,  # Pass the wrapper function
    metadata={
        "description": "interstellar wiki chroma collection",
        "created": str(datetime.now())
    }
)


In [19]:
collection.add(
    documents=texts,
    # embeddings=embeds.tolist(),
    ids=[str(i) for i in range(len(texts))]
)


### 4. Search the index

#### cosine similarity

In [20]:
results = collection.query(
    query_texts=["how precise was the science"], # Chroma will embed this for you
    n_results=3 # how many results to return
)
print(results)


{'ids': [['12', '4', '7']], 'embeddings': None, 'documents': [['It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics', 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar', 'Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None, None]], 'distances': [[10757.37109375, 11566.1328125, 11922.84375]]}


In [21]:
pd.DataFrame({'ids': list(results['ids'][0]), 'documents': list(results['documents'][0]), 'distances': list(results['distances'][0])})

Unnamed: 0,ids,documents,distances
0,12,It has also received praise from many astronom...,10757.371094
1,4,Caltech theoretical physicist and 2017 Nobel l...,11566.132812
2,7,Interstellar uses extensive practical and mini...,11922.84375


In [32]:
def search(text):
  results = collection.query(
    query_texts=[text], # Chroma will embed this for you
    n_results=3 # how many results to return
  )
  return pd.DataFrame({'ids': list(results['ids'][0]), 'documents': list(results['documents'][0]), 'distances': list(results['distances'][0])})


#### keyword search

In [23]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string

def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

In [24]:
from tqdm import tqdm

tokenized_corpus = []
for passage in tqdm(texts):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)

100%|██████████| 15/15 [00:00<00:00, 40933.35it/s]


In [25]:
def keyword_search(query, top_k=3, num_candidates=15):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

In [26]:
keyword_search(query = "how precise was the science")

Input question: how precise was the science
Top-3 lexical search (BM25) hits
	1.789	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine


### Caveats of Dense Retrieval

In [33]:
query = "What is the mass of the moon?"
results = search(query)
results

Unnamed: 0,ids,documents,distances
0,5,Cinematographer Hoyte van Hoytema shot it on 3...,12854.444336
1,10,The film had a worldwide gross over $677 milli...,13301.009766
2,12,It has also received praise from many astronom...,13331.999023


### Reranking Example

In [28]:
query = "how precise was the science"
results = co.rerank(query=query, documents=texts, top_n=3, return_documents=True)
results.results

[RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics'), index=12, relevance_score=0.15239799),
 RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'), index=10, relevance_score=0.050354082),
 RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan'), index=0, relevance_score=0.0350424)]

In [29]:
for idx, result in enumerate(results.results):
    print(idx, result.relevance_score , result.document.text)

0 0.15239799 It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics
1 0.050354082 The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014
2 0.0350424 Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan


In [30]:
def keyword_and_reranking_search(query, top_k=3, num_candidates=10):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -num_candidates)[-num_candidates:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)

    print(f"Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:top_k]:
        print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

    #Add re-ranking
    docs = [texts[hit['corpus_id']] for hit in bm25_hits]

    print(f"\nTop-3 hits by rank-API ({len(bm25_hits)} BM25 hits re-ranked)")
    results = co.rerank(query=query, documents=docs, top_n=top_k, return_documents=True)
    for hit in results.results:
        print("\t{:.3f}\t{}".format(hit.relevance_score, hit.document.text.replace("\n", " ")))

In [31]:
keyword_and_reranking_search(query = "how precise was the science")

Input question: how precise was the science
Top-3 lexical search (BM25) hits
	1.789	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	1.373	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar
	0.000	Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects

Top-3 hits by rank-API (10 BM25 hits re-ranked)
	0.035	Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan
	0.032	It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine
	0.031	Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of In

## Retrieval Augmented Generation (RAG)