In [2]:
!pip install -U langchain langchain-community langchain-huggingface chromadb sentence-transformers


Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting chromadb
  Downloading chromadb-1.0.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (f

**Running the embeddings.py file**

In [3]:
import pandas as pd
from langchain_community.vectorstores import Chroma # a vectore store to store embeddings and metadata
from langchain.docstore.document import Document # a standard format for storing text and metadata
from sentence_transformers import SentenceTransformer # used to turn text into embeddings using pre-trained modells like "all-mpnet-base-v2"
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Load the dataset
data = pd.read_csv(r"/content/qna_dataset.csv")

# Creates a new column "Combined" Combining Question and Answer into a single text
data["Combined"] = data.apply(
    lambda row: f"question: {row['question']} answer: {row['answer']}", axis=1
)

# Create documents for Chroma, instead of plain text from the combined text, for more flexibility for more use cases and for working smoothly in langchain ecosystem
documents = [
    Document(page_content=row["Combined"], metadata={"question": row["question"], "answer": row["answer"]})
    for _, row in data.iterrows()
]

# Load the embedding model using LangChain's wrapper
embedding_function = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")

try:
# Create the vector database using from_documents
    vector_db = Chroma.from_documents(
        documents,
        embedding=embedding_function,  # Correct: pass the wrapper, not just a function
        persist_directory="db"
    )
    print('Vector store created successfully')
except Exception as e:
    print(f"error: {e}")

print("Saving DB to:", os.path.abspath("db"))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector store created successfully
Saving DB to: /content/db


**Test**

In [5]:
from langchain.prompts import PromptTemplate
import pandas as pd

# Import your prompt template function
def get_prompt_template():
    template = """
You are a helpful e-commerce assistant. Use the following question-answer pairs (context) to help answer the user's question.

{context}

User's Question: {query}

Provide a clear and helpful answer based on the information above.
"""
    return PromptTemplate(input_variables=["context", "query"], template=template)

# Simulate user input
user_query = "How long does delivery take?"

# Use the vector database to retrieve relevant docs
retrieved_docs = vector_db.similarity_search(user_query)

# Build the context from retrieved documents
context = "\n".join([doc.page_content for doc in retrieved_docs])

# Get the prompt template and format it
prompt_template = get_prompt_template()
formatted_prompt = prompt_template.format(context=context, query=user_query)

# Print the formatted prompt
print(formatted_prompt)


You are a helpful e-commerce assistant. Use the following question-answer pairs (context) to help answer the user's question.

question: What is the discount on Condoms - Extra Time? answer: The discount on Condoms - Extra Time is 0.0%
question: What is the discount on For Boys - With Surprise? answer: The discount on For Boys - With Surprise is 0.0%
question: What is the discount on Rusk - Baby? answer: The discount on Rusk - Baby is 0.0%
question: What is the discount on Papad - Potato? answer: The discount on Papad - Potato is 0.0%

User's Question: How long does delivery take?

Provide a clear and helpful answer based on the information above.

