<a href="https://colab.research.google.com/github/shreyachowdhuryjsk/rag-architechture-md-processing/blob/main/rag_md_reader_and_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
pip install langchain_community



In [5]:
import os
from langchain_community.document_loaders import TextLoader
from sentence_transformers import SentenceTransformer
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import pipeline
from langchain_text_splitters import MarkdownHeaderTextSplitter

**Step 1: Loading the File**

In [6]:
loader = TextLoader("/content/tennis_details.md")
text_doc = loader.load()
print(text_doc)
print(text_doc[0].metadata)
print(text_doc[0].page_content)

[Document(metadata={'source': '/content/tennis_details.md'}, page_content="# Tennis\n\n## Introduction\nTennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net into the opponent's court.\n\n## Basic Rules\n- A match can be played as best of three or five sets.\n- Each set consists of games, and each game consists of points.\n- Points are scored as **0 (Love), 15, 30, 40**, and then **game**.\n- A player must win a game by at least **two points**.\n- The ball must land within the designated court boundaries.\n\n## Scoring System\n```plaintext\n0 points  -> Love\n1 point   -> 15\n2 points  -> 30\n3 points  -> 40\n4 points  -> Game (if leading by 2)\nDeuce     -> 40-40 (must win two consecutive points to win the game)\nAdvantage -> If a player wins a point at deuce, they gain the advantage\n```\n\n## Famous Tournaments\n- **Grand Slam Events**:\n  - Australian Open\n  - French Open

**Step 2 : Splitting the documents into Chunks**

In [7]:
split_condition = [("##", "title")]
splitter = MarkdownHeaderTextSplitter(split_condition)
doc_splits = splitter.split_text(text_doc[0].page_content)
print(doc_splits)
text_chunks = [i.page_content for i in doc_splits]
print(text_chunks)

[Document(metadata={}, page_content='# Tennis'), Document(metadata={'title': 'Introduction'}, page_content="Tennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net into the opponent's court."), Document(metadata={'title': 'Basic Rules'}, page_content='- A match can be played as best of three or five sets.\n- Each set consists of games, and each game consists of points.\n- Points are scored as **0 (Love), 15, 30, 40**, and then **game**.\n- A player must win a game by at least **two points**.\n- The ball must land within the designated court boundaries.'), Document(metadata={'title': 'Scoring System'}, page_content='```plaintext\n0 points  -> Love\n1 point   -> 15\n2 points  -> 30\n3 points  -> 40\n4 points  -> Game (if leading by 2)\nDeuce     -> 40-40 (must win two consecutive points to win the game)\nAdvantage -> If a player wins a point at deuce, they gain the advantage\n```'

In [8]:
len(text_chunks)

7

**Step 3 :  Create Embeddings**

In [9]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
def embed_chunk(i):
  return embedding_model.encode([i], normalize_embeddings = True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
sample_embedding = embed_chunk(text_chunks[1]).tolist()[0]

  return forward_call(*args, **kwargs)


In [11]:
print(sample_embedding)

[0.043128401041030884, 0.013731017708778381, 0.040937382727861404, -0.060120921581983566, -0.11004157364368439, 0.03762723505496979, 0.06258527934551239, 0.058430612087249756, 0.07231482118368149, 0.13938894867897034, -0.08466644585132599, 0.03008531779050827, -0.008123097009956837, 0.01305976789444685, 0.028446480631828308, -0.0328884981572628, 0.01718791201710701, -0.0006705721607431769, 0.03371882066130638, 0.03483973443508148, 0.006430466193705797, -0.06199755147099495, 0.02903241105377674, -0.1022266075015068, -0.023730630055069923, -0.0007567994180135429, -0.04043148085474968, 0.06892861425876617, -0.07998340576887131, 0.03472358360886574, -0.027686500921845436, 0.01230208296328783, -0.03259429708123207, 0.04529410973191261, -0.19071798026561737, 0.003529939102008939, -0.017625831067562103, 0.04938877746462822, -0.021366601809859276, 0.017831694334745407, 0.03800325468182564, -0.031719643622636795, 0.017327267676591873, 0.04867412894964218, 0.007676777429878712, 0.105700135231018

In [12]:
len(sample_embedding)

384

In [13]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.16-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp311-cp311-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.36.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.36.0-py3-none-any.whl.metadata (1.5 k

In [14]:
# Step 4: Store embeddings in ChromaDB

vector_db = Chroma.from_texts(text_chunks, HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"), persist_directory="/tmp/chroma_db")



  vector_db = Chroma.from_texts(text_chunks, HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"), persist_directory="/tmp/chroma_db")
  return forward_call(*args, **kwargs)


In [15]:
vector_db._collection.get(include=['embeddings','documents'])

{'ids': ['5c8abe7d-f2ea-4f78-a29a-e8fead2e0059',
  'a0da9bd0-ed4e-4fd5-b7ec-df85e4c6c90a',
  '79165dd8-649f-4e29-9d6e-713482fe2486',
  'fd1b6317-0bc1-4ede-8d1e-0881a5a38b61',
  '5561fac7-1ee3-4585-809c-b1dd2bf033a6',
  'a4489d0d-1afc-40c3-9af8-2937a7b9cf52',
  '8ebc5494-90ba-43c8-8f95-82fda1671445'],
 'embeddings': array([[ 0.0227568 ,  0.05737348,  0.06708645, ..., -0.09128203,
          0.03132669,  0.02229537],
        [ 0.04312836,  0.013731  ,  0.04093744, ..., -0.01512668,
         -0.00087324,  0.03250713],
        [ 0.03571488,  0.02733695, -0.01912353, ...,  0.03562167,
         -0.01786362,  0.02190564],
        ...,
        [ 0.01810161,  0.02553317,  0.02062308, ..., -0.07975577,
         -0.04529488,  0.0281473 ],
        [ 0.05494464,  0.04179734,  0.03473552, ..., -0.04557597,
          0.05043149,  0.04526827],
        [ 0.02283715,  0.03019896,  0.07116921, ..., -0.01202571,
         -0.01404738,  0.03577407]]),
 'documents': ['# Tennis',
  "Tennis is a popular sport p

In [16]:
#step 5: Set up a LLM
pipe = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct")

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [17]:
# Step 6: Retrieval and Generation
def retrieve_and_generate(query, threshold=1):
    """Retrieves relevant context from the vector database and generates an answer."""
    search_results = vector_db.similarity_search_with_score(query, k=1)

    print(search_results)

    if not search_results or search_results[0][1] > threshold:
        return "I don't know the answer. There is no available context in vector DB."

    retrieved_context = search_results[0][0].page_content
    similarity_score = search_results[0][1]
    print(f"Similarity Score: {similarity_score}")
    print(f"Retrieved Context: {retrieved_context}")

    prompt = f"Answer the question using the given context\nContext: {retrieved_context}\nQuestion: {query}\nAnswer: "
    print(prompt)
    response = pipe(prompt, max_new_tokens=100)
    return response[0]["generated_text"]

In [18]:
question = "what is tennis"
response = retrieve_and_generate(question)
print(response)

[(Document(metadata={}, page_content="Tennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net into the opponent's court."), 0.2799421548843384)]
Similarity Score: 0.2799421548843384
Retrieved Context: Tennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net into the opponent's court.
Answer the question using the given context
Context: Tennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net into the opponent's court.
Question: what is tennis
Answer: 
Answer the question using the given context
Context: Tennis is a popular sport played between two players (singles) or two teams of two players each (doubles). The game involves using a racket to hit a ball over a net i