# 📝 Document Chunking, Embedding, and Retrieval
This notebook demonstrates different chunking methods (fixed, sentence, schematic, code) using **spaCy** and **regex**.
We build TF-IDF vectors and store them locally (no external DBs).
Then we retrieve relevant chunks based on input queries.

In [2]:
# 📦 Install required dependencies
pip3 install scikit-learn spacy
python -m spacy download en_core_web_sm

SyntaxError: invalid syntax (51163182.py, line 2)

In [3]:
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# ✅ Load spaCy English model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    raise OSError("spaCy model 'en_core_web_sm' not found. Run: python3 -m spacy download en_core_web_sm")

# ✅ Fixed-size word chunks
def fixed_chunk(text, chunk_size=200):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# ✅ Sentence-based chunks (spaCy)
def sentence_chunk(text, max_sentences=3):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    return [
        " ".join(sentences[i:i+max_sentences])
        for i in range(0, len(sentences), max_sentences)
    ]

# ✅ Schematic chunks (split by headings/numbers)
def schematic_chunk(text):
    parts = re.split(r'(?m)^#{1,6}\s.*|^\d+\.\s', text)
    return [p.strip() for p in parts if p.strip()]

# ✅ Code chunks (split ``` blocks)
def code_chunk(text):
    parts = re.split(r'```.*?```', text, flags=re.S)
    return [p.strip() for p in parts if p.strip()]

In [4]:
# 📄 Example input text
text = """
1. User enters username and password.
System validates input.
Then the system sends an OTP.
The user enters the OTP.
System verifies and logs in.

```python
print("Hello World")
```
"""

print("🔹 Fixed chunk:", fixed_chunk(text, chunk_size=10))
print("🔹 Sentence chunk:", sentence_chunk(text, max_sentences=2))
print("🔹 Schematic chunk:", schematic_chunk(text))
print("🔹 Code chunk:", code_chunk(text))

🔹 Fixed chunk: ['1. User enters username and password. System validates input. Then', 'the system sends an OTP. The user enters the OTP.', 'System verifies and logs in. ```python print("Hello World") ```']
🔹 Sentence chunk: ['1. User enters username and password.', 'System validates input. Then the system sends an OTP.', 'The user enters the OTP. System verifies and logs in.', '```python\nprint("Hello World")\n```']
🔹 Schematic chunk: ['User enters username and password.\nSystem validates input.\nThen the system sends an OTP.\nThe user enters the OTP.\nSystem verifies and logs in.\n\n```python\nprint("Hello World")\n```']
🔹 Code chunk: ['1. User enters username and password.\nSystem validates input.\nThen the system sends an OTP.\nThe user enters the OTP.\nSystem verifies and logs in.']


In [5]:
# ⚡ Build TF-IDF index
def build_tfidf_index(chunks):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(chunks)
    return vectorizer, tfidf_matrix

def retrieve(query, chunks, vectorizer, tfidf_matrix, top_k=3):
    query_vec = vectorizer.transform([query])
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = similarities.argsort()[::-1][:top_k]
    return [(chunks[i], similarities[i]) for i in top_indices]

In [6]:
# ✅ Choose chunking method
chunks = sentence_chunk(text, max_sentences=2)

# ✅ Build TF-IDF index
vectorizer, tfidf_matrix = build_tfidf_index(chunks)

# 🔍 Query
query = "How does the system login the user?"
results = retrieve(query, chunks, vectorizer, tfidf_matrix)

print("Query:", query)
for chunk, score in results:
    print(f"\n🔹 Score: {score:.4f}\n{chunk}")

Query: How does the system login the user?

🔹 Score: 0.6588
The user enters the OTP. System verifies and logs in.

🔹 Score: 0.4358
System validates input. Then the system sends an OTP.

🔹 Score: 0.1637
1. User enters username and password.


In [7]:
import re
import spacy
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ✅ Embedding model
from sentence_transformers import SentenceTransformer

# Load sentence transformer (local)
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# ✅ Load spaCy English model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    raise OSError("spaCy model 'en_core_web_sm' not found. Run: python3 -m spacy download en_core_web_sm")

# ---------- Chunking Methods ----------
def fixed_chunk(text, chunk_size=200):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

def sentence_chunk(text, max_sentences=3):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    return [" ".join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

def schematic_chunk(text):
    parts = re.split(r'(?m)^#{1,6}\s.*|^\d+\.\s', text)
    return [p.strip() for p in parts if p.strip()]

def code_chunk(text):
    parts = re.split(r'```.*?```', text, flags=re.S)
    return [p.strip() for p in parts if p.strip()]

# ---------- Build Vector Index ----------
def build_vector_index(chunks):
    # Embedding for each chunk
    embeddings = embedder.encode(chunks)
    return np.array(embeddings)

# ---------- TF-IDF Index ----------
def build_tfidf_index(chunks):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(chunks)
    return vectorizer, tfidf_matrix

# ---------- Retrieval ----------
def retrieve(query, chunks, embeddings, tfidf_vectorizer, tfidf_matrix, top_k=3):
    # Query embedding
    query_vec = embedder.encode([query])

    # Cosine similarity with embedding vectors
    sim_scores = cosine_similarity(query_vec, embeddings)[0]
    top_idx_embed = np.argsort(sim_scores)[::-1][:top_k]

    # TF-IDF similarity
    tfidf_query = tfidf_vectorizer.transform([query])
    tfidf_scores = cosine_similarity(tfidf_query, tfidf_matrix)[0]
    top_idx_tfidf = np.argsort(tfidf_scores)[::-1][:top_k]

    print("\n🔹 Embedding-based results:")
    for idx in top_idx_embed:
        print(f"- {chunks[idx][:80]}... (score: {sim_scores[idx]:.4f})")

    print("\n🔹 TF-IDF results:")
    for idx in top_idx_tfidf:
        print(f"- {chunks[idx][:80]}... (score: {tfidf_scores[idx]:.4f})")


  from .autonotebook import tqdm as notebook_tqdm


In [8]:
text = """
1. User enters username and password.
2. System validates input.
3. System sends OTP.
4. User enters OTP.
5. System validates OTP and logs in.
"""

# Choose chunking method
chunks = sentence_chunk(text, max_sentences=2)

# Build indexes
embeddings = build_vector_index(chunks)
tfidf_vectorizer, tfidf_matrix = build_tfidf_index(chunks)

# Query
retrieve("How does the system validate user login?", chunks, embeddings, tfidf_vectorizer, tfidf_matrix)



🔹 Embedding-based results:
- System validates OTP and logs in.... (score: 0.6073)
- 1. User enters username and password.... (score: 0.5722)
- 2. System validates input.... (score: 0.4879)

🔹 TF-IDF results:
- User enters OTP. 5.... (score: 0.4692)
- 1. User enters username and password.... (score: 0.3122)
- 3. System sends OTP.
4.... (score: 0.3106)


In [9]:
import re
import spacy
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load spaCy
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    raise OSError("Run: python3 -m spacy download en_core_web_sm")

# Embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# ---------- Different Chunking Methods ----------

# 1. Fixed-size word chunks
def fixed_chunk(text, chunk_size=200):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# 2. Sentence-based chunks (spaCy)
def sentence_chunk(text, max_sentences=3):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    return [" ".join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

# 3. Schematic chunks (headings / numbered lists)
def schematic_chunk(text):
    parts = re.split(r'(?m)^#{1,6}\s.*|^\d+\.\s', text)
    return [p.strip() for p in parts if p.strip()]

# 4. Code-aware chunks (split ``` blocks)
def code_chunk(text):
    parts = re.split(r'```.*?```', text, flags=re.S)
    return [p.strip() for p in parts if p.strip()]

# 5. Semantic chunks (embedding-based, greedy merge)
def semantic_chunk(text, threshold=0.7):
    """
    Uses embeddings to merge semantically similar sentences into chunks.
    """
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    embeddings = embedder.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]
    prev_vec = embeddings[0]

    for i in range(1, len(sentences)):
        sim = cosine_similarity([prev_vec], [embeddings[i]])[0][0]
        if sim >= threshold:  # semantically close → same chunk
            current_chunk.append(sentences[i])
        else:  # new semantic group
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        prev_vec = embeddings[i]

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# ---------- Chunking Agent ----------
def chunking_agent(text, method="sentence", **kwargs):
    if method == "fixed":
        return fixed_chunk(text, **kwargs)
    elif method == "sentence":
        return sentence_chunk(text, **kwargs)
    elif method == "schematic":
        return schematic_chunk(text)
    elif method == "code":
        return code_chunk(text)
    elif method == "semantic":
        return semantic_chunk(text, **kwargs)
    else:
        raise ValueError(f"Unknown chunking method: {method}")

# ---------- Indexing ----------
def build_indexes(chunks):
    embeddings = embedder.encode(chunks)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(chunks)
    return embeddings, vectorizer, tfidf_matrix

# ---------- Retrieval ----------
def retrieve(query, chunks, embeddings, vectorizer, tfidf_matrix, top_k=3):
    query_vec = embedder.encode([query])
    sim_scores = cosine_similarity(query_vec, embeddings)[0]
    top_idx_embed = np.argsort(sim_scores)[::-1][:top_k]

    tfidf_query = vectorizer.transform([query])
    tfidf_scores = cosine_similarity(tfidf_query, tfidf_matrix)[0]
    top_idx_tfidf = np.argsort(tfidf_scores)[::-1][:top_k]

    retrieved = []
    for idx in set(top_idx_embed.tolist() + top_idx_tfidf.tolist()):
        retrieved.append(chunks[idx])

    return retrieved


In [10]:
jira_text = """
As a user, when I try to reset my password,
the system should send an OTP to my registered email.
Currently, OTP is not delivered if the email contains a '+' symbol.
Steps:
1. Open login page
2. Click forgot password
3. Enter email
Expected: OTP should be delivered
Actual: Error displayed
"""

# Try different chunking styles
print("📌 Sentence Chunking:", chunking_agent(jira_text, method="sentence", max_sentences=2))
print("📌 Fixed Chunking:", chunking_agent(jira_text, method="fixed", chunk_size=10))
print("📌 Schematic Chunking:", chunking_agent(jira_text, method="schematic"))
print("📌 Code Chunking:", chunking_agent(jira_text, method="code"))
print("📌 Semantic Chunking:", chunking_agent(jira_text, method="semantic", threshold=0.75))


📌 Sentence Chunking: ["As a user, when I try to reset my password,\nthe system should send an OTP to my registered email. Currently, OTP is not delivered if the email contains a '+' symbol.", 'Steps:\n1. Open login page\n2.', 'Click forgot password\n3. Enter email\nExpected: OTP should be delivered\nActual:', 'Error displayed']
📌 Fixed Chunking: ['As a user, when I try to reset my password,', 'the system should send an OTP to my registered email.', 'Currently, OTP is not delivered if the email contains a', "'+' symbol. Steps: 1. Open login page 2. Click forgot", 'password 3. Enter email Expected: OTP should be delivered Actual:', 'Error displayed']
📌 Schematic Chunking: ["As a user, when I try to reset my password,\nthe system should send an OTP to my registered email.\nCurrently, OTP is not delivered if the email contains a '+' symbol.\nSteps:", 'Open login page', 'Click forgot password', 'Enter email\nExpected: OTP should be delivered\nActual: Error displayed']
📌 Code Chunking: ["As 

In [2]:
# Intent Understanding Prompt Template for JIRA
def get_intent_prompt(jira_heading, jira_description, jira_acceptance_criteria):
    return f"""
    You are an Intent Understanding Agent for JIRA tickets. 
    Your job is to deeply analyze the given JIRA and extract the intent behind it. 
    Focus on the relationship between the JIRA heading, description, and acceptance criteria.

    JIRA Input:
    - Heading: {jira_heading}
    - Description: {jira_description}
    - Acceptance Criteria: {jira_acceptance_criteria}

    Tasks:
    1. Summarize the **core business intent** of this JIRA in 2–3 sentences.
    2. Identify the **main functional requirements** based on the description and acceptance criteria.
    3. Explain how the **acceptance criteria relates** to the description (what part of the description it validates).
    4. List possible **edge cases** or missing points not explicitly covered in the acceptance criteria.
    5. Provide a **clear conceptual understanding** of the JIRA that could be used later for test case generation.
    """


In [3]:
def read_jira_from_txt(filepath):
    heading, description, acceptance = "", "", ""
    with open(filepath, "r") as f:
        lines = f.readlines()
    
    section = None
    for line in lines:
        line = line.strip()
        if line.lower().startswith("heading:"):
            heading = line.split(":", 1)[1].strip()
            section = "heading"
        elif line.lower().startswith("description:"):
            description = line.split(":", 1)[1].strip()
            section = "description"
        elif line.lower().startswith("acceptance criteria:"):
            acceptance = line.split(":", 1)[1].strip()
            section = "acceptance"
        else:
            if section == "description":
                description += " " + line
            elif section == "acceptance":
                acceptance += " " + line
    
    return heading, description, acceptance


In [4]:
jira_heading, jira_description, jira_acceptance_criteria = read_jira_from_txt("jira_ticket.txt")
intent_prompt = get_intent_prompt(jira_heading, jira_description, jira_acceptance_criteria)

print(intent_prompt)  # to verify



    You are an Intent Understanding Agent for JIRA tickets. 
    Your job is to deeply analyze the given JIRA and extract the intent behind it. 
    Focus on the relationship between the JIRA heading, description, and acceptance criteria.

    JIRA Input:
    - Heading: Login with OTP Verification
    - Description:  As a user, I want to log into the system using my registered mobile number. After entering the username and password, the system should send an OTP to the registered mobile. The user must enter the OTP within 2 minutes. If OTP verification fails, the login should be denied. 
    - Acceptance Criteria:  1. System sends OTP to the registered mobile number after successful username and password validation. 2. OTP must be valid for 2 minutes. 3. If OTP is incorrect or expired, user should not be logged in. 4. Successful OTP entry grants access to the system dashboard.

    Tasks:
    1. Summarize the **core business intent** of this JIRA in 2–3 sentences.
    2. Identify the 

In [6]:
# Build prompt from TXT file JIRA input
jira_heading, jira_description, jira_acceptance_criteria = read_jira_from_txt("jira_ticket.txt")
intent_prompt = get_intent_prompt(jira_heading, jira_description, jira_acceptance_criteria)

In [5]:
# Build prompt from TXT file JIRA input
jira_heading, jira_description, jira_acceptance_criteria = read_jira_from_txt("jira_ticket.txt")
intent_prompt = get_intent_prompt(jira_heading, jira_description, jira_acceptance_criteria)

# Send to LLM
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Replace with Ollama/local model if needed
    messages=[
        {"role": "system", "content": "You are a helpful AI agent for JIRA analysis."},
        {"role": "user", "content": intent_prompt}
    ]
)

print("📌 JIRA Analysis:\n")
print(response.choices[0].message.content)


NameError: name 'client' is not defined

In [1]:
pwd

'/Users/vigneshv/Coding'