# This notebook is all about chunking

Chunking is a technique used in natural language processing (NLP) to group words into meaningful phrases or "chunks." This helps in understanding the structure of sentences and extracting useful information. While the most common form of chunking is **noun phrase chunking** (identifying noun phrases like "the big dog"), there are other types of chunking as well. Below, I'll explain some of these types in simple terms, provide examples, and include sample Python code.

---

### 1. **Verb Phrase Chunking**
Verb phrase chunking identifies groups of words that form a verb phrase, such as "is running," "will eat," or "has been completed."

#### Example:
**Input Sentence:**  
"The cat is sleeping on the couch."

**Output:**  
- Noun Phrase: "The cat"  
- Verb Phrase: "is sleeping"

#### Code:
```python
import nltk
from nltk import word_tokenize, pos_tag
from nltk.chunk import RegexpParser

# Sample sentence
sentence = "The cat is sleeping on the couch."

# Tokenize and POS tagging
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

# Define a grammar for verb phrase chunking
grammar = r"""
    VP: {<VB.*><DT|JJ|NN.*>*}  # Verb followed by optional determiners, adjectives, or nouns
"""

# Create a parser and parse the sentence
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Print the result
print("Parsed Tree:")
print(tree)

# Extract verb phrases
verb_phrases = []
for subtree in tree.subtrees():
    if subtree.label() == "VP":
        verb_phrases.append(" ".join(word for word, tag in subtree.leaves()))

print("\nVerb Phrases:")
print(verb_phrases)
```

**Output:**
```
Parsed Tree:
(S
  The/DT
  cat/NN
  (VP is/VBZ sleeping/VBG)
  on/IN
  the/DT
  couch/NN
  ./.)

Verb Phrases:
['is sleeping']
```

---

### 2. **Prepositional Phrase Chunking**
Prepositional phrase chunking identifies phrases starting with prepositions, such as "on the table," "in the park," or "with a smile."

#### Example:
**Input Sentence:**  
"The book is on the table near the window."

**Output:**  
- Prepositional Phrase: "on the table"  
- Prepositional Phrase: "near the window"

#### Code:
```python
# Define a grammar for prepositional phrase chunking
grammar = r"""
    PP: {<IN><DT>?<JJ>*<NN.*>+}  # Preposition followed by optional determiner, adjectives, and nouns
"""

# Create a parser and parse the sentence
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Print the result
print("Parsed Tree:")
print(tree)

# Extract prepositional phrases
prepositional_phrases = []
for subtree in tree.subtrees():
    if subtree.label() == "PP":
        prepositional_phrases.append(" ".join(word for word, tag in subtree.leaves()))

print("\nPrepositional Phrases:")
print(prepositional_phrases)
```

**Output:**
```
Parsed Tree:
(S
  The/DT
  book/NN
  is/VBZ
  (PP on/IN the/DT table/NN)
  (PP near/IN the/DT window/NN)
  ./.)

Prepositional Phrases:
['on the table', 'near the window']
```

---

### 3. **Adjective Phrase Chunking**
Adjective phrase chunking identifies phrases centered around adjectives, such as "very happy," "extremely tired," or "quite interesting."

#### Example:
**Input Sentence:**  
"The movie was extremely boring and very long."

**Output:**  
- Adjective Phrase: "extremely boring"  
- Adjective Phrase: "very long"

#### Code:
```python
# Define a grammar for adjective phrase chunking
grammar = r"""
    ADJP: {<RB.*>*<JJ>}  # Optional adverbs followed by an adjective
"""

# Create a parser and parse the sentence
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Print the result
print("Parsed Tree:")
print(tree)

# Extract adjective phrases
adjective_phrases = []
for subtree in tree.subtrees():
    if subtree.label() == "ADJP":
        adjective_phrases.append(" ".join(word for word, tag in subtree.leaves()))

print("\nAdjective Phrases:")
print(adjective_phrases)
```

**Output:**
```
Parsed Tree:
(S
  The/DT
  movie/NN
  was/VBD
  (ADJP extremely/RB boring/JJ)
  and/CC
  (ADJP very/RB long/JJ)
  ./.)

Adjective Phrases:
['extremely boring', 'very long']
```

---

### 4. **Sentence Chunking**
Sentence chunking divides a paragraph into individual sentences. This is useful for breaking down large blocks of text into manageable chunks.

#### Example:
**Input Paragraph:**  
"The weather is nice. We went for a walk. It was fun."

**Output:**  
- Sentence 1: "The weather is nice."  
- Sentence 2: "We went for a walk."  
- Sentence 3: "It was fun."

#### Code:
```python
from nltk.tokenize import sent_tokenize

# Sample paragraph
paragraph = "The weather is nice. We went for a walk. It was fun."

# Split into sentences
sentences = sent_tokenize(paragraph)

print("Sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")
```

**Output:**
```
Sentences:
Sentence 1: The weather is nice.
Sentence 2: We went for a walk.
Sentence 3: It was fun.
```

---

### Summary Table of Chunking Types

| **Type of Chunking**       | **Description**                                | **Example Output**                  |
|-----------------------------|------------------------------------------------|--------------------------------------|
| Noun Phrase Chunking        | Groups nouns and related words                | "The big dog"                       |
| Verb Phrase Chunking        | Groups verbs and related words                | "is running"                        |
| Prepositional Phrase Chunking | Groups prepositions and related words        | "on the table"                      |
| Adjective Phrase Chunking   | Groups adjectives and related words           | "extremely boring"                  |
| Sentence Chunking           | Divides text into individual sentences        | "The weather is nice."              |

---

### Final Answer:
There are **five main types of chunking**:  
1. Noun Phrase Chunking  
2. Verb Phrase Chunking  
3. Prepositional Phrase Chunking  
4. Adjective Phrase Chunking  
5. Sentence Chunking  

Each type serves a specific purpose in analyzing and structuring text data.

Great question! Semantic chunking is a more advanced form of chunking that focuses on grouping words or phrases based on their **meaning** rather than just their grammatical structure. It goes beyond traditional syntactic chunking methods (like noun phrase or verb phrase chunking) and aims to capture the **semantic relationships** between words.

Let’s break it down step by step:

---

### **What is Semantic Chunking?**
Semantic chunking involves identifying meaningful units of text based on the **contextual meaning** of words or phrases. For example:
- Instead of just identifying "noun phrases" or "verb phrases," semantic chunking might group together words that represent a single concept, such as "machine learning model" or "climate change."

#### Example:
**Input Sentence:**  
"Artificial intelligence is revolutionizing healthcare by improving diagnostic accuracy."

**Output (Semantic Chunks):**
- "Artificial intelligence"
- "revolutionizing healthcare"
- "improving diagnostic accuracy"

Here, the chunks are not strictly grammatical but represent **meaningful concepts** in the sentence.

---

### **Why Use Semantic Chunking?**
Semantic chunking is particularly useful when you need to extract **high-level information** from text for tasks like:
1. **Information Retrieval**: Extract key concepts or topics from documents.
2. **Summarization**: Identify the most important ideas in a text.
3. **Question Answering**: Understand the context of a question and retrieve relevant answers.
4. **Topic Modeling**: Group related concepts together for analysis.
5. **Knowledge Graph Construction**: Build structured representations of relationships between entities.

For example, if you're building a search engine, semantic chunking can help identify phrases like "electric vehicles" or "renewable energy" as single units of meaning, making it easier to match queries to relevant documents.

---

### **When to Use Semantic Chunking vs. Other Methods?**

| **Method**                  | **When to Use**                                                                                   | **Example Task**                              |
|-----------------------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------|
| **Syntactic Chunking**      | When you need to analyze grammatical structure (e.g., noun phrases, verb phrases).               | Parsing sentences for grammar-based analysis. |
| **Semantic Chunking**       | When you need to extract meaningful concepts or ideas from text.                                 | Summarizing articles, topic modeling.         |
| **Sentence Chunking**       | When you need to split text into individual sentences for processing.                           | Breaking paragraphs into manageable units.    |
| **Prepositional Phrase Chunking** | When you need to identify relationships introduced by prepositions (e.g., location, time).     | Extracting relationships like "in the park."  |

---

### **How Does Semantic Chunking Work?**
Semantic chunking typically relies on techniques like:
1. **Named Entity Recognition (NER)**: Identifies entities like people, organizations, locations, etc.
2. **Dependency Parsing**: Analyzes the grammatical relationships between words to understand how they relate semantically.
3. **Word Embeddings**: Uses embeddings like Word2Vec, GloVe, or BERT to capture the meaning of words in context.
4. **Custom Rules or Machine Learning Models**: Combines linguistic rules or trained models to group words based on their semantic similarity.

---

### **Example of Semantic Chunking**
#### Input:
"The AI-powered chatbot improved customer satisfaction by providing instant responses."

#### Output (Semantic Chunks):
- "AI-powered chatbot"
- "customer satisfaction"
- "instant responses"

Here, the chunks represent meaningful concepts rather than strict grammatical structures.

---

### **Code Example for Semantic Chunking**
Below is an example of how you might implement semantic chunking using **SpaCy**, a popular NLP library:

```python
import spacy

# Load SpaCy's English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The AI-powered chatbot improved customer satisfaction by providing instant responses."

# Process the text
doc = nlp(text)

# Extract semantic chunks using noun chunks (SpaCy's built-in functionality)
semantic_chunks = [chunk.text for chunk in doc.noun_chunks]

print("Semantic Chunks:")
print(semantic_chunks)
```

#### Output:
```
Semantic Chunks:
['The AI-powered chatbot', 'customer satisfaction', 'instant responses']
```

---

### **Advanced Semantic Chunking with Dependency Parsing**
If you want to go deeper, you can use dependency parsing to identify relationships between words and group them semantically.

```python
# Extract semantic chunks using dependency parsing
semantic_chunks = []
for token in doc:
    if token.dep_ in ("nsubj", "dobj", "pobj", "compound"):  # Focus on key relationships
        semantic_chunks.append(" ".join([child.text for child in token.subtree]))

# Remove duplicates
semantic_chunks = list(set(semantic_chunks))

print("Advanced Semantic Chunks:")
print(semantic_chunks)
```

#### Output:
```
Advanced Semantic Chunks:
['AI-powered chatbot', 'customer satisfaction', 'instant responses']
```

---

### **Comparison of Methods**

| **Chunking Type**            | **Focus**                                      | **Example Use Case**                          |
|-------------------------------|------------------------------------------------|-----------------------------------------------|
| **Noun Phrase Chunking**      | Grammatical structure (noun + modifiers)       | Extracting names of products or entities.     |
| **Verb Phrase Chunking**      | Action phrases (verbs + objects)               | Identifying actions in legal documents.       |
| **Prepositional Phrase Chunking** | Relationships (prepositions + objects)      | Extracting locations or times.                |
| **Semantic Chunking**         | Meaningful concepts (contextual understanding) | Summarizing articles or building knowledge graphs. |

---

### **Real-World Applications of Semantic Chunking**
1. **Search Engines**: Group related terms to improve query matching.
   - Example: A search for "best electric cars" should return results about "Tesla Model 3" and "electric vehicles."
2. **Chatbots**: Understand user intent by grouping related concepts.
   - Example: "Book a flight to New York" → Chunk: "flight to New York."
3. **Document Analysis**: Extract key topics from research papers.
   - Example: "Machine learning algorithms" → Chunk: "machine learning algorithms."
4. **Legal Document Processing**: Identify clauses, conditions, or obligations.
   - Example: "The party agrees to pay $1000" → Chunk: "pay $1000."

---

### **Final Answer**
Semantic chunking is used to group words or phrases based on their **meaning** rather than just their grammatical structure. It is particularly useful for tasks like summarization, topic modeling, and knowledge graph construction. While syntactic chunking focuses on grammar, semantic chunking focuses on **contextual relationships** between words. 

To summarize:
- **Use syntactic chunking** for grammatical analysis.
- **Use semantic chunking** for extracting meaningful concepts and relationships.

**Boxed Answer:**
Semantic chunking groups words/phrases based on meaning, and is ideal for summarization, topic modeling, and knowledge extraction.

### **Other Types of Chunking Methods**  
Chunking is a way of breaking down large text into smaller pieces. Depending on your use case (AI models, search, summarization, etc.), different chunking strategies can be used.  

---

## **1. Fixed-Length Chunking**  
Splits text into fixed-size chunks (e.g., 1000 characters or 200 tokens).  
📌 **Use Case:** When you don’t care about preserving sentence structure, just need equal-sized chunks.  

🔹 **Example (Character-Based Split)**  
```python
def fixed_length_chunking(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

text = "This is a sample document. It contains multiple sentences and paragraphs."
chunks = fixed_length_chunking(text, chunk_size=10)
print(chunks)  # ['This is a ', 'sample doc', 'ument. It ', 'contains m', ...]
```

🔹 **Example (Token-Based Split using `tiktoken`)**  
```python
import tiktoken

def token_based_chunking(text, chunk_size=200):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    return [tokens[i:i+chunk_size] for i in range(0, len(tokens), chunk_size)]

tokens_chunks = token_based_chunking(text, chunk_size=20)
print(tokens_chunks)  # Returns chunks of tokenized text
```

---

## **2. Sentence-Based Chunking**  
Splits text **at sentence boundaries** to avoid cutting sentences in half.  
📌 **Use Case:** When meaning should be preserved, useful for AI processing like summarization.  

🔹 **Example (Using `nltk`)**  
```python
from nltk.tokenize import sent_tokenize

def sentence_based_chunking(text, max_sentences=2):
    sentences = sent_tokenize(text)
    return [". ".join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

text = "This is sentence one. This is sentence two. This is sentence three. This is sentence four."
chunks = sentence_based_chunking(text, max_sentences=2)
print(chunks)  # ['This is sentence one. This is sentence two', 'This is sentence three. This is sentence four']
```

---

## **3. Semantic Chunking (Using Embeddings)**  
Uses **semantic meaning** to split text into coherent sections instead of arbitrary cuts.  
📌 **Use Case:** When preserving meaning is critical, such as **retrieval-augmented generation (RAG)** for LLMs.  

🔹 **Example (Using `langchain` Sentence Transformers)**  
```python
from langchain.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()
splitter = SemanticChunker(embedding_model)
chunks = splitter.split_text("This is a paragraph about AI. AI is transforming industries. Another topic is climate change.")
print(chunks)
```
🔹 **How It Works?**  
- Uses **embeddings** (word meaning representations) to group sentences that are semantically related.  
- Avoids cutting related content in half.  

---

## **4. Overlapping Chunking (Sliding Window)**  
Splits text into chunks with **overlapping portions** for better context retention.  
📌 **Use Case:** When using AI models where **previous context matters**, like chat history or document search.  

🔹 **Example (Using `RecursiveCharacterTextSplitter`)**  
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = text_splitter.split_text("This is a long document. It contains important information that should not be lost.")
print(chunks)
```
🔹 **How It Works?**  
- Each chunk has **some overlap** with the previous chunk (100 characters in this case).  
- Ensures **smooth transition** between chunks when feeding them to AI models.  

---

## **5. Paragraph-Based Chunking**  
Splits text **at paragraph boundaries** instead of arbitrary sizes.  
📌 **Use Case:** When working with structured text like **articles, books, or reports**.  

🔹 **Example (Splitting by Double Newline `\n\n`)**  
```python
def paragraph_chunking(text):
    return text.split("\n\n")  # Assumes paragraphs are separated by double newlines

text = "Paragraph 1.\n\nParagraph 2.\n\nParagraph 3."
chunks = paragraph_chunking(text)
print(chunks)  # ['Paragraph 1.', 'Paragraph 2.', 'Paragraph 3.']
```

---

## **6. Topic-Based Chunking (Hierarchical Splitting)**  
Splits text based on **topics or section headings**.  
📌 **Use Case:** When working with structured documents like **research papers, books, or legal documents**.  

🔹 **Example (Using Regular Expressions to Detect Headings)**  
```python
import re

def topic_based_chunking(text):
    return re.split(r'\n\s*\d+\.\s', text)  # Splits at numbered headings like "1. Introduction"

text = "1. Introduction\nThis is the intro.\n2. Methods\nThis is the methods section."
chunks = topic_based_chunking(text)
print(chunks)  # ['1. Introduction\nThis is the intro.', '2. Methods\nThis is the methods section.']
```

---

## **7. Hybrid Chunking**  
Combines **multiple chunking strategies** (e.g., paragraph + token-based).  
📌 **Use Case:** When you need a balance between **structure and token limits**.  

🔹 **Example (First Paragraph, then Token-Based Split)**  
```python
def hybrid_chunking(text, token_limit=300):
    paragraphs = text.split("\n\n")
    token_chunks = []
    for para in paragraphs:
        token_chunks.extend(token_based_chunking(para, chunk_limit=token_limit))
    return token_chunks
```
🔹 **Why Use Hybrid?**  
- Ensures **natural structure** (paragraphs).  
- Respects **token constraints** for AI models.  

---

## **🔹 Summary Table**
| **Chunking Method** | **Use Case** | **Pros** | **Cons** |
|----------------|----------------|----------------|----------------|
| **Fixed-Length Chunking** | General purpose, simple splitting | Fast & easy | May break sentences mid-way |
| **Sentence-Based Chunking** | Summarization, NLP models | Preserves meaning | Uneven chunk sizes |
| **Semantic Chunking** | AI retrieval, LLMs | Smart splits | Requires embeddings model |
| **Overlapping Chunking** | LLM context retention | Better AI responses | Higher token cost |
| **Paragraph-Based Chunking** | Structured documents (articles, books) | Retains formatting | Large paragraph sizes |
| **Topic-Based Chunking** | Research papers, structured reports | Logical splits | Needs clean headings |
| **Hybrid Chunking** | AI applications with token limits | Best of both worlds | More complex |

---

## **💡 Which One Should You Use?**
- **For AI models (like GPT, Llama2)** → **Overlapping, Semantic, or Hybrid Chunking**  
- **For Summarization & NLP tasks** → **Sentence-Based or Paragraph-Based Chunking**  
- **For Token-Limited Models (OpenAI, Anthropic, etc.)** → **Fixed-Length or Token-Based Chunking**  
- **For Legal/Research Docs** → **Topic-Based or Paragraph-Based Chunking**  

Would you like a hands-on example for your specific use case? 🚀