# Chunking Use Cases with Sentence and Sliding Window Chunkers

In this notebook, we'll explore how to chunk text using more complex strategies:
- **Sentence-Based Chunking**: Dividing text into sentences.
- **Sliding Window Chunking**: Dividing text into chunks based on a sliding window, with and without overlap.
- **Markdown Snippet Chunker**: Splits markdown documents into logical sections based on their structure.

In [26]:
# Importing necessary chunkers from the swarmauri library
from swarmauri.chunkers.concrete.SentenceChunker import SentenceChunker
from swarmauri.chunkers.concrete.SlidingWindowChunker import SlidingWindowChunker
from swarmauri.chunkers.concrete.MdSnippetChunker import MdSnippetChunker

## Sentence-Based Chunking

A `SentenceChunker` splits a text into sentences based on punctuation marks. This is useful for linguistic processing or document analysis.

In [10]:
# SentenceChunker usage example
sentence_chunker = SentenceChunker()

# Chunking text based on sentence boundaries
text = 'A walk in the park is a nice start. After the walk, let us talk.'
chunks = sentence_chunker.chunk_text(text)
print(f"Sentence-based chunks: {chunks}")

Sentence-based chunks: ['A walk in the park is a nice start.', 'After the walk, let us talk.']


## Sliding Window Chunking

The `SlidingWindowChunker` divides a text into overlapping or non-overlapping chunks of a fixed size. It can be used for tasks like document scanning, where overlapping chunks help maintain context.

In [11]:
# SlidingWindowChunker usage without overlap
sliding_chunker = SlidingWindowChunker()

# Defining text to chunk
unchunked_text = 'abcd ' * 512
chunks = sliding_chunker.chunk_text(unchunked_text)
print(f"Number of sliding window chunks (no overlap): {len(chunks)}")

Number of sliding window chunks (no overlap): 2


### Chunking Text with Overlap

When overlap is enabled, chunks will have overlapping content. Let's see how enabling overlap affects the chunking process.

In [12]:
# SlidingWindowChunker usage with overlap
sliding_chunker_overlap = SlidingWindowChunker(overlap=True, step_size=21)

# Chunking with overlap
chunks_overlap = sliding_chunker_overlap.chunk_text(unchunked_text)
print(f"Number of sliding window chunks (with overlap): {len(chunks_overlap)}")

Number of sliding window chunks (with overlap): 13


## Markdown Snippet Chunking

The `MdSnippetChunker` is specialized for chunking markdown files into logical sections such as headers, paragraphs, and code blocks. This chunker is ideal for processing markdown documents.

In [23]:
# Initialize MdSnippetChunker and check its resource and type
md_chunker = MdSnippetChunker()
print(f"Resource: {md_chunker.resource}")
print(f"Type: {md_chunker.type}")

Resource: Chunker
Type: MdSnippetChunker


### Chunking a Markdown Document

Let's chunk a sample markdown text into logical sections.

In [24]:
# Define a markdown document to chunk
markdown_text = """
```python

print('hello world')

```

Above is an example of some code.

```bash
echo 'test'
```

Here we have some text:

```md
# Hello
- list item
- list item
```
"""

### Chunk the markdown text using MdSnippetChunker

In [25]:
chunks = md_chunker.chunk_text(markdown_text) 
print(f"Markdown chunks: {chunks}")

Markdown chunks: [('', 'python', "print('hello world')"), ('', 'bash', "echo 'test'"), ('Above is an example of some code.', 'md', '# Hello\n- list item\n- list item')]


## Notebook Metadata

In [27]:
import os
import platform
import sys
from datetime import datetime

author_name = "Huzaifa Irshad " 
github_username = "irshadhuzaifa"

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

notebook_file = "Notebook_02_Use_Cases.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")

Author: Huzaifa Irshad 
GitHub Username: irshadhuzaifa
Last Modified: 2024-10-17 10:50:40.937606
Platform: Windows 11
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
Swarmauri Version: 0.5.0
