<a href="https://colab.research.google.com/github/salmantec/AI-Agents-Crash-Course/blob/feat%2FDay-2/Day-2/Day_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Chunking and Intelligent Processing for Data

In [None]:
# In the first part of the course, we focus on data preparation – the process of properly preparing data before it can be used for AI agents.

# Small and Large Documents:

# Yesterday, we prepared data from Github repo. For small sources, like FAQs, that is sufficient.
# The questions and answers are small enough. We can put them directly into the search engine

# But large documents, like Evidently's docs, Let's take a look at this one: https://github.com/evidentlyai/docs/blob/main/docs/library/descriptors.mdx. can cause problems when passed directly to an LLM:

# Why We Need to Prepare Large Documents Before Using Them

# Large documents create several problems:

# Token limits: Most LLMs have maximum input token limits
# Cost: Longer prompts cost more money
# Performance: LLMs perform worse with very long contexts
# Relevance: Not all parts of a long document are relevant to a specific question

# So we need to split documents into smaller subdocuments. For AI applications like RAG (which we will discuss tomorrow), this process is referred to as "chunking."

# 'Chunking': breaking long documents into smaller, focused pieces that are easier (and cheaper) for AI to process

In [None]:
# Today's task:

# Explore 3 different chunking methods:
# - Simple sliding window: cut into overlapping chunks (Simple character-based chunking)
# - Paragraph and section splits: use natural document structure (Paragraph and section-based chunking)
# - LLM-powered chunking: Intelligent, semantic splits (requires OpenAI or Groq account) (Intelligent chunking with LLM)

# Just so you know, for the last section, you will need an OpenAI account or an account from an alternative LLM provider such as Groq.

In [None]:
!pip install uv

Collecting uv
  Downloading uv-0.8.22-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.8.22-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.8.22


In [None]:
!uv pip install requests python-frontmatter

[2mUsing Python 3.12.11 environment at: /usr[0m
[2K[2mResolved [1m7 packages[0m [2min 148ms[0m[0m
[2K[2mPrepared [1m1 package[0m [2min 20ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 1ms[0m[0m
 [32m+[39m [1mpython-frontmatter[0m[2m==1.1.0[0m


In [None]:
# 1. Simple chunking

# Let's start with a simple chunking. This will be sufficient for most cases.

In [None]:
# We can continue with the notebook from Day 1. We already downloaded the data from Evidently docs. We put them into the evidently_docs list.

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
  """
  Download and parse all markdown files from a github repository

  Args:
    repo_owner : Github username or organization
    repo_name: Repository name

  Returns:
    List of dictionaries containing file content and metadata
  """
  prefix = 'https://codeload.github.com'
  url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
  resp = requests.get(url)

  if resp.status_code != 200:
    raise Exception(f"Failed to download repository {repo_owner}/{repo_name}: {resp.status_code}")

  repository_data = []

  # Create a ZipFile object from the downloaded content
  zf = zipfile.ZipFile(io.BytesIO(resp.content))

  for file_info in zf.infolist():
    filename = file_info.filename
    filename_lower = filename.lower()

    if not (filename_lower.endswith('.md') or (filename_lower.endswith('.mdx'))):
      continue

    try:
      with zf.open(file_info) as f_in:
        content = f_in.read().decode('utf-8', errors='ignore')
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)
    except Exception as e:
      print(f"Error processing {filename}: {e}")
      continue

  zf.close()
  return repository_data

In [None]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ Documents: {len(dtc_faq)}")
print(f"Evidently Docs: {len(evidently_docs)}")

FAQ Documents: 1217
Evidently Docs: 95


In [None]:
# This is how the document at index 45 looks like:

len(evidently_docs[45]['content'])


# The content field is 21,712 characters long. The simplest thing we can do is cut it into pieces of equal length. For example, for size of 2000 characters, we will have:

# Chunk 1: 0..2000
# Chunk 2: 2000..4000
# Chunk 3: 4000..6000

# And so on.

# However, this approach has disadvantages:

# Context loss: Important information might be split in the middle
# Incomplete sentences: Chunks might end mid-sentence
# Missing connections: Related information might end up in different chunks

# That's why, in practice, we usually make sure there's overlap between chunks. For size 2000 and overlap 1000, we will have:

# Chunk 1: 0..2000
# Chunk 2: 1000..3000
# Chunk 3: 2000..4000
# ...

# This is better for AI because:

# Continuity: Important information isn't lost at chunk boundaries
# Context preservation: Related sentences stay together in at least one chunk
# Better search: Queries can match information even if it spans chunk boundaries


21712

In [None]:
# This approach is known as the "sliding window" method. This is how we implement it in Python:

def sliding_window(sequence, size, step):
  if size <= 0 or step <= 0:
    raise ValueError("Size and step must be positive")

  n = len(sequence)
  result = []
  for i in range(0, n, step):
    chunk = sequence[i:i+size]
    result.append({'start': i, 'chunk': chunk})
    if i + size >= n:
      break

  return result

# Let's apply it for document 45. This gives us 21 chunks:

# 0..2000
# 1000..3000
# ...
# 19000..21000
# 20000..21712

In [None]:
# Let's process all the documents

evidently_chunks = []

for doc in evidently_docs:
  doc_copy = doc.copy()
  doc_content = doc_copy.pop('content')
  chunks = sliding_window(doc_content, 2000, 1000)
  for chunk in chunks:
    chunk.update(doc_copy)
  evidently_chunks.extend(chunks)


# Note that we use copy() and pop() operations:

# doc.copy() creates a shallow copy of the document dictionary
# doc_copy.pop('content') removes the 'content' key and returns its value
# This way we preserve the original dictionary keys that we can use later in the chunks.

# This way, we obtain 575 chunks from 95 documents

len(evidently_chunks)

# We can play with the parameters by including more or less content. 2000 characters is usually good enough for RAG applications.

# There are some alternative approaches:

# - Token-based chunking: You first tokenize the content (turn it into a sequence of words) and then do a sliding window over tokens
#   - Advantages: More precise control over LLM input size
#   - Disadvantages: Doesn't work well for documents with code
# - Paragraph splitting: Split by paragraphs
# - Section splitting: Split by sections
# - AI-powered splitting: Let AI split the text intelligently

# We won't cover token-based chunking here, as we're working with documents that contain code. But it's easy to implement - ask ChatGPT for help if you need it for text-only content.


In [None]:
# 2. Splitting by Paragraphs and sections

In [30]:
# splitting by paragraphs is relatively easy:

import re
text = evidently_docs[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())

# paragraphs[0:3]
# We use \n\s*\n regex pattern for splitting:

# - \n matches a newline
# - \s* matches zero or more whitespace characters
# - \n matches another newline
# So \n\s*\n matches two newlines with optional whitespace between them

# This works well for literature, but it doesn't work well for documents. Most paragraphs in technical documentation are very short.


In [31]:
# Let's now looks at section splitting. Here, we take advantage of the documents' structure. markdown documents have this structure:

# Heading 1
## Heading 2
### Heading 3

# What we can do is split by headers

# For that we will use regex too:

import re

def split_markdown_by_level(text, level=2):
  """
  Split markdown text by a specific header level.

  :param text: Markdown text as a string
  :param level: Header level to split on
  :return: List of sections as strings
  """
  # This regex matches markdown headers
  # For level 2, it matches lines starting with "## "

  header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
  pattern = re.compile(header_pattern, re.MULTILINE)

  # Split and keep the headers
  parts = pattern.split(text)

  sections = []
  for i in range(1, len(parts), 3):
    # We step by 3 because regex.split() with capturing groups returns:
    # [before_match, group1, group2, after_match, ...]
    # here group1 is "## ", group2 is the header text
    header = parts[i] + parts[i + 1] # "## " + "Title"
    header = header.strip()

    # Get the content after this header
    content = ""
    if i+2 < len(parts):
      content = parts[i+2].strip()

    if content:
      section = f'{header}\n\n{content}'
    else:
      section = header
    sections.append(section)

  return sections

# Note: This code may not work perfectly if we want to split by level 1 headings and have Python code with # comments. But in general, this is not a big problem for documentation.

In [32]:
# Now we iterate over all the docs to create the final result:

evidently_chunks = []

for doc in evidently_docs:
  doc_copy = doc.copy()
  doc_content = doc_copy.pop('content')
  sections = split_markdown_by_level(doc_content, level = 2)
  for section in sections:
    section_doc = doc_copy.copy()
    section_doc['section'] = section
    evidently_chunks.append(section_doc)

# Like previously, copy() creates a copy of the document metadata. pop('content') removes and returns the content. This way, each section gets the same metadata (title, description) as the original document.

# This was more intelligent processing, but we can go even further and use LLMs for that.
