# Semantic Chunking with Generative Feedback Loops

This short tutorial illustrates how Generative Feedback Loops can help you with data ingestion and ETL workloads when developing with Weaviate. We show this particularly with the example of loading code repositories into Weaviate. We note the general concepts extend to applications such as PDFs, emails, and other media options, as well as more technical details such as RDBMS or Knowledge Graph to Vector Database integrations.

Here are some helpful additional resources for learning about Generative Feedback Loops and Weaviate:

- Learn more about Generative Feedback Loops with Bob van Luijt [here](https://www.youtube.com/watch?v=1RALju6ZJz0)!

- Generative Feedback Loops notebooks on [Weaviate Recipes](https://github.com/weaviate/recipes/tree/main/weaviate-features/generative-feedback-loops)!

- Weaviate Blog Post: [Generating Blog Posts with DSPy](https://weaviate.io/blog/hurricane-generative-feedback-loops)!

![Alt text](GFL.png)

Generative Feedback Loops (GFLs) describe the co-evolution of data and AI models throughout the lifecycle of an AI application. 

GFLs capture the continuous exchange of feedback between data and AI models, where AI models enhance datasets by updating or creating data objects, and where data shapes AI models by providing learning signals for specific domains or tasks.

## Chapters

### 1. DSPy Setup

Test connections to Weaviate, OpenAI, Anthropic, Gemini, Cohere, and Ollama.

### 2. Generative Feedback Loops with DSPy TypedPredictors

Write a short summary of the file and ingest into Weaviate by chunking the code files and summarizing the code chunks. 

### 3. Load Code Repositoy from Disk into Python Runtime Memory

Load the DSPy code repository from disk and into the the Python runtime memory.

### 4. Import to Weaviate!

Ingesting the DSPy Code Base into Weaviate with Generative Feedback Loops!

# 1. DSPy Setup

Import DSPy and connect it to popular LLMs.

In [1]:
import dspy
import openai

openai.api_key = "sk-foobar"
CLAUDE_API_KEY = "sk-foobar"
cohere_api_key = "foobar"

# Start Llama 3.1 on your laptop with Ollama using!
'''bash
ollama run llama3.1
'''

gpt4 = dspy.OpenAI(model="gpt-4o", max_tokens=4_000, model_type="chat")
claude_opus = dspy.Claude(model="claude-3-opus-20240229", api_key=CLAUDE_API_KEY)
command_r_plus = dspy.Cohere(model="command-r-plus", max_tokens=4000, api_key=cohere_api_key)
ollama_llama3 = dspy.OllamaLocal(model="llama3.1")

lms = [
    {"name": "GPT-4", "lm": gpt4},
    {"name": "Claude Opus", "lm": claude_opus},
    {"name": "Command R+", "lm": command_r_plus},
    {"name": "Llama 3.1", "lm": ollama_llama3}
]

dspy.settings.configure(lm=gpt4)

connection_prompt = "Please say something interesting about Database Systems intended to impress me with your intelligence."

print(f"\033[36mTesting the prompt:\n\033[91m\n{connection_prompt}\n")

for lm_dict in lms:
    lm, name = lm_dict["lm"], lm_dict["name"]
    with dspy.context(lm=lm):
        print(f"\033[92mResult for {name}\n")
        print(f"\033[0m{lm(connection_prompt)[0]}\n")

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


[36mTesting the prompt:
[91m
Please say something interesting about Database Systems intended to impress me with your intelligence.

[92mResult for GPT-4

[0mCertainly! One of the most fascinating aspects of modern database systems is their ability to handle distributed transactions across multiple nodes in a network while ensuring ACID (Atomicity, Consistency, Isolation, Durability) properties. This is particularly impressive given the challenges posed by the CAP theorem, which states that a distributed data store can only provide two out of the following three guarantees simultaneously: Consistency, Availability, and Partition Tolerance.

To navigate this, advanced database systems like Google Spanner employ innovative techniques such as TrueTime, a globally synchronized clock, to achieve external consistency. This allows Spanner to provide strong consistency guarantees across geographically distributed data centers, which is a remarkable feat. TrueTime leverages GPS and atomic c

# 2. DSPy Chunking Program

We will now give the code files to a Large Language Model with instructions about the task of chunking in order to import data into Vector Databases.

By leveraging Structured Outputs, the LLM will also output a summary of each chunk. These descriptions help developers quickly understand what purpose blocks of code serve. This is especially helpful when dealing with very large and rapdily changing codebases.

In [2]:
from typing import List
from pydantic import BaseModel

class ChunkWithSummary(BaseModel):
    chunk: str
    summary: str
    
class Chunker(dspy.Signature):
    """Your task is to divide a long document into coherent, semantic chunks of text. Each chunk should represent a complete thought or topic, typically ranging from one to three paragraphs in length. Follow these guidelines:
1. Focus on semantic coherence: Each chunk should contain text that revolves around a single main idea or closely related ideas.
2. Respect natural breaks: Use paragraph boundaries as a guide, but don't be bound by them if a topic continues across paragraphs.
3. Maintain context: Ensure that each chunk can be understood independently without losing crucial context.
4. Aim for consistency: Try to keep chunks relatively similar in length, but prioritize semantic completeness over strict length adherence.
5. Handle transitions: When encountering transition sentences between topics, include them with the most relevant chunk.
6. Consider document structure: Pay attention to headings, subheadings, and other structural elements that might indicate topic changes.
7. Adjust for content type: Be flexible based on the document's nature (e.g., academic papers might have longer chunks than news articles).

Your output should be a List of strings containing these semantic chunks, each representing a distinct portion of the original document while preserving its overall flow and meaning.
    """
    
    long_document: str = dspy.InputField()
    # file_summary: str = dspy.OutputField()
    chunks_with_summaries: List[ChunkWithSummary] = dspy.OutputField()
    
    
chunker = dspy.TypedPredictor(Chunker)

# 3. Load Code Repository into Memory

Load the code files from a repo directory on disk and into memory.

**Note:** When the Weaviate `gfl` API is released, you will be able to skip this step and directly apply Generative Feedback Loops on your data in Weaviate. Learn more [here](https://weaviate.io/gen-feedback-loops)!

In [3]:
import os

def traverse_directory(root_dir):
    file_contents = []
    text_extensions = {'.txt', '.py', '.md', '.json', '.csv', '.xml', '.html', '.css', '.js'}

    for dirpath, dirnames, filenames in os.walk(root_dir):
        print(f'Current directory: {dirpath}')
        
        if dirnames:
            print(f'\tSubdirectories: {", ".join(dirnames)}')

        for filename in filenames:
            file_path = os.path.join(dirpath, filename)
            print(f'\tFile: {file_path}')
            
            if any(filename.endswith(ext) for ext in text_extensions):
                try:
                    with open(file_path, 'r') as file:
                        content = file.read()
                        file_contents.append(content) # ToDo, add filename to Weaviate Schema & Import
                        print(f"\033[92mContent successfully read from {file_path}.\n\033[0m")
                except Exception as e:
                    print(f"An error occurred while reading {file_path}: {e}")
            else:
                print(f"\033[93mSkipped non-text file: {file_path}\033[0m")
    
    return file_contents

In [4]:
print("\033[36mLoading code from dspy...\033[0m\n")
dspy_code = traverse_directory("./dspy/dspy")

print("\033[36mLoading code from dsp...\033[0m\n")
dsp_code = traverse_directory("./dspy/dsp")

print("\033[36mLoading docs from docs...\033[0m\n")
dspy_docs = traverse_directory("./dspy/docs")

[36mLoading code from dspy...[0m

Current directory: ./dspy/dspy
	Subdirectories: propose, experimental, signatures, datasets, utils, primitives, adapters, evaluate, predict, teleprompt, retrieve, functional
	File: ./dspy/dspy/__init__.py
[92mContent successfully read from ./dspy/dspy/__init__.py.
[0m
Current directory: ./dspy/dspy/propose
	File: ./dspy/dspy/propose/__init__.py
[92mContent successfully read from ./dspy/dspy/propose/__init__.py.
[0m
	File: ./dspy/dspy/propose/dataset_summary_generator.py
[92mContent successfully read from ./dspy/dspy/propose/dataset_summary_generator.py.
[0m
	File: ./dspy/dspy/propose/propose_base.py
[92mContent successfully read from ./dspy/dspy/propose/propose_base.py.
[0m
	File: ./dspy/dspy/propose/instruction_proposal.py
[92mContent successfully read from ./dspy/dspy/propose/instruction_proposal.py.
[0m
	File: ./dspy/dspy/propose/utils.py
[92mContent successfully read from ./dspy/dspy/propose/utils.py.
[0m
	File: ./dspy/dspy/propose/gr

In [5]:
print(f"There are \033[92m{len(dspy_code)}\033[0m files in dspy/dspy\n")
print(f"There are \033[92m{len(dsp_code)}\033[0m files in dspy/dsp\n")
print(f"There are \033[92m{len(dspy_docs)}\033[0m files in dspy/docs")

There are [92m105[0m files in dspy/dspy

There are [92m57[0m files in dspy/dsp

There are [92m103[0m files in dspy/docs


### Test with 1 File

Let's chunk and summarize the code in the DSPy's `WeaviateRM`

In [6]:
sample = dspy_code[92]
print(sample)

from typing import List, Optional, Union

import dspy
from dsp.utils import dotdict
from dspy.primitives.prediction import Prediction

try:
    import weaviate
except ImportError as err:
    raise ImportError(
        "The 'weaviate' extra is required to use WeaviateRM. Install it with `pip install dspy-ai[weaviate]`",
    ) from err


class WeaviateRM(dspy.Retrieve):
    """A retrieval module that uses Weaviate to return the top passages for a given query.

    Assumes that a Weaviate collection has been created and populated with the following payload:
        - content: The text of the passage

    Args:
        weaviate_collection_name (str): The name of the Weaviate collection.
        weaviate_client (WeaviateClient): An instance of the Weaviate client.
        k (int, optional): The default number of top passages to retrieve. Default to 3.

    Examples:
        Below is a code snippet that shows how to use Weaviate as the default retriever:
        ```python
        import weav

In [7]:
response = chunker(long_document=sample)

In [8]:
chunks_with_summaries = response.chunks_with_summaries

In [9]:
for i in range(5):
    print(f"\033[92m=== Chunk {i} ===\n\033[0m")
    print(f"{chunks_with_summaries[i].chunk}\n")
    print(f"\033[92m=== Summary of Chunk {i} ===\n\033[0m")
    print(f"{chunks_with_summaries[i].summary}\n")

[92m=== Chunk 0 ===
[0m
from typing import List, Optional, Union

import dspy
from dsp.utils import dotdict
from dspy.primitives.prediction import Prediction

try:
    import weaviate
except ImportError as err:
    raise ImportError(
        "The 'weaviate' extra is required to use WeaviateRM. Install it with `pip install dspy-ai[weaviate]`",
    ) from err

[92m=== Summary of Chunk 0 ===
[0m
This section includes the necessary imports for the module, including handling the optional import of the 'weaviate' library with an appropriate error message if it is not installed.

[92m=== Chunk 1 ===
[0m
class WeaviateRM(dspy.Retrieve):
    """A retrieval module that uses Weaviate to return the top passages for a given query.

    Assumes that a Weaviate collection has been created and populated with the following payload:
        - content: The text of the passage

    Args:
        weaviate_collection_name (str): The name of the Weaviate collection.
        weaviate_client (WeaviateCli

# 4. Chunk and Import!

In [10]:
import weaviate
weaviate_client = weaviate.connect_to_local()
weaviate_client.collections.delete_all()

In [11]:
import weaviate
import weaviate.classes.config as wvcc

weaviate_client = weaviate.connect_to_local()

code_collection = weaviate_client.collections.create(
    name="Code",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_voyageai(
        model="voyage-code-2"
    ),
    properties=[
        wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
        wvcc.Property(name="chunk_summary", data_type=wvcc.DataType.TEXT)
    ]
)

docs_collection = weaviate_client.collections.create(
    name="Docs",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere(
        model="embed-english-v3.0"
    ),
    properties=[
        wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
        wvcc.Property(name="chunk_summary", data_type=wvcc.DataType.TEXT)
    ]
)

            Please make sure to close the connection using `client.close()`.


## DSPy's Typed Predictors

DSPy currently supports several strategies for achieving structured outputs with LLM systems.

Here are some thoughts from the Weaviate team on our developing understanding of structured outputs:
- StructuredRAG - [[repo]](https://github.com/weaviate/structured-rag/tree/main)
- OPRO JSON Mode [[notebook]](https://github.com/weaviate/structured-rag/blob/main/OPRO-Compiled-JSON-Mode.ipynb)
- Weaviate Podcast #88 with Jason Liu [[podcast]](https://www.youtube.com/watch?v=higlHgYDc5E)
- Weaviate Recipes [[notebook]](https://github.com/weaviate/recipes/blob/main/integrations/llm-frameworks/dspy/4.Structured-Outputs-with-DSPy.ipynb)

In [12]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

failure_counter = 0

import time

start = time.time()
for file in dspy_code:
    try:
        response = chunker(long_document=file)
        # file_summary = response.file_summary # ToDo
        chunks_with_summaries = response.chunks_with_summaries
        for chunk_with_summary in chunks_with_summaries:
            uuid = get_valid_uuid(uuid4())
            code_collection.data.insert(
                properties={
                    "content": chunk_with_summary.chunk,
                    "summary": chunk_with_summary.summary
                },
                uuid=uuid
            )
    except:
        failure_counter += 1
        print(f"TypedPredictors failure {failure_counter}\n")
print(f"GFL ran in {time.time() - start} seconds.")



TypedPredictors failure 1

TypedPredictors failure 2

TypedPredictors failure 3

TypedPredictors failure 4

TypedPredictors failure 5

TypedPredictors failure 6

GFL ran in 208.73269414901733 seconds.


In [13]:
start = time.time()
for file in dspy_code:
    try:
        response = chunker(long_document=file)
        # file_summary = response.file_summary # ToDo
        chunks_with_summaries = response.chunks_with_summaries
        for chunk_with_summary in chunks_with_summaries:
            uuid = get_valid_uuid(uuid4())
            code_collection.data.insert(
                properties={
                    "content": chunk_with_summary.chunk,
                    "summary": chunk_with_summary.summary
                },
                uuid=uuid
            )
    except:
        failure_counter += 1
        print(f"TypedPredictors failure {failure_counter}\n")
print(f"GFL ran in {time.time() - start} seconds.")

TypedPredictors failure 7

TypedPredictors failure 8

TypedPredictors failure 9

TypedPredictors failure 10

TypedPredictors failure 11

TypedPredictors failure 12

GFL ran in 198.70638418197632 seconds.


In [14]:
start = time.time()
for file in dspy_docs:
    try:
        response = chunker(long_document=file)
        # file_summary = response.file_summary # ToDo
        chunks_with_summaries = response.chunks_with_summaries
        for chunk_with_summary in chunks_with_summaries:
            uuid = get_valid_uuid(uuid4())
            docs_collection.data.insert(
                properties={
                    "content": chunk_with_summary.chunk,
                    "summary": chunk_with_summary.summary
                },
                uuid=uuid
            )
    except:
        failure_counter += 1
        print(f"TypedPredictors failure {failure_counter}\n")
print(f"GFL ran in {time.time() - start} seconds.")

TypedPredictors failure 13

TypedPredictors failure 14

TypedPredictors failure 15

TypedPredictors failure 16

TypedPredictors failure 17



INFO:backoff:Backing off request(...) for 0.6s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


INFO:backoff:Backing off request(...) for 0.5s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 0.5 seconds after 2 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


INFO:backoff:Backing off request(...) for 1.9s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 1.9 seconds after 3 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


ERROR:backoff:Giving up request(...) after 4 tries (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


TypedPredictors failure 18



INFO:backoff:Backing off request(...) for 0.6s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


INFO:backoff:Backing off request(...) for 1.6s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 1.6 seconds after 2 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


INFO:backoff:Backing off request(...) for 1.4s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


Backing off 1.4 seconds after 3 tries calling function <function GPT3.request at 0x1155bb130> with kwargs {}


ERROR:backoff:Giving up request(...) after 4 tries (openai.RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}})


TypedPredictors failure 19

GFL ran in 80.71587681770325 seconds.


# Check Data in Weaviate!

In [29]:
response = code_collection.aggregate.over_all(total_count=True)

print(f"{response.total_count} objects in the Weaviate \033[92m`Code`\033[0m collection.")

response = docs_collection.aggregate.over_all(total_count=True)

print(f"{response.total_count} objects in the Weaviate \033[92m`Docs`\033[0m collection.")

1242 objects in the Weaviate [92m`Code`[0m collection.
418 objects in the Weaviate [92m`Docs`[0m collection.


# Search through your Data in Weaviate!

In [23]:
response = code_collection.query.hybrid(query="WeaviateRM", limit=3)

for o in response.objects:
    print("=== Content ===\n")
    print(o.properties["content"])
    print("\n=== Summary ===\n")
    print(o.properties["summary"])
    print("\n")

=== Content ===

self._weaviate_collection_name = weaviate_collection_name
        self._weaviate_client = weaviate_client
        self._weaviate_collection_text_key = weaviate_collection_text_key

        # Check the type of weaviate_client (this is added to support v3 and v4)
        if hasattr(weaviate_client, "collections"):
            self._client_type = "WeaviateClient"
        elif hasattr(weaviate_client, "query"):
            self._client_type = "Client"
        else:
            raise ValueError("Unsupported Weaviate client type")

        super().__init__(k=k)

=== Summary ===

This section continues the constructor of the WeaviateRM class, initializing instance variables and checking the type of the Weaviate client to ensure compatibility with different versions.


=== Content ===

self._weaviate_collection_name = weaviate_collection_name
        self._weaviate_client = weaviate_client
        self._weaviate_collection_text_key = weaviate_collection_text_key

        # Che

# Connect with us!

I hope you found this example useful to see how Semantic Chunking with Generative Feedback Loops can help you prepare your data for Vector Database and LLM Applications!

Please reach out to us if you would like to discuss applications of Generative Feedback Loops in your project, and please feel free to open a pull request to add your GFL examples to Weaviate Recipes!