## Step 1: Chunking

In [1]:
from typing import List

def split_into_chunks(doc_file: str) -> List[str]:
    with open(doc_file, 'r') as file:
        content = file.read()

    return [chunk for chunk in content.split("\n\n")]

chunks = split_into_chunks("doc.md")

# print out the first two chunks
print(len(chunks))
for i, chunk in enumerate(chunks[:2]):
    print(f"[{i}] {chunk}\n")

222
[0] Cloud Native Computing Foundation (“CNCF”) Charter

[1] The Linux Foundation



## Step 2: Embedding into vector

(The initial run may be slow because the model needs to be downloaded.)

In [2]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def embed_chunk(chunk: str) -> List[float]:
    embedding = embedding_model.encode(chunk, normalize_embeddings=True)
    return embedding.tolist()

# Test embedding function
embedding = embed_chunk("test-content")
print(len(embedding))
print(embedding[:2]) # print first two dimensions of the embedding

# Actually embed all chunks
embeddings = [embed_chunk(chunk) for chunk in chunks]
print(f"Embedded {len(embeddings)} chunks.")
print(f"First embedding length: {len(embeddings[0])}")

  from .autonotebook import tqdm as notebook_tqdm


384
[-0.06328241527080536, 0.05413154140114784]
Embedded 222 chunks.
First embedding length: 384


## Step 3: Save the result into a vector DB

Requires: the original chuck (string), the embeding (float), id

In [3]:
import chromadb
from chromadb.config import Settings

chromadb_client = chromadb.EphemeralClient(
    Settings(allow_reset=True)
)
chromadb_collection = chromadb_client.get_or_create_collection(name="default")

def save_embeddings(chunks: List[str], embeddings: List[List[float]]) -> None:
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
        chromadb_collection.add(
            documents=[chunk],
            embeddings=[embedding],
            ids=[str(i)]
        )
    print(f"Saved {len(chunks)} chunks and embeddings to ChromaDB.")

save_embeddings(chunks, embeddings)

Saved 222 chunks and embeddings to ChromaDB.


In [11]:
chromadb_client.reset()  # Use with caution: this will delete all data in the ChromaDB instance

True

## Step 4: Retrieve relevant chunks based on a query

Embed the query as well
Query from Chroma that has the top 5 closes vector

In [4]:
def retrieve(query: str, top_k: int) -> List[str]:
    query_embedding = embed_chunk(query)
    results = chromadb_collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results['documents'][0]

query = "What is the responsibility of CNCF Governing Board?"
retrieved_chunks = retrieve(query, 5)

for i, chunk in enumerate(retrieved_chunks):
    print(f"[{i}] {chunk}\n")

[0] -	(a) The CNCF Governing Board will be responsible for marketing and other business oversight and budget decisions for the CNCF. The Governing Board does not make technical decisions for the CNCF, other than working with the TOC to set the overall scope for the CNCF as described in the cloud native definition from Section 1.

[1] 	-	i. Appoint one (1) representative to the CNCF Governing Board.

[2] #### 2. Role of the CNCF.

[3] The CNCF will serve a role in the open source community responsible for:

[4] 		-	d. the CNCF Executive Director, or



## Step 5: Re-rank

The results retrieved from the vector database are fast but lack accuracy. 
We can use a CrossEncoder for re-ranking, which is slower but more accurate. 
(The initial run may be slow because the model needs to be downloaded.)

In [6]:
from sentence_transformers import CrossEncoder

def rerank(query: str, retrieved_chunks: List[str], top_k: int) -> List[str]:
    cross_encoder = CrossEncoder('BAAI/bge-reranker-base')
    pairs = [(query, chunk) for chunk in retrieved_chunks]
    scores = cross_encoder.predict(pairs)

    scored_chunks = list(zip(retrieved_chunks, scores))
    scored_chunks.sort(key=lambda x: x[1], reverse=True)

    return [chunk for chunk, _ in scored_chunks][:top_k]

reranked_chunks = rerank(query, retrieved_chunks, 3)

for i, chunk in enumerate(reranked_chunks):
    print(f"[{i}] {chunk}\n")

[0] -	(a) The CNCF Governing Board will be responsible for marketing and other business oversight and budget decisions for the CNCF. The Governing Board does not make technical decisions for the CNCF, other than working with the TOC to set the overall scope for the CNCF as described in the cloud native definition from Section 1.

[1] The CNCF will serve a role in the open source community responsible for:

[2] #### 2. Role of the CNCF.



## Step 6: Send the related chucks and user query to a LLM

Here we choose `gemini-2.5-flash`

In [7]:
import os
from dotenv import load_dotenv
from google import genai

load_dotenv()
google_client = genai.Client(api_key=os.getenv("API_KEY"))

def generate(query: str, chunks: List[str]) -> str:
    prompt = f"""You are a knowledge assistant. Please generate an accurate answer based on the user's question and the following passages.

User question: {query}

Relevant context:
{"\n\n".join(chunks)}

Please answer based only on the content above and do not fabricate information."""

    print(f"{prompt}\n\n---\n")

    response = google_client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text

answer = generate(query, reranked_chunks)
print("Final Answer:")
print(answer)

You are a knowledge assistant. Please generate an accurate answer based on the user's question and the following passages.

User question: What is the responsibility of CNCF Governing Board?

Relevant context:
-	(a) The CNCF Governing Board will be responsible for marketing and other business oversight and budget decisions for the CNCF. The Governing Board does not make technical decisions for the CNCF, other than working with the TOC to set the overall scope for the CNCF as described in the cloud native definition from Section 1.

The CNCF will serve a role in the open source community responsible for:

#### 2. Role of the CNCF.

Please answer based only on the content above and do not fabricate information.

---

Final Answer:
The CNCF Governing Board is responsible for marketing and other business oversight and budget decisions for the CNCF. It does not make technical decisions for the CNCF, other than working with the TOC to set the overall scope for the CNCF as described in the cl