<a target="_blank" href="https://colab.research.google.com/github/urcraft/llm_lecture_notebooks/blob/main/06_Gemini_API_RAG_InMemory_VectorDB.ipynb">   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

# Gemini API + In-Memory Vector DB (RAG)

## What you will build
- A minimal Retrieval-Augmented Generation (RAG) pipeline.
- Gemini embeddings for chunked documents.
- Chroma in-memory vector database for retrieval.
- Gemini answer generation grounded in retrieved context.

Expected runtime: 10-20 minutes
Expected cost: Zero for Gemini Free tier API usage


## Online References Used
- Gemini embeddings API reference: https://ai.google.dev/api/embeddings
- Gemini text generation guide: https://ai.google.dev/gemini-api/docs/system-instructions
- Chroma in-memory client docs: https://docs.trychroma.com/docs/run-chroma/clients
- Chroma Python reference: https://docs.trychroma.com/reference/python
- Gemini cookbook repository: https://github.com/google-gemini/cookbook


In [1]:
# Force-install compatible versions for the Colab environment
%pip install -q -U google-genai chromadb \
    "pandas==2.2.2" \
    "requests==2.32.4" \
    "opentelemetry-sdk==1.22.0" \
    "opentelemetry-exporter-otlp-proto-http==1.22.0"


[2K     [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m53.2/53.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m52.0/52.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m105.6/105.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m57.9/57.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”[0m [32m50.8/50.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2

In [2]:
import os
import re
import textwrap
import requests
import pandas as pd
import chromadb

from google import genai


In [3]:
# Set your API key in environment variable GOOGLE_API_KEY before running.
# In Colab, you can also use a secret named GOOGLE_API_KEY.
api_key = os.getenv("GOOGLE_API_KEY")
if not api_key:
    try:
        from google.colab import userdata
        api_key = userdata.get("GOOGLE_API_KEY")
    except Exception:
        api_key = None

if not api_key:
    raise ValueError("Set GOOGLE_API_KEY (environment variable or Colab secret).")

client = genai.Client(api_key=api_key)

EMBED_MODEL = "gemini-embedding-001"
GEN_MODEL = "gemini-3-flash-preview"

print("Gemini client ready")
print("Embedding model:", EMBED_MODEL)
print("Generation model:", GEN_MODEL)


Gemini client ready
Embedding model: gemini-embedding-001
Generation model: gemini-3-flash-preview


In [4]:
# A small, thematically coherent corpus for RAG:
# public docs/README files about Gemini, vector DBs, and RAG tooling.
DOC_URLS = {
    "gemini_cookbook": "https://raw.githubusercontent.com/google-gemini/cookbook/main/README.md",
    "chroma_readme": "https://raw.githubusercontent.com/chroma-core/chroma/main/README.md",
    "langchain_readme": "https://raw.githubusercontent.com/langchain-ai/langchain/master/README.md",
    "llamaindex_readme": "https://raw.githubusercontent.com/run-llama/llama_index/main/README.md"
}

def fetch_text(url: str) -> str:
    r = requests.get(url, timeout=60)
    r.raise_for_status()
    return r.text

raw_docs = {}
for name, url in DOC_URLS.items():
    raw_docs[name] = fetch_text(url)

print("Downloaded documents:", list(raw_docs.keys()))
for k, v in raw_docs.items():
    print(f"- {k}: {len(v):,} chars")


Downloaded documents: ['gemini_cookbook', 'chroma_readme', 'langchain_readme', 'llamaindex_readme']
- gemini_cookbook: 14,088 chars
- chroma_readme: 5,475 chars
- langchain_readme: 6,867 chars
- llamaindex_readme: 13,291 chars


In [5]:
def clean_text(text: str) -> str:
    text = text.replace("\r", "")
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

def chunk_text(text: str, chunk_size: int = 1200, overlap: int = 150):
    text = clean_text(text)
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        if end == len(text):
            break
        start = end - overlap
    return chunks

records = []
for doc_id, text in raw_docs.items():
    chunks = chunk_text(text)
    for i, chunk in enumerate(chunks):
        records.append({
            "id": f"{doc_id}_chunk_{i}",
            "doc_id": doc_id,
            "chunk_index": i,
            "text": chunk,
            "source_url": DOC_URLS[doc_id],
        })

df_chunks = pd.DataFrame(records)
print("Total chunks:", len(df_chunks))
df_chunks.head(3)


Total chunks: 40


Unnamed: 0,id,doc_id,chunk_index,text,source_url
0,gemini_cookbook_chunk_0,gemini_cookbook,0,# Welcome to the Gemini API Cookbook\n\nThis c...,https://raw.githubusercontent.com/google-gemin...
1,gemini_cookbook_chunk_1,gemini_cookbook,1,ation).\n> \n> **ðŸŒ Nano-Banana Pro**: Go banan...,https://raw.githubusercontent.com/google-gemin...
2,gemini_cookbook_chunk_2,gemini_cookbook,2,Practical use cases demonstrating how to comb...,https://raw.githubusercontent.com/google-gemin...


In [6]:
def extract_embeddings(resp):
    # Handles both single and batch embedding response shapes.
    if hasattr(resp, "embeddings") and resp.embeddings is not None:
        out = []
        for emb in resp.embeddings:
            if hasattr(emb, "values"):
                out.append(list(emb.values))
            elif isinstance(emb, dict) and "values" in emb:
                out.append(list(emb["values"]))
        return out

    if hasattr(resp, "embedding") and resp.embedding is not None:
        emb = resp.embedding
        if hasattr(emb, "values"):
            return [list(emb.values)]
        if isinstance(emb, dict) and "values" in emb:
            return [list(emb["values"])]

    raise ValueError("Unexpected embedding response format")

texts = df_chunks["text"].tolist()

try:
    embed_resp = client.models.embed_content(
        model=EMBED_MODEL,
        contents=texts,
        config={"task_type": "RETRIEVAL_DOCUMENT"},
    )
except Exception:
    embed_resp = client.models.embed_content(
        model=EMBED_MODEL,
        contents=texts,
    )

chunk_embeddings = extract_embeddings(embed_resp)

print("Embedding count:", len(chunk_embeddings))
print("Embedding dimension:", len(chunk_embeddings[0]))


Embedding count: 40
Embedding dimension: 3072


In [7]:
# In-memory vector database (temporary).
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(name="gemini_rag_demo")

collection.add(
    ids=df_chunks["id"].tolist(),
    documents=df_chunks["text"].tolist(),
    embeddings=chunk_embeddings,
    metadatas=[
        {
            "doc_id": row.doc_id,
            "chunk_index": int(row.chunk_index),
            "source_url": row.source_url,
        }
        for row in df_chunks.itertuples(index=False)
    ],
)

print("Stored chunks in Chroma:", collection.count())


Stored chunks in Chroma: 40


In [8]:
def retrieve(query: str, k: int = 8, n_candidates: int = 24, max_per_doc: int = 2):
    try:
        q_resp = client.models.embed_content(
            model=EMBED_MODEL,
            contents=[query],
            config={"task_type": "RETRIEVAL_QUERY"},
        )
    except Exception:
        q_resp = client.models.embed_content(
            model=EMBED_MODEL,
            contents=[query],
        )

    q_emb = extract_embeddings(q_resp)[0]

    res = collection.query(
        query_embeddings=[q_emb],
        n_results=n_candidates,
    )

    ranked = []
    for i in range(len(res["ids"][0])):
        ranked.append({
            "id": res["ids"][0][i],
            "document": res["documents"][0][i],
            "metadata": res["metadatas"][0][i],
            "distance": res["distances"][0][i],
        })

    # Keep top relevance while preventing one source from consuming all slots.
    selected = []
    per_doc = {}
    for r in ranked:
        doc_id = r["metadata"]["doc_id"]
        count = per_doc.get(doc_id, 0)
        if count >= max_per_doc:
            continue
        selected.append(r)
        per_doc[doc_id] = count + 1
        if len(selected) >= k:
            break

    # Fallback: if diversity cap is too strict, top up with best remaining chunks.
    if len(selected) < k:
        selected_ids = {x["id"] for x in selected}
        for r in ranked:
            if r["id"] in selected_ids:
                continue
            selected.append(r)
            if len(selected) >= k:
                break

    return selected


def answer_with_rag(question: str, k: int = 8):
    hits = retrieve(question, k=k)

    context_blocks = []
    citations = []
    for idx, h in enumerate(hits, start=1):
        md = h["metadata"]
        context_blocks.append(
            f"[Context {idx}] source={md['doc_id']} chunk={md['chunk_index']}\n{h['document']}"
        )
        citations.append((idx, md["source_url"]))

    context = "\n\n".join(context_blocks)

    prompt = f"""
You are answering using only the provided context.
If the answer is not supported by context, say: "I don't know based on the provided documents."
Keep the answer concise and include citation markers like [1], [2] tied to context blocks.

Question:
{question}

Context:
{context}
""".strip()

    resp = client.models.generate_content(
        model=GEN_MODEL,
        contents=prompt,
    )

    return {
        "question": question,
        "answer": resp.text,
        "hits": hits,
        "citations": citations,
    }

# Try a question.
result = answer_with_rag("How do Gemini cookbook and Chroma differ in purpose, and where does LangChain fit?")
print("Answer:\n")
print(result["answer"])

print("\nCitation map:")
for idx, url in result["citations"]:
    print(f"[{idx}] {url}")


Answer:

The Gemini cookbook and Chroma differ in their primary functions within the AI ecosystem:

*   **Gemini Cookbook** is a collection of guides, practical examples, and end-to-end demos focused on teaching users how to use Gemini API features, such as grounding, Batch API, and multimodal capabilities [1, 7, 8].
*   **Chroma** is a vector database used to store, embed, and query documents [3]. Its purpose is to enable "Chat your data" use cases by retrieving relevant document snippets based on natural language queries to provide context for an LLM [4].

**LangChain** serves as an orchestration framework and integration layer [5]. It fits by:
*   Providing a library of integrations that connect LLMs to vector stores like Chroma [4, 5].
*   Enabling model interoperability and rapid prototyping through a modular architecture [5].
*   Offering low-level agent orchestration (via LangGraph) to build complex backend agents for applications [6, 8].

Citation map:
[1] https://raw.githubuse

In [9]:
# Inspect retrieved chunks for debugging / teaching.
for i, h in enumerate(result["hits"], start=1):
    print("=" * 80)
    print(f"Hit {i} | distance={h['distance']:.4f} | metadata={h['metadata']}")
    print(textwrap.shorten(h["document"].replace("\n", " "), width=500, placeholder=" ..."))


Hit 1 | distance=0.4921 | metadata={'source_url': 'https://raw.githubusercontent.com/google-gemini/cookbook/main/README.md', 'chunk_index': 5, 'doc_id': 'gemini_cookbook'}
interactivity with Gemini. * **Recently Added Guides:** * [Grounding](./quickstarts/Grounding.ipynb) [![Colab](https://storage.googleapis.com/generativeai-downloads/images/colab_icon16.png)](https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Grounding.ipynb): Discover different ways to ground Gemini's answer using different tools, from Google Search to Youtube and URLs and the new [**Maps grounding**](https://colab.research.google.com/github/google- ...
Hit 2 | distance=0.4924 | metadata={'source_url': 'https://raw.githubusercontent.com/google-gemini/cookbook/main/README.md', 'chunk_index': 10, 'doc_id': 'gemini_cookbook'}
b) [![Colab](https://storage.googleapis.com/generativeai-downloads/images/colab_icon16.png)](https://colab.research.google.com/github/google-gemini/cookbook/blob/

## Exercises
1. Change `DOC_URLS` to your own domain corpus (policies, manuals, course notes).
2. Increase `k` in retrieval and compare answer quality vs. verbosity.
3. Add a reranking step (optional) before generation.
4. Add citation post-checking: verify each claim appears in retrieved chunks.
