# Agentic AI Retrieval-Augmented Generation (RAG) on **Windows 11**
### 2 Agents (Researcher + Answerer) • Vector DB (**FAISS**) • Local LLM via **Ollama**

- Run an open model locally with **Ollama** (supports Windows, NVIDIA/AMD GPUs).
- Use **LangChain** integrations for Ollama chat and embeddings.
- Store and search document embeddings with **FAISS** (free, open-source).
> **Model tip (8GB VRAM)**: Use **Llama 3.1 8B** quantized (~4.9GB Q4_K_M via Ollama).


## Requirements
- **Windows 11**; Ollama runs natively and serves API on `http://localhost:11434`.
- **GPU optional** (NVIDIA/AMD). CPU-only works too.
- **Memory**: target **8GB VRAM** + **64GB RAM**; Llama 3.1 8B Q4_K_M ~4.9GB.
- **Vector DB**: **FAISS** via LangChain integration.


## Step 0 — Install Ollama (Windows) & pull models
1. Install Ollama for Windows and environment variables for quantization

    ```powershell
    winget install --id Ollama.Ollama
    ```

    Set environment variables in Windows to enable quantization

    ```powershell
    setx OLLAMA_FLASH_ATTENTION 1
    setx OLLAMA_KV_CACHE_TYPE "q8_0"
    ```

    Or, type env in the windows search field and select "Edit the system environment variables"

2. Start Ollama, API runs at `http://localhost:11434`.

    ```powershell
    ollama serve
    ```

    Or, use the GUI. :)

3. Pull quantized LLM suitable for 6GB to 8GB of VRAM:

    ```powershell
    ollama pull llama3.1:8b-instruct-q4_0
    ```

4. Pull embedding model (`nomic-embed-text` or `embeddinggemma`):

    ```powershell
    ollama pull nomic-embed-text
    ```

    or

    ```powershell
    ollama pull embeddinggemma
    ```

5. Optional: Relocate model storage via **`OLLAMA_MODELS`** env var in Windows.


## Step 1 — Python environment & dependencies
We use LangChain for ChatOllama + OllamaEmbeddings, and FAISS for vector search.


In [1]:
# Ensure your virtual environment is active then install dependencies
!python -m pip install --quiet --upgrade pip
!pip install --quiet jupyter langchain langchain-community langchain-ollama faiss-cpu langgraph python-dotenv ollama
!pip install torch --index-url https://download.pytorch.org/whl/cu128

Looking in indexes: https://download.pytorch.org/whl/cu128


In [2]:
# Ensure you are running "cuda", not "cpu"

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

Using device: cuda


## Step 2 — Verify Ollama
Ollama exposes a local API; clients connect to `http://localhost:11434`.


In [3]:

import json
from ollama import Client

client = Client(host='http://localhost:11434')
tags = client.list()

print(json.dumps(tags.model_dump(), indent=2, default=str)[:1000])



{
  "models": [
    {
      "model": "nomic-embed-text:latest",
      "modified_at": "2025-12-14 10:13:56.860727-07:00",
      "digest": "0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f",
      "size": 274302450,
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "nomic-bert",
        "families": [
          "nomic-bert"
        ],
        "parameter_size": "137M",
        "quantization_level": "F16"
      }
    },
    {
      "model": "llama3.1:8b-instruct-q4_0",
      "modified_at": "2025-12-14 08:03:50.144636-07:00",
      "digest": "42182419e9508c30c4b1fe55015f06b65f4ca4b9e28a744be55008d21998a093",
      "size": 4661230766,
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "8.0B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}


## Step 3 — Create a small corpus


In [4]:
from pathlib import Path

# Set DATA_PATH variable. Change it to match your filesystem.
# You may get permissions issues depending on where you save it to so the Downloads
# directory is usually a safe choice.

DATA_PATH = Path(r"C:\Users\RobGibson\Downloads\data").resolve()


In [5]:
import textwrap

# Ensure the folder exists
DATA_PATH.mkdir(parents=True, exist_ok=True)

# Define documents using *file names only*
docs = {
    "rag_overview.txt": "RAG couples an LLM with a retriever over external knowledge to ground responses.",
    "windows_gpu.txt": "On Windows 11, Ollama runs natively and can accelerate models on NVIDIA/AMD GPUs when available.",
    "faiss_intro.txt": "FAISS is a free/open-source library for efficient similarity search over dense embeddings.",
    "agents.txt": "Two agents: Researcher retrieves context; Answerer synthesizes final response.",
}

# Write each file into the specified path
for filename, content in docs.items():
    file_path = DATA_PATH / filename
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(textwrap.fill(content, width=100))

# Inspect what was written
files = list(DATA_PATH.iterdir())
print(len(files), [p.name for p in files])


4 ['agents.txt', 'faiss_intro.txt', 'rag_overview.txt', 'windows_gpu.txt']


## Step 4 — Build embeddings (Ollama) & index in FAISS
Use **OllamaEmbeddings** and **FAISS** vector store.


In [6]:
from pathlib import Path
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

# Pattern includes *.txt so we actually get files
paths = list(DATA_PATH.glob("*.txt"))

# If you still keep a relative 'data' folder as fallback:
if not paths:
    paths = list(Path("data").glob("*.txt"))

# Build Document list safely
raw_docs = []
for p in paths:
    text = p.read_text(encoding="utf-8")
    raw_docs.append(Document(page_content=text, metadata={"source": str(p)}))

# Split, embed, and index
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(raw_docs)

emb = OllamaEmbeddings(model="nomic-embed-text", base_url="http://localhost:11434")
vectorstore = FAISS.from_documents(chunks, emb)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print(f"Indexed {len(chunks)} chunks into FAISS.")


  from pydantic.v1.fields import FieldInfo as FieldInfoV1


Indexed 4 chunks into FAISS.


## Step 5 — Two-agent workflow (LangGraph)
Researcher → retrieves chunks; Answerer → generates grounded response with ChatOllama.


In [7]:
from langchain_ollama import ChatOllama
from langgraph.graph import START, StateGraph
from typing import TypedDict, List

class RAGState(TypedDict):
    question: str
    contexts: List[str]
    answer: str

llm = ChatOllama(model='llama3.1:8b-instruct-q4_0', base_url='http://localhost:11434', temperature=0)

def researcher_node(state: RAGState) -> RAGState:
    # Retrieve top-k contexts from FAISS
    docs = retriever.invoke(state['question'])
    state['contexts'] = [d.page_content for d in docs]
    return state

def answerer_node(state: RAGState) -> RAGState:
    # Generate final grounded answer
    system = 'Use only the provided contexts. If insufficient, say you are not sure and suggest data to add.'
    ctx = '\n\n'.join(state['contexts'])
    prompt = [('system', system), ('human', 'Question: ' + state['question'] + '\nContexts:\n' + ctx + '\nAnswer:')]
    resp = llm.invoke(prompt)
    state['answer'] = resp.content
    return state

graph = StateGraph(RAGState)
graph.add_node('researcher', researcher_node)
graph.add_node('answerer', answerer_node)
graph.add_edge(START, 'researcher')
graph.add_edge('researcher', 'answerer')
app = graph.compile()
print('Two-agent workflow compiled.')


Two-agent workflow compiled.


## Step 6 — Test the pipeline


In [8]:
query = 'What is RAG and how are the two agents used here?'
final_state = app.invoke({'question': query, 'contexts': [], 'answer': ''})
print('--- Retrieved contexts ---')
for i, c in enumerate(final_state['contexts'], 1):
    print(f'[{i}]', c[:300], '...')
print('\n--- Final answer ---')
print(final_state['answer'])


--- Retrieved contexts ---
[1] Two agents: Researcher retrieves context; Answerer synthesizes final response. ...
[2] RAG couples an LLM with a retriever over external knowledge to ground responses. ...
[3] FAISS is a free/open-source library for efficient similarity search over dense embeddings. ...

--- Final answer ---
Based on the provided contexts, I'll try to answer your question.

RAG stands for Retrieval-Augmented Generator. It's a technique used in natural language processing (NLP) that couples a Large Language Model (LLM) with a retriever over external knowledge to ground responses. In other words, RAG uses an LLM to generate text and a retriever to fetch relevant information from external sources, such as databases or articles.

The two agents mentioned are:

1. Researcher: This agent retrieves context from external knowledge sources.
2. Answerer: This agent synthesizes the final response using the retrieved context and the Large Language Model (LLM).

In this setup, the Res

#### Tip: run the following in powershell to get information about your running models if necessary.

In [9]:
!powershell -Command "ollama ps"


NAME                         ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
llama3.1:8b-instruct-q4_0    42182419e950    4.9 GB    100% GPU     4096       4 minutes from now    
nomic-embed-text:latest      0a109f422b47    595 MB    100% GPU     8192       4 minutes from now    
