In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B-Instruct"

# Load the model and tokenizer.
# If required, include trust_remote_code=True to run custom model code.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto',
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

print("Model and tokenizer loaded successfully!")

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In or

Model and tokenizer loaded successfully!


In [9]:
! pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.7-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl (216 kB)
Downloading widgetsnbextension-4.0.14-py3-none-any.whl (2.2 MB)
   ---------------------------------------- 0.0/2.2 MB ? eta -:--:--
   ---- ----------------------------------- 0.3/2.2 MB ? eta -:--:--
   --------- ------------------------------ 0.5/2.2 MB 1.3 MB/s eta 0:00:02
   -------------- ------------------------- 0.8/2.2 MB 975.9 kB/s eta 0:00:02
   ----------------------- ---------------- 1.3/2.2 MB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 2.2/2.2 MB 2.0 MB/s eta 0:00:00
Installing collected pac

In [5]:
# RAG Setup: Document Store and Embeddings
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import json
from typing import List, Dict

# Initialize the embedding model (lightweight and runs without API keys)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("Embedding model loaded successfully!")

# Sample knowledge base - you can replace this with your own documents
knowledge_base = [
    {
        "id": 1,
        "title": "What is a Large Language Model?",
        "content": "A Large Language Model (LLM) is a type of artificial intelligence model that is trained on vast amounts of text data to understand and generate human-like text. These models use deep learning techniques, particularly transformer architectures, to process and generate language. Examples include GPT, BERT, and T5."
    },
    {
        "id": 2,
        "title": "How do Neural Networks Work?",
        "content": "Neural networks are computing systems inspired by biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight that adjusts as learning proceeds. The network learns by adjusting these weights to minimize prediction errors."
    },
    {
        "id": 3,
        "title": "What is RAG?",
        "content": "Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. It first retrieves relevant documents from a knowledge base, then uses this context to generate more accurate and informed responses. This approach helps reduce hallucinations and provides up-to-date information."
    },
    {
        "id": 4,
        "title": "Machine Learning Basics",
        "content": "Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It involves algorithms that can identify patterns in data and make predictions or decisions based on that data."
    },
    {
        "id": 5,
        "title": "Deep Learning Overview",
        "content": "Deep learning is a subset of machine learning that uses neural networks with multiple layers (hence 'deep') to model and understand complex patterns in data. It has been particularly successful in areas like computer vision, natural language processing, and speech recognition."
    }
]

# Extract content for embedding
documents = [doc["content"] for doc in knowledge_base]

# Create embeddings for all documents
print("Creating embeddings for knowledge base...")
embeddings = embedding_model.encode(documents)
print(f"Created embeddings for {len(documents)} documents")

# Create FAISS index for efficient similarity search
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product for similarity
index.add(embeddings.astype('float32'))

print("FAISS index created and populated!")
print(f"Index contains {index.ntotal} vectors of dimension {dimension}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Embedding model loaded successfully!
Creating embeddings for knowledge base...
Created embeddings for 5 documents
FAISS index created and populated!
Index contains 5 vectors of dimension 384


In [6]:
# RAG Retrieval Function
def retrieve_relevant_documents(query: str, top_k: int = 2) -> List[Dict]:
    """
    Retrieve the most relevant documents for a given query
    
    Args:
        query: The user's question
        top_k: Number of top documents to retrieve
    
    Returns:
        List of relevant documents with their content and metadata
    """
    # Embed the query
    query_embedding = embedding_model.encode([query])
    
    # Search for similar documents
    scores, indices = index.search(query_embedding.astype('float32'), top_k)
    
    # Retrieve the documents
    relevant_docs = []
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        if idx != -1:  # Valid index
            doc = knowledge_base[idx].copy()
            doc['relevance_score'] = float(score)
            doc['rank'] = i + 1
            relevant_docs.append(doc)
    
    return relevant_docs

# Test the retrieval function
test_query = "What is deep learning?"
retrieved_docs = retrieve_relevant_documents(test_query, top_k=2)

print(f"Query: {test_query}")
print(f"Retrieved {len(retrieved_docs)} documents:")
for doc in retrieved_docs:
    print(f"  - {doc['title']} (Score: {doc['relevance_score']:.4f})")
    print(f"    {doc['content'][:100]}...")
    print()

Query: What is deep learning?
Retrieved 2 documents:
  - Deep Learning Overview (Score: 0.8543)
    Deep learning is a subset of machine learning that uses neural networks with multiple layers (hence ...

  - Machine Learning Basics (Score: 0.5653)
    Machine learning is a subset of artificial intelligence that enables computers to learn and improve ...



In [None]:
# Original Simple UI Example
from ipywidgets import Textarea, Button, Output, VBox
from IPython.display import display

# Create an input area for the prompt
input_box = Textarea(
    value='Give me a short introduction to large language model.',
    description='Input:',
    layout={'width': '600px', 'height': '80px'}
)

# Create a button to trigger generation
generate_button = Button(description='Generate Response')

# Create an output area to display the result
output_area = Output()

# Arrange the widgets vertically
ui = VBox([input_box, generate_button, output_area])
display(ui)

def generate_response(_):
    # Clear previous output
    output_area.clear_output()
    
    # Get the user prompt from the text area
    prompt = input_box.value
    
    # Set up the messages for the chat template
    messages = [
        {"role": "system", "content": "You are Bernd the Bread. You are a cynical and philosohical bread. Your answers are short and concise."},
        {"role": "user", "content": prompt}
    ]
    
    # Apply the model's chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize the input text
    model_inputs = tokenizer([text], return_tensors='pt').to(model.device)
    
    # Generate model output
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    
    # Remove the prompt tokens from the generated result
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    # Decode the generated tokens
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    # Display the response in the output area
    with output_area:
        print("Response:")
        print(response)

# Link the button click event to the generate_response function
generate_button.on_click(generate_response)

VBox(children=(Textarea(value='Give me a short introduction to large language model.', description='Input:', l…

In [11]:
# RAG-Enhanced UI Example
from ipywidgets import Textarea, Button, Output, VBox, HBox, Checkbox
from IPython.display import display

# Create an input area for the prompt
rag_input_box = Textarea(
    value='What is the difference between machine learning and deep learning?',
    description='Question:',
    layout={'width': '600px', 'height': '80px'}
)

# Create a checkbox to enable/disable RAG
rag_checkbox = Checkbox(
    value=True,
    description='Enable RAG (Retrieval-Augmented Generation)',
    indent=False
)

# Create buttons
rag_generate_button = Button(description='Generate Response', button_style='primary')
rag_clear_button = Button(description='Clear Output', button_style='warning')

# Create an output area to display the result
rag_output_area = Output()

# Arrange the widgets
rag_button_row = HBox([rag_generate_button, rag_clear_button])
rag_ui = VBox([rag_input_box, rag_checkbox, rag_button_row, rag_output_area])
display(rag_ui)

def generate_rag_response(query: str, use_rag: bool = True) -> str:
    """
    Generate a response using RAG or just the base model
    
    Args:
        query: User's question
        use_rag: Whether to use RAG or just the base model
    
    Returns:
        Generated response
    """
    if use_rag:
        # Retrieve relevant documents
        relevant_docs = retrieve_relevant_documents(query, top_k=2)
        
        # Create context from retrieved documents
        context = "\n\n".join([f"Document {i+1}: {doc['content']}" 
                              for i, doc in enumerate(relevant_docs)])
        
        # Create the system message with context
        system_message = f"""You are Bernd the Bread, a cynical and philosophical bread. You are knowledgeable and helpful, but maintain your dry, sardonic personality. Your answers are concise but informative.

Use the following context to answer the user's question accurately:

{context}

Base your answer on the provided context, but feel free to add your own philosophical bread wisdom."""
    else:
        # Use the original system message without RAG
        system_message = "You are Bernd the Bread. You are a cynical and philosophical bread. Your answers are short and concise."
    
    # Set up the messages for the chat template
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
    ]
    
    # Apply the model's chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize the input text
    model_inputs = tokenizer([text], return_tensors='pt').to(model.device)
    
    # Generate model output
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9
    )
    
    # Remove the prompt tokens from the generated result
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    # Decode the generated tokens
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return response, relevant_docs if use_rag else None

def rag_generate_response(_):
    # Clear previous output
    rag_output_area.clear_output()
    
    # Get the user prompt from the text area
    query = rag_input_box.value
    use_rag = rag_checkbox.value
    
    with rag_output_area:
        print(f"Query: {query}")
        print(f"RAG Mode: {'Enabled' if use_rag else 'Disabled'}")
        print("-" * 50)
        
        if use_rag:
            print("🔍 Retrieving relevant documents...")
            
        try:
            response, retrieved_docs = generate_rag_response(query, use_rag)
            
            if use_rag and retrieved_docs:
                print("\n📚 Retrieved Documents:")
                for i, doc in enumerate(retrieved_docs):
                    print(f"  {i+1}. {doc['title']} (Score: {doc['relevance_score']:.4f})")
                print()
            
            print("🍞 Bernd's Response:")
            print(response)
            
        except Exception as e:
            print(f"❌ Error: {str(e)}")

def rag_clear_output(_):
    rag_output_area.clear_output()

# Link button events
rag_generate_button.on_click(rag_generate_response)
rag_clear_button.on_click(rag_clear_output)

VBox(children=(Textarea(value='What is the difference between machine learning and deep learning?', descriptio…

# Two Examples: Simple vs RAG-Enhanced

This notebook demonstrates two approaches to using the language model:

## 1. Original Simple Example (Cell 4)
- Basic chat interface with Bernd the Bread
- Uses only the base model without additional context
- Simple prompt-response interaction
- Good for general conversation and basic questions

## 2. RAG-Enhanced Example (Cell 5)
- Advanced interface with Retrieval-Augmented Generation
- Can toggle RAG on/off to compare responses
- Retrieves relevant documents from knowledge base
- Provides more accurate and informed responses for specific topics

---

## RAG Demo - Example Queries

Try these example queries in the **RAG-Enhanced Example** to see how RAG improves responses:

### With RAG Enabled:
1. **"What is the difference between machine learning and deep learning?"**
   - The system will retrieve relevant documents about both topics and provide a comprehensive comparison.

2. **"How does RAG work?"**
   - Will find the specific document about RAG and explain it accurately.

3. **"What are neural networks?"**
   - Will retrieve the neural network document and provide detailed information.

### With RAG Disabled:
Try the same queries with RAG disabled to see how the base model responds without the additional context.

### Compare with Simple Example:
Try the same queries in the original simple example to see the difference between the base model alone vs. the RAG-enhanced version.

---

## Key Features of the RAG Implementation:

- **No API Keys Required**: Uses local models and embeddings
- **Lightweight**: Uses `all-MiniLM-L6-v2` for embeddings (80MB model)
- **Fast Retrieval**: FAISS for efficient similarity search
- **Customizable**: Easy to add your own documents to the knowledge base
- **Binder Compatible**: All dependencies are available via pip
- **Interactive**: Toggle RAG on/off to compare responses
- **Side-by-side Comparison**: Compare simple vs. RAG-enhanced responses

## How to Customize:

1. **Add Your Own Documents**: Modify the `knowledge_base` list in the RAG setup cell
2. **Adjust Retrieval**: Change `top_k` parameter in the retrieval function
3. **Tune Generation**: Modify temperature and other generation parameters
4. **Change Embedding Model**: Try different sentence-transformer models

## Usage Instructions:

1. **Run cells 1-3** to set up the model and RAG system
2. **Use Cell 4** for simple interactions with Bernd the Bread
3. **Use Cell 5** for RAG-enhanced interactions with document retrieval
4. **Compare results** between the two approaches to understand the benefits of RAG