# Generation in Retrieval-Augmented Generation (RAG): Stuffing Documents

This notebook demonstrates the **generation phase** of a
Retrieval-Augmented Generation (RAG) pipeline using the
**document stuffing approach**.

In this approach, retrieved documents are injected directly
into the prompt context before generation.


## RAG Generation Overview

After retrieval, the generation step produces a final answer
by combining:

- The user question
- Retrieved document context
- A structured prompt
- A language model

This notebook uses **prompt stuffing**, where all retrieved
context is passed directly into the prompt.


In [None]:
import getpass
import os
import copy
import numpy as np

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser


In [None]:
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

## Vector Store and Retriever

The vector store created during the indexing phase is loaded
from disk and wrapped as a retriever.

The retriever selects the top-k most relevant document chunks
for a given query.


In [None]:
vectorstore = Chroma(persist_directory = "./vectorstore/rag-practice", 
                                    embedding_function = OpenAIEmbeddings(model="text-embedding-3-small"))

In [None]:
len(vectorstore.get()['documents'])  #check number of documents in the vectorstore

### Retriever Configuration

The retriever is configured to return the top-2
most relevant document chunks.


In [None]:
retriver = vectorstore.as_retriever( search_kwargs={ "k":2,  } )

## Prompt Template for Document Stuffing

A prompt template is defined that:

- Receives the user question
- Injects retrieved context
- Restricts the model to using only the provided context
- Requires citation of source lectures


In [None]:
TEMPLATE = '''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}

At the end of the response, specify the name of the lecture this context is taken from in the format:
Resources: *Lecture Title*
where *Lecture Title* should be substituted with the title of all resource lectures.
'''

prompt_template = PromptTemplate.from_template(TEMPLATE)

## Language Model Initialization

A deterministic chat model is used to ensure
stable and reproducible answers.


In [None]:
chat = ChatOpenAI(
    model="gpt-5-nano", 
    temperature=0, 
    model_kwargs= {"text":{"verbosity": 'low'},"reasoning":{"effort": "medium"}},
    ) 



## Preparing the RAG Generation Chain

The chain is constructed using LCEL:

- The retriever supplies the context
- `RunnablePassthrough` forwards the user question
- The prompt template performs document stuffing


In [None]:
question = "What software do data scientists use?"

In [None]:
chain = {'context': retriver,
         'question': RunnablePassthrough()} | prompt_template

In [None]:
chain.invoke(question)
print("Generated Response:\n", chain.invoke(question))

## Generating a Response

The full RAG generation chain adds:

- A chat model
- An output parser to extract plain text


In [None]:
chain = ({'context': retriver,
         'question': RunnablePassthrough()} | prompt_template | chat | StrOutputParser())

In [None]:

print("Generated Response:\n", chain.invoke(question))

## Summary

This notebook demonstrated the **generation phase** of a
Retrieval-Augmented Generation (RAG) pipeline using document stuffing:

- Loading a persisted vector store
- Retrieving relevant document chunks
- Injecting retrieved context into a structured prompt
- Generating a grounded answer using an LLM

Document stuffing is a simple and effective approach,
but must be used carefully to avoid context length limits.
