In [None]:
import os, sys, platform

# 🔑 Set your project key here
os.environ["OPENAI_API_KEY"] = 

# Verify
print("Python:", sys.version.split()[0], "| Platform:", platform.system())
assert os.environ.get("OPENAI_API_KEY"), "OPENAI_API_KEY not set"
print("OK: OPENAI_API_KEY detected")


Python: 3.11.9 | Platform: Windows
OK: OPENAI_API_KEY detected



# Demo 01 — Minimal RAG (Text file → FAISS → OpenAI LLM)

This version keeps your structure but swaps the heavy local HF model for an API LLM to minimize setup friction on Windows.  
Pipeline: **TextLoader → Chunking → HF Embeddings (mpnet) → FAISS → Retriever → Prompt → OpenAI Chat → Answer**.


# __Demo: RAG Implementation from scratch__


## What each stage does

- **Loader**: Reads a local `.txt` into `Document` objects. Use absolute paths to avoid `FileNotFoundError`.
- **Splitter**: `RecursiveCharacterTextSplitter` creates chunked context windows (512/64) to balance recall vs. duplication.
- **Embeddings**: `all-mpnet-base-v2` (768-dim) — strong general-purpose encoder; CPU-friendly.
- **Vector store**: FAISS in-memory index; fast cosine similarity over embedding vectors.
- **Retriever**: Top-4 chunks by similarity; tune `k` for precision/recall tradeoff.
- **Prompt**: Minimal instruction; expects `{question}` + formatted `{context}`.
- **LLM**: `gpt-4o-mini` via `langchain-openai`; set `OPENAI_API_KEY` (project key).
- **Chain**: LCEL composes retrieval, prompt fill, model call, and parsing into a single callable.


### Steps to be followed:

1. Install and import the dependencies
2. Load the document
3. Split the document into chunks
4. Generate embeddings for each chunk
5. Build the FAISS vector store and create a retriever
6. Design a prompt template for the language model
7. Load and configure a quantized language model
8. Set up the generation pipeline and chain the components
9. Invoke the pipeline with a query


## Operational notes

- Put your corpus file next to the notebook or into `./data/`. Change `data_file` accordingly.
- For multi-file corpora, switch to `DirectoryLoader` or concatenate `documents` lists.
- To persist the FAISS index:
  ```python
  FAISS.save_local(vectorstore, "index/faiss_idx")
  vectorstore = FAISS.load_local("index/faiss_idx", hf_embed, allow_dangerous_deserialization=True)
  ```
- To use OpenAI embeddings instead of HF (fewer deps, higher $/token):
  ```python
  from langchain_openai import OpenAIEmbeddings
  embed = OpenAIEmbeddings(model="text-embedding-3-large")
  ```


# **Step1: Install and import the dependecies**

In [2]:

# Imports (modernized)
from pathlib import Path

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from langchain.schema.runnable import RunnablePassthrough, RunnableLambda
from langchain.schema.output_parser import StrOutputParser

import os


# **Step 2: Load the document**

Load the document that will be used as the knowledge source.

**Knowledge base**: The text document serves as the underlying knowledge base. Later, when a query is made, relevant parts of this document will be retrieved to augment the LLM's response.






In [3]:
from pathlib import Path
from langchain_community.document_loaders import TextLoader

# Absolute Windows path to your text file
data_file = Path(r"D:\Desktop\AMJ Group\Teaching\Class Materials\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\Demo\L5_RAG\state_of_union.txt")

assert data_file.exists(), f"File not found: {data_file}"

loader = TextLoader(str(data_file), encoding="utf-8")
documents = loader.load()

print(f"Loaded {len(documents)} document(s) from {data_file}")
print("Preview:", documents[0].page_content[:200])


Loaded 1 document(s) from D:\Desktop\AMJ Group\Teaching\Class Materials\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\Demo\L5_RAG\state_of_union.txt
Preview: Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. 


# **Step 3: Split the document into chunks**

Break down the large document into manageable pieces.

**Fine-Grained Retrieval**: Smaller chunks allow the retriever to more precisely locate the context relevant to the query, enhancing the generation step with focused context.

In [4]:

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
split_docs = splitter.split_documents(documents)
print("Chunks:", len(split_docs))


Chunks: 53


# **Step 4: Generate embeddings for each chunk**

Convert text chunks into numerical vectors (embeddings) that capture semantic meaning.

**Semantic Search**: Embeddings allow the FAISS vector store to perform similarity searches, ensuring that the most relevant context is retrieved for any given query.

**Verification**: Printing the length of the embedding vector confirms the transformation was successful.

In [6]:
from langchain_openai import OpenAIEmbeddings

# OpenAI embeddings (project key already set in env)
embed = OpenAIEmbeddings(model="text-embedding-3-small")

# quick smoke test
vec = embed.embed_query("ping")
print("Embedding dim:", len(vec))



Embedding dim: 1536


#### If we quickly want to see how the embeddings for the chunks will look like we will do the below

In [8]:
embedded_chunks = [embed.embed_query(chunk.page_content) for chunk in split_docs]

import pandas as pd
df_chunks = pd.DataFrame(embedded_chunks)
df_chunks.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
0,0.044372,-0.026949,0.027764,0.066343,0.011331,0.003405,-0.014267,0.040978,-0.036453,0.027718,...,-0.001178,9.2e-05,-0.014776,0.028307,0.015432,-0.001285,-0.001872,0.042539,-0.038036,-0.040503
1,-0.024246,-0.006789,0.006964,0.079522,0.005704,-0.045462,0.007739,0.044307,0.013999,0.006934,...,0.03596,0.002332,-0.013133,0.008972,-0.004363,0.019003,0.022009,0.00328,-0.04664,-0.010981
2,0.019339,0.006058,-0.01624,0.049528,0.019459,-0.022256,-0.001701,0.01442,0.008898,-0.008271,...,0.031009,0.044778,0.025174,0.01243,-0.022546,-0.020363,0.006046,-0.005477,-0.024342,-0.029008
3,-0.025534,0.001651,0.054586,0.079069,-0.014367,-0.008851,-0.023102,0.009645,-0.017935,0.038679,...,0.003821,0.000409,-0.010843,0.011985,0.004291,0.004245,0.017666,-0.003338,-0.01901,-0.035038
4,-0.022921,0.022447,0.05465,0.081217,0.031019,-0.011093,-0.029338,0.049015,-0.007577,0.013059,...,0.018031,0.001623,-0.033837,0.023939,0.011058,0.024247,-0.01046,-0.003528,-0.031966,-0.01506


In [9]:
import pandas as pd
df_chunks = pd.DataFrame(embedded_chunks)
df_chunks


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
0,0.044372,-0.026949,0.027764,0.066343,0.011331,0.003405,-0.014267,0.040978,-0.036453,0.027718,...,-0.001178,9.2e-05,-0.014776,0.028307,0.015432,-0.001285,-0.001872,0.042539,-0.038036,-0.040503
1,-0.024246,-0.006789,0.006964,0.079522,0.005704,-0.045462,0.007739,0.044307,0.013999,0.006934,...,0.03596,0.002332,-0.013133,0.008972,-0.004363,0.019003,0.022009,0.00328,-0.04664,-0.010981
2,0.019339,0.006058,-0.01624,0.049528,0.019459,-0.022256,-0.001701,0.01442,0.008898,-0.008271,...,0.031009,0.044778,0.025174,0.01243,-0.022546,-0.020363,0.006046,-0.005477,-0.024342,-0.029008
3,-0.025534,0.001651,0.054586,0.079069,-0.014367,-0.008851,-0.023102,0.009645,-0.017935,0.038679,...,0.003821,0.000409,-0.010843,0.011985,0.004291,0.004245,0.017666,-0.003338,-0.01901,-0.035038
4,-0.022921,0.022447,0.05465,0.081217,0.031019,-0.011093,-0.029338,0.049015,-0.007577,0.013059,...,0.018031,0.001623,-0.033837,0.023939,0.011058,0.024247,-0.01046,-0.003528,-0.031966,-0.01506
5,-0.016146,-0.022957,0.052154,0.044174,0.005369,-0.038506,0.013175,0.031521,-0.007203,0.03085,...,0.041141,0.010608,0.00916,0.024971,-0.019626,0.026773,0.00029,0.013548,-0.018221,-0.03699
6,-0.048051,-0.023623,0.06282,0.062109,0.029513,-0.007367,-0.021382,0.032453,-0.014022,-0.016594,...,-0.001085,-0.007266,-0.010081,-0.001569,-0.020991,-0.000679,-0.007633,-0.03058,-0.033899,-0.013714
7,-0.010576,-0.038167,0.017646,0.03263,-0.006506,-0.043229,-0.020557,0.049004,-0.020129,-0.014295,...,0.025025,0.002369,0.01723,0.032416,-0.001313,0.019488,0.014996,0.0033,-0.01073,-0.008152
8,-0.025751,-0.007941,0.030343,0.049923,0.009257,-0.022264,-0.034159,0.029317,0.015027,-0.005263,...,-0.003849,-0.00152,0.015461,0.047292,0.001934,0.036975,0.009586,-0.007046,-0.018961,-0.018935
9,-0.030142,0.001214,0.053524,0.077894,0.028394,-0.053386,0.014278,0.047914,0.000246,-0.004271,...,0.010823,-0.004432,0.013323,-0.00117,-0.013749,0.002995,0.016358,0.011754,-0.014496,0.009478


# **Step 5: Build the FAISS vector store and create a retriever**

Build an index (FAISS) for the document embeddings and create a retriever.

**Retrieval step**: The retriever is responsible for fetching the most relevant chunks from the document based on the query. These retrieved contexts will later be fed into the generation step to produce an informed answer.


In [11]:

# Build a FAISS vector store in-memory
vectorstore = FAISS.from_documents(split_docs, embed)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
print("Vector store ready.")



Vector store ready.


#### Let's see if the retriever works

In [12]:
retriever=vectorstore.as_retriever()

In [13]:
# The way the retriever works

query = "What are the key points from the State Of The Union"
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)

  docs = retriever.get_relevant_documents(query)


Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny.
Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges. 

And let’s pass the PRO Act when a majority of workers want to form a union—they shouldn’t be stopped.  

When we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.
So what are we waiting for? Let’s g

In [14]:
query2 = "How is the United States supporting Ukraine economically and militarily?"

In [15]:
docs = retriever.get_relevant_documents(query2)
for doc in docs:
    print(doc.page_content)

The Russian stock market has lost 40% of its value and trading remains suspended. Russia’s economy is reeling and Putin alone is to blame. 

Together with our allies we are providing support to the Ukrainians in their fight for freedom. Military assistance. Economic assistance. Humanitarian assistance. 

We are giving more than $1 Billion in direct assistance to Ukraine. 

And we will continue to aid the Ukrainian people as they defend their country and to help ease their suffering.
Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. 

Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   

They keep moving.   

And the costs and the threats to America and the world keep rising.   

That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. 

The United States is a member along with 29 other nations.
Along with 

# **Step 6: Design a prompt template for the language model**
Establish a prompt that instructs the LLM on how to utilize the retrieved context to generate a concise answer.

**Guiding Generation**: The prompt template bridges retrieval and generation by ensuring the LLM uses the provided context (from the retriever) to answer the query accurately.

In [16]:
from langchain.prompts import ChatPromptTemplate

In [17]:
template="""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use one sentence and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

In [18]:
prompt=ChatPromptTemplate.from_template(template)

In [19]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

In [20]:
output_parser=StrOutputParser()

# **Step 7: Load and configure a quantized language model**

Load a quantized version of a large language model (Falcon3-1B-Base) for efficient and cost-effective text generation.

**Generation Step**: This model is responsible for generating the final answer. It takes the prompt (which includes the retrieved context) and produces a response, completing the RAG pipeline.

**Efficiency**: 4-bit quantization reduces resource usage while maintaining performance, crucial for deploying RAG systems in production.

In [22]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

# prompt
template = """You are an assistant for question-answering tasks.
Use the retrieved context to answer. If you don't know, say you don't know.
Use 1–3 sentences.
Question: {question}
Context: {context}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# format retrieved docs -> text
def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

print("OK: LLM ready.")


OK: LLM ready.


# **Step 8: Set up the generation pipeline and chain the components**

Build an end-to-end pipeline that seamlessly connects document retrieval with text generation.

**Integration**: The chain uses the retriever to fetch context, applies the prompt template to integrate the query with the retrieved context, and then passes the final prompt to the LLM for answer generation.

**Pipeline composition**: Using the pipe operator (|), the components are elegantly chained together to perform a complete RAG operation in one go.

In [23]:

# LLM: OpenAI Chat (project-scoped key required in OPENAI_API_KEY)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)

# prompt template already defined as `template` and wrapped as `prompt` elsewhere
output_parser = StrOutputParser()

# Helper: convert retrieved documents -> plain text
def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)



In [24]:

# Compose RAG chain (LangChain Expression Language)
rag_chain = (
    {
        "context": retriever | RunnableLambda(format_docs),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

print("RAG chain ready.")



RAG chain ready.


# **Step 9: Invoke the pipeline with a query**

Execute the entire RAG pipeline with a sample query.

**Final output**: The pipeline retrieves relevant chunks from the document, forms a context-rich prompt, and the LLM generates a concise answer based on that context.

**End-to-end flow**: This step demonstrates the full cycle of RAG—retrieval and augmented generation—in action.

In [26]:
# Step 9: Invoke the pipeline with a query
question = "How is the United States supporting Ukraine economically and militarily?"
result = rag_chain.invoke(question)

print(result)



The United States is supporting Ukraine economically by providing over $1 billion in direct assistance, and militarily through military and humanitarian aid. Additionally, the U.S. is working with allies to enforce powerful economic sanctions against Russia to further support Ukraine in its fight for freedom.


# Conclusion

This RAG (Retrieval-augmented generation) pipeline exemplifies how to combine retrieval-based methods with generative AI to produce informed, context-driven answers. By following these high-level steps—setting up the environment, loading and splitting the document, generating embeddings, building a FAISS vector store, and creating a retriever—you establish a robust foundation for pinpointing the most relevant pieces of information. Integrating a prompt template ensures that the language model is guided to leverage this retrieved context effectively. Finally, by employing a quantized language model in an end-to-end chain, the system efficiently generates concise and accurate responses. Overall, this approach not only enhances the model’s output by grounding it in factual context but also streamlines the process, making it scalable and adaptable to various domains and applications.