# RAG Project: Knowledge Retrieval from Local/Backend Data

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline to process locally stored articles (simulating a backend database) for accurate Q&A and summarization. The architecture uses **LangChain** for orchestration, **OpenAI** for embeddings and generation, and **FAISS** for vector storage.

## 🛠️ Setup Instructions
1. **Install Dependencies:** `pip install langchain openai pydantic faiss-cpu python-dotenv`
2. **Data Folder:** Create a folder named `articles/` in this notebook's directory and place your `.txt` articles inside it.
3. **API Key:** Set your OpenAI API key in the code cell below.

In [None]:
import os
import pickle
import langchain
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.schema import Document
import time

In [None]:
# Set your OpenAI API key here.
os.environ['OPENAI_API_KEY'] = 'your openapi key here'

In [None]:
# Initialize the Large Language Model (LLM)
llm = OpenAI(temperature=0.9, max_tokens=500)

## (1) Data Ingestion: Fetching Articles from Simulated Backend (Local Files)

This function replaces the original URL loader (`UnstructuredURLLoader`) by automatically loading all articles from the local `articles/` folder, simulating the text being pulled from your backend database.

In [None]:
def load_articles_from_local(folder_path="articles/"):
    """Loads all .txt files from a local folder, simulating a fetch from the backend."""
    try:
        # DirectoryLoader is used to read multiple files from the local file system.
        loader = DirectoryLoader(
            folder_path, 
            glob="**/*.txt", 
            loader_cls=TextLoader,
            show_progress=True
        )
        data = loader.load()
        print(f"Successfully loaded {len(data)} documents from the '{folder_path}' folder.")
        # The metadata['source'] stores the file path, acting as the 'Article ID' reference.
        return data
    except Exception as e:
        print(f"Error loading local files. Please ensure the 'articles/' folder exists and contains .txt files. Error: {e}")
        return []
        
# --- Execute Data Loading ---
data = load_articles_from_local()
print(f"Total documents loaded: {len(data)}")

## (2) Document Preprocessing: Splitting Articles into Chunks

This step splits the loaded articles into smaller, overlapping chunks (the 'unit' of knowledge) suitable for embedding and LLM context limits.

In [None]:
# RecursiveCharacterTextSplitter is robust for complex text types.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

# Split the loaded documents into chunks
docs = text_splitter.split_documents(data)

print(f"Total chunks created: {len(docs)}")
if docs:
    print("\nExample Chunk Source (Article ID):", docs[0].metadata['source'])
    print("Example Chunk Content (First 200 chars):\n", docs[0].page_content[:200] + '...') 

## (3) Embedding and Vector Storage (FAISS)

The chunks are converted to vector embeddings, which are then stored in FAISS (Facebook AI Similarity Search) for fast retrieval.

In [None]:
if not docs:
    raise ValueError("No documents were loaded or split. Cannot proceed with embedding.")
    
# Create the embeddings of the chunks
embeddings = OpenAIEmbeddings()

# Create FAISS vector index from documents and embeddings
print("Creating FAISS Vector Index... (This might take a moment)")
vectorindex_openai = FAISS.from_documents(docs, embeddings)
print("FAISS Vector Index creation complete.")

In [None]:
# Storing vector index to local disk for persistence (saving to 'vector_index.pkl')
file_path="vector_index.pkl"
with open(file_path, "wb") as f:
    pickle.dump(vectorindex_openai, f)
print(f"Vector Index saved to {file_path}")

In [None]:
# Loading vector index from local disk (for subsequent runs)
file_path="vector_index.pkl"
if os.path.exists(file_path):
    with open(file_path, "rb") as f:
        vectorIndex = pickle.load(f)
    print("Vector Index loaded successfully from local file.")
else:
    print("Vector Index file not found. Please run the previous cells to create it.")
    vectorIndex = None

## (4) Retrieval-Augmented Generation (RAG) and Q&A

The chain takes a user query, finds relevant chunks using the FAISS index, and passes both to the LLM to generate a final, sourced answer.

In [None]:
# Create the RetrievalQA chain
if vectorIndex is not None:
    chain = RetrievalQAWithSourcesChain.from_llm(llm=llm, retriever=vectorIndex.as_retriever())
    print("RetrievalQA Chain initialized.")
else:
    print("Cannot initialize chain: Vector Index is missing.")


In [None]:
# Set your question (Query) based on the articles you placed in the 'articles/' folder
query = "What are the key points of the latest technology policy article?"

print(f"Question: {query}\n")

# Set LangChain to debug mode to see the internal steps (optional)
langchain.debug=True

# Run the RAG Chain
if 'chain' in locals():
    # Note: 'return_only_outputs=True' returns a dictionary with 'answer' and 'sources'
    result = chain({"question": query}, return_only_outputs=True)
    
    # Print the final result in a clean format
    print("\n--- RAG RESULT ---")
    print("Answer:", result['answer'].strip())
    print("Sources:", result['sources'])
else:
    print("RAG Chain not ready. Please check the previous steps.")