# Technique 0: Setup & Baseline RAG

## üéØ Welcome to Advanced RAG Techniques!

This notebook establishes the **foundation** for all advanced techniques you'll learn.

### What You'll Do:
1. Set up your environment
2. Load the MSME dataset
3. Build a basic RAG system
4. Establish baseline metrics
5. Test with sample queries

### Why This Matters:
Every advanced technique will be compared against this baseline. Understanding where we start helps you appreciate the improvements!

**Difficulty:** ‚≠ê‚òÜ‚òÜ‚òÜ‚òÜ

## üìã Prerequisites

Before starting, ensure you have:
- ‚úÖ Python 3.8+
- ‚úÖ Together AI API key
- ‚úÖ Installed dependencies (see requirements.txt)
- ‚úÖ `msme.csv` in this directory
- ‚úÖ `.env` file with your API key

## Step 1: Import Libraries

We'll use our custom `utils.py` module along with LangChain components.

In [None]:
# Import utilities
from utils_openai import (
    setup_openai_api,
    load_msme_data,
    create_embeddings,
    create_llm,
    create_vectorstore,
    get_baseline_prompt,
    print_retrieval_results,
    count_tokens_approximate
)

# LangChain components
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print("‚úÖ All imports successful!")

## Step 2: Setup Together AI API

Load your API key from the `.env` file.

In [None]:
# Load API key
api_key = setup_openai_api()
print("‚úÖ API key loaded successfully!")

## Step 3: Load MSME Dataset

Our knowledge base contains 14 documents about MSMEs in Nigeria:
- Business registration procedures
- Financing options and policies
- Government support programs
- Industry-specific guidance

In [None]:
# Load the MSME data
documents, metadatas, ids = load_msme_data("msme.csv")

print(f"\nDataset Overview:")
print(f"- Total documents: {len(documents)}")
print(f"- Sample title: {metadatas[0]['doc_title']}")
print(f"- Average doc length: {sum(len(d) for d in documents) // len(documents)} characters")

## Step 4: Initialize Models

We'll use:
- **Embeddings:** M2-BERT-80M-32K (32,768 token context)
- **LLM:** Llama 3.3 70B Turbo (fast and accurate)

In [None]:
# Create embeddings model
embeddings = create_embeddings(api_key)

# Create chat model
llm = create_llm(api_key, temperature=0)

print("\n‚úÖ Models initialized!")

## Step 5: Create Vector Store

We'll use ChromaDB to store document embeddings for fast similarity search.

In [None]:
# Create vector store
vectorstore = create_vectorstore(
    documents=documents,
    metadatas=metadatas,
    ids=ids,
    embeddings=embeddings,
    collection_name="msme_baseline",
    persist_directory="./chroma_db_baseline"
)

print("‚úÖ Vector store created and persisted!")

## Step 6: Create Retriever

The retriever will find the most relevant documents for a given query.

In [None]:
# Create retriever (retrieve top 5 documents)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Test retrieval
test_query = "Procedure and legal requirements for setting up a business in Nigeria"
retrieved_docs = retriever.invoke(test_query)

print(f"Retrieved {len(retrieved_docs)} documents for query:")
print(f"'{test_query}'")
print_retrieval_results(retrieved_docs, max_docs=2, max_chars=150)

In [None]:
print_retrieval_results(retrieved_docs, max_docs=5, max_chars=5000)

## Step 7: Build Baseline RAG Chain

Using **modern LCEL** (LangChain Expression Language) pattern - NOT deprecated RetrievalQA!

### The Pipeline:
1. Query comes in
2. Retriever finds relevant docs
3. Prompt combines docs + query
4. LLM generates answer
5. Output parser extracts text

In [None]:
# Get prompt template
prompt = get_baseline_prompt()

# Build the RAG chain (Modern LCEL pattern)
baseline_rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("‚úÖ Baseline RAG chain created!")
print("\nChain structure:")
print("  Query ‚Üí Retriever ‚Üí Prompt ‚Üí LLM ‚Üí Answer")

## Step 8: Test with Sample Queries

Let's test our baseline RAG system with various queries about MSMEs.

In [None]:
# Test Query 1
question1 = "Explain the procedure and legal requirements for setting up a business in Nigeria"

print(f"Question: {question1}\n")
answer1 = baseline_rag_chain.invoke(question1)
print(f"Answer: {answer1}")
print(f"\n{'='*80}\n")

In [None]:
# Test Query 2
question2 = "What are the financing options for small businesses in Nigeria?"

print(f"Question: {question2}\n")
answer2 = baseline_rag_chain.invoke(question2)
print(f"Answer: {answer2}")
print(f"\n{'='*80}\n")

In [None]:
# Test Query 3
question3 = "What is the Development Bank of Nigeria loan repayment plan?"

print(f"Question: {question3}\n")
answer3 = baseline_rag_chain.invoke(question3)
print(f"Answer: {answer3}")
print(f"\n{'='*80}\n")

## Step 9: Establish Baseline Metrics

These metrics will be our comparison point for all advanced techniques.

We'll measure:
- Number of documents retrieved
- Total tokens in context
- Approximate cost
- Answer quality (subjective)

In [None]:
# Calculate baseline metrics for question 1
retrieved_for_q1 = retriever.invoke(question1)
total_context = "\n\n".join([doc.page_content for doc in retrieved_for_q1])
token_count = count_tokens_approximate(total_context)

print("üìä BASELINE METRICS")
print("="*80)
print(f"Query: '{question1}'")
print(f"\nRetrieval:")
print(f"  - Documents retrieved: {len(retrieved_for_q1)}")
print(f"  - Total context tokens: ~{token_count}")
print(f"  - Average tokens per doc: ~{token_count // len(retrieved_for_q1)}")
print("="*80)

## ‚úÖ Summary

**What you built:**
- ‚úÖ Loaded MSME dataset (14 documents)
- ‚úÖ Created embeddings with M2-BERT
- ‚úÖ Built vector store with ChromaDB
- ‚úÖ Implemented modern RAG chain with LCEL
- ‚úÖ Established baseline metrics

**Current System:**
- Retrieves top 5 documents based on semantic similarity
- Uses ~3500-6000 tokens of context per query
- Works well for straightforward questions

**Limitations (what we'll improve):**
- ‚ùå Misses exact keyword matches
- ‚ùå Can't handle vague queries well
- ‚ùå Retrieves too much irrelevant context
- ‚ùå Fixed chunk size may break meaning
- ‚ùå No way to rerank or refine results

**Next:** Technique 1 - BM25 Hybrid Search will address keyword matching!

## üí™ Exercise: Explore the Baseline

**Task:**
1. Try 3 more queries of your own about Nigerian MSMEs
2. For each query:
   - Note how many retrieved docs seem relevant
   - Rate the answer quality (1-10)
   - Calculate approximate token usage
3. Identify one query where the system struggles

**Example queries to try:**
- "What are the tax benefits for small businesses?"
- "How long does business registration take?"
- "What is SMEDAN and what do they do?"
- "Can I get a loan for my tech startup?"

**Expected Outcome:**
You should find at least one query where:
- The system retrieves irrelevant documents, OR
- Misses documents with exact keyword matches, OR
- The answer is incomplete/vague

**Time:** 10 minutes

**Document your findings in the cell below:**

In [None]:
# Your Exercise Code Here

# Query 1:
my_query_1 = ""  # Add your query
# answer_1 = baseline_rag_chain.invoke(my_query_1)

# Query 2:
my_query_2 = ""  # Add your query
# answer_2 = baseline_rag_chain.invoke(my_query_2)

# Query 3:
my_query_3 = ""  # Add your query
# answer_3 = baseline_rag_chain.invoke(my_query_3)

# Document your findings:
print("My Findings:")
print("- Struggling query: [describe which query struggled]")
print("- Why it struggled: [explain the issue]")
print("- Potential solution: [which technique might help?]")

**Next Steps:**
- ‚û°Ô∏è **Technique 1:** Contextual Compression Retrieval