# Day 1 - Exercise 3: LangChain Prompt and Parsing Setup

**Objective:** Build a structured prompting pipeline with data ingestion using LangChain components.

## Learning Objectives:

By the end of this exercise, you will be able to:

- **Create LangChain PromptTemplates** for Q&A tasks with parsers to extract structured outputs (JSON with "answer" and "confidence" fields)
- **Implement data ingestion** from CSV and web pages with chunking strategies (fixed-size, semantic) and metadata attachment
- **Build structured prompts** and data preprocessing pipelines for scalable, reusable LLM applications
- **Integrate components** into end-to-end pipelines for real-world scenarios

## Prerequisites:
- Completion of Day 1 - Exercises 1 & 2
- Basic understanding of Python and JSON
- Familiarity with prompt engineering concepts

## Training Structure (140 minutes total):
1. **LangChain Fundamentals** (15 min)
2. **PromptTemplate Basics** (20 min) 
3. **Output Parsing with Pydantic** (25 min)
4. **Data Ingestion Pipeline** (30 min)
5. **Chunking Strategies** (20 min)
6. **End-to-End Integration** (30 min)

## Setup and Installation

In [1]:
# Install required packages for LangChain pipeline
!pip install langchain langchain-core langchain-community langchain-openai
!pip install pydantic beautifulsoup4 requests pandas tiktoken
!pip install langchain-text-splitters

zsh:1: command not found: pip
zsh:1: command not found: pip
zsh:1: command not found: pip


In [2]:
import os
import json
import pandas as pd
import requests
from typing import Dict, List, Any, Optional
from pydantic import BaseModel, Field

# LangChain imports
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser, PydanticOutputParser
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.document_loaders import CSVLoader, WebBaseLoader
from langchain_core.documents import Document

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "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"
print("‚úÖ OpenAI API key configured successfully!")

# Initialize LangChain LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
print("‚úÖ LangChain LLM initialized!")

print("‚úÖ All imports successful - ready to build LangChain pipelines!")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
USER_AGENT environment variable not set, consider setting it to identify your requests.


‚úÖ OpenAI API key configured successfully!
‚úÖ LangChain LLM initialized!
‚úÖ All imports successful - ready to build LangChain pipelines!


## Section 1: LangChain Fundamentals (15 minutes)

### What is LangChain?

LangChain is a framework for developing applications powered by language models. It provides:

- **Modular Components**: Reusable building blocks for LLM applications
- **Chain Composition**: Connect multiple components into workflows
- **Data Integration**: Easy connection to various data sources
- **Output Parsing**: Structured extraction from LLM responses

### Core Components We'll Use:

1. **PromptTemplate**: Structured prompt creation with variables
2. **OutputParser**: Extract structured data from LLM responses
3. **Document Loaders**: Ingest data from various sources
4. **Text Splitters**: Chunk large documents efficiently
5. **Chains**: Combine components into workflows

### Quick LangChain Demo: Basic Chain

Let's start with a simple example to understand how LangChain components work together.

In [3]:
print("=" * 60)
print("LANGCHAIN FUNDAMENTALS - Basic Chain Demo")
print("=" * 60)

# Create a simple prompt template
basic_prompt = PromptTemplate(
    input_variables=["topic", "audience"],
    template="Explain {topic} to a {audience} in 2-3 sentences."
)

# Format the prompt
formatted_prompt = basic_prompt.format(
    topic="machine learning",
    audience="5-year-old child"
)

print(f"üìù Formatted Prompt:\n{formatted_prompt}\n")

# Create a simple chain: Prompt ‚Üí LLM
chain = basic_prompt | llm

# Execute the chain
result = chain.invoke({
    "topic": "machine learning",
    "audience": "5-year-old child"
})

print(f"ü§ñ LLM Response:\n{result.content}\n")
print("‚úÖ Basic LangChain chain executed successfully!")

LANGCHAIN FUNDAMENTALS - Basic Chain Demo
üìù Formatted Prompt:
Explain machine learning to a 5-year-old child in 2-3 sentences.

ü§ñ LLM Response:
Machine learning is like teaching a computer to learn from examples, just like how you learn to recognize animals by looking at pictures of them. If you show the computer lots of pictures of cats and dogs, it can learn to tell the difference between them, just like you can!

‚úÖ Basic LangChain chain executed successfully!


### üéØ Checkpoint 1: Understanding Check

**Question**: What are the main advantages of using LangChain over direct LLM API calls?

**Answer**: 
- Modular, reusable components
- Structured prompt management
- Built-in output parsing
- Easy data source integration
- Chain composition for complex workflows

## Section 2: PromptTemplate Fundamentals (20 minutes)

PromptTemplates are the foundation of structured prompting in LangChain. They allow you to create reusable, parameterized prompts that can be easily modified and tested.

### Basic PromptTemplate Creation

Learn how to create and use PromptTemplates with variable substitution and validation.

In [4]:
print("\n" + "=" * 60)
print("PROMPTTEMPLATE FUNDAMENTALS - Basic Creation")
print("=" * 60)

# Example 1: Simple Q&A Template
qa_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    Context: {context}
    
    Question: {question}
    
    Please provide a clear and concise answer based on the context provided.
    
    Answer:
    """
)

# Test the template
context = "LangChain is a framework for developing applications powered by language models."
question = "What is LangChain?"

formatted_qa = qa_template.format(context=context, question=question)
print(f"üìù Q&A Template Output:\n{formatted_qa}")

# Example 2: Template with validation
try:
    # This will raise an error - missing required variable
    invalid_format = qa_template.format(context=context)
except KeyError as e:
    print(f"\n‚ùå Template Validation Error: {e}")
    print("‚úÖ LangChain validates required variables!")


PROMPTTEMPLATE FUNDAMENTALS - Basic Creation
üìù Q&A Template Output:

    Context: LangChain is a framework for developing applications powered by language models.
    
    Question: What is LangChain?
    
    Please provide a clear and concise answer based on the context provided.
    
    Answer:
    

‚ùå Template Validation Error: 'question'
‚úÖ LangChain validates required variables!


### ChatPromptTemplate for Conversational AI

ChatPromptTemplate allows you to create structured conversations with system messages, human messages, and AI responses.

In [5]:
print("\n" + "=" * 60)
print("CHATPROMPTTEMPLATE - Conversational Structure")
print("=" * 60)

# Create a structured chat template
chat_template = ChatPromptTemplate.from_messages([
    SystemMessage(content="You are a helpful AI assistant specializing in {domain}. Always provide accurate, well-structured answers."),
    HumanMessage(content="Context: {context}"),
    HumanMessage(content="Question: {question}")
])

# Format the chat template
formatted_chat = chat_template.format_messages(
    domain="data science",
    context="Machine learning models require training data to learn patterns.",
    question="Why is training data important for ML models?"
)

print(f"üí¨ Chat Template Structure:")
for i, message in enumerate(formatted_chat):
    print(f"Message {i+1} ({type(message).__name__}): {message.content}")

# Execute the chat chain
chat_chain = chat_template | llm
chat_result = chat_chain.invoke({
    "domain": "data science",
    "context": "Machine learning models require training data to learn patterns.",
    "question": "Why is training data important for ML models?"
})

print(f"\nü§ñ Chat Response:\n{chat_result.content}")


CHATPROMPTTEMPLATE - Conversational Structure
üí¨ Chat Template Structure:
Message 1 (SystemMessage): You are a helpful AI assistant specializing in {domain}. Always provide accurate, well-structured answers.
Message 2 (HumanMessage): Context: {context}
Message 3 (HumanMessage): Question: {question}

ü§ñ Chat Response:
It seems that you have not provided the specific context or question. Please provide the necessary details so I can assist you effectively!


### Template Composition and Reusability

Learn how to create modular, reusable prompt components that can be combined for different use cases.

In [6]:
print("\n" + "=" * 60)
print("TEMPLATE COMPOSITION - Modular Design")
print("=" * 60)

# Create reusable prompt components
system_instructions = {
    "analyst": "You are a data analyst. Provide insights based on data and evidence.",
    "teacher": "You are an educational instructor. Explain concepts clearly and provide examples.",
    "consultant": "You are a business consultant. Focus on practical, actionable recommendations."
}

output_formats = {
    "bullet_points": "Format your response as bullet points.",
    "numbered_list": "Format your response as a numbered list.",
    "paragraph": "Format your response as a well-structured paragraph."
}

# Compose templates dynamically
def create_custom_template(role, format_type):
    return ChatPromptTemplate.from_messages([
        SystemMessage(content=f"{system_instructions[role]} {output_formats[format_type]}"),
        HumanMessage(content="Topic: {topic}\nSpecific Question: {question}")
    ])

# Test different combinations
combinations = [
    ("teacher", "bullet_points"),
    ("consultant", "numbered_list")
]

for role, format_type in combinations:
    template = create_custom_template(role, format_type)
    chain = template | llm
    
    result = chain.invoke({
        "topic": "prompt engineering",
        "question": "What are the key benefits?"
    })
    
    print(f"\nüé≠ {role.title()} + {format_type.replace('_', ' ').title()}:")
    print(f"{result.content[:200]}...")

print("\n‚úÖ Template composition enables flexible, reusable prompts!")


TEMPLATE COMPOSITION - Modular Design

üé≠ Teacher + Bullet Points:
It seems like you haven't specified a topic or a specific question. Please provide the topic and the question you would like me to address, and I'll be happy to help!...

üé≠ Consultant + Numbered List:
It seems that you haven't provided a specific topic or question. Please provide the details so I can offer you practical, actionable recommendations tailored to your needs....

‚úÖ Template composition enables flexible, reusable prompts!


### üéØ Checkpoint 2: Hands-On Exercise

**Task**: Create a PromptTemplate for product review analysis that includes:
- Product name and review text as variables
- Instructions to extract sentiment and key features mentioned
- Request for confidence score

**Test it** with a sample product review.

In [7]:
# Your solution here
# Create a product review analysis template

review_template = PromptTemplate(
    input_variables=["product_name", "review_text"],
    template="""
    # Your template here
    """
)

# Test with sample data
sample_product = "Wireless Bluetooth Headphones"
sample_review = "Great sound quality and comfortable fit. Battery life could be better."

# Implement and test your solution
print("Checkpoint 2 - Product Review Analysis:")
# Your code here

Checkpoint 2 - Product Review Analysis:


## Section 3: Output Parsing with Pydantic (25 minutes)

Output parsing is crucial for extracting structured data from LLM responses. We'll use Pydantic models to define schemas and ensure data validation.

### Defining Pydantic Models for Structured Output

Pydantic models provide type safety, validation, and clear data structures for LLM outputs.

In [8]:
print("\n" + "=" * 60)
print("PYDANTIC MODELS - Structured Output Definition")
print("=" * 60)

# Define Pydantic models for different use cases

class QAResponse(BaseModel):
    """Model for Q&A responses with confidence scoring"""
    answer: str = Field(description="The main answer to the question")
    confidence: float = Field(description="Confidence score between 0.0 and 1.0", ge=0.0, le=1.0)
    reasoning: Optional[str] = Field(description="Brief explanation of the reasoning", default=None)
    sources_needed: bool = Field(description="Whether additional sources would improve the answer")

class DocumentSummary(BaseModel):
    """Model for document summarization"""
    title: str = Field(description="Main topic or title of the document")
    key_points: List[str] = Field(description="List of 3-5 key points from the document")
    sentiment: str = Field(description="Overall sentiment: positive, negative, or neutral")
    word_count: int = Field(description="Approximate word count of original document")
    complexity_level: str = Field(description="Reading complexity: beginner, intermediate, or advanced")

class ProductAnalysis(BaseModel):
    """Model for product review analysis"""
    product_name: str = Field(description="Name of the product being reviewed")
    overall_sentiment: str = Field(description="positive, negative, or mixed")
    rating_prediction: float = Field(description="Predicted rating from 1.0 to 5.0", ge=1.0, le=5.0)
    pros: List[str] = Field(description="Positive aspects mentioned")
    cons: List[str] = Field(description="Negative aspects mentioned")
    recommendation: str = Field(description="Would you recommend this product? yes/no/maybe")

# Demonstrate model validation
print("üìã Defined Pydantic Models:")
print(f"1. QAResponse: {list(QAResponse.__fields__.keys())}")
print(f"2. DocumentSummary: {list(DocumentSummary.__fields__.keys())}")
print(f"3. ProductAnalysis: {list(ProductAnalysis.__fields__.keys())}")

# Test model validation
try:
    # Valid data
    valid_qa = QAResponse(
        answer="LangChain is a framework for LLM applications",
        confidence=0.95,
        reasoning="Based on official documentation",
        sources_needed=False
    )
    print(f"\n‚úÖ Valid QA Response: {valid_qa.answer} (confidence: {valid_qa.confidence})")
    
    # Invalid data - will raise validation error
    invalid_qa = QAResponse(
        answer="Test answer",
        confidence=1.5,  # Invalid: > 1.0
        sources_needed=False
    )
except Exception as e:
    print(f"\n‚ùå Validation Error: {e}")
    print("‚úÖ Pydantic validates data constraints!")


PYDANTIC MODELS - Structured Output Definition
üìã Defined Pydantic Models:
1. QAResponse: ['answer', 'confidence', 'reasoning', 'sources_needed']
2. DocumentSummary: ['title', 'key_points', 'sentiment', 'word_count', 'complexity_level']
3. ProductAnalysis: ['product_name', 'overall_sentiment', 'rating_prediction', 'pros', 'cons', 'recommendation']

‚úÖ Valid QA Response: LangChain is a framework for LLM applications (confidence: 0.95)

‚ùå Validation Error: 1 validation error for QAResponse
confidence
  Input should be less than or equal to 1 [type=less_than_equal, input_value=1.5, input_type=float]
    For further information visit https://errors.pydantic.dev/2.11/v/less_than_equal
‚úÖ Pydantic validates data constraints!


/var/folders/7s/jcp2dsss28lbqc7_f9j6vdb00000gn/T/ipykernel_4358/273601876.py:33: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  print(f"1. QAResponse: {list(QAResponse.__fields__.keys())}")
/var/folders/7s/jcp2dsss28lbqc7_f9j6vdb00000gn/T/ipykernel_4358/273601876.py:34: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  print(f"2. DocumentSummary: {list(DocumentSummary.__fields__.keys())}")
/var/folders/7s/jcp2dsss28lbqc7_f9j6vdb00000gn/T/ipykernel_4358/273601876.py:35: PydanticDeprecatedSince20: The `__fields__` attribute is deprecated, use `model_fields` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 M

### PydanticOutputParser Integration

Learn how to integrate Pydantic models with LangChain's output parsing system for automatic data extraction and validation.

In [19]:
print("\n" + "=" * 60)
print("PYDANTIC OUTPUT PARSER - Automatic Extraction")
print("=" * 60)

# Create parser for QA responses
qa_parser = PydanticOutputParser(pydantic_object=QAResponse)

# Create prompt template with parser instructions
qa_prompt_with_parser = PromptTemplate(
    template="""
    Answer the following question based on the provided context.
    
    Context: {context}
    Question: {question}
    
    {format_instructions}
    """,
    input_variables=["context", "question"],
    partial_variables={"format_instructions": qa_parser.get_format_instructions()}
)

# Show the format instructions
print(f"üìã Parser Format Instructions:\n{qa_parser.get_format_instructions()}\n")

# Create the complete chain: Prompt ‚Üí LLM ‚Üí Parser
qa_chain = qa_prompt_with_parser | llm | qa_parser

# Test the chain
test_context = """
LangChain is a framework for developing applications powered by language models. 
It provides components for prompt management, output parsing, data integration, 
and chain composition. LangChain supports multiple LLM providers and includes 
tools for building complex AI workflows.
"""

test_question = "What are the main components of LangChain?"

try:
    parsed_result = qa_chain.invoke({
        "context": test_context,
        "question": test_question
    })
    
    print(f"üéØ Parsed QA Result:")
    print(f"Answer: {parsed_result.answer}")
    print(f"Confidence: {parsed_result.confidence}")
    print(f"Reasoning: {parsed_result.reasoning}")
    print(f"Sources Needed: {parsed_result.sources_needed}")
    print(f"\n‚úÖ Successfully parsed structured output!")
    
except Exception as e:
    print(f"‚ùå Parsing Error: {e}")
    print("üí° Tip: LLM output might not match expected format. Consider prompt refinement.")


PYDANTIC OUTPUT PARSER - Automatic Extraction
üìã Parser Format Instructions:
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Model for Q&A responses with confidence scoring", "properties": {"answer": {"description": "The main answer to the question", "title": "Answer", "type": "string"}, "confidence": {"description": "Confidence score between 0.0 and 1.0", "maximum": 1.0, "minimum": 0.0, "title": "Confidence", "type": "number"}, "reasoning": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "Brief explanation of the reasoning", "title": "R

### Error Handling and Retry Mechanisms

Real-world applications need robust error handling for parsing failures and retry logic for improved reliability.

In [20]:
print("\n" + "=" * 60)
print("ERROR HANDLING - Robust Parsing")
print("=" * 60)

def robust_qa_chain(context: str, question: str, max_retries: int = 3) -> QAResponse:
    """QA chain with error handling and retry logic"""
    
    for attempt in range(max_retries):
        try:
            print(f"üîÑ Attempt {attempt + 1}/{max_retries}")
            
            # Execute the chain
            result = qa_chain.invoke({
                "context": context,
                "question": question
            })
            
            # Validate the result
            if isinstance(result, QAResponse):
                print(f"‚úÖ Success on attempt {attempt + 1}")
                return result
            else:
                raise ValueError("Invalid response type")
                
        except Exception as e:
            print(f"‚ùå Attempt {attempt + 1} failed: {str(e)[:100]}...")
            
            if attempt == max_retries - 1:
                # Final attempt failed - return fallback response
                print("üîß Returning fallback response")
                return QAResponse(
                    answer="Unable to process the question due to parsing errors.",
                    confidence=0.0,
                    reasoning="Parsing failed after multiple attempts",
                    sources_needed=True
                )
    
    # This should never be reached, but included for completeness
    raise RuntimeError("Unexpected error in robust_qa_chain")

# Test the robust chain
robust_result = robust_qa_chain(
    context="LangChain enables building LLM applications with modular components.",
    question="How does LangChain help developers?"
)

print(f"\nüõ°Ô∏è Robust Result:")
print(f"Answer: {robust_result.answer}")
print(f"Confidence: {robust_result.confidence}")

# Demonstrate fallback with intentionally problematic input
print(f"\nüß™ Testing with problematic input:")
fallback_result = robust_qa_chain(
    context="",  # Empty context
    question="What is the meaning of life?",  # Philosophical question
    max_retries=2
)

print(f"\nüîß Fallback Result:")
print(f"Answer: {fallback_result.answer}")
print(f"Confidence: {fallback_result.confidence}")


ERROR HANDLING - Robust Parsing
üîÑ Attempt 1/3
‚úÖ Success on attempt 1

üõ°Ô∏è Robust Result:
Answer: LangChain helps developers by providing modular components that facilitate the building of LLM applications, allowing for easier integration and customization.
Confidence: 0.9

üß™ Testing with problematic input:
üîÑ Attempt 1/2
‚úÖ Success on attempt 1

üîß Fallback Result:
Answer: The meaning of life is a philosophical question that has been explored by many cultures and thinkers, often interpreted as the pursuit of happiness, fulfillment, and understanding one's purpose.
Confidence: 0.8


### üéØ Checkpoint 3: Structured Output Challenge

**Task**: Create a Pydantic model and parser for analyzing customer feedback that includes:
- Customer satisfaction score (1-10)
- Main complaint categories (list)
- Urgency level (low/medium/high)
- Recommended action

**Test it** with sample customer feedback.

In [22]:
# Your solution here
# Create CustomerFeedback Pydantic model and parser

class CustomerFeedback(BaseModel):
    """Model for customer feedback analysis"""
    # Your model definition here
    pass

# Create parser and prompt template
feedback_parser = PydanticOutputParser(pydantic_object=CustomerFeedback)

# Test with sample feedback
sample_feedback = "The product arrived late and was damaged. Very disappointed with the service. Need immediate replacement."

print("Checkpoint 3 - Customer Feedback Analysis:")
# Your implementation here

Checkpoint 3 - Customer Feedback Analysis:


## Section 4: Data Ingestion Pipeline (30 minutes)

Real-world LLM applications need to process data from various sources. We'll build pipelines to ingest data from CSV files and web pages, with proper metadata handling.

### CSV Data Ingestion with Metadata

Learn how to load CSV data and attach relevant metadata for better context in LLM processing.

In [23]:
print("\n" + "=" * 60)
print("CSV DATA INGESTION - Structured Data Loading")
print("=" * 60)

# Create sample CSV data for demonstration
sample_data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Wireless Headphones', 'Smart Watch', 'Laptop Stand', 'USB Cable', 'Phone Case'],
    'category': ['Electronics', 'Electronics', 'Accessories', 'Accessories', 'Accessories'],
    'price': [99.99, 299.99, 49.99, 19.99, 24.99],
    'rating': [4.5, 4.2, 4.8, 4.0, 3.9],
    'description': [
        'High-quality wireless headphones with noise cancellation',
        'Feature-rich smartwatch with health monitoring',
        'Adjustable laptop stand for ergonomic working',
        'Durable USB-C cable for fast charging',
        'Protective phone case with drop protection'
    ]
}

# Save to CSV file
df = pd.DataFrame(sample_data)
csv_file = 'sample_products.csv'
df.to_csv(csv_file, index=False)
print(f"üìÑ Created sample CSV: {csv_file}")
print(f"Data shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Load CSV using LangChain CSVLoader
csv_loader = CSVLoader(
    file_path=csv_file,
    csv_args={
        'delimiter': ',',
        'quotechar': '"',
    }
)

# Load documents
csv_documents = csv_loader.load()
print(f"\nüìã Loaded {len(csv_documents)} documents from CSV")

# Examine the first document
first_doc = csv_documents[0]
print(f"\nüìÑ First Document:")
print(f"Content: {first_doc.page_content}")
print(f"Metadata: {first_doc.metadata}")

# Enhance documents with custom metadata
def enhance_csv_documents(documents, source_info):
    """Add custom metadata to CSV documents"""
    enhanced_docs = []
    
    for i, doc in enumerate(documents):
        # Parse the content to extract structured data
        lines = doc.page_content.strip().split('\n')
        
        # Enhanced metadata
        enhanced_metadata = {
            **doc.metadata,
            'source_type': 'csv',
            'document_id': f"csv_doc_{i}",
            'total_documents': len(documents),
            'data_source': source_info['name'],
            'ingestion_timestamp': source_info['timestamp'],
            'content_type': 'structured_data'
        }
        
        # Create enhanced document
        enhanced_doc = Document(
            page_content=doc.page_content,
            metadata=enhanced_metadata
        )
        enhanced_docs.append(enhanced_doc)
    
    return enhanced_docs

# Enhance the documents
import datetime
source_info = {
    'name': 'Product Catalog Database',
    'timestamp': datetime.datetime.now().isoformat()
}

enhanced_csv_docs = enhance_csv_documents(csv_documents, source_info)

print(f"\nüîß Enhanced Document Metadata:")
for key, value in enhanced_csv_docs[0].metadata.items():
    print(f"  {key}: {value}")

print(f"\n‚úÖ CSV ingestion pipeline complete!")


CSV DATA INGESTION - Structured Data Loading
üìÑ Created sample CSV: sample_products.csv
Data shape: (5, 6)
Columns: ['product_id', 'product_name', 'category', 'price', 'rating', 'description']

üìã Loaded 5 documents from CSV

üìÑ First Document:
Content: product_id: P001
product_name: Wireless Headphones
category: Electronics
price: 99.99
rating: 4.5
description: High-quality wireless headphones with noise cancellation
Metadata: {'source': 'sample_products.csv', 'row': 0}

üîß Enhanced Document Metadata:
  source: sample_products.csv
  row: 0
  source_type: csv
  document_id: csv_doc_0
  total_documents: 5
  data_source: Product Catalog Database
  ingestion_timestamp: 2025-09-18T12:00:31.889710
  content_type: structured_data

‚úÖ CSV ingestion pipeline complete!


### Web Page Data Ingestion

Learn how to extract content from web pages and prepare it for LLM processing with proper metadata and content cleaning.

In [24]:
print("\n" + "=" * 60)
print("WEB PAGE DATA INGESTION - Content Extraction")
print("=" * 60)

# For demonstration, we'll create a simple HTML content
# In practice, you would use WebBaseLoader with real URLs

sample_html_content = """
<html>
<head><title>LangChain Documentation</title></head>
<body>
<h1>Introduction to LangChain</h1>
<p>LangChain is a framework for developing applications powered by language models. 
It enables developers to build context-aware and reasoning applications.</p>

<h2>Key Features</h2>
<ul>
<li>Modular components for LLM applications</li>
<li>Chain composition for complex workflows</li>
<li>Integration with multiple data sources</li>
<li>Built-in output parsing and validation</li>
</ul>

<h2>Getting Started</h2>
<p>To get started with LangChain, install the package and explore the documentation. 
The framework supports various LLM providers and includes extensive examples.</p>
</body>
</html>
"""

# Save sample HTML
html_file = 'sample_webpage.html'
with open(html_file, 'w') as f:
    f.write(sample_html_content)

print(f"üìÑ Created sample HTML file: {html_file}")

# Custom web content processor
from bs4 import BeautifulSoup
import re

def process_web_content(html_content, url="local_file"):
    """Process HTML content and extract structured information"""
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Extract metadata
    title = soup.find('title')
    title_text = title.get_text().strip() if title else "No Title"
    
    # Extract headings
    headings = []
    for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
        headings.append({
            'level': h.name,
            'text': h.get_text().strip()
        })
    
    # Extract clean text content
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Get text and clean it
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    clean_text = '\n'.join(chunk for chunk in chunks if chunk)
    
    # Create document with metadata
    metadata = {
        'source': url,
        'source_type': 'web_page',
        'title': title_text,
        'headings': headings,
        'content_length': len(clean_text),
        'heading_count': len(headings),
        'extraction_timestamp': datetime.datetime.now().isoformat()
    }
    
    return Document(page_content=clean_text, metadata=metadata)

# Process the sample HTML
with open(html_file, 'r') as f:
    html_content = f.read()

web_document = process_web_content(html_content, "sample_langchain_docs.html")

print(f"\nüìÑ Processed Web Document:")
print(f"Title: {web_document.metadata['title']}")
print(f"Content Length: {web_document.metadata['content_length']} characters")
print(f"Headings Found: {web_document.metadata['heading_count']}")

print(f"\nüìã Extracted Headings:")
for heading in web_document.metadata['headings']:
    print(f"  {heading['level'].upper()}: {heading['text']}")

print(f"\nüìù Content Preview:")
print(web_document.page_content[:300] + "...")

print(f"\n‚úÖ Web content ingestion complete!")


WEB PAGE DATA INGESTION - Content Extraction
üìÑ Created sample HTML file: sample_webpage.html

üìÑ Processed Web Document:
Title: LangChain Documentation
Content Length: 550 characters
Headings Found: 3

üìã Extracted Headings:
  H1: Introduction to LangChain
  H2: Key Features
  H2: Getting Started

üìù Content Preview:
LangChain Documentation
Introduction to LangChain
LangChain is a framework for developing applications powered by language models.
It enables developers to build context-aware and reasoning applications.
Key Features
Modular components for LLM applications
Chain composition for complex workflows
Int...

‚úÖ Web content ingestion complete!


### Unified Data Ingestion Pipeline

Create a unified pipeline that can handle multiple data sources and prepare them for downstream processing.

In [25]:
print("\n" + "=" * 60)
print("UNIFIED DATA INGESTION - Multi-Source Pipeline")
print("=" * 60)

class UnifiedDataIngestion:
    """Unified pipeline for ingesting data from multiple sources"""
    
    def __init__(self):
        self.documents = []
        self.source_stats = {}
    
    def ingest_csv(self, file_path: str, source_name: str) -> List[Document]:
        """Ingest data from CSV file"""
        try:
            loader = CSVLoader(file_path=file_path)
            docs = loader.load()
            
            # Enhance with metadata
            enhanced_docs = []
            for i, doc in enumerate(docs):
                enhanced_metadata = {
                    **doc.metadata,
                    'source_name': source_name,
                    'source_type': 'csv',
                    'document_index': i,
                    'ingestion_timestamp': datetime.datetime.now().isoformat()
                }
                enhanced_docs.append(Document(
                    page_content=doc.page_content,
                    metadata=enhanced_metadata
                ))
            
            self.documents.extend(enhanced_docs)
            self.source_stats[source_name] = {
                'type': 'csv',
                'document_count': len(enhanced_docs),
                'status': 'success'
            }
            
            return enhanced_docs
            
        except Exception as e:
            self.source_stats[source_name] = {
                'type': 'csv',
                'document_count': 0,
                'status': 'error',
                'error': str(e)
            }
            return []
    
    def ingest_web_content(self, html_content: str, source_name: str, url: str = None) -> Document:
        """Ingest content from web page"""
        try:
            doc = process_web_content(html_content, url or source_name)
            
            # Enhance metadata
            doc.metadata.update({
                'source_name': source_name,
                'ingestion_method': 'unified_pipeline'
            })
            
            self.documents.append(doc)
            self.source_stats[source_name] = {
                'type': 'web',
                'document_count': 1,
                'status': 'success',
                'content_length': len(doc.page_content)
            }
            
            return doc
            
        except Exception as e:
            self.source_stats[source_name] = {
                'type': 'web',
                'document_count': 0,
                'status': 'error',
                'error': str(e)
            }
            return None
    
    def get_summary(self) -> Dict[str, Any]:
        """Get ingestion summary statistics"""
        total_docs = len(self.documents)
        successful_sources = sum(1 for stats in self.source_stats.values() if stats['status'] == 'success')
        failed_sources = sum(1 for stats in self.source_stats.values() if stats['status'] == 'error')
        
        return {
            'total_documents': total_docs,
            'total_sources': len(self.source_stats),
            'successful_sources': successful_sources,
            'failed_sources': failed_sources,
            'source_details': self.source_stats
        }

# Test the unified pipeline
pipeline = UnifiedDataIngestion()

# Ingest CSV data
csv_docs = pipeline.ingest_csv('sample_products.csv', 'Product Catalog')
print(f"üìÑ Ingested {len(csv_docs)} documents from CSV")

# Ingest web content
with open('sample_webpage.html', 'r') as f:
    html_content = f.read()

web_doc = pipeline.ingest_web_content(html_content, 'LangChain Documentation', 'sample_docs.html')
if web_doc:
    print(f"üåê Ingested web document: {web_doc.metadata['title']}")

# Get pipeline summary
summary = pipeline.get_summary()
print(f"\nüìä Ingestion Summary:")
print(f"Total Documents: {summary['total_documents']}")
print(f"Successful Sources: {summary['successful_sources']}/{summary['total_sources']}")

print(f"\nüìã Source Details:")
for source_name, stats in summary['source_details'].items():
    status_icon = "‚úÖ" if stats['status'] == 'success' else "‚ùå"
    print(f"  {status_icon} {source_name}: {stats['document_count']} docs ({stats['type']})")

print(f"\n‚úÖ Unified data ingestion pipeline complete!")


UNIFIED DATA INGESTION - Multi-Source Pipeline
üìÑ Ingested 5 documents from CSV
üåê Ingested web document: LangChain Documentation

üìä Ingestion Summary:
Total Documents: 6
Successful Sources: 2/2

üìã Source Details:
  ‚úÖ Product Catalog: 5 docs (csv)
  ‚úÖ LangChain Documentation: 1 docs (web)

‚úÖ Unified data ingestion pipeline complete!


### üéØ Checkpoint 4: Data Ingestion Challenge

**Task**: Create a data ingestion pipeline that:
1. Loads customer review data from a CSV
2. Processes the data to extract key information
3. Adds metadata including sentiment analysis readiness
4. Prepares the data for Q&A processing

**Create sample data** and test your pipeline.

In [26]:
# Your solution here
# Create customer review ingestion pipeline

# Sample customer review data
review_data = {
    'review_id': ['R001', 'R002', 'R003'],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'customer_name': ['John D.', 'Sarah M.', 'Mike R.'],
    'rating': [5, 3, 4],
    'review_text': [
        'Excellent laptop with great performance and battery life.',
        'Mouse works okay but could be more ergonomic.',
        'Good keyboard with nice tactile feedback.'
    ]
}

print("Checkpoint 4 - Customer Review Ingestion:")
# Your implementation here

Checkpoint 4 - Customer Review Ingestion:


## Section 5: Chunking Strategies (20 minutes)

Large documents need to be split into smaller chunks for effective LLM processing. We'll explore different chunking strategies and their use cases.

### Fixed-Size Chunking

The simplest approach: split text into chunks of fixed character or token count with optional overlap.

In [27]:
print("\n" + "=" * 60)
print("FIXED-SIZE CHUNKING - Character-Based Splitting")
print("=" * 60)

# Create sample long text for chunking
long_text = """
LangChain is a framework for developing applications powered by language models. 
The framework enables developers to build context-aware and reasoning applications 
that can connect language models to other sources of data and interact with their environment.

The main value propositions of LangChain are: 1) Components: modular abstractions 
for the components necessary to work with language models, along with implementations 
for each abstraction. Components are modular and easy-to-use, whether you are using 
the rest of the LangChain framework or not. 2) Off-the-shelf chains: structured 
assemblies of components for accomplishing specific higher-level tasks.

Off-the-shelf chains make it easy to get started. For more complex applications 
and nuanced use-cases, components make it easy to customize existing chains or 
build new ones. The framework consists of several parts: LangChain Libraries, 
LangChain Templates, LangServe, and LangSmith.
"""

print(f"üìÑ Original Text Length: {len(long_text)} characters")

# Initialize character-based text splitter
char_splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separator="\n\n"
)

# Split the text
char_chunks = char_splitter.split_text(long_text)

print(f"\n‚úÇÔ∏è Character-Based Chunking Results:")
print(f"Number of chunks: {len(char_chunks)}")

for i, chunk in enumerate(char_chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(f"{chunk[:100]}..." if len(chunk) > 100 else chunk)

# Demonstrate overlap
if len(char_chunks) > 1:
    print(f"\nüîó Overlap Analysis:")
    chunk1_end = char_chunks[0][-50:]
    chunk2_start = char_chunks[1][:50]
    print(f"Chunk 1 end: ...{chunk1_end}")
    print(f"Chunk 2 start: {chunk2_start}...")

# Create documents with metadata
char_documents = char_splitter.create_documents(
    [long_text],
    metadatas=[{
        'source': 'langchain_overview',
        'chunking_method': 'character_based',
        'chunk_size': 200,
        'chunk_overlap': 50
    }]
)

print(f"\nüìã Document Metadata Example:")
print(f"Metadata: {char_documents[0].metadata}")
print(f"\n‚úÖ Fixed-size chunking complete!")

Created a chunk of size 261, which is longer than the specified 200
Created a chunk of size 407, which is longer than the specified 200
Created a chunk of size 261, which is longer than the specified 200
Created a chunk of size 407, which is longer than the specified 200



FIXED-SIZE CHUNKING - Character-Based Splitting
üìÑ Original Text Length: 959 characters

‚úÇÔ∏è Character-Based Chunking Results:
Number of chunks: 3

Chunk 1 (260 chars):
LangChain is a framework for developing applications powered by language models. 
The framework enab...

Chunk 2 (407 chars):
The main value propositions of LangChain are: 1) Components: modular abstractions 
for the component...

Chunk 3 (286 chars):
Off-the-shelf chains make it easy to get started. For more complex applications 
and nuanced use-cas...

üîó Overlap Analysis:
Chunk 1 end: ...urces of data and interact with their environment.
Chunk 2 start: The main value propositions of LangChain are: 1) C...

üìã Document Metadata Example:
Metadata: {'source': 'langchain_overview', 'chunking_method': 'character_based', 'chunk_size': 200, 'chunk_overlap': 50}

‚úÖ Fixed-size chunking complete!


### Semantic Chunking with RecursiveCharacterTextSplitter

More intelligent chunking that respects document structure and tries to keep related content together.

In [28]:
print("\n" + "=" * 60)
print("SEMANTIC CHUNKING - Structure-Aware Splitting")
print("=" * 60)

# Create structured text with different separators
structured_text = """
# LangChain Framework Overview

## Introduction
LangChain is a framework for developing applications powered by language models. The framework enables developers to build context-aware and reasoning applications.

## Core Components

### Prompt Templates
Prompt templates provide a structured way to format inputs to language models. They support variable substitution and can be composed for complex scenarios.

### Output Parsers
Output parsers extract structured data from language model responses. They support various formats including JSON, XML, and custom schemas.

### Chains
Chains combine multiple components into workflows. They enable complex processing pipelines and can be nested for sophisticated applications.

## Data Integration

### Document Loaders
Document loaders provide interfaces to various data sources including files, databases, and web APIs. They handle format conversion and metadata extraction.

### Text Splitters
Text splitters break large documents into manageable chunks. They support different strategies including fixed-size and semantic splitting.

## Conclusion
LangChain provides a comprehensive toolkit for building LLM applications with proper abstractions and integrations.
"""

print(f"üìÑ Structured Text Length: {len(structured_text)} characters")

# Initialize recursive character text splitter
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""],  # Try these separators in order
    keep_separator=True
)

# Split the structured text
semantic_chunks = recursive_splitter.split_text(structured_text)

print(f"\nüß† Semantic Chunking Results:")
print(f"Number of chunks: {len(semantic_chunks)}")

for i, chunk in enumerate(semantic_chunks):
    # Identify the content type
    content_type = "Unknown"
    if chunk.strip().startswith("#"):
        content_type = "Header"
    elif "###" in chunk:
        content_type = "Subsection"
    elif "##" in chunk:
        content_type = "Section"
    else:
        content_type = "Content"
    
    print(f"\nChunk {i+1} ({len(chunk)} chars) - {content_type}:")
    preview = chunk.strip()[:150].replace('\n', ' ')
    print(f"{preview}..." if len(chunk.strip()) > 150 else chunk.strip())

# Compare with simple character splitting
simple_splitter = CharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50
)
simple_chunks = simple_splitter.split_text(structured_text)

print(f"\nüìä Chunking Comparison:")
print(f"Semantic chunking: {len(semantic_chunks)} chunks")
print(f"Simple chunking: {len(simple_chunks)} chunks")

# Analyze chunk quality
def analyze_chunk_quality(chunks, method_name):
    """Analyze the quality of chunks"""
    header_breaks = 0
    sentence_breaks = 0
    
    for chunk in chunks:
        # Check if chunk breaks in the middle of a header
        if '##' in chunk and not chunk.strip().startswith('#'):
            header_breaks += 1
        
        # Check if chunk ends mid-sentence
        if not chunk.strip().endswith(('.', '!', '?', '\n')):
            sentence_breaks += 1
    
    return {
        'method': method_name,
        'total_chunks': len(chunks),
        'header_breaks': header_breaks,
        'sentence_breaks': sentence_breaks,
        'quality_score': 1 - (header_breaks + sentence_breaks) / len(chunks)
    }

semantic_quality = analyze_chunk_quality(semantic_chunks, "Semantic")
simple_quality = analyze_chunk_quality(simple_chunks, "Simple")

print(f"\nüìà Quality Analysis:")
for quality in [semantic_quality, simple_quality]:
    print(f"{quality['method']} chunking:")
    print(f"  Quality Score: {quality['quality_score']:.2f}")
    print(f"  Header Breaks: {quality['header_breaks']}")
    print(f"  Sentence Breaks: {quality['sentence_breaks']}")

print(f"\n‚úÖ Semantic chunking analysis complete!")


SEMANTIC CHUNKING - Structure-Aware Splitting
üìÑ Structured Text Length: 1218 characters

üß† Semantic Chunking Results:
Number of chunks: 6

Chunk 1 (232 chars) - Header:
# LangChain Framework Overview  ## Introduction LangChain is a framework for developing applications powered by language models. The framework enables...

Chunk 2 (197 chars) - Header:
## Core Components  ### Prompt Templates Prompt templates provide a structured way to format inputs to language models. They support variable substitu...

Chunk 3 (158 chars) - Header:
### Output Parsers Output parsers extract structured data from language model responses. They support various formats including JSON, XML, and custom ...

Chunk 4 (173 chars) - Header:
### Chains Chains combine multiple components into workflows. They enable complex processing pipelines and can be nested for sophisticated application...

Chunk 5 (198 chars) - Header:
## Data Integration  ### Document Loaders Document loaders provide interfaces to var

### Advanced Chunking Strategies

Explore different chunking approaches for various document types and use cases.

In [29]:
print("\n" + "=" * 60)
print("ADVANCED CHUNKING STRATEGIES - Use Case Optimization")
print("=" * 60)

def create_adaptive_chunker(content_type: str, target_use_case: str):
    """Create chunker optimized for specific content and use case"""
    
    if content_type == "code":
        # For code, preserve function boundaries
        return RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=100,
            separators=["\n\nclass ", "\n\ndef ", "\n\n", "\n", " ", ""],
            keep_separator=True
        )
    
    elif content_type == "academic":
        # For academic papers, preserve paragraph structure
        return RecursiveCharacterTextSplitter(
            chunk_size=800,
            chunk_overlap=100,
            separators=["\n\n", "\n", ". ", " ", ""],
            keep_separator=True
        )
    
    elif content_type == "conversational":
        # For chat/dialogue, preserve conversation turns
        return RecursiveCharacterTextSplitter(
            chunk_size=300,
            chunk_overlap=50,
            separators=["\n\nUser:", "\n\nAssistant:", "\n\n", "\n", " ", ""],
            keep_separator=True
        )
    
    else:
        # Default general-purpose chunker
        return RecursiveCharacterTextSplitter(
            chunk_size=400,
            chunk_overlap=50
        )

# Test with different content types
test_contents = {
    "code": """
class DataProcessor:
    def __init__(self, config):
        self.config = config
        self.data = []
    
    def load_data(self, source):
        """Load data from specified source"""
        if source.endswith('.csv'):
            return self._load_csv(source)
        elif source.endswith('.json'):
            return self._load_json(source)
        else:
            raise ValueError("Unsupported format")
    
    def _load_csv(self, file_path):
        import pandas as pd
        return pd.read_csv(file_path)
    
    def _load_json(self, file_path):
        import json
        with open(file_path, 'r') as f:
            return json.load(f)
    """,
    
    "academic": """
    Abstract: This paper presents a novel approach to natural language processing using transformer architectures. We demonstrate significant improvements in performance across multiple benchmarks.
    
    Introduction: Natural language processing has seen remarkable advances in recent years, particularly with the introduction of transformer-based models. These models have achieved state-of-the-art results on a wide range of tasks including machine translation, text summarization, and question answering.
    
    The key innovation of transformer models lies in their attention mechanism, which allows the model to focus on relevant parts of the input sequence when generating each output token. This approach has proven more effective than previous recurrent neural network architectures.
    
    Methodology: Our approach builds upon the standard transformer architecture by introducing several key modifications. First, we implement a novel attention pattern that reduces computational complexity while maintaining performance.
    """,
    
    "conversational": """
    User: Can you explain how LangChain works?
    
    Assistant: LangChain is a framework that helps developers build applications with language models. It provides modular components that can be combined to create complex workflows.
    
    User: What are the main components?
    
    Assistant: The main components include prompt templates for structuring inputs, output parsers for extracting structured data, chains for combining operations, and document loaders for data integration.
    
    User: How do I get started?
    
    Assistant: Start by installing LangChain, then create a simple prompt template and chain it with a language model. The documentation provides excellent examples to follow.
    """
}

# Test adaptive chunking
for content_type, content in test_contents.items():
    chunker = create_adaptive_chunker(content_type, "qa")
    chunks = chunker.split_text(content)
    
    print(f"\nüìù {content_type.title()} Content Chunking:")
    print(f"Original length: {len(content)} chars")
    print(f"Number of chunks: {len(chunks)}")
    print(f"Average chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
    
    # Show first chunk as example
    if chunks:
        preview = chunks[0][:100].replace('\n', ' ').strip()
        print(f"First chunk preview: {preview}...")

print(f"\n‚úÖ Advanced chunking strategies demonstrated!")

SyntaxError: invalid syntax. Perhaps you forgot a comma? (201461881.py, line 44)

### üéØ Checkpoint 5: Chunking Strategy Selection

**Task**: Given different types of content, choose and implement the most appropriate chunking strategy:

1. **Product manual** (structured with sections and subsections)
2. **Customer reviews** (short, independent texts)
3. **Technical documentation** (code examples and explanations)

**Justify your choices** and demonstrate the chunking results.

In [None]:
# Your solution here
# Implement appropriate chunking strategies for different content types

sample_contents = {
    "product_manual": """
    # Smartphone User Manual
    
    ## Getting Started
    ### Unboxing
    Your package contains: smartphone, charger, USB cable, earphones, user manual.
    
    ### First Setup
    1. Insert SIM card
    2. Power on device
    3. Follow setup wizard
    
    ## Basic Operations
    ### Making Calls
    To make a call, open the phone app and dial the number.
    """,
    
    "customer_reviews": """
    Review 1: Great phone with excellent camera quality. Battery lasts all day.
    
    Review 2: Good value for money but screen could be brighter.
    
    Review 3: Fast performance and smooth interface. Highly recommended.
    """,
    
    "technical_docs": """
    # API Documentation
    
    ## Authentication
    Use API key in header:
    ```python
    headers = {'Authorization': 'Bearer YOUR_API_KEY'}
    response = requests.get(url, headers=headers)
    ```
    
    ## Error Handling
    Handle errors appropriately:
    ```python
    try:
        response = api_call()
    except APIError as e:
        print(f"Error: {e}")
    ```
    """
}

print("Checkpoint 5 - Chunking Strategy Selection:")
# Your implementation and justification here

## Section 6: End-to-End Integration (30 minutes)

Now we'll combine all the components we've learned into a complete, production-ready pipeline that can handle real-world scenarios.

### Complete Q&A Pipeline Integration

Build a comprehensive pipeline that combines data ingestion, chunking, structured prompting, and output parsing.

In [None]:
print("\n" + "=" * 60)
print("END-TO-END INTEGRATION - Complete Q&A Pipeline")
print("=" * 60)

class LangChainQAPipeline:
    """Complete Q&A pipeline with all components integrated"""
    
    def __init__(self, llm):
        self.llm = llm
        self.documents = []
        self.chunks = []
        
        # Initialize components
        self.qa_parser = PydanticOutputParser(pydantic_object=QAResponse)
        self.qa_prompt = PromptTemplate(
            template="""
            You are a helpful AI assistant. Answer the question based on the provided context.
            
            Context: {context}
            
            Question: {question}
            
            Provide a clear, accurate answer with a confidence score.
            
            {format_instructions}
            """,
            input_variables=["context", "question"],
            partial_variables={"format_instructions": self.qa_parser.get_format_instructions()}
        )
        
        # Create the chain
        self.qa_chain = self.qa_prompt | self.llm | self.qa_parser
        
        print("‚úÖ Q&A Pipeline initialized")
    
    def ingest_data(self, data_sources: List[Dict[str, Any]]) -> None:
        """Ingest data from multiple sources"""
        ingestion_pipeline = UnifiedDataIngestion()
        
        for source in data_sources:
            if source['type'] == 'csv':
                docs = ingestion_pipeline.ingest_csv(source['path'], source['name'])
                self.documents.extend(docs)
            elif source['type'] == 'web':
                with open(source['path'], 'r') as f:
                    content = f.read()
                doc = ingestion_pipeline.ingest_web_content(content, source['name'])
                if doc:
                    self.documents.append(doc)
        
        print(f"üìÑ Ingested {len(self.documents)} documents")
    
    def chunk_documents(self, chunking_strategy: str = "adaptive") -> None:
        """Chunk documents using specified strategy"""
        if chunking_strategy == "adaptive":
            # Use different chunkers based on content type
            for doc in self.documents:
                content_type = doc.metadata.get('source_type', 'general')
                chunker = create_adaptive_chunker(content_type, "qa")
                
                chunk_texts = chunker.split_text(doc.page_content)
                
                for i, chunk_text in enumerate(chunk_texts):
                    chunk_metadata = {
                        **doc.metadata,
                        'chunk_index': i,
                        'chunk_count': len(chunk_texts),
                        'chunking_strategy': chunking_strategy
                    }
                    
                    chunk_doc = Document(
                        page_content=chunk_text,
                        metadata=chunk_metadata
                    )
                    self.chunks.append(chunk_doc)
        
        print(f"‚úÇÔ∏è Created {len(self.chunks)} chunks using {chunking_strategy} strategy")
    
    def find_relevant_chunks(self, question: str, max_chunks: int = 3) -> List[Document]:
        """Find most relevant chunks for the question (simple keyword matching)"""
        question_words = set(question.lower().split())
        
        chunk_scores = []
        for chunk in self.chunks:
            chunk_words = set(chunk.page_content.lower().split())
            overlap = len(question_words.intersection(chunk_words))
            score = overlap / len(question_words) if question_words else 0
            chunk_scores.append((chunk, score))
        
        # Sort by score and return top chunks
        chunk_scores.sort(key=lambda x: x[1], reverse=True)
        return [chunk for chunk, score in chunk_scores[:max_chunks]]
    
    def answer_question(self, question: str, max_retries: int = 3) -> QAResponse:
        """Answer question using the pipeline"""
        # Find relevant chunks
        relevant_chunks = self.find_relevant_chunks(question)
        
        if not relevant_chunks:
            return QAResponse(
                answer="No relevant information found in the knowledge base.",
                confidence=0.0,
                reasoning="No matching content found",
                sources_needed=True
            )
        
        # Combine relevant chunks as context
        context_parts = []
        for chunk in relevant_chunks:
            source_info = chunk.metadata.get('source_name', 'Unknown')
            context_parts.append(f"[Source: {source_info}] {chunk.page_content}")
        
        context = "\n\n".join(context_parts)
        
        # Try to get answer with retries
        for attempt in range(max_retries):
            try:
                result = self.qa_chain.invoke({
                    "context": context,
                    "question": question
                })
                
                if isinstance(result, QAResponse):
                    return result
                    
            except Exception as e:
                print(f"‚ùå Attempt {attempt + 1} failed: {str(e)[:100]}...")
                if attempt == max_retries - 1:
                    return QAResponse(
                        answer="Unable to process the question due to technical issues.",
                        confidence=0.0,
                        reasoning="Processing error after multiple attempts",
                        sources_needed=True
                    )
        
        # Fallback (should not reach here)
        return QAResponse(
            answer="Unexpected error occurred.",
            confidence=0.0,
            reasoning="Unknown error",
            sources_needed=True
        )
    
    def get_pipeline_stats(self) -> Dict[str, Any]:
        """Get pipeline statistics"""
        return {
            'total_documents': len(self.documents),
            'total_chunks': len(self.chunks),
            'avg_chunk_size': sum(len(c.page_content) for c in self.chunks) / len(self.chunks) if self.chunks else 0,
            'source_types': list(set(doc.metadata.get('source_type', 'unknown') for doc in self.documents))
        }

# Initialize the complete pipeline
qa_pipeline = LangChainQAPipeline(llm)

# Define data sources
data_sources = [
    {'type': 'csv', 'path': 'sample_products.csv', 'name': 'Product Catalog'},
    {'type': 'web', 'path': 'sample_webpage.html', 'name': 'LangChain Documentation'}
]

# Execute the pipeline
print(f"\nüîÑ Executing complete pipeline...")
qa_pipeline.ingest_data(data_sources)
qa_pipeline.chunk_documents("adaptive")

# Get pipeline statistics
stats = qa_pipeline.get_pipeline_stats()
print(f"\nüìä Pipeline Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

print(f"\n‚úÖ Complete pipeline ready for questions!")

### Pipeline Testing and Validation

Test the complete pipeline with various types of questions to validate its performance and reliability.

In [None]:
print("\n" + "=" * 60)
print("PIPELINE TESTING - Comprehensive Validation")
print("=" * 60)

# Test questions covering different scenarios
test_questions = [
    {
        'question': 'What products are available in the catalog?',
        'expected_source': 'Product Catalog',
        'category': 'factual_retrieval'
    },
    {
        'question': 'What is LangChain and what are its key features?',
        'expected_source': 'LangChain Documentation',
        'category': 'conceptual_explanation'
    },
    {
        'question': 'Which product has the highest rating?',
        'expected_source': 'Product Catalog',
        'category': 'analytical_query'
    },
    {
        'question': 'How do I get started with building LLM applications?',
        'expected_source': 'LangChain Documentation',
        'category': 'procedural_guidance'
    },
    {
        'question': 'What is the price of quantum computers?',
        'expected_source': None,
        'category': 'out_of_scope'
    }
]

# Test each question
test_results = []

for i, test_case in enumerate(test_questions):
    print(f"\nüß™ Test {i+1}: {test_case['category'].replace('_', ' ').title()}")
    print(f"Question: {test_case['question']}")
    
    # Get answer from pipeline
    try:
        answer = qa_pipeline.answer_question(test_case['question'])
        
        print(f"\nü§ñ Answer: {answer.answer}")
        print(f"üìä Confidence: {answer.confidence:.2f}")
        if answer.reasoning:
            print(f"üß† Reasoning: {answer.reasoning}")
        print(f"üìö Sources Needed: {answer.sources_needed}")
        
        # Evaluate the answer
        evaluation = {
            'question': test_case['question'],
            'category': test_case['category'],
            'answer_provided': bool(answer.answer and answer.answer != "No relevant information found in the knowledge base."),
            'confidence': answer.confidence,
            'appropriate_confidence': (
                answer.confidence > 0.7 if test_case['expected_source'] else answer.confidence < 0.3
            ),
            'sources_needed': answer.sources_needed
        }
        
        test_results.append(evaluation)
        
        # Quick evaluation feedback
        if test_case['expected_source'] and evaluation['answer_provided']:
            print(f"‚úÖ Successfully answered question with relevant information")
        elif not test_case['expected_source'] and not evaluation['answer_provided']:
            print(f"‚úÖ Correctly identified out-of-scope question")
        else:
            print(f"‚ö†Ô∏è Answer quality may need review")
            
    except Exception as e:
        print(f"‚ùå Error processing question: {e}")
        test_results.append({
            'question': test_case['question'],
            'category': test_case['category'],
            'error': str(e)
        })

# Analyze test results
print(f"\n" + "=" * 40)
print(f"TEST RESULTS ANALYSIS")
print(f"=" * 40)

successful_tests = sum(1 for r in test_results if r.get('answer_provided', False))
total_tests = len(test_results)
avg_confidence = sum(r.get('confidence', 0) for r in test_results) / total_tests

print(f"üìä Overall Performance:")
print(f"  Successful Answers: {successful_tests}/{total_tests} ({successful_tests/total_tests*100:.1f}%)")
print(f"  Average Confidence: {avg_confidence:.2f}")

# Category breakdown
categories = {}
for result in test_results:
    cat = result['category']
    if cat not in categories:
        categories[cat] = {'total': 0, 'successful': 0}
    categories[cat]['total'] += 1
    if result.get('answer_provided', False):
        categories[cat]['successful'] += 1

print(f"\nüìã Performance by Category:")
for category, stats in categories.items():
    success_rate = stats['successful'] / stats['total'] * 100
    print(f"  {category.replace('_', ' ').title()}: {stats['successful']}/{stats['total']} ({success_rate:.1f}%)")

print(f"\n‚úÖ Pipeline testing complete!")

### Performance Monitoring and Optimization

Implement monitoring capabilities to track pipeline performance and identify optimization opportunities.

In [None]:
print("\n" + "=" * 60)
print("PERFORMANCE MONITORING - Pipeline Optimization")
print("=" * 60)

import time
from typing import List, Dict

class PipelineMonitor:
    """Monitor pipeline performance and provide optimization insights"""
    
    def __init__(self):
        self.query_logs = []
        self.performance_metrics = {
            'total_queries': 0,
            'successful_queries': 0,
            'avg_response_time': 0,
            'avg_confidence': 0,
            'chunk_utilization': {}
        }
    
    def log_query(self, question: str, answer: QAResponse, response_time: float, chunks_used: List[Document]):
        """Log a query and its results"""
        log_entry = {
            'timestamp': datetime.datetime.now().isoformat(),
            'question': question,
            'answer': answer.answer,
            'confidence': answer.confidence,
            'response_time': response_time,
            'chunks_used': len(chunks_used),
            'sources_needed': answer.sources_needed,
            'chunk_sources': [chunk.metadata.get('source_name', 'Unknown') for chunk in chunks_used]
        }
        
        self.query_logs.append(log_entry)
        self._update_metrics()
    
    def _update_metrics(self):
        """Update performance metrics"""
        if not self.query_logs:
            return
        
        self.performance_metrics['total_queries'] = len(self.query_logs)
        self.performance_metrics['successful_queries'] = sum(
            1 for log in self.query_logs if log['confidence'] > 0.5
        )
        self.performance_metrics['avg_response_time'] = sum(
            log['response_time'] for log in self.query_logs
        ) / len(self.query_logs)
        self.performance_metrics['avg_confidence'] = sum(
            log['confidence'] for log in self.query_logs
        ) / len(self.query_logs)
        
        # Track chunk utilization
        source_usage = {}
        for log in self.query_logs:
            for source in log['chunk_sources']:
                source_usage[source] = source_usage.get(source, 0) + 1
        
        self.performance_metrics['chunk_utilization'] = source_usage
    
    def get_performance_report(self) -> Dict[str, Any]:
        """Generate comprehensive performance report"""
        if not self.query_logs:
            return {'status': 'No queries logged yet'}
        
        # Calculate additional insights
        high_confidence_queries = [log for log in self.query_logs if log['confidence'] > 0.8]
        low_confidence_queries = [log for log in self.query_logs if log['confidence'] < 0.3]
        slow_queries = [log for log in self.query_logs if log['response_time'] > 5.0]
        
        return {
            'summary': self.performance_metrics,
            'success_rate': self.performance_metrics['successful_queries'] / self.performance_metrics['total_queries'],
            'high_confidence_rate': len(high_confidence_queries) / len(self.query_logs),
            'low_confidence_rate': len(low_confidence_queries) / len(self.query_logs),
            'slow_query_rate': len(slow_queries) / len(self.query_logs),
            'optimization_suggestions': self._generate_optimization_suggestions()
        }
    
    def _generate_optimization_suggestions(self) -> List[str]:
        """Generate optimization suggestions based on performance data"""
        suggestions = []
        
        if self.performance_metrics['avg_confidence'] < 0.6:
            suggestions.append("Consider improving chunk relevance scoring or adding more diverse data sources")
        
        if self.performance_metrics['avg_response_time'] > 3.0:
            suggestions.append("Response time is high - consider optimizing chunk retrieval or reducing chunk size")
        
        # Check for uneven source utilization
        utilization = self.performance_metrics['chunk_utilization']
        if utilization:
            max_usage = max(utilization.values())
            min_usage = min(utilization.values())
            if max_usage > min_usage * 3:
                suggestions.append("Uneven source utilization detected - some sources may be underutilized")
        
        sources_needed_rate = sum(1 for log in self.query_logs if log['sources_needed']) / len(self.query_logs)
        if sources_needed_rate > 0.3:
            suggestions.append("High rate of queries needing additional sources - consider expanding knowledge base")
        
        return suggestions if suggestions else ["Pipeline performance looks good!"]

# Enhanced pipeline with monitoring
class MonitoredQAPipeline(LangChainQAPipeline):
    """Q&A Pipeline with integrated performance monitoring"""
    
    def __init__(self, llm):
        super().__init__(llm)
        self.monitor = PipelineMonitor()
    
    def answer_question(self, question: str, max_retries: int = 3) -> QAResponse:
        """Answer question with performance monitoring"""
        start_time = time.time()
        
        # Find relevant chunks
        relevant_chunks = self.find_relevant_chunks(question)
        
        # Get answer using parent method
        answer = super().answer_question(question, max_retries)
        
        # Calculate response time
        response_time = time.time() - start_time
        
        # Log the query
        self.monitor.log_query(question, answer, response_time, relevant_chunks)
        
        return answer
    
    def get_performance_report(self):
        """Get performance report from monitor"""
        return self.monitor.get_performance_report()

# Test the monitored pipeline
print(f"üîß Initializing monitored pipeline...")
monitored_pipeline = MonitoredQAPipeline(llm)
monitored_pipeline.ingest_data(data_sources)
monitored_pipeline.chunk_documents("adaptive")

# Run test queries with monitoring
test_queries = [
    "What products are available?",
    "What is LangChain?",
    "Which product has the best rating?",
    "How do I use prompt templates?",
    "What is the weather today?"  # Out of scope
]

print(f"\nüß™ Running monitored test queries...")
for i, query in enumerate(test_queries):
    print(f"\nQuery {i+1}: {query}")
    answer = monitored_pipeline.answer_question(query)
    print(f"Confidence: {answer.confidence:.2f}")

# Generate performance report
print(f"\n" + "=" * 40)
print(f"PERFORMANCE REPORT")
print(f"=" * 40)

report = monitored_pipeline.get_performance_report()

print(f"üìä Performance Summary:")
print(f"  Total Queries: {report['summary']['total_queries']}")
print(f"  Success Rate: {report['success_rate']:.1%}")
print(f"  High Confidence Rate: {report['high_confidence_rate']:.1%}")
print(f"  Average Response Time: {report['summary']['avg_response_time']:.2f}s")
print(f"  Average Confidence: {report['summary']['avg_confidence']:.2f}")

print(f"\nüìã Source Utilization:")
for source, count in report['summary']['chunk_utilization'].items():
    print(f"  {source}: {count} queries")

print(f"\nüí° Optimization Suggestions:")
for suggestion in report['optimization_suggestions']:
    print(f"  ‚Ä¢ {suggestion}")

print(f"\n‚úÖ Performance monitoring complete!")

### üéØ Final Challenge: Complete Pipeline Implementation

**Task**: Build a complete Q&A pipeline for a specific domain (choose one):

1. **Customer Support Bot** - Handle product inquiries and support requests
2. **Technical Documentation Assistant** - Answer questions about API usage and code examples
3. **Educational Content Helper** - Provide explanations and learning guidance

**Requirements**:
- Implement data ingestion from at least 2 sources
- Use appropriate chunking strategy for your domain
- Include structured output parsing with confidence scoring
- Add performance monitoring
- Test with domain-specific questions
- Provide optimization recommendations

In [None]:
# Your complete pipeline implementation here
# Choose your domain and implement all required components

print("Final Challenge - Complete Pipeline Implementation:")
print("Choose your domain and implement the full pipeline")

# Your implementation here
# 1. Define your domain and data sources
# 2. Create domain-specific Pydantic models
# 3. Implement data ingestion
# 4. Set up appropriate chunking
# 5. Create domain-specific prompts
# 6. Add monitoring and testing
# 7. Generate performance report

## Summary and Key Takeaways

### üéì Concepts Mastered:

1. **LangChain Fundamentals**: Understanding the framework architecture and core components
2. **PromptTemplate Design**: Creating reusable, parameterized prompts with validation
3. **Structured Output Parsing**: Using Pydantic models for reliable data extraction
4. **Data Ingestion Pipelines**: Processing CSV and web data with metadata enhancement
5. **Chunking Strategies**: Implementing fixed-size and semantic chunking approaches
6. **End-to-End Integration**: Combining all components into production-ready pipelines

### üõ†Ô∏è Technical Skills Developed:

- **Component Integration**: Chaining LangChain components effectively
- **Error Handling**: Implementing robust retry mechanisms and fallbacks
- **Performance Monitoring**: Tracking pipeline metrics and optimization
- **Data Processing**: Handling multiple data sources and formats
- **Quality Assurance**: Testing and validating pipeline outputs

### üöÄ Best Practices Learned:

- **Modular Design**: Build reusable components that can be easily combined
- **Metadata Management**: Attach comprehensive metadata for better context
- **Adaptive Strategies**: Choose appropriate techniques based on content type
- **Monitoring Integration**: Build observability into your pipelines from the start
- **Graceful Degradation**: Handle errors and edge cases appropriately

### üîÑ Next Steps:

- **Day 2**: Advanced RAG implementations and agent architectures
- **Production Deployment**: Scaling pipelines for real-world usage
- **Advanced Retrieval**: Vector databases and semantic search
- **Multi-Modal Integration**: Handling text, images, and other data types

### üí° Key Insights:

- **Structured prompting** significantly improves output reliability
- **Proper chunking** is crucial for maintaining context and relevance
- **Monitoring and optimization** are essential for production systems
- **LangChain's modularity** enables rapid prototyping and iteration

You now have a solid foundation in LangChain prompt and parsing setup that will serve as the basis for more advanced LLM applications in the coming exercises!