# Comprehensive Python Notebook for Learning LangChain RAG Systems

## Executive Summary

This comprehensive Python notebook provides a complete learning journey for developers with 2 years of Python experience who want to master LangChain Retrieval-Augmented Generation (RAG) systems. The notebook covers everything from basic concepts to production-ready implementations, focusing on LangChain-specific patterns rather than basic Python concepts.

## Prerequisites

Before running this notebook, install the required packages:

In [3]:
# Install required packages
!pip install langchain langchain-openai langchain-anthropic langchain-community
!pip install langchain-chroma langchain-pinecone langchain-text-splitters
!pip install faiss-cpu chromadb pinecone-client
!pip install beautifulsoup4 pypdf pymupdf
!pip install ragas datasets langsmith
!pip install fastapi uvicorn pydantic
!pip install tenacity python-multipart
!pip install jupyter ipywidgets

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try apt install
[31m   [0m python3-xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian-packaged Python package,
[31m   [0m create a virtual environment using python3 -m venv path/to/venv.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
[31m   [0m sure you have python3-full installed.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian packaged Python application,
[31m   [0m it may be easiest to use pipx install xyz, which will manage a
[31m   [0m virtual environment for you. Make sure you have pipx installed.
[31m   [0m 
[31m   [0m See /usr/share/doc/python3.12/README.venv for more information.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python insta

# Part 1: LangChain RAG Architecture and Components

## 1.1 Document Loaders for Various Formats

In [None]:
import os
from langchain_community.document_loaders import (
    TextLoader, PyPDFLoader, WebBaseLoader, CSVLoader, 
    JSONLoader, DirectoryLoader, UnstructuredMarkdownLoader,
    UnstructuredWordDocumentLoader, UnstructuredPowerPointLoader
)
from langchain_core.documents import Document
from pathlib import Path
import bs4

# Basic Text Loader
def load_text_files(file_path: str):
    """Load simple text files"""
    loader = TextLoader(file_path)
    documents = loader.load()
    
    # Each document has page_content and metadata
    print(f"Content preview: {documents[0].page_content[:100]}")
    print(f"Metadata: {documents[0].metadata}")
    
    return documents

# PDF Loader with page-level splitting
def load_pdf_documents(pdf_path: str):
    """Load PDF files page by page"""
    # PyPDFLoader for basic needs
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    
    # Alternative: PyMuPDFLoader for better performance
    from langchain_community.document_loaders import PyMuPDFLoader
    advanced_loader = PyMuPDFLoader(pdf_path)
    advanced_docs = advanced_loader.load()
    
    return documents

In [None]:
# Web Content Loader with custom parsing
def load_web_content(urls: list):
    """Load web content with custom parsing"""
    loader = WebBaseLoader(
        web_paths=urls,
        bs_kwargs=dict(
            parse_only=bs4.SoupStrainer(
                class_=("article-content", "post-content", "main-content")
            )
        )
    )
    documents = loader.load()
    return documents

In [None]:
# Universal Document Processor
class UniversalDocumentProcessor:
    """Process multiple document formats with metadata extraction"""
    
    def __init__(self):
        self.supported_loaders = {
            '.pdf': PyPDFLoader,
            '.txt': TextLoader,
            '.csv': CSVLoader,
            '.md': UnstructuredMarkdownLoader,
            '.docx': UnstructuredWordDocumentLoader,
            '.pptx': UnstructuredPowerPointLoader,
        }
    
    def process_file(self, file_path: str, custom_metadata: dict = None):
        """Process single file with comprehensive metadata extraction"""
        file_path = Path(file_path)
        file_extension = file_path.suffix.lower()
        
        if file_extension not in self.supported_loaders:
            raise ValueError(f"Unsupported file type: {file_extension}")
        
        loader_class = self.supported_loaders[file_extension]
        loader = loader_class(str(file_path))
        documents = loader.load()
        
        # Add custom metadata
        for doc in documents:
            doc.metadata.update({
                'file_name': file_path.name,
                'file_path': str(file_path),
                'file_extension': file_extension,
                'file_size': file_path.stat().st_size,
                'created_at': file_path.stat().st_ctime,
                'modified_at': file_path.stat().st_mtime,
            })
            
            if custom_metadata:
                doc.metadata.update(custom_metadata)
            
            # Add content analysis metadata
            doc.metadata.update(self._analyze_content(doc.page_content))
        
        return documents
    
    def _analyze_content(self, content: str) -> dict:
        """Analyze document content for additional metadata"""
        words = content.split()
        sentences = content.split('.')
        
        return {
            'word_count': len(words),
            'sentence_count': len(sentences),
            'character_count': len(content),
            'avg_words_per_sentence': len(words) / max(len(sentences), 1),
            'has_code': any(keyword in content.lower() for keyword in ['def ', 'class ', 'import ']),
            'has_urls': 'http' in content.lower(),
        }

# Test the document processor
processor = UniversalDocumentProcessor()
print("Document processor initialized successfully!")