This directory contains comprehensive examples demonstrating how to use the LangChain ScrapeGraph tools for web scraping and data extraction.
Before running these examples, make sure you have:
-
API Key: Set your ScrapeGraph AI API key as an environment variable:
export SGAI_API_KEY="your-api-key-here"
-
Dependencies: Install the required packages:
pip install langchain-scrapegraph scrapegraph-py
For the agent example, you'll also need:
pip install langchain-openai langchain python-dotenv
Purpose: Demonstrates how to integrate ScrapeGraph tools with a LangChain agent for conversational web scraping.
Features:
- Uses OpenAI's function calling capabilities
- Combines multiple tools (SmartScraper, GetCredits, SearchScraper)
- Provides conversational interface for web scraping tasks
- Includes verbose output to show agent reasoning
Usage:
python agent_example.pyPurpose: Check your remaining API credits.
Features:
- Simple API credit checking
- No parameters required
- Returns current credit balance
Usage:
python get_credits_tool.pyPurpose: Convert website content to clean markdown format.
Features:
- Converts HTML to markdown
- Cleans and structures content
- Preserves formatting and links
Usage:
python markdownify_tool.pyPurpose: Extract specific information from a single webpage using AI.
Features:
- Target specific websites
- Use natural language prompts
- Extract structured data
- Support for both URL and HTML content
Usage:
python smartscraper_tool.pyPurpose: Search the web and extract information based on a query.
Features:
- Web search capabilities
- AI-powered content extraction
- No specific URL required
- Returns relevant information from multiple sources
Usage:
python searchscraper_tool.pyPurpose: Crawl multiple pages of a website and extract comprehensive information.
Features:
- Multi-page crawling
- Configurable depth and page limits
- Domain restriction options
- Website caching for efficiency
- Extract information from multiple related pages
Usage:
python smartcrawler_tool.pyAll tools support structured output using Pydantic models. These examples show how to define schemas for consistent, typed responses.
Purpose: Extract product information with structured output.
Schema Features:
- Product name and description
- Feature lists with structured details
- Pricing information with multiple plans
- Reference URLs for verification
Key Schema Classes:
Feature: Product feature detailsPricingPlan: Pricing tier informationProductInfo: Complete product information
Purpose: Extract website information with structured output.
Schema Features:
- Website title and description
- URL extraction from page
- Support for both URL and HTML input
Key Schema Classes:
WebsiteInfo: Complete website information structure
Purpose: Extract company information from multiple pages with structured output.
Schema Features:
- Company description
- Privacy policy content
- Terms of service content
- Multi-page content aggregation
Key Schema Classes:
CompanyInfo: Company information structure
website_url: Target website URLuser_prompt: What information to extractwebsite_html: (Optional) HTML content instead of URLllm_output_schema: (Optional) Pydantic model for structured output
user_prompt: Search query and extraction instructionsllm_output_schema: (Optional) Pydantic model for structured output
url: Starting URL for crawlingprompt: What information to extractcache_website: (Optional) Cache pages for efficiencydepth: (Optional) Maximum crawling depthmax_pages: (Optional) Maximum pages to crawlsame_domain_only: (Optional) Restrict to same domainllm_output_schema: (Optional) Pydantic model for structured output
- No parameters required
website_url: Target website URL
- Error Handling: Always wrap tool calls in try-catch blocks for production use
- Rate Limiting: Be mindful of API rate limits when making multiple requests
- Caching: Use website caching for SmartCrawlerTool when processing multiple pages
- Structured Output: Use Pydantic schemas for consistent, typed responses
- Logging: Enable logging to debug and monitor tool performance
- Authentication Issues: Ensure SGAI_API_KEY is properly set
- Import Errors: Install all required dependencies
- Timeout Issues: Increase timeout values for complex crawling operations
- Rate Limiting: Implement delays between requests if hitting rate limits