---
sidebar_label: ScrapingAnt
---

# ScrapingAnt
## Overview
[ScrapingAnt](https://scrapingant.com/) is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown.

This particular integration uses only Markdown extraction feature, but don't hesitate to [reach out to us](mailto:support@scrapingant.com) if you need more features provided by ScrapingAnt, but not yet implemented in this integration.

### Integration details

| Class                                                                                                                                                    | Package                                                                                        | Local | Serializable | JS support |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|:-----:|:------------:|:----------:|
| [ScrapingAntLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.scrapingant.ScrapingAntLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) |   ❌   |      ❌       |     ❌      |

### Loader features
|      Source       | Document Lazy Loading | Async Support |
|:-----------------:|:---------------------:|:-------------:|
| ScrapingAntLoader |           ✅           |       ❌       |


## Setup

Install ScrapingAnt Python SDK and he required Langchain packages using pip:
```shell
pip install scrapingant-client langchain langchain-community
```

In [None]:
!pip install scrapingant-client langchain langchain-community

In [9]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import re

def get_langchain_docs(base_url="https://python.langchain.com/docs/", visited=None):
    if visited is None:
        visited = set()

    if base_url in visited:
        return []

    try:
        response = requests.get(base_url)
        soup = BeautifulSoup(response.text, 'html.parser')
        visited.add(base_url)

        urls = []
        for link in soup.find_all('a', href=True):
            url = urljoin(base_url, link['href'])

            # Only follow LangChain doc URLs
            if urlparse(url).netloc == "python.langchain.com" and "/docs/" in url:
                urls.append(url)
                urls.extend(get_langchain_docs(url, visited))

        return [base_url] + urls

    except Exception as e:
        print(f"Error crawling {base_url}: {e}")
        return []

# Usage
doc_urls = get_langchain_docs()

KeyboardInterrupt: 

## Instantiation

In [4]:
from langchain_community.document_loaders import ScrapingAntLoader

scrapingant_loader = ScrapingAntLoader(
    ["https://scrapingant.com/", "https://example.com/"],  # List of URLs to scrape
    api_key="0fbcae1b2426472aafd06ed16e445e7d",  # Get your API key from https://scrapingant.com/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
)

The ScrapingAntLoader also allows providing a dict - scraping config for customizing the scrape request. As it is based on the [ScrapingAnt Python SDK](https://github.com/ScrapingAnt/scrapingant-client-python) you can pass any of the [common arguments](https://github.com/ScrapingAnt/scrapingant-client-python) to the `scrape_config` parameter.

In [6]:
from langchain_community.document_loaders import ScrapingAntLoader

scrapingant_config = {
    "browser": True,  # Enable browser rendering with a cloud browser
    "proxy_type": "datacenter",  # Select a proxy type (datacenter or residential)
    "proxy_country": "us",  # Select a proxy location
}

scrapingant_additional_config_loader = ScrapingAntLoader(
    ["https://scrapingant.com/"],
    api_key="<YOUR_SCRAPINGANT_TOKEN>",  # Get your API key from https://scrapingant.com/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
    scrape_config=scrapingant_config,  # Pass the scrape_config object
)

## Load

Use the `load` method to scrape the web pages and get the extracted markdown content.


In [7]:
# Load documents from URLs as markdown
documents = scrapingant_loader.load()

print(documents)

[Document(metadata={'url': 'https://scrapingant.com/'}, page_content='![](images/loader.svg)\n\n[![](images/ScrapingAnt-1.svg)](/) Features Pricing\n\nServices\n\n[Web Scraping API](/") [LLM-ready data extraction](/llm-ready-data-extraction)\n[AI data scraping](/ai-data-scraper) [Residential Proxy](/residential-proxies)\n[Datacenter Proxy](/datacenter-proxies)\n\n[Blog](https://scrapingant.com/blog/)\n\nDocumentation\n\n[Web Scraping API](https://docs.scrapingant.com)\n[Proxies](https://proxydocs.scrapingant.com)\n\nContact Us\n\n[Sign In](https://app.scrapingant.com/login)\n\n![](images/icon-menu.svg)\n\n![](images/Capterra-Rating.png)\n\n# Enterprise-Grade Scraping API.  \nAnt Sized Pricing.\n\n## Get the mission-critical speed, reliability, and features you need at a\nfraction of the cost!  \n\nGot Questions?  \n(get expert advice)\n\n[ Try Our Free Plan (10,000 API Credits) ](https://app.scrapingant.com/signup)\n\n![](images/lines-10-white.svg)![](images/lines-12-white.svg)\n\n### 

## Lazy Load

Use the 'lazy_load' method to scrape the web pages and get the extracted markdown content lazily.

In [8]:
# Lazy load documents from URLs as markdown
lazy_documents = scrapingant_loader.lazy_load()

for document in lazy_documents:
    print(document)

page_content='![](images/loader.svg)

[![](images/ScrapingAnt-1.svg)](/) Features Pricing

Services

[Web Scraping API](/") [LLM-ready data extraction](/llm-ready-data-extraction)
[AI data scraping](/ai-data-scraper) [Residential Proxy](/residential-proxies)
[Datacenter Proxy](/datacenter-proxies)

[Blog](https://scrapingant.com/blog/)

Documentation

[Web Scraping API](https://docs.scrapingant.com)
[Proxies](https://proxydocs.scrapingant.com)

Contact Us

[Sign In](https://app.scrapingant.com/login)

![](images/icon-menu.svg)

![](images/Capterra-Rating.png)

# Enterprise-Grade Scraping API.  
Ant Sized Pricing.

## Get the mission-critical speed, reliability, and features you need at a
fraction of the cost!  

Got Questions?  
(get expert advice)

[ Try Our Free Plan (10,000 API Credits) ](https://app.scrapingant.com/signup)

![](images/lines-10-white.svg)![](images/lines-12-white.svg)

### Proudly scaling with us

![](images/_2cd6c6d09d261d19_281d72aa098ecca8.png)![](images/_bb8ca9c

## API reference

This loader is based on the [ScrapingAnt Python SDK](https://docs.scrapingant.com/python-client). For more configuration options, see the [common arguments](https://github.com/ScrapingAnt/scrapingant-client-python/tree/master?tab=readme-ov-file#common-arguments)