Test drive ready-to-go beta app
Webtap is a Python library that enables reliable, AI-driven web scraping. It leverages Large Language Models (LLMs) to orchestrate established scraping libraries, such as Apify, for efficient data extraction from the web.
Modern websites use measures like captchas and IP blocking to hinder automated data extraction. With frequent changes in their layout and content, these defenses can challenge AI-based scrapers, resulting in less reliable data collection.
Webtap combines Apify's specialized scraping libraries with Large Language Models (LLMs) for smart web navigation and dynamic content adaptation, streamlining data extraction.
The code below shows how to use Webtap to find up to 15 'history'-themed books in Italian using Apify Proxy.
tap_manager = TapManager()
tap = tap_manager.get_tap( "atg_epctex_gutenberg_scraper" ) # Project Gutenberg: a collection of 70,000 free ebooks
print(tap.get_retriever_and_run("Search for 'history', maximum 15 items, in Italian language, using Apify Proxy"))
# The above print statement will output real time scraped data from project gutenberg in the following format:
# {
# 'retriever_result': ApifyRetrieverResult(can_fulfill=True, explanation='...', retriever=...),
# 'data': '[{"author": "Albertazzi, Adolfo, 1865-1924", "title": "Novelle umoristiche by Adolfo Albertazzi", ...}]'
# }
Experience Webtap's beta version directly on webtap.ai with no installation neededβsimply sign up to begin scraping for free.
If you have built (or are planning to build) an Apify actor and want to enhance its capabilities with AI, Webtap is the perfect tool for you. By integrating Webtap into your actor, you can make it possible to access your actor's data using AI-based natural language queries. Follow instructions to create a new Apify Tap to learn how to do this.
Webtap has been developed and tested with Python 3.11 Make sure that your python version is >= 3.11
- Clone the repo
git clone https://github.com/webtap-ai/webtap.git
cd webtap
- It is recommended, though not mandatory, to create a virtual environment for your project.
python3.11 -m venv venv
source venv/bin/activate
- Setup openai key and Apify Key You must have an openai key and a Apify key in order to use the project. You can get an openai key by signing up at https://platform.openai.com/ and an Apify key by signing up at https://apify.com/ Set the following environment variables in your shell ( Add these to your .bashrc or .bash_profile to make them permanent):
export OPENAI_API_KEY="{your api key}"
export APIFY_API_TOKEN="{your api key}"
Optionally setup LangSmith (signup at https://langsmith.com) environment variables if you want to use LangSmith for tracing:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT="{Your dev environment project}"
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY={your api_key}
- Install requirements
python3.11 setup.py install
After installation you can run an example using the following command:
python3.11 -m examples.apify_tap_example
To use Webtap in your project, you can install it directly from the Python Package Index (PyPI) using pip:
pip install webtap
After installing the library, follow the steps to set up the environment variables and run the examples as described in "Setup openai key and Apify Key section (2), (3) and (4)".
If you are managing your project's dependencies through a requirements.txt
file, you can add Webtap to it:
webtap
After adding the line to your requirements.txt
, you can install all your dependencies with:
pip install -r requirements.txt
Webtap supports out of the box a selection of the top 40 Apify actors, enabling immediate use. You can also define your own Tap following one of the three methods below:
The Tap Generator is a tool that automatically generates a new Apify Tap based on a JSON definition. It often works; when it doesn't work, it provides a good starting point for manual editing. To do so, you can use the following python code: (This feature is exclusive to the full GitHub repository and is not included in the PyPI package distribution.)
# make sure you Webtap project is setup(venv is activated, requirements are installed, environment variables are set and you are in the webtap directory)
python -m examples.tap_generator_example epctex/gutenberg-scraper
Check out the code in the examples/tap_generator_example.py
file to see how to use the Tap Generator and how to customize the generated Tap.
In order to see how to create a standard new ApifyTap (using json definition) see GUIDE.md
In order to see how to create a standard new ApifyTap (using json definition) see CUSTOM_GUIDE.md
The Webtap Python library is currently tailored to work with Apify. It readily supports a selection of the top 40 Apify actors, enabling immediate use. Additionally, the Universal Scraper is on offer as a versatile LLM-based tool capable of scraping any website, albeit with varying efficiency compared to a bespoke Apify actor.
- Utilize any of the 40 immediately available Apify actors for website data scraping.
- Employ the Universal Scraper for a generic LLM-based approach to scrape any website.
- Access your custom Apify actors using intuitive natural language queries.
- Craft a tap tailored to your specific scraping requirements with the Tap Generator, or do it manually for more control.
- Integration with the full suite of public Apify actors, totaling over 1500, is under consideration for future updates.
- We are exploring the possibility of broadening our scraping toolset to include additional libraries such as Scrapy.
- Enhancements to the Universal Scraper are continuously evaluated, aiming to enhance its efficiency and dependability.
This project would not have been possible without the following open-source libraries and services:
- Apify SDK for web crawling and automation.
- LangChain for integrating language models into applications.
- OpenAI for providing access to powerful AI models.
- LangSmith for their tracing and debugging tools.
- Pydantic for data validation and settings management using Python type annotations.
- tiktoken for handling TikTok tokens.
- apify-client for interacting with the Apify API.
- demjson for encoding, decoding, and linting JSON.
- black for formatting Python code.
- html2text for converting HTML into clean, easy-to-read plain ASCII text.
We are grateful to the developers of these libraries and services for their hard work and dedication!
This project was initially developed and is owned by Webtap Technologies LLC.
All contributions to this project are made on a voluntary basis and do not imply any legal responsibility for individual contributors, including but not limited to founding contributors. Legal responsibility for this project and its use rests solpely with Webtap Technologies LLC, as stated in the license.
Webtap Technologies LLC
30 N Gould ST STE R
Sheridan, WY 82801
EIN: 32-0763057