This project is a web scraper that uses the Jina AI Reader API to fetch and save content from web pages. It supports both single-page and multi-page scraping, with the ability to traverse links within allowed domains.
- Single-page and multi-page scraping
- Concurrent processing of URLs
- Rate limiting to respect API usage guidelines
- Extraction of page titles and links
- Saving content as Markdown files
- Configurable allowed domains for multi-page scraping
- Python 3.6+
- Jina AI Reader API key (obtain from https://jina.ai/reader/)
-
Clone this repository:
git clone https://github.com/yourusername/jina-reader-api-scraper.git cd jina-reader-api-scraper -
Install the required dependencies:
pip install -r requirements.txt -
Create a
.envfile in the project root directory with the following content:BASE_URL=<Your Jina AI Reader API Base URL> API_KEY=<Your Jina AI Reader API Key>
The project requires the following Python packages (specified in requirements.txt):
certifi==2024.7.4
charset-normalizer==3.3.2
idna==3.7
joblib==1.4.2
numpy==2.0.1
python-dotenv==1.0.1
requests==2.32.3
scikit-learn==1.5.1
scipy==1.14.0
threadpoolctl==3.5.0
urllib3==2.2.2
You can install these requirements using the command:
pip install -r requirements.txt
Run the scraper using the following command:
python jina.py <url> <output_dir> [--allowed_domains domain1 domain2 ...] [--multi_page]
Arguments:
url: The initial URL to scrapeoutput_dir: Directory to save scraped content--allowed_domains(optional): List of allowed domains to scrape (default: domain of the initial URL)--multi_page(optional): Enable multi-page scraping (default: False)
Example:
python jina.py https://example.com/start-page output_folder --allowed_domains example.com blog.example.com --multi_page
You can adjust the following variables in the script to fine-tune the scraper's behavior:
MAX_WORKERS: Maximum number of concurrent workers (default: 5)RATE_LIMIT: Time in seconds between requests (default: 1)
The scraper saves each scraped page as a separate Markdown file in the specified output directory. The filename is based on the page title, sanitized to remove invalid characters.
- The scraper respects the rate limit set in the
RATE_LIMITvariable to avoid overwhelming the API. - Multi-page scraping is limited to the domains specified in
allowed_domains. - You need a valid API key from Jina AI Reader to use this scraper.
The scraper prints error messages for failed requests or processing errors but continues with the next URL in the queue.
Contributions are welcome! Please feel free to submit a Pull Request.
[Specify your license here]