Skip to content

web-scraping-tools/email-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Email Scraper – Open-Source Email Extraction Tool for Websites

License: MIT TypeScript Node.js

Email Scraper is an open-source TypeScript library and CLI tool for extracting email addresses from webpages and entire websites. This email scraper helps developers automate email collection by providing ready-to-use modules that support both HTTP requests and headless browsers (Playwright/Puppeteer) for comprehensive web scraping.

Whether you need to scrape emails from a single contact page or crawl an entire website to collect email addresses, this tool provides a simple, reliable solution for email extraction and web scraping tasks.

Features

  • Email Extraction from Single Pages: Scrape email addresses from any webpage URL using HTTP requests or headless browsers
  • Website Crawling: Automatically crawl entire websites to extract emails from multiple pages with configurable depth and page limits
  • Multiple Scraping Methods: Choose between fast HTTP requests or full browser automation (Playwright/Puppeteer) for JavaScript-heavy sites
  • TypeScript Support: Fully typed API for better developer experience and IDE autocomplete
  • CLI Tool: Command-line interface for quick email scraping without writing code
  • Library API: Import and use in your own Node.js projects for custom email scraping workflows
  • Flexible Configuration: Customize timeout, crawl depth, page limits, and cross-domain crawling options
  • Email Normalization: Automatically normalizes and deduplicates extracted email addresses

Installation

Setup

  1. Clone this repository:
git clone https://github.com/web-scraping-apis/email-scraper
cd email-scraper
  1. Install dependencies:
npm install
  1. Build the project:
npm run build

Browser Support (Optional)

For scraping JavaScript-rendered pages, install one of the supported browsers:

Using Playwright:

npm install playwright
npx playwright install chromium

Using Puppeteer:

npm install puppeteer

Quick Start

CLI Usage – Scrape Emails from a Single Webpage

The email scraper CLI makes it easy to extract emails from any webpage:

# Using HTTP requests (fast, for static pages)
npm run dev page https://example.com/contact

# Using headless browser (for JavaScript-rendered pages)
npm run dev page https://example.com/contact --browser

# After building, use the compiled version:
npm run scrape page https://example.com/contact

CLI Options for Single Page Scraping:

  • -t, --timeout <ms> - Request timeout in milliseconds (default: 10000)
  • -b, --browser - Use headless browser (Playwright/Puppeteer) instead of HTTP requests
  • --wait-until <event> - Wait until event: load, domcontentloaded, or networkidle (default: load, browser mode only)

CLI Usage – Scrape Emails from an Entire Website

Crawl and extract emails from multiple pages on a website:

# Using HTTP requests (default)
npm run dev website https://example.com

# Using headless browser
npm run dev website https://example.com --browser

# After building:
npm run scrape website https://example.com

CLI Options for Website Crawling:

  • -d, --max-depth <number> - Maximum crawl depth (default: 3)
  • -p, --max-pages <number> - Maximum number of pages to crawl (default: 50)
  • --cross-domain - Allow crawling to different domains (default: false)
  • -b, --browser - Use headless browser (Playwright/Puppeteer) instead of HTTP requests
  • --wait-until <event> - Wait until event: load, domcontentloaded, or networkidle (default: load, browser mode only)

Library Usage

Import and use the email scraper in your own TypeScript/JavaScript projects:

Basic HTTP Email Scraping

Extract emails from a webpage using simple HTTP requests:

import { scrapeEmailsFromUrl } from 'email-scraper';

const emails = await scrapeEmailsFromUrl('https://example.com/contact');
console.log(Array.from(emails));
// Output: ['contact@example.com', 'info@example.com']

Email Scraping with Playwright

For JavaScript-rendered pages, use Playwright with the email scraper:

import { chromium } from 'playwright';
import { scrapeEmailsFromPage } from 'email-scraper';

const browser = await chromium.launch();
const page = await browser.newPage();
const emails = await scrapeEmailsFromPage(page, 'https://example.com/contact');
await browser.close();

console.log(Array.from(emails));

Email Scraping with Puppeteer

Alternatively, use Puppeteer for browser automation:

import puppeteer from 'puppeteer';
import { scrapeEmailsFromPage } from 'email-scraper';

const browser = await puppeteer.launch();
const page = await browser.newPage();
const emails = await scrapeEmailsFromPage(page, 'https://example.com/contact');
await browser.close();

Crawling an Entire Website for Emails

Use the website crawler to extract emails from multiple pages:

import { scrapeEmailsFromWebsite } from 'email-scraper';

const emails = await scrapeEmailsFromWebsite('https://example.com', {
  maxDepth: 3,
  maxPages: 50,
  sameDomainOnly: true,
  useBrowser: false, // Set to true to use a browser
  browser: browserInstance, // Optional: pass browser instance
});

console.log(`Found ${emails.size} unique emails`);

Using the EmailScraper Class

For a more object-oriented approach, use the EmailScraper class:

import { EmailScraper } from 'email-scraper';
import { chromium } from 'playwright';

// With browser
const browser = await chromium.launch();
const scraper = new EmailScraper(browser);
const emails = await scraper.scrapeFromWebsite('https://example.com');
await browser.close();

// Without browser (HTTP only)
const scraper = new EmailScraper();
const emails = await scraper.scrapeFromUrl('https://example.com/contact');

Note: After building (npm run build), you can import from ./dist/index instead of ./src/index when using the library locally.

API Reference

Functions

  • scrapeEmailsFromUrl(url: string, options?: HttpScraperOptions): Promise<Set<string>>

    • Scrapes emails from a single webpage using HTTP requests
    • Returns a Set of unique email addresses
  • scrapeEmailsFromPage(page: Page, url: string, options?: WebpageScraperOptions): Promise<Set<string>>

    • Scrapes emails from a webpage using a browser page instance
    • Works with Playwright or Puppeteer page objects
  • scrapeEmailsFromWebsite(url: string, options?: WebsiteCrawlerOptions): Promise<Set<string>>

    • Crawls an entire website and extracts emails from all visited pages
    • Supports configurable depth, page limits, and domain restrictions
  • extractEmails(text: string): Set<string>

    • Extracts email addresses from plain text using regex patterns
  • normalizeEmail(email: string): string

    • Normalizes email addresses to a standard format

Classes

  • EmailScraper - Main scraper class with convenience methods for both HTTP and browser-based scraping

Frequently Asked Questions (FAQ)

Q: How do I scrape emails from websites?

A: You can use the email scraper CLI or library. For a single page, use npm run dev page <url>. For an entire website, use npm run dev website <url>. The tool automatically extracts email addresses from HTML content and text.

Q: What's the difference between HTTP and browser scraping?

A: HTTP scraping is faster and works well for static HTML pages. Browser scraping (Playwright/Puppeteer) is necessary for JavaScript-rendered content where emails are loaded dynamically. Use the --browser flag to enable browser mode.

Q: How to scrape email from websites that require JavaScript?

A: Use the --browser option with the CLI, or use scrapeEmailsFromPage() with a Playwright or Puppeteer browser instance. This allows the email scraper to wait for JavaScript to execute before extracting emails.

Q: Can I scrape emails from multiple websites at once?

A: Yes, you can use the website crawler with --cross-domain flag, or write a script that calls scrapeEmailsFromUrl() or scrapeEmailsFromWebsite() multiple times with different URLs.

Q: How does the email scraper handle duplicate emails?

A: The scraper automatically deduplicates emails using a Set data structure. All functions return a Set<string> containing unique email addresses only.

Q: Is this email scraper legal to use?

A: This is a tool for extracting publicly available email addresses from websites. Always respect website terms of service, robots.txt files, and applicable laws (such as GDPR, CAN-SPAM) when scraping emails. Use responsibly and ethically.

Q: How do I install the email scraper as a library?

A: After cloning the repo and running npm install and npm run build, you can import from ./dist/index. For npm package distribution, the library would be installed via npm install email-scraper (when published).

Q: What email formats does the scraper support?

A: The email scraper recognizes standard email formats (user@domain.com) and handles various edge cases. It normalizes emails and validates them against common email patterns.

Q: Can I use this for lead generation?

A: Yes, this email scraper can be used for lead generation by extracting contact emails from websites. However, ensure compliance with anti-spam laws and best practices for email marketing.

Development

Building the Project

# Install dependencies
npm install

# Build TypeScript to JavaScript
npm run build

Running the CLI

# Development mode (uses ts-node, no build needed)
npm run dev page https://example.com
npm run dev website https://example.com

# Production mode (uses compiled JavaScript)
npm run scrape page https://example.com
npm run scrape website https://example.com

Project Structure

  • /src - TypeScript source code
    • /scrapers - Scraping implementations (HTTP, webpage, website crawler)
    • /utils - Utility functions (email extraction, browser factory)
    • /types - TypeScript type definitions
  • /dist - Compiled JavaScript output
  • cli.ts - Command-line interface implementation

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Keywords: email scraper, email extraction, web scraper, email harvesting, website crawler, email finder, contact scraper, lead generation tool, web scraping library, TypeScript scraper, Playwright scraper, Puppeteer scraper, email collection tool