Email Scraper – Open-Source Email Extraction Tool for Websites

Email Scraper is an open-source TypeScript library and CLI tool for extracting email addresses from webpages and entire websites. This email scraper helps developers automate email collection by providing ready-to-use modules that support both HTTP requests and headless browsers (Playwright/Puppeteer) for comprehensive web scraping.

Whether you need to scrape emails from a single contact page or crawl an entire website to collect email addresses, this tool provides a simple, reliable solution for email extraction and web scraping tasks.

Features

Email Extraction from Single Pages: Scrape email addresses from any webpage URL using HTTP requests or headless browsers
Website Crawling: Automatically crawl entire websites to extract emails from multiple pages with configurable depth and page limits
Multiple Scraping Methods: Choose between fast HTTP requests or full browser automation (Playwright/Puppeteer) for JavaScript-heavy sites
TypeScript Support: Fully typed API for better developer experience and IDE autocomplete
CLI Tool: Command-line interface for quick email scraping without writing code
Library API: Import and use in your own Node.js projects for custom email scraping workflows
Flexible Configuration: Customize timeout, crawl depth, page limits, and cross-domain crawling options
Email Normalization: Automatically normalizes and deduplicates extracted email addresses

Installation

Setup

Clone this repository:

git clone https://github.com/web-scraping-apis/email-scraper
cd email-scraper

Install dependencies:

npm install

Build the project:

npm run build

Browser Support (Optional)

For scraping JavaScript-rendered pages, install one of the supported browsers:

Using Playwright:

npm install playwright
npx playwright install chromium

Using Puppeteer:

npm install puppeteer

Quick Start

CLI Usage – Scrape Emails from a Single Webpage

The email scraper CLI makes it easy to extract emails from any webpage:

# Using HTTP requests (fast, for static pages)
npm run dev page https://example.com/contact

# Using headless browser (for JavaScript-rendered pages)
npm run dev page https://example.com/contact --browser

# After building, use the compiled version:
npm run scrape page https://example.com/contact

CLI Options for Single Page Scraping:

-t, --timeout <ms> - Request timeout in milliseconds (default: 10000)
-b, --browser - Use headless browser (Playwright/Puppeteer) instead of HTTP requests
--wait-until <event> - Wait until event: load, domcontentloaded, or networkidle (default: load, browser mode only)

CLI Usage – Scrape Emails from an Entire Website

Crawl and extract emails from multiple pages on a website:

# Using HTTP requests (default)
npm run dev website https://example.com

# Using headless browser
npm run dev website https://example.com --browser

# After building:
npm run scrape website https://example.com

CLI Options for Website Crawling:

-d, --max-depth <number> - Maximum crawl depth (default: 3)
-p, --max-pages <number> - Maximum number of pages to crawl (default: 50)
--cross-domain - Allow crawling to different domains (default: false)
-b, --browser - Use headless browser (Playwright/Puppeteer) instead of HTTP requests
--wait-until <event> - Wait until event: load, domcontentloaded, or networkidle (default: load, browser mode only)

Library Usage

Import and use the email scraper in your own TypeScript/JavaScript projects:

Basic HTTP Email Scraping

Extract emails from a webpage using simple HTTP requests:

import { scrapeEmailsFromUrl } from 'email-scraper';

const emails = await scrapeEmailsFromUrl('https://example.com/contact');
console.log(Array.from(emails));
// Output: ['contact@example.com', 'info@example.com']

Email Scraping with Playwright

For JavaScript-rendered pages, use Playwright with the email scraper:

import { chromium } from 'playwright';
import { scrapeEmailsFromPage } from 'email-scraper';

const browser = await chromium.launch();
const page = await browser.newPage();
const emails = await scrapeEmailsFromPage(page, 'https://example.com/contact');
await browser.close();

console.log(Array.from(emails));

Email Scraping with Puppeteer

Alternatively, use Puppeteer for browser automation:

import puppeteer from 'puppeteer';
import { scrapeEmailsFromPage } from 'email-scraper';

const browser = await puppeteer.launch();
const page = await browser.newPage();
const emails = await scrapeEmailsFromPage(page, 'https://example.com/contact');
await browser.close();

Crawling an Entire Website for Emails

Use the website crawler to extract emails from multiple pages:

import { scrapeEmailsFromWebsite } from 'email-scraper';

const emails = await scrapeEmailsFromWebsite('https://example.com', {
  maxDepth: 3,
  maxPages: 50,
  sameDomainOnly: true,
  useBrowser: false, // Set to true to use a browser
  browser: browserInstance, // Optional: pass browser instance
});

console.log(`Found ${emails.size} unique emails`);

Using the EmailScraper Class

For a more object-oriented approach, use the EmailScraper class:

import { EmailScraper } from 'email-scraper';
import { chromium } from 'playwright';

// With browser
const browser = await chromium.launch();
const scraper = new EmailScraper(browser);
const emails = await scraper.scrapeFromWebsite('https://example.com');
await browser.close();

// Without browser (HTTP only)
const scraper = new EmailScraper();
const emails = await scraper.scrapeFromUrl('https://example.com/contact');

Note: After building (npm run build), you can import from ./dist/index instead of ./src/index when using the library locally.

API Reference

Functions

scrapeEmailsFromUrl(url: string, options?: HttpScraperOptions): Promise<Set<string>>
- Scrapes emails from a single webpage using HTTP requests
- Returns a Set of unique email addresses
scrapeEmailsFromPage(page: Page, url: string, options?: WebpageScraperOptions): Promise<Set<string>>
- Scrapes emails from a webpage using a browser page instance
- Works with Playwright or Puppeteer page objects
scrapeEmailsFromWebsite(url: string, options?: WebsiteCrawlerOptions): Promise<Set<string>>
- Crawls an entire website and extracts emails from all visited pages
- Supports configurable depth, page limits, and domain restrictions
extractEmails(text: string): Set<string>
- Extracts email addresses from plain text using regex patterns
normalizeEmail(email: string): string
- Normalizes email addresses to a standard format

Classes

EmailScraper - Main scraper class with convenience methods for both HTTP and browser-based scraping

Frequently Asked Questions (FAQ)

Q: How do I scrape emails from websites?

A: You can use the email scraper CLI or library. For a single page, use npm run dev page <url>. For an entire website, use npm run dev website <url>. The tool automatically extracts email addresses from HTML content and text.

Q: What's the difference between HTTP and browser scraping?

A: HTTP scraping is faster and works well for static HTML pages. Browser scraping (Playwright/Puppeteer) is necessary for JavaScript-rendered content where emails are loaded dynamically. Use the --browser flag to enable browser mode.

Q: How to scrape email from websites that require JavaScript?

A: Use the --browser option with the CLI, or use scrapeEmailsFromPage() with a Playwright or Puppeteer browser instance. This allows the email scraper to wait for JavaScript to execute before extracting emails.

Q: Can I scrape emails from multiple websites at once?

A: Yes, you can use the website crawler with --cross-domain flag, or write a script that calls scrapeEmailsFromUrl() or scrapeEmailsFromWebsite() multiple times with different URLs.

Q: How does the email scraper handle duplicate emails?

A: The scraper automatically deduplicates emails using a Set data structure. All functions return a Set<string> containing unique email addresses only.

Q: Is this email scraper legal to use?

A: This is a tool for extracting publicly available email addresses from websites. Always respect website terms of service, robots.txt files, and applicable laws (such as GDPR, CAN-SPAM) when scraping emails. Use responsibly and ethically.

Q: How do I install the email scraper as a library?

A: After cloning the repo and running npm install and npm run build, you can import from ./dist/index. For npm package distribution, the library would be installed via npm install email-scraper (when published).

Q: What email formats does the scraper support?

A: The email scraper recognizes standard email formats (user@domain.com) and handles various edge cases. It normalizes emails and validates them against common email patterns.

Q: Can I use this for lead generation?

A: Yes, this email scraper can be used for lead generation by extracting contact emails from websites. However, ensure compliance with anti-spam laws and best practices for email marketing.

Development

Building the Project

# Install dependencies
npm install

# Build TypeScript to JavaScript
npm run build

Running the CLI

# Development mode (uses ts-node, no build needed)
npm run dev page https://example.com
npm run dev website https://example.com

# Production mode (uses compiled JavaScript)
npm run scrape page https://example.com
npm run scrape website https://example.com

Project Structure

/src - TypeScript source code
- /scrapers - Scraping implementations (HTTP, webpage, website crawler)
- /utils - Utility functions (email extraction, browser factory)
- /types - TypeScript type definitions
/dist - Compiled JavaScript output
cli.ts - Command-line interface implementation

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Keywords: email scraper, email extraction, web scraper, email harvesting, website crawler, email finder, contact scraper, lead generation tool, web scraping library, TypeScript scraper, Playwright scraper, Puppeteer scraper, email collection tool

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

web-scraping-tools/email-scraper

Folders and files

Latest commit

History

Repository files navigation