A user-friendly command-line tool for cloning websites using wget with an interactive menu interface.
- Interactive Menu: Easy-to-use CLI interface with colored output
- Domain Validation: Ensures valid domain format before proceeding
- Project Organization: Creates organized folder structure in
projects/domain.com - Live Output: Shows real-time wget progress and output
- Post-Clone Options: View files, open folder, or clone another site
- Error Handling: Graceful error handling with helpful messages
- wget: The tool requires wget to be installed
- macOS:
brew install wget - Ubuntu/Debian:
sudo apt-get install wget - CentOS/RHEL:
sudo yum install wget
- macOS:
- Clone or download the script to your desired location
- Make it executable:
chmod +x clone-it.sh - (Optional) Run the installer for easier access:
./install.sh
./clone-it.shclone-it- Menu Interface: The script presents a clean, colored menu asking "What are we cloning today?"
- Domain Input: Enter the domain you want to clone (e.g.,
example.com) - Validation: The script validates the domain format
- Link Conversion Option: Choose whether to convert internal links to .html extension
- Directory Creation: Creates
projects/domain.com/folder structure - Cloning Process: Runs the wget command with optimal parameters:
wget --mirror -w 2 -p --html-extension --convert-links https://domain.com/
- Link Processing: If enabled, converts internal links to work with .html extensions
- Live Output: Shows the wget output in real-time within a bordered window
- Completion Menu: Offers options to:
- Clone another website
- Open the project folder
- View project contents
- Exit
--mirror: Creates a complete mirror of the site-w 2: Waits 2 seconds between downloads (respectful crawling)-p: Downloads all page prerequisites (images, CSS, etc.)--html-extension: Adds .html extension to files--convert-links: Converts links for offline browsing
CLONE-IT/
├── clone-it.sh # Main cloning script
├── fix-links.sh # Link conversion utility
├── install.sh # Installation helper
├── README.md # This file
└── projects/ # Created when first used
└── domain.com/ # Individual site folders
└── domain.com/ # Actual site files
- Checks for wget installation before running
- validates domain format
- Handles existing directories with user confirmation
- Reports wget exit codes and errors
- Graceful handling of user cancellations
The script uses ANSI color codes for better user experience:
- 🔵 Blue: Process information
- 🟢 Green: Success messages
- 🟡 Yellow: Warnings and prompts
- 🔴 Red: Error messages
- 🔷 Cyan: Headers and menus
The tool includes a link conversion feature to fix a common issue with cloned websites:
The Problem: When wget clones a site, it adds .html extensions to files that didn't originally have them. This breaks internal links like /office-visits which become /office-visits.html but the HTML still links to the original path without the extension.
The Solution: When you choose "Y" for "Convert all internal links to .html?", the script will:
- Scan all HTML files in the cloned site
- Convert internal links from
/page-nameto/page-name.html - Handle both absolute and relative links
- Preserve existing
.htmllinks unchanged - Create backups when using the standalone fix-links utility
For sites already cloned without link conversion, use the separate utility:
./fix-links.sh # Interactive mode
./fix-links.sh example.com # Direct mode- The script is respectful to servers with a 2-second delay between requests
- Large sites may take considerable time to clone completely
- Check robots.txt and site terms of service before cloning
- The cloned site will work offline with converted links
- Use link conversion if the original site used clean URLs without extensions
"wget not found": Install wget using your package manager
"Invalid domain": Ensure domain format like example.com (no http://)
"Permission denied": Make sure the script is executable (chmod +x)
Slow cloning: This is normal for large sites due to the respectful 2-second delay
Some sites sit behind bot-mitigation challenges (Vercel, Cloudflare, etc.) that
return HTTP 429 to wget because it can't execute JavaScript. For those sites,
use clone-playwright.js, which drives a real Chromium browser so the
challenge solves transparently.
npm install playwright
npx playwright install chromiumnode clone-playwright.js <domain> [--max-pages=N] [--headful] [--delay=ms]
# Examples
node clone-playwright.js example.com
node clone-playwright.js example.com --max-pages=500
node clone-playwright.js example.com --headful # watch it run- Launches headless Chromium with a real user-agent
- BFS-crawls same-origin links from the landing page (cap:
--max-pages) - Captures every successful response (CSS, JS, images, fonts, …) via the browser network layer and writes it to disk
- Saves the rendered post-JS HTML for each page
- Rewrites absolute URLs, root-relative paths, and internal links to relative local paths so the mirror works fully offline
projects/<domain>/<domain>/
├── index.html
├── about-us.html
├── _next/... # framework assets
└── ...
| Flag | Default | Description |
|---|---|---|
--max-pages=N |
200 | Hard cap on pages crawled |
--headful |
off | Show the browser window (debugging) |
--delay=ms |
500 | Pause after networkidle per page |
cd projects/<domain>/<domain>
python3 -m http.server 8000
# open http://localhost:8000/- Lazy-loaded / scroll-triggered assets won't be captured unless you add a scroll step in the page loop.
- URLs with query strings get a hashed suffix in the filename to avoid
collisions (notably common with Next.js RSC payloads
?_rsc=...). - Use
--headfulthe first time on a new site if anything looks wrong.
Free to use and modify as needed.