Skip to content

webdevtodayjason/Clone-It

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌐 Clone-It - Website Cloning Tool

A user-friendly command-line tool for cloning websites using wget with an interactive menu interface.

Features

  • Interactive Menu: Easy-to-use CLI interface with colored output
  • Domain Validation: Ensures valid domain format before proceeding
  • Project Organization: Creates organized folder structure in projects/domain.com
  • Live Output: Shows real-time wget progress and output
  • Post-Clone Options: View files, open folder, or clone another site
  • Error Handling: Graceful error handling with helpful messages

Prerequisites

  • wget: The tool requires wget to be installed
    • macOS: brew install wget
    • Ubuntu/Debian: sudo apt-get install wget
    • CentOS/RHEL: sudo yum install wget

Installation

  1. Clone or download the script to your desired location
  2. Make it executable: chmod +x clone-it.sh
  3. (Optional) Run the installer for easier access: ./install.sh

Usage

Direct execution:

./clone-it.sh

After installation:

clone-it

How It Works

  1. Menu Interface: The script presents a clean, colored menu asking "What are we cloning today?"
  2. Domain Input: Enter the domain you want to clone (e.g., example.com)
  3. Validation: The script validates the domain format
  4. Link Conversion Option: Choose whether to convert internal links to .html extension
  5. Directory Creation: Creates projects/domain.com/ folder structure
  6. Cloning Process: Runs the wget command with optimal parameters:
    wget --mirror -w 2 -p --html-extension --convert-links https://domain.com/
  7. Link Processing: If enabled, converts internal links to work with .html extensions
  8. Live Output: Shows the wget output in real-time within a bordered window
  9. Completion Menu: Offers options to:
    • Clone another website
    • Open the project folder
    • View project contents
    • Exit

wget Parameters Explained

  • --mirror: Creates a complete mirror of the site
  • -w 2: Waits 2 seconds between downloads (respectful crawling)
  • -p: Downloads all page prerequisites (images, CSS, etc.)
  • --html-extension: Adds .html extension to files
  • --convert-links: Converts links for offline browsing

Project Structure

CLONE-IT/
├── clone-it.sh        # Main cloning script
├── fix-links.sh       # Link conversion utility
├── install.sh         # Installation helper
├── README.md          # This file
└── projects/          # Created when first used
    └── domain.com/    # Individual site folders
        └── domain.com/  # Actual site files

Error Handling

  • Checks for wget installation before running
  • validates domain format
  • Handles existing directories with user confirmation
  • Reports wget exit codes and errors
  • Graceful handling of user cancellations

Colors and UI

The script uses ANSI color codes for better user experience:

  • 🔵 Blue: Process information
  • 🟢 Green: Success messages
  • 🟡 Yellow: Warnings and prompts
  • 🔴 Red: Error messages
  • 🔷 Cyan: Headers and menus

Link Conversion Feature

The tool includes a link conversion feature to fix a common issue with cloned websites:

The Problem: When wget clones a site, it adds .html extensions to files that didn't originally have them. This breaks internal links like /office-visits which become /office-visits.html but the HTML still links to the original path without the extension.

The Solution: When you choose "Y" for "Convert all internal links to .html?", the script will:

  • Scan all HTML files in the cloned site
  • Convert internal links from /page-name to /page-name.html
  • Handle both absolute and relative links
  • Preserve existing .html links unchanged
  • Create backups when using the standalone fix-links utility

Standalone Link Fixer

For sites already cloned without link conversion, use the separate utility:

./fix-links.sh              # Interactive mode
./fix-links.sh example.com  # Direct mode

Tips

  • The script is respectful to servers with a 2-second delay between requests
  • Large sites may take considerable time to clone completely
  • Check robots.txt and site terms of service before cloning
  • The cloned site will work offline with converted links
  • Use link conversion if the original site used clean URLs without extensions

Troubleshooting

"wget not found": Install wget using your package manager "Invalid domain": Ensure domain format like example.com (no http://) "Permission denied": Make sure the script is executable (chmod +x) Slow cloning: This is normal for large sites due to the respectful 2-second delay

Playwright Edition (bypasses JS challenges)

Some sites sit behind bot-mitigation challenges (Vercel, Cloudflare, etc.) that return HTTP 429 to wget because it can't execute JavaScript. For those sites, use clone-playwright.js, which drives a real Chromium browser so the challenge solves transparently.

Setup

npm install playwright
npx playwright install chromium

Usage

node clone-playwright.js <domain> [--max-pages=N] [--headful] [--delay=ms]

# Examples
node clone-playwright.js example.com
node clone-playwright.js example.com --max-pages=500
node clone-playwright.js example.com --headful   # watch it run

What it does

  1. Launches headless Chromium with a real user-agent
  2. BFS-crawls same-origin links from the landing page (cap: --max-pages)
  3. Captures every successful response (CSS, JS, images, fonts, …) via the browser network layer and writes it to disk
  4. Saves the rendered post-JS HTML for each page
  5. Rewrites absolute URLs, root-relative paths, and internal links to relative local paths so the mirror works fully offline

Output

projects/<domain>/<domain>/
├── index.html
├── about-us.html
├── _next/...        # framework assets
└── ...

Flags

Flag Default Description
--max-pages=N 200 Hard cap on pages crawled
--headful off Show the browser window (debugging)
--delay=ms 500 Pause after networkidle per page

Viewing the clone

cd projects/<domain>/<domain>
python3 -m http.server 8000
# open http://localhost:8000/

Caveats

  • Lazy-loaded / scroll-triggered assets won't be captured unless you add a scroll step in the page loop.
  • URLs with query strings get a hashed suffix in the filename to avoid collisions (notably common with Next.js RSC payloads ?_rsc=...).
  • Use --headful the first time on a new site if anything looks wrong.

License

Free to use and modify as needed.

About

CLI tool to grab an entire website and clone it to html for Site backups.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors