Skip to content

anmolksachan/WayBackupFinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

image

WayBackup Finder

This Python script fetches URLs from the Wayback Machine and filters them based on specified file extensions. It also checks if archived snapshots are available for each URL and saves the filtered URLs to files.

Read more: Medium
Watch Tool in action: Medium

Features

  • Fetches URLs from the Wayback Machine using the CDX API.
  • Filters the fetched URLs by specific file extensions (e.g., .pdf, .zip).
  • Checks if a Wayback snapshot is available for each URL.
  • Saves the filtered URLs to text files.
  • Customizable file extensions to filter or use default extensions from extensions.txt.

Use Case: Finding Archived Backups

This tool can be especially useful for finding backups of websites or files that may no longer be available on the live site. If a resource (e.g., a PDF or image) was previously available on the website but has since been removed or is temporarily unavailable, there may still be an archived snapshot of it in the Wayback Machine.

By using this script, you can:

  • Identify URLs that may have once been accessible but no longer are.
  • Check for backups on the Wayback Machine that might not be available on the current live site.
  • Retrieve historical versions of files or content that have been deleted or moved.

The script attempts to retrieve URLs from the Wayback Machine. For each URL found, it checks if an archived snapshot is available. If a snapshot exists, the script provides a link to the backup.

Requirements

  • Python 3.x
  • The following Python packages:
    • requests
    • colorama
    • termcolor

You can install the required packages using the following command:

pip3 install requests colorama termcolor

How to Use

  1. Clone the repository or download the script.
  2. Ensure you have a file named extensions.txt in the same directory as the script, or specify custom file extensions.
  3. Run the script:
python wayBackupFinder.py
  1. When prompted, enter the target domain (e.g., example.com) and specify whether to use custom file extensions or load them from the extensions.txt file.
  2. The script will:
    • Fetch URLs from the Wayback Machine.
    • Filter the URLs by the provided file extensions.
    • Save the filtered URLs to separate files.
    • Check if archived snapshots are available for each URL.

Example

image

Input:

Enter the target domain (e.g., example.com): example.com
Would you like to use custom file extensions or load from extensions.txt? (custom/load): load

Output:

The script will print the progress and save the filtered URLs to files such as:

Filtered URLs for .pdf saved to: content/example.com/example.com_pdf_filtered_urls.txt
Found possible backup: https://web.archive.org/web/20200101000000/https://example.com/sample.pdf

File Extensions

You can specify custom file extensions to filter by, separated by commas, for example: .zip,.pdf,.jpg. If you choose to load extensions from extensions.txt, the script will use those.

File Structure

The script will create a folder called content and store the filtered URLs for each extension in subfolders named after the target domain.

License

This project is licensed under the MIT License - see the LICENSE file for details.