A simple Python script to download HTML content from a list of URLs and clean it, focusing on extracting the main textual content while removing clutter like navigation, ads, scripts, and styles.
This project uses uv for fast dependency management and provides an install script to simplify setup by creating a self-contained environment.
- Downloads HTML from a list of URLs provided in 
urls.txt. - Attempts to identify and extract the main content block of the page.
 - Removes common non-content elements (scripts, styles, nav, footer, ads, etc.).
 - Filters tags and attributes to a predefined allow-list optimized for content
 - Resolves relative links within the cleaned content to absolute URLs.
 - Normalizes whitespace and basic HTML structure.
 - Adds basic metadata (Source URL, Retrieval Time) to the cleaned HTML.
 - Uses 
httpxfor HTTP requests andBeautifulSoup4withlxmlfor parsing. 
- A Unix-like environment (Linux, macOS, WSL on Windows) to run the 
installshell script. curl(usually pre-installed) to downloaduvif it's not already present.- Python 3.13 or newer (The 
installscript will try to useuvto set this up, but having it available is recommended). - Internet connection (for installing 
uvand Python dependencies). 
This project uses a simple install script to handle setup, leveraging the fast uv package manager. This avoids polluting your global Python environment and makes setup easier.
What the install script does:
- Checks for/Installs 
uv: Downloads and installs theuvtool if it's not found in your PATH. - Creates Virtual Environment: Uses 
uvto create a dedicated, isolated Python virtual environment (.venv_url_cleaner) specifically for this script, attempting to use Python 3.13+. - Installs Dependencies: Uses 
uv pip installto quickly install the required Python packages (httpx,beautifulsoup4,lxml, etc. fromrequirements.txt) into the dedicated virtual environment. - Creates Executable Wrapper: Generates a wrapper script named 
url-cleaner. This script automatically uses the Python interpreter and packages from the.venv_url_cleanerenvironment, so you don't need to activate it manually. 
To install:
- Make the install script executable:
chmod +x install
 - Run the install script:
./install
 
Follow any prompts from the script. If successful, it will create the .venv_url_cleaner directory and the url-cleaner executable file.
- Edit 
urls.txt: Add the URLs you want to download and clean, one URL per line. Lines starting with#are ignored.# Example URLs https://example.com/article1 https://another-site.org/some/page.html - Run the Cleaner: Execute the wrapper script created during installation:
./url-cleaner
 - Find Output: The script will process each URL. Cleaned HTML files will be saved in the 
_output/directory. Filenames are generated based on the URL's domain and path, plus a hash segment to avoid collisions (e.g.,example_com_article1_...hash....html). 
Using the install script with uv provides several benefits, especially when distributing Python command-line tools:
- Isolation: Dependencies are installed in a dedicated virtual environment (
.venv_url_cleaner), preventing conflicts with other Python projects or system packages. - Simplicity for Users: Users don't need to manually create virtual environments or manage Python paths. They just run 
./installonce, then use the simple./url-cleanercommand. - Reproducibility: The 
requirements.txtfile ensures the correct versions of dependencies are installed viauv. - Speed: 
uvis significantly faster than traditionalpipandvenvfor creating environments and installing packages. - No Global Installs: Keeps the user's system clean.
 
The generated url-cleaner script is a "polyglot" script – it contains a shell command that finds and executes the correct Python interpreter within the hidden virtual environment, passing the rest of the script's content (the actual Python code) to it.
Copyright Aryan Ameri. This project is licensed under the MIT License - see the LICENSE file for details.