# Instruction on extract_html_text.py script

---
title: "Instruction on extract_html_text.py script"
author: rudakow.wadim@gmail.com
date: 2026-02-10
options:
  version: 1.0.0
  birth: 2026-02-10
---

This [script](/tools/scripts/extract_html_text.py) extracts readable plain text from HTML and MHTML files by stripping all markup, scripts, styles, SVG, and non-content elements.

It auto-detects MHTML (multipart MIME with quoted-printable encoding) and handles it transparently. Uses only the Python standard library (`html.parser`, `email`, `quopri`), requiring zero external dependencies.

## Synopsis

```bash
# Extract to stdout
extract_html_text.py INPUT_FILE

# Extract to file
extract_html_text.py INPUT_FILE --output OUTPUT_FILE
```

| Argument | Description | Default |
|----------|-------------|---------|
| `INPUT_FILE` | Path to the HTML file to extract text from | Required |
| `--output` | Write output to file instead of stdout | stdout |

**Exit Codes:**
- `0` = Extraction successful
- `1` = File not found or read error

## Extraction Logic

The script processes HTML using a SAX-style parser that:

1. **Auto-detects MHTML**: If the file starts with email-style headers (`From:`, `MIME-Version:`), extracts the `text/html` part and decodes quoted-printable encoding.
2. **Discards non-content tags**: `<script>`, `<style>`, `<noscript>`, `<svg>` tags and all their nested content are stripped completely.
3. **Preserves text from all other elements**: Paragraph text, headings, list items, table cells, and inline elements are collected.
4. **Decodes HTML entities**: `&amp;` → `&`, `&lt;` → `<`, character references like `&#8212;` → `—`.
5. **Normalizes whitespace**: Collapses runs of 3+ blank lines into 2.

## Examples

1. Extract text from an HTML file to stdout:

In [1]:
cd ../../../
echo '<html><body><p>Hello world</p><script>alert("hidden")</script></body></html>' > /tmp/test_extract.html
env -u VIRTUAL_ENV uv run tools/scripts/extract_html_text.py /tmp/test_extract.html

Hello world


2. Extract to a file:

In [2]:
env -u VIRTUAL_ENV uv run tools/scripts/extract_html_text.py /tmp/test_extract.html --output /tmp/extracted.txt && cat /tmp/extracted.txt

Hello world


## Test Suite

The [test suite](/tools/tests/test_extract_html_text.py) covers the full extraction contract:

| Test Class | Coverage |
|------------|----------|
| `TestExtractText` | Unit tests: tag stripping, entity decoding, Unicode, nested tags, empty input |
| `TestCLISuccessPath` | Integration: stdout output, file output, empty files, Unicode files |
| `TestCLIErrorPath` | Error handling: missing files, no arguments, directories |
| `TestIsMhtml` | MHTML detection: headers, plain HTML rejection, edge cases |
| `TestExtractHtmlFromMhtml` | MHTML parsing: HTML extraction, quoted-printable decoding, non-HTML filtering |
| `TestSvgAndBase64Stripping` | Noise removal: SVG path data, base64 image sources |
| `TestMainMhtml` | Integration: MHTML files with various extensions, output to file |

Run tests with:

```bash
uv run pytest tools/tests/test_extract_html_text.py -v
```

In [3]:
env -u VIRTUAL_ENV uv run pytest tools/tests/test_extract_html_text.py -q

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                     [100%][0m
[32m[32m[1m36 passed[0m[32m in 0.12s[0m[0m


In [4]:
env -u VIRTUAL_ENV uv run pytest tools/tests/test_extract_html_text.py --cov=tools.scripts.extract_html_text --cov-report=term-missing

platform linux -- Python 3.13.5, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/commi/Yandex.Disk/it_working/projects/soviar-systems/ai_engineering_book
configfile: pyproject.toml
plugins: cov-7.0.0
collected 36 items                                                             [0m

tools/tests/test_extract_html_text.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [ 94%]
[0m[32m.[0m[32m.[0m[32m                                                                       [100%][0m

_______________ coverage: platform linux, python 3.13.5-final-0 ________________

Name                                 Stmts   Miss  Cover   Missing
------------------------------------------------------------------
to