# Instruction on check_link_format.py script

---

Owner: Vadim Rudakov, rudakow.wadim@gmail.com
Version: 0.1.0
Birth: 2026-01-24
Last Modified: 2026-01-24

---

## **1. Architectural Overview: The SVA Principle**

This [script](/tools/scripts/check_link_format.py) validates link format in Markdown files, ensuring **ipynb priority** when a Jupytext pair exists.

When a `.md` file has a paired `.ipynb` file, links should point to the `.ipynb` version because `myst.yml` only renders `.ipynb` files. Links to `.md` files cause downloads instead of opening as web pages.

:::{important} **Separation of Concerns**
This script is deliberately separate from `check_broken_links.py`:

* **check_broken_links.py** → validates link targets **exist**
* **check_link_format.py** → validates link **format** (ipynb priority)

This separation keeps each tool focused on a single responsibility.
:::

It adheres to the **Smallest Viable Architecture (SVA)** principle.

:::{hint} **SVA = right tool for the job**
SVA isn't about minimal *code* — it's about **minimal *cognitive and operational overhead***.

* **Zero Dependencies**: Uses only the Python standard library (`pathlib`, `re`, `sys`, `argparse`, `tempfile`), ensuring it runs on any system with Python installed.
* **Idempotency**: Operates on a "Check vs. Fix" logic—validating state without modifying existing files.
* **High Portability**: Designed for local-only validation, making it ideal for air-gapped or high-security environments.
:::

## **2. Key Capabilities & Logic**

### Core Validation Logic

The script checks every link in `.md` files and flags an error when:

1. The link points to a `.md` file (e.g., `[Guide](path/to/file.md)`)
2. A paired `.ipynb` file exists (e.g., `path/to/file.ipynb`)

**Example:**
```text
# In source.md

[Guide](path/to/file.md)     # ERROR if path/to/file.ipynb exists
[Guide](path/to/file.ipynb)  # OK - correct format
[Readme](valid.md)           # OK if valid.ipynb does NOT exist
```

### Link Types Handled

**A. Markdown Links**

Standard syntax: `[text](link)` or `![alt](image)`.

* **Regex**: `r"\[[^\]]*\]\(([^)]+)\)"`

**B. MyST Include Directives**

Used for file transclusion:

* **Syntax**: ````{include} path/to/file.md ````
* **Regex**: `r"```\{include\}([^`\n]+)"`

### What Gets Skipped

* **External URLs**: `https://...`, `http://...`
* **Internal Fragments**: `#section`
* **Non-.md Links**: `image.png`, `data.json`, etc.
* **Excluded Link Strings**: Configured in `paths.py`

### Path Resolution

* **Git Root Awareness**: Uses `git rev-parse --show-toplevel` to resolve root-absolute links (e.g., `/docs/guide.md`)
* **Relative Paths**: Resolved relative to the source file
* **Root-Relative Paths**: Resolved from the Git root directory

## **3. Technical Architecture**

The script is organized into specialized classes:

* **`LinkFormatCLI`**: Main orchestrator. Handles argument parsing and execution flow.
* **`LinkExtractor`**: Scans file content line-by-line using regex to capture Markdown and MyST links.
* **`LinkFormatValidator`**: The core engine. Checks if `.md` links have paired `.ipynb` files.
* **`FileFinder`**: Handles recursive file traversal with exclusion logic.
* **`Reporter`**: Collects errors and handles exit codes.

## **4. Operational Guide**

### Configuration Reference

* **Primary Script**: `tools/scripts/check_link_format.py`
* **Exclusion Logic**: Managed via `tools/scripts/paths.py` (reuses `BROKEN_LINKS_EXCLUDE_*` constants)
* **Pre-commit Config**: `.pre-commit-config.yaml`
* **CI Config**: `.github/workflows/quality.yml`

### Command Line Interface

```bash
check_link_format.py [--paths PATH] [--pattern PATTERN] [options]

```

| Argument | Description | Default |
| --- | --- | --- |
| `--paths` | One or more directories or specific file paths to scan. | `.` (Current Dir) |
| `--pattern` | Glob pattern for files to scan. | `*.md` |
| `--exclude-dirs` | List of directory names to ignore. | `in_progress`, `pr`, `.venv` |
| `--exclude-files` | List of specific filenames to ignore. | `.aider.chat.history.md` |
| `--verbose` | Shows detailed logs of skipped URLs and valid links. | `False` |

### Manual Execution Commands

Run these from the repository root using `uv` for consistent environment resolution:

| Task | Command |
| --- | --- |
| **Full Repo Audit** | `uv run tools/scripts/check_link_format.py` |
| **Scan Specific Directory** | `uv run tools/scripts/check_link_format.py --paths ai_system/` |
| **Verbose Mode** | `uv run tools/scripts/check_link_format.py --verbose` |

### Examples

In [1]:
cd ../../../

1. Check all `*.md` files in the current directory and subdirectories:

In [6]:
env -u VIRTUAL_ENV uv run tools/scripts/check_link_format.py 2>&1 | tail -15

Using Git root as project root: ai_engineering_book
Found 73 files in: ai_engineering_book/

❌ 5 Link format errors found:
LINK FORMAT ERROR: File '0_intro/ai_systems_grounding_in_computing_disciplines.md:43' links to '/ai_system/1_execution/optimization_fpga_asic_hardware_acceleration_for_ai.md' but paired .ipynb exists.
  Suggested fix: Change to '/ai_system/1_execution/optimization_fpga_asic_hardware_acceleration_for_ai.ipynb'
LINK FORMAT ERROR: File '0_intro/ai_systems_grounding_in_computing_disciplines.md:43' links to '/ai_system/1_execution/optimization_nvidia_gpu_cuda_nsight_and_systems_thinking.md' but paired .ipynb exists.
  Suggested fix: Change to '/ai_system/1_execution/optimization_nvidia_gpu_cuda_nsight_and_systems_thinking.ipynb'
LINK FORMAT ERROR: File 'misc/research/slm_from_scratch/01_foundational_neurons_and_backprop/01_foundations.md:109' links to '/ai_system/1_execution/algebra_gemm_engineering_standard.md' but paired .ipynb exists.
  Suggested fix: Change to '/ai_

2. Check a specific directory with verbose output:

```bash
env -u VIRTUAL_ENV uv run tools/scripts/check_link_format.py --paths tools/docs --verbose 2>&1 | head -20
```

3. Check a specific file:

In [8]:
env -u VIRTUAL_ENV uv run tools/scripts/check_link_format.py --paths README.md

Using Git root as project root: ai_engineering_book
Found 1 file in: README.md

✅ All link formats are correct!


## **5. Validation Layers**

### Layer 1: Local Pre-commit Hook

The first line of defense runs automatically during the `git commit` process.

* **Scope**: All `.md` files are validated to ensure consistent link format across the repository.
* **Efficiency**: Fast execution ensures no significant delay in the developer's workflow.
* **Logic Tests**: Includes a meta-check (`test-check-link-format`) that triggers whenever the script itself or its tests change.

### Layer 2: GitHub Action (Continuous Integration)

The CI pipeline in `quality.yml` validates **ALL** `.md` files when any documentation changes.

* **Full Repository Scan**: When any `.md` file changes, the workflow scans ALL `.md` files.
* **Trigger Optimization**: Uses `tj-actions/changed-files` to detect when docs change.
* **Environment Parity**: Utilizes `uv` for high-performance dependency management.
* **Failure Isolation**: Separates logic tests from format validation.

:::{tip} `quality.yml` Implementation:
```yaml
link-format:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Install uv
      uses: astral-sh/setup-uv@v3
      with:
        enable-cache: true
    - name: Get changed files
      id: changed
      uses: tj-actions/changed-files@v45
      with:
        files_yaml: |
          logic:
            - tools/scripts/check_link_format.py
            - tools/tests/test_check_link_format.py
            - tools/scripts/paths.py
          docs:
            - "**/*.md"
        safe_output: true
    - name: Run Logic Tests
      if: steps.changed.outputs.logic_any_changed == 'true'
      run: uv run pytest tools/tests/test_check_link_format.py
    - name: Run Link Format Check on All Files
      if: steps.changed.outputs.docs_any_changed == 'true'
      run: uv run tools/scripts/check_link_format.py --verbose
```
:::

### Layer 3: Manual Checks

Used for deep repository audits or post-refactoring cleanup.

* **Full Scan**: Can be executed manually to scan the entire repository.
* **Custom Patterns**: Supports custom file patterns and exclusion lists.

## **6. Error Output Format**

When errors are found, the script outputs:

```
LINK FORMAT ERROR: File 'docs/guide.md:15' links to 'intro.md' but paired .ipynb exists.
  Suggested fix: Change to 'intro.ipynb'
```

The error message includes:
* **Source file and line number**: Where the problematic link is located
* **Current link**: The `.md` link that should be changed
* **Suggested fix**: The correct `.ipynb` link to use

## **7. Test Suite**

The script is accompanied by a comprehensive test suite (`test_check_link_format.py`) with 34 tests covering:

* **Link Extraction**: Verifies Markdown and MyST links are correctly identified
* **Format Validation**: Tests the core logic for detecting `.md` links with `.ipynb` pairs
* **File Discovery**: Tests recursive search and exclusion logic
* **CLI Integration**: End-to-end tests for command-line behavior
* **Edge Cases**: External URLs, fragments, excluded paths

### Running the Tests

```bash
# Run all tests
uv run pytest tools/tests/test_check_link_format.py

# Run with coverage
uv run pytest tools/tests/test_check_link_format.py --cov=tools.scripts.check_link_format --cov-report=term-missing
```

In [9]:
env -u VIRTUAL_ENV uv run pytest tools/tests/test_check_link_format.py -q

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                       [100%][0m
[32m[32m[1m34 passed[0m[32m in 0.08s[0m[0m


In [10]:
env -u VIRTUAL_ENV uv run pytest tools/tests/test_check_link_format.py --cov=. --cov-report=term-missing -q

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                       [100%][0m
_______________ coverage: platform linux, python 3.13.11-final-0 _______________

Name                                    Stmts   Miss  Cover   Missing
---------------------------------------------------------------------
tools/scripts/check_link_format.py        212     30    86%   120-121, 128, 131, 147, 151-153, 213, 249-250, 272-274, 279, 329-331, 334-336, 340-342, 359-360, 365, 382-384
tools/scripts/paths.py                      7      1    86%   42
tools/tests/test_check_link_format.py     226      0   100%
---------------------------------------------------------------------
TOTAL           