fix: Handle various HTML encodings in documentation parsing #58

jamesbrink · 2025-10-18T19:10:20Z

Summary

Fixes encoding errors that occur when parsing Home Manager and nix-darwin documentation served with non-UTF-8 encodings.

Problem

Users were encountering intermittent errors like:

Error (ERROR): Failed to fetch docs: unknown encoding: windows-1252

This happened because:

Documentation servers (or CDN edge nodes) sometimes serve HTML with different encoding declarations
Using resp.text forced Python's requests library to decode with the declared encoding, which could fail
Different users might hit different CDN nodes with varying configurations

Solution

Changed BeautifulSoup(resp.text, "html.parser") to BeautifulSoup(resp.content, "html.parser"):

resp.content provides raw bytes instead of pre-decoded text
BeautifulSoup has robust encoding detection that handles various encodings gracefully
This is the recommended approach for parsing HTML from unknown sources

Changes

Core Fix

mcp_nixos/server.py: Use resp.content instead of resp.text in parse_html_options()

Test Coverage

Added comprehensive encoding tests:

Windows-1252 encoding with special characters (café)
UTF-8 with BOM and Unicode characters (你好)
ISO-8859-1 encoding with accented characters (naïve, résumé)

Test Updates

Updated all existing tests to use mock_resp.content (bytes) instead of mock_resp.text (string):

tests/test_server.py - 9 tests updated + 3 new encoding tests
tests/test_edge_cases.py - 3 tests updated
tests/test_evals.py - 1 test updated
tests/test_flakes.py - 7 tests updated
tests/test_mcp_tools.py - 1 test updated
tests/test_plain_text_output.py - 5 tests updated

Testing

✅ All 344 tests passing
✅ Code formatted with ruff
✅ All linting checks pass
✅ Type checking passes (mypy)

Test the fix

# These commands should no longer fail with encoding errors
mcp-nixos home_manager_info "wayland.windowManager.hyprland.package"
mcp-nixos home_manager_info "wayland.windowManager.hyprland.systemd.enable"

Impact

This is a bug fix with no breaking changes. The fix makes the MCP server more robust when fetching documentation from various sources with different encoding configurations.

Fix encoding errors when parsing Home Manager and nix-darwin documentation that may be served with different encodings (windows-1252, UTF-8 with BOM, ISO-8859-1). Changes: - Use resp.content instead of resp.text in parse_html_options() - Let BeautifulSoup handle encoding detection automatically - Add test coverage for windows-1252, UTF-8 with BOM, and ISO-8859-1 - Update all existing tests to use mock_resp.content This resolves intermittent "unknown encoding: windows-1252" errors when fetching documentation from CDN edge servers with different configurations. Fixes errors like: - "Failed to fetch docs: unknown encoding: windows-1252"

claude · 2025-10-18T19:10:39Z

Claude finished @jamesbrink's task —— View job

Code Review Complete ✅

Reviewing PR #58: Handle various HTML encodings in documentation parsing

Todo List:

Review Summary

This is an excellent bug fix that addresses a real-world issue with robust engineering practices. The PR successfully resolves encoding errors while maintaining backward compatibility and adding comprehensive test coverage.

✅ Strengths

Core Fix Quality

Perfect solution: Changing from resp.text to resp.content in mcp_nixos/server.py:254 is the correct approach for handling unknown HTML encodings
Best practice: Leverages BeautifulSoup's robust encoding detection rather than relying on potentially incorrect HTTP headers
Minimal impact: Single-line change in the core function with maximum effectiveness

Test Coverage Excellence

Comprehensive encoding tests: Added 3 new tests covering Windows-1252, UTF-8 with BOM, and ISO-8859-1 encodings (tests/test_server.py:231-307)
Real-world scenarios: Tests include actual special characters (café, 你好, naïve, résumé) that would cause encoding failures
Consistent updates: All 26 existing tests properly updated from mock_resp.text to mock_resp.content across 6 test files

Code Quality

Clear documentation: Excellent comments explaining the encoding issue and solution (server.py:252-253)
Maintains patterns: All changes follow existing code style and testing patterns
No breaking changes: Backward compatible fix that doesn't affect the API

🔍 Technical Analysis

Security: ✅ Secure

No security concerns introduced
Using resp.content with BeautifulSoup is the secure approach for parsing untrusted HTML
No new attack vectors created

Performance: ✅ Neutral/Slightly Better

resp.content avoids unnecessary text decoding by requests library
BeautifulSoup handles encoding detection efficiently
No measurable performance impact expected

Reliability: ✅ Significantly Improved

Eliminates intermittent encoding failures from CDN/server variations
Makes the server more robust across different deployment environments
Graceful handling of various encoding declarations

🎯 Areas of Excellence

Problem Understanding: Clear identification of the root cause (CDN encoding variations)
Solution Elegance: Minimal change with maximum impact
Test Strategy: Proactive testing of multiple encoding scenarios
Documentation: Well-documented change with context
Compatibility: Zero breaking changes

📋 Verification Checklist

✅ Core fix correctly implemented (server.py:254)
✅ All existing tests updated consistently
✅ New encoding tests cover real-world scenarios
✅ Code style maintained (follows project patterns)
✅ No security vulnerabilities introduced
✅ Performance impact minimal/positive
✅ Error handling preserved
✅ Documentation updated appropriately

🚀 Recommendation

APPROVE - This PR is ready for merge. It demonstrates:

Strong understanding of the underlying issue
Excellent engineering judgment in solution selection
Comprehensive testing approach
Attention to detail in maintaining code quality

The fix directly addresses user pain points while following best practices. The author has done exceptional work in both implementation and testing.

codecov · 2025-10-18T20:02:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.26%. Comparing base (ec59088) to head (58679bb).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #58   +/-   ##
=======================================
  Coverage   93.26%   93.26%           
=======================================
  Files           2        2           
  Lines         950      950           
=======================================
  Hits          886      886           
  Misses         64       64

Flag	Coverage Δ
unittests	`93.26% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jamesbrink self-assigned this Oct 18, 2025

jamesbrink merged commit bfc084d into main Oct 18, 2025
13 of 15 checks passed

jamesbrink deleted the fix/html-encoding-issue branch October 18, 2025 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Handle various HTML encodings in documentation parsing #58

fix: Handle various HTML encodings in documentation parsing #58

Uh oh!

jamesbrink commented Oct 18, 2025

Uh oh!

claude bot commented Oct 18, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Handle various HTML encodings in documentation parsing #58

fix: Handle various HTML encodings in documentation parsing #58

Uh oh!

Conversation

jamesbrink commented Oct 18, 2025

Summary

Problem

Solution

Changes

Core Fix

Test Coverage

Test Updates

Testing

Test the fix

Impact

Uh oh!

claude bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Todo List:

Review Summary

✅ Strengths

🔍 Technical Analysis

🎯 Areas of Excellence

📋 Verification Checklist

🚀 Recommendation

Uh oh!

codecov bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Oct 18, 2025 •

edited

Loading

codecov bot commented Oct 18, 2025 •

edited

Loading