Skip to content

Conversation

@jamesbrink
Copy link
Member

Summary

Fixes encoding errors that occur when parsing Home Manager and nix-darwin documentation served with non-UTF-8 encodings.

Problem

Users were encountering intermittent errors like:

Error (ERROR): Failed to fetch docs: unknown encoding: windows-1252

This happened because:

  • Documentation servers (or CDN edge nodes) sometimes serve HTML with different encoding declarations
  • Using resp.text forced Python's requests library to decode with the declared encoding, which could fail
  • Different users might hit different CDN nodes with varying configurations

Solution

Changed BeautifulSoup(resp.text, "html.parser") to BeautifulSoup(resp.content, "html.parser"):

  • resp.content provides raw bytes instead of pre-decoded text
  • BeautifulSoup has robust encoding detection that handles various encodings gracefully
  • This is the recommended approach for parsing HTML from unknown sources

Changes

Core Fix

  • mcp_nixos/server.py: Use resp.content instead of resp.text in parse_html_options()

Test Coverage

Added comprehensive encoding tests:

  • Windows-1252 encoding with special characters (café)
  • UTF-8 with BOM and Unicode characters (你好)
  • ISO-8859-1 encoding with accented characters (naïve, résumé)

Test Updates

Updated all existing tests to use mock_resp.content (bytes) instead of mock_resp.text (string):

  • tests/test_server.py - 9 tests updated + 3 new encoding tests
  • tests/test_edge_cases.py - 3 tests updated
  • tests/test_evals.py - 1 test updated
  • tests/test_flakes.py - 7 tests updated
  • tests/test_mcp_tools.py - 1 test updated
  • tests/test_plain_text_output.py - 5 tests updated

Testing

✅ All 344 tests passing
✅ Code formatted with ruff
✅ All linting checks pass
✅ Type checking passes (mypy)

Test the fix

# These commands should no longer fail with encoding errors
mcp-nixos home_manager_info "wayland.windowManager.hyprland.package"
mcp-nixos home_manager_info "wayland.windowManager.hyprland.systemd.enable"

Impact

This is a bug fix with no breaking changes. The fix makes the MCP server more robust when fetching documentation from various sources with different encoding configurations.

Fix encoding errors when parsing Home Manager and nix-darwin documentation
that may be served with different encodings (windows-1252, UTF-8 with BOM,
ISO-8859-1).

Changes:
- Use resp.content instead of resp.text in parse_html_options()
- Let BeautifulSoup handle encoding detection automatically
- Add test coverage for windows-1252, UTF-8 with BOM, and ISO-8859-1
- Update all existing tests to use mock_resp.content

This resolves intermittent "unknown encoding: windows-1252" errors when
fetching documentation from CDN edge servers with different configurations.

Fixes errors like:
- "Failed to fetch docs: unknown encoding: windows-1252"
@claude
Copy link

claude bot commented Oct 18, 2025

Claude finished @jamesbrink's task —— View job


Code Review Complete ✅

Reviewing PR #58: Handle various HTML encodings in documentation parsing

Todo List:

  • Read repository guidelines (CLAUDE.md)
  • Review core fix in mcp_nixos/server.py
  • Analyze new encoding tests
  • Review test updates across all files
  • Assess code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Review security implications
  • Assess test coverage
  • Provide comprehensive feedback

Review Summary

This is an excellent bug fix that addresses a real-world issue with robust engineering practices. The PR successfully resolves encoding errors while maintaining backward compatibility and adding comprehensive test coverage.

✅ Strengths

Core Fix Quality

  • Perfect solution: Changing from resp.text to resp.content in mcp_nixos/server.py:254 is the correct approach for handling unknown HTML encodings
  • Best practice: Leverages BeautifulSoup's robust encoding detection rather than relying on potentially incorrect HTTP headers
  • Minimal impact: Single-line change in the core function with maximum effectiveness

Test Coverage Excellence

  • Comprehensive encoding tests: Added 3 new tests covering Windows-1252, UTF-8 with BOM, and ISO-8859-1 encodings (tests/test_server.py:231-307)
  • Real-world scenarios: Tests include actual special characters (café, 你好, naïve, résumé) that would cause encoding failures
  • Consistent updates: All 26 existing tests properly updated from mock_resp.text to mock_resp.content across 6 test files

Code Quality

  • Clear documentation: Excellent comments explaining the encoding issue and solution (server.py:252-253)
  • Maintains patterns: All changes follow existing code style and testing patterns
  • No breaking changes: Backward compatible fix that doesn't affect the API

🔍 Technical Analysis

Security: ✅ Secure

  • No security concerns introduced
  • Using resp.content with BeautifulSoup is the secure approach for parsing untrusted HTML
  • No new attack vectors created

Performance: ✅ Neutral/Slightly Better

  • resp.content avoids unnecessary text decoding by requests library
  • BeautifulSoup handles encoding detection efficiently
  • No measurable performance impact expected

Reliability: ✅ Significantly Improved

  • Eliminates intermittent encoding failures from CDN/server variations
  • Makes the server more robust across different deployment environments
  • Graceful handling of various encoding declarations

🎯 Areas of Excellence

  1. Problem Understanding: Clear identification of the root cause (CDN encoding variations)
  2. Solution Elegance: Minimal change with maximum impact
  3. Test Strategy: Proactive testing of multiple encoding scenarios
  4. Documentation: Well-documented change with context
  5. Compatibility: Zero breaking changes

📋 Verification Checklist

  • ✅ Core fix correctly implemented (server.py:254)
  • ✅ All existing tests updated consistently
  • ✅ New encoding tests cover real-world scenarios
  • ✅ Code style maintained (follows project patterns)
  • ✅ No security vulnerabilities introduced
  • ✅ Performance impact minimal/positive
  • ✅ Error handling preserved
  • ✅ Documentation updated appropriately

🚀 Recommendation

APPROVE - This PR is ready for merge. It demonstrates:

  • Strong understanding of the underlying issue
  • Excellent engineering judgment in solution selection
  • Comprehensive testing approach
  • Attention to detail in maintaining code quality

The fix directly addresses user pain points while following best practices. The author has done exceptional work in both implementation and testing.


@jamesbrink jamesbrink self-assigned this Oct 18, 2025
@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.26%. Comparing base (ec59088) to head (58679bb).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #58   +/-   ##
=======================================
  Coverage   93.26%   93.26%           
=======================================
  Files           2        2           
  Lines         950      950           
=======================================
  Hits          886      886           
  Misses         64       64           
Flag Coverage Δ
unittests 93.26% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jamesbrink jamesbrink merged commit bfc084d into main Oct 18, 2025
13 of 15 checks passed
@jamesbrink jamesbrink deleted the fix/html-encoding-issue branch October 18, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants