Convert HTML Table (with no rowspan/colspan) to JSON using Python
✅ Smart Header Detection - Automatically detects headers in <thead>
or first row with <th>
elements
✅ Flexible Output Format - Returns list of dictionaries (when headers exist) or list of lists
✅ Robust HTML Parsing - Handles complex nested HTML content within cells
✅ Error Handling - Comprehensive input validation and error reporting
✅ Type Safety - Full type hints for better development experience
✅ Multiple Utility Functions - Various conversion methods for different use cases
pip install beautifulsoup4
from tabletojson import html_to_json
# Convert HTML table to JSON
html_content = "<table>...</table>"
json_result = html_to_json(html_content, indent=2)
print(json_result)
from tabletojson import (
html_to_json,
html_table_to_dict_list,
html_table_to_list_of_lists,
validate_html_table
)
# Validate table first
if validate_html_table(html_content):
# Force dictionary output (with auto-generated headers if needed)
dict_result = html_table_to_dict_list(html_content)
# Force list of lists output (ignores headers)
list_result = html_table_to_list_of_lists(html_content)
Input:
ID | Vendor | Product |
---|---|---|
1 | Intel | Processor |
2 | AMD | GPU |
3 | Gigabyte | Mainboard |
[
{
"product": "Processor",
"vendor": "Intel",
"id": "1"
},
{
"product": "GPU",
"vendor": "AMD",
"id": "2"
},
{
"product": "Mainboard",
"vendor": "Gigabyte",
"id": "3"
}
]
Input:
1 | Intel | Processor |
2 | AMD | GPU |
3 | Gigabyte | Mainboard |
Output:
[
[
"1",
"Intel",
"Processor"
],
[
"2",
"AMD",
"GPU"
],
[
"3",
"Gigabyte",
"Mainboard"
]
]
Input:
Output:
[
{
"date": "13 Jul 2022",
"transaction description": "SOME WORKPLACESalary",
"debit/cheque": "",
"credit/deposit": "$3,509.30",
"balance": "OD $1,725.53"
},
{
"date": "12 Jul 2022",
"transaction description": "ATM DEPOSITCARD 1605",
"debit/cheque": "",
"credit/deposit": "$400.00",
"balance": "OD $5,234.83"
},
{
"date": "11 Jul 2022",
"transaction description": "Another TransactionAnother Transaction",
"debit/cheque": "",
"credit/deposit": "$104.00",
"balance": "OD $5,634.83"
},
{
"date": "11 Jul 2022",
"transaction description": "MB TRANSFERTO XX-XXXX-XXXXXXX-51",
"debit/cheque": "$4.50",
"credit/deposit": "",
"balance": "OD $5,738.83"
}
]
- Fixed header detection logic that was incorrectly searching for
th
elements - Fixed data structure handling for consistent output format
- Fixed handling of tables with
tbody
elements - Fixed text extraction from nested HTML elements (preserves spaces)
- Type Safety: Added comprehensive type hints for better IDE support
- Input Validation: Robust error handling with descriptive error messages
- Multiple Output Formats: New utility functions for different use cases
- Smart Header Detection: Detects headers in
<thead>
or first row with<th>
elements - Empty Row Handling: Automatically skips empty rows
- Flexible Cell Handling: Handles mismatched column counts gracefully
- Modular Design: Split functionality into focused helper functions
- Better Error Handling: Comprehensive exception handling with context
- Documentation: Added detailed docstrings and usage examples
- Test Coverage: Enhanced test cases covering edge cases and error scenarios
html_to_json()
- Main conversion function (improved)html_table_to_dict_list()
- Force dictionary output with auto-generated headershtml_table_to_list_of_lists()
- Force list of lists outputvalidate_html_table()
- Validate HTML table structure
- Support for nested table
-
Support for buggy HTML table (ie.✅ Fixedtd
instead ofth
inthead
) -
Improve error handling✅ Added comprehensive error handling -
Add type hints✅ Added full type safety -
Better text extraction from nested elements✅ Improved HTML parsing - Support for rowspan/colspan attributes
- Add support for custom header mappings
- Add CSV export functionality