Skip to content

mdminhazulhaque/html-table-to-json

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html-table-to-json

Convert HTML Table (with no rowspan/colspan) to JSON using Python

Features

Smart Header Detection - Automatically detects headers in <thead> or first row with <th> elements
Flexible Output Format - Returns list of dictionaries (when headers exist) or list of lists
Robust HTML Parsing - Handles complex nested HTML content within cells
Error Handling - Comprehensive input validation and error reporting
Type Safety - Full type hints for better development experience
Multiple Utility Functions - Various conversion methods for different use cases

Installation

pip install beautifulsoup4

Usage

Basic Usage

from tabletojson import html_to_json

# Convert HTML table to JSON
html_content = "<table>...</table>"
json_result = html_to_json(html_content, indent=2)
print(json_result)

Advanced Usage

from tabletojson import (
    html_to_json,
    html_table_to_dict_list,
    html_table_to_list_of_lists,
    validate_html_table
)

# Validate table first
if validate_html_table(html_content):
    # Force dictionary output (with auto-generated headers if needed)
    dict_result = html_table_to_dict_list(html_content)
    
    # Force list of lists output (ignores headers)
    list_result = html_table_to_list_of_lists(html_content)

Examples

Example 1: Table with Headers

Input:

ID Vendor Product
1 Intel Processor
2 AMD GPU
3 Gigabyte Mainboard

Output

[
    {
        "product": "Processor", 
        "vendor": "Intel", 
        "id": "1"
    }, 
    {
        "product": "GPU", 
        "vendor": "AMD", 
        "id": "2"
    }, 
    {
        "product": "Mainboard", 
        "vendor": "Gigabyte", 
        "id": "3"
    }
]

Example 2: Table without Headers

Input:

1 Intel Processor
2 AMD GPU
3 Gigabyte Mainboard

Output:

[
  [
    "1",
    "Intel",
    "Processor"
  ],
  [
    "2",
    "AMD",
    "GPU"
  ],
  [
    "3",
    "Gigabyte",
    "Mainboard"
  ]
]

Example 3: Complex Table with tbody and Nested Content

Input:

Date Transaction Description Debit/Cheque Credit/Deposit Balance
13 Jul 2022 SOME WORKPLACE
Salary
$3,509.30 OD $1,725.53
12 Jul 2022 ATM DEPOSIT
CARD 1605
$400.00 OD $5,234.83
11 Jul 2022 Another Transaction
Another Transaction
$104.00 OD $5,634.83
11 Jul 2022 MB TRANSFER
TO XX-XXXX-XXXXXXX-51
$4.50 OD $5,738.83

Output:

[
    {
        "date": "13 Jul 2022",
        "transaction description": "SOME WORKPLACESalary",
        "debit/cheque": "",
        "credit/deposit": "$3,509.30",
        "balance": "OD $1,725.53"
    },
    {
        "date": "12 Jul 2022",
        "transaction description": "ATM DEPOSITCARD 1605",
        "debit/cheque": "",
        "credit/deposit": "$400.00",
        "balance": "OD $5,234.83"
    },
    {
        "date": "11 Jul 2022",
        "transaction description": "Another TransactionAnother Transaction",
        "debit/cheque": "",
        "credit/deposit": "$104.00",
        "balance": "OD $5,634.83"
    },
    {
        "date": "11 Jul 2022",
        "transaction description": "MB TRANSFERTO XX-XXXX-XXXXXXX-51",
        "debit/cheque": "$4.50",
        "credit/deposit": "",
        "balance": "OD $5,738.83"
    }
]

Improvements Made

🐛 Bug Fixes

  • Fixed header detection logic that was incorrectly searching for th elements
  • Fixed data structure handling for consistent output format
  • Fixed handling of tables with tbody elements
  • Fixed text extraction from nested HTML elements (preserves spaces)

New Features

  • Type Safety: Added comprehensive type hints for better IDE support
  • Input Validation: Robust error handling with descriptive error messages
  • Multiple Output Formats: New utility functions for different use cases
  • Smart Header Detection: Detects headers in <thead> or first row with <th> elements
  • Empty Row Handling: Automatically skips empty rows
  • Flexible Cell Handling: Handles mismatched column counts gracefully

🚀 Performance & Code Quality

  • Modular Design: Split functionality into focused helper functions
  • Better Error Handling: Comprehensive exception handling with context
  • Documentation: Added detailed docstrings and usage examples
  • Test Coverage: Enhanced test cases covering edge cases and error scenarios

📋 New API Functions

  • html_to_json() - Main conversion function (improved)
  • html_table_to_dict_list() - Force dictionary output with auto-generated headers
  • html_table_to_list_of_lists() - Force list of lists output
  • validate_html_table() - Validate HTML table structure

TODO

  • Support for nested table
  • Support for buggy HTML table (ie. td instead of th in thead)Fixed
  • Improve error handlingAdded comprehensive error handling
  • Add type hintsAdded full type safety
  • Better text extraction from nested elementsImproved HTML parsing
  • Support for rowspan/colspan attributes
  • Add support for custom header mappings
  • Add CSV export functionality

About

🔢 Convert HTML Table to JSON using BeautifulSoup

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages