PyPDF2 doesn't handle tables well. When extracting text from PDFs that contain tables, PyPDF2 typically:

1. Loses the table structure completely
2. Extracts text in an unpredictable reading order
3. Fails to maintain column alignment
4. Often merges cells that should be separate

For PDFs with tables, you would be better off using one of these alternatives:

1. **Tabula-py**: A Python wrapper for Tabula, which is specifically designed for table extraction from PDFs
   ```
   pip install tabula-py
   ```

2. **Camelot**: Excellent for complex tables with merged cells and spanning columns
   ```
   pip install camelot-py[cv]
   ```

3. **PyMuPDF (fitz)**: Better overall text extraction with some table structure preservation
   ```
   pip install pymupdf
   ```

4. **pdf-table-extract**: Specialized for table extraction
   ```
   pip install pdf-table-extract
   ```

If tables are critical to your use case, I'd recommend using Camelot or Tabula-py as they're specifically designed for table extraction. They can output the tables as pandas DataFrames, which makes further processing much easier.

https://claude.ai/chat/e1daab8c-9373-4bc1-8eef-50e956d56135

In [1]:
!pip uninstall tabula-py

^C


In [2]:
pip install pymupdf

Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install pdf-table-extract

Collecting pdf-table-extract
  Downloading pdf-table-extract-0.2.tar.gz (9.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pdf-table-extract
  Building wheel for pdf-table-extract (setup.py): started
  Building wheel for pdf-table-extract (setup.py): finished with status 'done'
  Created wheel for pdf-table-extract: filename=pdf_table_extract-0.2-py3-none-any.whl size=12087 sha256=d1ce2784f7b88392a1a893620cbd9d77a1fe7a3accd5011a62311d1d57176c79
  Stored in directory: c:\users\p2p2l\appdata\local\pip\cache\wheels\46\97\20\b9e353ddb4094a2a13d8e88a87d7fbfb66560b18e3d5b2288f
Successfully built pdf-table-extract
Installing collected packages: pdf-table-extract
Successfully installed pdf-table-extract-0.2
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.11.6-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20250327 (from pdfplumber)
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Using cached pypdfium2-4.30.1-py3-none-win_amd64.whl.metadata (48 kB)
Downloading pdfplumber-0.11.6-py3-none-any.whl (60 kB)
Downloading pdfminer_six-20250327-py3-none-any.whl (5.6 MB)
   ---------------------------------------- 0.0/5.6 MB ? eta -:--:--
   ----------------------------- ---------- 4.2/5.6 MB 22.9 MB/s eta 0:00:01
   ---------------------------------------- 5.6/5.6 MB 20.1 MB/s eta 0:00:00
Using cached pypdfium2-4.30.1-py3-none-win_amd64.whl (3.0 MB)
Installing collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20250327 pdfplumber-0.11.6 pypdfium2-4.30.1
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install pdfminer.six

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install camelot-py[cv]

Collecting camelot-py[cv]
  Downloading camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Collecting chardet>=5.1.0 (from camelot-py[cv])
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting pypdf<6.0,>=4.0 (from camelot-py[cv])
  Using cached pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting tabulate>=0.9.0 (from camelot-py[cv])
  Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting opencv-python-headless>=4.7.0.68 (from camelot-py[cv])
  Using cached opencv_python_headless-4.11.0.86-cp37-abi3-win_amd64.whl.metadata (20 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
Using cached opencv_python_headless-4.11.0.86-cp37-abi3-win_amd64.whl (39.4 MB)
Using cached pypdf-5.4.0-py3-none-any.whl (302 kB)
Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
Downloading camelot_py-1.0.0-py3-none-any.whl (66 kB)
Installing collected packages: tabulate, pypdf, opencv-python-headless, chardet, camelot-py
Successfully installed camelot-py-1.



In [1]:
# import fitz  # PyMuPDF
# import pdfplumber
import pandas as pd
import os

In [2]:
file_path = r"C:\Users\p2p2l\projects\digital-duck\zinets\docs\research\simplified-characters\通用规范汉字表.pdf"

In [3]:
def pdf2text(file_path: str, page_range: tuple = None) -> str:
    """
    Extract text from a PDF file with better support for Chinese text using pdfminer.six.
    
    Args:
        file_path (str): Path to the PDF file
        page_range (tuple, optional): Tuple of (start_page, end_page) for extraction (inclusive).
                                     Page numbers are 1-indexed.
                                     None means all pages. Default is None.
        
    Returns:
        str: Extracted text from the PDF
    """
    try:
        import os
        from pdfminer.high_level import extract_pages, extract_text
        from pdfminer.layout import LAParams
        
        # Check if file exists
        if not os.path.exists(file_path):
            return f"Error: File not found at {file_path}"
        
        if page_range is None:
            # Extract all pages
            text = extract_text(file_path, laparams=LAParams(line_margin=0.5))
            return text
        else:
            if len(page_range) != 2:
                return "Error: page_range must be a tuple of (start_page, end_page)"
            
            # Convert 1-indexed input to 0-indexed for internal use
            start_page, end_page = page_range[0] - 1, page_range[1] - 1
            
            if start_page < 0:
                return "Error: Page numbers must be positive integers"
            
            # Extract specified page range
            text = ""
            for i, page_layout in enumerate(extract_pages(file_path, laparams=LAParams(line_margin=0.5))):
                if start_page <= i <= end_page:
                    page_text = ""
                    for element in page_layout:
                        if hasattr(element, "get_text"):
                            page_text += element.get_text()
                    
                    text += f"--- Page {i + 1} ---\n{page_text}\n\n"
                
                if i > end_page:
                    break
            
            return text
    
    except ImportError:
        return "Error: pdfminer.six library not installed. Install it using 'pip install pdfminer.six'."
    except Exception as e:
        return f"Error extracting text from PDF: {str(e)}"

In [4]:
p47 = pdf2text(file_path, page_range=(46,47))
p47

# '--- Page 46 ---\n\n\n--- Page 47 ---\n\n\n'

'--- Page 46 ---\n\n\n--- Page 47 ---\n\n\n'