Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.get_text() returns hexadecimal text for some characters #3197

Closed
brandenkmurray opened this issue Feb 22, 2024 · 3 comments
Closed

page.get_text() returns hexadecimal text for some characters #3197

brandenkmurray opened this issue Feb 22, 2024 · 3 comments
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package

Comments

@brandenkmurray
Copy link

brandenkmurray commented Feb 22, 2024

Description of the bug

get_text() extracts numbers in the Cash Flow table in this document as hexadecimal characters. Copy/paste from the page and pdftotext extract the correct text.

How to reproduce the bug

Ford Motor Company (F) Cash Flow - Yahoo Finance - Yahoo Finance.pdf

import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('Ford.Motor.Company.F.Cash.Flow.-.Yahoo.Finance.-.Yahoo.Finance.pdf', 1)
PyMuPDF:

'Related Tickers\nTTM\n12/31/2023\n12/31/2022\n12/31/2021\n12/31/2020\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x8e\x91,\x96\x8e\x95,\x8d\x8d\x8d\n\x93,\x95\x92\x90,\x8d\x8d\x8d\n\x8e\x92,\x94\x95\x94,\x8d\x8d\x8d\n\x8f\x91,\x8f\x93\x96,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x8e\x94,\x93\x8f\x95,\x8d\x8d\x8d\n-\x91,\x90\x91\x94,\x8d\x8d\x8d\n\x8f,\x94\x91\x92,\x8d\x8d\x8d\n-\x8e\x95,\x93\x8e\x92,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x95\x91,\x8d\x8d\x8d\n\x8f,\x92\x8e\x8e,\x8d\x8d\x8d\n-\x8f\x90,\x91\x96\x95,\x8d\x8d\x8d\n\x8f,\x90\x8e\x92,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x8e\x8e\x8d,\x8d\x8d\x8d\n\x8f\x92,\x90\x91\x8d,\x8d\x8d\x8d\n\x8f\x8d,\x94\x90\x94,\x8d\x8d\x8d\n\x8f\x92,\x96\x90\x92,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x95,\x8f\x90\x93,\x8d\x8d\x8d\n-\x93,\x95\x93\x93,\x8d\x8d\x8d\n-\x93,\x8f\x8f\x94,\x8d\x8d\x8d\n-\x92,\x94\x91\x8f,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x92\x8e,\x93\x92\x96,\x8d\x8d\x8d\n\x91\x92,\x91\x94\x8d,\x8d\x8d\x8d\n\x8f\x94,\x96\x8d\x8e,\x8d\x8d\x8d\n\x93\x92,\x96\x8d\x8d,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x8e,\x96\x93\x92,\x8d\x8d\x8d\n-\x91\x92,\x93\x92\x92,\x8d\x8d\x8d\n-\x92\x91,\x8e\x93\x91,\x8d\x8d\x8d\n-\x93\x8d,\x92\x8e\x91,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x90\x90\x92,\x8d\x8d\x8d\n-\x91\x95\x91,\x8d\x8d\x8d\n--\n--\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n\x93,\x93\x95\x8f,\x8d\x8d\x8d\n-\x8e\x90,\x8d\x8d\x8d\n\x96,\x92\x93\x8d,\x8d\x8d\x8d\n\x8e\x95,\x92\x8f\x94,\x8d\x8d\x8d\n \nYahoo Finance Plus Essential\naccess required.\nUnlock Access\nBreakdown\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n12/31/2020 - 6/1/1972\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\xa0\nRIVN\nRivian Automotive, Inc.\n15.39 -3.15%\n\xa0\nNIO\nNIO Inc.\n5.97 +0.17%\n\xa0\nSTLA\nStellantis N.V.\n25.63 +0.91%\n\xa0\nLCID\nLucid Group, Inc.\n3.7000 +0.54%\n\xa0\nTSLA\nTesla, Inc.\n194.77 +0.52%\n\xa0\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\xa0\nXPEV\nXPeng Inc.\n9.08 +0.89%\n\xa0\nFSR\nFisker Inc.\n0.5579 -11.46%\n\xa0\nCopyright © 2024 Yahoo.\nAll rights reserved.\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\nEXPLORE MORE\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\nABOUT\nData Disclaimer\nHelp\nSu\x0cestions\nSitemap\n'

pdftotext:

'Breakdown\n\nOperating Cash\nFlow\nInvesting Cash\nFlow\nFinancing Cash\nFlow\nEnd Cash Position\nCapital Expenditure\nIssuance of Debt\nRepayment of Debt\nRepurchase of\nCapital Stock\nFree Cash Flow\n\nRelated Tickers\nGM\nGeneral Motors Compa…\n39.49 +1.23%\n\nCopyright © 2024 Yahoo.\nAll rights reserved.\n\nTTM\n\n12/31/2023\n\n12/31/2022\n\n12/31/2021\n\n12/31/2020\n\n14,918,000\n\n14,918,000\n\n6,853,000\n\n15,787,000\n\n24,269,000\n\n-17,628,000\n\n-17,628,000\n\n-4,347,000\n\n2,745,000\n\n-18,615,000\n\n2,584,000\n\n2,584,000\n\n2,511,000\n\n-23,498,000\n\n2,315,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,110,000\n-8,236,000\n51,659,000\n-41,965,000\n\n25,340,000\n-6,866,000\n45,470,000\n-45,655,000\n\n20,737,000\n-6,227,000\n27,901,000\n-54,164,000\n\n25,935,000\n-5,742,000\n65,900,000\n-60,514,000\n\n-335,000\n\n-335,000\n\n-484,000\n\n--\n\n--\n\n6,682,000\n\n6,682,000\n\n-13,000\n\n9,560,000\n\n18,527,000\n\nRIVN\nRivian Automotive, Inc.\n\nNIO\nNIO Inc.\n\nSTLA\nStellantis N.V.\n\nPOPULAR QUOTES\nTesla\nDAX Index\nKOSPI\nDow Jones\nS&P BSE SENSEX\nSPDR S&P 500 ETF Trust\n\nEXPLORE MORE\n\n15.39 -3.15%\n\n5.97 +0.17%\n\n25.63 +0.91%\n\nLCID\nLucid Group, Inc.\n\n3.7000 +0.54%\n\nCredit Score Management\nHousing Market\nActive vs. Passive Investing\nShort Selling\nToday’s Mortgage Rates\nHow Much Mortgage Can You Afford\n\nABOUT\n\nTSLA\nTesla, Inc.\n\n194.77 +0.52%\n\nData Disclaimer\nHelp\nSuggestions\nSitemap\n\n12/31/2020 - 6/1/1972\n\nYahoo Finance Plus Essential\naccess required.\nUnlock Access\n\nTM\nToyota Motor Corporati…\n227.09 +0.14%\n\nXPEV\nXPeng Inc.\n\n9.08 +0.89%\n\nFSR\nFisker Inc.\n\n0.5579 -11.46%\n\n\x0c'

pdfplumber:

'Breakdown TTM 12/31/2023 12/31/2022 12/31/2021 12/31/2020 12/31/2020 - 6/1/1972\nOperating Cash\n\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nYahoo Finance Plus Essential\nInvesting Cash access required.\n-\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nFlow\nUnlock Access\nFinancing Cash\n\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00\nFlow\nEnd Cash Position \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nCapital Expenditure -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00 -\x00,\x00\x00\x00,\x00\x00\x00\nIssuance of Debt \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRepayment of Debt -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00,\x00\x00\x00\nRepurchase of\n-\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -\x00\x00\x00,\x00\x00\x00 -- --\nCapital Stock\nFree Cash Flow \x00,\x00\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 -\x00\x00,\x00\x00\x00 \x00,\x00\x00\x00,\x00\x00\x00 \x00\x00,\x00\x00\x00,\x00\x00\x00\nRelated Tickers\nGM RIVN NIO STLA LCID TSLA TM XPEV FSR\nGeneral Motors Compa… Rivian Automotive, Inc. NIO Inc. Stellantis N.V. Lucid Group, Inc. Tesla, Inc. Toyota Motor Corporati… XPeng Inc. Fisker Inc.\n39.49 +1.23% 15.39 -3.15% 5.97 +0.17% 25.63 +0.91% 3.7000 +0.54% 194.77 +0.52% 227.09 +0.14% 9.08 +0.89% 0.5579 -11.46%\nPOPULAR QUOTES EXPLORE MORE ABOUT\nTesla Credit Score Management Data Disclaimer\nCopyright © 2024 Yahoo.\nDAX Index Housing Market Help\nAll rights reserved.\nKOSPI Active vs. Passive Investing Su\x00estions\nShort Selling Sitemap\nDow Jones\nToday’s Mortgage Rates\nS&P BSE SENSEX\nHow Much Mortgage Can You Afford\nSPDR S&P 500 ETF Trust'

Expected behavior (optional)

I expect the numbers in the table to be returned as normal text, similar to pdftotext

PyMuPDF version

1.23.25

Operating system

Linux

Python version

3.10

@brandenkmurray brandenkmurray changed the title page.get_text() returns hexadecimal text for some character page.get_text() returns hexadecimal text for some characters Feb 22, 2024
@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the report.

It looks like PyMuPDF with the latest MuPDF master branch does not include these control characters in the text. So this looks like a MuPDF issue.

I'll ask the MuPDF people about what has changed on MuPDF master relative to PyMuPDF's default MuPDF-1.23.10.

@julian-smith-artifex-com
Copy link
Collaborator

MuPDF master has support for ActualText which fixes this problem. We are expecting MuPDF to move to new release 1.24.x branch in the next few weeks which will include ActualText support, and so the problem will be fixed in PyMuPDF shortly afterwards.

@julian-smith-artifex-com julian-smith-artifex-com added upstream bug bug outside this package fix developed release schedule to be determined labels Feb 23, 2024
julian-smith-artifex-com added a commit that referenced this issue Feb 23, 2024
This is test for #3197. Fixed in MuPDF 1.24.
julian-smith-artifex-com added a commit that referenced this issue Feb 23, 2024
This is test for #3197. Fixed in MuPDF 1.24.
julian-smith-artifex-com added a commit that referenced this issue Feb 23, 2024
This is test for #3197. Fixed in MuPDF 1.24.
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants