Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failures when extracting text from pdf #554

Closed
ryankilroy opened this issue May 23, 2024 · 4 comments
Closed

[BUG] Failures when extracting text from pdf #554

ryankilroy opened this issue May 23, 2024 · 4 comments

Comments

@ryankilroy
Copy link

Description

When I attempt to extract the text from a pdf with certain embedded fonts, it returns some missing rune characters. The fonts don't seem to throw errors on the first page (which still has missing runes), but when I attempt to extract the fonts from the later pages in the pdf, I get some Can't convert font object, invalid type errors.

Expected Behavior

I expect to be able to extract usable text from the pdf

Actual Behavior

Extracting text from the pdf results in missing runes

Steps to reproduce the behavior:

  1. Construct a PdfReader from the attached pdf
  2. Get the first page from the reader and construct an extractor with it
  3. Output the result of ExtractText()

If you instead run pdftotext <file.pdf> - against it, the text is fully readable

Attachments

	pdfReader, _, err := model.NewPdfReaderFromFile(filePath, nil)
        if err != nil {
		return nil, fmt.Errorf("failed to create pdf reader: %w", err)
	}
        pageCount := len(pdfReader.PageList)
	var pages []string
	for i := range pageCount {
		pageNum := i + 1
		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return nil, fmt.Errorf("failed to get page %d: %w", pageNum, err)
		}

		ex, err := extractor.New(page)
		if err != nil {
			return nil, fmt.Errorf("failed to create extractor: %w", err)
		}

		text, err := ex.ExtractText()
		if err != nil {
			return nil, fmt.Errorf("failed to extract text: %w", err)
		}
		fmt.Println("=== UNIPDF TEXT ===")
		fmt.Println(text)
        }

	pdfToTextArgs := []string{
		filePath, "-",
	}
	cmd := exec.Command("pdftotext", pdfToTextArgs...)

	b, _ := cmd.Output()

	content := strings.Replace(string(b), `\n`, "\n", -1)
	fmt.Println("=== PDFTOTEXT TEXT===")
	fmt.Println(content)
=== UNIPDF TEXT ===
Convertibe hoder
...etc

=== PDFTOTEXT ===
Convertible holder
...etc

Sample PDF.pdf

Examples

There are more missing runes in areas of the actual pdf, but I couldn't replicate them with the anonymized data. Here are some of the examples

Issue date Aug   
Board approva date Aug   
Origina principa $ USD
Issue date
Board approval date

Aug. 4, 2023
Aug. 24, 2023

Value
Original principal

$1,000.00 (USD)
Copy link

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

@kellemNegasi
Copy link

kellemNegasi commented May 28, 2024

Hi @ryankilroy , thank you for reporiting this issue. We were able to reproduce it using the sample code and sample file you provided and we are currently investigating the cause of it. We will write an update as soon as we identify the source of the issue and the fixes.

@kellemNegasi
Copy link

kellemNegasi commented Jun 6, 2024

Hi @ryankilroy, after some investigation, we found out that the issue is in the ToUnicode map provided in the document. It has an invalid code point for the character code that represented the missing letter (l). But the reason other tools were able to extract the correct character is that they resorted to the Replacement Text data provoded as part of the marked content. Currently, our extractor doesn't implement this feature, which is why it just took the invalid code point (which is by the way in the Private Use Area of Unicode ) and extracted it as valid text. We plan to incorporate this feature in the future and provide an update on this ticket upon its release.

Regarding your second issue, i.e., font extraction, the reason for the font extraction failure is that there is no font in pages 3 and beyond (because the pages are scanned). But the error message is not informative enough to convey this. We will update this one too.

@kellemNegasi
Copy link

Hi @ryankilroy , This issue is fixed in the new release (v3.60.0) which can be found here https://github.com/unidoc/unipdf/releases/tag/v3.60.0. Closing this ticket as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants