[BUG] Failures when extracting text from pdf #554

ryankilroy · 2024-05-23T00:30:58Z

Description

When I attempt to extract the text from a pdf with certain embedded fonts, it returns some missing rune characters. The fonts don't seem to throw errors on the first page (which still has missing runes), but when I attempt to extract the fonts from the later pages in the pdf, I get some Can't convert font object, invalid type errors.

Expected Behavior

I expect to be able to extract usable text from the pdf

Actual Behavior

Extracting text from the pdf results in missing runes

Steps to reproduce the behavior:

Construct a PdfReader from the attached pdf
Get the first page from the reader and construct an extractor with it
Output the result of ExtractText()

If you instead run pdftotext <file.pdf> - against it, the text is fully readable

Attachments

	pdfReader, _, err := model.NewPdfReaderFromFile(filePath, nil)
        if err != nil {
		return nil, fmt.Errorf("failed to create pdf reader: %w", err)
	}
        pageCount := len(pdfReader.PageList)
	var pages []string
	for i := range pageCount {
		pageNum := i + 1
		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return nil, fmt.Errorf("failed to get page %d: %w", pageNum, err)
		}

		ex, err := extractor.New(page)
		if err != nil {
			return nil, fmt.Errorf("failed to create extractor: %w", err)
		}

		text, err := ex.ExtractText()
		if err != nil {
			return nil, fmt.Errorf("failed to extract text: %w", err)
		}
		fmt.Println("=== UNIPDF TEXT ===")
		fmt.Println(text)
        }

	pdfToTextArgs := []string{
		filePath, "-",
	}
	cmd := exec.Command("pdftotext", pdfToTextArgs...)

	b, _ := cmd.Output()

	content := strings.Replace(string(b), `\n`, "\n", -1)
	fmt.Println("=== PDFTOTEXT TEXT===")
	fmt.Println(content)

=== UNIPDF TEXT ===
Convertibe hoder
...etc

=== PDFTOTEXT ===
Convertible holder
...etc

Sample PDF.pdf

Examples

There are more missing runes in areas of the actual pdf, but I couldn't replicate them with the anonymized data. Here are some of the examples

Issue date Aug   
Board approva date Aug   
Origina principa $ USD

Issue date
Board approval date

Aug. 4, 2023
Aug. 24, 2023

Value
Original principal

$1,000.00 (USD)

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-23T00:31:23Z

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

kellemNegasi · 2024-05-28T09:55:51Z

Hi @ryankilroy , thank you for reporiting this issue. We were able to reproduce it using the sample code and sample file you provided and we are currently investigating the cause of it. We will write an update as soon as we identify the source of the issue and the fixes.

kellemNegasi · 2024-06-06T13:04:19Z

Hi @ryankilroy, after some investigation, we found out that the issue is in the ToUnicode map provided in the document. It has an invalid code point for the character code that represented the missing letter (l). But the reason other tools were able to extract the correct character is that they resorted to the Replacement Text data provoded as part of the marked content. Currently, our extractor doesn't implement this feature, which is why it just took the invalid code point (which is by the way in the Private Use Area of Unicode ) and extracted it as valid text. We plan to incorporate this feature in the future and provide an update on this ticket upon its release.

Regarding your second issue, i.e., font extraction, the reason for the font extraction failure is that there is no font in pages 3 and beyond (because the pages are scanned). But the error message is not informative enough to convey this. We will update this one too.

kellemNegasi · 2024-06-28T12:18:55Z

Hi @ryankilroy , This issue is fixed in the new release (v3.60.0) which can be found here https://github.com/unidoc/unipdf/releases/tag/v3.60.0. Closing this ticket as fixed.

kellemNegasi closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failures when extracting text from pdf #554

[BUG] Failures when extracting text from pdf #554

ryankilroy commented May 23, 2024

github-actions bot commented May 23, 2024

kellemNegasi commented May 28, 2024 •

edited

Loading

kellemNegasi commented Jun 6, 2024 •

edited

Loading

kellemNegasi commented Jun 28, 2024

[BUG] Failures when extracting text from pdf #554

[BUG] Failures when extracting text from pdf #554

Comments

ryankilroy commented May 23, 2024

Description

Expected Behavior

Actual Behavior

Attachments

Examples

github-actions bot commented May 23, 2024

kellemNegasi commented May 28, 2024 • edited Loading

kellemNegasi commented Jun 6, 2024 • edited Loading

kellemNegasi commented Jun 28, 2024

kellemNegasi commented May 28, 2024 •

edited

Loading

kellemNegasi commented Jun 6, 2024 •

edited

Loading