Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Invalid interval #99

Closed
bdeonovic opened this issue Aug 18, 2021 · 8 comments
Closed

AssertionError: Invalid interval #99

bdeonovic opened this issue Aug 18, 2021 · 8 comments

Comments

@bdeonovic
Copy link

What does this error mean?

julia> pdPageExtractText(stdout, page)
ERROR: AssertionError: Invalid interval
Stacktrace:
  [1] Interval
    @ C:\Users\bdeon\.julia\packages\Rectangle\Imrhs\src\interval.jl:5 [inlined]
  [2] Interval
    @ C:\Users\bdeon\.julia\packages\Rectangle\Imrhs\src\interval.jl:20 [inlined]
  [3] on_cmap_command!(stm::IOBuffer, command::Symbol, params::Vector{CosInt}, cmap::PDFIO.PD.CMap)
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:365
  [4] read_cmap(stm::IOBuffer)
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:384
  [5] get_unicode_mapping(cmap_stm::PDFIO.Cos.ID{CosStream})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:153
  [6] get_unicode_mapping(doc::PDFIO.Cos.CosDocImpl, font::PDFIO.Cos.ID{CosDict})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:143
  [7] PDFont
    @ C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:411 [inlined]
  [8] get_pd_font!(doc::PDFIO.PD.PDDocImpl, cosfont::PDFIO.Cos.ID{CosDict})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDDocImpl.jl:112
  [9] get_font(page::PDFIO.PD.PDPageImpl, fontname::CosName)
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:313
 [10] evalContent!(pdo::PDPageElement{:Tf}, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:774
 [11] evalContent!
    @ C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
 [12] evalContent!(pdo::PDPageTextObject, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:719
 [13] evalContent!
    @ C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
 [14] pdPageEvalContent(page::PDFIO.PD.PDPageImpl, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:145
 [15] pdPageEvalContent
    @ C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:144 [inlined]
 [16] pdPageExtractText(io::Base.TTY, page::PDFIO.PD.PDPageImpl)
    @ PDFIO.PD C:\Users\bdeon\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:178
 [17] top-level scope
    @ REPL[10]:1
@sambitdash
Copy link
Owner

sambitdash commented Aug 18, 2021

The font enconding mapping to unicode used in PDF has issues. Please share the file to investigate.

The encoding cmaps have ranges lo:hi defined in them. It seems for some reason in the mapping file you have high value lesser than low value. Hence, this assertion error.

https://github.com/sambitdash/Rectangle.jl/blob/54f36a07257b17b8bc8e1f4698aef20df90d632f/src/interval.jl#L5

@bdeonovic
Copy link
Author

A few comments:

  1. I printed one page of the pdf (print to pdf on windows) which was causing the error so I could post it here as an example. However, when I tried to run the extract on this 1 page example the extract worked.
  2. The extract doesn't correctly extract the text. The first sentence should be:

U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu
značaju i nešto malo o njihovoj povijesti.

but the extract function seems to have a problem with the accent marks. I get this:

U ovoj je knjizi rijee
zna
Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

which doesn't have accent marks and skips a bunch of text.

Thoughts?

test.pdf

@sambitdash
Copy link
Owner

I would believe you have some issues related to the font encoding in the file. If I open the file in Adobe Reader and select and copy the text I see exactly below. Which is close to what you are observing. This happens when the font toUnicode c-maps are not properly transferred. The extract text works on the same principle of copying and pasting text from a PDF file.

1 O hobitima
U ovoj je knjizi rije e
zna
Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

@sambitdash
Copy link
Owner

I will need to investigate the original file with the C-Map to realize why the file does not get transmitted properly. Please share it here, if possible. If there are security concerns you can mail me at: sambitdash at gmail

@bdeonovic
Copy link
Author

email sent

@sambitdash
Copy link
Owner

sambitdash commented Nov 15, 2022

@bdeonovic Sorry for my delay in looking into the file. The CMap file in the PDF is not aligned to the spec. Figure-6 in the attached spec.

5014.CIDFont_Spec.pdf

That's the reason some readers behave differently. While I will try to repair the cmap for a special case, this is not the correct approach. Code space ranges are rectangular regions in the byte plane and not numbers.

/Registry (BKABIP+TT5+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /BKABIP+TT5+0 def
/CMapType 2 def
1 begincodespacerange <00fb> <0108> endcodespacerange

2 beginbfchar
<00ff> <0111>
<0108> <0110>
endbfchar
2 beginbfrange
<00fb> <00fc> <0106>
<00fd> <00fe> <010C>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

is the CMap. As per the CMap spec the codespace range should have 2 elements.

2 begincodespacerange 
   <00fb> <00ff>
   <0100> <0108> 
endcodespacerange

@sambitdash
Copy link
Owner

A few comments:

  1. I printed one page of the pdf (print to pdf on windows) which was causing the error so I could post it here as an example. However, when I tried to run the extract on this 1 page example the extract worked.
  2. The extract doesn't correctly extract the text. The first sentence should be:

U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu značaju i nešto malo o njihovoj povijesti.

but the extract function seems to have a problem with the accent marks. I get this:

U ovoj je knjizi rijee zna Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

which doesn't have accent marks and skips a bunch of text.

Thoughts?

test.pdf

On Page-6 of the document you shared, I get:

     U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu 
     značaju i nešto malo o njihovoj povijesti. 

This is what you are expecting. While I have introduced a workaround in the code, this is not the code as per spec.

@sambitdash
Copy link
Owner

9ed161f fixes this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants