-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracted text shows unicode character 65533 #365
Comments
Interesting questions.
This a Python C function which converts "A single character, represented as a C int" into unicode. The integer
To your other question
|
Sorry, that's on me, I should've said I was on v1.16.1. Is this materially different to what I get with doing
Yup, that would be a nice convenience method I could use together with a (possibly custom) ToUnicode mapping. Alternatively, I don't mind getting my hands dirty and use _getXrefString, if I can use it to extract the text binary (I'm not intimately familiar with the inner workings of PDFs, although I've read parts of the Adobe PDF Manual). As for extracting the ToUnicode stuff, thanks for showing me how to use _getXrefString, that's very helpful. I'll have to do play around with it for a bit, but it looks like a good starting point. |
I have just experimented a little. You can always do this: >>> for b in page.getText("rawdict")["blocks"]:
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]: print(ord(char["c"])) This will print the integer behind the unicode |
I believe I have made that change to |
Using
for a non-printable character. Which corresponds to the integer 0xDC40 in this case. |
I've had another look at this - I don't think doing When running with 1.16.2, on the pdf I provided:
So the last 3 characters are all 65533. (All question mark characters are at that codepoint, really) So you're right that |
I see. I'll check if there happens a translation of
The bottom of this 40KB stream looks like this (line breaks are mine). The
So, your 3 characters reported as 65533 should be the last 3 characters of the last But let's see, if |
I'd expect the issue to come from there. To be fair, it's probably following spec, as the values in the pdf are mapping into some garbage unicode space.
Tragically so - I noticed that as I was digging around with _getXrefString. To be honest, I'll probably end up extracting the pixmap and using some OCR to get the text. It's way slower, but more robust. That being said, thanks for looking into the belly of the PDF.
That's correct - although now that I think about it, without the ToUnicode mapping, this information might be of limited value. Another related point is that RAWDICT returns a shorter form of the font basename (I don't know what the right terminology is). As an example, the font list for page 136:
The rawdict will contain font names like Again, I think this is mostly academic at this point (going via the OCR route makes more sense in my case), but thought I should mention it. Some tangential thoughts I think one strategy to deal with these cases is:
It's an expensive operation (and not 100% accurate), but:
Anyway, thanks for the help with this - on my side I'll just take the easy way out and OCR the pages. Adding the suggested API (returning the original pdf codepoints) might be helpful to other people in the future, if you decide to add it. |
Hmmm - no success. I'm geeting this: {'origin': (311.81060791015625, 562.3582763671875), 'bbox': (311.81060791015625, 557.4832763671875, 315.7106018066406, 563.9832763671875), 'c': '�', 'code': 65533} So the value 65533 of |
Maybe a way to prevent exactly this copy / paste procedure: only the fontfile program knows how to do the translation. |
Yes, some fonts are containers of subfonts, which can be referenced in that manner. Doing so helps limiting the PDF file size - a big space saver potentially. |
Noted - my experience with pymupdf is currently limited, but I'm sure I'll get more intimate with it over time. |
Welcome on board anyway, hope you will enjoy the package. |
@victor-ab -
mat = fitz.Matrix(2, 2) # potentially use to magnify / improve char image
for b in page.get_text("rawdict", flags=0)["blocks"]: # flags value excludes any images on page
for l in b["lines"]:
for s in l["spans"]:
for char in s["chars"]:
if char["c"] == 65533:
pix = page.get_pixmap(matrix=mat, clip=char["bbox"])
# call some OCR magic with 'pix' to receive recovered unicode unc
char["c"] = unc The open point is the 'OCR magic'! The rest is more of a no-brainer. With its v1.18.0, MuPDF has introduced integrated support of Tesseract.
So for the time being, a somewhat clumsy way out may be to check whether a page has at least one character code 65533. If yes, hand the respective page to an outside subprocess, which executes pre-installed Tesseract with it. Then extract the text of the returned OCR-ed page ... |
Hey, I am sorry, I deleted my comment here and moved it to discussion |
no problem - saw it there. |
This method is perfect for my needs , however I cannot save the ocred character . The dictionary dosen't updates . print(page_test.get_text("rawdict", flags=0)["blocks"][0]['lines'][0]['spans'][0]['chars'][1]['c']) |
@smokersan1 - a joke? |
I am having issues extracting text from the following pdf:
https://gofile.io/?c=TN6hln
Appologies for the large pdf, I didn't want to modify the document and cut it down in case it messed up some metadata.
Anywayw, on pdf page 136 the text is extracted as '�' (it looks rendered ok).
All of those are "Replacement Characters" with code 65533.
From digging around, as far as I can tell, this is because of the Identity-H mapping in the PDF FontList:
From what I understand there are 2 possibilities here:
My questions would be:
Related discussion that ended with
won't fix
:#87
The text was updated successfully, but these errors were encountered: