Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Use pdfloc for figuring out highlight annotation position #4

rschroll opened this Issue Apr 27, 2012 · 9 comments


None yet
4 participants

rschroll commented Apr 27, 2012

Right now, we're looking for the highlighted text in the page. But there's no guarantee that we will parse the text into the same string as the reader will, so this is a bit fragile. Moreover, the entire text of long highlights is not included -- only the first (200?) characters are. If we can work out what the pdfloc information is telling us, we can get around both of these problems, and probably speed up the creation of these annotations as well.

Here's the one source of info I've found so far.


rschroll commented Jun 1, 2012

This thread, specifically these posts, has some more information on the pdfloc information. (It's for an older version of the reader, but I'm hoping it's the same.) The consensus seems to be that the format is

#pdfloc(A, B, C, D, E, F, G, H)
  • A is a four hex-digit hash of the file, somehow.
  • B is the page number.
  • C is 8*(line number) + 9.
  • F is 0.
  • G is 0 for start markers and 1 for end markers.
  • H is 1.

(We're taking both the page and line number to be zero-indexed.) Somehow D and E encode positions in terms of characters. In my very limited testing, D seemed to be the character count while E was 0. I think the link in the previous comment suggests that the character count is D + 2*E. Note that this is not inconsistent with what I've seen. More testing is needed.


rschroll commented Jun 16, 2012

I've written a little script to get the pdfloc information for all annotations of a given file. Get it here: https://gist.github.com/2942769


rschroll commented Nov 5, 2012

I've been playing with a few test PDFs, and the situation is only more muddled. The line number is given by C, but sometimes it only increases by 2/line, and the number of the first line may vary. G is either 0 or 1, but this isn't connected to start or end markers. Somehow D, E, and G give the character count, but there doesn't seem to be a consistent pattern there. For a series of identical lines with identical highlights, you can get a variety of values for these numbers.

The only good news is that the font doesn't seem to matter; this really is character-based.

v01d commented Mar 3, 2013

maybe this helps to give some insight?

I'm eager to see this solved since i quickly tried annotating and getting the annotated pdf and I can see it doesnt really work well right now and it would very nice to be able to do this for my work.


rschroll commented Mar 5, 2013

Thanks for the link, but I'm afraid it's incomplete at best. I've seen C jump by 2, 4, and 8 per line, and G doesn't seem to be correlated to start or end.

If you'd like to help, and I certainly need some at this point, check out this script: https://gist.github.com/2942769. It'll give you a text version of all the pdfloc information, and you can start trying to figure out what's going on in your PDFs.

v01d commented Mar 23, 2013

Hi, I was reviewing a paper making a lot of notes and decided to give the script a try.
i've made some findings but nothing conclusive or perfectly consistent. In this PDF (created by pdflatex), C = 2 * (line + 1). In general, D indicates the word (0 based)within the line, and E, the character count (0 based) within the word. The reamining variable G, seems to indicate some type of mode which I can't figure out (it does not indicate start, end as suggested). This word/character finding is mostly valid when G=0. I say mostly because in some cases it didn't hold even with G=0. Also, even though on the reader, the highligting starts at, lets say, word 5, the starting marker may indicate word 4, and a character offset corresponding to all letters of word 4. The ending marker sometimes even includes some letters of the following word, but on the ereader and the reader software this extra word is not highlighted.
I'm using the PRS-T2, btw.

I'm a bit frustrated now because I can't get all of these annotations out of the device, and have them point at the corresponding part of the text.

Lidanha commented May 25, 2015


Did anyone had any progress with figuring out the pdfloc secret?


peci1 commented Aug 11, 2016

@Lidanha Yes, I did :) The docs are at https://github.com/peci1/PDFLocConverter , the Python library is still in alpha version =)

Lidanha commented Aug 19, 2016

Wow! Thanks, I will take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment