You can clone with
HTTPS or Subversion.
Right now, we're looking for the highlighted text in the page. But there's no guarantee that we will parse the text into the same string as the reader will, so this is a bit fragile. Moreover, the entire text of long highlights is not included -- only the first (200?) characters are. If we can work out what the pdfloc information is telling us, we can get around both of these problems, and probably speed up the creation of these annotations as well.
Here's the one source of info I've found so far.
This thread, specifically these posts, has some more information on the pdfloc information. (It's for an older version of the reader, but I'm hoping it's the same.) The consensus seems to be that the format is
#pdfloc(A, B, C, D, E, F, G, H)
(We're taking both the page and line number to be zero-indexed.) Somehow D and E encode positions in terms of characters. In my very limited testing, D seemed to be the character count while E was 0. I think the link in the previous comment suggests that the character count is D + 2*E. Note that this is not inconsistent with what I've seen. More testing is needed.
D + 2*E
I've written a little script to get the pdfloc information for all annotations of a given file. Get it here: https://gist.github.com/2942769
I've been playing with a few test PDFs, and the situation is only more muddled. The line number is given by C, but sometimes it only increases by 2/line, and the number of the first line may vary. G is either 0 or 1, but this isn't connected to start or end markers. Somehow D, E, and G give the character count, but there doesn't seem to be a consistent pattern there. For a series of identical lines with identical highlights, you can get a variety of values for these numbers.
The only good news is that the font doesn't seem to matter; this really is character-based.
maybe this helps to give some insight?
I'm eager to see this solved since i quickly tried annotating and getting the annotated pdf and I can see it doesnt really work well right now and it would very nice to be able to do this for my work.
Thanks for the link, but I'm afraid it's incomplete at best. I've seen C jump by 2, 4, and 8 per line, and G doesn't seem to be correlated to start or end.
If you'd like to help, and I certainly need some at this point, check out this script: https://gist.github.com/2942769. It'll give you a text version of all the pdfloc information, and you can start trying to figure out what's going on in your PDFs.
Hi, I was reviewing a paper making a lot of notes and decided to give the script a try.
i've made some findings but nothing conclusive or perfectly consistent. In this PDF (created by pdflatex), C = 2 * (line + 1). In general, D indicates the word (0 based)within the line, and E, the character count (0 based) within the word. The reamining variable G, seems to indicate some type of mode which I can't figure out (it does not indicate start, end as suggested). This word/character finding is mostly valid when G=0. I say mostly because in some cases it didn't hold even with G=0. Also, even though on the reader, the highligting starts at, lets say, word 5, the starting marker may indicate word 4, and a character offset corresponding to all letters of word 4. The ending marker sometimes even includes some letters of the following word, but on the ereader and the reader software this extra word is not highlighted.
I'm using the PRS-T2, btw.
I'm a bit frustrated now because I can't get all of these annotations out of the device, and have them point at the corresponding part of the text.