Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of pdf annotation extraction in Zotero (from zotfile) #1018

Open
jlegewie opened this issue May 30, 2016 · 10 comments

Comments

@jlegewie
Copy link
Contributor

commented May 30, 2016

Hi,

I just wanted to open an issue for a discussion about this. Presumably this is not for 5.0 but if the Zotero devs are interested in this, maybe 5.1.

The idea is to integrate the annotation extraction feature from zotfile in zotero. I think it's such an essential feature for reference managers and would be awesome for zotero. It would, however, require some work from the zotero devs because I won't be able to do everything in a pull request. The feature in Zotfile is pretty mature (e.g. unit tests for extraction quality exist but not with continuous integration). Still some parts might not be up to Zotero standards and it would require data base and UI changes. But the biggest burden for zotero core might be future maintenance and user support. You would ship zotero with a modified version of pdf.js, which is a huge library with rapid development. Updating to the most recent version can be a pain (mainly because I am using a modified version of pdf.js for this feature). Just saying that there are drawbacks.

Proposed UI

zotero-annotations

Currently, the attachment info pane has no tabs and just shows the left tab (without the tabs bar, of course). Instead, when the user selects an attachment, the pane could show three tabs with "Info" (same as before), Annotations with a list of extracted annotations, and Outline with the extracted outline.
The Annotation tab would be updated in the background so it would always shows the current annotations in the file. The links allow the user to open the pdf attachment on the page with the highlighted text. The "Create separate note with annotations" button would create a child note with the extracted annotations. This is important because the annotation tab itself can not be edited. Creating a note would allow the user to work with and edit the annotations in a note by adding comments etc. Zotfile only creates such a note and doesn't show the always updated list of annotations in a separate non-editable tab. Finally, the outline tab shows the pdf outline. Clicking on a section in the outline would open the pdf on that page. Of course, this only works for pdfs with an outline.

@dstillman

This comment has been minimized.

Copy link
Member

commented Jun 11, 2016

Thanks for starting this discussion. I've thought for a while that it could make sense to migrate some ZotFile features into Zotero core.

So, yeah, my assumption has always been that, if we did support extracted annotations, we'd want them to be virtual, which I think is basically what you're describing: they'd reflect what was currently in the PDF, not be a manual operation. (Not sure what pdf.js allows, but seems like it might even be possible for them to be writable from the note view, no?)

I'd imagine a different view for the Annotations tab, closer to the existing Notes tab. We've also discussed showing annotations as (virtual) notes under the item in the middle pane.

Re: outlines, that doesn't seem to me like something that belongs in Zotero. I think that makes more sense just being in the PDF viewer.

@jlegewie

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2016

Just a couple of thoughts:

  • Agree about the virtual aspect but also think that an option to create note from annotations is very important because it makes it possible to work with the extracted text (add own thoughts etc).
  • Not 100% sure what you mean with "writable" but I don't think that is possible.
  • A benefit of having the outline is that you can directly jump to specific sections in the pdf. Mendeley has a feature like that. But the downside is that most pdf don't include an outline. Depends on the publisher and many just don't do it.
@dstillman

This comment has been minimized.

Copy link
Member

commented Jun 12, 2016

By writable I mean the ability to edit your own annotations (as opposed to your highlights) from the virtual entry. But I realize now that your screenshot actually shows a highlight, not an annotation — do you support extraction of both?

The outline part still doesn't make sense to me. The PDF viewer should let you do this, and even pdf.js supports it. Not sure why we would duplicate that functionality — a tab in Zotero is no different from a sidebar in the PDF viewer.

@jlegewie

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2016

Yes, the screenshot shows highlighted text and not a text annotation (I think they are called "Text" and "Highlight" annotations in pdfs). Zotfile supports both (and underlined text as well). Text annotations are pretty easy. It's possible to extract them with the pdf.js API without rendering the actual pdf (so it's also very fast). Extracting highlighted text is much more complicated and not possible with the pdf.js API. Zotfile uses a modified version of pdf.js to extract highlighted and underlined text. It's also more computationally intense because zotfile has to render all pages with annotations in a hidden browser to check which texts falls into the highlighted area (defined by annotation quadpoints in pdf terminology).

re writable: I don't think pdf.js supports any modifications to existing annotations. The official statement is that "PDF.js won't support editing features, it's only a reader" (e.g. timvandermeij comment here). Would be pretty nice for text annotations though. Not sure how hard it would be to implement this.

@gracile-fr

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2016

The outline part still doesn't make sense to me. The PDF viewer should let you do this, and even pdf.js supports it. Not sure why we would duplicate that functionality — a tab in Zotero is no different from a sidebar in the PDF viewer.
It's quite handy to have the outline of a document in Zotero, especially because it's an active outline (i.e. with direct links as Joscha said). Also, the way it works with Zotfile at the moment, i.e. the fact that a note is created, makes possible to open it in another window and resize it as needed.
@dstillman

This comment has been minimized.

Copy link
Member

commented Jun 12, 2016

@gracile-fr: But why does it make sense to duplicate that functionality? There seems like very little difference between clicking to an outline tab in Zotero and clicking on a section vs. opening a PDF, viewing the outline there, and clicking on the relevant section.

@dstillman

This comment has been minimized.

Copy link
Member

commented Jun 12, 2016

The main difference, I suppose, is the ability to browse through PDFs in your library, leaving the outline view selected, and see the outline of each, without having to open each one. Not sure how much that matters for real-world usage, or if it's compelling enough to duplicate (implement, maintain, take up space in the UI with, etc.) functionality that already exists in every PDF viewer.

dstillman added a commit that referenced this issue May 5, 2018
Add zotero://open-pdf handler to open PDF at a given page
This is loosely based on the same functionality in ZotFile, but it tries
to do the right thing based on existing Zotero settings: either the new
PDF handler setting in the prefs or the system-default app. The latter
can only reliably be determined on Windows (and this uses ZotFile's
function to read that from the registry), but this tries to figure it
out on macOS and Linux too using the Mozilla handler service. (The
handler service only gets you an app name, not a path, so on Linux we
can try reading mimetypes.list and the like in case someone is using a
system-default okular or evince not in /usr/bin, but that's not yet
implemented.)

This uses the new 5.0 URL format, and a 'page' query parameter instead
of a path component:

zotero://open-pdf/library/items/[itemKey]?page=[page]
zotero://open-pdf/groups/[groupID]/items/[itemKey]?page=[page]

It also accepts ZotFile-style URLs, though, so if you uninstall ZotFile
you should still be able to open those links. ZotFile will need to
accept the new format for new links to work when ZotFile is installed,
since it will override this handler.

This functionality will be necessary for annotation extraction (#1018)
and for imported annotations from Mendeley (#1451).
@sojusnik

This comment has been minimized.

Copy link

commented Apr 11, 2019

What is the current status on this essential feature? Any progress recently?

@dstillman

This comment has been minimized.

Copy link
Member

commented Apr 11, 2019

@sojusnik: No ETA, but we're working on this.

@StoltHD

This comment has been minimized.

Copy link

commented Jul 14, 2019

IT would be really nice if you could prioritize this, because the pdf.js version used in Zotfile is 5 years old and do not support the latest versions of the pdf standards...

And I find that the newer version 1.7 is used more often now than the "older" versions...
It would be nice if we could keep on extracting annotations, selections and comments from newer pdf's in Zotero instead of searching for new software...

And maybe you could cooperate a little with the Tropy team, they need to implement the same functionality in that software...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.