Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use filename components in metadata retrieval #99

Open
dstillman opened this issue Apr 21, 2012 · 12 comments

Comments

Projects
None yet
5 participants
@dstillman
Copy link
Member

commented Apr 21, 2012

@rsnape

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2013

I've been meandering through the issues list as I'm a coder and academic and use Zotero heaps so would love to give something back. This issue caught my eye - I think I could probably have a go at this. Just wanted to check as I've never coded for Zotero before (except CSL) - do I have to express an interest in addressing an issue or just do it and then issue the pull request "out of the blue"?

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2013

No one is working on this as far as I know, so you can just have a stab at it. If you get this to work, send a pull request and reference this issue. Make sure to be working off of the 4.0 branch.

You might find various clean* functions useful for this https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js#L275

@rsnape

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2013

Wow - that was quick. I'll have a look at it then. Thanks for the tip off re clean* functions.

@ghost

This comment has been minimized.

Copy link

commented Mar 22, 2013

hey rsnape. I was also interested in writing up something for adding the file name --> bibliographic ref. look-up. I am new to large codes. All I have done is numerical codes for myself. Any suggestions for me as to where to start ?

@rsnape

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2013

Hi rajavenks. I haven't done any work on this yet. I think the first place to start is to get a development environment - there are some instructions on the zotero website here http://www.zotero.org/support/dev/getting_started - and then download the latest code.

I was planning to work out where other translation code was called and see whether any of that could be re-used / adapted. Also - see the tip from aurimasv above about clean* functions in utilities.js which should help with extracting the information from the filename.

I guess the hard bit about this enhancement is writing regular expressions that parse the filename and get all available information from it without getting too many "false positives" on words that look like they give information, but are actually irrelevant.

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2013

Here are my thoughts on this.

Best pieces of metadata to look for in the file name are DOI, ISBN, and maybe PMID. DOI and ISBN should be easy with cleanDOI and cleanISBN functions. PMID is a bit more tricky since it's just a random 8 digit number (so I would hold off on that).

If we can't find those, automatic metadata retrieval will be more complicated. Currently it's not possible via Zotero, but there has been some discussion about using CrossRef (http://www.crossref.org/guestquery/) to find articles based on some combination of author, year, title.

Year is obviously easy to find. Just look for 4 digits that start with 19 or 20.

I would probably expect Author and Title to be separated either by the year, a dash, or a semicolon. Author can also be identified as preceding et al. Title may be missing altogether.

Given that we cannot currently look up metadata based on title/year/author combo, we can either ignore that part or we could still extract the info and populate those fields in Zotero. It might actually be more helpful to ignore such limited metadata and leave the PDF (or w/e attachment) as a standalone attachment, since it would draw more attention to it.

So unless I'm overlooking something, this doesn't look too complicated right now.

BTW, almost all of the relevant code for this is here: https://github.com/zotero/zotero/blob/master/chrome/content/zotero/recognizePDF.js

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2013

Actually, I take back what I said about DOIs. As Dan mentions in the forum thread, DOIs are probably unlikely to be stored in the file name because they include slashes, which are not allowed. They could be encoded if they were stored in the file name by some other software. I doubt that anyone would go through the trouble of encoding them by hand.

I'll work on the CrossRef search translator when I get a chance. It would be pretty exciting to get that working.

@rsnape

This comment has been minimized.

Copy link
Contributor

commented Mar 28, 2013

Yes - it does look like nearly everything we would need is in recognizePDF already. Just a case of adding the filename to the text searched. @aurimasv - trying CrossRef with Author / year / title would be pretty cool as you say.

@ghost

This comment has been minimized.

Copy link

commented Mar 29, 2013

Hi rsnape,

Thanks for the information. I am surprised and glad to know that you have a nice documentation to ease in novices like me. I still have a lot to catch up. To start with I will try to solve this problem for myself. I have quiet a few pdfs w/o doi's but with name in the format "Year_JourName_FstAuth_LastAuth.pdf".

Thanks again
R

@bwiernik

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2018

Is this still necessary with the new Retrieve Metadata service in place?

@dstillman

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2018

Probably not, but @mrtcode could comment on that. (If at all desirable, the logic would now go in https://github.com/zotero/recognizer-server, not here.)

@mrtcode

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2018

Identifiers in filenames are rare but possible. Any other filename information would be too difficult to utilize. For DOIs, slashes are usually replaced with eta i.e. 10.1002@pam.21831. Book PDFs can have ISBN too. The same is with arXiv PDFs.

The current recognizer-server tries to utilize identifiers or title from PDF file metadata or PDF body. The probability is quite low to encounter a PDF which extraction failed, but has an identifier in the filename. But it's probably worth to add this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.