Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Libraries Tasmania translator #2832

Merged
merged 8 commits into from
Jul 13, 2022
Merged

Conversation

wragge
Copy link
Contributor

@wragge wragge commented Jun 13, 2022

This is a translator for Libraries Tasmania. The catalogue uses SirsiDynix, but none of the existing translators worked and the system has been customised to include a range of important archival material. There's actually three catalogues rolled into one:

  • the Library catalogue for all the normal library stuff (but also including things like archived websites)
  • the Archives catalogue
  • the Name Index catalogue -- basically transcribed entries from certain archive collections

The translator captures data from all three sections. Many of the records in the Library section have MARC views, so I was able to use the existing MARC translator for those. Some elements are generated by JS, so they took a bit of tracking down. The digital files are rendered and linked to in a variety of different ways. I tried many different solutions, but this one seems to work ok in most cases. The main limit is that if a record links to a multi-image viewer, without a reference to a specific page, no images are saved. But if it's a single image viewer, or there's a link identifying the page, the image/file will be saved -- both images and PDFs. In some cases, such as the Convict Name index, you get multiple image attachments which is pretty cool.

@AbeJellinek
Copy link
Member

Thanks, this is awesome work! That said... I think the issue might just be that we don't have a (working?) translator that targets SIRSI Dynix. This site looks similar, as does this one and this one. None of them are detected by any translator on my end. Could you test out if this translator can target other Dynix sites? If so, let's just make it more general.

@wragge
Copy link
Contributor Author

wragge commented Jun 17, 2022

Good idea, but I think the Libraries Tasmania catalogue is too heavily customized for the translator to work with more generic SirsiDynix systems. Looking at the two examples you mentioned, there are some major differences:

  • The links on items in the search results don't include any item identifiers. To get a link to an item page I think you'd have to get an identifier from the 'Place hold' button (or somewhere else) and build a url. The links in Libraries Tas go to the item page.
  • The other systems don't seem to provide the MARC view that Libraries Tas includes in many of the bib records.
  • It looks like most of the field/class names are different in the item records, so you'd need another set of selectors anyway.

On top of that Libraries Tas adds permalinks, digitised items, and separate custom catalogues for archival material. None of which seem to be standard.

From a quick poke around, it looks like a translator could be created to work with the more generic SirsiDynix systems you've linked to, but I don't think there'd be much overlap with the Libraries Tas translator.

@wragge
Copy link
Contributor Author

wragge commented Jun 17, 2022

Ah looking closer it seems you can get the item identifiers from the onclick event on the titles, but it's still different to Libraries Tas which has the full item url in the href.

@wragge
Copy link
Contributor Author

wragge commented Jun 21, 2022

Is there anything else you'd like me to do?

Copy link
Member

@AbeJellinek AbeJellinek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

if (url.includes("results") && getSearchResults(doc, true)) {
return "multiple";
}
else if (catType == "tas" || url.includes("ARCHIVES_")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catType == might be a little sketchy - String#match returns an array, or null if there was no match. Instead, how about catType && catType[1] == "tas" (and similar for the check below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

return "manuscript";
}
else {
var formats = doc.querySelectorAll("div.displayElementText.text-p.LOCAL_FORMAT");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these classes necessary or would just ".LOCAL_FORMAT" work? For stability if the site changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .LOCAL_FORMAT class is also used in the field label, but simplifying to .displayElementText.LOCAL_FORMAT seems ok.

Comment on lines 73 to 81
if (!items) {
return true;
}
var articles = [];
for (var i in items) {
articles.push(i);
}
ZU.processDocuments(articles, scrape);
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days this can be simplified to

Suggested change
if (!items) {
return true;
}
var articles = [];
for (var i in items) {
articles.push(i);
}
ZU.processDocuments(articles, scrape);
return false;
if (!items) return;
ZU.processDocuments(Object.keys(items), scrape);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

}

function getFieldText(doc, label) {
let fields = doc.querySelectorAll("div.displayElementText.text-p." + label);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about the selectors here - it seems like we could at least drop the div tag selector, and maybe .displayElementText, since that seems like a more visual and less semantic class name and is liable to change. We could avoid false positives by running querySelectorAll() on a smaller scope instead of the full document, like the main metadata table element if there is a consistent one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I've kept .displayElementText to distinguish from the label, but removed the rest.


function getLinkLists(doc, label, idx) {
var values = [];
let links = doc.querySelectorAll("div." + label + " a[href*='" + idx + "']");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the div necessary here? (Sorry for being annoying about these!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed div

let physParts = physDesc.split(";");
if (physParts.length == 2 && format == "book") {
item.numPages = cleanText(physParts[0]);
item["physical desription"] = cleanText(physParts[1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

var item = new Zotero.Item("manuscript");

// Types should be Agency, Series, or Item
let typeLabel = ZU.trim(doc.querySelector(".T245_DISPLAY_label").textContent).replace(/:$/, "");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like this should use text(doc, ...) instead of querySelector(...).textContent for null-safety:

Suggested change
let typeLabel = ZU.trim(doc.querySelector(".T245_DISPLAY_label").textContent).replace(/:$/, "");
let typeLabel = ZU.trim(text(doc, ".T245_DISPLAY_label")).replace(/:$/, "");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does text() come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here: https://github.com/zotero/translate/blob/e08e318997b80685f03fbc4aae2cf6944fe46cfc/src/translation/translate.js#L1879-L1896

It's added to the translation sandbox for web and search translators. There's also innerText, attr (same idea but takes an attribute name as the last argument and returns the value of that attribute on the matched element, or the empty string), and now request (async HTTP).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in this instance and in a couple of other places.

];
// Add field values to Extras
for (let i = 0; i < fields.length; ++i) {
let label = doc.querySelector("div." + fields[i] + "_label");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selector nitpick!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

div removed

addDigitalFiles(doc, item);
}

// Zotero.debug(item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"abstractNote": "Official minutes of meetings. | These records are part of the holdings of the Tasmanian Archives",
"callNumber": "AD940",
"libraryCatalog": "Libraries Tasmania",
"manuscriptType": "Series",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if these really belong in manuscriptType... Maybe a line in extra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the type of archival entity being described -- series, item, or agency -- so I think manuscriptType is the best place for it. Having it there also makes it much easier to browse/filter different types in the Zotero interface. This is also consistent with other translators including National Archives of Australia and State Records Office of WA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree, but to make sure -- is this expected to be part of citations, typically? manuscriptType aka CSL's genre is used a fair amount in citation styles, so this would show up regularly. If that makes sense, I think this is definitely the right field. If it doesn't, I'd reconsider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a couple of changes. I've kept manuscriptType to indicate the archival entity type ('series', 'item' etc). This is consistent with other translators (those noted above plus the UK National Archives) and is most likely to be used in citations. I've also lowercased these values for consistency.

I was also using manuscriptType for the type of index for names index records. I've changed these so the manuscriptType is now 'item', and the index title is saved in 'Extras'.

@AbeJellinek
Copy link
Member

And makes sense about the scope. We can do a different translator, or expand the existing SIRSI one, for other sites that this can't cover.

@wragge
Copy link
Contributor Author

wragge commented Jul 7, 2022

Hope it's right to go now, want to give it a shout out in a talk I'm giving tomorrow!

@AbeJellinek
Copy link
Member

Ack. I'm really sorry I didn't get to this in time. Looks good to go. I am curious why some of the publisher and label fields aren't getting split into place, publisher/label, and date, since the regex for that looks good, but we can work on that later.

@AbeJellinek AbeJellinek merged commit e3b6326 into zotero:master Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants