-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMDb - Fixed creators, date, genre, running time, tags #2069
Conversation
Don't touch the 'translators' directory manually — just click the Save to Zotero button in Scaffold while viewing the translator, and then click Update Translators in the Advanced pane of the Zotero Connector preferences. |
IMDb.js
Outdated
|
||
function detectWeb(doc, url) { | ||
if (url.indexOf('/title/tt')>-1) { | ||
if (url.includes('/title/tt') > -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
includes()
just returns a boolean (here and below)
Thanks for working on this. While this works, going from embedded let json = JSON.parse(document.querySelector('script[type="application/ld+json"]').textContent);
item.date = json.datePublished; // "2018-11-08" |
You're right of course about the localization problem. I'm used to scraping webpages on the go and usually don't pay attention to this problems, but the point of making those translators is precisely to have some generality. |
IMDb.js
Outdated
var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]'); | ||
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]')); | ||
item.title = json.name;// note that json only has the original title | ||
item.date = "datePublished" in json ? json.datePublished : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right way to check if the field is in the json data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to just do item.date = json.datePublished;
. If it's undefined then it'll just be undefined.
They don't get cited, so it's not something we've ever worried about. We privilege actors as contributors because they're so prominent, but there are obviously hundreds of different roles people can have on a movie, and it wouldn't be possible to fully reproduce an IMDB entry in terms of creator types.
Can you give an example of this? |
IMDb.js
Outdated
item.title = json.name;// note that json only has the original title | ||
item.date = "datePublished" in json ? json.datePublished : ""; | ||
item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : ""; | ||
item.genre = "genre" in json ? json.genre : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for all of these. If it's empty it just won't be set. (If that's not the behavior you're seeing, let us know.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember now why I checked everywhere: when json.something is undefined, a .replace or .split will throw an error. So I will keep it where it is necessary, ie for duration and keywords, and make without on the other items
IMDb.js
Outdated
var creators = ZU.xpath(summary, './/span[@itemprop="'+role+'"]//span[@itemprop="name"]'); | ||
for (var i=0; i<creators.length; i++) { | ||
item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role])); | ||
if (!(role in json)) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our purposes, there's basically no need for in
anywhere. So, e.g., you can just use if (!json.role) continue;
.
IMDb.js
Outdated
var tags = ZU.xpath(doc, '//div[@itemprop="keywords"]/a'); | ||
for (var i=0; i<tags.length; i++) { | ||
item.tags.push(tags[i].textContent); | ||
var tags = "keywords" in json ? json.keywords.split(",") : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type needs to be consistent. json.keywords ? json.keywords.split(",") : []
.
IMDb.js
Outdated
item.tags.push(tags[i].textContent); | ||
var tags = "keywords" in json ? json.keywords.split(",") : ""; | ||
for (let i = 0; i < tags.length; i++) { | ||
item.tags.push(tags[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this. Just assign an array directly to item.tags
above.
Ah, I guess the tests include examples: https://www.imdb.com/title/tt0089276/ The question of translated titles was raised in zotero/zotero-bits#50, though I don't think we came to a firm decision there. Assuming we add a new field for this, we'd probably want to add these to the Extra field as, e.g., "Translated Title: The Official Story" for "La historia oficial". |
So I put it in extra and when (if) a Translated Title field is added we can put it there, is that what you had in mind? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the work @Placardo ! I think we can go with the translated title in the extra and use the original title in the title field as you did here. However, we should be aware, that this is changing the current behavior, where it was exactly the opposite: the translated title was saved in the the title field and the original title was saved in the extra field. I guess that one can cite it either or the other way. We can see if there are any reactions in the forum after this changes going live.
I have some small comments and suggestions. Especially I would like to save some of the information like country or studio which where mentioned in the linked thread. Everything should be easy to implement. Let me know when there is anything not clear.
It would be good if we can finish this the next days, as there was another request in the forum.
IMDb.js
Outdated
item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role])); | ||
if (!json[role]) continue; | ||
var creators = json[role]; | ||
if (typeof creators.length == "undefined") item.creators.push(ZU.cleanAuthor(creators.name, creatorsMapping[role])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to check whether creators
is an array or not? Then I would suggest to use the Array.isArray
function. Maybe also add parentheses in the if case, because it is followed by an else case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was clearer since I use creators.length in the for loop just after and it needed to be defined there. Do you mean curly brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean curly brackets?
Yes
IMDb.js
Outdated
} | ||
|
||
var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]'); | ||
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling EM translator still needed? I think all information come from the json data at the moment. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does but I didn't know whether deleting it would be frowned upon (apparently it's the opposite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, it is a good idea to start with (generic) EM translator which will already provide some/many information, then only the missing parts need to added. However, that is not the case here. And if we can spare an additional call of this translator it is a little speedup.
IMDb.js
Outdated
if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle; | ||
item.date = json.datePublished; | ||
item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : ""; | ||
item.genre = json.genre; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item.genre = json.genre; | |
item.genre = json.genre ? json.genre.join(", ") : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a good idea as some films do not return an array when there is only one genre (see eg https://www.imdb.com/title/tt0115964/), and thus .join returns an error which would mean we need another condition to check that it is indeed an array. And it works well without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the genre can be an array, we should use an explicit check for that. We would never want to rely on implicit array-to-string conversion (which also wouldn't include a space after the comma).
IMDb.js
Outdated
"genre": "Drama, History, War", | ||
"date": "1985-04-03", | ||
"abstractNote": "La historia oficial is a movie starring Norma Aleandro, Héctor Alterio, and Chunchuna Villafañe. During the final months of Argentinian Military Dictatorship in 1983, a high school teacher sets out to find out who the mother of her...", | ||
"extra": "Translated title: The Official Story", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add also the country information from doc.querySelectorAll('a[href*="title?country_of_origin"]')
in the extra with the label event-location
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling EM translator still needed? I think all information come from the json data at the moment. No?
If I start using elements from the page, do we need to leave it inside the EM translator then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
else { | ||
for (var i = 0; i < creators.length; i++) { | ||
if (creators[i]["@type"] == "Person") item.creators.push(ZU.cleanAuthor(creators[i].name, creatorsMapping[role])); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the company info as well? I suggest something like this
let companyNodes = doc.querySelectorAll('a[href*="/company/"]');
let companies = [];
for (let company of companyNodes) {
companies.push(company);
}
item.distributor = companies.join(', ');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest here to use the field distributor also it is more like the production company. However, the linked thread here also seems to indicate that maybe the label for this field is not optimal. The list of all distributors for each country and each medium can contain almost 100 entries and that seems unreasonable to "remember" for a citation, e.g. https://www.imdb.com/title/tt0076759/companycredits?ref_=tt_dt_co . CC @adam3smith
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it seems more useful to store production companies (which are btw listed as creators in the JSON data) but this label of distributor seems unnecessarily confusing. One question, just to be sure: did you intentionally push the link or did you just forget to add company.textContent? When you say company info I thought you meant the company name but your example stores the link for the company
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right, I meant the name and not the link. Please add also textContent
.
I.e. for the example something like Lucasfilm, Twentieth Century Fox
trans.itemType = "film"; | ||
trans.doWeb(doc, url); | ||
}); | ||
} | ||
|
||
|
||
function addExtra(item, value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should reuse that function for several values in extra, i.e. don't delete it.
} | ||
var pageId = ZU.xpathText(doc, '//meta[@property="pageId"]/@content'); | ||
if (pageId) { | ||
addExtra(item, "IMDb ID: "+pageId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to delete that information.
IMDb.js
Outdated
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]')); | ||
item.title = json.name;// note that json only has the original title | ||
var transTitle = ZU.trimInternal(ZU.xpathText(doc, "//div[@class='title_wrapper']/h1/text()")).slice(0, -2); | ||
if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space after the if
.
@Placardo Let me know when this is ready from your side again for another round of review. |
@zuphilip Seems good to me if you're ok I'll separate into two commits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks perfect now! Thank you very much!
Yes, please separate into 2-3 commits please. Then I can merge this. |
🚀 some issues remain pending though regarding the status of the translated title as main title and the fact that production companies are listed as distributors |
Merged now 🚀 ! Thank you very much @Placardo 🙇♂️ ! |
Yeah,I am aware of the change w.r.t. this. However, I think it is better because the release date will be also clear and we save the translated title still in an extra field. But even if there is a convincing demand to switch this, wen can easily do that in the future.
Yeah, right as well. I will write a comment in the other thread and we will see how this will develop in the future. |
This fixes #1780. Edited scraper works well in Scaffold, then I put it in my zotero/translator folder and he refuses to use it on Chrome/Mozilla (console says : "Error: Could not establish connection. Receiving end does not exist."). Reverted back in my zotero/translator folder to the original script, and the problem persists on both browsers... Since it does not seem to come from my edits, I open the pull request if you know where it comes from please let me know