Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMDb - Fixed creators, date, genre, running time, tags #2069

Merged
merged 2 commits into from
Jan 7, 2020

Conversation

vcarret
Copy link
Contributor

@vcarret vcarret commented Nov 26, 2019

This fixes #1780. Edited scraper works well in Scaffold, then I put it in my zotero/translator folder and he refuses to use it on Chrome/Mozilla (console says : "Error: Could not establish connection. Receiving end does not exist."). Reverted back in my zotero/translator folder to the original script, and the problem persists on both browsers... Since it does not seem to come from my edits, I open the pull request if you know where it comes from please let me know

@dstillman
Copy link
Member

dstillman commented Dec 1, 2019

Edited scraper works well in Scaffold, then I put it in my zotero/translator folder

Don't touch the 'translators' directory manually — just click the Save to Zotero button in Scaffold while viewing the translator, and then click Update Translators in the Advanced pane of the Zotero Connector preferences.

IMDb.js Outdated

function detectWeb(doc, url) {
if (url.indexOf('/title/tt')>-1) {
if (url.includes('/title/tt') > -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includes() just returns a boolean (here and below)

@dstillman
Copy link
Member

dstillman commented Dec 2, 2019

Thanks for working on this. While this works, going from embedded itemprop data to page scraping will make the translator much more fragile, and it also won't work if IMDB is localized for other countries. Fortunately it doesn't appear to be necessary, since the page includes embedded JSON-LD. While we don't support JSON-LD natively yet, you can still get the <script type="application/ld+json"> content, parse it as JSON, and then pull values such as datePublished and keywords out of there manually:

let json = JSON.parse(document.querySelector('script[type="application/ld+json"]').textContent);
item.date = json.datePublished; // "2018-11-08"

@vcarret
Copy link
Contributor Author

vcarret commented Dec 2, 2019

You're right of course about the localization problem. I'm used to scraping webpages on the go and usually don't pay attention to this problems, but the point of making those translators is precisely to have some generality.
Actors were originally included as "contributors", shouldn't there be a specific field for them? Or on the contrary should they be included at all? (I feel like this is a bit subjective). Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

IMDb.js Outdated
var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]');
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));
item.title = json.name;// note that json only has the original title
item.date = "datePublished" in json ? json.datePublished : "";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right way to check if the field is in the json data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just do item.date = json.datePublished;. If it's undefined then it'll just be undefined.

@dstillman
Copy link
Member

Actors were originally included as "contributors", shouldn't there be a specific field for them?

They don't get cited, so it's not something we've ever worried about. We privilege actors as contributors because they're so prominent, but there are obviously hundreds of different roles people can have on a movie, and it wouldn't be possible to fully reproduce an IMDB entry in terms of creator types.

Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

Can you give an example of this?

IMDb.js Outdated
item.title = json.name;// note that json only has the original title
item.date = "datePublished" in json ? json.datePublished : "";
item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : "";
item.genre = "genre" in json ? json.genre : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for all of these. If it's empty it just won't be set. (If that's not the behavior you're seeing, let us know.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now why I checked everywhere: when json.something is undefined, a .replace or .split will throw an error. So I will keep it where it is necessary, ie for duration and keywords, and make without on the other items

IMDb.js Outdated
var creators = ZU.xpath(summary, './/span[@itemprop="'+role+'"]//span[@itemprop="name"]');
for (var i=0; i<creators.length; i++) {
item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role]));
if (!(role in json)) continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our purposes, there's basically no need for in anywhere. So, e.g., you can just use if (!json.role) continue;.

IMDb.js Outdated
var tags = ZU.xpath(doc, '//div[@itemprop="keywords"]/a');
for (var i=0; i<tags.length; i++) {
item.tags.push(tags[i].textContent);
var tags = "keywords" in json ? json.keywords.split(",") : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type needs to be consistent. json.keywords ? json.keywords.split(",") : [].

IMDb.js Outdated
item.tags.push(tags[i].textContent);
var tags = "keywords" in json ? json.keywords.split(",") : "";
for (let i = 0; i < tags.length; i++) {
item.tags.push(tags[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this. Just assign an array directly to item.tags above.

@dstillman
Copy link
Member

Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

Can you give an example of this?

Ah, I guess the tests include examples:

https://www.imdb.com/title/tt0089276/
https://www.imdb.com/title/tt0060613/

The question of translated titles was raised in zotero/zotero-bits#50, though I don't think we came to a firm decision there. Assuming we add a new field for this, we'd probably want to add these to the Extra field as, e.g., "Translated Title: The Official Story" for "La historia oficial".

@vcarret
Copy link
Contributor Author

vcarret commented Dec 16, 2019

The question of translated titles was raised in zotero/zotero-bits#50, though I don't think we came to a firm decision there. Assuming we add a new field for this, we'd probably want to add these to the Extra field as, e.g., "Translated Title: The Official Story" for "La historia oficial".

So I put it in extra and when (if) a Translated Title field is added we can put it there, is that what you had in mind?

Copy link
Contributor

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the work @Placardo ! I think we can go with the translated title in the extra and use the original title in the title field as you did here. However, we should be aware, that this is changing the current behavior, where it was exactly the opposite: the translated title was saved in the the title field and the original title was saved in the extra field. I guess that one can cite it either or the other way. We can see if there are any reactions in the forum after this changes going live.

I have some small comments and suggestions. Especially I would like to save some of the information like country or studio which where mentioned in the linked thread. Everything should be easy to implement. Let me know when there is anything not clear.

It would be good if we can finish this the next days, as there was another request in the forum.

IMDb.js Outdated
item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role]));
if (!json[role]) continue;
var creators = json[role];
if (typeof creators.length == "undefined") item.creators.push(ZU.cleanAuthor(creators.name, creatorsMapping[role]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to check whether creators is an array or not? Then I would suggest to use the Array.isArray function. Maybe also add parentheses in the if case, because it is followed by an else case.

Copy link
Contributor Author

@vcarret vcarret Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was clearer since I use creators.length in the for loop just after and it needed to be defined there. Do you mean curly brackets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean curly brackets?

Yes

IMDb.js Outdated
}

var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]');
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling EM translator still needed? I think all information come from the json data at the moment. No?

Copy link
Contributor Author

@vcarret vcarret Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does but I didn't know whether deleting it would be frowned upon (apparently it's the opposite).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, it is a good idea to start with (generic) EM translator which will already provide some/many information, then only the missing parts need to added. However, that is not the case here. And if we can spare an additional call of this translator it is a little speedup.

IMDb.js Outdated
if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle;
item.date = json.datePublished;
item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : "";
item.genre = json.genre;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
item.genre = json.genre;
item.genre = json.genre ? json.genre.join(", ") : "";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea as some films do not return an array when there is only one genre (see eg https://www.imdb.com/title/tt0115964/), and thus .join returns an error which would mean we need another condition to check that it is indeed an array. And it works well without it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the genre can be an array, we should use an explicit check for that. We would never want to rely on implicit array-to-string conversion (which also wouldn't include a space after the comma).

IMDb.js Outdated
"genre": "Drama, History, War",
"date": "1985-04-03",
"abstractNote": "La historia oficial is a movie starring Norma Aleandro, Héctor Alterio, and Chunchuna Villafañe. During the final months of Argentinian Military Dictatorship in 1983, a high school teacher sets out to find out who the mother of her...",
"extra": "Translated title: The Official Story",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add also the country information from doc.querySelectorAll('a[href*="title?country_of_origin"]') in the extra with the label event-location?

Copy link
Contributor Author

@vcarret vcarret Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling EM translator still needed? I think all information come from the json data at the moment. No?

If I start using elements from the page, do we need to leave it inside the EM translator then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

else {
for (var i = 0; i < creators.length; i++) {
if (creators[i]["@type"] == "Person") item.creators.push(ZU.cleanAuthor(creators[i].name, creatorsMapping[role]));
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the company info as well? I suggest something like this

let companyNodes = doc.querySelectorAll('a[href*="/company/"]');
let companies = [];
for (let company of companyNodes) {
   companies.push(company);
}
item.distributor = companies.join(', ');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest here to use the field distributor also it is more like the production company. However, the linked thread here also seems to indicate that maybe the label for this field is not optimal. The list of all distributors for each country and each medium can contain almost 100 entries and that seems unreasonable to "remember" for a citation, e.g. https://www.imdb.com/title/tt0076759/companycredits?ref_=tt_dt_co . CC @adam3smith

Copy link
Contributor Author

@vcarret vcarret Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it seems more useful to store production companies (which are btw listed as creators in the JSON data) but this label of distributor seems unnecessarily confusing. One question, just to be sure: did you intentionally push the link or did you just forget to add company.textContent? When you say company info I thought you meant the company name but your example stores the link for the company

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right, I meant the name and not the link. Please add also textContent.

I.e. for the example something like Lucasfilm, Twentieth Century Fox

trans.itemType = "film";
trans.doWeb(doc, url);
});
}


function addExtra(item, value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should reuse that function for several values in extra, i.e. don't delete it.

}
var pageId = ZU.xpathText(doc, '//meta[@property="pageId"]/@content');
if (pageId) {
addExtra(item, "IMDb ID: "+pageId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to delete that information.

IMDb.js Outdated
let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));
item.title = json.name;// note that json only has the original title
var transTitle = ZU.trimInternal(ZU.xpathText(doc, "//div[@class='title_wrapper']/h1/text()")).slice(0, -2);
if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space after the if.

@zuphilip zuphilip added the Improvements Pull requests that are improving existing translators label Jan 6, 2020
@zuphilip
Copy link
Contributor

zuphilip commented Jan 6, 2020

@Placardo Let me know when this is ready from your side again for another round of review.

@vcarret
Copy link
Contributor Author

vcarret commented Jan 7, 2020

@zuphilip Seems good to me if you're ok I'll separate into two commits

Copy link
Contributor

@zuphilip zuphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks perfect now! Thank you very much!

@zuphilip
Copy link
Contributor

zuphilip commented Jan 7, 2020

Yes, please separate into 2-3 commits please. Then I can merge this.

@vcarret
Copy link
Contributor Author

vcarret commented Jan 7, 2020

🚀 some issues remain pending though regarding the status of the translated title as main title and the fact that production companies are listed as distributors

@zuphilip zuphilip merged commit fd5ef86 into zotero:master Jan 7, 2020
@zuphilip
Copy link
Contributor

zuphilip commented Jan 7, 2020

Merged now 🚀 ! Thank you very much @Placardo 🙇‍♂️ !

@zuphilip
Copy link
Contributor

zuphilip commented Jan 7, 2020

translated title as main title and ...

Yeah,I am aware of the change w.r.t. this. However, I think it is better because the release date will be also clear and we save the translated title still in an extra field. But even if there is a convincing demand to switch this, wen can easily do that in the future.

the fact that production companies are listed as distributors

Yeah, right as well. I will write a comment in the other thread and we will see how this will develop in the future.

@vcarret vcarret deleted the imdb branch January 7, 2020 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvements Pull requests that are improving existing translators
Development

Successfully merging this pull request may close these issues.

IMDb - pulling in incomplete data
3 participants