IMDb - Fixed creators, date, genre, running time, tags #2069

vcarret · 2019-11-26T19:46:42Z

This fixes #1780. Edited scraper works well in Scaffold, then I put it in my zotero/translator folder and he refuses to use it on Chrome/Mozilla (console says : "Error: Could not establish connection. Receiving end does not exist."). Reverted back in my zotero/translator folder to the original script, and the problem persists on both browsers... Since it does not seem to come from my edits, I open the pull request if you know where it comes from please let me know

dstillman · 2019-12-01T10:51:14Z

Edited scraper works well in Scaffold, then I put it in my zotero/translator folder

Don't touch the 'translators' directory manually — just click the Save to Zotero button in Scaffold while viewing the translator, and then click Update Translators in the Advanced pane of the Zotero Connector preferences.

dstillman · 2019-12-01T10:53:09Z

IMDb.js


 function detectWeb(doc, url) {
-	if (url.indexOf('/title/tt')>-1) {
+	if (url.includes('/title/tt') > -1) {


includes() just returns a boolean (here and below)

dstillman · 2019-12-02T07:08:51Z

Thanks for working on this. While this works, going from embedded itemprop data to page scraping will make the translator much more fragile, and it also won't work if IMDB is localized for other countries. Fortunately it doesn't appear to be necessary, since the page includes embedded JSON-LD. While we don't support JSON-LD natively yet, you can still get the <script type="application/ld+json"> content, parse it as JSON, and then pull values such as datePublished and keywords out of there manually:

let json = JSON.parse(document.querySelector('script[type="application/ld+json"]').textContent);
item.date = json.datePublished; // "2018-11-08"

vcarret · 2019-12-02T16:08:05Z

You're right of course about the localization problem. I'm used to scraping webpages on the go and usually don't pay attention to this problems, but the point of making those translators is precisely to have some generality.
Actors were originally included as "contributors", shouldn't there be a specific field for them? Or on the contrary should they be included at all? (I feel like this is a bit subjective). Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

vcarret · 2019-12-02T16:15:57Z

IMDb.js

-		var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]');
+		let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));
+		item.title = json.name;// note that json only has the original title
+		item.date = "datePublished" in json ? json.datePublished : "";


Is this the right way to check if the field is in the json data?

You should be able to just do item.date = json.datePublished;. If it's undefined then it'll just be undefined.

dstillman · 2019-12-16T08:11:58Z

Actors were originally included as "contributors", shouldn't there be a specific field for them?

They don't get cited, so it's not something we've ever worried about. We privilege actors as contributors because they're so prominent, but there are obviously hundreds of different roles people can have on a movie, and it wouldn't be possible to fully reproduce an IMDB entry in terms of creator types.

Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

Can you give an example of this?

dstillman · 2019-12-16T08:18:02Z

IMDb.js

+		item.title = json.name;// note that json only has the original title
+		item.date = "datePublished" in json ? json.datePublished : "";
+		item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : "";
+		item.genre = "genre" in json ? json.genre : "";


Same for all of these. If it's empty it just won't be set. (If that's not the behavior you're seeing, let us know.)

I remember now why I checked everywhere: when json.something is undefined, a .replace or .split will throw an error. So I will keep it where it is necessary, ie for duration and keywords, and make without on the other items

dstillman · 2019-12-16T08:20:11Z

IMDb.js

-			var creators = ZU.xpath(summary, './/span[@itemprop="'+role+'"]//span[@itemprop="name"]');
-			for (var i=0; i<creators.length; i++) {
-				item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role]));
+			if (!(role in json)) continue;


For our purposes, there's basically no need for in anywhere. So, e.g., you can just use if (!json.role) continue;.

dstillman · 2019-12-16T08:21:11Z

IMDb.js

-		var tags = ZU.xpath(doc, '//div[@itemprop="keywords"]/a');
-		for (var i=0; i<tags.length; i++) {
-			item.tags.push(tags[i].textContent);
+		var tags = "keywords" in json ? json.keywords.split(",") : "";


The return type needs to be consistent. json.keywords ? json.keywords.split(",") : [].

dstillman · 2019-12-16T08:22:19Z

IMDb.js

-			item.tags.push(tags[i].textContent);
+		var tags = "keywords" in json ? json.keywords.split(",") : "";
+		for (let i = 0; i < tags.length; i++) {
+			item.tags.push(tags[i]);


You don't need this. Just assign an array directly to item.tags above.

dstillman · 2019-12-16T08:34:04Z

Note also that the JSON data sometimes differs quite significantly from what appears in the webpage (eg there is only the original title. Should the translated title be retrieved?).

Can you give an example of this?

Ah, I guess the tests include examples:

https://www.imdb.com/title/tt0089276/
https://www.imdb.com/title/tt0060613/

The question of translated titles was raised in zotero/zotero-bits#50, though I don't think we came to a firm decision there. Assuming we add a new field for this, we'd probably want to add these to the Extra field as, e.g., "Translated Title: The Official Story" for "La historia oficial".

vcarret · 2019-12-16T14:47:01Z

The question of translated titles was raised in zotero/zotero-bits#50, though I don't think we came to a firm decision there. Assuming we add a new field for this, we'd probably want to add these to the Extra field as, e.g., "Translated Title: The Official Story" for "La historia oficial".

So I put it in extra and when (if) a Translated Title field is added we can put it there, is that what you had in mind?

zuphilip

Thank you for all the work @Placardo ! I think we can go with the translated title in the extra and use the original title in the title field as you did here. However, we should be aware, that this is changing the current behavior, where it was exactly the opposite: the translated title was saved in the the title field and the original title was saved in the extra field. I guess that one can cite it either or the other way. We can see if there are any reactions in the forum after this changes going live.

I have some small comments and suggestions. Especially I would like to save some of the information like country or studio which where mentioned in the linked thread. Everything should be easy to implement. Let me know when there is anything not clear.

It would be good if we can finish this the next days, as there was another request in the forum.

zuphilip · 2020-01-06T11:34:33Z

IMDb.js

-				item.creators.push(ZU.cleanAuthor(creators[i].textContent, creatorsMapping[role]));
+			if (!json[role]) continue;
+			var creators = json[role];
+			if (typeof creators.length == "undefined") item.creators.push(ZU.cleanAuthor(creators.name, creatorsMapping[role]));


You want to check whether creators is an array or not? Then I would suggest to use the Array.isArray function. Maybe also add parentheses in the if case, because it is followed by an else case.

I thought it was clearer since I use creators.length in the for loop just after and it needed to be defined there. Do you mean curly brackets?

Do you mean curly brackets?

Yes

zuphilip · 2020-01-06T11:39:13Z

IMDb.js

-		}
-
-		var summary = ZU.xpath(doc, '//div[contains(@class, "plot_summary_wrapper")]');
+		let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));


Is calling EM translator still needed? I think all information come from the json data at the moment. No?

It does but I didn't know whether deleting it would be frowned upon (apparently it's the opposite).

Usually, it is a good idea to start with (generic) EM translator which will already provide some/many information, then only the missing parts need to added. However, that is not the case here. And if we can spare an additional call of this translator it is a little speedup.

zuphilip · 2020-01-06T11:44:34Z

IMDb.js

+		if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle;
+		item.date = json.datePublished;
+		item.runningTime = "duration" in json ? json.duration.replace("PT", "").toLowerCase() : "";
+		item.genre = json.genre;


Suggested change

item.genre = json.genre;

item.genre = json.genre ? json.genre.join(", ") : "";

I'm not sure this is a good idea as some films do not return an array when there is only one genre (see eg https://www.imdb.com/title/tt0115964/), and thus .join returns an error which would mean we need another condition to check that it is indeed an array. And it works well without it

If the genre can be an array, we should use an explicit check for that. We would never want to rely on implicit array-to-string conversion (which also wouldn't include a space after the comma).

zuphilip · 2020-01-06T12:16:09Z

IMDb.js

-				"genre": "Drama, History, War",
+				"date": "1985-04-03",
+				"abstractNote": "La historia oficial is a movie starring Norma Aleandro, Héctor Alterio, and Chunchuna Villafañe. During the final months of Argentinian Military Dictatorship in 1983, a high school teacher sets out to find out who the mother of her...",
+				"extra": "Translated title: The Official Story",


Can we add also the country information from doc.querySelectorAll('a[href*="title?country_of_origin"]') in the extra with the label event-location?

Is calling EM translator still needed? I think all information come from the json data at the moment. No?

If I start using elements from the page, do we need to leave it inside the EM translator then?

zuphilip · 2020-01-06T12:32:11Z

IMDb.js

+			else {
+				for (var i = 0; i < creators.length; i++) {
+					if (creators[i]["@type"] == "Person") item.creators.push(ZU.cleanAuthor(creators[i].name, creatorsMapping[role]));
+				}
 			}
 		}


Can we add the company info as well? I suggest something like this

let companyNodes = doc.querySelectorAll('a[href*="/company/"]'); let companies = []; for (let company of companyNodes) { companies.push(company); } item.distributor = companies.join(', ');

I suggest here to use the field distributor also it is more like the production company. However, the linked thread here also seems to indicate that maybe the label for this field is not optimal. The list of all distributors for each country and each medium can contain almost 100 entries and that seems unreasonable to "remember" for a citation, e.g. https://www.imdb.com/title/tt0076759/companycredits?ref_=tt_dt_co . CC @adam3smith

Indeed it seems more useful to store production companies (which are btw listed as creators in the JSON data) but this label of distributor seems unnecessarily confusing. One question, just to be sure: did you intentionally push the link or did you just forget to add company.textContent? When you say company info I thought you meant the company name but your example stores the link for the company

Oh, you are right, I meant the name and not the link. Please add also textContent.

I.e. for the example something like Lucasfilm, Twentieth Century Fox

zuphilip · 2020-01-06T12:45:44Z

IMDb.js

 		trans.itemType = "film";
 		trans.doWeb(doc, url);
 	});
 }

-
-function addExtra(item, value) {


You should reuse that function for several values in extra, i.e. don't delete it.

zuphilip · 2020-01-06T12:47:26Z

IMDb.js

-		}
-		var pageId = ZU.xpathText(doc, '//meta[@property="pageId"]/@content');
-		if (pageId) {
-			addExtra(item, "IMDb ID: "+pageId);


No reason to delete that information.

zuphilip · 2020-01-06T12:50:33Z

IMDb.js

+		let json = JSON.parse(text(doc, 'script[type="application/ld+json"]'));
+		item.title = json.name;// note that json only has the original title
+		var transTitle = ZU.trimInternal(ZU.xpathText(doc, "//div[@class='title_wrapper']/h1/text()")).slice(0, -2);
+		if(transTitle && transTitle !== item.title) item.extra = "Translated title: " + transTitle;


Add a space after the if.

zuphilip · 2020-01-06T23:27:53Z

@Placardo Let me know when this is ready from your side again for another round of review.

vcarret · 2020-01-07T00:54:19Z

@zuphilip Seems good to me if you're ok I'll separate into two commits

zuphilip

This looks perfect now! Thank you very much!

zuphilip · 2020-01-07T16:23:49Z

Yes, please separate into 2-3 commits please. Then I can merge this.

… and country

vcarret · 2020-01-07T16:44:36Z

🚀 some issues remain pending though regarding the status of the translated title as main title and the fact that production companies are listed as distributors

zuphilip · 2020-01-07T17:52:01Z

Merged now 🚀 ! Thank you very much @Placardo 🙇‍♂️ !

zuphilip · 2020-01-07T17:52:19Z

translated title as main title and ...

Yeah,I am aware of the change w.r.t. this. However, I think it is better because the release date will be also clear and we save the translated title still in an extra field. But even if there is a convincing demand to switch this, wen can easily do that in the future.

the fact that production companies are listed as distributors

Yeah, right as well. I will write a comment in the other thread and we will see how this will develop in the future.

dstillman reviewed Dec 1, 2019

View reviewed changes

vcarret commented Dec 2, 2019

View reviewed changes

dstillman reviewed Dec 16, 2019

View reviewed changes

dstillman mentioned this pull request Dec 16, 2019

Additional fields for film (and tv series) zotero/zotero-bits#50

Open

zuphilip reviewed Jan 6, 2020

View reviewed changes

zuphilip added the Improvements Pull requests that are improving existing translators label Jan 6, 2020

zuphilip approved these changes Jan 7, 2020

View reviewed changes

vcarret added 2 commits January 7, 2020 17:37

New scrape function using JSON data + page info on production company…

4e40911

… and country

Linting + tests update

a257224

vcarret force-pushed the imdb branch from d9e7e34 to a257224 Compare January 7, 2020 16:40

zuphilip merged commit fd5ef86 into zotero:master Jan 7, 2020

vcarret deleted the imdb branch January 7, 2020 18:31

	item.genre = json.genre;
	item.genre = json.genre ? json.genre.join(", ") : "";

IMDb - Fixed creators, date, genre, running time, tags #2069

IMDb - Fixed creators, date, genre, running time, tags #2069

Conversation

vcarret commented Nov 26, 2019 • edited Loading

dstillman commented Dec 1, 2019 • edited Loading

Choose a reason for hiding this comment

dstillman commented Dec 2, 2019 • edited Loading

vcarret commented Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstillman commented Dec 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstillman commented Dec 16, 2019

vcarret commented Dec 16, 2019

zuphilip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcarret Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcarret Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcarret Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcarret Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuphilip commented Jan 6, 2020

vcarret commented Jan 7, 2020 • edited Loading

zuphilip left a comment

Choose a reason for hiding this comment

zuphilip commented Jan 7, 2020

vcarret commented Jan 7, 2020

zuphilip commented Jan 7, 2020

zuphilip commented Jan 7, 2020

vcarret commented Nov 26, 2019 •

edited

Loading

dstillman commented Dec 1, 2019 •

edited

Loading

dstillman commented Dec 2, 2019 •

edited

Loading

vcarret commented Dec 2, 2019 •

edited

Loading

vcarret Jan 6, 2020 •

edited

Loading

vcarret Jan 6, 2020 •

edited

Loading

vcarret Jan 6, 2020 •

edited

Loading

vcarret Jan 6, 2020 •

edited

Loading

vcarret commented Jan 7, 2020 •

edited

Loading