Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add(Soopat.js): New translator for a patent site www.soopat.com #3007

Closed
wants to merge 1 commit into from
Closed

Conversation

l0o0
Copy link
Contributor

@l0o0 l0o0 commented Mar 28, 2023

Soopat.com is a patent database based in China that is widely used by Chinese users. It collects patents from all around the world, not just from China, and allows users to search and download patent applications and authorization PDFs. Pro users can access more detailed data on this site.

The translator extracts metadata from HTML elements rather than using an API. The data in HTML is well-formatted and easy to read.

Please be aware that the pro version requires an account to access. I was given a private account to create this translator.

if (selectedItems) {
var urls = Object.keys(selectedItems);
Z.debug(urls[0]);
await Promise.all(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will initiate the requests asynchronously, which in practice means almost simultaneously. Some sites don't like this (because of the botty behaviour). Is this perhaps related to the Captcha issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, frequent requests will trigger Captcha for free users. It's not recommended to scrape many items from the search page. How about get page on by one in a for loop.

if (selectedItems) {
	var urls = Object.keys(selectedItems);
	for (let url in selectedItems) {
		let html = await requestDocument(url);
		await scrape(html, url, loginStatus);
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we almost always want to do await instead of Promise.all() for this reason. (And if that's not enough, we can consider artificial delays.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Soopat seems to have a strict policy against web scraping for free users. Sometimes I have to enter the Captcha code just after I refresh the webpage. I think there is little thing we can do on Zotero's side.

@l0o0 l0o0 closed this by deleting the head repository Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants