Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ignoreContentType for JsoupBrowser #63

Closed
parius opened this issue Apr 3, 2019 · 4 comments
Closed

Introduce ignoreContentType for JsoupBrowser #63

parius opened this issue Apr 3, 2019 · 4 comments

Comments

@parius
Copy link

parius commented Apr 3, 2019

Are there any plans to allow setting JsoupBrowser::ignoreContentType(true)?

Currently it fails to load e.g. Mimetype=application/pdf with org.jsoup.UnsupportedMimeTypeException and I do not see how could I obtain the content of the link.

@ruippeixotog
Copy link
Owner

Hi @parius! What exactly are you trying to do? Jsoup cannot read PDF files, so this restriction seems reasonable. Or is it the case that the website is sending a wrong Content-Type header?

@parius
Copy link
Author

parius commented Apr 4, 2019

While parsing a page, I would like to follow a link on it (e.g. https://de.indeed.com/rc/clk?jk=b30cb1d8cced1cdf) but I don't know yet what is there. If it would be not html, I would like to get the location (final permanent link), content type and be able to dump it on disk.

With pure JSoup and ignoreContentType I could achieve this. I was thinking of extending JsoupBrowser, but it not that open for this case

@ruippeixotog
Copy link
Owner

ruippeixotog commented Apr 5, 2019

Got it 👍 You can extend JsoupBrowser for that, though:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import org.jsoup.Connection

val browser = new JsoupBrowser {
  override def requestSettings(conn: Connection) = conn.ignoreContentType(true)
}

See #60 for another example. Let me know if this helps.

@parius
Copy link
Author

parius commented Apr 7, 2019

oh, so easy. Thanks, it is working!

@parius parius closed this as completed Apr 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants