Introduce ignoreContentType for JsoupBrowser #63

parius · 2019-04-03T14:04:12Z

Are there any plans to allow setting JsoupBrowser::ignoreContentType(true)?

Currently it fails to load e.g. Mimetype=application/pdf with org.jsoup.UnsupportedMimeTypeException and I do not see how could I obtain the content of the link.

ruippeixotog · 2019-04-03T20:23:56Z

Hi @parius! What exactly are you trying to do? Jsoup cannot read PDF files, so this restriction seems reasonable. Or is it the case that the website is sending a wrong Content-Type header?

parius · 2019-04-04T10:01:56Z

While parsing a page, I would like to follow a link on it (e.g. https://de.indeed.com/rc/clk?jk=b30cb1d8cced1cdf) but I don't know yet what is there. If it would be not html, I would like to get the location (final permanent link), content type and be able to dump it on disk.

With pure JSoup and ignoreContentType I could achieve this. I was thinking of extending JsoupBrowser, but it not that open for this case

ruippeixotog · 2019-04-05T18:43:58Z

Got it 👍 You can extend JsoupBrowser for that, though:

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import org.jsoup.Connection

val browser = new JsoupBrowser {
  override def requestSettings(conn: Connection) = conn.ignoreContentType(true)
}

See #60 for another example. Let me know if this helps.

parius · 2019-04-07T11:16:47Z

oh, so easy. Thanks, it is working!

parius closed this as completed Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ignoreContentType for JsoupBrowser #63

Introduce ignoreContentType for JsoupBrowser #63

parius commented Apr 3, 2019

ruippeixotog commented Apr 3, 2019

parius commented Apr 4, 2019

ruippeixotog commented Apr 5, 2019 •

edited

Loading

parius commented Apr 7, 2019

Introduce ignoreContentType for JsoupBrowser #63

Introduce ignoreContentType for JsoupBrowser #63

Comments

parius commented Apr 3, 2019

ruippeixotog commented Apr 3, 2019

parius commented Apr 4, 2019

ruippeixotog commented Apr 5, 2019 • edited Loading

parius commented Apr 7, 2019

ruippeixotog commented Apr 5, 2019 •

edited

Loading