-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ignoreContentType for JsoupBrowser #63
Comments
Hi @parius! What exactly are you trying to do? Jsoup cannot read PDF files, so this restriction seems reasonable. Or is it the case that the website is sending a wrong Content-Type header? |
While parsing a page, I would like to follow a link on it (e.g. https://de.indeed.com/rc/clk?jk=b30cb1d8cced1cdf) but I don't know yet what is there. If it would be not html, I would like to get the location (final permanent link), content type and be able to dump it on disk. With pure JSoup and ignoreContentType I could achieve this. I was thinking of extending JsoupBrowser, but it not that open for this case |
Got it 👍 You can extend import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import org.jsoup.Connection
val browser = new JsoupBrowser {
override def requestSettings(conn: Connection) = conn.ignoreContentType(true)
} See #60 for another example. Let me know if this helps. |
oh, so easy. Thanks, it is working! |
Are there any plans to allow setting JsoupBrowser::ignoreContentType(true)?
Currently it fails to load e.g. Mimetype=application/pdf with org.jsoup.UnsupportedMimeTypeException and I do not see how could I obtain the content of the link.
The text was updated successfully, but these errors were encountered: