Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't indicate the encoding for HtmlUnitBrowser #165

Closed
dev590t opened this issue Dec 13, 2020 · 2 comments
Closed

Can't indicate the encoding for HtmlUnitBrowser #165

dev590t opened this issue Dec 13, 2020 · 2 comments

Comments

@dev590t
Copy link

dev590t commented Dec 13, 2020

When I execute:

import net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser._
import net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser
val typedBrowser: HtmlUnitBrowser = HtmlUnitBrowser.typed()
val apec =  "https://www.apec.fr/candidat/recherche-emploi.html/emploi?niveauxExperience=101881&lieux=711&motsCles=scala&fonctions=101833&page=1"
typedBrowser.get(apec)

Then I get few exceptions. As the exception message is very long, I give in below an extract of message of my first exception. The complete exception message is in https://github.com/zhenleibb/scala-scraper/blob/feature/bug/log/log_1.txt.

Dec 13, 2020 8:33:41 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter error
SEVERE: error: message=[illegal character: ] sourceName=[https://cdn.cookielaw.org/consent/e7726f99-5fed-4e14-bb06-9cf27479add1.js] line=[1] lineSource=[�}y[ۺ������y!�8!akI���VJ[�B�ry�(���S/@X��;#ɶ�8=w˹�GcK�I3��4��5_�T�.G�I��KL���5��N���;3S���.9�� l<�*�B��Fh{nQy�J'�p@��ϢR�I����7=7 n��ę��>��6)�ޞ�@�M- E��h�w���v��6N�N_�.N�:�4���M�/�J]���Za����)gZ1�үU�V�����X�I��
                                                                                  R�ת,+PDʝ(h] lineOffset=[1]
Dec 13, 2020 8:33:41 PM com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine handleJavaScriptException
INFO: Caught script exception
======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.EvaluatorException]
com.gargoylesoftware.htmlunit.ScriptException: illegal character:  (https://cdn.cookielaw.org/consent/e7726f99-5fed-4e14-bb06-9cf27479add1.js#1)
	at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:885)
	at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:617)
..
	at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:398)
	at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:315)
	at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.exec(HtmlUnitBrowser.scala:52)
	at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.get(HtmlUnitBrowser.scala:57)
...
Caused by: net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: illegal character:  (https://cdn.cookielaw.org/consent/e7726f99-5fed-4e14-bb06-9cf27479add1.js#1)
	at com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.error(StrictErrorReporter.java:69)
	at net.sourceforge.htmlunit.corejs.javascript.Parser.addError(Parser.java:259)

Version

scala-scraper: 2.2.0

Possible cause

I feel the problem is from encoding of file

Possible solution

After check the source code of HtmlUnitBrowser, I see it isn't possible to indicate the encoding in exec function, because newRequest is private. So we need make it public.

@dev590t dev590t changed the title How indicate encoding for HtmlUnitBrowser Can't indicate the encoding for HtmlUnitBrowser Dec 13, 2020
@dev590t
Copy link
Author

dev590t commented Dec 13, 2020

Atfer download the problematic js file https://cdn.cookielaw.org/consent/e7726f99-5fed-4e14-bb06-9cf27479add1.js
I checked its charset:

$ file -i https\ _cdn.cookielaw.org_consent_e7726f99-5fed-4e14-bb06-9cf27479add1.js 
https _cdn.cookielaw.org_consent_e7726f99-5fed-4e14-bb06-9cf27479add1.js: text/plain; charset=us-ascii

The default charset used by HtmlUnitBrowser is utf-8. SO I opened the js file with utf-8 encoding and us-ascii, I can display the text without problem.

So the issue don't from https://cdn.cookielaw.org/consent/e7726f99-5fed-4e14-bb06-9cf27479add1.js?

@dev590t
Copy link
Author

dev590t commented Dec 14, 2020

Finally, it isn't a bug. The exceptions message are from the log. I have removed the exception messages with

    java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(java.util.logging.Level.OFF);
    java.util.logging.Logger.getLogger("org.apache.http").setLevel(java.util.logging.Level.OFF);

@dev590t dev590t closed this as completed Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant