Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scala-scraper's implementation of HtmlUnit doesn't have .close() #35

Closed
piercelamb opened this issue Oct 25, 2016 · 7 comments
Closed

Comments

@piercelamb
Copy link

I have an akka Actor that starts up every 3 hours to scrape some content using HtmlUnitBrowser (have to use this because of JS execution). Everything works fine except memory usage jumps every time it starts and stays constant at that new level. So eventually I run out of memory. I'm not 100% sure HtmlUnit is the issue but they do have a FAQ question about it specifically:

http://htmlunit.sourceforge.net/faq.html#MemoryLeak

As such, I'd like to test closing the browser after its used in the akka actor every time. However, I don't see a .close() method on HtmlUnitBrowser.

Please advise.

@ruippeixotog
Copy link
Owner

Hi @piercelamb! Sure, this can be easily done. However, calling something like close() on the browser will most probably close all pages created with that browser to that moment, not a specific one. Would that work for you?

A question about your problem, are you reusing the same HtmlUnitBrowser instance for the scheduled operation or are you creating a new HtmlUnitBrowser each time the job runs?

@piercelamb
Copy link
Author

@ruippeixotog I'm actually not too sure. I'm using Play Framework 2.5.9 so everything is dependency injected. In order to inject HtmlUnitBrowser i had to make this:

class HtmlUnitBrowserFactory extends HtmlUnitBrowser { new HtmlUnitBrowser }

Because it doesnt have a parameterless constructor. That then gets injected like this:

@Singleton
class YelpService @Inject() (browser: HtmlUnitBrowserFactory) ....

And passed to an Actor like this:

actorSystem.scheduler.schedule(0.seconds, 3.hour, yelpActor, Start(browser, mailActor, mailer, reviewsTable))

YelpService gets injected into one of my main controllers so it fires on startup. That is where I believe HtmlUnitBrowser would be created and I assume only once.

How would I access that .close() method?

@ruippeixotog
Copy link
Owner

ruippeixotog commented Oct 27, 2016

Hmm, I'm not sure I understand one thing in your code: why do you add { new HtmlUnitBrowser } to the front of your class? You seem to be creating an extra instance inside the constructor of HtmlUnitBrowserFactory that is immediately discarded. Wouldn't class HtmlUnitBrowserFactory extends HtmlUnitBrowser work for your purpose of having a no-arg constructor?

Either way, if you are using browser: HtmlUnitBrowserFactory as your browser, you would be able to call browser.close() directly inside your task after you finish scraping the page - closing all windows opened at the moment. In your case it doesn't seem to be a problem, as you will surely be able to do everything you want with the scraped page before the next job runs (after 3 hours). If you really don't want to risk that, you can always change your factory to be:

class HtmlUnitBrowserFactory {
  def newBrowser() = new HtmlUnitBrowser
}

And create a new browser instance each time you run a job.

@piercelamb
Copy link
Author

piercelamb commented Oct 28, 2016

@ruippeixotog Great point on the constructor. Major oversight.

My issue is that browser.close() does not compile, e.g.

value close is not a member of net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser
[error]               browser.close()
[error]                       ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

@ruippeixotog
Copy link
Owner

Oh, I guess I wasn't clear. The method still doesn't exist, I'm suggesting implementing it as a possible solution to your use case :) I'll get to it this weekend.

@piercelamb
Copy link
Author

@ruippeixotog awesome! I'll test it out as soon as you have it ready.

Thank you

@samikrc
Copy link

samikrc commented Oct 29, 2016

@piercelamb Not really on the topic, but I am curious to know why/how you are using akka actors for screen scraping? Can you mention a bit what is the use case for akka actors? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants