Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java null pointer error in fetch() #57

Closed
arelaxend opened this issue Dec 26, 2016 · 9 comments
Closed

Java null pointer error in fetch() #57

arelaxend opened this issue Dec 26, 2016 · 9 comments

Comments

@arelaxend
Copy link

arelaxend commented Dec 26, 2016

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at edu.usc.irds.sparkler.Main$.main(Main.scala:47)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'**
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121)
at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646)
at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77)
at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162)
at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227)
at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179)
at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source)
at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643)
... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException
at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
at org.apache.spark.scheduler.Task.run(Task.scala:93)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6)
java.util.NoSuchElementException: key not found: 6**
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused
Build info: version: 'unknown', revision: 'unknown', time: 'unknown'
System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91'
Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

@arelaxend arelaxend changed the title Certificate error ? Selenium error ? Dec 26, 2016
@arelaxend
Copy link
Author

To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function. Cheers

@karanjeets
Copy link
Member

@arelaxend Thanks for reporting this issue and also for a quick solution. I will investigate more before putting it around try and catch.
Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)

@thammegowda
Copy link
Member

Thanks @arelaxend If you could submit a pull request with those fixes, it will be awesome 👍 🥇

@thammegowda
Copy link
Member

@karanjeets
Instead of

Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)

I suggest we do this:

To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function.

@arelaxend
Copy link
Author

arelaxend commented Dec 27, 2016

Hi, I got:
fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

Besides, you just have to update the function fetch() in the plugin fetcher-jbbrowser. I choose to do that to be coherent with what you have done in the fetch() function on the app folder. 👍 And so, it means to change the implementation of the fetch() function by the following code:

public FetchedData fetch(Resource resource) {
  LOG.info("JBrowser FETCHER {}", resource.getUrl());
  FetchedData fetchedData;
      /*
  	* In this plugin we will work on only HTML data
  	* If data is of any other data type like image, pdf etc plugin will return client error
  	* so it can be fetched using default Fetcher
  	*/
  try {
    if (!isWebPage(resource.getUrl())) {
      LOG.debug("{} not a html. Falling back to default fetcher.",
          resource.getUrl());
      //This should be true for all URLS ending with 4 character file extension
      //return new FetchedData("".getBytes(), "application/html", ERROR_CODE) ;
      return super.fetch(resource);
    }
    long start = System.currentTimeMillis();

    LOG.debug("Time taken to create driver- {}",
        (System.currentTimeMillis() - start));

    // This will block for the page load and any
    // associated AJAX requests
    driver.get(resource.getUrl());

    int status = driver.getStatusCode();
    //content-type

    // Returns the page source in its current state, including
    // any DOM updates that occurred after page load
    String html = driver.getPageSource();

    //quitBrowserInstance(driver);

    LOG.debug("Time taken to load {} - {} ", resource.getUrl(),
        (System.currentTimeMillis() - start));

    if (!(status >= 200 && status < 300)) {
      // If not fetched through plugin successfully
      // Falling back to default fetcher
      LOG.info(
          "{} Failed to fetch the page. Falling back to default fetcher.",
          resource.getUrl());
      return super.fetch(resource);
    }
    fetchedData = new FetchedData(html.getBytes(), "application/html",
        status);
    resource.setStatus(ResourceStatus.FETCHED.toString());
    fetchedData.setResource(resource);
    return fetchedData;
  } catch (Exception e) {
    LOG.info(
        "{} Failed to fetch the page. Falling back to default fetcher.",
        resource.getUrl());
    return super.fetch(resource);
  }
}

@arelaxend
Copy link
Author

Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated! 👍

@thammegowda
Copy link
Member

Thanks for the comment

Hi, I got:
fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

What command did you execute to get this error message?
I hope you are aware of the way to raise pull request without having write permissions. If not, refer to http://stackoverflow.com/a/14681796/1506477. Basically, you (1) fork this repo (2) push your changes to your fork (3) raise a pull request from your fork to this repo.

Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated!

Ah! That explains the NullPointerException - the 3rd stack trace.

@arelaxend arelaxend reopened this Dec 27, 2016
@thammegowda
Copy link
Member

Fixed in #61
Waiting for one more person to review before I merge

@arelaxend
Copy link
Author

This is a good fix. I understood that you want to remove the pain of handling such errors on modules. 👍

@arelaxend arelaxend changed the title Selenium error ? Java null pointer error in fetch() Dec 27, 2016
buggtb added a commit that referenced this issue Jan 18, 2022
* allow config driver overriding

* kick off updated build

* update for release

* update versions

* update versions

* Update sparkler and new build versions

* update for 0.4.4

* update log4j levels

* update version

* update version

* update to next snapshot

* Add version sniffing (#40)

* Interpreter Interoperability (#41)

* Update FetcherChrome with interpreter interoperability

* Update htmlCrawl

* Fixed build chain

* update tika detection

* update version

* allow for non selenium or magnesium execution

* make naming a bit more intelligent

* fix id

* fix id

* fix id

* remove plugin examples

* update version

* fix crawl hookup

* fix crawl hookup

* update version tag

* update path creation

* update path creation

* add conf overload from file

* fix mimetype lookup

* update version

* Update FetcherChrome.java

* Update version.sbt

* Update PluginDependencies.scala

* Update version.sbt

* Update DatabricksAPI.java

* Update version.sbt

* Update Ms integration

* Update version.sbt

* Version bump for Ms upgrade

* Version bump for critical Ms patch

* update SeleniumScripter to use maven central repo dependancy path (#57)

* remove old ci

* revert Docker changes

* Ms version bump to 0.2.0

* add new generic process, snapshots and catch missing mimetypes

* update version

* update version

* fix bug

* init checkpoints

* init checkpoints

* add dns support to chrome

* remove proxy from config

* update config

* update config

* update other config

* update version

* Extended logging support for slog4j

* Extended logging support for slog4j

* Extended logging support for slog4j

* migrate fetcher default to apache httpclient and add proxy support

* update version

* clean up

* update version

* fix status

* fix content type lookup

* trigger build

* update version

* fix ssl lookup

* add gitpod stuff

* revert

* update version

* Update version.sbt

* update scripter to 1.7.9

* update version

* fix sbt install

* Update version.sbt

* Update PluginDependencies.scala

* fix critical parsing bug

* fix critical parsing bug

* various bug fixes£

* update version

* update version

* add more checkpoints£

* put stuff in the right place

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* Update version.sbt

* update json implementation for fetcher chrome

* Update version.sbt

* Update .gitpod.Dockerfile

* Update .gitpod.yml

* Update .gitpod.yml

* Update .gitpod.Dockerfile

* add proxy code

* add proxy code

* add proxy code

* add proxy code

* add proxy code

* add more logging

* remove prune for now

* update version

* try and work out removal issue

* change log level

* fix logger

* fix logger

* fix logger

* fix logger

* log title

* log title

* This is the fix for log level option to make sparkler work without providing this argument.

* Fix loggable issue

* stick snapshot version

* Fix prod error

* Fix prod error

* Detecting the breaking point

* Detecting the breaking point

* Detecting the breaking point

* Detecting the breaking point

* Update README.md

triggering git workflow

* Detecting the breaking point

* Detecting the breaking point

* Fix "ClassCastException" exception in definable debug levels

* Fix "ClassCastException" exception in definable debug levels

* use logback-classic Logger instead of slf4j logger

* add jobid file support to crawl and injector

* loop until no records left

* Update version.sbt

* update restlet repo because of ssl cert expiry

* Update version.sbt

* remove banana

* remove banana

* remove restlet

* remove restlet

* fix build

* update version

* add release workflow

* update version

* update samehost filter to allow subdomains

* update version

* add idf to crawler

* add idf to crawler

* fix samehost config catch

* fix samehost config catch

* fix samehost config catch

* updates to resolve npe

* update version

* update version

* add null check

* finish basic merge

Co-authored-by: Dmitri McGuckin <28746912+dmitri-mcguckin@users.noreply.github.com>
Co-authored-by: dmitri-mcguckin <dmitri.mcguckin26@gmail.com>
Co-authored-by: Pankaj Raturi <pankaj@tripatinfoways.com>
Co-authored-by: pankaj-tripat <64186268+pankaj-tripat@users.noreply.github.com>
buggtb added a commit that referenced this issue Jan 22, 2022
* update for release

* update versions

* update versions

* Update sparkler and new build versions

* update for 0.4.4

* update log4j levels

* update version

* update version

* update to next snapshot

* Add version sniffing (#40)

* Interpreter Interoperability (#41)

* Update FetcherChrome with interpreter interoperability

* Update htmlCrawl

* Fixed build chain

* update tika detection

* update version

* allow for non selenium or magnesium execution

* make naming a bit more intelligent

* fix id

* fix id

* fix id

* remove plugin examples

* update version

* fix crawl hookup

* fix crawl hookup

* update version tag

* update path creation

* update path creation

* add conf overload from file

* fix mimetype lookup

* update version

* Update FetcherChrome.java

* Update version.sbt

* Update PluginDependencies.scala

* Update version.sbt

* Update DatabricksAPI.java

* Update version.sbt

* Update Ms integration

* Update version.sbt

* Version bump for Ms upgrade

* Version bump for critical Ms patch

* update SeleniumScripter to use maven central repo dependancy path (#57)

* remove old ci

* revert Docker changes

* Ms version bump to 0.2.0

* add new generic process, snapshots and catch missing mimetypes

* update version

* update version

* fix bug

* init checkpoints

* init checkpoints

* add dns support to chrome

* remove proxy from config

* update config

* update config

* update other config

* update version

* Extended logging support for slog4j

* Extended logging support for slog4j

* Extended logging support for slog4j

* migrate fetcher default to apache httpclient and add proxy support

* update version

* clean up

* update version

* fix status

* fix content type lookup

* trigger build

* update version

* fix ssl lookup

* add gitpod stuff

* revert

* update version

* Update version.sbt

* update scripter to 1.7.9

* update version

* fix sbt install

* Update version.sbt

* Update PluginDependencies.scala

* fix critical parsing bug

* fix critical parsing bug

* various bug fixes£

* update version

* update version

* add more checkpoints£

* put stuff in the right place

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* more checkpoints

* Update version.sbt

* update json implementation for fetcher chrome

* Update version.sbt

* Update .gitpod.Dockerfile

* Update .gitpod.yml

* Update .gitpod.yml

* Update .gitpod.Dockerfile

* add proxy code

* add proxy code

* add proxy code

* add proxy code

* add proxy code

* add more logging

* remove prune for now

* update version

* try and work out removal issue

* change log level

* fix logger

* fix logger

* fix logger

* fix logger

* log title

* log title

* This is the fix for log level option to make sparkler work without providing this argument.

* Fix loggable issue

* stick snapshot version

* Fix prod error

* Fix prod error

* Detecting the breaking point

* Detecting the breaking point

* Detecting the breaking point

* Detecting the breaking point

* Update README.md

triggering git workflow

* Detecting the breaking point

* Detecting the breaking point

* Fix "ClassCastException" exception in definable debug levels

* Fix "ClassCastException" exception in definable debug levels

* use logback-classic Logger instead of slf4j logger

* add jobid file support to crawl and injector

* loop until no records left

* Update version.sbt

* update restlet repo because of ssl cert expiry

* Update version.sbt

* remove banana

* remove banana

* remove restlet

* remove restlet

* fix build

* update version

* add release workflow

* update version

* update samehost filter to allow subdomains

* update version

* add idf to crawler

* add idf to crawler

* fix samehost config catch

* fix samehost config catch

* fix samehost config catch

* updates to resolve npe

* update version

* update version

* add null check

* finish basic merge

* fix up es integration

* ensure exitg

Co-authored-by: Dmitri McGuckin <28746912+dmitri-mcguckin@users.noreply.github.com>
Co-authored-by: dmitri-mcguckin <dmitri.mcguckin26@gmail.com>
Co-authored-by: Pankaj Raturi <pankaj@tripatinfoways.com>
Co-authored-by: pankaj-tripat <64186268+pankaj-tripat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants