Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace speed/newcrawler/war to speed/windows-64bit-jetty-jre/war??? #91

Open
whairg opened this issue Mar 22, 2020 · 15 comments
Open

Replace speed/newcrawler/war to speed/windows-64bit-jetty-jre/war??? #91

whairg opened this issue Mar 22, 2020 · 15 comments

Comments

@whairg
Copy link

whairg commented Mar 22, 2020

Replace speed/newcrawler/war to speed/windows-64bit-jetty-jre/war??
请问在哪个文件修改?

1.Download NewCrawler war:

https://github.com/speed/newcrawler
2.Replace speed/newcrawler/war to speed/windows-64bit-jetty-jre/war

3.Run start.bat

4.http://127.0.0.1:8500/

@speed
Copy link
Owner

speed commented Mar 23, 2020

下载这两个
https://github.com/speed/windows-64bit-jetty-jre/archive/master.zip 解压成 windows-64bit-jetty-jre
https://github.com/speed/newcrawler/archive/master.zip 解压成 newcrawler

2.替换 newcrawler/war 到 windows-64bit-jetty-jre/war

3.点击 start.bat 运行

4.等 一会 就可以 在浏览器里访问 http://127.0.0.1:8500/

5.需要在newcrawler.com注册帐号

@whairg
Copy link
Author

whairg commented Mar 23, 2020

HTTP ERROR: 503
Problem accessing /. Reason:

Service Unavailable

Powered by Jetty://
报这个错误

@whairg
Copy link
Author

whairg commented Mar 23, 2020

12761584941047_ pic
启动的时候显示这个。

@speed
Copy link
Owner

speed commented Mar 23, 2020

能把上半部的异常也截图看下吗?

@whairg
Copy link
Author

whairg commented Mar 23, 2020

image
image
image
您好,这是点击start.bat的所有信息。目前服务器是windows2012 R2系统,
image
这是打开http://127.0.0.1:8500/报的错误,
image
这是JAVA版本。
image
javac编译都没问题,java环境没问题。
image
image
这是文件,都覆盖过去了。

@speed
Copy link
Owner

speed commented Mar 23, 2020

是NewCrawler自带的JRE版本低了,需要你将start.bat文件里的这一行删掉(我看到你有JDK1.8的环境)
set path="%~dp0jre\bin"
删掉后你再启动

@whairg
Copy link
Author

whairg commented Mar 23, 2020

您好,

可以打开了,
http://www.dianping.com/guangzhou/ch30/g141
这个是我要采集的网站,但是输入进去的时候显示这样,。
image
image
也无法像视频那样选择需要采集的字段。

@speed
Copy link
Owner

speed commented Mar 23, 2020

你使用了chrome插件支持,需要下载
https://github.com/speed/newcrawler-plugin-urlfetch-chrome/archive/master.zip
并修改这个插件配置 , chromedriver.exe, ModHeader.crx 这两个文件位置要正确
5849540899249

@whairg
Copy link
Author

whairg commented Mar 23, 2020

您好,
image
image
为啥这个下一页测试的时候获取不到?

@whairg
Copy link
Author

whairg commented Mar 23, 2020

设置好下一页链接提取规则,
这个下一页的链接提取规则怎么设置?

@whairg
Copy link
Author

whairg commented Mar 23, 2020

image
下一页的提取规则请问是在这里填写吗?请问http://${property3}?pageNo=${page(1,1,50)}&PARAM1=${3},PARAM1=${3}是什么意思?

@whairg
Copy link
Author

whairg commented Mar 23, 2020

image
还有问题,乱码这个怎么解决?
不好意思,第一次用这个比较多问题,麻烦了。

@speed
Copy link
Owner

speed commented Mar 23, 2020

自定义下一页CSS路径
div.page > a.next
200323215844

@speed
Copy link
Owner

speed commented Mar 23, 2020

页面没乱码?

@whairg
Copy link
Author

whairg commented Mar 23, 2020

您好,

页面没有乱码,

用自定义下一页CSS路径
div.page> a.next这个方式,测试采集的时候还是没有办法采集下一页的信息出来。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants