Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

功能完善 #2

Closed
GoogleCodeExporter opened this issue Dec 3, 2015 · 53 comments
Closed

功能完善 #2

GoogleCodeExporter opened this issue Dec 3, 2015 · 53 comments

Comments

@GoogleCodeExporter
Copy link

1.通过一个网址入口爬取整站相式网址(2012.05.26)
2.可以对采集到的数据进行编辑再发布,功能点:数据编辑,手动发布(2012.05.26)
  手动发布已完成时间:2012.06.01
3.网址入库规则,添加一个必须包含的规则字段(2012.05.27)
  完成时间:2012.06.01
4.允许爬虫在抓取数据时就过滤一次重复采集到的数据(2012.05.31)
  完成时间:2012.06.02
5.不保存采集到的数据,减少GAE的数据库使用配额(2012.05.31)
  完成时间:2012.06.03
6.可以配置爬虫采集速率,控制爬虫APP的并发数同时可以减少实例启动数量,降低Frontend Instance Hours的配额(2012.06.05)
  完成时间:2012.06.11
7.能够勾选指定的规则进行采集测试(2012.05.31)
  完成时间:2012.06.11
8.能够对采集网址和采集数据导出到Excel(2012.06.12)
9.实现OCR功能(2012.06.12)
  完成时间:2012.06.12

Original issue reported on code.google.com by app.l...@gmail.com on 21 Aug 2012 at 8:48

@GoogleCodeExporter
Copy link
Author

[deleted comment]

1 similar comment
@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

10.发布功能支持上传文件
11.邮箱通知支持上传附件

Original comment by app.l...@gmail.com on 19 Sep 2012 at 2:08

@GoogleCodeExporter
Copy link
Author

12.采集规则添加一个过滤接口,一个过滤插件

Original comment by app.l...@gmail.com on 20 Sep 2012 at 3:34

@GoogleCodeExporter
Copy link
Author

13.循环区配,采集规则不允许为空时,当一条记录的部份规则匹配为空时,此时将不能再匹配任何其它记录,需要修改逻辑,使其从最后匹配的位置继续匹配下一条记录。

Original comment by app.l...@gmail.com on 6 Nov 2012 at 9:27

@GoogleCodeExporter
Copy link
Author

14.分页规则索引号可以与采集规则索引号相同

Original comment by app.l...@gmail.com on 9 Nov 2012 at 9:01

@GoogleCodeExporter
Copy link
Author

15.新添加的采集规则再更新时有错误

Original comment by app.l...@gmail.com on 9 Nov 2012 at 9:10

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

16.账户管理,语言、时区设置

Original comment by app.l...@gmail.com on 26 Jan 2013 at 3:25

@GoogleCodeExporter
Copy link
Author

17.修改select的值时,应当更新所有页面的select

Original comment by app.l...@gmail.com on 26 Jan 2013 at 3:27

@GoogleCodeExporter
Copy link
Author

18.采集规则添加"标签组合"

Original comment by app.l...@gmail.com on 15 Apr 2013 at 5:59

@GoogleCodeExporter
Copy link
Author

19.重新设计日志记录方式,1.将日志存放于内存(GAE),2将日志存放于DB。
20.前台可以设置每个站点的采集速率。

Original comment by app.l...@gmail.com on 20 Jul 2013 at 8:57

@GoogleCodeExporter
Copy link
Author

21.任务队列统计、采集的URL(每日统计)、采集到的数据(每日统计)
22.前台查看日志

Original comment by app.l...@gmail.com on 10 Aug 2013 at 1:29

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

23.当数据来源为其它标签时修改采集规则区域的显示方式.

Original comment by app.l...@gmail.com on 29 Nov 2013 at 6:59

@GoogleCodeExporter
Copy link
Author

24.完善选项卡异步加载
25.完善嵌套采集时的COOKIE设置

Original comment by app.l...@gmail.com on 18 Feb 2014 at 1:40

@GoogleCodeExporter
Copy link
Author

26.实现查看计划任务中采集规则的运行状态
27.XPATH读取时可以直接添加到采集规则

Original comment by app.l...@gmail.com on 13 Oct 2014 at 8:01

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

29.采集到的数据列表页增加按入库日期查询
30.数量统计同步统计类型字段

Original comment by app.l...@gmail.com on 13 Oct 2014 at 8:32

@GoogleCodeExporter
Copy link
Author

31.当采集测试没有匹配到数据时提示是哪条规则没有匹配到数据

Original comment by app.l...@gmail.com on 16 Oct 2014 at 8:36

@GoogleCodeExporter
Copy link
Author

32.站点管理》HTTP请求配置窗口无法打开

Original comment by app.l...@gmail.com on 17 Oct 2014 at 1:03

@GoogleCodeExporter
Copy link
Author

33.采集规则字段合并排版问题

Original comment by app.l...@gmail.com on 23 Oct 2014 at 3:27

@GoogleCodeExporter
Copy link
Author

34.JS依赖分析失败

Original comment by app.l...@gmail.com on 23 Oct 2014 at 3:30

@GoogleCodeExporter
Copy link
Author

35.load异常的时,关闭loading mark

Original comment by app.l...@gmail.com on 23 Oct 2014 at 6:36

@GoogleCodeExporter
Copy link
Author

36.数据列表页查询时开始索引错误

Original comment by app.l...@gmail.com on 23 Oct 2014 at 8:18

@GoogleCodeExporter
Copy link
Author

37.为计划任务添加执行日志
38.为“数据自动采集”计划任务增加入队列统计,完成度统计。

Original comment by app.l...@gmail.com on 6 Nov 2014 at 3:10

@GoogleCodeExporter
Copy link
Author

39.站点编码“自动识别”改成每次抓取都自动识别

Original comment by app.l...@gmail.com on 6 Nov 2014 at 3:15

@GoogleCodeExporter
Copy link
Author

40.修改XPATH提取工具的class,避免class冲突

Original comment by truetrue...@gmail.com on 19 Nov 2014 at 3:46

@GoogleCodeExporter
Copy link
Author

41.XPATH匹配增加outerHTML、innerHTML、innerTEXT属性

Original comment by truetrue...@gmail.com on 25 Nov 2014 at 3:20

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

43.添加采集队列管理功能,如删除队列、停止队列、运行队列

Original comment by truetrue...@gmail.com on 28 Nov 2014 at 3:37

@GoogleCodeExporter
Copy link
Author

44.统计功能数据自动刷新

Original comment by truetrue...@gmail.com on 9 Dec 2014 at 6:57

@GoogleCodeExporter
Copy link
Author

45.导出到CSV

Original comment by truetrue...@gmail.com on 9 Dec 2014 at 7:09

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

47.将采集器做为服务,开放采集API支持异步或同步返回两种形式

Original comment by truetrue...@gmail.com on 16 Dec 2014 at 1:44

@GoogleCodeExporter
Copy link
Author

48.在站点管理里增加“最大采集队列数”,为空或小于1时不限制。计划任务在执行“数据自动采集”时会检测当前站点未完成的任务数,超过限制时将不开启本次采集任务。这样可以避免开启过多的任务而耗尽系统资源。

Original comment by truetrue...@gmail.com on 16 Dec 2014 at 7:00

@GoogleCodeExporter
Copy link
Author

49.完善WEB端,
1.优化响应速度CND加速、多节点同步(DNS智能加速)
2.GAE在线安装使用排队机制

Original comment by truetrue...@gmail.com on 24 Dec 2014 at 7:49

@GoogleCodeExporter
Copy link
Author

50.Queue SYNC_FULL 需要加入CPU操时处理逻辑

Original comment by truetrue...@gmail.com on 14 Jan 2015 at 1:33

@GoogleCodeExporter
Copy link
Author

51.网址批量添加
多个网址用'|$|'分隔
to
多个网址使用'换行'或'|$|'分隔

Original comment by truetrue...@gmail.com on 16 Jan 2015 at 8:14

@GoogleCodeExporter
Copy link
Author

52.实现密码找回功能

Original comment by truetrue...@gmail.com on 19 Jan 2015 at 8:31

@GoogleCodeExporter
Copy link
Author

53.newcrawler.com全球服务器选择功能

Original comment by truetrue...@gmail.com on 19 Jan 2015 at 8:33

@GoogleCodeExporter
Copy link
Author

54.数据发布规则,默认隐藏,增加显示按钮

Original comment by truetrue...@gmail.com on 19 Jan 2015 at 8:36

@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

55.快速开始,增加可视化规则创建功能
56.增加数据查询API,提供JSON、CSV格式.
57.爬虫池配置--负载均衡功能实现

Original comment by app.l...@gmail.com on 12 Mar 2015 at 3:15

@GoogleCodeExporter
Copy link
Author

58.异步查询时增加loading中的图片

Original comment by app.l...@gmail.com on 12 Mar 2015 at 3:26

@GoogleCodeExporter
Copy link
Author

59.可以为每个站点配置“触发抓取异常”
   抓取到网页内容后检测是否包含异常文本(如反爬虫验证码输入提示),包含异常文本时系统将抛出抓取异常并且系统默认会重试抓取一次

Original comment by truetrue...@gmail.com on 25 May 2015 at 8:57

@GoogleCodeExporter
Copy link
Author

60.增加自定义采集速率

Original comment by truetrue...@gmail.com on 26 May 2015 at 1:18

@GoogleCodeExporter
Copy link
Author

61.验证Cookie的语言环境是否与当前系统选择的语言一致

Original comment by truetrue...@gmail.com on 29 May 2015 at 1:45

@GoogleCodeExporter
Copy link
Author

62.爬虫统计数据没有生效

Original comment by truetrue...@gmail.com on 29 May 2015 at 1:45

@GoogleCodeExporter
Copy link
Author

63.可以为爬虫配置默认的采集速率
64.回调检测时间,描述:采集器会使用异步的方式调用爬虫采集,当爬虫由于一些原因没有返回结果时,需要重新采集网址,回调检测时间就是定义爬虫多长时间没有返回时触发重新采集

Original comment by app.l...@gmail.com on 25 Jun 2015 at 3:48

@GoogleCodeExporter
Copy link
Author

65.登录后比较版本,需要更新时醒目提示
66.查看日志,length右对齐单位改为KB,lastmodified增加宽度
67.爬虫远程访问增加密码认证

Original comment by app.l...@gmail.com on 25 Jun 2015 at 9:14

@GoogleCodeExporter
Copy link
Author

68.登录界面“帮助”连接到WIKI

Original comment by truetrue...@gmail.com on 29 Jun 2015 at 6:05

@GoogleCodeExporter
Copy link
Author

69.添加服务条款页面

Original comment by truetrue...@gmail.com on 29 Jun 2015 at 6:06

@speed speed closed this as completed Dec 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants