Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

develop分支爬虫无法自行终止 #34

Closed
hitalex opened this issue Nov 20, 2014 · 15 comments
Closed

develop分支爬虫无法自行终止 #34

hitalex opened this issue Nov 20, 2014 · 15 comments

Comments

@hitalex
Copy link

hitalex commented Nov 20, 2014

我用develop分支抓取微博数据,发现其现在无法自行终止,最终的(部分)输出消息如下:

start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 3,
'secs': 1.501857042312622}
Processing shutting down
Shutdown finished

配置如下:

job:
db: kweibo
mode: bundle # also can be bundle
size: 50 # the destination (including bundle or url) size
speed:
max: 20 # to the cluster, -1 means no restrictions, if greater than 0, means webpages opened per minute
single: -1 # max restrictions to a single instance
adaptive: no
instances: 1
priorities: 3 # priorities queue count in mq
copies: 1 # redundant size of objects in mq
inc: yes
shuffle: no # only work in bundle mode, means the urls in a bundle will shuffle before fetching
clear: yes
error:
network:
retries: 0 # 0 means no retry, -1 means keeping on trying
span: 20 # seconds span to retry
ignore: yes # only work under bundle mode, if True will ignore this url and move to the next after several tries, or move to the next bundle
server: # like 404 or 500 error returned by server
retries: 5
span: 10
ignore: no
components:
deduper:
cls: cola.core.dedup.FileBloomFilterDeduper
(下面还有一些我自己添加的配置,应该跟这个没关系)。

@qinxuye
Copy link
Owner

qinxuye commented Nov 20, 2014

看起来是没有用户抓了,好友页面抓取了吗?

@hitalex
Copy link
Author

hitalex commented Nov 20, 2014

我改了程序,没有让它抓好友页面,包括关注和粉丝。

@qinxuye
Copy link
Owner

qinxuye commented Nov 20, 2014

cola抓取微博的机制是这样的,首先把初始用户(starts)压入队列;对于一个正在抓取的用户,他的好友将会被压入这个队列;下一个抓取的从队列中获取。因此如果没有抓好有页面,就只能抓取初始用户了。

@ghost
Copy link

ghost commented Nov 27, 2014

系统非常稳定.抓取了几十个用户的微博.非常感谢。

但是,不知道怎么才能让cola继续抓取这些用户后来添加的新微博内容。现在,好像cola抓取完成后,就会反复提示“no budget left to process”, 跟顶楼用户遇到的情况一样.不会抓取新增加的微博内容.即使关掉cola,重新再运行,也还是这样。

develop分支.单机模式。基本都是默认设置.

@hitalex
Copy link
Author

hitalex commented Nov 28, 2014

@windch:几十个用户抓取成功没法说明系统是稳定的吧?

另外,cola的逻辑是抓取当前时间用户所发的所有微博,每个用户看作一个bundle,此用户抓取完毕,即完成了该bundle。如果需要抓取新发表的微博,应该自己写一些逻辑。

@qinxuye
Copy link
Owner

qinxuye commented Nov 28, 2014

@windch develop分支应该是支持增量抓取的,配置里inc为yes就是支持。
原理是一个bundle在抓取结束后会被push到增量抓取队列,这个队列会被分配到一定的时间片来运行。

@ghost
Copy link

ghost commented Nov 28, 2014

见微知著.所以说cola非常稳定:)

是develop分支.默认配置inc为yes.我再运行一次,好像还是没有抓取最新的微博.
start uid有65个。

$ python init.py
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
/opt/cola/cola/core/opener.py:108: UserWarning: gzip transfer encoding is experimental!
self.browser.set_handle_gzip(True)
start to process priority: 0
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
...
...
^CCatch interrupt signal, start to stop
Counters during running:
{'error_urls': 20,
'finishes': 65,
'pages': 7321,
'secs': 15064.990124702454}
Processing shutting down
Shutdown finished
Job id:8ZcGfAqHmzc finished, spend 175.00 seconds for running

@qinxuye
Copy link
Owner

qinxuye commented Nov 28, 2014

ok

这个可能是个bug,你先check一下/tmp/cola/worker/<job_id>/mq/inc下有没有文件并且文件有内容。

@ghost
Copy link

ghost commented Nov 28, 2014

/tmp下没有cola目录。我是单机运行python init.py

$ ll /tmp/cola/worker//mq/inc
ls: cannot access /tmp/cola/worker//mq/inc: No such file or directory
$ ll /tmp/cola
ls: cannot access /tmp/cola: No such file or directory

@qinxuye
Copy link
Owner

qinxuye commented Nov 28, 2014

你得先到/tmp/cola下去,worker下是个job id,是根据job name生成的。

@ghost
Copy link

ghost commented Nov 28, 2014

找到了。在/tmp/user/1000/cola/worker/8ZcGfAqHmzc/mq/inc下,有个文件9223372036854775807, 大小为4194304。

@qinxuye
Copy link
Owner

qinxuye commented Nov 28, 2014

ok,那head一下这个文件,看看有没有数据

@ghost
Copy link

ghost commented Nov 28, 2014

head 9223372036854775807

�^pccopy_reg
reconstructor
p1
(cweibo.bundle
WeiboUserBundle
p2
c__builtin
_
object
p3
NtRp4

tail 9223372036854775807

ag12
ag43
ag647
ag54
asbsbsg1253
g1255
sS'last_error_page_times'
p1259
I0
sb.

@qinxuye
Copy link
Owner

qinxuye commented Nov 28, 2014

那应该是有数据的,那到了inc的时候应该就能取到数据。这个问题,你重新开一个issue,把问题描述一下啊,我近期修复。

@ghost
Copy link

ghost commented Nov 28, 2014

多谢!

@qinxuye qinxuye closed this as completed Nov 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants