-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
develop分支爬虫无法自行终止 #34
Comments
看起来是没有用户抓了,好友页面抓取了吗? |
我改了程序,没有让它抓好友页面,包括关注和粉丝。 |
cola抓取微博的机制是这样的,首先把初始用户(starts)压入队列;对于一个正在抓取的用户,他的好友将会被压入这个队列;下一个抓取的从队列中获取。因此如果没有抓好有页面,就只能抓取初始用户了。 |
系统非常稳定.抓取了几十个用户的微博.非常感谢。 但是,不知道怎么才能让cola继续抓取这些用户后来添加的新微博内容。现在,好像cola抓取完成后,就会反复提示“no budget left to process”, 跟顶楼用户遇到的情况一样.不会抓取新增加的微博内容.即使关掉cola,重新再运行,也还是这样。 develop分支.单机模式。基本都是默认设置. |
@windch:几十个用户抓取成功没法说明系统是稳定的吧? 另外,cola的逻辑是抓取当前时间用户所发的所有微博,每个用户看作一个bundle,此用户抓取完毕,即完成了该bundle。如果需要抓取新发表的微博,应该自己写一些逻辑。 |
@windch develop分支应该是支持增量抓取的,配置里inc为yes就是支持。 |
见微知著.所以说cola非常稳定:) 是develop分支.默认配置inc为yes.我再运行一次,好像还是没有抓取最新的微博. $ python init.py |
ok 这个可能是个bug,你先check一下/tmp/cola/worker/<job_id>/mq/inc下有没有文件并且文件有内容。 |
/tmp下没有cola目录。我是单机运行python init.py $ ll /tmp/cola/worker//mq/inc |
你得先到/tmp/cola下去,worker下是个job id,是根据job name生成的。 |
找到了。在 |
ok,那head一下这个文件,看看有没有数据 |
head 9223372036854775807�^pccopy_reg tail 9223372036854775807ag12 |
那应该是有数据的,那到了inc的时候应该就能取到数据。这个问题,你重新开一个issue,把问题描述一下啊,我近期修复。 |
多谢! |
我用develop分支抓取微博数据,发现其现在无法自行终止,最终的(部分)输出消息如下:
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
start to process priority: 2
start to process priority: inc
start to process priority: 0
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 2
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: inc
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 0
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
no budget left to process
start to process priority: 1
no budget left to process
no budget left to process
no budget left to process
^CCatch interrupt signal, start to stop
Counters during running:
{'finishes': 1,
'pages': 3,
'secs': 1.501857042312622}
Processing shutting down
Shutdown finished
配置如下:
job:
db: kweibo
mode: bundle # also can be
bundle
size: 50 # the destination (including bundle or url) size
speed:
max: 20 # to the cluster, -1 means no restrictions, if greater than 0, means webpages opened per minute
single: -1 # max restrictions to a single instance
adaptive: no
instances: 1
priorities: 3 # priorities queue count in mq
copies: 1 # redundant size of objects in mq
inc: yes
shuffle: no # only work in bundle mode, means the urls in a bundle will shuffle before fetching
clear: yes
error:
network:
retries: 0 # 0 means no retry, -1 means keeping on trying
span: 20 # seconds span to retry
ignore: yes # only work under bundle mode, if True will ignore this url and move to the next after several tries, or move to the next bundle
server: # like 404 or 500 error returned by server
retries: 5
span: 10
ignore: no
components:
deduper:
cls: cola.core.dedup.FileBloomFilterDeduper
(下面还有一些我自己添加的配置,应该跟这个没关系)。
The text was updated successfully, but these errors were encountered: