Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

新浪微博爬虫访问页面模式疑问 #30

Closed
hitalex opened this issue Nov 18, 2014 · 3 comments
Closed

新浪微博爬虫访问页面模式疑问 #30

hitalex opened this issue Nov 18, 2014 · 3 comments

Comments

@hitalex
Copy link

hitalex commented Nov 18, 2014

我现在打算抓取每个微博的评论,所以需要改动已有代码。我有一个疑问:在contrib/sina/__init__.py中定义了如下一组url模式,例如微博的模式:
Url(r'http://weibo.com/aj/mblog/mbloglist.*', 'micro_blog', MicroBlogParser),
访问http://weibo.com/aj/mblog/mbloglist.*请问你这个页面模式是怎么得到的呢?

我熟悉的方式是:先访问某个微博页面,如:http://weibo.com/p/1006061774908135/home?from=page_100606&mod=TAB#place, 然后观察页面的结构采用bs4或者lxml进行抽取。

烦请指点!

@qinxuye
Copy link
Owner

qinxuye commented Nov 18, 2014

cola支持抓取微博的评论,配置文件里有个comment: no,改成yes即可。

p.s. 单机版本的话最好是使用develop分支代码。

@hitalex
Copy link
Author

hitalex commented Nov 18, 2014

因为我可能需要自定义抓取内容,例如微博内容信息等,所以还是烦请告知你是如何得到类似http://weibo.com/aj/mblog/mbloglist.*的模式的?

@qinxuye
Copy link
Owner

qinxuye commented Nov 18, 2014

还是分析网页的ajax请求应该就能得到这些url了

@hitalex hitalex closed this as completed Nov 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants