自定义 scrapy 爬虫

准备工作

在 Mac 上安装 scrapy 参照博客: mac安装scrapy实践为什么要参照这篇博客呢？因为我用 scrapy 官网的命令不得劲儿哇 pip install scrapy

catchYou

通过用户 ID 范围围脖找人

通过 API 抓取数据, 因为围脖爬取数据的限制较多, 例如: weibo.com 是无法爬取内容, 因此使用当前围脖移动端并且是 cn 站点的 API:

个人页面

https://m.weibo.cn/api/container/getIndex?type=uid&value={usr_id}

从这个结果页面能拿到两个 containerid 的 ID, 作为下面两个 API 的 URL 的输入

个人信息 `profile` 页面

https://m.weibo.cn/api/container/getIndex?containerid={oid}&type=uid&value={uid}&page={page}

个人的围脖 `tweets` 页面

https://m.weibo.cn/api/container/getIndex?containerid={oid}&type=uid&value={uid}&page={page}

命令:

cd catchYou/catchYou/spiders

scrapy crawl catchYouSpider

这样可以爬取到用户的 ID, Nickname, Profile ContainerID, Tweets ContainerID
抓取频率 DELAY 1s

doubanImageBeta

实现了读取豆瓣电影海报下图片的下载爬虫

play4data

这个是个稍微大的爬虫
目前实现了豆瓣实验室API读取BUBS的存储
使用Postgresql
scrapy0.14+postgresql9.3

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
catchYou		catchYou
doubanImageBeta		doubanImageBeta
pageMonitor		pageMonitor
play4data		play4data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

自定义 scrapy 爬虫

准备工作

catchYou

个人页面

个人信息 `profile` 页面

个人的围脖 `tweets` 页面

doubanImageBeta

play4data

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

自定义 scrapy 爬虫

准备工作

catchYou

个人页面

个人信息 profile 页面

个人的围脖 tweets 页面

doubanImageBeta

play4data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

个人信息 `profile` 页面

个人的围脖 `tweets` 页面

Packages