Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

设计一个网页爬虫 翻译 #28

Merged
merged 2 commits into from May 12, 2017

Conversation

xunge0613
Copy link

翻译完毕…… 第一次翻译…… 请大佬们多多指教~
万谢!

@sqrthree

@lsvih
Copy link
Member

lsvih commented May 5, 2017

@sqrthree 认领校对

@linhe0x0
Copy link
Member

linhe0x0 commented May 6, 2017

好的

* Search analytics
* Personalized search results
* Page rank
* 搜素分析
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

搜索

* 用户很快就能看到搜索结果
* 网页爬虫不应该陷入死循环
* 当爬虫路径包含环的时候,将会陷入死循环
* 抓取 100 万个链接
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 亿

* 抓取 100 万个链接
* 要定期重新抓取页面以确保新鲜度
* 平均每周重新抓取一次,网站越热门,那么重新抓取的频率越高
* 每月抓取 400 万个链接
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 亿

* 要定期重新抓取页面以确保新鲜度
* 平均每周重新抓取一次,网站越热门,那么重新抓取的频率越高
* 每月抓取 400 万个链接
* 每个页面的平均存储大小: 500 KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个页面的平均存储大小: 500 KB

-> 每个页面的平均存储大小:500 KB


Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/).
用更传统的系统来练习 —— 不要使用现成的系统,比如: [solr](http://lucene.apache.org/solr/) 或者 [nutch](http://nutch.apache.org/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要使用现成的系统,比如: solr 或者 nutch

->

不要使用 solrnutch 之类的现成的系统。

* 1,600 write requests per second
* 40,000 search requests per second
* 每月存储 2 PB 页面
* 每月抓取 400 万个页面,每个页面 500 KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40 亿


We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity. If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/), [DMOZ](http://www.dmoz.org/), etc
假设我们有一个初始列表 `links_to_crawl`(待抓取链接),它最初基于网站整体的知名度来排序。当然如果这个假设不合理,我们可以使用知名门户网站作为种子链接来进行扩散,例如: [Yahoo](https://www.yahoo.com/) [DMOZ](http://www.dmoz.org/),等等。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们可以使用知名门户网站作为种子链接来进行扩散,例如: YahooDMOZ,等等。

->

我们可以使用 YahooDMOZ 等知名门户网站作为种子链接来进行扩散。


We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Database**. For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
我们可以将 `links_to_crawl` `crawled_links` 记录在键-值型 **NoSQL 数据库**。对于 `crawled_links` 中已排序的链接,我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上,讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记录在...中

* For smaller lists we could use something like `sort | unique`
* With 1 billion links to crawl, we could use **MapReduce** to output only entries that have a frequency of 1
* 假设数据量较小,我们可以用类似于 `sort | unique` 的方法。(译注: 先排序,后去重)
* 假设有 100 万条数据,我们应该使用 **MapReduce** 来输出只出现 1 次的记录。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 亿

@lsvih
Copy link
Member

lsvih commented May 6, 2017

基本上很完美=w=,主要问题就是 billion 有时候看成 million 了

@lsvih
Copy link
Member

lsvih commented May 6, 2017

另外全角符号与英文单词之间不用加空格了,不然看起来空的太多了

参考 相关资料

根据校对意见修改
@xunge0613
Copy link
Author

@lsvih 感谢大佬细心认真的校对!

@sqrthree 根据校对意见修改完毕

@xunge0613 xunge0613 changed the title Design a web crawler 翻译 设计一个网页爬虫 翻译 May 9, 2017
Copy link
Member

@linhe0x0 linhe0x0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linhe0x0 linhe0x0 merged commit 86321d9 into xitu:translation May 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants