设计一个网页爬虫翻译 #28

xunge0613 · 2017-05-05T03:49:29Z

翻译完毕…… 第一次翻译…… 请大佬们多多指教~
万谢！

@sqrthree

lsvih · 2017-05-05T14:51:18Z

@sqrthree 认领校对

linhe0x0 · 2017-05-06T05:09:39Z

好的

lsvih · 2017-05-06T06:52:37Z

solutions/system_design/web_crawler/README.md

-* Search analytics
-* Personalized search results
-* Page rank
+* 搜素分析


lsvih · 2017-05-06T06:54:00Z

solutions/system_design/web_crawler/README.md

+* 用户很快就能看到搜索结果
+* 网页爬虫不应该陷入死循环
+    * 当爬虫路径包含环的时候，将会陷入死循环
+* 抓取 100 万个链接


lsvih · 2017-05-06T06:54:38Z

solutions/system_design/web_crawler/README.md

+* 抓取 100 万个链接
+    * 要定期重新抓取页面以确保新鲜度
+    * 平均每周重新抓取一次，网站越热门，那么重新抓取的频率越高
+        * 每月抓取 400 万个链接


lsvih · 2017-05-06T06:55:15Z

solutions/system_design/web_crawler/README.md

+    * 要定期重新抓取页面以确保新鲜度
+    * 平均每周重新抓取一次，网站越热门，那么重新抓取的频率越高
+        * 每月抓取 400 万个链接
+    * 每个页面的平均存储大小： 500 KB 


每个页面的平均存储大小： 500 KB

-> 每个页面的平均存储大小：500 KB

lsvih · 2017-05-06T06:57:15Z

solutions/system_design/web_crawler/README.md


-Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/).
+用更传统的系统来练习 —— 不要使用现成的系统，比如： [solr](http://lucene.apache.org/solr/) 或者 [nutch](http://nutch.apache.org/)。 


不要使用现成的系统，比如： solr 或者 nutch。

->

不要使用 solr、nutch 之类的现成的系统。

lsvih · 2017-05-06T06:57:49Z

solutions/system_design/web_crawler/README.md

-* 1,600 write requests per second
-* 40,000 search requests per second
+* 每月存储 2 PB 页面
+    * 每月抓取 400 万个页面，每个页面 500 KB  


lsvih · 2017-05-06T06:59:44Z

solutions/system_design/web_crawler/README.md


-We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity.  If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/), [DMOZ](http://www.dmoz.org/), etc
+假设我们有一个初始列表 `links_to_crawl`（待抓取链接），它最初基于网站整体的知名度来排序。当然如果这个假设不合理，我们可以使用知名门户网站作为种子链接来进行扩散，例如： [Yahoo](https://www.yahoo.com/)、 [DMOZ](http://www.dmoz.org/)，等等。 


我们可以使用知名门户网站作为种子链接来进行扩散，例如： Yahoo、 DMOZ，等等。

->

我们可以使用 Yahoo、DMOZ 等知名门户网站作为种子链接来进行扩散。

lsvih · 2017-05-06T07:01:01Z

solutions/system_design/web_crawler/README.md


-We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Database**.  For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links.  We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
+我们可以将 `links_to_crawl` 和 `crawled_links` 记录在键-值型  **NoSQL 数据库**。对于 `crawled_links` 中已排序的链接，我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上，讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。 


记录在...中

lsvih · 2017-05-06T07:02:19Z

solutions/system_design/web_crawler/README.md

-* For smaller lists we could use something like `sort | unique`
-* With 1 billion links to crawl, we could use **MapReduce** to output only entries that have a frequency of 1
+* 假设数据量较小，我们可以用类似于 `sort | unique` 的方法。（译注： 先排序，后去重）
+* 假设有 100 万条数据，我们应该使用 **MapReduce** 来输出只出现 1 次的记录。


lsvih · 2017-05-06T07:10:36Z

基本上很完美=w=，主要问题就是 billion 有时候看成 million 了

lsvih · 2017-05-06T07:12:37Z

另外全角符号与英文单词之间不用加空格了，不然看起来空的太多了

参考相关资料

根据校对意见修改

xunge0613 · 2017-05-06T12:45:29Z

@lsvih 感谢大佬细心认真的校对！

@sqrthree 根据校对意见修改完毕

linhe0x0

LGTM

first translate edition

138b71f

linhe0x0 mentioned this pull request May 5, 2017

Design a web crawler #21

Closed

linhe0x0 added the 等待校对 label May 5, 2017

linhe0x0 added the 正在校对 label May 6, 2017

lsvih reviewed May 6, 2017

View reviewed changes

根据校对意见修改

866b452

根据校对意见修改

xunge0613 changed the title ~~Design a web crawler 翻译~~ 设计一个网页爬虫翻译 May 9, 2017

linhe0x0 approved these changes May 12, 2017

View reviewed changes

linhe0x0 merged commit 86321d9 into xitu:translation May 12, 2017

linhe0x0 removed 正在校对等待校对 labels Jun 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

设计一个网页爬虫翻译 #28

设计一个网页爬虫翻译 #28

xunge0613 commented May 5, 2017

lsvih commented May 5, 2017

linhe0x0 commented May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih May 6, 2017

lsvih commented May 6, 2017

lsvih commented May 6, 2017

xunge0613 commented May 6, 2017

linhe0x0 left a comment


		Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/).
		用更传统的系统来练习 —— 不要使用现成的系统，比如： [solr](http://lucene.apache.org/solr/) 或者 [nutch](http://nutch.apache.org/)。


		We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity. If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/), [DMOZ](http://www.dmoz.org/), etc
		假设我们有一个初始列表 `links_to_crawl`（待抓取链接），它最初基于网站整体的知名度来排序。当然如果这个假设不合理，我们可以使用知名门户网站作为种子链接来进行扩散，例如： [Yahoo](https://www.yahoo.com/)、 [DMOZ](http://www.dmoz.org/)，等等。


		We could store `links_to_crawl` and `crawled_links` in a key-value NoSQL Database. For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
		我们可以将 `links_to_crawl` 和 `crawled_links` 记录在键-值型 NoSQL 数据库。对于 `crawled_links` 中已排序的链接，我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上，讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。

设计一个网页爬虫 翻译 #28

设计一个网页爬虫 翻译 #28

Conversation

xunge0613 commented May 5, 2017

lsvih commented May 5, 2017

linhe0x0 commented May 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lsvih commented May 6, 2017

lsvih commented May 6, 2017

xunge0613 commented May 6, 2017

linhe0x0 left a comment

Choose a reason for hiding this comment

设计一个网页爬虫翻译 #28

设计一个网页爬虫翻译 #28