From a18bafad1ea70921d0d47cb87d04bc660115c374 Mon Sep 17 00:00:00 2001 From: wycm Date: Tue, 2 Apr 2019 16:28:06 +0800 Subject: [PATCH 1/2] modify License --- License | 237 +++++++++++------------------------------------------- README.md | 22 ++--- 2 files changed, 51 insertions(+), 208 deletions(-) diff --git a/License b/License index f34759e..bf4649d 100644 --- a/License +++ b/License @@ -1,191 +1,46 @@ -Apache License -Version 2.0, January 2004 -http://www.apache.org/licenses/ - -TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - -1. Definitions. - -"License" shall mean the terms and conditions for use, reproduction, and -distribution as defined by Sections 1 through 9 of this document. - -"Licensor" shall mean the copyright owner or entity authorized by the copyright -owner that is granting the License. - -"Legal Entity" shall mean the union of the acting entity and all other entities -that control, are controlled by, or are under common control with that entity. -For the purposes of this definition, "control" means (i) the power, direct or -indirect, to cause the direction or management of such entity, whether by -contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the -outstanding shares, or (iii) beneficial ownership of such entity. - -"You" (or "Your") shall mean an individual or Legal Entity exercising -permissions granted by this License. - -"Source" form shall mean the preferred form for making modifications, including -but not limited to software source code, documentation source, and configuration -files. - -"Object" form shall mean any form resulting from mechanical transformation or -translation of a Source form, including but not limited to compiled object code, -generated documentation, and conversions to other media types. - -"Work" shall mean the work of authorship, whether in Source or Object form, made -available under the License, as indicated by a copyright notice that is included -in or attached to the work (an example is provided in the Appendix below). - -"Derivative Works" shall mean any work, whether in Source or Object form, that -is based on (or derived from) the Work and for which the editorial revisions, -annotations, elaborations, or other modifications represent, as a whole, an -original work of authorship. For the purposes of this License, Derivative Works -shall not include works that remain separable from, or merely link (or bind by -name) to the interfaces of, the Work and Derivative Works thereof. - -"Contribution" shall mean any work of authorship, including the original version -of the Work and any modifications or additions to that Work or Derivative Works -thereof, that is intentionally submitted to Licensor for inclusion in the Work -by the copyright owner or by an individual or Legal Entity authorized to submit -on behalf of the copyright owner. For the purposes of this definition, -"submitted" means any form of electronic, verbal, or written communication sent -to the Licensor or its representatives, including but not limited to -communication on electronic mailing lists, source code control systems, and -issue tracking systems that are managed by, or on behalf of, the Licensor for -the purpose of discussing and improving the Work, but excluding communication -that is conspicuously marked or otherwise designated in writing by the copyright -owner as "Not a Contribution." - -"Contributor" shall mean Licensor and any individual or Legal Entity on behalf -of whom a Contribution has been received by Licensor and subsequently -incorporated within the Work. - -2. Grant of Copyright License. - -Subject to the terms and conditions of this License, each Contributor hereby -grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, -irrevocable copyright license to reproduce, prepare Derivative Works of, -publicly display, publicly perform, sublicense, and distribute the Work and such -Derivative Works in Source or Object form. - -3. Grant of Patent License. - -Subject to the terms and conditions of this License, each Contributor hereby -grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, -irrevocable (except as stated in this section) patent license to make, have -made, use, offer to sell, sell, import, and otherwise transfer the Work, where -such license applies only to those patent claims licensable by such Contributor -that are necessarily infringed by their Contribution(s) alone or by combination -of their Contribution(s) with the Work to which such Contribution(s) was -submitted. If You institute patent litigation against any entity (including a -cross-claim or counterclaim in a lawsuit) alleging that the Work or a -Contribution incorporated within the Work constitutes direct or contributory -patent infringement, then any patent licenses granted to You under this License -for that Work shall terminate as of the date such litigation is filed. - -4. Redistribution. - -You may reproduce and distribute copies of the Work or Derivative Works thereof -in any medium, with or without modifications, and in Source or Object form, -provided that You meet the following conditions: - -You must give any other recipients of the Work or Derivative Works a copy of -this License; and -You must cause any modified files to carry prominent notices stating that You -changed the files; and -You must retain, in the Source form of any Derivative Works that You distribute, -all copyright, patent, trademark, and attribution notices from the Source form -of the Work, excluding those notices that do not pertain to any part of the -Derivative Works; and -If the Work includes a "NOTICE" text file as part of its distribution, then any -Derivative Works that You distribute must include a readable copy of the -attribution notices contained within such NOTICE file, excluding those notices -that do not pertain to any part of the Derivative Works, in at least one of the -following places: within a NOTICE text file distributed as part of the -Derivative Works; within the Source form or documentation, if provided along -with the Derivative Works; or, within a display generated by the Derivative -Works, if and wherever such third-party notices normally appear. The contents of -the NOTICE file are for informational purposes only and do not modify the -License. You may add Your own attribution notices within Derivative Works that -You distribute, alongside or as an addendum to the NOTICE text from the Work, -provided that such additional attribution notices cannot be construed as -modifying the License. -You may add Your own copyright statement to Your modifications and may provide -additional or different license terms and conditions for use, reproduction, or -distribution of Your modifications, or for any such Derivative Works as a whole, -provided Your use, reproduction, and distribution of the Work otherwise complies -with the conditions stated in this License. - -5. Submission of Contributions. - -Unless You explicitly state otherwise, any Contribution intentionally submitted -for inclusion in the Work by You to the Licensor shall be under the terms and -conditions of this License, without any additional terms or conditions. -Notwithstanding the above, nothing herein shall supersede or modify the terms of -any separate license agreement you may have executed with Licensor regarding -such Contributions. - -6. Trademarks. - -This License does not grant permission to use the trade names, trademarks, -service marks, or product names of the Licensor, except as required for -reasonable and customary use in describing the origin of the Work and -reproducing the content of the NOTICE file. - -7. Disclaimer of Warranty. - -Unless required by applicable law or agreed to in writing, Licensor provides the -Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, -including, without limitation, any warranties or conditions of TITLE, -NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are -solely responsible for determining the appropriateness of using or -redistributing the Work and assume any risks associated with Your exercise of -permissions under this License. - -8. Limitation of Liability. - -In no event and under no legal theory, whether in tort (including negligence), -contract, or otherwise, unless required by applicable law (such as deliberate -and grossly negligent acts) or agreed to in writing, shall any Contributor be -liable to You for damages, including any direct, indirect, special, incidental, -or consequential damages of any character arising as a result of this License or -out of the use or inability to use the Work (including but not limited to -damages for loss of goodwill, work stoppage, computer failure or malfunction, or -any and all other commercial damages or losses), even if such Contributor has -been advised of the possibility of such damages. - -9. Accepting Warranty or Additional Liability. - -While redistributing the Work or Derivative Works thereof, You may choose to -offer, and charge a fee for, acceptance of support, warranty, indemnity, or -other liability obligations and/or rights consistent with this License. However, -in accepting such obligations, You may act only on Your own behalf and on Your -sole responsibility, not on behalf of any other Contributor, and only if You -agree to indemnify, defend, and hold each Contributor harmless for any liability -incurred by, or claims asserted against, such Contributor by reason of your -accepting any such warranty or additional liability. - -END OF TERMS AND CONDITIONS - -APPENDIX: How to apply the Apache License to your work - -To apply the Apache License to your work, attach the following boilerplate -notice, with the fields enclosed by brackets "{}" replaced with your own -identifying information. (Don't include the brackets!) The text should be -enclosed in the appropriate comment syntax for the file format. We also -recommend that a file or class name and description of purpose be included on -the same "printed page" as the copyright notice for easier identification within -third-party archives. - - Copyright 2017 wycm - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. \ No newline at end of file +Copyright (c) 2019 wycm + +996 License Version 1.0 (Draft) + +Permission is hereby granted to any individual or legal entity +obtaining a copy of this licensed work (including the source code, +documentation and/or related items, hereinafter collectively referred +to as the "licensed work"), free of charge, to deal with the licensed +work for any purpose, including without limitation, the rights to use, +reproduce, modify, prepare derivative works of, distribute, publish +and sublicense the licensed work, subject to the following conditions: + +1. The individual or the legal entity must conspicuously display, +without modification, this License and the notice on each redistributed +or derivative copy of the Licensed Work. + +2. The individual or the legal entity must strictly comply with all +applicable laws, regulations, rules and standards of the jurisdiction +relating to labor and employment where the individual is physically +located or where the individual was born or naturalized; or where the +legal entity is registered or is operating (whichever is stricter). In +case that the jurisdiction has no such laws, regulations, rules and +standards or its laws, regulations, rules and standards are +unenforceable, the individual or the legal entity are required to +comply with Core International Labor Standards. + +3. The individual or the legal entity shall not induce or force its +employee(s), whether full-time or part-time, or its independent +contractor(s), in any methods, to agree in oral or written form, to +directly or indirectly restrict, weaken or relinquish his or her +rights or remedies under such laws, regulations, rules and standards +relating to labor and employment as mentioned above, no matter whether +such written or oral agreement are enforceable under the laws of the +said jurisdiction, nor shall such individual or the legal entity +limit, in any methods, the rights of its employee(s) or independent +contractor(s) from reporting or complaining to the copyright holder or +relevant authorities monitoring the compliance of the license about +its violation(s) of the said license. + +THE LICENSED WORK IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE COPYRIGHT HOLDER BE LIABLE FOR ANY CLAIM, +DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, ARISING FROM, OUT OF OR IN ANY WAY CONNECTION WITH THE +LICENSED WORK OR THE USE OR OTHER DEALINGS IN THE LICENSED WORK. \ No newline at end of file diff --git a/README.md b/README.md index 9cf2f94..3d8db0c 100755 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ 知乎爬虫 ==== -zhihu-crawler是一个基于Java的爬虫实战项目,主要功能是抓取知乎用户的基本资料,如果觉得不错,请给个star。 +zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支持横向扩展、分布式抓取爬虫项目,主要功能是抓取知乎用户、话题、问题、答案、文章等数据,如果觉得不错,请给个star。 ## 爬取结果 * 下图为爬取117w知乎用户数据的简单统计
![](https://github.com/wycm/zhihu-crawler/blob/2.0/src/main/resources/img/zhihu-charts.png) @@ -17,20 +17,6 @@ zhihu-crawler是一个基于Java的爬虫实战项目,主要功能是抓取知 3. 设置日志路径,默认在`/var/www/logs`[logback-spring.xml](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/resources/logback-spring.xml) 4. Run with [ZhihuCrawlerApplication.java](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/java/com/github/wycm/zhihu/ZhihuCrawlerApplication.java ) -## 使用到的接口 -* 地址(url):```https://www.zhihu.com/api/v4/members/${userid}/followees``` -* 请求类型:GET -* **请求参数** - -| 参数名 |类型 | 必填 | 值 | 说明| -| :------------ | :------------ | :------------ | :----- | :------------ | -| include | String | 是| ```data[*]answer_count,articles_count``` |需要返回的字段(这个值可以改根据需要增加一些字段,见如下示例url) | -| offset | int | 是| 0 | 偏移量(通过调整这个值可以获取到一个用户的```所有关注用户```资料) | -| limit | int | 是| 20 | 返回用户数(最大20,超过20无效) | - -* url示例:```https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data[*].educations,employments,answer_count,business,locations,articles_count,follower_count,gender,following_count,question_count,voteup_count,thanked_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20``` -* 响应:json数据,会有关注用户资料 - ## 特性 * 大量使用http代理,突破同一个客户端访问量限制(注:使用的都是网上公开的免费代理,近期测试来看,部分免费代理网站都做了反爬,可用的免费代理比以前少了很多,抓取速度相比以前慢了很多)。 * 支持持久化(mongodb)。 @@ -81,6 +67,8 @@ DetailPageThreadPool负责下载用户详情页面,解析出用户基本信息 * 有问题的请提issue。 * 欢迎贡献代码。 * 爬虫交流群:633925314,欢迎交流。 -* 需要数据(117w知乎用户基本信息资料)的,关注公众号即可:lwndso
-![](https://github.com/wycm/zhihu-crawler/blob/2.0/src/main/resources/img/wx.jpg) +* 需要数据(117w知乎用户基本信息资料,该数据仅供个人学习与交流使用,严禁用于商业以及不良用途)的,关注公众号即可:lwndso
+![一个程序员日常分享,包括但不限于爬虫、Java后端技术,欢迎关注](https://raw.githubusercontent.com/wycm/md-image/master/2019-02-28/9.png) +## 免责申明 +* 本项目仅供个人学习与交流使用,严禁用于商业以及不良用途。 \ No newline at end of file From e48dc5edb52b2893d62c1698b5ee9f473719e2a7 Mon Sep 17 00:00:00 2001 From: wycm Date: Tue, 2 Apr 2019 16:33:13 +0800 Subject: [PATCH 2/2] update --- README.md | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 3d8db0c..2a32f1d 100755 --- a/README.md +++ b/README.md @@ -17,6 +17,20 @@ zhihu-crawler是一个基于Java的高性能、支持免费http代理池、支 3. 设置日志路径,默认在`/var/www/logs`[logback-spring.xml](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/resources/logback-spring.xml) 4. Run with [ZhihuCrawlerApplication.java](https://github.com/wycm/zhihu-crawler/blob/3.0/zhihu/src/main/java/com/github/wycm/zhihu/ZhihuCrawlerApplication.java ) +## 使用到的接口 +* 地址(url):```https://www.zhihu.com/api/v4/members/${userid}/followees``` +* 请求类型:GET +* **请求参数** + +| 参数名 |类型 | 必填 | 值 | 说明| +| :------------ | :------------ | :------------ | :----- | :------------ | +| include | String | 是| ```data[*]answer_count,articles_count``` |需要返回的字段(这个值可以改根据需要增加一些字段,见如下示例url) | +| offset | int | 是| 0 | 偏移量(通过调整这个值可以获取到一个用户的```所有关注用户```资料) | +| limit | int | 是| 20 | 返回用户数(最大20,超过20无效) | + +* url示例:```https://www.zhihu.com/api/v4/members/wo-yan-chen-mo/followees?include=data[*].educations,employments,answer_count,business,locations,articles_count,follower_count,gender,following_count,question_count,voteup_count,thanked_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20``` +* 响应:json数据,会有关注用户资料 + ## 特性 * 大量使用http代理,突破同一个客户端访问量限制(注:使用的都是网上公开的免费代理,近期测试来看,部分免费代理网站都做了反爬,可用的免费代理比以前少了很多,抓取速度相比以前慢了很多)。 * 支持持久化(mongodb)。 @@ -61,14 +75,12 @@ DetailPageThreadPool负责下载用户详情页面,解析出用户基本信息 * 增加游客(免登录)模式抓取。 * 增加代理抓取模块。 +## 免责申明 +* 本项目仅供个人学习与交流使用,严禁用于商业以及不良用途。 ## 最后 -* 想要爬取其它数据,如问题、答案等,完全可以在此基础上自己定制。 * 有问题的请提issue。 * 欢迎贡献代码。 * 爬虫交流群:633925314,欢迎交流。 -* 需要数据(117w知乎用户基本信息资料,该数据仅供个人学习与交流使用,严禁用于商业以及不良用途)的,关注公众号即可:lwndso
+* 需要数据的,关注公众号即可(117w知乎用户基本信息资料,该数据仅供个人学习与交流使用,严禁用于商业以及不良用途):lwndso
![一个程序员日常分享,包括但不限于爬虫、Java后端技术,欢迎关注](https://raw.githubusercontent.com/wycm/md-image/master/2019-02-28/9.png) - -## 免责申明 -* 本项目仅供个人学习与交流使用,严禁用于商业以及不良用途。 \ No newline at end of file