Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

部分网站单章节分为多个页面,目录获取的只是第一页,其他页通过page参数访问 #22

Closed
Code-Cats opened this issue Aug 25, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@Code-Cats
Copy link

Code-Cats commented Aug 25, 2020

eg:章节第一页 https://www.ygbqtl.com/x…x/mFALm4TpdZvfi.html
第一页下方有下一页链接Xpath='//*[@id="pt_next"]'
第二页 https://www.ygbqtl.com/x…x/mFALm4TpdZvfi.html?page=2
上面的案例来自七猫读书网,这种分页阅读的方式是不区分电脑还是手机的
我自己用python requests+lxm试了下,想到两种解决方案:
1.经过观察每个页面指向下一页的Xpath路径是相同的,不论是章节内不同页面,还是本章节最后一页指向下一章节第一页的Xpath路径相同,因此可通过递归或循环从第一章节第一页遍历到最后一章节最后一页。
2.获取目录链接后手动设置翻页参数name,最大翻页范围value,对于每一章链接尝试爬取所有页数范围,根据返回值将不存在的页数剔除,将同章节内容合并
上面的思路可能只对某个网站有效,如果有高级模式能自定义抓取规则可能会更通用一些

@unclezs
Copy link
Member

unclezs commented Aug 27, 2020

对于这种翻页的网站确实不好做到通用下载,所以下载还支持翻页。
我也想过做一些支持无非是两种方法:
1.用户自定义翻页xpath
2.对一些网站收集,把大多数的翻页规则整合起来,加个是否翻页的选项,勾选后自动查找是否翻页标签,存在就进行翻页

但是软件主打就是通用,抛弃规则束缚,所以就想没弄,现在就是如果这个网站有翻页,则换一个网站

@unclezs unclezs added the enhancement New feature or request label Aug 27, 2020
@unclezs
Copy link
Member

unclezs commented Jun 22, 2021

已经支持多页,升级到5.0体验即可

@unclezs unclezs closed this as completed Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants