Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 爬取网页报错404 #317

Closed
JinCheng666 opened this issue Mar 18, 2025 · 2 comments
Closed

[Bug]: 爬取网页报错404 #317

JinCheng666 opened this issue Mar 18, 2025 · 2 comments

Comments

@JinCheng666
Copy link

JinCheng666 commented Mar 18, 2025

wiseflow version

0.3.9patch2

Expected Behavior

爬取网页都报错404
window10,gitclone后,conda pip安装

LLM_API_KEY="sk-bce5bfcc..."
LLM_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/"
PRIMARY_MODEL="qwen-max-2025-01-25"
SECONDARY_MODEL="qwen2.5-72b-instruct"
VL_MODEL="qwen-vl-max-2025-01-25"
PB_API_AUTH="634298263@qq.com|AN_quan565656" 
ZHIPU_API_KEY="249e..."
PROJECT_DIR="./work_dir"
VERBOSE="true"
(wiseflow) PS D:\work\code\wiseflow\core> python .\windows_run.py
Starting PocketBase...
2025-03-18 18:45:49.854 | DEBUG    | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090
2025-03-18 18:45:51.525 | INFO     | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - 634298263@qq.com
2025-03-18 18:45:51.527 | INFO     | __main__:schedule_task:19 - task execute loop 1
2025-03-18 18:45:51.535 | DEBUG    | general_process:main_process:53 - new task initializing...
2025-03-18 18:45:51.535 | DEBUG    | general_process:main_process:57 - focus_id: 2a4d0p1tqtt54qn, focus_point: 国家水网包含什么内容, explanation: , search_engine: False
[INIT].... → Crawl4AI 0.4.300_wiseflow_modified
2025-03-18 18:45:52.665 | DEBUG    | general_process:main_process:140 - process new url, still 0 urls in working list
[FETCH]... ↓ https://www.gov.cn/zhengce/202305/content_6876214.... | Status: True | Time: 8.74s
[SCRAPE].. ◆ Processed https://www.gov.cn/zhengce/202305/content_6876214.... | Time: 55ms
[COMPLETE] ● https://www.gov.cn/zhengce/202305/content_6876214.... | Status: True | Total: 8.82s
Error code: 404
Error code: 404
Error code: 404
Error code: 404
Error code: 404
Error code: 404
2025-03-18 18:46:02.824 | WARNING  | agents.get_info:get_author_and_publish_date:261 - failed to parse from llm output: 
Error code: 404
2025-03-18 18:46:03.070 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
2025-03-18 18:46:03.193 | DEBUG    | general_process:main_process:224 - task finished, focus_id: 2a4d0p1tqtt54qn
2025-03-18 18:46:03.194 | INFO     | __main__:schedule_task:33 - task execute loop finished, work after 3600 seconds

Image

如果打开search_engine,报错如下:

(wiseflow) PS D:\work\code\wiseflow\core> python .\windows_run.py
Starting PocketBase...
2025-03-18 19:00:16.390 | DEBUG    | utils.pb_api:__init__:12 - initializing pocketbase client: http://127.0.0.1:8090
2025-03-18 19:00:18.037 | INFO     | utils.pb_api:__init__:22 - pocketbase ready authenticated as admin - 634298263@qq.com
2025-03-18 19:00:18.038 | INFO     | __main__:schedule_task:19 - task execute loop 1
2025-03-18 19:00:18.047 | DEBUG    | general_process:main_process:53 - new task initializing...
2025-03-18 19:00:18.047 | DEBUG    | general_process:main_process:57 - focus_id: 2a4d0p1tqtt54qn, focus_point: 国家水网包含什么内容, explanation: , search_engine: True
2025-03-18 19:00:20.325 | INFO     | general_process:main_process:91 - query: 国家水网包含什么内容
search intent: SEARCH_ALL
keywords: 国家水网 内容
2025-03-18 19:00:20.326 | DEBUG    | general_process:main_process:100 - can not find publish time in the search result http://mp.weixin.qq.com/s?__biz=MjM5MzIzNjY0Mg==&mid=2651049912&idx=3&sn=76f5c62f05da0035dfc7a5541545a1f1, adding to working list
2025-03-18 19:00:20.326 | DEBUG    | general_process:main_process:100 - can not find publish time in the search result https://new.qq.com/rain/a/20230530A093N900, adding to working list
Error code: 404
2025-03-18 19:00:20.617 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
Error code: 404
2025-03-18 19:00:20.840 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
Error code: 404
2025-03-18 19:00:21.041 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
2025-03-18 19:00:21.041 | DEBUG    | general_process:main_process:100 - can not find publish time in the search result http://mp.weixin.qq.com/s?__biz=MzI4Njk1NDQ0MA==&mid=2247503852&idx=5&sn=ead937afe787dead179372029fb2b0b9, adding to working list
Error code: 404
2025-03-18 19:00:21.247 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
[INIT].... → Crawl4AI 0.4.300_wiseflow_modified
2025-03-18 19:00:22.431 | DEBUG    | general_process:main_process:140 - process new url, still 3 urls in working list
[CACHING]. ↓ https://www.gov.cn/zhengce/202305/content_6876214.... | Status: True | Time: 0.03s
[COMPLETE] ● https://www.gov.cn/zhengce/202305/content_6876214.... | Status: True | Total: 0.03s
Error code: 404
Error code: 404
Error code: 404
Error code: 404
Error code: 404
Error code: 404
2025-03-18 19:00:23.645 | WARNING  | agents.get_info:get_author_and_publish_date:261 - failed to parse from llm output: 
Error code: 404
2025-03-18 19:00:23.858 | WARNING  | agents.get_info:get_info:343 - model hallucination: [] 
contains no summary tag
2025-03-18 19:00:23.858 | DEBUG    | general_process:main_process:140 - process new url, still 2 urls in working list
[FETCH]... ↓ http://mp.weixin.qq.com/s?__biz=MzI4Njk1NDQ0MA==&m... | Status: True | Time: 5.66s
[SCRAPE].. ◆ Processed http://mp.weixin.qq.com/s?__biz=MzI4Njk1NDQ0MA==&m... | Time: 42ms
[COMPLETE] ● http://mp.weixin.qq.com/s?__biz=MzI4Njk1NDQ0MA==&m... | Status: True | Total: 5.72s
Traceback (most recent call last):
  File "D:\work\code\wiseflow\core\run_task.py", line 36, in <module>
    asyncio.run(schedule_task())
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\base_events.py", line 649, in run_until_complete
    return future.result()
  File "D:\work\code\wiseflow\core\run_task.py", line 32, in schedule_task
    await asyncio.gather(*jobs)
  File "D:\work\code\wiseflow\core\general_process.py", line 163, in main_process
    result = custom_scrapers[domain](result)
  File "D:\work\code\wiseflow\core\scrapers\mp_scraper.py", line 28, in mp_scraper
    raw_markdown = fetch_result.markdown['raw_markdown']
TypeError: string indices must be integers
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x000002922FB4C550>
Traceback (most recent call last):
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\base_subprocess.py", line 126, in __del__
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\base_subprocess.py", line 104, in close
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\proactor_events.py", line 109, in close
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\base_events.py", line 753, in call_soon
  File "C:\Users\liujincheng\.conda\envs\wiseflow\lib\asyncio\base_events.py", line 515, in _check_closed
RuntimeError: Event loop is closed
Error running run_task.py: Command '['C:\\Users\\liujincheng\\.conda\\envs\\wiseflow\\python.exe', 'D:\\work\\code\\wiseflow\\core\\run_task.py']' returned non-zero exit status 1.
(wiseflow) PS D:\work\code\wiseflow\core> 

Current Behavior

bug fix

Is this reproducible?

Yes

Inputs Causing the Bug

下面这几个网站都试了,均返回404
- url:https://www.gov.cn/zhengce/202305/content_6876214.htm
- focus_points:国家水网包含什么内容

- url:https://news.baidu.com/
- focus_points:经济运行情况

OS

window10

Python version

3.10

Error logs & Screenshots (if applicable)

No response

@bigbrother666sh
Copy link
Member

从日志上看爬取成功了,但是调用大模型分析出错了,你检查下你的 .env 中 LLM 相关的配置以及模型型号对不对吧

@JinCheng666
Copy link
Author

哦,确实是key有问题。感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants