Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTML code element clipping #11281

Closed
3 tasks done
allrobot opened this issue May 6, 2024 · 24 comments
Closed
3 tasks done

Improve HTML code element clipping #11281

allrobot opened this issue May 6, 2024 · 24 comments
Assignees
Milestone

Comments

@allrobot
Copy link

allrobot commented May 6, 2024

[功能改善] 复制网页元素转为md格式时,不应该复制网页元素

Is there an existing issue for this?

  • I have searched the existing issues

Can the issue be reproduced with the default theme (daylight/midnight)?

  • I was able to reproduce the issue with the default theme

Could the issue be due to extensions?

  • I've ruled out the possibility that the extension is causing the problem.

Describe the problem

粘贴到issue的md编辑器,它只会粘贴成这样:
ctypes 是 Python 的外部函数库。它提供了与 C 兼容的数据类型,并允许调用 DLL 或共享库中的函数。可使用该模块以纯 Python 形式对这些库进行封装。

chrome_zkmfLKTCBT

Expected result

SiYuan_bn2akgeVdq
它把span元素也一起粘贴了,为什么不能取元素文本呢?就像上面github把span的文本放在xxx

Screenshot or screen recording presentation

No response

Version environment

- Version: latest
- Operating System: WIN10 企业 21H2

Log file

NO

More information

No response

@88250 88250 changed the title [功能改善] 复制网页元素转为md格式时,不应该复制网页元素 Improve HTML clipping May 7, 2024
@88250 88250 self-assigned this May 7, 2024
@88250 88250 changed the title Improve HTML clipping Improve HTML code element clipping May 7, 2024
@88250 88250 added this to the 3.0.13 milestone May 7, 2024
@88250 88250 closed this as completed May 7, 2024
@allrobot
Copy link
Author

allrobot commented May 10, 2024

3.0.13依旧是复制网页元素,截图:

SiYuan_UUIPx7IUGN

如果是单单是复制<span class="pre">capture_stdout=True,</span><span> </span><span class="pre">stderr=subprocess.STDOUT</span>到剪切板就没问题,如果网页元素周围有文本时就不行了

不知道你用的是什么解析剪切板的富文本信息,剪切板都是HTML富文本形式,操作截图:

chrome_nTHWUiZK8W

import win32clipboard


def GetAvailableFormats():
    """
    Return a possibly empty list of formats available on the clipboard
    返回剪切板可能有的一些格式
    """
    formats = []
    try:
        win32clipboard.OpenClipboard(0)
        cf = win32clipboard.EnumClipboardFormats(0)
        while (cf != 0):
            formats.append(cf)
            cf = win32clipboard.EnumClipboardFormats(cf)
    finally:
        win32clipboard.CloseClipboard()

    return formats

# CF_HTML=49433
CF_HTML = win32clipboard.RegisterClipboardFormat("HTML Format")

if CF_HTML in GetAvailableFormats():
    win32clipboard.OpenClipboard(0)
    src = win32clipboard.GetClipboardData(CF_HTML)
    src = src.decode("UTF-8")
    print(src)
else:
    print("当前剪切板没有富文本内容")

完整的剪切板内容输出:

C:\ProgramData\anaconda3\envs\python311\python.exe C:\Users\Administrator\Personal_scripts\pythonProject\temp.py 
Version:0.9
StartHTML:0000000200
EndHTML:0000004248
StartFragment:0000000236
EndFragment:0000004212
SourceURL:https://trio.readthedocs.io/en/stable/reference-io.html#asynchronous-filesystem-i-o
<html>
<body>
<!--StartFragment--><span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;">如果要同时捕获 stdout 和 stderr,但要按打印顺序将它们混合在一起,请使用:<span> </span></span><code class="docutils literal notranslate" data-immersive-translate-walked="a926668a-d38b-4e1e-824c-535ff2a6205f" style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, Courier, monospace; font-size: 12px; white-space: normal; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(34, 34, 34) !important; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span class="pre" style="box-sizing: border-box;">capture_stdout=True,</span><span> </span><span class="pre" style="box-sizing: border-box;">stderr=subprocess.STDOUT</span></code><span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;"><span> </span>。这会将子项的 stderr 引导到其 stdout,因此组合输出将在属性中<span> </span></span><code class="xref py py-obj docutils literal notranslate" data-immersive-translate-walked="a926668a-d38b-4e1e-824c-535ff2a6205f" style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, Courier, monospace; font-size: 12px; white-space: normal; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(34, 34, 34) !important; overflow-x: auto; font-weight: 700; overflow-wrap: normal; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span class="pre" style="box-sizing: border-box;">stdout</span></code><span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;"><span> </span>可用</span><!--EndFragment-->
</body>
</html> 

进程已结束,退出代码为 0

表格在代码其中:

<span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;">如果要同时捕获 stdout 和 stderr,但要按打印顺序将它们混合在一起,请使用:<span> </span></span><code class="docutils literal notranslate" data-immersive-translate-walked="a926668a-d38b-4e1e-824c-535ff2a6205f" style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, Courier, monospace; font-size: 12px; white-space: normal; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(34, 34, 34) !important; overflow-x: auto; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span class="pre" style="box-sizing: border-box;">capture_stdout=True,</span><span> </span><span class="pre" style="box-sizing: border-box;">stderr=subprocess.STDOUT</span></code><span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;"><span> </span>。这会将子项的 stderr 引导到其 stdout,因此组合输出将在属性中<span> </span></span><code class="xref py py-obj docutils literal notranslate" data-immersive-translate-walked="a926668a-d38b-4e1e-824c-535ff2a6205f" style="box-sizing: border-box; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, Courier, monospace; font-size: 12px; white-space: normal; max-width: 100%; background: rgb(255, 255, 255); border: 1px solid rgb(225, 228, 229); padding: 2px 5px; color: rgb(34, 34, 34) !important; overflow-x: auto; font-weight: 700; overflow-wrap: normal; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span class="pre" style="box-sizing: border-box;">stdout</span></code><span style="color: rgb(64, 64, 64); font-family: Lato, proxima-nova, &quot;Helvetica Neue&quot;, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(252, 252, 252); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;"><span> </span>可用</span>

table代码直接粘贴到Github的MD编辑器就显示成表格:
如果要同时捕获 stdout 和 stderr,但要按打印顺序将它们混合在一起,请使用: capture_stdout=True, stderr=subprocess.STDOUT 。这会将子项的 stderr 引导到其 stdout,因此组合输出将在属性中 stdout 可用

操作截图:

chrome_PDxkLlHwtQ

善用富文本,MD写作笔记中的富文本地位很高

@88250
Copy link
Member

88250 commented May 10, 2024

我这里测试正文中提到的网页 DOM 结构已经没有问题:

issue.webm

其他网页的话麻烦单独提 issue 进行改进,谢谢。

@allrobot
Copy link
Author

其他网页的话麻烦单独提 issue 进行改进,谢谢。

SiYuan_4K6ww92ITK

@88250
Copy link
Member

88250 commented May 10, 2024

收到,这种情况下个版本改进,谢谢。

@allrobot
Copy link
Author

是否有考虑:

  1. 每个块/页面的添加纯文本模式即时预览模式?(每个块当成单独的编辑框,类似Issue编辑某个评论就弹出编辑框)
  2. 每个块支持HTML元素展示,以及更丰富的MD语法支持

Github的Issue编辑框的WritePreview按钮,粘贴HTML元素后,切为Preview模式就能立刻显示,Github的编辑框特性是挺实用的,优点:

  1. 兼容HTML元素,任何页面就能展示丰富的HTML元素,如不满意可切换Write模式重新编辑HTML文本
  2. 更多的MD语法支持,例如引用中的引用颜色模型等等

其它:
创建表格,若编辑标点在表格末尾,连按Enter几次没什么反应,不是很符合用户直觉,Enter应该改成连按2次就退出表格切换到下一个块(相当于ctrl+shift+A

@88250
Copy link
Member

88250 commented May 10, 2024

思源笔记的编辑器不是 Markdown 编辑器,所以不考虑源码编辑模式。

@Vanessa219 表格最后一行 Enter 跳出创建新块考虑一下。

@allrobot
Copy link
Author

思源笔记的编辑器不是 Markdown 编辑器,所以不考虑源码编辑模式。

自研编辑器?那更多的MD语法还考虑支持吗?😬

@88250
Copy link
Member

88250 commented May 11, 2024 via email

@allrobot
Copy link
Author

SiYuan_m5OfpFHpwt

复制Github的引用链接出错,无法转为[subject:matches-attr(arg)](https://github.com/siyuan-note/siyuan/issues/11281)

链接:https://github.com/fang5566/uBlock/wiki/%E8%BF%87%E7%A8%8B%E5%BC%8F%E4%BF%AE%E9%A5%B0%E8%A7%84%E5%88%99

@88250
Copy link
Member

88250 commented May 12, 2024

@allrobot

image

这个 a 标签的锚文本是个 svg,编辑器进行处理,所以导致锚文本为空,从而无法转换渲染为正常的超链接:

image

下个版本会改进为直接丢弃锚文本为空的超链接元素。

@allrobot
Copy link
Author

这时候就体现富文本HTML的优势了,放到Github编辑框能自动显示为合适的格式,siyuan应该添加富文本/HTML支持,这样无论下次剪切/复制什么内容,siyuan就直接读取HTML转为合适的格式

SiYuan_x3BnSQ8sH4

剪切板的问题陆陆续续遇到了好几次,这次解决了,但下次又双叒叕遇到了咋办😂

@88250
Copy link
Member

88250 commented May 12, 2024 via email

@allrobot
Copy link
Author

支持不了 HTML 的,遇到问题继续解决。

剪切板的富文本内容是千奇百怪的,注定了要维护很久……不太建议使用这种维护方式,哪天siyuan弃坑了,很多类似的方面就不好维护了

随手复制就遇到了新问题,没有换行:
SiYuan_8RND8H7cwI

粘贴到notepad或github的编辑框,粘贴的内容能保留换行符:
chrome_xjdLfIUcdd

还有一个问题:
当HTML复制内嵌代码复制到siyuan,然后在siyuan复制内嵌代码粘贴到代码片段,结果变成了纯文本的MD格式

SiYuan_2akbGjCbNT

我从网页复制的富文本,能在siyuan的代码片段里粘贴为普通文本……

@88250
Copy link
Member

88250 commented May 13, 2024

没有其他更好的方案了,从各式各样的 HTML 复制粘贴后只能进行适配,这个处理步骤无法避免的,这就是富文本编辑器开发中很麻烦的一个地方。

视频中没有换行的地方多半是因为网页 DOM 中没有换行元素(比如 p/div 或者 br),可以给一下网址以便排查。

当HTML复制内嵌代码复制到siyuan,然后在siyuan复制内嵌代码粘贴到代码片段,结果变成了纯文本的MD格式

思源中复制是以 text(md)/html 复制的,剪切板里面我记得有这两种格式,具体用哪种要看粘贴的地方的选择。思源中粘贴使用的话优先是带格式的,其次才是文本,所以在代码块里面用文本就会带 `,这个可以复制的时候选择 复制纯文本,或者粘贴的时候选择 粘贴为纯文本

@allrobot
Copy link
Author

allrobot commented May 13, 2024

可以给一下网址以便排查

https://www.dataleadsfuture.com/combining-traditional-thread-based-code-and-asyncio-in-python/

开启了沉浸式翻译
image

思源中复制是以 text(md)/html 复制的,剪切板里面我记得有这两种格式,具体用哪种要看粘贴的地方的选择。思源中粘贴使用的话优先是带格式的,其次才是文本,所以在代码块里面用文本就会带 `,这个可以复制的时候选择 复制纯文本,或者粘贴的时候选择 粘贴为纯文本

不能往剪切板放富文本吗?剪切板有几十种格式,MD格式的文本转为HTML格式的代码,以富文本格式往剪切板放进去,这样复制粘贴就不会有问题了

@88250
Copy link
Member

88250 commented May 13, 2024

不能往剪切板放富文本吗?剪切板有几十种格式,MD格式的文本转为HTML格式的代码,以富文本格式往剪切板放进去,这样复制粘贴就不会有问题了

应该是放了的,但是主要看粘贴的时候软件如何处理了。

@88250
Copy link
Member

88250 commented May 13, 2024

开启了沉浸式翻译

那估计就是没有换行了,这种无法解析出来的。

@allrobot
Copy link
Author

应该是放了的,但是主要看粘贴的时候软件如何处理了。

观察了下从思源获取的内嵌代码test,富文本内容是:<p id="20240514061411-uim5foo" updated="20240514061411"><span data-type="code">test</span>​</p>

可能是因为思源的编辑器读取到 id="20240514061411-uim5foo" updated="20240514061411",就把文中的内嵌源码以纯文本形式移到代码片段了?

刚测了下,如果test的富文本内容为<span style="font-family: Consolas, Monaco, monospace;">test</span>,也就是<p style="font-family: Consolas, Monaco, monospace;" id="20240514061411-uim5foo" updated="20240514061411"><span data-type="code" >test</span>​</p>,粘贴到代码片段就不会变成带有`test`的场景,你可以测测

那估计就是没有换行了,这种无法解析出来的。

什么意思?沉浸式翻译是把译文的HTML元素内嵌到网页啊?看了下源码,译文前头是有<br>

@88250
Copy link
Member

88250 commented May 13, 2024

我测试复制行级元素和复制块级元素后直接粘贴,两种情况都会带有 ` 的:

issue.webm

用 粘贴为纯文本 就可以不带 ` 了:

issue.webm

什么意思?沉浸式翻译是把译文的HTML元素内嵌到网页啊?看了下源码,译文前头是有<br>

处理 <font> 时其中的 <br> 会被替换为空格:

image

@allrobot
Copy link
Author

我测试复制行级元素和复制块级元素后直接粘贴,两种情况都会带有 ` 的:

因为复制的时候,思源的写入剪切板的文本是`XXX`,或测试 `XXX` 测试,在使用CF_TEXT或CF_UNICODETEXT剪切板格式的时候,TEXT应该为不带`XXX

Python参考代码:

from re import search

text = '''Version:0.9
StartHTML:0000000105
EndHTML:0000000272
StartFragment:0000000141
EndFragment:0000000236
<html>
<body>
<!--StartFragment--><p style="font-family: Consolas, Monaco, monospace;" id="20240514135730-37ivg0u" updated="20240514135730"><span data-type="code">loop</span></p>
<!--EndFragment-->
</body>
</html>'''

import win32clipboard,bs4
def html_to_clipboard():
    CF_HTML = win32clipboard.RegisterClipboardFormat("HTML Format")

    win32clipboard.OpenClipboard(0)
    win32clipboard.EmptyClipboard()
    win32clipboard.SetClipboardData(CF_HTML, text.encode('utf-8'))
    unicode_text ='loop'
    win32clipboard.SetClipboardText(unicode_text, win32clipboard.CF_UNICODETEXT)
    clipboard_data = win32clipboard.GetClipboardData(CF_HTML)
    print("富文本:\n"+clipboard_data.decode('utf-8'))
    clipboard_data = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
    print("纯文本:\n"+clipboard_data)

def get_clipboard():
    CF_HTML = win32clipboard.RegisterClipboardFormat("HTML Format")

    win32clipboard.OpenClipboard(0)
    clipboard_data = win32clipboard.GetClipboardData(CF_HTML)
    clipboard_data=bs4.BeautifulSoup(search('<!--StartFragment-->([\\s\\S]+)<!--EndFragment-->|<html>([\\s\\S]+)</html>',clipboard_data.decode('utf-8')).group(0),'lxml').prettify()
    print("富文本:\n"+clipboard_data)
    clipboard_data = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
    print("纯文本:\n"+clipboard_data)
    win32clipboard.CloseClipboard()

html_to_clipboard()
# get_clipboard()

SiYuan_ke07WE7oQ7

处理 <font> 时其中的 <br> 会被替换为空格:

但是,Pycharm用的'lxml'解析器解析剪切板的html代码,get_text()获取的文本有换行,要不你换个解析器?

chrome_0hjrvM1P51

@88250
Copy link
Member

88250 commented May 14, 2024 via email

@allrobot
Copy link
Author

SiYuan_TBQOoBMOck

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
还是span元素的问题

@allrobot
Copy link
Author

allrobot commented Jun 4, 2024

https://aria2.github.io/

SiYuan_Prq0n5kZQt

@allrobot
Copy link
Author

allrobot commented Jun 9, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants