Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

形码单字会根据词频改变候选词序 #449

Closed
lld2001 opened this issue Jul 6, 2022 · 24 comments
Closed

形码单字会根据词频改变候选词序 #449

lld2001 opened this issue Jul 6, 2022 · 24 comments

Comments

@lld2001
Copy link
Contributor

lld2001 commented Jul 6, 2022

有段时间没升级了,刚升级到最新,发现形码(五笔)单字会根据词频改变候选词序:
image

词库里的顺序:
wubi/g 一 与 王

@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

是呢,现在的规则是,第一个汉字永远是词库的第一个汉字,后面的会按照使用频率动态调整,我不知道这个规则合不合理

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

对形码来说固定字序最好,词序我觉得可以调整。

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

或者能否把这个排序方法暴露出来,形码特殊处理

@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

型码既涉及字词,又涉及个人词库,公共词库,感觉很绕脑袋,最好能找一个通用的算法,如果找不到,我就将相关函数劈来,方便你们 override

@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

或者添加一个选项来控制

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

就我个人而言。如果全是字,按词库来。如果字词混合,字按库来排,词按频率。如果全是词,按频率。我记得有次讨论过了

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

现在版本用 cl-lib 写后,看不太懂,不会 hack了。

@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

如果字词混合,字按库来排,词按频率

意思是先排字,后排词? 还是先排词后排字?

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

先按标准库排字。

@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

现在版本用 cl-lib 写后,看不太懂,不会 hack了。

基本上就是在你的配置中添加类似下面的代码

(cl-defmethod pyim-candidates-create
  :extra "lld2001hack" (imobjs (scheme pyim-scheme-xingma))
  "按照 SCHEME, 从 IMOBJS 获得候选词条,用于五笔仓颉等形码输入法。"
  (let (result)
    (dolist (imobj imobjs)
      (let* ((codes (pyim-codes-create imobj scheme))
             (last-code (car (last codes)))
             (other-codes (remove last-code codes))
             output prefix)

        ;; 如果 wubi/aaaa -> 工 㠭;wubi/bbbb -> 子 子子孙孙;wubi/cccc 又 叕;
        ;; 用户输入为: aaaabbbbcccc

        ;; 那么:
        ;; 1. codes       =>   ("wubi/aaaa" "wubi/bbbb" "wubi/cccc")
        ;; 2. last-code   =>   "wubi/cccc"
        ;; 3. other-codes =>   ("wubi/aaaa" "wubi/bbbb")
        ;; 4. prefix      =>   工子
        (when other-codes
          (setq prefix (mapconcat
                        (lambda (code)
                          (pyim-candidates-get-chief
                           scheme
                           (pyim-dcache-get code '(icode2word))
                           (pyim-dcache-get code '(code2word))))
                        other-codes "")))

        ;; 5. output => 工子又 工子叕
        (setq output
              (let* ((personal-words (pyim-dcache-get last-code '(icode2word)))
                     (personal-words (pyim-candidates--sort personal-words))
                     (common-words (pyim-dcache-get last-code '(code2word)))
                     (chief-word (pyim-candidates-get-chief scheme personal-words common-words))
                     (common-words (pyim-candidates--sort common-words))
                     (other-words (pyim-dcache-get last-code '(shortcode2word))))
                (mapcar (lambda (word)
                          (concat prefix word))
                        `(,chief-word
                          ,@personal-words
                          ,@common-words
                          ,@other-words))))
        (setq output (remove "" (or output (list prefix))))
        (setq result (append result output))))
    (when (car result)
      (delete-dups result))))

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

这个怎么用啊?放在require 'pyim 后面吗?

搜索了下,大概明白了履盖的方法。但不知道怎么写这个逻辑,能不能麻烦你帮下忙

tumashu added a commit that referenced this issue Jul 6, 2022
@tumashu
Copy link
Owner

tumashu commented Jul 6, 2022

我试着调整了一下,你可以再试试

(defun pyim-candidates--xingma-words (code)
  "按照形码 scheme 的规则,搜索 CODE, 得到相应的词条列表。

当前的词条的构建规则是:
1. 先排公共词库中的字。
2. 然后再排所有词库中的词,词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 6, 2022

好的,谢谢。

另外我调试了下,发现获取common-words时的词时,顺序就已经变了。

(pyim-dcache-get "wubi/g" '(code2word))

这段代码输出:

("与" "一" "王")

我理解这个 code2word是不是词库,默认是不会变顺序的?默认顺序应该“一”在最前面(一级简码)

我将dcache删除后,重新启动,输出这样:

("一" "与" "王")

@tumashu
Copy link
Owner

tumashu commented Jul 7, 2022

对,这个顺序不会变,词库什么样子,顺序就是什么样子,除非你添加了多个词库

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 7, 2022

但在我这确实改变顺序了,不知道什么原因。会不会跟我的用法有关系?

我现在有三台机器会相互同步个人词库。定时用 pyim-export-words-and-counts 导出到外部 dict 文件,启动emacs时,再用pyim-import-words-and-counts 分别导进三个词库。现在会生成大量的带日期缓存文件:

pyim-dhashcache-icode2word-backup-20220704084810
pyim-dhashcache-icode2word-backup-20220704113050
pyim-dhashcache-icode2word-backup-20220705081521
pyim-dhashcache-icode2word-backup-20220705082011
pyim-dhashcache-icode2word-backup-20220706091706
pyim-dhashcache-icode2word-backup-20220706144411
pyim-dhashcache-icode2word-backup-20220706171605

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 7, 2022

现在词序是预期的了。

pyim-dhashcache-icode2word-backup-20220704084810
pyim-dhashcache-icode2word-backup-20220704113050
pyim-dhashcache-icode2word-backup-20220705081521
pyim-dhashcache-icode2word-backup-20220705082011
pyim-dhashcache-icode2word-backup-20220706091706
pyim-dhashcache-icode2word-backup-20220706144411
pyim-dhashcache-icode2word-backup-20220706171605

像这些还是会出现,不知道什么原因。

@tumashu
Copy link
Owner

tumashu commented Jul 7, 2022

这是pyim对个人词库缓存的保护机制,如果个人词库缓存的尺寸发生的变化超过一个阈值,pyim就会backup, 防止缓存损坏导致的数据丢失

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 7, 2022

我现在有三台机器会相互同步个人词库。定时用 pyim-export-words-and-counts 导出到外部 dict 文件,启动emacs时,再用pyim-import-words-and-counts 分别导进三个词库。

这种是不是会导致词库变化较大啊。

@tumashu
Copy link
Owner

tumashu commented Jul 7, 2022

不知道,一般词库 hash-table-count 变化超过20%,就会自动备份,

@lld2001
Copy link
Contributor Author

lld2001 commented Jul 7, 2022

好的,谢谢。我定期手动删除吧。

@lld2001 lld2001 closed this as completed Jul 7, 2022
@xuan-w
Copy link

xuan-w commented May 12, 2023

如果字词混合,字按库来排,词按频率

意思是先排字,后排词? 还是先排词后排字?

如您所说,不同形码用户的需求确实不同。比如说,我作为郑码用户,就希望输入法完全不考虑字词区分,只按词频或者词典顺序进行排序。
因为郑码本身的重码率极低,留下大量码位给词语。对于郑码使用来者说,nyll 在 99.9% 的情况下,用户想打的是词语“自己“而不是单字”翺“,因为后者几乎永远是组词”翱翔“来使用,而“翱翔”的编码 nguy 是唯一的,不存在重码。

既然不同用户需求不同,还希望留出一个选项,至少将“严格按词典文件排序”作为一个候选项?
感谢!

@tumashu
Copy link
Owner

tumashu commented May 16, 2023

@xuan-w 我觉得有特殊需求的同学还是直接 advice 下面的函数吧,比选项更灵活

(defun pyim-candidates--xingma-words (code)
  "搜索形码 CODE, 得到相应的词条列表。

当前的词条的构建规则是:
1. 先排公共词库中的字。
2. 然后再排所有词库中的词,词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

@sunlin7
Copy link

sunlin7 commented Sep 7, 2023

@xuan-w 我觉得有特殊需求的同学还是直接 advice 下面的函数吧,比选项更灵活

(defun pyim-candidates--xingma-words (code)
  "搜索形码 CODE, 得到相应的词条列表。

当前的词条的构建规则是:
1. 先排公共词库中的字。
2. 然后再排所有词库中的词,词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

@tumashu 能不能把上面这段写到文档里啊 🙏
我找了半天才找到这里 🥲

@tumashu
Copy link
Owner

tumashu commented Sep 8, 2023

@xuan-w @sunlin7 我添加了一个设置变量,你们可以试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants