形码单字会根据词频改变候选词序 #449

lld2001 · 2022-07-06T06:49:19Z

有段时间没升级了，刚升级到最新，发现形码（五笔）单字会根据词频改变候选词序：

词库里的顺序：
wubi/g 一与王

tumashu · 2022-07-06T08:04:51Z

是呢，现在的规则是，第一个汉字永远是词库的第一个汉字，后面的会按照使用频率动态调整，我不知道这个规则合不合理

lld2001 · 2022-07-06T08:13:58Z

对形码来说固定字序最好，词序我觉得可以调整。

lld2001 · 2022-07-06T08:22:34Z

或者能否把这个排序方法暴露出来，形码特殊处理

tumashu · 2022-07-06T08:37:55Z

型码既涉及字词，又涉及个人词库，公共词库，感觉很绕脑袋，最好能找一个通用的算法，如果找不到，我就将相关函数劈来，方便你们 override

tumashu · 2022-07-06T08:38:30Z

或者添加一个选项来控制

lld2001 · 2022-07-06T08:47:01Z

就我个人而言。如果全是字，按词库来。如果字词混合，字按库来排，词按频率。如果全是词，按频率。我记得有次讨论过了

lld2001 · 2022-07-06T08:50:16Z

现在版本用 cl-lib 写后，看不太懂，不会 hack了。

tumashu · 2022-07-06T08:50:29Z

如果字词混合，字按库来排，词按频率

意思是先排字，后排词？还是先排词后排字？

lld2001 · 2022-07-06T08:51:33Z

先按标准库排字。

tumashu · 2022-07-06T08:59:32Z

现在版本用 cl-lib 写后，看不太懂，不会 hack了。

基本上就是在你的配置中添加类似下面的代码

(cl-defmethod pyim-candidates-create
  :extra "lld2001hack" (imobjs (scheme pyim-scheme-xingma))
  "按照 SCHEME, 从 IMOBJS 获得候选词条，用于五笔仓颉等形码输入法。"
  (let (result)
    (dolist (imobj imobjs)
      (let* ((codes (pyim-codes-create imobj scheme))
             (last-code (car (last codes)))
             (other-codes (remove last-code codes))
             output prefix)

        ;; 如果 wubi/aaaa -> 工 㠭；wubi/bbbb -> 子 子子孙孙；wubi/cccc 又 叕；
        ;; 用户输入为： aaaabbbbcccc

        ;; 那么：
        ;; 1. codes       =>   ("wubi/aaaa" "wubi/bbbb" "wubi/cccc")
        ;; 2. last-code   =>   "wubi/cccc"
        ;; 3. other-codes =>   ("wubi/aaaa" "wubi/bbbb")
        ;; 4. prefix      =>   工子
        (when other-codes
          (setq prefix (mapconcat
                        (lambda (code)
                          (pyim-candidates-get-chief
                           scheme
                           (pyim-dcache-get code '(icode2word))
                           (pyim-dcache-get code '(code2word))))
                        other-codes "")))

        ;; 5. output => 工子又 工子叕
        (setq output
              (let* ((personal-words (pyim-dcache-get last-code '(icode2word)))
                     (personal-words (pyim-candidates--sort personal-words))
                     (common-words (pyim-dcache-get last-code '(code2word)))
                     (chief-word (pyim-candidates-get-chief scheme personal-words common-words))
                     (common-words (pyim-candidates--sort common-words))
                     (other-words (pyim-dcache-get last-code '(shortcode2word))))
                (mapcar (lambda (word)
                          (concat prefix word))
                        `(,chief-word
                          ,@personal-words
                          ,@common-words
                          ,@other-words))))
        (setq output (remove "" (or output (list prefix))))
        (setq result (append result output))))
    (when (car result)
      (delete-dups result))))

lld2001 · 2022-07-06T09:04:18Z

这个怎么用啊？放在require 'pyim 后面吗？

搜索了下，大概明白了履盖的方法。但不知道怎么写这个逻辑，能不能麻烦你帮下忙

tumashu · 2022-07-06T21:35:02Z

我试着调整了一下，你可以再试试

(defun pyim-candidates--xingma-words (code)
  "按照形码 scheme 的规则，搜索 CODE, 得到相应的词条列表。

当前的词条的构建规则是：
1. 先排公共词库中的字。
2. 然后再排所有词库中的词，词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

lld2001 · 2022-07-06T22:40:47Z

好的，谢谢。

另外我调试了下，发现获取common-words时的词时，顺序就已经变了。

(pyim-dcache-get "wubi/g" '(code2word))

这段代码输出：

("与" "一" "王")

我理解这个 code2word是不是词库，默认是不会变顺序的？默认顺序应该“一”在最前面（一级简码）

我将dcache删除后，重新启动，输出这样：

("一" "与" "王")

tumashu · 2022-07-07T00:12:03Z

对，这个顺序不会变，词库什么样子，顺序就是什么样子，除非你添加了多个词库

lld2001 · 2022-07-07T00:37:14Z

但在我这确实改变顺序了，不知道什么原因。会不会跟我的用法有关系？

我现在有三台机器会相互同步个人词库。定时用 pyim-export-words-and-counts 导出到外部 dict 文件，启动emacs时，再用pyim-import-words-and-counts 分别导进三个词库。现在会生成大量的带日期缓存文件：

pyim-dhashcache-icode2word-backup-20220704084810
pyim-dhashcache-icode2word-backup-20220704113050
pyim-dhashcache-icode2word-backup-20220705081521
pyim-dhashcache-icode2word-backup-20220705082011
pyim-dhashcache-icode2word-backup-20220706091706
pyim-dhashcache-icode2word-backup-20220706144411
pyim-dhashcache-icode2word-backup-20220706171605

lld2001 · 2022-07-07T02:05:59Z

现在词序是预期的了。

pyim-dhashcache-icode2word-backup-20220704084810
pyim-dhashcache-icode2word-backup-20220704113050
pyim-dhashcache-icode2word-backup-20220705081521
pyim-dhashcache-icode2word-backup-20220705082011
pyim-dhashcache-icode2word-backup-20220706091706
pyim-dhashcache-icode2word-backup-20220706144411
pyim-dhashcache-icode2word-backup-20220706171605

像这些还是会出现，不知道什么原因。

tumashu · 2022-07-07T02:10:34Z

这是pyim对个人词库缓存的保护机制，如果个人词库缓存的尺寸发生的变化超过一个阈值，pyim就会backup, 防止缓存损坏导致的数据丢失

lld2001 · 2022-07-07T02:12:28Z

我现在有三台机器会相互同步个人词库。定时用 pyim-export-words-and-counts 导出到外部 dict 文件，启动emacs时，再用pyim-import-words-and-counts 分别导进三个词库。

这种是不是会导致词库变化较大啊。

tumashu · 2022-07-07T03:00:15Z

不知道，一般词库 hash-table-count 变化超过20％，就会自动备份，

lld2001 · 2022-07-07T03:02:32Z

好的，谢谢。我定期手动删除吧。

xuan-w · 2023-05-12T18:44:00Z

如果字词混合，字按库来排，词按频率
意思是先排字，后排词？还是先排词后排字？

如您所说，不同形码用户的需求确实不同。比如说，我作为郑码用户，就希望输入法完全不考虑字词区分，只按词频或者词典顺序进行排序。
因为郑码本身的重码率极低，留下大量码位给词语。对于郑码使用来者说，nyll 在 99.9% 的情况下，用户想打的是词语“自己“而不是单字”翺“，因为后者几乎永远是组词”翱翔“来使用，而“翱翔”的编码 nguy 是唯一的，不存在重码。

既然不同用户需求不同，还希望留出一个选项，至少将“严格按词典文件排序”作为一个候选项？
感谢！

tumashu · 2023-05-16T00:47:14Z

@xuan-w 我觉得有特殊需求的同学还是直接 advice 下面的函数吧，比选项更灵活

(defun pyim-candidates--xingma-words (code)
  "搜索形码 CODE, 得到相应的词条列表。

当前的词条的构建规则是：
1. 先排公共词库中的字。
2. 然后再排所有词库中的词，词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

sunlin7 · 2023-09-07T16:26:51Z

@xuan-w 我觉得有特殊需求的同学还是直接 advice 下面的函数吧，比选项更灵活

(defun pyim-candidates--xingma-words (code)
  "搜索形码 CODE, 得到相应的词条列表。

当前的词条的构建规则是：
1. 先排公共词库中的字。
2. 然后再排所有词库中的词，词会按词频动态调整。"
  (let* ((common-words (pyim-dcache-get code '(code2word)))
         (common-chars (pyim-candidates--get-chars common-words))
         (personal-words (pyim-dcache-get code '(icode2word)))
         (other-words (pyim-dcache-get code '(shortcode2word)))
         (words-without-chars
          (pyim-candidates--sort
           (pyim-candidates--remove-chars
            (delete-dups
             `(,@personal-words
               ,@common-words
               ,@other-words))))))
    `(,@common-chars
      ,@words-without-chars)))

@tumashu 能不能把上面这段写到文档里啊 🙏
我找了半天才找到这里 🥲

tumashu · 2023-09-08T00:40:24Z

@xuan-w @sunlin7 我添加了一个设置变量，你们可以试试

tumashu added a commit that referenced this issue Jul 6, 2022

改变形码词条的排序规则, #449

f260cc9

lld2001 closed this as completed Jul 7, 2022

tumashu added a commit that referenced this issue Sep 8, 2023

Add pyim-candidates-xingma-words-function custom, see #449

81459f2

tumashu mentioned this issue Sep 11, 2023

固定词频配置 #477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

形码单字会根据词频改变候选词序 #449

形码单字会根据词频改变候选词序 #449

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022 •

edited

Loading

lld2001 commented Jul 6, 2022 •

edited

Loading

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022 •

edited

Loading

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

lld2001 commented Jul 7, 2022

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

xuan-w commented May 12, 2023

tumashu commented May 16, 2023

sunlin7 commented Sep 7, 2023

tumashu commented Sep 8, 2023

形码单字会根据词频改变候选词序 #449

形码单字会根据词频改变候选词序 #449

Comments

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022

tumashu commented Jul 6, 2022 • edited Loading

lld2001 commented Jul 6, 2022 • edited Loading

tumashu commented Jul 6, 2022

lld2001 commented Jul 6, 2022 • edited Loading

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

lld2001 commented Jul 7, 2022

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

tumashu commented Jul 7, 2022

lld2001 commented Jul 7, 2022

xuan-w commented May 12, 2023

tumashu commented May 16, 2023

sunlin7 commented Sep 7, 2023

tumashu commented Sep 8, 2023

tumashu commented Jul 6, 2022 •

edited

Loading

lld2001 commented Jul 6, 2022 •

edited

Loading

lld2001 commented Jul 6, 2022 •

edited

Loading