Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

全角/半角字符转换 #185

Closed
jctian98 opened this issue Jan 24, 2024 · 3 comments
Closed

全角/半角字符转换 #185

jctian98 opened this issue Jan 24, 2024 · 3 comments

Comments

@jctian98
Copy link

你好,感谢开源:)

在使用该工具包时,希望不对全角/半角字符进行转换,希望保留中文标点符号,但似乎在设置相关参数后没有达到目的:

>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(full_to_half=False)
>>> normalizer.normalize("你好。")
'你好.'

请问是否是使用时存在错误? 感谢!

@xingchensong
Copy link
Member

sorry jinchuan, 邮件 miss 掉了才看到。

是的,full_to_half这样的参数不能在pip安装的Normalizer里设置(因为涉及到重新编译fst,这个在py包里暂时没法做到,后面我想想办法)

现行方案如readme 1.2 节 (Advanced Usage) 所示 https://github.com/wenet-e2e/WeTextProcessing?tab=readme-ov-file#12-advanced-usage

git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
pip install -r requirements.txt
# `overwrite_cache` will rebuild all rules according to
#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python -m tn --text "你好。" --overwrite_cache --full_to_half false

上述过程会重新编译fst,在PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn 文件夹下可以找到新的fst,然后再在py包中指定cache_dir

# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("你好。")

成功截图:
image

@jctian98
Copy link
Author

Thanks!!!

@xingchensong
Copy link
Member

Hi, 最新1.0.0版本新增:

  1. 支持了英文tn (比nemo更精简,fst大小 76M->7M, 构图时间 777s -> 41s)
  2. 支持了在线构图,用法如下
from tn.chinese.normalizer import Normalizer
normalizer = Normalizer(full_to_half=False, overwrite_cache=True)
print(normalizer.normalize("你好。"))
normalizer = Normalizer(full_to_half=True, overwrite_cache=True)
print(normalizer.normalize("你好。"))

image

details: https://github.com/wenet-e2e/WeTextProcessing/releases/tag/1.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants