Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

解析emoji时存在遗漏 #7

Closed
raawaa opened this issue Dec 1, 2021 · 2 comments · Fixed by #8
Closed

解析emoji时存在遗漏 #7

raawaa opened this issue Dec 1, 2021 · 2 comments · Fixed by #8

Comments

@raawaa
Copy link
Contributor

raawaa commented Dec 1, 2021

我在试图找「男人耸肩🤷」的时候,发现并没有从 html 文件里解析出来这个 emoji。
读了一下代码,发现 emoji_all_parser.py 中,只获取了字符长度为 1 的emoji。

if len(emoji) == 1:

但其实,作为 unicode 字符的很多 emoji,len() 返回的长度是大于 1 的。这样就导致很多 emoji 没有被写入到json数据里。

@yuhangch
Copy link
Owner

yuhangch commented Dec 2, 2021

感谢反馈,当时不了解一部分emoji len()>1的情况,也被终端骗了,打印出来两个emoji,于是粗暴的过滤了一,现在看是不对的。
微信截图_20211202094707

目前来看,len()>1 的情况,主要集中在性别相关、国家🚩部分,于是又一些特殊处理:

  • 比如男人耸肩🤷‍♂️/女人耸肩🤷‍♀️ ,njrfssjm>>🤷‍♂️有点繁琐,加了耸肩关键字现在可以ssjm>>🤷‍♀️
  • 旗帜部分(虽然现在windows还显示不了),🇨🇳 ['旗: 中国', '中华', '华夏', '国旗', '旗'],也把:前面的处理掉。

@raawaa
Copy link
Contributor Author

raawaa commented Dec 2, 2021

感谢,越来越好用了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants