New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a vector of Mandarin Chinese strings into pinyin #5

Closed
caimiao0714 opened this Issue Dec 16, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@caimiao0714
Copy link

caimiao0714 commented Dec 16, 2018

The py() function seems not to be working if I input a vector of Chinese strings. For example:

> library('pinyin')
> mypy = pydic(method = 'toneless')
> py(c("我", "一定", "是个", "天才"),  dic = mypy)
[1] "wo"

Sometimes, I have several columns in a data.frame that need to be converted into English letters.
I wrote a small function that can make it work, which depends on a couple of functions from dplyr.

> testd = data.frame(stringsAsFactors=FALSE,
          x1 = c('我', '一定', '是个', '天才'),
          x2 = c('我', '确', '是个', '天才'))
> print(testd)
    x1   x2
1   我   我
2 一定   确
3 是个 是个
4 天才 天才
> require(tidyverse)
> conv_py = function(data, var_name){
+   for(i in var_name){
+     data[[i]] = map(data[[i]], function(x){py(x, dic = mypy)}) %>%
+       gsub("_", "", .) %>%
+       unlist()
+   }
+   return(data)
+ }

> conv_py(testd, c("x1", "x2"))
       x1      x2
1      wo      wo
2  yiding     que
3  shigan  shigan
4 tiancai tiancai

But there seems to be an obvious bug here: "是个" has been parsed into "shigan", which cannot be correct.

In summary:

  • See if you want to add this conv_py() or alike functions into your updated package. I found converting a vector of Chinese characters a very common problem in data manipulation.
  • fix the obvious "是个" into "shigan" bug in the package, which is probably not your fault. I guess it is from the problem in the dictionary.
@pzhaonet

This comment has been minimized.

Copy link
Owner

pzhaonet commented Dec 17, 2018

Thanks for the feedback. For your comment, 1, sapply() can do the vector work for you, or you can update to the newest version 1.1.5 of the 'pinyin' package, then you can convert a string vector directly.

devtools::install_github('pzhaonet/pinyin')
require('pinyin')
mypy = pydic(method = 'toneless', dic = 'pinyin2')
> py(c("", "一定", "是个", "天才"),  dic = mypy, sep = '')
>        我      一定      是个      天才 
 >     "wo"  "yiding"   "shige" "tiancai" 
testd = data.frame(stringsAsFactors=FALSE,
                    x1 = c('', '一定', '是个', '天才'),
                    x2 = c('', '', '是个', '天才'))
py(testd$x1, dic = mypy)
>         我       一定       是个       天才 
>       "wo"  "yi_ding"   "shi_ge" "tian_cai" 
py(testd$x2, dic = mypy)
>         我         确       是个       天才 
>       "wo"      "que"   "shi_ge" "tian_cai"

For your comment 2, you can choose dic = 'pinyin2' as shown above. The problem was caused by the default dictionary 'pinyin', which is larger but provides some uncommon heteronyms.

@caimiao0714

This comment has been minimized.

Copy link

caimiao0714 commented Dec 21, 2018

Thanks for the feedback. For your comment, 1, sapply() can do the vector work for you, or you can update to the newest version 1.1.5 of the 'pinyin' package, then you can convert a string vector directly.

devtools::install_github('pzhaonet/pinyin')
require('pinyin')
mypy = pydic(method = 'toneless', dic = 'pinyin2')
> py(c("", "一定", "是个", "天才"),  dic = mypy, sep = '')
>        我      一定      是个      天才 
 >     "wo"  "yiding"   "shige" "tiancai" 
testd = data.frame(stringsAsFactors=FALSE,
                    x1 = c('', '一定', '是个', '天才'),
                    x2 = c('', '', '是个', '天才'))
py(testd$x1, dic = mypy)
>         我       一定       是个       天才 
>       "wo"  "yi_ding"   "shi_ge" "tian_cai" 
py(testd$x2, dic = mypy)
>         我         确       是个       天才 
>       "wo"      "que"   "shi_ge" "tian_cai"

For your comment 2, you can choose dic = 'pinyin2' as shown above. The problem was caused by the default dictionary 'pinyin', which is larger but provides some uncommon heteronyms.

Thanks Peng!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment