Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_number does not remove numbers with comma decimal separator #26

Closed
markvanderloo opened this issue Nov 27, 2017 · 1 comment
Closed

Comments

@markvanderloo
Copy link

qdapRegex::rm_number("hello 12,5 world")
[1] "hello 12,5 world"

According to the help file it should recognize this:

 ‘rm_number’ - Remove/replace/extract number from a string (works
     on numbers with commas, decimals and negatives).

Here's the sessionInfo

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qdapRegex_0.7.2

loaded via a namespace (and not attached):
[1] compiler_3.4.2  magrittr_1.5    tools_3.4.2     lubridate_1.6.0
[5] stringi_1.1.5   stringr_1.2.0  
> 
@trinker
Copy link
Owner

trinker commented Nov 27, 2017

Hi Mark. Thanks for the issue. What is meant by commas is when it is comma separating the denominations (e.g., millions, billions, thousands, hundereds). This is U.S. convention. When I wrote qdapRegex I included a default U.S. dictionary with room for growth by adding additional other locale specific dictionaries via community support. In the README I have:

The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues

I see there are all sorts of ways decimal marks can be represented.
https://en.wikipedia.org/wiki/Decimal_mark

I would love if you were willing to make a Netherlands specific dictionary. I/We could blog/tweet about it and the community support and hopefully get the ball rolling with other locale specific dictionaries from the community if you were willing. I'm guessing a lot of the dictionary for Netherlands would be the same as the U.S. one I made (e.g., IP address is a universal thing) while others would require nly minor tweaks.

So for example with your problem we could use the current regex for U.S. and just swap out the comma and period using the textclean package's swap function:

library(qdapRegex)
library(textclean)

## make netherlands pattern
textclean::swap(qdapRegex::grab('rm_number'), ',', '.')
## "(?<=^| )[-,]*\\d+(?:\\,\\d+)?(?= |\\,?$)|\\d+(?:.\\d{3})+(\\,\\d+)*"


## make rm_number function for netherlands
rm_number2 <- rm_(pattern = textclean::swap(qdapRegex::grab('rm_number'), ',', '.'))


rm_number2("hello 12,5 world and another 1.234.567,89")
## [1] "hello world and another"

@trinker trinker closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants