rm_number does not remove numbers with comma decimal separator #26

markvanderloo · 2017-11-27T08:06:50Z

qdapRegex::rm_number("hello 12,5 world")
[1] "hello 12,5 world"

According to the help file it should recognize this:

 ‘rm_number’ - Remove/replace/extract number from a string (works
     on numbers with commas, decimals and negatives).

Here's the sessionInfo

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qdapRegex_0.7.2

loaded via a namespace (and not attached):
[1] compiler_3.4.2  magrittr_1.5    tools_3.4.2     lubridate_1.6.0
[5] stringi_1.1.5   stringr_1.2.0  
>

The text was updated successfully, but these errors were encountered:

trinker · 2017-11-27T15:53:41Z

Hi Mark. Thanks for the issue. What is meant by commas is when it is comma separating the denominations (e.g., millions, billions, thousands, hundereds). This is U.S. convention. When I wrote qdapRegex I included a default U.S. dictionary with room for growth by adding additional other locale specific dictionaries via community support. In the README I have:

The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues

I see there are all sorts of ways decimal marks can be represented.
https://en.wikipedia.org/wiki/Decimal_mark

I would love if you were willing to make a Netherlands specific dictionary. I/We could blog/tweet about it and the community support and hopefully get the ball rolling with other locale specific dictionaries from the community if you were willing. I'm guessing a lot of the dictionary for Netherlands would be the same as the U.S. one I made (e.g., IP address is a universal thing) while others would require nly minor tweaks.

So for example with your problem we could use the current regex for U.S. and just swap out the comma and period using the textclean package's swap function:

library(qdapRegex)
library(textclean)

## make netherlands pattern
textclean::swap(qdapRegex::grab('rm_number'), ',', '.')
## "(?<=^| )[-,]*\\d+(?:\\,\\d+)?(?= |\\,?$)|\\d+(?:.\\d{3})+(\\,\\d+)*"


## make rm_number function for netherlands
rm_number2 <- rm_(pattern = textclean::swap(qdapRegex::grab('rm_number'), ',', '.'))


rm_number2("hello 12,5 world and another 1.234.567,89")
## [1] "hello world and another"

trinker closed this as completed Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm_number does not remove numbers with comma decimal separator #26

rm_number does not remove numbers with comma decimal separator #26

markvanderloo commented Nov 27, 2017

trinker commented Nov 27, 2017 •

edited

rm_number does not remove numbers with comma decimal separator #26

rm_number does not remove numbers with comma decimal separator #26

Comments

markvanderloo commented Nov 27, 2017

trinker commented Nov 27, 2017 • edited

trinker commented Nov 27, 2017 •

edited