New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"box" style - error 'vec' must be sorted non-decreasingly and not contain NAs” #41
Comments
|
Your reprex doesn't set the seed. |
|
Indeed. Sorry. Better this way? set.seed(101)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
# one of 1,906,884 possible partitions of this variable into 6 classes
# Error in findInterval(clI$var, clI$brks, all.inside = TRUE) :
# 'vec' must be sorted non-decreasingly and not contain NAs
set.seed(102)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
# one of 1,906,884 possible partitions of this variable into 6 classes
# Error in findInterval(clI$var, clI$brks, all.inside = TRUE) :
# 'vec' must be sorted non-decreasingly and not contain NAs
set.seed(103)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
# one of 1,906,884 possible partitions of this variable into 6 classes
# [-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
# 1 12 12
# [0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
# 30 12 12 1 |
|
Seems like b=classInt::classIntervals(x,style="box")
findInterval(b$var, b$brks, all.inside=TRUE)
# Error in findInterval(b$var, b$brks, all.inside = TRUE) :
# 'vec' must be sorted non-decreasingly and not contain NAs
findInterval(b$var, sort(b$brks), all.inside=TRUE)
# [1] 3 5 3 4 4 5 5 3 5 3 4 2 5 2 3 3 2 4 2 2 3 5 3 2 5 2 4 3 4 4 5 4 5 2 5 2 4 5
# [39] 2 4 4 5 2 3 2 4 5 3 3 2 |
|
See #38; @dieghernan , could you please look at the source (https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map)? The brief comment on fence handing seems to be about https://spatialanalysis.github.io/lab_tutorials/2_R_EDA_1.html#box-plot, using the The cause is that in these cases the iqr is quite restricted (I think), so the output breaks are: where the first two are reversed. If I increase the optional argument so I think that the method as written requires intervention from the user. This maybe could be trapped and |
|
Hi @rsbivand I am on it. In fact this is caused by the Lines 371 to 374 in 830abc9
On the example with set.seed(101)
x<-rnorm(50)
br <- classInt::classIntervals(x,style="box")
br$brks
#> [1] -2.61844134 -3.00000000 -0.71980148 -0.02713441 0.54595842 2.00000000
#> [7] 2.44459827
# Recreate logic of the function
var <- x
# Mock dots
dots <- list(empyt = NA)
# The function as is
iqr_mult <- ifelse(is.null(dots$iqr_mult), 1.5, dots$iqr_mult)
qtype <- ifelse(is.null(dots$type), 7, dots$type)
qv <- unname(quantile(var, type=qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
# initialize break points vector
bb <- vector(mode="numeric",length=7)
# logic for lower and upper fences
if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
bb[2] <- floor(qv[1])
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
bb[7] <- upfence
bb[6] <- ceiling(qv[5])
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
brks <- bb
# Bad breaks
brks[1:2]
#> [1] -2.618441 -3.000000
# Since
lofence
#> [1] -2.618441
# And obviously
qv[1]
#> [1] -2.319327
# But the error is
floor(qv[1])
#> [1] -3Created on 2023-02-17 with reprex v2.0.2 This is commented on the original source of the method: https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map but I have to think about that. Maybe @angela-li can help as well (see #18) |
|
The ggplot2 part with this data shows: |
|
I most probably miss the wider picture, but not sure why floor(qv[1]) rather than qv[1] (and similarly ceiling) are needed at all in: logic for lower and upper fencesif (lofence < qv[1]) { # no lower outliers } else { In the case of seed (101), if the floor is removed, i.e;: then we would get
|
|
Will In the documentation, it is suggested that the R code should perform like Geoda. Is this where Geoda defines these quantities: https://github.com/GeoDaCenter/geoda/blob/5109e6bbe0cd41d08fd15c4c2a14c6cf4ac601bb/Explore/CatClassification.cpp#L216 ? |
|
I have not access to a computer now @rsbivand but I think that I have an bug fix in mind, but I'll take a couple of days until I can test it. Glad if anyone wants to propose their own alternative on this |
|
I suppose the reason for flooring/ceiling is related to the left/right use or the breaks. Rather then ceiling and flooring, one way would be to add/deduce a small value, typically a value just below the minimun difference between any two consecutive values in the original variable. That requires a sorting though (which may be inefficient with long vectors?). An implementation could be around this, where one divides that min difference by 10 (cutting_dec) and adds/deduces it from the respective quantile: mybox<-function(var,iqr_mult=1.5,qtype=7,cutting_dec=10){ bb <- vector(mode="numeric",length=qtype) if (lofence < qv[1]){ # no lower outliers bb[3:5] <- qv[2:4] Results are then as follows:
|
|
Thanks @geocaruso I came to the same approach that you mentioned. The I was thinking on adding/substracting |
|
DELETED, See below |
|
Sorry, just a question on this since I think @geocaruso do you think my approach can yield to more accurate results with a smaller multiplier? Hopefully tomorrow I can have a closer look to all this issue |
|
I had some time, @geocaruso I modified your version of my proposed approach by reducing the multiplier (not need to be 10%, can be as small as desired) and correcting the signs on the lower fence implementation, they were reversed: mybox <- function(var, iqr_mult = 1.5, qtype = 7, cutting_dec = 10) {
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
bb <- vector(mode = "numeric", length = qtype)
if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
# bb[2] <- floor(qv[1]) #former
bb[2] <- qv[1] - min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
bb[7] <- upfence
# bb[6] <- ceiling(qv[5]) #former
bb[6] <- (qv[5]) + min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
brks <- bb
return(brks)
}
mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
bb <- vector(mode = "numeric", length = qtype)
if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
# bb[2] <- floor(qv[1]) #former
bb[2] <- qv[1] - 0.00001 * (qv[1] - lofence) # ALTERNATIVE 2
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
bb[7] <- upfence
# bb[6] <- ceiling(qv[5]) #former
bb[6] <- (qv[5]) + 0.00001 * (upfence - qv[5]) # ALTERNATIVE 2
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
brks <- bb
return(brks)
}
Testing examples: set.seed(101) # lower end problem
x <- rnorm(50)
quantile(x)
#> 0% 25% 50% 75% 100%
#> -2.31932737 -0.71980148 -0.02713441 0.54595842 1.42775554
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.618441,-2.319345) [-2.319345,-0.7198015) [-0.7198015,-0.02713441)
#> 0 13 12
#> [-0.02713441,0.5459584) [0.5459584,1.427773) [1.427773,2.444598]
#> 12 13 0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.618441,-2.31933) [-2.31933,-0.7198015) [-0.7198015,-0.02713441)
#> 0 13 12
#> [-0.02713441,0.5459584) [0.5459584,1.427766) [1.427766,2.444598]
#> 12 13 0
# I see here same results in terms of classification (i.e. no missclasification), mybox_dhh provides closer values to the min max
# due to to the 0.001 factor.
library(ggplot2)
ggplot() +
geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
ggtitle("Seed101")set.seed(102) # upper end problem
x <- rnorm(50)
quantile(x)
#> 0% 25% 50% 75% 100%
#> -1.6783112 -0.3608505 0.1443969 1.1722807 3.1143334
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.660547,-1.67847) [-1.67847,-0.3608505) [-0.3608505,0.1443969)
#> 0 13 12
#> [0.1443969,1.172281) [1.172281,3.114493) [3.114493,3.471977]
#> 12 13 0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.660547,-1.678321) [-1.678321,-0.3608505) [-0.3608505,0.1443969)
#> 0 13 12
#> [0.1443969,1.172281) [1.172281,3.114337) [3.114337,3.471977]
#> 12 13 0
# Same here, closer values to min max although I think that is irrelevante as long as it does not misclasify outliers
ggboxplot <- ggplot() +
geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
ggtitle("Seed102")
set.seed(103) # no problem
x <- rnorm(50)
quantile(x)
#> 0% 25% 50% 75% 100%
#> -2.14506499 -0.49890289 0.09865646 0.56598795 2.57973063
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
#> 1 12 12
#> [0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
#> 12 12 1
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
#> 1 12 12
#> [0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
#> 12 12 1
ggplot() +
geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
theme_bw() +
ggtitle("Seed103")I see some potential improvements on performance on # Performance
set.seed(104) # no problem
x <- rnorm(1000000)
mybox_t <- Sys.time()
mybox(x)
#> [1] -4.9904983832 -2.6980446395 -0.6746739374 0.0003594936 0.6742398640
#> [6] 2.6976105660 5.0789629606
Sys.time() - mybox_t
#> Time difference of 0.03704286 secs
mybox_dhh_t <- Sys.time()
mybox_dhh(x)
#> [1] -4.9904983832 -2.6980446395 -0.6746739374 0.0003594936 0.6742398640
#> [6] 2.6976105660 5.0789629606
Sys.time() - mybox_dhh_t
#> Time difference of 0.03578687 secsCreated on 2023-02-20 with reprex v2.0.2 |
|
I will try to summarise @rsbivand, @geocaruso @edzer :
# For cutting_dec = 10
+/- min(diff(sort(var))) / 10Or # For lower fence, I added abs to force correct direction of the offset
- 0.00001 * abs(lofence - qv[1] )
# For upper fence
+ 0.00001 * abs(upfence - qv[5])
outliers <- data$y < (stats[2] - coef * iqr) | data$y > (stats[4] + coef * iqr)But what we need is to came with the exact value (i.e. ggplot2 approach is equivalent to not apply Happy to hear your remarks/feedback on this |
|
A full reprex with both approaches and a comparison in terms of performance with # Define functions
mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
bb <- vector(mode = "numeric", length = qtype)
if (lofence < qv[1]) {
bb[1] <- lofence
# bb[2] <- floor(qv[1]) #former
bb[2] <- qv[1] - 0.00000001 * abs(lofence - qv[1])
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) {
bb[7] <- upfence
bb[6] <- (qv[5]) + 0.00000001 * abs(upfence - qv[5])
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
brks <- bb
brks
}
mybox <- function(var, iqr_mult = 1.5, qtype = 7, cutting_dec = 10) {
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
bb <- vector(mode = "numeric", length = qtype)
if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
# bb[2] <- floor(qv[1]) #former
bb[2] <- qv[1] - min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
bb[7] <- upfence
# bb[6] <- ceiling(qv[5]) #former
bb[6] <- (qv[5]) + min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
brks <- bb
brks
}
# Test results
set.seed(101) # lower end problem
x <- rnorm(50)
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.618441,-2.319328) [-2.319328,-0.7198015) [-0.7198015,-0.02713441)
#> 0 13 12
#> [-0.02713441,0.5459584) [0.5459584,1.427756) [1.427756,2.444598]
#> 12 13 0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.618441,-2.319327) [-2.319327,-0.7198015) [-0.7198015,-0.02713441)
#> 0 13 12
#> [-0.02713441,0.5459584) [0.5459584,1.427756) [1.427756,2.444598]
#> 12 13 0
set.seed(102) # upper end problem
x <- rnorm(50)
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.660547,-1.678313) [-1.678313,-0.3608505) [-0.3608505,0.1443969)
#> 0 13 12
#> [0.1443969,1.172281) [1.172281,3.114335) [3.114335,3.471977]
#> 12 13 0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.660547,-1.678311) [-1.678311,-0.3608505) [-0.3608505,0.1443969)
#> 0 13 12
#> [0.1443969,1.172281) [1.172281,3.114333) [3.114333,3.471977]
#> 12 13 0
set.seed(103) # no problem
x <- rnorm(50)
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
#> 1 12 12
#> [0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
#> 12 12 1
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#> one of 1,906,884 possible partitions of this variable into 6 classes
#> [-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
#> 1 12 12
#> [0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
#> 12 12 1
Performance# test micro benchmark
microb <- function(sample) {
microbenchmark::microbenchmark(
"mybox" = {
b <- mybox(sample)
},
"mybox_dhh" = {
b <- mybox_dhh(sample)
},
times = 5
)
}
# Init samples
set.seed(2389)
# Pareto distributions a=7 b=14
paretodist <- 7 / (1 - runif(5000000))^(1 / 14)
# Exponential dist
expdist <- rexp(5000000)
# Lognorm
lognormdist <- rlnorm(5000000)
# Weibull
weibulldist <- rweibull(5000000, 1, scale = 5)
# LogCauchy "super-heavy tail"
logcauchdist <- exp(rcauchy(5000000, 2, 4))
# Remove Inf
logcauchdist <- logcauchdist[logcauchdist < Inf]
# Normal dist
normdist <- rnorm(5000000)
# Left-tailed distr
leftnorm <-
sample(rep(normdist[normdist < mean(normdist)], 3), size = 5000000)
# Uniform distribution
unifdist <- runif(5000000)
microb(paretodist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 872.3832 879.3783 892.6899 884.0298 912.1657 915.4926 5 b
#> mybox_dhh 514.8215 523.3301 535.8768 524.2920 528.1830 588.7575 5 a
microb(expdist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 867.5758 893.5908 911.1049 897.3768 946.1543 950.8266 5 b
#> mybox_dhh 474.9472 506.7352 510.4360 523.3524 523.3616 523.7836 5 a
microb(lognormdist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 882.1914 892.7457 917.6065 925.7256 926.1274 961.2424 5 b
#> mybox_dhh 529.3752 530.4412 540.1720 535.0765 540.9861 564.9811 5 a
microb(weibulldist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 882.7893 899.4784 1108.1453 1038.8632 1248.6510 1470.9446 5 b
#> mybox_dhh 486.6262 600.9610 601.3876 625.4297 644.4559 649.4653 5 a
microb(logcauchdist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 790.1666 806.3588 847.7212 847.1650 896.2554 898.6602 5 b
#> mybox_dhh 432.7527 447.8909 468.9992 472.5833 473.5591 518.2099 5 a
microb(normdist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 226.2988 232.4805 242.8239 234.4963 252.7251 268.1189 5 a
#> mybox_dhh 470.9276 487.3639 492.1637 494.6327 503.5849 504.3094 5 b
microb(leftnorm)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> mybox 925.3866 931.4877 1034.676 969.6359 1167.3629 1179.507 5 b
#> mybox_dhh 488.9530 515.3051 552.802 531.9555 560.9445 666.852 5 a
microb(unifdist)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> mybox 1451.1785 1533.1986 1547.2717 1536.2723 1578.4240 1637.2853 5
#> mybox_dhh 439.7543 443.3136 467.8262 445.8411 491.6797 518.5423 5
#> cld
#> b
#> aCreated on 2023-02-21 with reprex v2.0.2 |
See the notes https://github.com/pysal/mapclassify/blob/34341a22db968b42bcc21b52d4bcf32a0a3e7098/mapclassify/classifiers.py#L1354, and the book section https://geographicdata.science/book/notebooks/05_choropleth.html#box-plot. Could this be a better variant? |
|
Hi @rsbivand I think this would be a breaking change. As of now (and also stated in the documentation)
As per the NOTE.
So the output can be either 5 or 6 classes. Also, I still see the issue (not tested yet) in case |
|
Hi @rsbivand and @dieghernan,
|
|
|
set.seed(101)
x<-rnorm(50)
boxplot(x, plot = FALSE)
#> $stats
#> [,1]
#> [1,] -2.31932737
#> [2,] -0.72437422
#> [3,] -0.02713441
#> [4,] 0.55246186
#> [5,] 1.42775554
#>
#> $n
#> [1] 50
#>
#> $conf
#> [,1]
#> [1,] -0.3124380
#> [2,] 0.2581692
#>
#> $out
#> numeric(0)
#>
#> $group
#> numeric(0)
#>
#> $names
#> [1] "1"
# See boxplot stats
boxplot.stats(x)
#> $stats
#> [1] -2.31932737 -0.72437422 -0.02713441 0.55246186 1.42775554
#>
#> $n
#> [1] 50
#>
#> $conf
#> [1] -0.3124380 0.2581692
#>
#> $out
#> numeric(0)
# Source
boxplot.stats
#> function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE)
#> {
#> if (coef < 0)
#> stop("'coef' must not be negative")
#> nna <- !is.na(x)
#> n <- sum(nna)
#> stats <- stats::fivenum(x, na.rm = TRUE)
#> iqr <- diff(stats[c(2, 4)])
#> if (coef == 0)
#> do.out <- FALSE
#> else {
#> out <- if (!is.na(iqr)) {
#> x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
#> iqr)
#> }
#> else !is.finite(x)
#> if (any(out[nna], na.rm = TRUE))
#> stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
#> }
#> conf <- if (do.conf)
#> stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
#> list(stats = stats, n = n, conf = conf, out = if (do.out) x[out &
#> nna] else numeric())
#> }
#> <bytecode: 0x0000020c8edb5080>
#> <environment: namespace:grDevices>
# Source of boxplot
boxplot.default
#> function (x, ..., range = 1.5, width = NULL, varwidth = FALSE,
#> notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"),
#> col = "lightgray", log = "", pars = list(boxwex = 0.8, staplewex = 0.5,
#> outwex = 0.5), ann = !add, horizontal = FALSE, add = FALSE,
#> at = NULL)
#> {
#> args <- list(x, ...)
#> namedargs <- if (!is.null(attributes(args)$names))
#> attributes(args)$names != ""
#> else rep_len(FALSE, length(args))
#> groups <- if (is.list(x))
#> x
#> else args[!namedargs]
#> if (0L == (n <- length(groups)))
#> stop("invalid first argument")
#> if (length(class(groups)))
#> groups <- unclass(groups)
#> if (!missing(names))
#> attr(groups, "names") <- names
#> else {
#> if (is.null(attr(groups, "names")))
#> attr(groups, "names") <- 1L:n
#> names <- attr(groups, "names")
#> }
#> cls <- lapply(groups, class)
#> cl <- NULL
#> if (all(vapply(groups, function(e) {
#> is.numeric(unclass(e)) && identical(names(attributes(e)),
#> "class")
#> }, NA)) && (length(unique(cls)) == 1L))
#> cl <- cls[[1L]]
#> for (i in 1L:n) groups[i] <- list(boxplot.stats(unclass(groups[[i]]),
#> range))
#> stats <- matrix(0, nrow = 5L, ncol = n)
#> conf <- matrix(0, nrow = 2L, ncol = n)
#> ng <- out <- group <- numeric(0L)
#> ct <- 1
#> for (i in groups) {
#> stats[, ct] <- i$stats
#> conf[, ct] <- i$conf
#> ng <- c(ng, i$n)
#> if ((lo <- length(i$out))) {
#> out <- c(out, i$out)
#> group <- c(group, rep.int(ct, lo))
#> }
#> ct <- ct + 1
#> }
#> if (length(cl) == 1L && cl != "numeric")
#> oldClass(stats) <- oldClass(conf) <- oldClass(out) <- cl
#> z <- list(stats = stats, n = ng, conf = conf, out = out,
#> group = group, names = names)
#> if (plot) {
#> if (is.null(pars$boxfill) && is.null(args$boxfill))
#> pars$boxfill <- col
#> do.call(bxp, c(list(z, notch = notch, width = width,
#> varwidth = varwidth, log = log, border = border,
#> pars = pars, outline = outline, horizontal = horizontal,
#> add = add, ann = ann, at = at), args[namedargs]),
#> quote = TRUE)
#> invisible(z)
#> }
#> else z
#> }
#> <bytecode: 0x0000020c8ed33138>
#> <environment: namespace:graphics>For debugging run: # If you need to debug
debug(boxplot.default)
boxplot(x)
Created on 2023-02-21 with reprex v2.0.2 |
|
Don't really know how to proceed. set.seed(101)
x<-rnorm(50) * 1.3
classInt::classIntervals(x, style = "pretty", n=20)
#> style: pretty
#> one of 6.32053e+13 possible partitions of this variable into 26 classes
#> [-3.2,-3) [-3,-2.8) [-2.8,-2.6) [-2.6,-2.4) [-2.4,-2.2) [-2.2,-2)
#> 1 0 2 0 0 1
#> [-2,-1.8) [-1.8,-1.6) [-1.6,-1.4) [-1.4,-1.2) [-1.2,-1) [-1,-0.8)
#> 4 0 1 0 3 3
#> [-0.8,-0.6) [-0.6,-0.4) [-0.4,-0.2) [-0.2,0) [0,0.2) [0.2,0.4)
#> 0 2 6 2 1 3
#> [0.4,0.6) [0.6,0.8) [0.8,1) [1,1.2) [1.2,1.4) [1.4,1.6)
#> 3 7 4 3 1 2
#> [1.6,1.8) [1.8,2]
#> 0 1Created on 2023-02-21 with reprex v2.0.2 The specificity of
Without these two features, So I would propose either fix it keeping the original features or deprecate it. If |
|
Thanks for the hint on how to read the boxplot function! (Learning everyday) Since it uses stats::fivenum (Tukey's hinges) it is not exactly quartiles then (which adds a further complexity with differences or not for odd or even number of records, should we use this computation). |
|
Additional: As an example in Geoda here: https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map, fig 27 shows a case where there are outliers and the bottom limit of that class is min(x) (0 in that case), while fig 28 shows an example where there are no lower outlier, then the category is indeed present with a -Inf to the bottom (not the calculated fence as in the py classifier). I would think this the way to go. |
|
Thanks @geocaruso (@rsbivand also mentioned Geoda but I missed that). With a couple of modification we can get the same results (Adding I agree this is the way to go, since is not breaking and consistent with the current implementation: url <- "https://geodacenter.github.io/data-and-lab/data/nyc.zip"
tmpzip <- tempfile(fileext = ".zip")
download.file(
url = url,
tmpzip
)
unzip(tmpzip, junkpaths = FALSE, exdir = tempdir())
shpfile <- list.files(tempdir(), pattern = "gpkg", full.names = TRUE)
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
library(ggplot2)
opar <- par(no.readonly = TRUE)
options(scipen = 3)
ny <- read_sf(shpfile)
v <- as.vector(ny$RENT2008)
mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
qv <- unname(quantile(var, type = qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr
bb <- vector(mode = "numeric", length = qtype)
if (lofence < qv[1]) {
# bb[1] <- lofence #former
# bb[2] <- floor(qv[1]) #former
bb[1] <- -Inf
bb[2] <- lofence
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) {
# former
# bb[7] <- upfence
# bb[6] <- ceiling(qv[5])
bb[7] <- Inf
bb[6] <- upfence
} else {
bb[6] <- upfence
# bb[7] <- qv[5] # former
bb[7] <- Inf
}
bb[3:5] <- qv[2:4]
brks <- bb
brks
}
br <- classInt::classIntervals(v, style = "fixed", fixedBreaks = mybox_dhh(v))
br
#> style: fixed
#> one of 20,349 possible partitions of this variable into 6 classes
#> [0,456.25) [456.25,1000) [1000,1100) [1100,1362.5)
#> 3 4 15 19
#> [1362.5,1906.25) [1906.25,Inf]
#> 8 6
# Fig 26 https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map
#
# Lower outlier (3) [0:456.250]
# < 25% (4) [456.250:1000]
# 25% - 50% (15) [1000:1100]
# 50% - 75% (19) [1100:1362.500]
# > 75% (8) [1362.500:1906.250]
# Upper outlier (6) [1906.250:Inf]
# Map
ny$cats <- classInt::classify_intervals(v, style = "fixed", fixedBreaks = mybox_dhh(v))
ggplot(ny) +
geom_sf(aes(fill = cats)) +
scale_fill_manual(values = hcl.colors(6, "RdBu", rev = TRUE), drop = FALSE) +
labs(fill = "Hinge= 1.5")# With a higher hinge
br2 <- classInt::classIntervals(v, style = "fixed", fixedBreaks = mybox_dhh(v, iqr_mult = 3))
br2
#> style: fixed
#> one of 20,349 possible partitions of this variable into 6 classes
#> [-Inf,-87.5) [-87.5,1000) [1000,1100) [1100,1362.5) [1362.5,2450)
#> 0 7 15 19 9
#> [2450,Inf]
#> 5
# Fig 27 https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map
#
# Lower outlier (0) [0:-87.5]
# < 25% (7) [-87.5:1000]
# 25% - 50% (15) [1000:1100]
# 50% - 75% (19) [1100:1362.500]
# > 75% (9) [1362.500:2450]
# Upper outlier (5) [2450:Inf]
ny$cats2 <- classInt::classify_intervals(v, style = "fixed", fixedBreaks = mybox_dhh(v, iqr_mult = 3))
ggplot(ny) +
geom_sf(aes(fill = cats2)) +
scale_fill_manual(values = hcl.colors(6, "RdBu", rev = TRUE), drop = FALSE) +
labs(fill = "Hinge= 3")Created on 2023-02-22 with reprex v2.0.2 |
|
Thanks @dieghernan. Two advantages: There is no longer any worry about adding substracting a sufficiently small number and it is directly consistent with Geoda. For purists maybe, should one add an (information) note that in absence of outliers, the upper (resp. lower) limit of the lower (resp. upper) outlier class is not the whisker tip of a standard boxplot (i.e. min(x) (resp. max(x)) but q0.25-IQRhinge (resp q0.75+IQRhinge), i.e. the limit for deciding whether a value is an outlier? |
|
|
|
We are actually now some way from a boxplot definition, where the tips of the whiskers should be pegged to specific observed values. Of course, Tukey was considering small data sets. I'll create a branch with the @geocaruso - are you considering attending ECTQG 2023 in Braga? Might a special session on map classification/visualization be sensible? |
|
Very good to me. And yes I'll attend ECTQG in Braga and a special session on map classification, especially with open tools and comparisons across would be fun. Let's suggest it. The call should have been sent today I believe |









Hello,
I just discovered the box style and I find it is a great idea for discretization (before mapping in my case), but I am puzzled by the small tests I made with it, which sometimes return "Error in findInterval(clI$var, clI$brks, all.inside = TRUE) : 'vec' must be sorted non-decreasingly and not contain NAs”
See below.
I suppose it is because I apply the function on x which is normal by construction and so there are groups with missing values (no outliers). But I understood that if there are no outliers, then the quartiles should be reported (4 or 5 groups instead of 6) In that sense, a boxplot I guess would always be reported even without outliers.
As you will see below it worked fine with the first two seeds of x but then not with the 3rd. The 3rd one then works if I add an outlier (see x2). The 2nd case though also has 2 empty groups but there is no error there.
I just wonder if throwing the error is the expected way out. When I will use it in batch (or loops) I can always get around with a “Try” catch but maybe the error is not a behavior you wanted either?
Thanks for the help (and the package!) and sorry if this is reporting something far too obvious!
Geoffrey
The text was updated successfully, but these errors were encountered: