Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"box" style - error 'vec' must be sorted non-decreasingly and not contain NAs” #41

Closed
geocaruso opened this issue Feb 17, 2023 · 31 comments

Comments

@geocaruso
Copy link

Hello,
I just discovered the box style and I find it is a great idea for discretization (before mapping in my case), but I am puzzled by the small tests I made with it, which sometimes return "Error in findInterval(clI$var, clI$brks, all.inside = TRUE) : 'vec' must be sorted non-decreasingly and not contain NAs”

See below.

I suppose it is because I apply the function on x which is normal by construction and so there are groups with missing values (no outliers). But I understood that if there are no outliers, then the quartiles should be reported (4 or 5 groups instead of 6) In that sense, a boxplot I guess would always be reported even without outliers.

As you will see below it worked fine with the first two seeds of x but then not with the 3rd. The 3rd one then works if I add an outlier (see x2). The 2nd case though also has 2 empty groups but there is no error there.

I just wonder if throwing the error is the expected way out. When I will use it in batch (or loops) I can always get around with a “Try” catch but maybe the error is not a behavior you wanted either?

Thanks for the help (and the package!) and sorry if this is reporting something far too obvious!
Geoffrey

packageVersion("classInt")
[1] ‘0.4.8’

x<-rnorm(50)
classInt::classIntervals(x,style="box")
style: box
one of 1,906,884 possible partitions of this variable into 6 classes
[-3.147138,-3) [-3,-0.9019681) [-0.9019681,-0.2114992)
0 13 12
[-0.2114992,0.594812) [0.594812,2.839982) [2.839982,3.407277]
12 12 1

x<-rnorm(50)
classInt::classIntervals(x,style="box")
style: box
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.201223,-2) [-2,-0.5390956) [-0.5390956,0.09125387)
0 13 12
[0.09125387,0.5689893) [0.5689893,2) [2,2.231117]
12 13 0

x<-rnorm(50)
classInt::classIntervals(x,style="box")
style: box
one of 1,906,884 possible partitions of this variable into 6 classes
Error in findInterval(clI$var, clI$brks, all.inside = TRUE) :
'vec' must be sorted non-decreasingly and not contain NAs

x2<-c(x,99)
classInt::classIntervals(x2,style="box")
style: box
one of 2,118,760 possible partitions of this variable into 6 classes
[-3.077989,-3) [-3,-0.8386009) [-0.8386009,-0.3035683)
0 13 12
[-0.3035683,0.6543248) [0.6543248,2.893713) [2.893713,99]
13 12 1

@edzer
Copy link
Member

edzer commented Feb 17, 2023

Your reprex doesn't set the seed.

@geocaruso
Copy link
Author

geocaruso commented Feb 17, 2023

Indeed. Sorry. Better this way?

set.seed(101)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
#  one of 1,906,884 possible partitions of this variable into 6 classes
# Error in findInterval(clI$var, clI$brks, all.inside = TRUE) : 
#  'vec' must be sorted non-decreasingly and not contain NAs
set.seed(102)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
#   one of 1,906,884 possible partitions of this variable into 6 classes
# Error in findInterval(clI$var, clI$brks, all.inside = TRUE) : 
#   'vec' must be sorted non-decreasingly and not contain NAs
set.seed(103)
x<-rnorm(50)
classInt::classIntervals(x,style="box")
# style: box
#  one of 1,906,884 possible partitions of this variable into 6 classes
#  [-2.145065,-2.096239)  [-2.096239,-0.4989029) [-0.4989029,0.09865646) 
#                      1                      12                      12 
#   [0.09865646,0.565988)     [0.565988,2.163324)     [2.163324,2.579731] 
#                   30                     12                      12                       1 

@edzer
Copy link
Member

edzer commented Feb 17, 2023

Seems like brks is unsorted in the call to findIntervals:

b=classInt::classIntervals(x,style="box")
findInterval(b$var, b$brks, all.inside=TRUE)
# Error in findInterval(b$var, b$brks, all.inside = TRUE) : 
#   'vec' must be sorted non-decreasingly and not contain NAs
findInterval(b$var, sort(b$brks), all.inside=TRUE)
#  [1] 3 5 3 4 4 5 5 3 5 3 4 2 5 2 3 3 2 4 2 2 3 5 3 2 5 2 4 3 4 4 5 4 5 2 5 2 4 5
# [39] 2 4 4 5 2 3 2 4 5 3 3 2

@rsbivand
Copy link
Member

rsbivand commented Feb 17, 2023

See #38; @dieghernan , could you please look at the source (https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map)? The brief comment on fence handing seems to be about https://spatialanalysis.github.io/lab_tutorials/2_R_EDA_1.html#box-plot, using the layer_data() of a ggplotobject. In?ggplot2::geom_boxplot`, "coef: Length of the whiskers as multiple of IQR. Defaults to 1.5"

The cause is that in these cases the iqr is quite restricted (I think), so the output breaks are:

Browse[2]> qv
[1] -2.31932737 -0.71980148 -0.02713441  0.54595842  1.42775554
Browse[2]> bb
[1] -2.61844134 -3.00000000 -0.71980148 -0.02713441  0.54595842  2.00000000
[7]  2.44459827

where the first two are reversed. If I increase the optional argument iqr_mult = 2 from default 1.5, I get:

style: box
  one of 1,906,884 possible partitions of this variable into 6 classes
          [-3.251321,-3)          [-3,-0.7198015) [-0.7198015,-0.02713441) 
                       0                       13                       12 
 [-0.02713441,0.5459584)            [0.5459584,2)             [2,3.077478] 
                      12                       13                        0 

so I think that the method as written requires intervention from the user. This maybe could be trapped and iqr_mult increased (with a warning?) if bb is not equal to sort(bb).

@dieghernan
Copy link
Contributor

Hi @rsbivand

I am on it. In fact this is caused by the floor() on here:

classInt/R/classInt.R

Lines 371 to 374 in 830abc9

if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
bb[2] <- floor(qv[1])
} else {

On the example with set.seed(101) it causes:

set.seed(101)
x<-rnorm(50)


br <- classInt::classIntervals(x,style="box")
br$brks
#> [1] -2.61844134 -3.00000000 -0.71980148 -0.02713441  0.54595842  2.00000000
#> [7]  2.44459827

# Recreate logic of the function
var <- x

# Mock dots
dots <- list(empyt = NA)

# The function as is
iqr_mult <- ifelse(is.null(dots$iqr_mult), 1.5, dots$iqr_mult)
qtype <- ifelse(is.null(dots$type), 7, dots$type)

qv <- unname(quantile(var, type=qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr


# initialize break points vector
bb <- vector(mode="numeric",length=7)

# logic for lower and upper fences
if (lofence < qv[1]) {  # no lower outliers
  bb[1] <- lofence
  bb[2] <- floor(qv[1])
} else {
  bb[2] <- lofence
  bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
  bb[7] <- upfence
  bb[6] <- ceiling(qv[5])

} else {
  bb[6] <- upfence
  bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]

brks <- bb
# Bad breaks
brks[1:2]
#> [1] -2.618441 -3.000000

# Since
lofence
#> [1] -2.618441

# And obviously
qv[1]
#> [1] -2.319327

# But the error is
floor(qv[1])
#> [1] -3

Created on 2023-02-17 with reprex v2.0.2

This is commented on the original source of the method: https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map but I have to think about that.

Maybe @angela-li can help as well (see #18)

@rsbivand
Copy link
Member

The ggplot2 part with this data shows:

library(ggplot2)
box.plt3 <- ggplot(data=data.frame(x=x), aes(x="",y=x)) + geom_boxplot()
layer_data(box.plt3)
            [,1]       
ymin        -2.319327  
lower       -0.7198015 
middle      -0.02713441
upper       0.5459584  
ymax        1.427756   
outliers    numeric,0  
notchupper  0.2556943  
notchlower  -0.3099631 
x           1          
flipped_aes FALSE      
PANEL       "1"        
group       1          
ymin_final  -2.319327  
ymax_final  1.427756   
xmin        0.625      
xmax        1.375      
xid         1          
newx        1          
new_width   0.75       
weight      1          
colour      "grey20"   
fill        "white"    
alpha       NA         
shape       19         
linetype    "solid"    
linewidth   0.5        

@geocaruso
Copy link
Author

I most probably miss the wider picture, but not sure why floor(qv[1]) rather than qv[1] (and similarly ceiling) are needed at all in:

logic for lower and upper fences

if (lofence < qv[1]) { # no lower outliers
bb[1] <- lofence
bb[2] <- floor(qv[1])
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]) { # no upper outliers
bb[7] <- upfence
bb[6] <- ceiling(qv[5])

} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]

In the case of seed (101), if the floor is removed, i.e;:
bb[1] <- lofence
bb[2] <- qv[1]

then we would get

bb
[1] -2.61844134 -2.31932737 -0.71980148 -0.02713441 0.54595842 1.42775554
[7] 2.44459827
which is a correct order leading to a first group being empty as expected (and I understand a desired outcome as of the box style implementation)

@rsbivand
Copy link
Member

rsbivand commented Feb 18, 2023

Will ceiling() cause the same problem at the upper end? This also depends on variable scaling, with default rnorm(, 0, 1) being harder than absolutely larger numbers, right?

In the documentation, it is suggested that the R code should perform like Geoda. Is this where Geoda defines these quantities: https://github.com/GeoDaCenter/geoda/blob/5109e6bbe0cd41d08fd15c4c2a14c6cf4ac601bb/Explore/CatClassification.cpp#L216 ?

@dieghernan
Copy link
Contributor

I have not access to a computer now @rsbivand but I think that ceiling() is also problematic (I think that the reprex with seed 102 triggers that).

I have an bug fix in mind, but I'll take a couple of days until I can test it. Glad if anyone wants to propose their own alternative on this

@geocaruso
Copy link
Author

I suppose the reason for flooring/ceiling is related to the left/right use or the breaks. Rather then ceiling and flooring, one way would be to add/deduce a small value, typically a value just below the minimun difference between any two consecutive values in the original variable. That requires a sorting though (which may be inefficient with long vectors?).

An implementation could be around this, where one divides that min difference by 10 (cutting_dec) and adds/deduces it from the respective quantile:

mybox<-function(var,iqr_mult=1.5,qtype=7,cutting_dec=10){
qv <- unname(quantile(var, type=qtype))
iqr <- iqr_mult * (qv[4] - qv[2])
upfence <- qv[4] + iqr
lofence <- qv[2] - iqr

bb <- vector(mode="numeric",length=qtype)

if (lofence < qv[1]){ # no lower outliers
bb[1] <- lofence
#bb[2] <- floor(qv[1]) ### FORMER ###
bb[2] <- qv[1] - min(diff(sort(var)))/cutting_dec ### NEW ###
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if (upfence > qv[5]){ # no upper outliers
bb[7] <- upfence
#bb[6] <- ceiling(qv[5]) ### FORMER ###
bb[6] <- (qv[5])- min(diff(sort(var)))/cutting_dec ### NEW ###
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}

bb[3:5] <- qv[2:4]
brks <- bb
return(brks)
}

Results are then as follows:

set.seed(101) #lower end problem
x=rnorm(50)
quantile(x)
0% 25% 50% 75% 100%
-2.31932737 -0.71980148 -0.02713441 0.54595842 1.42775554
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
style: fixed
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.618441,-2.319345) [-2.319345,-0.7198015) [-0.7198015,-0.02713441)
0 13 12
[-0.02713441,0.5459584) [0.5459584,1.427738) [1.427738,2.444598]
12 12 1

set.seed(102) #upper end problem
x=rnorm(50)
quantile(x)
0% 25% 50% 75% 100%
-1.6783112 -0.3608505 0.1443969 1.1722807 3.1143334
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
style: fixed
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.660547,-1.67847) [-1.67847,-0.3608505) [-0.3608505,0.1443969)
0 13 12
[0.1443969,1.172281) [1.172281,3.114174) [3.114174,3.471977]
12 12 1

set.seed(103) #no problem, results remain the same in this case
x=rnorm(50)
quantile(x)
0% 25% 50% 75% 100%
-2.14506499 -0.49890289 0.09865646 0.56598795 2.57973063
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
style: fixed
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
1 12 12
[0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
12 12 1
classInt::classIntervals(x,style="box")
style: box
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.145065,-2.096239) [-2.096239,-0.4989029) [-0.4989029,0.09865646)
1 12 12
[0.09865646,0.565988) [0.565988,2.163324) [2.163324,2.579731]
12 12 1

@dieghernan
Copy link
Contributor

dieghernan commented Feb 18, 2023

Thanks @geocaruso I came to the same approach that you mentioned. The ceiling()/floor() part was there just because it was in the original implementation showed on https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map but obviously it was not a bulletproof approach (my mistake).

I was thinking on adding/substracting 0.1*(lofence - qv[1] and 0.1 * (upfence - qv[5]) so the cut value is always between the q and the fences and avoiding the include left issue.

@geocaruso
Copy link
Author

Hello,
Well if am correct the two implementations are close but do not lead to the same outcome in terms of classification nor in terms of display of break values. Compared to Diego's approach (see mybox2() below), the first approach (mybox() below)
(i) is more conservative in terms of values and I think takes no risk of misclassification (no record in the outliers class if no outliers are defined, see below with Seed101 for a difference) but it requires a sort()
(ii) and in absence of outliers displays the tip of the whisker as the break limit with more precision (see ggplots below)
Yet, this 2nd point is not necessarily a desired outcome because in presence of outliers (see Seed 103), it is the fence, not the whisker's tip that is displayed.

I have added below the functions some graphical output for Seed101, 102 and 103 to compare with the ggplot outcome. Blue is the outcome of Diego (mybox2()) and red is the one I suggested (mybox())

mybox<-function(var,iqr_mult=1.5,qtype=7,cutting_dec=10){
  qv <- unname(quantile(var, type=qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr
  
  bb <- vector(mode="numeric",length=qtype)
  
  if (lofence < qv[1]){  # no lower outliers
    bb[1] <- lofence
    #bb[2] <- floor(qv[1]) #former
    bb[2] <- qv[1] - min(diff(sort(var)))/cutting_dec  #ALTERNATIVE 1
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }
  if (upfence > qv[5]){ # no upper outliers
    bb[7] <- upfence
    #bb[6] <- ceiling(qv[5]) #former
    bb[6] <- (qv[5])+ min(diff(sort(var)))/cutting_dec #ALTERNATIVE 1
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }
  bb[3:5] <- qv[2:4]
  brks <- bb
  return(brks)
}

  mybox2<-function(var,iqr_mult=1.5,qtype=7){
    qv <- unname(quantile(var, type=qtype))
    iqr <- iqr_mult * (qv[4] - qv[2])
    upfence <- qv[4] + iqr
    lofence <- qv[2] - iqr
    
    bb <- vector(mode="numeric",length=qtype)
    
    if (lofence < qv[1]){  # no lower outliers
      bb[1] <- lofence
      #bb[2] <- floor(qv[1]) #former
      bb[2] <- qv[1]-0.1*(lofence - qv[1])  #ALTERNATIVE 2
    } else {
      bb[2] <- lofence
      bb[1] <- qv[1]
    }
    if (upfence > qv[5]){ # no upper outliers
      bb[7] <- upfence
      #bb[6] <- ceiling(qv[5]) #former
      bb[6] <- (qv[5])+ 0.1 * (upfence - qv[5]) #ALTERNATIVE 2
    } else {
      bb[6] <- upfence
      bb[7] <- qv[5]
    }
  bb[3:5] <- qv[2:4]
  brks <- bb
  return(brks)
}

And the outcomes can be compared with the following 3 seeds:

set.seed(101) #lower end problem
x=rnorm(50)
quantile(x)
classInt::classIntervals(x,style="box")
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox2(x))
ggboxplot<- ggplot() +
  geom_boxplot(data=data.frame(x=x), aes(x="",y=x), fill="green", alpha=0.8) +
  geom_hline(data=data.frame(brks=mybox(x)), aes(yintercept=brks), col="red", alpha=0.5) +
  geom_hline(data=data.frame(brks=mybox2(x)), aes(yintercept=brks), col="blue", alpha=0.5)
ggboxplot+ theme_bw()+ ggtitle("Seed101")

image

Note also the difference of classification with 1 record being classed as a lower outlier in case of mybox2()

classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
style: fixed
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.618441,-2.319345) [-2.319345,-0.7198015) [-0.7198015,-0.02713441)
0 13 12
[-0.02713441,0.5459584) [0.5459584,1.427773) [1.427773,2.444598]
12 13 0
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox2(x))
style: fixed
one of 1,906,884 possible partitions of this variable into 6 classes
[-2.618441,-2.289416) [-2.289416,-0.7198015) [-0.7198015,-0.02713441)
1 12 12
[-0.02713441,0.5459584) [0.5459584,1.52944) [1.52944,2.444598]
12 13 0

set.seed(102) #upper end problem
x=rnorm(50)
quantile(x)
classInt::classIntervals(x,style="box")
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox2(x))

ggboxplot<- ggplot() +
  geom_boxplot(data=data.frame(x=x), aes(x="",y=x), fill="green", alpha=0.8) +
  geom_hline(data=data.frame(brks=mybox(x)), aes(yintercept=brks), col="red", alpha=0.5) +
  geom_hline(data=data.frame(brks=mybox2(x)), aes(yintercept=brks), col="blue", alpha=0.5)
ggboxplot+ theme_bw()+ ggtitle("Seed102")

image

set.seed(103) #no problem
x=rnorm(50)
quantile(x)
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox(x))
classInt::classIntervals(x,style = "fixed", fixedBreaks=mybox2(x))
classInt::classIntervals(x,style="box")

ggboxplot<- ggplot() +
     geom_boxplot(data=data.frame(x=x), aes(x="",y=x), fill="green", alpha=0.8) +
     geom_hline(data=data.frame(brks=mybox(x)), aes(yintercept=brks), col="red", alpha=0.5) +
     geom_hline(data=data.frame(brks=mybox2(x)), aes(yintercept=brks), col="blue", alpha=0.5)
ggboxplot+ theme_bw()+ ggtitle("Seed103")

image
With this latter image, you see the fences (quantiles +1.5 iqr) are reported in cases with outliers, thus questioning what should be reported as breaks in the no outliers' cases.

It actually triggers the wider question of whether one wants to always have 6 classes reported or sometimes 4 or 5.

(One may even think of adding the notches, ie. up to 8 classes then...but that is another story ;-) )

@dieghernan
Copy link
Contributor

dieghernan commented Feb 20, 2023

DELETED, See below

@dieghernan
Copy link
Contributor

dieghernan commented Feb 20, 2023

Sorry, just a question on this since I think min(diff(sort(var))) can present performance issues in very large samples (need to sort an compute all diffs).

@geocaruso do you think my approach can yield to more accurate results with a smaller multiplier?

Hopefully tomorrow I can have a closer look to all this issue

@dieghernan
Copy link
Contributor

I had some time, @geocaruso

I modified your version of my proposed approach by reducing the multiplier (not need to be 10%, can be as small as desired) and correcting the signs on the lower fence implementation, they were reversed:

mybox <- function(var, iqr_mult = 1.5, qtype = 7, cutting_dec = 10) {
  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr

  bb <- vector(mode = "numeric", length = qtype)

  if (lofence < qv[1]) { # no lower outliers
    bb[1] <- lofence
    # bb[2] <- floor(qv[1]) #former
    bb[2] <- qv[1] - min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }
  if (upfence > qv[5]) { # no upper outliers
    bb[7] <- upfence
    # bb[6] <- ceiling(qv[5]) #former
    bb[6] <- (qv[5]) + min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }
  bb[3:5] <- qv[2:4]
  brks <- bb
  return(brks)
}



mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr

  bb <- vector(mode = "numeric", length = qtype)

  if (lofence < qv[1]) { # no lower outliers
    bb[1] <- lofence
    # bb[2] <- floor(qv[1]) #former
    bb[2] <- qv[1] - 0.00001 * (qv[1] - lofence) # ALTERNATIVE 2
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }
  if (upfence > qv[5]) { # no upper outliers
    bb[7] <- upfence
    # bb[6] <- ceiling(qv[5]) #former
    bb[6] <- (qv[5]) + 0.00001 * (upfence - qv[5]) # ALTERNATIVE 2
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }
  bb[3:5] <- qv[2:4]
  brks <- bb
  return(brks)
}

Testing examples:

set.seed(101) # lower end problem
x <- rnorm(50)
quantile(x)
#>          0%         25%         50%         75%        100% 
#> -2.31932737 -0.71980148 -0.02713441  0.54595842  1.42775554
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>    [-2.618441,-2.319345)   [-2.319345,-0.7198015) [-0.7198015,-0.02713441) 
#>                        0                       13                       12 
#>  [-0.02713441,0.5459584)     [0.5459584,1.427773)      [1.427773,2.444598] 
#>                       12                       13                        0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>     [-2.618441,-2.31933)    [-2.31933,-0.7198015) [-0.7198015,-0.02713441) 
#>                        0                       13                       12 
#>  [-0.02713441,0.5459584)     [0.5459584,1.427766)      [1.427766,2.444598] 
#>                       12                       13                        0

# I see here same results in terms of classification (i.e. no missclasification), mybox_dhh provides closer values to the min max
# due to to the 0.001 factor.

library(ggplot2)

ggplot() +
  geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
  geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
  geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
  ggtitle("Seed101")

set.seed(102) # upper end problem
x <- rnorm(50)
quantile(x)
#>         0%        25%        50%        75%       100% 
#> -1.6783112 -0.3608505  0.1443969  1.1722807  3.1143334
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>   [-2.660547,-1.67847)  [-1.67847,-0.3608505) [-0.3608505,0.1443969) 
#>                      0                     13                     12 
#>   [0.1443969,1.172281)    [1.172281,3.114493)    [3.114493,3.471977] 
#>                     12                     13                      0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>  [-2.660547,-1.678321) [-1.678321,-0.3608505) [-0.3608505,0.1443969) 
#>                      0                     13                     12 
#>   [0.1443969,1.172281)    [1.172281,3.114337)    [3.114337,3.471977] 
#>                     12                     13                      0


# Same here, closer values to min max although I think that is irrelevante as long as it does not misclasify outliers

ggboxplot <- ggplot() +
  geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
  geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
  geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
  ggtitle("Seed102")


set.seed(103) # no problem
x <- rnorm(50)
quantile(x)
#>          0%         25%         50%         75%        100% 
#> -2.14506499 -0.49890289  0.09865646  0.56598795  2.57973063
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>   [-2.145065,-2.096239)  [-2.096239,-0.4989029) [-0.4989029,0.09865646) 
#>                       1                      12                      12 
#>   [0.09865646,0.565988)     [0.565988,2.163324)     [2.163324,2.579731] 
#>                      12                      12                       1
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>   [-2.145065,-2.096239)  [-2.096239,-0.4989029) [-0.4989029,0.09865646) 
#>                       1                      12                      12 
#>   [0.09865646,0.565988)     [0.565988,2.163324)     [2.163324,2.579731] 
#>                      12                      12                       1


ggplot() +
  geom_boxplot(data = data.frame(x = x), aes(x = "", y = x), fill = "green", alpha = 0.8) +
  geom_hline(data = data.frame(brks = mybox(x)), aes(yintercept = brks), col = "red", alpha = 0.5) +
  geom_hline(data = data.frame(brks = mybox_dhh(x)), aes(yintercept = brks), col = "blue", alpha = 0.5) +
  theme_bw() +
  ggtitle("Seed103")

I see some potential improvements on performance on mybox_dhh() although I would say there is not a massive difference:

# Performance

set.seed(104) # no problem
x <- rnorm(1000000)


mybox_t <- Sys.time()

mybox(x)
#> [1] -4.9904983832 -2.6980446395 -0.6746739374  0.0003594936  0.6742398640
#> [6]  2.6976105660  5.0789629606

Sys.time() - mybox_t
#> Time difference of 0.03704286 secs


mybox_dhh_t <- Sys.time()

mybox_dhh(x)
#> [1] -4.9904983832 -2.6980446395 -0.6746739374  0.0003594936  0.6742398640
#> [6]  2.6976105660  5.0789629606

Sys.time() - mybox_dhh_t
#> Time difference of 0.03578687 secs

Created on 2023-02-20 with reprex v2.0.2

@dieghernan
Copy link
Contributor

dieghernan commented Feb 21, 2023

I will try to summarise @rsbivand, @geocaruso @edzer :

  • The issue is related with the use of floor()/ceiling() functions when no outliers are detected. This was implemented original in the initial implementation of the algorithm, but it was tailored to the data used in the exercise. For distibutions where decimals are relevante (i.e. rnorm()) it yields to bad results (see "box" style - error 'vec' must be sorted non-decreasingly and not contain NAs” #41 (comment)):

    # Bad breaks
    brks[1:2]
    #> [1] -2.618441 -3.000000
    
    # Since
    lofence
    #> [1] -2.618441
    
    # And obviously
    qv[1]
    #> [1] -2.319327
    
    # But the error is
    floor(qv[1])
    #> [1] -3
    
  • Where there are no outliers, first and last group should be empty. This can be achieved by adding a small offset to breaks 2 and 5 (that correspond in this case to min and max value of the distribution) in order to avoid including any point due to the intervalClosure parameter (left or right).

  • Both @geocaruso and I came with two different approaches to compute that offset. Either

# For cutting_dec = 10
+/- min(diff(sort(var))) / 10

Or

# For lower fence, I added abs to force correct direction of the offset
- 0.00001 * abs(lofence - qv[1] )

# For upper fence
+ 0.00001 * abs(upfence - qv[5]) 
  • The size of the offset in both cases is arbitrary (can be adjusted with cutting_dec/ 0.00001). The desirable condition IMHO is that it should be as small as possible. This is due to break 2 and 5 should correspond to min/max, so since we are modifying this value I think the user should still be able to recognize those breaks as such min/max values.

  • Regarding both approaches, I think abs(*fence - qv[*]) is potentially more efficient (a simple algorithm rather than
    computing min(diff(sort(var))) since the latter needs to be performed over the entire sample. Also abs(*fence - qv[*]) is not affected for other factors, i.e. I am thinking in a corner case where min(diff(sort(var)))/10 may be still larger than *fence - qv[*], that would trigger again the issue. This would never happen with abs(*fence - qv[*]) since by definition it is a small portion of the difference of the fence and the corresponding value.

  • @rsbivand I additionaly checked the source code of ggplot2:.stat_boxplot() and I can't find a good fit on classInt since the outliers are detecting as :

outliers <- data$y < (stats[2] - coef * iqr) | data$y > (stats[4] + coef * iqr)

But what we need is to came with the exact value (i.e. ggplot2 approach is equivalent to not apply intervalClosure, that is not a feature in classInt).

Happy to hear your remarks/feedback on this

@dieghernan
Copy link
Contributor

A full reprex with both approaches and a comparison in terms of performance with library(microbenchmark):

# Define functions
mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr


  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr

  bb <- vector(mode = "numeric", length = qtype)

  if (lofence < qv[1]) {
    bb[1] <- lofence
    # bb[2] <- floor(qv[1]) #former
    bb[2] <- qv[1] - 0.00000001 * abs(lofence - qv[1])
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }


  if (upfence > qv[5]) {
    bb[7] <- upfence

    bb[6] <- (qv[5]) + 0.00000001 * abs(upfence - qv[5])
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }

  bb[3:5] <- qv[2:4]
  brks <- bb



  brks
}

mybox <- function(var, iqr_mult = 1.5, qtype = 7, cutting_dec = 10) {
  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr

  bb <- vector(mode = "numeric", length = qtype)

  if (lofence < qv[1]) { # no lower outliers
    bb[1] <- lofence
    # bb[2] <- floor(qv[1]) #former
    bb[2] <- qv[1] - min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }
  if (upfence > qv[5]) { # no upper outliers
    bb[7] <- upfence
    # bb[6] <- ceiling(qv[5]) #former
    bb[6] <- (qv[5]) + min(diff(sort(var))) / cutting_dec # ALTERNATIVE 1
  } else {
    bb[6] <- upfence
    bb[7] <- qv[5]
  }
  bb[3:5] <- qv[2:4]
  brks <- bb

  brks
}


# Test results
set.seed(101) # lower end problem
x <- rnorm(50)

classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>    [-2.618441,-2.319328)   [-2.319328,-0.7198015) [-0.7198015,-0.02713441) 
#>                        0                       13                       12 
#>  [-0.02713441,0.5459584)     [0.5459584,1.427756)      [1.427756,2.444598] 
#>                       12                       13                        0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>    [-2.618441,-2.319327)   [-2.319327,-0.7198015) [-0.7198015,-0.02713441) 
#>                        0                       13                       12 
#>  [-0.02713441,0.5459584)     [0.5459584,1.427756)      [1.427756,2.444598] 
#>                       12                       13                        0


set.seed(102) # upper end problem
x <- rnorm(50)

classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>  [-2.660547,-1.678313) [-1.678313,-0.3608505) [-0.3608505,0.1443969) 
#>                      0                     13                     12 
#>   [0.1443969,1.172281)    [1.172281,3.114335)    [3.114335,3.471977] 
#>                     12                     13                      0
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>  [-2.660547,-1.678311) [-1.678311,-0.3608505) [-0.3608505,0.1443969) 
#>                      0                     13                     12 
#>   [0.1443969,1.172281)    [1.172281,3.114333)    [3.114333,3.471977] 
#>                     12                     13                      0

set.seed(103) # no problem
x <- rnorm(50)


classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox(x, cutting_dec = 1000))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>   [-2.145065,-2.096239)  [-2.096239,-0.4989029) [-0.4989029,0.09865646) 
#>                       1                      12                      12 
#>   [0.09865646,0.565988)     [0.565988,2.163324)     [2.163324,2.579731] 
#>                      12                      12                       1
classInt::classIntervals(x, style = "fixed", fixedBreaks = mybox_dhh(x))
#> style: fixed
#>   one of 1,906,884 possible partitions of this variable into 6 classes
#>   [-2.145065,-2.096239)  [-2.096239,-0.4989029) [-0.4989029,0.09865646) 
#>                       1                      12                      12 
#>   [0.09865646,0.565988)     [0.565988,2.163324)     [2.163324,2.579731] 
#>                      12                      12                       1

Performance

# test micro benchmark

microb <- function(sample) {
  microbenchmark::microbenchmark(
    "mybox" = {
      b <- mybox(sample)
    },
    "mybox_dhh" = {
      b <- mybox_dhh(sample)
    },
    times = 5
  )
}



# Init samples
set.seed(2389)

# Pareto distributions a=7 b=14
paretodist <- 7 / (1 - runif(5000000))^(1 / 14)
# Exponential dist
expdist <- rexp(5000000)
# Lognorm
lognormdist <- rlnorm(5000000)
# Weibull
weibulldist <- rweibull(5000000, 1, scale = 5)
# LogCauchy "super-heavy tail"
logcauchdist <- exp(rcauchy(5000000, 2, 4))
# Remove Inf
logcauchdist <- logcauchdist[logcauchdist < Inf]

# Normal dist
normdist <- rnorm(5000000)
# Left-tailed distr
leftnorm <-
  sample(rep(normdist[normdist < mean(normdist)], 3), size = 5000000)

# Uniform distribution
unifdist <- runif(5000000)



microb(paretodist)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>      mybox 872.3832 879.3783 892.6899 884.0298 912.1657 915.4926     5   b
#>  mybox_dhh 514.8215 523.3301 535.8768 524.2920 528.1830 588.7575     5  a
microb(expdist)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>      mybox 867.5758 893.5908 911.1049 897.3768 946.1543 950.8266     5   b
#>  mybox_dhh 474.9472 506.7352 510.4360 523.3524 523.3616 523.7836     5  a
microb(lognormdist)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>      mybox 882.1914 892.7457 917.6065 925.7256 926.1274 961.2424     5   b
#>  mybox_dhh 529.3752 530.4412 540.1720 535.0765 540.9861 564.9811     5  a
microb(weibulldist)
#> Unit: milliseconds
#>       expr      min       lq      mean    median        uq       max neval cld
#>      mybox 882.7893 899.4784 1108.1453 1038.8632 1248.6510 1470.9446     5   b
#>  mybox_dhh 486.6262 600.9610  601.3876  625.4297  644.4559  649.4653     5  a
microb(logcauchdist)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>      mybox 790.1666 806.3588 847.7212 847.1650 896.2554 898.6602     5   b
#>  mybox_dhh 432.7527 447.8909 468.9992 472.5833 473.5591 518.2099     5  a
microb(normdist)
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval cld
#>      mybox 226.2988 232.4805 242.8239 234.4963 252.7251 268.1189     5  a 
#>  mybox_dhh 470.9276 487.3639 492.1637 494.6327 503.5849 504.3094     5   b
microb(leftnorm)
#> Unit: milliseconds
#>       expr      min       lq     mean   median        uq      max neval cld
#>      mybox 925.3866 931.4877 1034.676 969.6359 1167.3629 1179.507     5   b
#>  mybox_dhh 488.9530 515.3051  552.802 531.9555  560.9445  666.852     5  a
microb(unifdist)
#> Unit: milliseconds
#>       expr       min        lq      mean    median        uq       max neval
#>      mybox 1451.1785 1533.1986 1547.2717 1536.2723 1578.4240 1637.2853     5
#>  mybox_dhh  439.7543  443.3136  467.8262  445.8411  491.6797  518.5423     5
#>  cld
#>    b
#>   a

Created on 2023-02-21 with reprex v2.0.2

@rsbivand
Copy link
Member

rsbivand commented Feb 21, 2023

See https://github.com/pysal/mapclassify/blob/34341a22db968b42bcc21b52d4bcf32a0a3e7098/mapclassify/classifiers.py#L1326:

> set.seed(101)
> x<-rnorm(50)
> br <- classInt::classIntervals(x,style="box")
> br$brks
[1] -2.61844134 -3.00000000 -0.71980148 -0.02713441  0.54595842  2.00000000
[7]  2.44459827
> library(reticulate)
> mc <- import("mapclassify")
> mc$BoxPlot(x)$bins
[1] -2.61844134 -0.71980148 -0.02713441  0.54595842  2.44459827

See the notes https://github.com/pysal/mapclassify/blob/34341a22db968b42bcc21b52d4bcf32a0a3e7098/mapclassify/classifiers.py#L1354, and the book section https://geographicdata.science/book/notebooks/05_choropleth.html#box-plot.

Could this be a better variant?

@dieghernan
Copy link
Contributor

dieghernan commented Feb 21, 2023

Hi @rsbivand

I think this would be a breaking change. As of now (and also stated in the documentation) box always return 7 breaks. From current classInt docs (bold added):

The "box" style generates 7 breaks (therefore 6 categories) based on a box-and-whisker plot. First and last categories include the data values considered as outliers, and the four remaining categories are defined by the percentiles 25, 50 and 75 of the data distribution

As per the NOTE.

If :math:q[2]+hinge*IQR > max(y) there will only be 5 classes and no high
outliers, otherwise, there will be 6 classes and at least one high
outlier.

So the output can be either 5 or 6 classes. Also, I still see the issue (not tested yet) in case classInt(var, style = "box", intervalClosure = "right") then the method would capture always an outlier on the first group, since our second break won't be offsetted and it would corresponf to min(var) .

@geocaruso
Copy link
Author

Hi @rsbivand and @dieghernan,
My view is that

  1. most users would'nt need 7 groups if there are no outliers. The first thing I would do if a group is empty is actully to remove the class (e.g. before mapping). In that sense I am not worried of the py classifier approach
  2. but that classes limit should reflect the whiskers ends in case there is no outlier.
    Although I don't know how to access to source of boxplot function in R, I think it must be
    max(min(x), quantile(x,0.25)-IQR(x)*1.5) for the bottom part.
    With seed101, it would be
    [1] -2.319327
    which is actually min(x)
    [1] -2.319327
    not quantile(x,0.25)-IQR(x)*1.5
    -2.618441
    as indicated by the py classifier.
    In that sense then I would prefer the approaches discussed above.

@rsbivand
Copy link
Member

grDevices::boxplot.stats gives the code; it uses stats::fivenum to calculate the values.

@dieghernan
Copy link
Contributor

set.seed(101)
x<-rnorm(50)


boxplot(x, plot = FALSE)
#> $stats
#>             [,1]
#> [1,] -2.31932737
#> [2,] -0.72437422
#> [3,] -0.02713441
#> [4,]  0.55246186
#> [5,]  1.42775554
#> 
#> $n
#> [1] 50
#> 
#> $conf
#>            [,1]
#> [1,] -0.3124380
#> [2,]  0.2581692
#> 
#> $out
#> numeric(0)
#> 
#> $group
#> numeric(0)
#> 
#> $names
#> [1] "1"
# See boxplot stats
boxplot.stats(x)
#> $stats
#> [1] -2.31932737 -0.72437422 -0.02713441  0.55246186  1.42775554
#> 
#> $n
#> [1] 50
#> 
#> $conf
#> [1] -0.3124380  0.2581692
#> 
#> $out
#> numeric(0)

# Source
boxplot.stats
#> function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) 
#> {
#>     if (coef < 0) 
#>         stop("'coef' must not be negative")
#>     nna <- !is.na(x)
#>     n <- sum(nna)
#>     stats <- stats::fivenum(x, na.rm = TRUE)
#>     iqr <- diff(stats[c(2, 4)])
#>     if (coef == 0) 
#>         do.out <- FALSE
#>     else {
#>         out <- if (!is.na(iqr)) {
#>             x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
#>                 iqr)
#>         }
#>         else !is.finite(x)
#>         if (any(out[nna], na.rm = TRUE)) 
#>             stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
#>     }
#>     conf <- if (do.conf) 
#>         stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
#>     list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
#>         nna] else numeric())
#> }
#> <bytecode: 0x0000020c8edb5080>
#> <environment: namespace:grDevices>


# Source of boxplot

boxplot.default
#> function (x, ..., range = 1.5, width = NULL, varwidth = FALSE, 
#>     notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), 
#>     col = "lightgray", log = "", pars = list(boxwex = 0.8, staplewex = 0.5, 
#>         outwex = 0.5), ann = !add, horizontal = FALSE, add = FALSE, 
#>     at = NULL) 
#> {
#>     args <- list(x, ...)
#>     namedargs <- if (!is.null(attributes(args)$names)) 
#>         attributes(args)$names != ""
#>     else rep_len(FALSE, length(args))
#>     groups <- if (is.list(x)) 
#>         x
#>     else args[!namedargs]
#>     if (0L == (n <- length(groups))) 
#>         stop("invalid first argument")
#>     if (length(class(groups))) 
#>         groups <- unclass(groups)
#>     if (!missing(names)) 
#>         attr(groups, "names") <- names
#>     else {
#>         if (is.null(attr(groups, "names"))) 
#>             attr(groups, "names") <- 1L:n
#>         names <- attr(groups, "names")
#>     }
#>     cls <- lapply(groups, class)
#>     cl <- NULL
#>     if (all(vapply(groups, function(e) {
#>         is.numeric(unclass(e)) && identical(names(attributes(e)), 
#>             "class")
#>     }, NA)) && (length(unique(cls)) == 1L)) 
#>         cl <- cls[[1L]]
#>     for (i in 1L:n) groups[i] <- list(boxplot.stats(unclass(groups[[i]]), 
#>         range))
#>     stats <- matrix(0, nrow = 5L, ncol = n)
#>     conf <- matrix(0, nrow = 2L, ncol = n)
#>     ng <- out <- group <- numeric(0L)
#>     ct <- 1
#>     for (i in groups) {
#>         stats[, ct] <- i$stats
#>         conf[, ct] <- i$conf
#>         ng <- c(ng, i$n)
#>         if ((lo <- length(i$out))) {
#>             out <- c(out, i$out)
#>             group <- c(group, rep.int(ct, lo))
#>         }
#>         ct <- ct + 1
#>     }
#>     if (length(cl) == 1L && cl != "numeric") 
#>         oldClass(stats) <- oldClass(conf) <- oldClass(out) <- cl
#>     z <- list(stats = stats, n = ng, conf = conf, out = out, 
#>         group = group, names = names)
#>     if (plot) {
#>         if (is.null(pars$boxfill) && is.null(args$boxfill)) 
#>             pars$boxfill <- col
#>         do.call(bxp, c(list(z, notch = notch, width = width, 
#>             varwidth = varwidth, log = log, border = border, 
#>             pars = pars, outline = outline, horizontal = horizontal, 
#>             add = add, ann = ann, at = at), args[namedargs]), 
#>             quote = TRUE)
#>         invisible(z)
#>     }
#>     else z
#> }
#> <bytecode: 0x0000020c8ed33138>
#> <environment: namespace:graphics>

For debugging run:

# If you need to debug

debug(boxplot.default)
boxplot(x)


Created on 2023-02-21 with reprex v2.0.2

@dieghernan
Copy link
Contributor

dieghernan commented Feb 21, 2023

Don't really know how to proceed. "box" was added after #18 that was a request to implement https://spatialanalysis.github.io/lab_tutorials/4_R_Mapping.html#box-map. Other styles already throw empty groups:

set.seed(101)
x<-rnorm(50) * 1.3


classInt::classIntervals(x, style = "pretty", n=20)
#> style: pretty
#>   one of 6.32053e+13 possible partitions of this variable into 26 classes
#>   [-3.2,-3)   [-3,-2.8) [-2.8,-2.6) [-2.6,-2.4) [-2.4,-2.2)   [-2.2,-2) 
#>           1           0           2           0           0           1 
#>   [-2,-1.8) [-1.8,-1.6) [-1.6,-1.4) [-1.4,-1.2)   [-1.2,-1)   [-1,-0.8) 
#>           4           0           1           0           3           3 
#> [-0.8,-0.6) [-0.6,-0.4) [-0.4,-0.2)    [-0.2,0)     [0,0.2)   [0.2,0.4) 
#>           0           2           6           2           1           3 
#>   [0.4,0.6)   [0.6,0.8)     [0.8,1)     [1,1.2)   [1.2,1.4)   [1.4,1.6) 
#>           3           7           4           3           1           2 
#>   [1.6,1.8)     [1.8,2] 
#>           0           1

Created on 2023-02-21 with reprex v2.0.2

The specificity of "box" is that:

  • Creates always 6 groups
  • First and last includes the outliers (if any), so empty groups may happen.

Without these two features, "box" and "quantile" are pretty much the same algorithm. What is true is that in the current CRAN release "box" has a bug.

So I would propose either fix it keeping the original features or deprecate it. If mapclassify approach is considered I think it might be implemented as a new style in order to avoid potential breaks on already existing codes.

@geocaruso
Copy link
Author

Thanks for the hint on how to read the boxplot function! (Learning everyday) Since it uses stats::fivenum (Tukey's hinges) it is not exactly quartiles then (which adds a further complexity with differences or not for odd or even number of records, should we use this computation).
I understand the issue raised by @dieghernan about the fact others expect a 6 groups outcome. Fine to leave it to post-treatment on my side. At the end what matters most is the correct classification. Which breaks one want to display is maybe secondary but needs a decision. And I admit I have zero experience on how best to go next step when issues are raised and potentially lead to important changes for others...

@geocaruso
Copy link
Author

Additional: As an example in Geoda here: https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map, fig 27 shows a case where there are outliers and the bottom limit of that class is min(x) (0 in that case), while fig 28 shows an example where there are no lower outlier, then the category is indeed present with a -Inf to the bottom (not the calculated fence as in the py classifier). I would think this the way to go.

@dieghernan
Copy link
Contributor

dieghernan commented Feb 22, 2023

Thanks @geocaruso (@rsbivand also mentioned Geoda but I missed that). With a couple of modification we can get the same results (Adding Inf/-Inf as the lower and upper break when needed).

I agree this is the way to go, since is not breaking and consistent with the current implementation:

url <- "https://geodacenter.github.io/data-and-lab/data/nyc.zip"
tmpzip <- tempfile(fileext = ".zip")

download.file(
  url = url,
  tmpzip
)

unzip(tmpzip, junkpaths = FALSE, exdir = tempdir())

shpfile <- list.files(tempdir(), pattern = "gpkg", full.names = TRUE)

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
library(ggplot2)

opar <- par(no.readonly = TRUE)
options(scipen = 3)

ny <- read_sf(shpfile)

v <- as.vector(ny$RENT2008)

mybox_dhh <- function(var, iqr_mult = 1.5, qtype = 7) {
  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr


  qv <- unname(quantile(var, type = qtype))
  iqr <- iqr_mult * (qv[4] - qv[2])
  upfence <- qv[4] + iqr
  lofence <- qv[2] - iqr

  bb <- vector(mode = "numeric", length = qtype)

  if (lofence < qv[1]) {
    # bb[1] <- lofence #former
    # bb[2] <- floor(qv[1]) #former
    bb[1] <- -Inf
    bb[2] <- lofence
  } else {
    bb[2] <- lofence
    bb[1] <- qv[1]
  }




  if (upfence > qv[5]) {
    # former
    # bb[7] <- upfence
    # bb[6] <- ceiling(qv[5])
    bb[7] <- Inf
    bb[6] <- upfence
  } else {
    bb[6] <- upfence
    # bb[7] <- qv[5] # former
    bb[7] <- Inf
  }

  bb[3:5] <- qv[2:4]
  brks <- bb



  brks
}

br <- classInt::classIntervals(v, style = "fixed", fixedBreaks = mybox_dhh(v))

br
#> style: fixed
#>   one of 20,349 possible partitions of this variable into 6 classes
#>       [0,456.25)    [456.25,1000)      [1000,1100)    [1100,1362.5) 
#>                3                4               15               19 
#> [1362.5,1906.25)    [1906.25,Inf] 
#>                8                6

# Fig 26 https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map
#
# Lower outlier (3) [0:456.250]
# < 25% (4) [456.250:1000]
# 25% - 50% (15) [1000:1100]
# 50% - 75% (19) [1100:1362.500]
# > 75% (8) [1362.500:1906.250]
# Upper outlier (6) [1906.250:Inf]


# Map
ny$cats <- classInt::classify_intervals(v, style = "fixed", fixedBreaks = mybox_dhh(v))

ggplot(ny) +
  geom_sf(aes(fill = cats)) +
  scale_fill_manual(values = hcl.colors(6, "RdBu", rev = TRUE), drop = FALSE) +
  labs(fill = "Hinge= 1.5")

# With a higher hinge


br2 <- classInt::classIntervals(v, style = "fixed", fixedBreaks = mybox_dhh(v, iqr_mult = 3))

br2
#> style: fixed
#>   one of 20,349 possible partitions of this variable into 6 classes
#>  [-Inf,-87.5)  [-87.5,1000)   [1000,1100) [1100,1362.5) [1362.5,2450) 
#>             0             7            15            19             9 
#>    [2450,Inf] 
#>             5
# Fig 27 https://geodacenter.github.io/workbook/3a_mapping/lab3a.html#box-map
#
# Lower outlier (0) [0:-87.5]
# < 25% (7) [-87.5:1000]
# 25% - 50% (15) [1000:1100]
# 50% - 75% (19) [1100:1362.500]
# > 75% (9) [1362.500:2450]
# Upper outlier (5) [2450:Inf]

ny$cats2 <- classInt::classify_intervals(v, style = "fixed", fixedBreaks = mybox_dhh(v, iqr_mult = 3))

ggplot(ny) +
  geom_sf(aes(fill = cats2)) +
  scale_fill_manual(values = hcl.colors(6, "RdBu", rev = TRUE), drop = FALSE) +
  labs(fill = "Hinge= 3")

Created on 2023-02-22 with reprex v2.0.2

@geocaruso
Copy link
Author

Thanks @dieghernan. Two advantages: There is no longer any worry about adding substracting a sufficiently small number and it is directly consistent with Geoda.

For purists maybe, should one add an (information) note that in absence of outliers, the upper (resp. lower) limit of the lower (resp. upper) outlier class is not the whisker tip of a standard boxplot (i.e. min(x) (resp. max(x)) but q0.25-IQRhinge (resp q0.75+IQRhinge), i.e. the limit for deciding whether a value is an outlier?

@rsbivand
Copy link
Member

fivenum() is the same as quantile(, type=2) and quantile(, type=5) in the set.seed(101) case. I think the python module returns 7 breaks (6 classes) by setting -Inf as the minimum. However, it omits empty classes, which classIntervals() does not do.

@rsbivand
Copy link
Member

We are actually now some way from a boxplot definition, where the tips of the whiskers should be pegged to specific observed values. Of course, Tukey was considering small data sets. I'll create a branch with the -Inf, +Inf insertions as the current floor() and ceiling() have clearly always been wrong. I'll ask for reviews from @geocaruso and @dieghernan and maintainers of downstream packages: mapsf, sf, stars and tmap, and run reverse dependency checks. Does that sound reasonable?

@geocaruso - are you considering attending ECTQG 2023 in Braga? Might a special session on map classification/visualization be sensible?

@geocaruso
Copy link
Author

Very good to me.

And yes I'll attend ECTQG in Braga and a special session on map classification, especially with open tools and comparisons across would be fun. Let's suggest it. The call should have been sent today I believe

rsbivand added a commit that referenced this issue Feb 22, 2023
@rsbivand rsbivand mentioned this issue Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants