Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robust irmi, bug with bcancer data #70

Closed
matthias-da opened this issue Oct 19, 2022 · 9 comments
Closed

robust irmi, bug with bcancer data #70

matthias-da opened this issue Oct 19, 2022 · 9 comments
Assignees

Comments

@matthias-da
Copy link
Collaborator

This is an unpleasant bug, because it is very hard to debug. It does not happen all the time.

library(VIM)
data("bcancer")

for(i in 1:ncol(bcancer)){
  bcancer[sample(1:nrow(bcancer), 25), i] <- NA
}

ir <- irmi(bcancer[, 2:ncol(bcancer)])
# no error:
set.seed(123)
ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE, maxit = 3)
# error:
set.seed(1234)
ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE, maxit = 3)

What will be: since so few different values in individual variables, it no longer interprets these variables as numeric.
If you add very little noise, everything fits with no errors:

for(i in 1:10){
  bcancer[, i] <- as.numeric(bcancer[, i]) + runif(nrow(bcancer), 0, 0.0001)
}

set.seed(1234)
ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE)
@matthias-da
Copy link
Collaborator Author

I would not remove rlm, because this fallback is often used when lmrob does not work. The problem is also not with rml, it is with rml and lmrob and any method, since the algorithm decided that the variable to impute is numeric for any reason, but it is not.

library(VIM)
data("bcancer")

for(i in 1:ncol(bcancer)){
  bcancer[sample(1:nrow(bcancer), 25), i] <- NA
}

ir <- irmi(bcancer[, 2:ncol(bcancer)])
# no error:
set.seed(123)
ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE, maxit = 3)
# error:
set.seed(1234)
ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE, maxit = 3)

[1] "inner loop: 10"
[1] "binary"
[1] "bin"
[1] "formula used: class ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses"
[1] "it = 3 ,  Wert = 187"
[1] "eps 5"
[1] "test: TRUE"
[1] "not converged..."
[1] "187 < 5 = eps"
Error in lm.wfit(x, y, w, method = "qr") : incompatible dimensions

So, either the print output is wrong, or the algorithm tooks the wrong regression method for a categorical variable.
rml (or lmrob) should never be used for this variable.

I can't go into details before Christmas, but can check it later in more detail. What I propose is to not exclude rlm.

@matthias-da
Copy link
Collaborator Author

matthias-da commented Dec 19, 2022

P.S. and it happens in iteration 3, the first 2 iterations was fine...

> ir <- irmi(bcancer[, 2:ncol(bcancer)], robust = TRUE, maxit = 3, trace = TRUE)
Method for multinomial models:multinom

  clump_thickness uniformity_cellsize uniformity_cellshape adhesion epithelial_cellsize bare_nuclei
1               5                   1                    1        1                   2           1
2               5                   4                    4        5                   3          10
3               3                   1                    1        1                   2           2
4               6                   8                    7        1                   3           4
5               4                   1                    1        3                   2           1
6               8                  10                   10        8                   7          10
  chromatin normal_nucleoli mitoses class
1         3               1       1     0
2         3               2       1     0
3         3               1       1     0
4         3               7       1     1
5         3               1       1     0
6         9               7       3     1
Iteration1

[1] "inner loop: 1"
[1] "integer"
[1] "numeric"
[1] "formula used: clump_thickness ~ uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 2"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellsize ~ clump_thickness+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 3"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellshape ~ clump_thickness+uniformity_cellsize+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 4"
[1] "integer"
[1] "numeric"
[1] "formula used: adhesion ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 5"
[1] "integer"
[1] "numeric"
[1] "formula used: epithelial_cellsize ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 6"
[1] "numeric"
[1] "numeric"
[1] "formula used: bare_nuclei ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 7"
[1] "integer"
[1] "numeric"
[1] "formula used: chromatin ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+normal_nucleoli+mitoses+class"
[1] "inner loop: 8"
[1] "integer"
[1] "numeric"
[1] "formula used: normal_nucleoli ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+mitoses+class"
[1] "inner loop: 9"
[1] "integer"
[1] "numeric"
[1] "formula used: mitoses ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+class"
[1] "inner loop: 10"
[1] "binary"
[1] "bin"
[1] "formula used: class ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses"
[1] "it = 1 ,  Wert = 818"
[1] "eps 5"
[1] "test: TRUE"
Iteration2

[1] "inner loop: 1"
[1] "integer"
[1] "numeric"
[1] "formula used: clump_thickness ~ uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 2"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellsize ~ clump_thickness+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 3"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellshape ~ clump_thickness+uniformity_cellsize+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 4"
[1] "integer"
[1] "numeric"
[1] "formula used: adhesion ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 5"
[1] "integer"
[1] "numeric"
[1] "formula used: epithelial_cellsize ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 6"
[1] "numeric"
[1] "numeric"
[1] "formula used: bare_nuclei ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 7"
[1] "integer"
[1] "numeric"
[1] "formula used: chromatin ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+normal_nucleoli+mitoses+class"
[1] "inner loop: 8"
[1] "integer"
[1] "numeric"
[1] "formula used: normal_nucleoli ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+mitoses+class"
[1] "inner loop: 9"
[1] "integer"
[1] "numeric"
[1] "formula used: mitoses ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+class"
[1] "inner loop: 10"
[1] "binary"
[1] "bin"
[1] "formula used: class ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses"
[1] "it = 2 ,  Wert = 1004"
[1] "eps 5"
[1] "test: TRUE"
Iteration3

[1] "inner loop: 1"
[1] "integer"
[1] "numeric"
[1] "formula used: clump_thickness ~ uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 2"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellsize ~ clump_thickness+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 3"
[1] "integer"
[1] "numeric"
[1] "formula used: uniformity_cellshape ~ clump_thickness+uniformity_cellsize+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 4"
[1] "integer"
[1] "numeric"
[1] "formula used: adhesion ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 5"
[1] "integer"
[1] "numeric"
[1] "formula used: epithelial_cellsize ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+bare_nuclei+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 6"
[1] "numeric"
[1] "numeric"
[1] "formula used: bare_nuclei ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+chromatin+normal_nucleoli+mitoses+class"
[1] "inner loop: 7"
[1] "integer"
[1] "numeric"
[1] "formula used: chromatin ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+normal_nucleoli+mitoses+class"
[1] "inner loop: 8"
[1] "integer"
[1] "numeric"
[1] "formula used: normal_nucleoli ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+mitoses+class"
[1] "inner loop: 9"
[1] "integer"
[1] "numeric"
[1] "formula used: mitoses ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+class"
[1] "inner loop: 10"
[1] "binary"
[1] "bin"
[1] "formula used: class ~ clump_thickness+uniformity_cellsize+uniformity_cellshape+adhesion+epithelial_cellsize+bare_nuclei+chromatin+normal_nucleoli+mitoses"
[1] "it = 3 ,  Wert = 187"
[1] "eps 5"
[1] "test: TRUE"
[1] "not converged..."
[1] "187 < 5 = eps"
Error in lm.wfit(x, y, w, method = "qr") : incompatible dimensions

@alexkowa
Copy link
Member

alexkowa commented Dec 19, 2022

The problem is happening in rlm, probably in the init part where a sample of the data is fitted with LS regression.
Depending on the sample it fails or it does not fail. The same rlm call with the same input data fails once and does not the second time. The error message is misleading.

rlmTestData.zip

dat <- readRDS("rlmTestData.Rds")
MASS::rlm(y ~ clump_thickness + uniformity_cellsize + uniformity_cellshape + 
    adhesion + epithelial_cellsize + bare_nuclei + chromatin + 
    normal_nucleoli + class, data = dat, method = "MM)

Btw, we can support all the methods provided by rlm with lmrob and I would not think that rlm is more stable than lmrob.

@alexkowa
Copy link
Member

the init is actual happening in lqs which is used by rlm as init method.

@matthias-da
Copy link
Collaborator Author

matthias-da commented Dec 19, 2022

Or simply use the following as the function argument of irmi() instead of robMethod = "MM" use robMethod = "lmrob".

This way, rlm is still inside as a fallback and lmrob is now used by default. The error will then no longer occur in the bcancer data.

That would actually be the fastest solution, wouldn't it? I have committed it this way.

@alexkowa
Copy link
Member

yes, but the parameter robMethod is quite confusing now as it is. We could improve it by removing rlm and then the parameter is actual stating only the method to be used for the robust estimation and not the function. Because robMethod = "lmrob" is actually doing a MM estimation.

@matthias-da
Copy link
Collaborator Author

If I remeber correclty, the aim was always to use lmrob at first glance and as a fallback rlm, because rlm (at least) was more robust in terms of its implementation than lmrob. So there was (at least in the past) a lot of situation, where lmrob does not give a solution, but rlm did. It seems that once we even then changed rlm to default.

To not risk more failures for other situations/data, I recommend just to update the documentation instead of kicking out rlm. We can write that we use - when a fallback to rlm is used - also MM regression but then in function rlm. I still think it is a good fallback (when setting force = TRUE).

@alexkowa
Copy link
Member

yeah, ok. Let's do this.

@matthias-da
Copy link
Collaborator Author

Ok. Thanks for your efforts! I will update the documentation to be more precise on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants