Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

length of the names vector is longer than the length of the m_odds column #1689

Closed
jankatins opened this issue Mar 7, 2016 · 6 comments
Closed
Labels
Milestone

Comments

@jankatins
Copy link

@jankatins jankatins commented Mar 7, 2016

[This is a forward from https://github.com/IRkernel/repr/issues/30, where this issue is now worked around; @karldw is the one who debugged this and has a better understanding what's going on here.]

Quoting @karldw from that issue:


@JanSchulz, it seems like this is a deeper problem with the interesting tbl_df. For instance, I get your same error when I try print(as.data.frame(interesting)). It seems that the m_odds column has an attribute called names that is longer than the number of rows. Maybe ask the dplyr people?

Oddly, print(data.table::as.data.table(interesting)) works fine.

> attributes(interesting)
$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 58 59 60 61 62 63

$names
[1] "candidate" "m_odds"   

> attributes(interesting$candidate)
NULL
> attributes(interesting$m_odds)
$names
  [1] "1/2"   "9/4"   "16"    "28"    "50"    "50"    "80"    "125"   "150"  
 [10] "300"   "1/2"   "9/4"   "14"    "25"    "50"    "50"    "80"    "150"  
 [19] "300"   "300"   "300"   "8/15"  "7/4"   "18"    "28"    "50"    "50"   
 [28] "80"    "100"   "100"   "150"   "500"   "8/15"  "7/4"   "18"    "28"   
 [37] "50"    "50"    "80"    "100"   "100"   "150"   "500"   "8/15"  "19/10"
 [46] "16"    "25"    "40"    "80"    "100"   "200"   "300"   "1/2"   "2"    
 [55] "14"    "20"    "50"    "50"    "80"    "150"   "1/2"   "9/4"   "18"   
 [64] "25"    "22"    "40"    "80"    "80"    "200"   "200"   "325"   "1/2"  
 [73] "2"     "12"    "20"    "40"    "50"    "80"    "100"   "300"   "1/2"  
 [82] "2"     "17"    "22"    "40"    "50"    "100"   "750"   "8/15"  "9/4"  
 [91] "16"    "25"    "50"    "50"    "66"    "100"   "100"   "100"   "1/2"  
[100] "2"     "18"    "25"    "50"    "16"    "80"    "100"   "200"   "8/15" 
[109] "2"     "14"    "18"    "50"    "50"    "80"    "80"    "400"   "1/2"  
[118] "9/4"   "16"    "20"    "33"    "50"    "100"   "100"   "200"   "500"  
[127] "500"   "500"   "500"   "500"   "500"   "500"   "500"   "500"   "999"  
[136] "999"   "999"   "999"   "999"   "999"   "999"   "999"   "999"   "999"  
[145] "999"   "999"   "999"   "999"   "1000"  "1000"  "9999"  "9999"  "9999" 
[154] "9999"  "9999"  "9999"  "9999"  "9999"  "9999"  "9999"  "9999"  "9999" 
[163] "9999"  "9999"  "9999"  "9999"  "9999"  "9999"  "9999"  "1/2"   "2"    
[172] "17"    "22"    "40"    "50"    "100"   "750"   "8/15"  "2"     "16"   
[181] "25"    "40"    "40"    "80"    "100"   "300"   "300"   "1/2"   "2"    
[190] "17"    "22"    "40"    "50"    "100"   "750"   "1/2"   "2"     "18"   
[199] "25"    "50"    "50"    "80"    "150"   "250"   "750"   "8/15"  "3"    
[208] "18"    "33"    "56"    "80"    "89"    "151"   "256"   "503"   "949"  
[217] "949"   "949"   "949"   "949"   "949"   "398"   "541"   "949"   "949"  
[226] "949"   "949"   "949"   "949"   "949"   "379"   "949"   "949"   "949"  
[235] "949"   "949"   "949"   "949"   "949"   "949"   "949"   "949"   "8/15" 
[244] "3"     "89/5"  "33"    "53"    "81"    "90"    "154"   "523"   "969"  
[253] "969"   "969"   "969"   "969"   "969"   "969"   "969"   "969"   "969"  
[262] "969"   "969"   "969"   "969"   "969"   "736"   "969"   "969"   "969"  
[271] "969"   "969"   "969"   "969"   "969"   "969"   "8/15"  "5/2"   "91/5" 
[280] "21"    "54"    "79"    "89"    "50"    "545"   "545"   "545"   "545"  
[289] "545"   "545"   "545"   "545"   "297"   "990"   "545"   "545"   "545"  
[298] "495"   "990"   "545"   "495"   "198"   "545"   "495"   "495"   "545"

> length(attributes(interesting$m_odds)$names)
[1] 306

and


@JanSchulz, I don't have a 100% clear understanding here, but it looks like dplyr is assigning names to the m_odds variable, and the length of the vector of names vector is longer than the length of the m_odds column. Note that it's not a problem with the row names of the interesting table; instead, the summarise command has actually gone and assigned names in the values in the m_odds column. (I've never seen that before, but I'm sure Hadley had a good reason.)

To look at this more, I ran your code in a normal R session (avoiding Jupyter to aid debugging). After running your notebook's cells 1 through 11, I get similar errors when I try to convert to a matrix or data.frame.

> as.matrix(interesting)
Error in as.matrix.data.frame(interesting) : 
  'names' attribute [232] must be the same length as the vector [61]
> as.data.frame(interesting)
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,  : 
  'names' attribute [232] must be the same length as the vector [61]

The code to produce interesting can be found in:

http://nbviewer.jupyter.org/gist/JanSchulz/2fc1a468a65f5fd47d51
https://gist.github.com/JanSchulz/2fc1a468a65f5fd47d51

[Note: there is an error in the notebook: the problematic cell 11 has a %>% too much at the end of the first line, this was introduced after the cell was run but before I created the gist :-/]

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 7, 2016

Could you please try the development version? I didn't see the problem with 94efd2f.

Columns that have a .Names attribute (such as m_odds in your example) are really weird, I'm having trouble even creating such a data frame from R.

@hadley
Copy link
Member

@hadley hadley commented Mar 8, 2016

This mostly looks like corrupted a data frame to me.

@jankatins
Copy link
Author

@jankatins jankatins commented Mar 9, 2016

I installed the current dplyr from github (execute the snippet in the readme) and could still reproduce this. This is the script I use:

library("rvest") 
library("dplyr") 
library("tidyr") 
math_eval <- function(exp){
  eval(parse(text=exp))
}

process_data_table <- . %>% html_node(css=".eventTable") %>% 
  html_table() %>% 
  slice(5:n()) %>% 
  slice(2:n()) %>%
  gather("bookie", "odds", -X1) %>%
  rename(candidate=X1) %>%
  filter(odds!="") %>%
  mutate(odds=sapply(odds, math_eval))

html_president <- read_html("http://www.oddschecker.com/politics/us-politics/us-presidential-election-2016/winner")
df_president <- html_president %>% process_data_table
interesting <- df_president %>% group_by(candidate) %>% summarise(m_odds=mean(odds)) 
as.matrix(interesting)
> as.matrix(interesting)
Error in as.matrix.data.frame(interesting) : 
  'names' attribute [188] must be the same length as the vector [11]

[R3.2, win7 64bit, rstudio shows the dplyr version as 0.4.3.9001]

@jankatins
Copy link
Author

@jankatins jankatins commented Mar 9, 2016

And this seems to be the shortest version I can produce, which reproduces it:

library("dplyr")
df = data.frame(a = c("1/2", "1", "4/3"), candidate=c("a", "a", "b"), stringsAsFactors = FALSE)
math_eval <- function(exp){
  eval(parse(text=exp))
}
df %>% mutate(b=sapply(a, math_eval)) %>%
  group_by(candidate) %>% 
  summarise(m_odds=mean(b)) %>% as.matrix()

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Mar 10, 2016

Confirmed. df is still a data frame with a named column; mutate() should strip the names here.

df %>% mutate(b=sapply(a, math_eval)) %>% {names(.$b)}

@hadley
Copy link
Member

@hadley hadley commented Apr 19, 2016

Here's a minimal reprex:

data_frame(g = c(1, 1, 2), x = 1:3) %>%
  mutate(x = setNames(x, c("a", "b", "c"))) %>% 
  group_by(g) %>% 
  summarise(y = mean(x)) %>% 
  str()

I think it's probably reasonable for mutate to always drop names:

data.frame(x = c(a = 1)) %>% str()

@romainfrancois could you take a look please?

@hadley hadley added this to the future milestone May 26, 2016
@krlmlr krlmlr added this to the data frame 2 milestone Feb 21, 2017
@krlmlr krlmlr removed this from the future milestone Feb 21, 2017
@krlmlr krlmlr closed this in #2512 Mar 9, 2017
romainfrancois added a commit that referenced this issue Jun 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants