Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spread changes factor order #47

Closed
manuelreif opened this issue Dec 15, 2014 · 4 comments
Closed

spread changes factor order #47

manuelreif opened this issue Dec 15, 2014 · 4 comments

Comments

@manuelreif
Copy link

Hi, and thank you for this package!

One question arose, when using tidyr version 0.2.0 from cran.
I want to reshape a longtable to a wide one, which is no problem, but the outcome (ordering) is somehow unexpected.
When using unite(), the new variabel is of type character, which does not contain any information about category ordering. So the output columns are sorted alphabetically, after using spread(). Is there any way, to keep the factor-ordering information - or to avoid unite() (submitting > 1 keys to spread()) - i fear currently this is not possible?
Would this be possible in future versions? It would be extremely helpful for creating tables automatically, when a certain order of columns is given a-priori.

# creating some variables
year <- c(rep(2006,4),rep(2007,4),rep(2006,4),rep(2007,4),rep(2006,4),rep(2007,4))
f1 <- factor(rep(c("m","w","gesamt"),each=8),levels=c("m","w","gesamt"))
f2 <- factor(rep(letters[1:4],6),levels=letters[1:4])
val <- round(rnorm(24),2)

# creating a data.frame
d1 <- data.frame(year = year,f1,f2,val)

d1
   year     f1 f2   val
1  2006      m  a -0.92
2  2006      m  b  0.93
3  2006      m  c  1.10
4  2006      m  d -1.04
5  2007      m  a  0.02
6  2007      m  b -0.22
7  2007      m  c  1.00
8  2007      m  d -0.50
9  2006      w  a  1.56
10 2006      w  b -0.52
11 2006      w  c -1.51
12 2006      w  d  0.50
13 2007      w  a -0.25
14 2007      w  b -0.56
15 2007      w  c -0.31
16 2007      w  d  0.50
17 2006 gesamt  a  0.74
18 2006 gesamt  b -1.90
19 2006 gesamt  c  0.44
20 2006 gesamt  d  0.46
21 2007 gesamt  a -0.91
22 2007 gesamt  b  1.20
23 2007 gesamt  c  0.03
24 2007 gesamt  d -0.41

# from long --> to wide
d1 %>% unite(univar,f1,f2) %>% spread(univar,val)

  year gesamt_a gesamt_b gesamt_c gesamt_d   m_a   m_b m_c   m_d   w_a   w_b   w_c w_d
1 2006     0.74     -1.9     0.44     0.46 -0.92  0.93 1.1 -1.04  1.56 -0.52 -1.51 0.5
2 2007    -0.91      1.2     0.03    -0.41  0.02 -0.22 1.0 -0.50 -0.25 -0.56 -0.31 0.5

Thank you!
Manuel

@mrdwab
Copy link

mrdwab commented Apr 1, 2015

This actually seems to also affect the "id" variables. See, for example, here: http://stackoverflow.com/q/29381069/1270695

df = data.frame(name=c("B","B","A","A"),
                group=c("g1","g2","g1","g2"),
                V1=c(10,40,20,30),
                V2=c(6,3,1,7))

gather(df, Var, Val, V1:V2) %>% 
  unite(VarG, Var, group) %>% 
  spread(VarG, Val)

Note the factor levels for "name" in the input and output.

> str(.Last.value)
'data.frame':   2 obs. of  5 variables:
 $ name : Factor w/ 2 levels "A","B": 1 2
 $ V1_g1: num  20 10
 $ V1_g2: num  30 40
 $ V2_g1: num  1 6
 $ V2_g2: num  7 3
> str(df)
'data.frame':   4 obs. of  4 variables:
 $ name : Factor w/ 2 levels "A","B": 2 2 1 1
 $ group: Factor w/ 2 levels "g1","g2": 1 2 1 2
 $ V1   : num  10 40 20 30
 $ V2   : num  6 3 1 7

@dataRangler
Copy link

I asked the SO question:-) I am a beginner but when I looked at the spread.R source code, line 79 seems to be the start of the sorting. Is it necessary? I am new to R and github and do not know how to test this yet.

# Add in missing values, if necessary
if (length(overall) < n) {
  overall <- match(seq_len(n), overall, nomatch = NA)
} else {
  overall <- order(overall)
}

@dataRangler
Copy link

I've just found dplyr::summarise does sorting as well. Is this a design philosophy?

df %>% 
+ group_by(name) %>% 
+ summarise(n()
+ )
Source: local data frame [2 x 2]

  name n()
1    A   2
2    B   2

@hadley
Copy link
Member

hadley commented Aug 24, 2015

Instead of unite() do:

d1 %>% mutate(univar = f1:f2, f1 = NULL, f2 = NULL) %>% spread(univar,val)
# OR
d1 %>% mutate(univar = interaction(f1, f2), f1 = NULL, f2 = NULL) %>% spread(univar,val)

@hadley hadley closed this as completed Aug 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants