Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.structure has problems with names that start with a number #160

Closed
romunov opened this issue Nov 28, 2016 · 5 comments
Closed

read.structure has problems with names that start with a number #160

romunov opened this issue Nov 28, 2016 · 5 comments

Comments

@romunov
Copy link
Collaborator

romunov commented Nov 28, 2016

Elizabeth posted a question she had with importing a structure file. Here is a reproducible example. Samples C_KH1059 and K_KH1834 are expected to have alleles of .33 for locus 1401_25, but are actually NA.

library(adegenet)

cat(print("		1_25	8_54	1358_15	1363_12	1368_57	1369_41	1372_14	1373_9	1377_42	1378_53	1379_10	1382_37	1386_27	1398_46	1400_9	1401_25	1403_13	1404_17	1409_42	1416_48	1419_11	1421_14	1423_5	1424_74	1426_55	1429_46	1432_23	1435_30	1436_7	1438_9	1443_37
A_KH1584	A	1	4	4	1	1	3	2	4	4	2	3	3	2	4	1	3	1	1	2	3	1	4	4	3	2	2	3	4	4	4	2
A_KH1584	A	1	4	4	1	1	3	2	4	4	4	3	3	4	4	1	3	1	3	2	3	3	4	4	3	4	2	3	4	4	4	2
C_KH1059	C	0	4	4	1	1	3	2	4	4	2	1	3	2	4	1	3	1	3	2	3	3	2	4	3	2	2	3	2	4	4	2
C_KH1059	C	0	4	4	1	1	3	2	4	4	4	3	3	2	4	1	3	1	3	2	3	3	4	4	3	2	2	3	4	4	4	2
M_KH1834	M	0	2	2	1	1	3	2	4	4	2	3	3	2	4	1	3	1	1	2	3	3	4	4	3	2	2	3	2	4	4	2
M_KH1834	M	0	4	4	1	3	3	2	4	4	2	3	3	2	4	1	3	1	3	2	3	3	4	4	3	2	4	3	4	4	4	2
M_KH1837	M	1	4	4	1	1	3	2	4	4	0	3	3	2	2	1	3	1	1	2	3	3	4	4	3	4	2	3	4	4	4	2
M_KH1837	M	1	4	4	1	3	3	2	4	4	0	3	3	4	4	1	3	1	3	2	3	3	4	4	3	4	2	3	4	4	4	2"), 
    file = "elizabeth_starts_with_number.stru")

xy1 <- read.structure("elizabeth_starts_with_number.stru", NA.char="0",
                     n.ind = 4, n.loc = 31, onerowperind = FALSE,
                     col.lab = 1, col.pop = 2, row.marknames = 1,
                     sep = "\t", col.others = 0)

x1 <- tab(xy1)

# 1401_25.33 are NA but should be 1 for samples C_KH1059 M_KH1834
x1[, grepl("1401_25", colnames(x1)), drop = FALSE]

#          1401_25.33
# A_KH1584          1
# C_KH1059         NA
# M_KH1834         NA
# M_KH1837          1

If I rename the column names so that they start with a letter, things pick up.

cat(print("		X1_25	X8_54	X1358_15	X1363_12	X1368_57	X1369_41	X1372_14	X1373_9	X1377_42	X1378_53	X1379_10	X1382_37	X1386_27	X1398_46	X1400_9	X1401_25	X1403_13	X1404_17	X1409_42	X1416_48	X1419_11	X1421_14	X1423_5	X1424_74	X1426_55	X1429_46	X1432_23	X1435_30	X1436_7	X1438_9	X1443_37
A_KH1584	A	1	4	4	1	1	3	2	4	4	2	3	3	2	4	1	3	1	1	2	3	1	4	4	3	2	2	3	4	4	4	2
A_KH1584	A	1	4	4	1	1	3	2	4	4	4	3	3	4	4	1	3	1	3	2	3	3	4	4	3	4	2	3	4	4	4	2
C_KH1059	C	0	4	4	1	1	3	2	4	4	2	1	3	2	4	1	3	1	3	2	3	3	2	4	3	2	2	3	2	4	4	2
C_KH1059	C	0	4	4	1	1	3	2	4	4	4	3	3	2	4	1	3	1	3	2	3	3	4	4	3	2	2	3	4	4	4	2
M_KH1834	M	0	2	2	1	1	3	2	4	4	2	3	3	2	4	1	3	1	1	2	3	3	4	4	3	2	2	3	2	4	4	2
M_KH1834	M	0	4	4	1	3	3	2	4	4	2	3	3	2	4	1	3	1	3	2	3	3	4	4	3	2	4	3	4	4	4	2
M_KH1837	M	1	4	4	1	1	3	2	4	4	0	3	3	2	2	1	3	1	1	2	3	3	4	4	3	4	2	3	4	4	4	2
M_KH1837	M	1	4	4	1	3	3	2	4	4	0	3	3	4	4	1	3	1	3	2	3	3	4	4	3	4	2	3	4	4	4	2"), 
    file = "elizabeth_starts_with_letter.stru")

xy2 <- read.structure("elizabeth_starts_with_letter.stru", NA.char="0",
                     n.ind = 4, n.loc = 31, onerowperind = FALSE,
                     col.lab = 1, col.pop = 2, row.marknames = 1,
                     sep = "\t", col.others = 0)

x2 <- tab(xy2)
x2[, grepl("1401_25", colnames(x2)), drop = FALSE]

#          X1401_25.33
# A_KH1584           1
# C_KH1059           1
# M_KH1834           1
# M_KH1837           1

unlink("elizabeth_starts_with_letter.stru")
unlink("elizabeth_starts_with_number.stru")
@romunov romunov changed the title read.structure has problems with names that can be coerced to numeric read.structure has problems with names that start with a number Nov 28, 2016
@thibautjombart
Copy link
Owner

OK thanks for looking into this. I am struggling to catch up with things so won't have time to look into it, though the fix is probably an easy one.. :-/

@romunov
Copy link
Collaborator Author

romunov commented Dec 2, 2016

I'll have a look ASAP.

@thibautjombart
Copy link
Owner

You rock!

@romunov
Copy link
Collaborator Author

romunov commented Dec 4, 2016

The problem was in df2genind in line 322 because of partial matching some other loci became eligible to become NA candidates. For instance, locus 1_25 would also match with 1401_25. I added a more explicit matching by forcing ^ in front of the string to make sure start of the string is also matched. By adding a letter to the beginning of the locus name, this matching worked as expected. I also added a test case for this.

@romunov romunov closed this as completed in 865db3d Dec 4, 2016
@thibautjombart
Copy link
Owner

This is great, I especially like the added test, thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants