Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_form_esummary matrix cannot be cleanly written to csv #65

Closed
gadepallivs opened this issue Sep 16, 2015 · 5 comments
Closed

extract_form_esummary matrix cannot be cleanly written to csv #65

gadepallivs opened this issue Sep 16, 2015 · 5 comments

Comments

@gadepallivs
Copy link

Hi David,
Below is the example. I did not understand why title, fulljournalname, pubtype has the text data extending to second column.

PM.ID <- c("26287849", "25979833", "25667274", "25430497", "24968756", "24846037", "24296758", "24281417", "24128713", "24055406","23489023")
p.data <- entrez_summary(db = "pubmed", id = PM.ID  )
pubrecord.table <- extract_from_esummary(esummaries = p.data , elements = c("uid","title","fulljournalname", "pubtype", "volume", "issue", "pages",                                                                           "lastauthor","pmcrefcount", "issn", "pubdate" ))
is(pubrecord.table) #  "matrix"         "array"          "structure"      "vector"         "vectorORfactor"
pubrecord.table <- t(pubrecord.table) # transpose the rows into columns
write.csv(pubrecord.table , file = "test12.csv" )
@dwinter dwinter changed the title extract_form_esummary results matrix of irrelevant column data extract_form_esummary matrix cannot be cleanly written to csv Sep 17, 2015
@dwinter
Copy link
Member

dwinter commented Sep 17, 2015

This is not really a problem with rentrez, just a property of NCBI records and R objects.

In this case, the pubtype field is variably-sized:

sapply(pubrecord.table[4,], length)
26287849 25979833 25667274 25430497 24968756 24846037 24296758 24281417 
       2        2        1        2        1        3        1        2 
24128713 24055406 23489023 
       1        2        2 

When you try and write the matrix it represents the vectrors like you'd type them in (c(..., ...)) which adds a comma which breaks the csv format.

In this case, you can collapse the vectors:

pubrecord.table[4,] <- sapply(pubrecord.table[4,], paste, collapse=" & ")

and unlist each matrix row to allow them to be written out

f <- tempfile()
write.csv( apply(pubrecord.table, 1, unlist), f)
re_read <- read.csv(f)
re_read$pmcrefcount
 [1]  0  1  3  2  1 26 10  4  3  2 21

@dwinter dwinter closed this as completed Sep 17, 2015
@gadepallivs
Copy link
Author

Hi david,
The solution above works on certain PMID queries, but for others I still get an error. Depending on PMID the variable field lengths are noted in Title, Journal name , pubtype or something else. I thought just removing the row number will fix the issue. But, I get error when trying to write a table on Rshiny
pubrecord.table[,] <- sapply(pubrecord.table[,], paste, collapse=" & ")

Error in apply(pubrecord.reference, 1, unlist) : dim(X) must have a positive length
P.S Why was the function extract_form_esummary designed to return a matrix ? The data it extracts is a mix of character, string , numeric vectors and so by definition dataframe would ideal to store these kind of data, while matrix is is expected to store data of the same type ?

@dwinter
Copy link
Member

dwinter commented Oct 12, 2015

I'm not sure what you are trying to in the example, but it seems like it's hitting empty fields?

extract_form_esummary is really a wrapper to sapply, it doesn't return data.frames because I think most users don't expect data.frame columns to contain vectors like

df <- as.data.frame(t(pubrecord.table))
df$pubtype
$`26287849`
[1] "Journal Article"   "Multicenter Study"

$`25979833`
[1] "Journal Article"             "Randomized Controlled Trial"

$`25667274`
[1] "Journal Article"
.
.
.

Structured data like that would seem to fit a list better than a data.frame, and you can get that by setting simplify=FALSE.

@gadepallivs
Copy link
Author

*Edited, noted the issue *
Hi david, I noted the issue was with empty abstract fields for some entries.

PM.ID <- c("26391251","26372702","26372699","26371045","26338018","26317919",
            "26315966","26301800","26301799","26258891")
fetch.pubmed <- entrez_fetch(db = "pubmed", id = pubmed.search$ids,
                              rettype = "xml", parsed = T)
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x) xmlValue(xmlChildren(x)$Abstract))

This results in NA for PMIDs where abstracts are empty. But, when It is being rendered using Rshiny it has problem displaying the table just shows "Processing" but does not display any table. need to learn more about it.
This is not related to rentrez package.
Thank you

@dwinter
Copy link
Member

dwinter commented Oct 13, 2015

OK, good luck to getting to the bottom of the shiny problem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants