You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have always been experiencing issues with the GEOquery package, when running behind an annoying university proxy, which I "bypass" using cntlm.
Forgetting about the technical details, the issue is that the GEO URL that is fetched by getAndParseGSEMatrices gets scrambled into an HTML page somewhere in between nih.gov and the R session (probably by the proxy or cntlm), this causes getGEO to throw the following typical error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 6 elements
I have a fix for this (see below), which would be really nice if applied to GEOquery, so that I can work with GEO datasets :D
The idea is that instead of always scanning for a txt table format, we first check if the result is an HTML document, and, if so, parse for the href strings that point to the matrix files to download.
Here I provide two alternative ways of processing the HTML content, one one-liner using the stringr package, or more lines if you don't want to depend on stringr.
The message should probably go away or disabled if not in verbose mode.
Please let me know if you consider this fix can be applied any time soon.
Thank you.
Bests,
Renaud
########### in getAndParseGSEMatrices
gdsurl <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/"
a <- getURL(sprintf(gdsurl, stub, GEO))
if( grepl("^<HTML", a) ){ # process HTML content
message("# Processing HTML result page (behind a proxy?) ... ", appendLF=FALSE)
b <- stringr::str_match_all(a, "((href)|(HREF))=\s*["']/[^\"']+/([^/]+)["']")[[1]]
# or, alternatively, without stringr
sa <- gsub('HREF', 'href', a, fixed = TRUE)# just not to depend on case change
sa <- strsplit(sa, 'href', fixed = TRUE)[[1L]]
pattern <- "^=\\s*[\"']/[^\"']+/([^/]+)[\"'].*"
b <- as.matrix(gsub(pattern, "\\1", sa[grepl(pattern, sa)]))
#
message('OK')
}else{ # standard processing of txt content
tmpcon <- textConnection(a, "r")
b <- read.table(tmpcon)
close(tmpcon)
}
The text was updated successfully, but these errors were encountered:
Hi Sean,
I have always been experiencing issues with the GEOquery package, when running behind an annoying university proxy, which I "bypass" using cntlm.
Forgetting about the technical details, the issue is that the GEO URL that is fetched by getAndParseGSEMatrices gets scrambled into an HTML page somewhere in between nih.gov and the R session (probably by the proxy or cntlm), this causes getGEO to throw the following typical error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 6 elements
For example:
I have a fix for this (see below), which would be really nice if applied to GEOquery, so that I can work with GEO datasets :D
The idea is that instead of always scanning for a txt table format, we first check if the result is an HTML document, and, if so, parse for the href strings that point to the matrix files to download.
Here I provide two alternative ways of processing the HTML content, one one-liner using the stringr package, or more lines if you don't want to depend on stringr.
The message should probably go away or disabled if not in verbose mode.
Please let me know if you consider this fix can be applied any time soon.
Thank you.
Bests,
Renaud
########### in getAndParseGSEMatrices
gdsurl <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/"
a <- getURL(sprintf(gdsurl, stub, GEO))
if( grepl("^<HTML", a) ){ # process HTML content
message("# Processing HTML result page (behind a proxy?) ... ", appendLF=FALSE)
b <- stringr::str_match_all(a, "((href)|(HREF))=\s*["']/[^\"']+/([^/]+)["']")[[1]]
}else{ # standard processing of txt content
tmpcon <- textConnection(a, "r")
b <- read.table(tmpcon)
close(tmpcon)
}
The text was updated successfully, but these errors were encountered: