Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTP urls translated to HTML behind firewall #1

Closed
seandavi opened this issue Nov 1, 2013 · 0 comments
Closed

FTP urls translated to HTML behind firewall #1

seandavi opened this issue Nov 1, 2013 · 0 comments

Comments

@seandavi
Copy link
Owner

seandavi commented Nov 1, 2013

Hi Sean,

I have always been experiencing issues with the GEOquery package, when running behind an annoying university proxy, which I "bypass" using cntlm.
Forgetting about the technical details, the issue is that the GEO URL that is fetched by getAndParseGSEMatrices gets scrambled into an HTML page somewhere in between nih.gov and the R session (probably by the proxy or cntlm), this causes getGEO to throw the following typical error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 6 elements

For example:

getURL('ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12008/matrix/')
[1] "\r\n<meta http-equiv="Content-Type" content="text-html; charset=UTF-8">\r\n\r\n<TITLE>FTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov. </TITLE>\r\n\r\n\r\n

FTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov.

\r\n
\r\n
\r\n                          <DIR> <A HREF="..">..
09/18/13 09:09AM [GMT] 719,285 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL4134_series_matrix.txt.gz">GSE12008-GPL4134_series_matrix.txt.gz\r\n09/18/13 09:09AM [GMT] 557,250 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL6244_series_matrix.txt.gz">GSE12008-GPL6244_series_matrix.txt.gz\r\n
\r\n
\r\n\r\n\r\n"

I have a fix for this (see below), which would be really nice if applied to GEOquery, so that I can work with GEO datasets :D

The idea is that instead of always scanning for a txt table format, we first check if the result is an HTML document, and, if so, parse for the href strings that point to the matrix files to download.
Here I provide two alternative ways of processing the HTML content, one one-liner using the stringr package, or more lines if you don't want to depend on stringr.
The message should probably go away or disabled if not in verbose mode.
Please let me know if you consider this fix can be applied any time soon.
Thank you.

Bests,
Renaud

########### in getAndParseGSEMatrices

gdsurl <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/"
a <- getURL(sprintf(gdsurl, stub, GEO))
if( grepl("^<HTML", a) ){ # process HTML content
message("# Processing HTML result page (behind a proxy?) ... ", appendLF=FALSE)
b <- stringr::str_match_all(a, "((href)|(HREF))=\s*["']/[^\"']+/([^/]+)["']")[[1]]

 # or, alternatively, without stringr
 sa <- gsub('HREF', 'href', a, fixed = TRUE)# just not to depend on case change
 sa <- strsplit(sa, 'href', fixed = TRUE)[[1L]]
 pattern <- "^=\\s*[\"']/[^\"']+/([^/]+)[\"'].*"
 b <- as.matrix(gsub(pattern, "\\1", sa[grepl(pattern, sa)]))
 #

 message('OK')

}else{ # standard processing of txt content
tmpcon <- textConnection(a, "r")
b <- read.table(tmpcon)
close(tmpcon)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant