Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTGOV2 API call does not filter by study sponsor #32

Closed
frederikziebell opened this issue Oct 21, 2023 · 7 comments
Closed

CTGOV2 API call does not filter by study sponsor #32

frederikziebell opened this issue Oct 21, 2023 · 7 comments
Assignees

Comments

@frederikziebell
Copy link

frederikziebell commented Oct 21, 2023

Consider the following example. With the old API, the filter is respected, whereas with the new one, all studies would be downloaded.

library("tibble")
library("ctrdata")

q1 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV")

ctrLoadQueryIntoDb(
  queryterm = q1,
  only.count = TRUE
)$n
# 5639

q2 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV2")

ctrLoadQueryIntoDb(
  queryterm = q2,
  only.count = TRUE
)$n
# 470145
@rfhb rfhb self-assigned this Oct 22, 2023
@rfhb
Copy link
Owner

rfhb commented Oct 22, 2023

Thanks for reporting, nice catch! Fix now available:

  • Corrected translation of some fields from browser URL to API call for CTGOV2, including sponsor and location, added testing for translation of all parameters
  • Please try 0447b0a with devtools::install_github("rfhb/ctrdata")
  • Will be included in a next release in the next days

As an aside, if not yet known and in case the tibble has no particular need, this also works:

library("ctrdata")
ctrLoadQueryIntoDb(
  queryterm = "spons=NameOfSponsor",
  register = "CTGOV2",
  only.count = TRUE
)$n

@frederikziebell
Copy link
Author

Thanks, and also for pointing out the shorter syntax, it's working now. For some companies, I see however differences in the number of returned results between both registers:

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV",
  only.count = TRUE
)$n
# 2352

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV2",
  only.count = TRUE
)$n
# 2347

But I don't know if that's because the new API accesses the data differently from the CTGOV database, or if it's an issue with ctrdata.

@rfhb
Copy link
Owner

rfhb commented Oct 24, 2023

Thanks - you find the same numbers when opening this search query in the browser like below. I have no explanation for this and can only speculate that in the backend, different matching processes take place. Try modifying the sponsor name in the browser and see different expansions offered.

ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV")
ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV2")

Nevertheless, it is straightforward to generate a list of the set difference, as follows:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)
ctgov2Trials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV2", con = dbc)
trialsSet <- dbGetFieldsIntoDf(c("sponsors.lead_sponsor.agency", "brief_title"), con = dbc)
trialsSet[trialsSet[["_id"]] %in% setdiff(ctgovTrials[["success"]], ctgov2Trials[["success"]]), ]

which returns

# A tibble: 5 × 3
  `_id`       sponsors.lead_sponsor.agency brief_title                                                
  <chr>       <chr>                        <chr>                                                      
1 NCT02135354 Wim Janssens                 Azithromycin for Acute Exacerbations Requiring Hospitaliza…
2 NCT02205242 Wim Janssens                 BACE Trial Substudy 1 - PROactive Substudy                 
3 NCT02205255 Wim Janssens                 BACE Trial Substudy 2 - FarmEc Substudy                    
4 NCT02332122 Wim Janssens                 Detection of Aspergillus Fumigatus and Sensitization in CO…
5 NCT05008081 Wim Janssens                 The CATALINA Study  

There you have it, possibly CTGOV uses a partial string match, and CTGOV2 matches differently, see e.g. here https://clinicaltrials.gov/data-about-studies/search-areas#SponsorSearch

@frederikziebell
Copy link
Author

Thanks for the clarification. Btw, I get an error with the latest devel build and your example:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)

gives

Not overruling register label CTGOV
* Found search query from CTGOV: spons=Janssen
Checking helper binaries: . . . done
Warning: Database not persisting* Checking trials in CTGOV classic...
Retrieved overview, records of 2352 trial(s) are to be downloaded (estimate: 19 MB)
(1/3) Downloading trial file...
Error in handle_setopt(h, ...) : Unknown option: multiplex

The call to ctrLoadQueryIntoDb() with only.count = TRUE works, so I guess the issue concerns multiplexed downloading.

Should I open a separate issue for that?

@rfhb
Copy link
Owner

rfhb commented Oct 25, 2023

Thanks. Could you please update R package curl, version 5.1.0 does not trigger this error; I will specify this requirement.

@machado-t
Copy link

machado-t commented Nov 4, 2023

Somewhat unrelated, but I'll leave it here for future reference.
I was getting this error with CTGOV2:
* Checking trials using CTGOV API 2.0.0.-test...Warning: Error in curl::curl_fetch_memory: Timeout was reached: [www.clinicaltrials.gov] Resolving timed out after 10011 milliseconds
... which was apparently also solved by updating curl.

Edit: Actually unrelated to curl update. Not sure why, but I'm getting this sometimes.

@rfhb
Copy link
Owner

rfhb commented Nov 4, 2023

Indeed completely unrelated to ctrdata, possibly a network or server issue.

@rfhb rfhb closed this as completed Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants