Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting retrieve_data() results to a data frame (tibble) #100

Open
jfy133 opened this issue Nov 29, 2018 · 7 comments
Open

Converting retrieve_data() results to a data frame (tibble) #100

jfy133 opened this issue Nov 29, 2018 · 7 comments

Comments

@jfy133
Copy link

jfy133 commented Nov 29, 2018

First I want to say thank you for this package, I'm working on some metagenomic data with lots of 'unusual' taxa, and trying to find a good (accessible) database to get a quick summary of characteristics of these has been surprisingly difficult.

This package saved me a lot of headaches trying 'manually' parse the API search results myself

I have neither a bug nor feature request, rather just some info which might be useful for others.

You can use a sequence of tidyverse tools convert the results from the BacDiveR::retrieve_data() function to a clean(ish) table format using the following code:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## convert list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%
  distinct()

#>Warning message:
#>Expected 4 pieces. Missing pieces filled with `NA` in 144 rows [72, 73, 74, 75, 76, 77, 156, 157, 158, 159, 160, 161, 250, 251, 252, 253, 254, 255, 461, 462, ...]. 

## print final table
data_bacdive_tib

#># A tibble: 18,555 x 6
#>   bacdive_id section       subsection      field              key   value           
#>   <chr>      <chr>         <chr>           <chr>              <chr> <chr>           
#> 1 2654       taxonomy_name strains_tax_PNU species_epithet    NA    mortiferum      
#> 2 2654       taxonomy_name strains_tax_PNU subspecies_epithet NA    NA              
#> 3 2654       taxonomy_name strains_tax_PNU is_type_strain     NA    FALSE           
#> 4 2654       taxonomy_name strains_tax_PNU domain             NA    Bacteria        
#> 5 2654       taxonomy_name strains_tax_PNU phylum             NA    Fusobacteria    
#> 6 2654       taxonomy_name strains_tax_PNU class              NA    Fusobacteriia   
#> 7 2654       taxonomy_name strains_tax_PNU ordo               NA    NA              
#> 8 2654       taxonomy_name strains_tax_PNU family             NA    Fusobacteriaceae
#> 9 2654       taxonomy_name strains_tax_PNU status_fam         NA    NA              
#>10 2654       taxonomy_name strains_tax_PNU genus              NA    Fusobacterium   

As far as I can see with the table from the search above the only issue is the references field is not correctly formatted (being placed in the subsection rather than field column - thus the 'NA' messages), because in the original results it is a dataframe rather than a list itself.

This worked for me using BacDiveR_0.7.0

@katrinleinweber
Copy link
Collaborator

Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.

@katrinleinweber
Copy link
Collaborator

If that NA problem in the reference dataframe (and possibly others) can be solved, that is.

Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key") very specific to your application or data analysis? Or do you consider it general?

@jfy133
Copy link
Author

jfy133 commented Nov 30, 2018

I think the reference metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.

The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.

I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.

But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).

Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse functions if so (and convert to a tibble after unnesting and separating).

@katrinleinweber
Copy link
Collaborator

Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%-line example ;-)

Looking into these NAs, I find that for example the ID_reference field appears in several nesting "depths":

> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame':	1 obs. of  2 variables:
 $ history     : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
 $ ID_reference: int 626

> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame':	3 obs. of  2 variables:
 $ ID_reference: int  626 20215 20218
 $ reference   : chr  "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__

screen shot 2018-11-30 at 21 15 14

This causes a "left-/up-ward shift/creep" of the NAs in the tibble:

screen shot 2018-11-30 at 21 13 21

Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?

@jfy133
Copy link
Author

jfy133 commented Dec 1, 2018

Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.

Yes, screenshot 2 is exactly what I mean.

I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the

separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%

could be conditional e.g. if the second field of the unlisted string grouped_category matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field").

This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.

@jfy133
Copy link
Author

jfy133 commented Dec 1, 2018

I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))

## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))

#># A tibble: 144 x 5
#>   bacdive_id section    subsection   field value                                                                                                                           
#>   <chr>      <chr>      <chr>        <chr> <chr>                                                                                                                           
#> 1 2654       references ID_referenc… NA    626                                                                                                                             
#> 2 2654       references ID_referenc… NA    20215                                                                                                                           
#> 3 2654       references ID_referenc… NA    20218                                                                                                                           
#> 4 2654       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295               
#> 5 2654       references reference2   NA    "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654       references reference3   NA    Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758       references ID_referenc… NA    9019                                                                                                                            
#> 8 5758       references ID_referenc… NA    20215                                                                                                                           
#> 9 5758       references ID_referenc… NA    20218                                                                                                                           
#>10 5758       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699   
#> # ... with 134 more rows


## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>% 
  mutate(field = if_else(section == "references", subsection, field),
                  subsection = if_else(section == "references", NA_character_, subsection))

## to show ID_references now correctly not in subsection

data_bacdive_tib %>% filter(is.na(field))

#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>

data_bacdive_tib_fixed %>% filter(is.na(subsection))

## shows ID_references now correctly in field
#># A tibble: 144 x 5
#>   bacdive_id section    subsection field      value                                                                                                                        
#>   <chr>      <chr>      <chr>      <chr>      <chr>                                                                                                                        
#> 1 2654       references NA         ID_refere… 626                                                                                                                          
#> 2 2654       references NA         ID_refere… 20215                                                                                                                        
#> 3 2654       references NA         ID_refere… 20218                                                                                                                        
#> 4 2654       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295            
#> 5 2654       references NA         reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654       references NA         reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758       references NA         ID_refere… 9019                                                                                                                         
#> 8 5758       references NA         ID_refere… 20215                                                                                                                        
#> 9 5758       references NA         ID_refere… 20218                                                                                                                        
#>10 5758       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699           
#># ... with 134 more rows

Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense

@katrinleinweber katrinleinweber pinned this issue Dec 15, 2018
@katrinleinweber
Copy link
Collaborator

Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants