Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

Parsing occurrence text files in DwC archive #25

Closed
damianooldoni opened this issue Dec 13, 2018 · 11 comments
Closed

Parsing occurrence text files in DwC archive #25

damianooldoni opened this issue Dec 13, 2018 · 11 comments

Comments

@damianooldoni
Copy link

After a year working with GBIF data in R and getting always problems importing correctly occurrence text files in R, I ended up writing a gist where I collected most of the col types I got problems with: type_GBIF_occurrence_fields.R. I discussed with colleagues about the utility of putting it in our project package. But, as suggested here trias-project/trias#25 (comment) why not pitch the authors of finch about? 👍
The typical issue while opening such files is that some DwC fields (columns) are NAs for thousands of rows before getting a real value. This creates parsing failures as R assigned type logical to these fields (columns). My first solution was to increase the value of guess_max parameter but for big files is this unfeasible, plus this is just a work-around.

@peterdesmet
Copy link

Indeed! Basically, can finch read Darwin Core Archives with expected types (date, character, integer) for each Darwin Core term? @niconoe have you implemented something like that in python-dwca-reader?

@pieterprovoost
Copy link

pieterprovoost commented Dec 13, 2018

Yes, you can pass colClasses to data.table::fread:

occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE)$data$occurrence.txt
class(occurrence$footprintWKT)

occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE, colClasses = c(footprintWKT = "character"))$data$occurrence.txt
class(occurrence$footprintWKT)

@damianooldoni
Copy link
Author

Thanks for the nice example. My question is actually going a step further: what about saving the right col types in finch repo and using them as default in finch::dwca_read()?

@niconoe
Copy link

niconoe commented Dec 13, 2018

@peterdesmet : we currently don't manage types in python-dwca-reader, everything is just assumed to be a string and the conversions are left to the data user. That seemed the simplest sensible approach at the time.

I do see the added value for users however, so I just opened a new issue (BelgianBiodiversityPlatform/python-dwca-reader#76) giving considerations about it, so I won't forgot to think about :)

@sckott
Copy link
Contributor

sckott commented Dec 13, 2018

thanks for this @damianooldoni

definitely seems reasonable to include the column types within this package and use them - do you want to make a PR?

@stijnvanhoey
Copy link

stijnvanhoey commented Dec 14, 2018

Having a default option would indeed be a good idea when using the package for data analysis. Still, as I mentioned as well in the python-dwca-reader repo, I would keep interpretation of the dtypes an option and not the default. Having all columns as strings/characters on input can be beneficial when for example you want to do validation of input and have full control about the dtype properties (e.g. our work with whip).

@sckott
Copy link
Contributor

sckott commented Dec 14, 2018

i can see the advantage of that @stijnvanhoey to have all strings - we could have a parameter to toggle this, where you get all strings or apply the types above

@sckott
Copy link
Contributor

sckott commented Apr 23, 2019

definitely seems reasonable to include the column types within this package and use them - do you want to make a PR?

@damianooldoni curious about my above question ^^

@sckott sckott added this to the v0.3 milestone Apr 23, 2019
@damianooldoni
Copy link
Author

yes, @sckott . I find it a good idea. I was still trying to find time within my free time to end up my other PR. But, yes, this should be done as well as I find it very important to avoid frustration and errors while importing occ files. I ping you very soon.

@sckott
Copy link
Contributor

sckott commented Apr 24, 2019

okay, thanks

@sckott sckott removed this from the v0.3 milestone Apr 24, 2019
@maelle
Copy link
Contributor

maelle commented Sep 9, 2022

This package is going to be archived.

@maelle maelle closed this as completed Sep 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants