Parsing occurrence text files in DwC archive #25

damianooldoni · 2018-12-13T09:20:53Z

After a year working with GBIF data in R and getting always problems importing correctly occurrence text files in R, I ended up writing a gist where I collected most of the col types I got problems with: type_GBIF_occurrence_fields.R. I discussed with colleagues about the utility of putting it in our project package. But, as suggested here trias-project/trias#25 (comment) why not pitch the authors of finch about? 👍
The typical issue while opening such files is that some DwC fields (columns) are NAs for thousands of rows before getting a real value. This creates parsing failures as R assigned type logical to these fields (columns). My first solution was to increase the value of guess_max parameter but for big files is this unfeasible, plus this is just a work-around.

The text was updated successfully, but these errors were encountered:

peterdesmet · 2018-12-13T09:28:59Z

Indeed! Basically, can finch read Darwin Core Archives with expected types (date, character, integer) for each Darwin Core term? @niconoe have you implemented something like that in python-dwca-reader?

pieterprovoost · 2018-12-13T09:47:21Z

Yes, you can pass colClasses to data.table::fread:

occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE)$data$occurrence.txt
class(occurrence$footprintWKT)

occurrence <- finch::dwca_read("http://ipt.vliz.be/eurobis/archive.do?r=manuela_uy&v=1.0", read = TRUE, colClasses = c(footprintWKT = "character"))$data$occurrence.txt
class(occurrence$footprintWKT)

damianooldoni · 2018-12-13T10:04:01Z

Thanks for the nice example. My question is actually going a step further: what about saving the right col types in finch repo and using them as default in finch::dwca_read()?

niconoe · 2018-12-13T10:32:24Z

@peterdesmet : we currently don't manage types in python-dwca-reader, everything is just assumed to be a string and the conversions are left to the data user. That seemed the simplest sensible approach at the time.

I do see the added value for users however, so I just opened a new issue (BelgianBiodiversityPlatform/python-dwca-reader#76) giving considerations about it, so I won't forgot to think about :)

sckott · 2018-12-13T18:54:51Z

thanks for this @damianooldoni

definitely seems reasonable to include the column types within this package and use them - do you want to make a PR?

stijnvanhoey · 2018-12-14T07:48:59Z

Having a default option would indeed be a good idea when using the package for data analysis. Still, as I mentioned as well in the python-dwca-reader repo, I would keep interpretation of the dtypes an option and not the default. Having all columns as strings/characters on input can be beneficial when for example you want to do validation of input and have full control about the dtype properties (e.g. our work with whip).

sckott · 2018-12-14T17:41:07Z

i can see the advantage of that @stijnvanhoey to have all strings - we could have a parameter to toggle this, where you get all strings or apply the types above

sckott · 2019-04-23T17:31:38Z

definitely seems reasonable to include the column types within this package and use them - do you want to make a PR?

@damianooldoni curious about my above question ^^

damianooldoni · 2019-04-24T06:08:32Z

yes, @sckott . I find it a good idea. I was still trying to find time within my free time to end up my other PR. But, yes, this should be done as well as I find it very important to avoid frustration and errors while importing occ files. I ping you very soon.

sckott · 2019-04-24T19:01:16Z

okay, thanks

maelle · 2022-09-09T09:05:27Z

This package is going to be archived.

niconoe mentioned this issue Dec 13, 2018

Assign column types (instead of considering everything is a string) BelgianBiodiversityPlatform/python-dwca-reader#76

Open

damianooldoni mentioned this issue Dec 13, 2018

Specify type columns occurrence data beforehand trias-project/trias#25

Open

sckott added this to the v0.3 milestone Apr 23, 2019

sckott removed this from the v0.3 milestone Apr 24, 2019

maelle closed this as completed Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing occurrence text files in DwC archive #25

Parsing occurrence text files in DwC archive #25

damianooldoni commented Dec 13, 2018

peterdesmet commented Dec 13, 2018

pieterprovoost commented Dec 13, 2018 •

edited

Loading

damianooldoni commented Dec 13, 2018

niconoe commented Dec 13, 2018

sckott commented Dec 13, 2018

stijnvanhoey commented Dec 14, 2018 •

edited

Loading

sckott commented Dec 14, 2018

sckott commented Apr 23, 2019

damianooldoni commented Apr 24, 2019

sckott commented Apr 24, 2019

maelle commented Sep 9, 2022

Parsing occurrence text files in DwC archive #25

Parsing occurrence text files in DwC archive #25

Comments

damianooldoni commented Dec 13, 2018

peterdesmet commented Dec 13, 2018

pieterprovoost commented Dec 13, 2018 • edited Loading

damianooldoni commented Dec 13, 2018

niconoe commented Dec 13, 2018

sckott commented Dec 13, 2018

stijnvanhoey commented Dec 14, 2018 • edited Loading

sckott commented Dec 14, 2018

sckott commented Apr 23, 2019

damianooldoni commented Apr 24, 2019

sckott commented Apr 24, 2019

maelle commented Sep 9, 2022

pieterprovoost commented Dec 13, 2018 •

edited

Loading

stijnvanhoey commented Dec 14, 2018 •

edited

Loading