Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_write loses data when writing to ESRI Shapefile #464

Closed
rungec opened this issue Aug 18, 2017 · 8 comments
Closed

st_write loses data when writing to ESRI Shapefile #464

rungec opened this issue Aug 18, 2017 · 8 comments

Comments

@rungec
Copy link

rungec commented Aug 18, 2017

When I write a shp using
st_write(shp1, "myshp.shp", driver="ESRI Shapefile")
if the column names in the attribute table are too long for ESRI the output .dbf shortens these column names BUT ALSO deletes any data in that column. Have tested using integer & numeric data, columns write fine if col names are < 10 characters but end up blank if col names > 10 characters.

@rsbivand
Copy link
Member

rsbivand commented Aug 18, 2017

The field name length restriction is a known feature of shapefiles, and dates way back (10 is generous, MS-DOS had 8 as maximum in file names). Migrate to GPKG and other more modern formats, or manually shorten and disambiguate field names before writing, for example using base::abbreviate().

@dmi3kno
Copy link

dmi3kno commented Aug 24, 2017

Can we check and abbreviate column names before attempting to st_write using "ESRI Shapefile"? Driver will do it anyways, causing a data loss. We can warn the users that the field names have been abbreviated to comply with ESRI driver limitations

@obrl-soil
Copy link

Sp uses base::abbreviate to automatically handle this issue, I see no reason why sf can't do the same thing. If not, there needs to be a clear warning on st_write that long columns = data loss when output format is shp.

@rsbivand
Copy link
Member

Not sp, rgdal::writeOGR(). Shapefiles were the only option then; maptools::writeSpatial() did this through foreign::write.dbf(), whose helpfile says:

 Dots in column names are replaced by underlines in the DBF file,
 and names are truncated to 11 characters.

Why help people to (ab)use shapefiles when we want them to migrate?

@obrl-soil
Copy link

Ah, so it is. Well, I'd love to migrate, but I'm stuck in a very ESRI-centric workplace with change-averse colleagues, so moving on is a bit of a fraught process. Without broader institutional support, I'm just That Coworker. The other issue for now is gpkg's slow disk write speed, which can be very inconvenient.

@tim-salabim
Copy link
Member

tim-salabim commented Aug 25, 2017

I unfortunately second @obrl-soil 's comment in a non-academic setting. Though, internally I could push gpkg through (as I already do for small data sets) the disk write speed is a big inconvenience for large data sets. Which doesn't mean I am in favour of trimming names automatically btw.

@rsbivand
Copy link
Member

rsbivand commented Aug 25, 2017

This is related to this thread? The foreign approach is OGR/shapelib - to truncate, risking non-unique field names. In rgdal::writeOGR(), base::abbreviate() is used and . replaced by _ when the "ESRI Shapefile" driver is being used.

Does anyone know how encoding affects the length constraint - it is bytes, not characters, isn't it?

On UTF-8:

> nchar("Fjærland")
[1] 8
> length(charToRaw("Fjærland"))
[1] 9
> abbreviate("Fjærland")
Fjærland 
  "Fjær" 
Warning message:
In abbreviate("Fjærland") : abbreviate used with non-ASCII chars
 If a input element contains non-ASCII characters, the
 corresponding value will be in UTF-8 and marked as such (see
 ‘Encoding’).

@tim-salabim
Copy link
Member

@rsbivand yes, that is what I see too. I really need to find some time to investigate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants