Skip to content

Conversation

yutannihilation
Copy link
Contributor

On Windows in CJK locale, data sources with multibyte characters cannot be read or write by sf. For example:

library(sf)

# copy nc.shp to a path which has multibyte characters
dsn_dir <- file.path(tempdir(), "データ")
dir.create(dsn_dir)
nc <- read_sf(system.file("shape/nc.shp", package="sf"))
write_sf(nc, dsn_dir, layer = "nc", driver="ESRI Shapefile")

# verify
list.files(dsn_dir)
#> [1] "nc.dbf" "nc.prj" "nc.shp" "nc.shx"

st_layers(dsn_dir)
#> Cannot open data source C:\Users\user1\AppData\Local\Temp\Rtmp21SLMr\データ
#> Error in CPL_get_layers(dsn, options, do_count) : Open failed.

st_read() and st_write() also fails in many cases (I'm not sure why write_sf() works in the example above...).

I'm not familiar with GDAL, but it seems that GDAL accepts UTF-8 characters only; it succeeds if we call CPL_get_layers() directly with a UTF-8 argument like this:

sf:::CPL_get_layers(enc2utf8(dsn_dir), character(0), FALSE)
#> Reading layer `nc' from data source `C:\Users\user1\AppData\Local\Temp\Rtmp21SLMr\繝・・繧ソ' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs

But, if we call st_layers(), it fails.

st_layers(enc2utf8(dsn_dir))
#> Cannot open data source C:\Users\user1\AppData\Local\Temp\Rtmp21SLMr\データ
#> Error in CPL_get_layers(dsn, options, do_count) : Open failed.

This is because normalizePath() doesn't keep the encoding.

dsn <- enc2utf8("データ/nc.shp")
Encoding(dsn)
#> [1] "UTF-8"

dsn_normalized <- normalizePath(dsn)
Encoding(dsn_normalized)
#> [1] "unknown"

So, we need to convert the strings into UTF-8 ones after normalizePath(). Fortunately, sub(), basename() and file_path_sans_ext(), which are used to extract layer names from dsn, all keep the encoding.

@yutannihilation
Copy link
Contributor Author

yutannihilation commented Aug 27, 2017

Another approach may be to use Rf_translateCharUTF8() in C++'s side. For example, around here:

poDS = (GDALDataset *) GDALOpenEx(datasource[0], GDAL_OF_VECTOR | GDAL_OF_READONLY, NULL,

	poDS = (GDALDataset *) GDALOpenEx(Rf_translateCharUTF8( datasource[0] ), GDAL_OF_VECTOR | GDAL_OF_READONLY, NULL, 

@edzer edzer merged commit 68a88c1 into r-spatial:master Aug 27, 2017
@yutannihilation yutannihilation deleted the encoding branch August 27, 2017 11:37
@yutannihilation
Copy link
Contributor Author

Thanks for merging!

@edzer
Copy link
Member

edzer commented Aug 27, 2017

Thanks for bringing this up, looks good. I can only test on a UTF-8 platform, and through appveyor on windows, so we need to see if it fixes, or creates problems on other platforms/encodings.

@yutannihilation
Copy link
Contributor Author

Yes, considering various platforms/encodings is difficult... I will keep my eye on GitHub issues of this repo 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants