encoding problem with get_eurostat_dic #55

rocian · 2016-07-05T03:39:09Z

There is an issue in get_eurostat_dict on systems with an encoding different from "Windows-1252". At least on linux systems where I tried it. I was able to resolve the issue overriding get_eurostat_dict and configuring the proper encoding (UTF-8).

jhuovari · 2016-07-18T12:49:22Z

Now dic <- get_eurostat_dic("na_item") gives warning:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  invalid input found on input connection 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=dic%2Fen%2Fna_item.dic'

It is not probably UTF-8. These encoding issues are tricky.

antagomir · 2016-07-18T13:00:00Z

Tricky indeed. Not sure if there is a universal solution.

jhuovari · 2016-07-18T13:03:04Z

Maybe reading as "Windows-1252" (or what ever it really is) and then change to UTF-8?

antagomir · 2016-07-18T13:08:01Z

If the encoding is always the same, or can be recognized automatically, then this will work. We can try at least.

jhuovari · 2016-08-05T11:43:56Z

Could you @rocian try it now. I changed to fileEncoding = "". Works for me now.

However, with get_eurostat_dic("na_item")[257, 2] I get: "HouseholdsÂ’ actual pension contributions". But that seems to be also on eurostat web page. Visible if you hover over that item on Data Explorer.

rocian · 2016-08-05T23:08:02Z

@jhuovari
It seems related to locale (and unicode gliphs of the used font). My locale is UTF-8.

With fileEncoding="Windows-1252"
get_eurostat_dic("na_item")[257, 2] produce "HouseholdsÂ’ actual pension contributions"

With fileEncoding="UTF-8", or fileEncoding=""
get_eurostat_dic("na_item")[257, 2] produce "Households\u0092 actual pension contributions"

The last one is what I see in the tooltip on Data Explorer. However, even if we see different things, we both obtain from get_eurostat_dic the same result seen on Data Explorer.

jhuovari · 2016-08-08T06:21:08Z

Good to know that you get something with fileEncoding="". For me get_eurostat_dic("na_item") with fileEncoding="UTF-8" gives a warning and only 248 observations. There should be 513 observations.

It think this affects only few dictionaries, but better solutions would be great.

pbiecek · 2016-09-09T22:09:47Z

After changing to read_tsv I have proper table (on OSX)

> get_eurostat_dic("na_item")
# A tibble: 513 x 2
   code_name
       <chr>
1       B1GQ

I do not have access to a windows machine, but maybe someone can check if this error is still present?

antagomir · 2016-09-09T22:54:30Z

@muuankarski ping

jhuovari · 2016-09-11T18:40:18Z

On windows the table is OK, but get_eurostat_dic("na_item")[257, 2] gives:

"Households<U+0092> actual pension contributions"

From stackoverflow: "U+0092 is a never-used control character. It is almost always the result of misdecoding a single right quote ’ in a Windows code page 1252 file as ISO-8859-1." I think that's the case also here. Not a big issue, but still...

pbiecek · 2016-09-11T20:11:23Z

In dc13905 there is a dirty hack that replaces U+0092 by '.
It's quite strange since some ' are read properly, so I'm not 100% sure if this hack is needed

jhuovari · 2016-09-13T06:45:23Z

The misdecoding is probably done on eurostat, so I think this is solved on our side.

antagomir added a commit that referenced this issue Jul 5, 2016

Polished man pages. Also solved #55

17fbaa6

antagomir added a commit that referenced this issue Jul 5, 2016

Now solves #55

84b18b3

antagomir closed this as completed Jul 5, 2016

jhuovari reopened this Jul 18, 2016

jhuovari pushed a commit that referenced this issue Aug 5, 2016

An attempt to solve #55

c8a2aa1

jhuovari pushed a commit that referenced this issue Aug 5, 2016

An attempt to solve #55

a56d35d

antagomir assigned muuankarski Sep 11, 2016

antagomir added this to the Submit to CRAN milestone Sep 11, 2016

antagomir added the bug label Sep 11, 2016

pbiecek added a commit that referenced this issue Sep 11, 2016

#55 hack over U+0092 in eurostat dict

dc13905

jhuovari closed this as completed Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding problem with get_eurostat_dic #55

encoding problem with get_eurostat_dic #55

rocian commented Jul 5, 2016

jhuovari commented Jul 18, 2016

antagomir commented Jul 18, 2016

jhuovari commented Jul 18, 2016

antagomir commented Jul 18, 2016

jhuovari commented Aug 5, 2016

rocian commented Aug 5, 2016

jhuovari commented Aug 8, 2016

pbiecek commented Sep 9, 2016

antagomir commented Sep 9, 2016

jhuovari commented Sep 11, 2016

pbiecek commented Sep 11, 2016

jhuovari commented Sep 13, 2016

encoding problem with get_eurostat_dic #55

encoding problem with get_eurostat_dic #55

Comments

rocian commented Jul 5, 2016

jhuovari commented Jul 18, 2016

antagomir commented Jul 18, 2016

jhuovari commented Jul 18, 2016

antagomir commented Jul 18, 2016

jhuovari commented Aug 5, 2016

rocian commented Aug 5, 2016

jhuovari commented Aug 8, 2016

pbiecek commented Sep 9, 2016

antagomir commented Sep 9, 2016

jhuovari commented Sep 11, 2016

pbiecek commented Sep 11, 2016

jhuovari commented Sep 13, 2016