Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding problem with get_eurostat_dic #55

Closed
rocian opened this issue Jul 5, 2016 · 12 comments
Closed

encoding problem with get_eurostat_dic #55

rocian opened this issue Jul 5, 2016 · 12 comments
Assignees
Labels

Comments

@rocian
Copy link

rocian commented Jul 5, 2016

There is an issue in get_eurostat_dict on systems with an encoding different from "Windows-1252". At least on linux systems where I tried it. I was able to resolve the issue overriding get_eurostat_dict and configuring the proper encoding (UTF-8).

antagomir added a commit that referenced this issue Jul 5, 2016
antagomir added a commit that referenced this issue Jul 5, 2016
@jhuovari
Copy link

Now dic <- get_eurostat_dic("na_item") gives warning:

Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  invalid input found on input connection 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=dic%2Fen%2Fna_item.dic'

It is not probably UTF-8. These encoding issues are tricky.

@jhuovari jhuovari reopened this Jul 18, 2016
@antagomir
Copy link
Member

Tricky indeed. Not sure if there is a universal solution.

@jhuovari
Copy link

Maybe reading as "Windows-1252" (or what ever it really is) and then change to UTF-8?

@antagomir
Copy link
Member

If the encoding is always the same, or can be recognized automatically, then this will work. We can try at least.

jhuovari pushed a commit that referenced this issue Aug 5, 2016
@jhuovari
Copy link

jhuovari commented Aug 5, 2016

Could you @rocian try it now. I changed to fileEncoding = "". Works for me now.

However, with get_eurostat_dic("na_item")[257, 2] I get: "HouseholdsÂ’ actual pension contributions". But that seems to be also on eurostat web page. Visible if you hover over that item on Data Explorer.

jhuovari pushed a commit that referenced this issue Aug 5, 2016
@rocian
Copy link
Author

rocian commented Aug 5, 2016

@jhuovari
It seems related to locale (and unicode gliphs of the used font). My locale is UTF-8.

With fileEncoding="Windows-1252"
get_eurostat_dic("na_item")[257, 2] produce "HouseholdsÂ’ actual pension contributions"

With fileEncoding="UTF-8", or fileEncoding=""
get_eurostat_dic("na_item")[257, 2] produce "Households\u0092 actual pension contributions"

The last one is what I see in the tooltip on Data Explorer. However, even if we see different things, we both obtain from get_eurostat_dic the same result seen on Data Explorer.

@jhuovari
Copy link

jhuovari commented Aug 8, 2016

Good to know that you get something with fileEncoding="". For me get_eurostat_dic("na_item") with fileEncoding="UTF-8" gives a warning and only 248 observations. There should be 513 observations.

It think this affects only few dictionaries, but better solutions would be great.

@pbiecek
Copy link
Member

pbiecek commented Sep 9, 2016

After changing to read_tsv I have proper table (on OSX)

> get_eurostat_dic("na_item")
# A tibble: 513 x 2
   code_name
       <chr>
1       B1GQ

I do not have access to a windows machine, but maybe someone can check if this error is still present?

@antagomir
Copy link
Member

@muuankarski ping

@jhuovari
Copy link

On windows the table is OK, but get_eurostat_dic("na_item")[257, 2] gives:

"Households<U+0092> actual pension contributions"

From stackoverflow: "U+0092 is a never-used control character. It is almost always the result of misdecoding a single right quote ’ in a Windows code page 1252 file as ISO-8859-1." I think that's the case also here. Not a big issue, but still...

pbiecek added a commit that referenced this issue Sep 11, 2016
@pbiecek
Copy link
Member

pbiecek commented Sep 11, 2016

In dc13905 there is a dirty hack that replaces U+0092 by '.
It's quite strange since some ' are read properly, so I'm not 100% sure if this hack is needed

@jhuovari
Copy link

The misdecoding is probably done on eurostat, so I think this is solved on our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants