datasets.get_rdataset string decode error on python 3 #1055

josef-pkt · 2013-08-20T15:03:30Z

trying to run Skipper's example from issue #805 on python 3.3
fails with a UnicodeDecodeError

I guess we don't have a test for this.
I ran several examples on python 3 before the release, but none that used get_rdataset

import statsmodels.api as sm
from statsmodels.formula.api import ols

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
# .... snip rest

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-15-6ce721e61c62> in <module>()
      3 from statsmodels.formula.api import ols
      4 
----> 5 dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
      6 df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
      7 

E:\Josef\programs\WinPython-64bit-3.3.2.2\python-3.3.2.amd64\lib\site-packages\statsmodels\datasets\utils.py in get_rdataset(dataname, package, cache)
    250 
    251     title = _get_dataset_meta(dataname, package, cache)
--> 252     doc, _ = _get_data(docs_base_url, dataname, cache, "rst")
    253 
    254     return Dataset(data=data, __doc__=doc.read(), package=package, title=title,

E:\Josef\programs\WinPython-64bit-3.3.2.2\python-3.3.2.amd64\lib\site-packages\statsmodels\datasets\utils.py in _get_data(base_url, dataname, cache, extension)
    185     #Python 3, don't think there will be any unicode in r datasets
    186     if sys.version[0] == '3':  # pragma: no cover
--> 187         data = data.decode('ascii', errors='strict')
    188     return StringIO(data), from_cache
    189 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

jseabold · 2013-08-20T15:43:27Z

Yeah, the error is intended because I didn't know how to properly do the decoding on Python 3 and wasn't necessarily expecting unicode from an R dataset.

josef-pkt · 2013-08-20T16:04:34Z

We might be able to postpone this until the py2 py3 common code base conversion, where we have to decide and figure out what to do with strings.
But we need examples in the documentation that don't fail on this.

jseabold · 2013-08-20T17:53:02Z

What are the expectations for if something is cached using 2.x that we're able to read it from 3.x? That's the only hold up for me right now. I'd rather not support it, but it should be doable and we probably should support this.

jseabold · 2013-08-20T17:58:16Z

I could also mangle the cached name to include the python version, so the Python 3 will never find a Python 2 cached item.

jseabold · 2013-08-20T17:58:59Z

This is the implicit expectation now actually. It was just never actually checked for.

jseabold · 2013-08-20T18:17:33Z

Nevermind, it's an easy fix.

josef-pkt · 2013-08-20T18:20:58Z

Is this related to caching, I'm using a new python 3.3 version with installed statsmodels. Do we share cache directories across python version and different statsmodels folders?

But in either case, we are just caching the csv file. Isn't that the same across python versions?

jseabold · 2013-08-20T18:23:17Z

No, it's not related to caching but fixing it brought up a caching bug for me since I had this file saved from 2.7. Basically just a bytes vs. str issue.

vincentarelbundock · 2013-08-20T18:32:37Z

so nothing I should do to help on RDatasets side?

jseabold · 2013-08-20T18:38:58Z

No you're fine. Just lazy coding catching up to me.

josef-pkt · 2014-06-06T13:55:38Z

missing cross-link to PR #1057

jseabold closed this as completed in 1cdeda5 Aug 20, 2013

jseabold added a commit to jseabold/statsmodels that referenced this issue Aug 20, 2013

COMPAT: Python 3 fix. Closes statsmodels#1055.

bdf6d52

josef-pkt mentioned this issue Jun 6, 2014

UnicodeDecodeError raised by get_rdataset("Guerry", "HistData") #1745

Closed

josef-pkt mentioned this issue Jun 6, 2014

TST: missing unicode tests ? #1746

Closed

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014

COMPAT: Python 3 fix. Closes statsmodels#1055.

b8ee37e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.get_rdataset string decode error on python 3 #1055

datasets.get_rdataset string decode error on python 3 #1055

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

vincentarelbundock commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Jun 6, 2014

datasets.get_rdataset string decode error on python 3 #1055

datasets.get_rdataset string decode error on python 3 #1055

Comments

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Aug 20, 2013

jseabold commented Aug 20, 2013

vincentarelbundock commented Aug 20, 2013

jseabold commented Aug 20, 2013

josef-pkt commented Jun 6, 2014