trying to run Skipper's example from issue #805 on python 3.3
fails with a UnicodeDecodeError
I guess we don't have a test for this.
I ran several examples on python 3 before the release, but none that used get_rdataset
import statsmodels.api as sm
from statsmodels.formula.api import ols
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
# .... snip rest
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-15-6ce721e61c62> in <module>()
3 from statsmodels.formula.api import ols
----> 5 dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
6 df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
E:\Josef\programs\WinPython-64bit-22.214.171.124\python-3.3.2.amd64\lib\site-packages\statsmodels\datasets\utils.py in get_rdataset(dataname, package, cache)
251 title = _get_dataset_meta(dataname, package, cache)
--> 252 doc, _ = _get_data(docs_base_url, dataname, cache, "rst")
254 return Dataset(data=data, __doc__=doc.read(), package=package, title=title,
E:\Josef\programs\WinPython-64bit-126.96.36.199\python-3.3.2.amd64\lib\site-packages\statsmodels\datasets\utils.py in _get_data(base_url, dataname, cache, extension)
185 #Python 3, don't think there will be any unicode in r datasets
186 if sys.version == '3': # pragma: no cover
--> 187 data = data.decode('ascii', errors='strict')
188 return StringIO(data), from_cache
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)
Yeah, the error is intended because I didn't know how to properly do the decoding on Python 3 and wasn't necessarily expecting unicode from an R dataset.
We might be able to postpone this until the py2 py3 common code base conversion, where we have to decide and figure out what to do with strings.
But we need examples in the documentation that don't fail on this.
What are the expectations for if something is cached using 2.x that we're able to read it from 3.x? That's the only hold up for me right now. I'd rather not support it, but it should be doable and we probably should support this.
I could also mangle the cached name to include the python version, so the Python 3 will never find a Python 2 cached item.
This is the implicit expectation now actually. It was just never actually checked for.
Nevermind, it's an easy fix.
Is this related to caching, I'm using a new python 3.3 version with installed statsmodels. Do we share cache directories across python version and different statsmodels folders?
But in either case, we are just caching the csv file. Isn't that the same across python versions?
No, it's not related to caching but fixing it brought up a caching bug for me since I had this file saved from 2.7. Basically just a bytes vs. str issue.
so nothing I should do to help on RDatasets side?
No you're fine. Just lazy coding catching up to me.
COMPAT: Python 3 fix. Closes #1055.
missing cross-link to PR #1057