formula example notebook broken #1326

Closed
andreas-h opened this Issue Jan 24, 2014 · 31 comments

Projects

None yet

3 participants

@andreas-h

When I try to follow the example_formulas example notebook, line number 6

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

triggers the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)

I'm using statsmodels 0.5.0 on python 3.3.2.

@josef-pkt
Member

Thanks for the report and pointing out an example case, see #1324


Bytes or Unicode - that's the question

@josef-pkt
Member

I guess the meta csv is utf8 and should be decoded that way instead of ascii

Also, there are some encoding inconsistencies, opening the cache is a different decoding

running on python 3.3.0

with cache:

>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
  File "C:\Programs\Python33\lib\site-packages\statsmodels-0.5.0-py3.3-win32.egg\statsmodels\datasets\utils.py", line 245, in get_rdataset
    title = _get_dataset_meta(dataname, package, cache)
  File "C:\Programs\Python33\lib\site-packages\statsmodels-0.5.0-py3.3-win32.egg\statsmodels\datasets\utils.py", line 193, in _get_dataset_meta
    data = data.decode('ascii', errors='strict')
AttributeError: 'str' object has no attribute 'decode'

without cache it's the same as Andreas's

>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
  File "C:\Programs\Python33\lib\site-packages\statsmodels-0.5.0-py3.3-win32.egg\statsmodels\datasets\utils.py", line 246, in get_rdataset
    doc, _ = _get_data(docs_base_url, dataname, cache, "rst")
  File "C:\Programs\Python33\lib\site-packages\statsmodels-0.5.0-py3.3-win32.egg\statsmodels\datasets\utils.py", line 181, in _get_data
    data = data.decode('ascii', errors='strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)
@jseabold
Member
Python 3.3.2+ (default, Oct  9 2013, 14:50:09) 
[GCC 4.8.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

What gives?

@josef-pkt
Member

I only have statsmodels 0.5.0 in python 3.3.0 right now. (and no spyder for it, so I don't know how to use pdb easily)

Trying out, I get different exceptions, or some cases work, depending on which and how the url file is obtained.

one case, I tried to reconstruct from the backtrace

>>> url
'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/HistData/rst/Guerry.rst'
>>> cache
'C:\\Users\\josef\\statsmodels_data'
>>> import statsmodels.datasets.utils as du
>>> datam, tmp = du._urlopen_cached(url, cache)
>>> type(datam)
<class 'bytes'>
>>> ms = datam.decode('ascii')
Traceback (most recent call last):
  File "<pyshell#80>", line 1, in <module>
    ms = datam.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)
>>> datam[1250:1300]
b' against persons. Source: A2 (Compte g\xc3\xa9n\xc3\xa9ral,\n  '

I also looked at the python 3 source of statsmodels, and I don't see anything that 2to3 might have changed.

@josef-pkt
Member

I'm using an installed statsmodels 0.5 on python 3.3 (and I don't have spyder for this to conveniently use pdb)

trying different versions, I get different exceptions or some cases it works

following up on one exception, this should show one "broken" case with non-ascii data:

>>> url
'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/HistData/rst/Guerry.rst'
>>> cache
'C:\\Users\\josef\\statsmodels_data'
>>> import statsmodels.datasets.utils as du
>>> datam, tmp = du._urlopen_cached(url, cache)
>>> type(datam)
<class 'bytes'>
>>> ms = datam.decode('ascii')
Traceback (most recent call last):
  File "<pyshell#80>", line 1, in <module>
    ms = datam.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)
>>> datam[1250:1300]
b' against persons. Source: A2 (Compte g\xc3\xa9n\xc3\xa9ral,\n  '

I was trying to get through the cache and download code, but there are too many possibilities for me to understand. for example whether things have been cached from py 2 or py 3.
I also don't see if the system encoding is used at some point to write to disk, which on Windows wouldn't be utf-8, at least not on python 2 I guess.

@josef-pkt
Member

Now I wrote it twice, I thought I didn't save/Comment the first version of the comment I wrote.
firefox browser cache problems and switching tabs ?

@josef-pkt
Member

my "impression" is that there are two "bugs" (just a rough guess)

we are decoding the metadata with ascii even if it contains unicode.
we are pickle dumping the decoded data which might have a system encoding on python 2

@andreas-h

Python2:

$ python
Python 2.7.6 |Anaconda 1.8.0 (64-bit)| (default, Nov 11 2013, 10:47:18) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
>>>

Python3:

$ python
Python 3.3.2 |Anaconda 1.8.0 (64-bit)| (default, Aug  5 2013, 15:04:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 253, in get_rdataset
    doc, _ = _get_data(docs_base_url, dataname, cache, "rst")
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 187, in _get_data
    data = data.decode('ascii', errors='strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1288: ordinal not in range(128)
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 248, in get_rdataset
    data, from_cache = _get_data(data_base_url, dataname, cache)
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 187, in _get_data
    data = data.decode('ascii', errors='strict')
AttributeError: 'str' object has no attribute 'decode'
>>>
@josef-pkt
Member

BTW: is there a reason why you didn't use gzip to write an archive file.
It doesn't really matter for the internal use, but I tried to open one of the cache ".zip" files and it is an invalid format. I guess because of missing metadata for archive files.

@josef-pkt
Member

and the 3rd bug candidate is that we try to decode cached data twice.

@jseabold
Member

There's no reason I didn't use gzip. AFAICT, zlib uses gzip for the compression but gzip provides file-like handles, which I don't use.

I don't see why a system encoding should matter for pickle.

The decoding to ascii of the meta data is a bug. I don't know why it doesn't show up for me or why it doesn't always decode to utf-8 like the data.

@jseabold jseabold added a commit to jseabold/statsmodels that referenced this issue Jan 25, 2014
@jseabold jseabold BUG: Decode metadata to utf-8. Closes #1326. e1aae64
@andreas-h

Only partial success after manually changing ascii to utf-8 in datasets/utils.py:

Python 3.3.2 |Anaconda 1.8.0 (64-bit)| (default, Aug  5 2013, 15:04:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 248, in get_rdataset
    data, from_cache = _get_data(data_base_url, dataname, cache)
  File "/raid/home2/hilboll/SCIATRAN/no2prof/pyenv/lib/python3.3/site-packages/statsmodels/datasets/utils.py", line 187, in _get_data
    data = data.decode('utf-8', errors='strict')
AttributeError: 'str' object has no attribute 'decode'
>>> 
@jseabold
Member

Thanks. Can you also post the result of

import sys
sys.getdefaultencoding()
@andreas-h
$ python
Python 3.3.2 |Anaconda 1.8.0 (64-bit)| (default, Aug  5 2013, 15:04:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> 
@jseabold
Member

Thanks. Very odd. Same code on my machine returns bytes. On your machine returns str. Maybe the cached dataset is stale on your machine? It would have to be many months old though. Is that possible?

Can you try the latest commit in #1329. That should hopefully do it, though I still wonder if I'm treating the symptom and not the cause of why you have a different pickled object.

@andreas-h

Very strange indeed. After deleting the ~/statsmodels_data directory (which was created yesterday), everything works fine:

$ python 
Python 3.3.2 |Anaconda 1.8.0 (64-bit)| (default, Aug  5 2013, 15:04:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import statsmodels.api as sm
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=False)
>>> dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
>>> 

Thanks for looking into this!

@jseabold
Member

You might want to try the second call twice to be sure. These calls are equivalent if the cached dataset doesn't exist yet. Equivalent except the second writes something to disk. Doing it again will try to read from the disk.

@andreas-h

You might want to try the second call twice to be sure. These calls are
equivalent if the cached dataset doesn't exist yet.

No problem any more. Works.

@jseabold
Member

Great thanks. I'll wait for Josef to comment before merging.

@jseabold
Member

To be clearer, it's unclear whether we need the second commit. It's an equivalent way of writing the same thing, though perhaps it's more explicit and preferred.

I'm wondering if the stale cached dataset came from having written the cached dataset in Python 2. Though, this shouldn't matter. This is what the last commits in that file were for and I tested this.

@josef-pkt
Member

I get the same success: After I make the utf8 change, delete cache, everything works in python 3.3.
If I then use py2.7 to load the data with cache=False (overwrites cache), then the next call in python 3.3 with cache=True, raises the same 2nd error 'str' object has no attribute 'decode' again.

(It's not completely clean testing, I'm using statsmodels 0.5 with only the utf changes in python 3.3, and I use statsmodels master also with the utf8 change in python 2.7.)

@josef-pkt
Member

Windows

python 3.3

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>> 

python 2.7

>>> import sys
>>> sys.getdefaultencoding()
'cp1252'
@jseabold
Member

How about with the explicity bytes call commit? That should take care of that I think.

@josef-pkt
Member

To iterate my suspicion: I think we need to decode the data to bytes (utf8) before pickling.
But I need to verify what the bytes in the "zip" file actually are.

I didn't see the bytes changeset. I will try it before following up on the zip encoding.

@josef-pkt
Member

(aside: I figured out that firefox doesn't reload if the url has a link to a comment, but reloads the page if the url is just the issue.)

@josef-pkt
Member

I just saw that there is an additional change already in master compared to 0.5.0.
1cdeda5
adding the missing .encode('utf-8') seems to have fixed it already, even without the explicit bytes.

I can verify in a py 3.3 virtualenv with statsmodels master after dinner

@jseabold
Member

Yes this was the bug I fixed already. As I was saying, the bytes and encode should be equivalent. I was wondering how in the world the encode returned a bytes sometimes and a str sometimes...

@josef-pkt
Member

Sorry about the confusion. I wasn't really setup for python 3 statsmodels work and reused my system python 3.3, that doesn't have statsmodels master, and that I had open for the numpy mailinglist bytes/string examples.

No more exception, now for checking the content

print(dta.__doc__) looks good in python 3.3, but has incorrectly decoded unicode in python 2.7
Source: A1 (Compte général)

@jseabold jseabold closed this in 322bef1 Jan 26, 2014
@josef-pkt
Member

by trial and error, I get the correct printout with py 2.6 in IDLE (an old statsmodels '0.5.0.dev-c1fb529') and in python 2.7 in Spyder with close to current master (a branch with edited utf-8 fix, plus a few extra unrelated changes)

print unicode(dta.__doc__.decode('utf-8'))

I have no idea what it means yet.

@jseabold
Member

It looks fine to me on Python 2.7. Are you sure it's not a terminal encoding issue?

@jseabold
Member

Though it is a string. I guess it should probably be decoded to utf-8, but it should work. E.g., with my terminal encoding, things like this work

>>> print 'Compte g\xc3\xa9n\xc3\xa9ral'
Compte général

But maybe it should be

>>> 'Compte g\xc3\xa9n\xc3\xa9ral'.decode('utf-8')
u'Compte g\xe9n\xe9ral'

I don't usually see unicode docstrings, but I guess this is a special case?

@PierreBdR PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this issue Sep 2, 2014
@jseabold jseabold BUG: Decode metadata to utf-8. Closes #1326. d24178b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment