Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for ARFF reader regression (#10232) #10233

Merged
merged 6 commits into from
May 31, 2019

Conversation

vnmabus
Copy link
Contributor

@vnmabus vnmabus commented May 29, 2019

  • Quoted nominal values are now properly unquoted.
  • A regression test has been added.

* Quoted nominal values are now properly unquoted.
* A regression test has been added.
@rgommers rgommers added backport-candidate This fix should be ported by a maintainer to previous SciPy versions. maintenance Items related to regular maintenance tasks SciPEP SciPy Enhancement Proposal scipy.io and removed SciPEP SciPy Enhancement Proposal labels May 29, 2019
@rgommers
Copy link
Member

Thanks @vnmabus! @sebp can you confirm that this solves your issue?

@vnmabus
Copy link
Contributor Author

vnmabus commented May 29, 2019

It seems that for some reasons the test fails on some versions and works in others. I am not sure why.

@sebp
Copy link
Contributor

sebp commented May 29, 2019

I can confirm that it does solve the issue for me. Strictly speaking, stripping quotes leads to different values being returned compared to versions prior to 1.3, but most of the time it would make sense to do this.

I noticed that your code accounts for situations where the first or last character is a quote, but not both. I strongly suggest to add a test case for this too.

@sebp
Copy link
Contributor

sebp commented May 29, 2019

I found a slightly esoteric setup hat still results in an exception:

@attribute age numeric
@attribute smoker {'  yes', 'no  '}
@data
18,'no  '
24,'  yes'
44,'no  '
56,'no  '
89,'  yes'
11,'no  '

also results in ValueError: no value not in (' yes', 'no ').

How about using csv.reader to parse the list of attributes too?

* Add tests to check that spaces between quotes are preserved, but
spaces between csv elements are not.
@sebp
Copy link
Contributor

sebp commented May 30, 2019

Thanks for the changes @vnmabus, everything works as expected on my end.

@rgommers
Copy link
Member

The one CI failure on Azure is real. A bit puzzling though, 'no == 'no'` failing ....

________________ ERROR at setup of TestQuotedNominal.test_data ________________
[gw0] win32 -- Python 3.5.4 C:\hostedtoolcache\windows\Python\3.5.4\x64\python.exe

self = <scipy.io.arff.tests.test_arffread.TestQuotedNominal object at 0x0000022EAB6C76A0>

    def setup_method(self):
>       self.data, self.meta = loadarff(test_quoted_nominal)

self       = <scipy.io.arff.tests.test_arffread.TestQuotedNominal object at 0x0000022EAB6C76A0>

C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\tests\test_arffread.py:342: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\arffread.py:738: in loadarff
    return _loadarff(ofile)
C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\arffread.py:803: in _loadarff
    a = list(generator(ofile))
C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\arffread.py:801: in generator
    yield tuple([attr[i].parse_data(row[i]) for i in elems])
C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\arffread.py:801: in <listcomp>
    yield tuple([attr[i].parse_data(row[i]) for i in elems])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <scipy.io.arff.arffread.NominalAttribute object at 0x0000022EB4DA74E0>
data_str = "'no'"

    def parse_data(self, data_str):
        """
        Parse a value of this type.
        """
        if data_str in self.values:
            return data_str
        elif data_str == '?':
            return data_str
        else:
            raise ValueError("%s value not in %s" % (str(data_str),
>                                                    str(self.values)))
E           ValueError: 'no' value not in ('yes', 'no')

data_str   = "'no'"
self       = <scipy.io.arff.arffread.NominalAttribute object at 0x0000022EB4DA74E0>

C:\hostedtoolcache\windows\Python\3.5.4\x64\lib\site-packages\scipy\io\arff\arffread.py:165: ValueError

@vnmabus
Copy link
Contributor Author

vnmabus commented May 30, 2019

So, in some environments, the data do not have the quotes stripped but the attribute values do, in spite of using the same function for reading them. Any suggestion on which difference is the important one between the affected versions, and how can I test this error locally? 😅

@vnmabus
Copy link
Contributor Author

vnmabus commented May 30, 2019

Ok, it seems to be that some versions do not have this patch applied:
python/cpython@2411292
How should I proceed now? Should I provide a workaround?

@rgommers
Copy link
Member

Nice catch. That patch is pretty recent it looks like, so there'll be quite some users that won't have it. If a workaround isn't too hard, then yes that sounds good.

@rgommers
Copy link
Member

Nice, that fixed it!

One more minor CI issue:

39.88s$ pycodestyle scipy benchmarks/benchmarks
scipy/io/arff/arffread.py:458:112: E502 the backslash is redundant between brackets
1       E502 the backslash is redundant between brackets

@rgommers rgommers merged commit d8f5509 into scipy:maintenance/1.3.x May 31, 2019
@rgommers
Copy link
Member

All green now, merged. Thanks @vnmabus for the quick fix, and @sebp for reporting!

@rgommers rgommers added this to the 1.3.1 milestone May 31, 2019
@rgommers rgommers changed the title Correct issue #10232. Fix for ARFF reader regression (#10232) May 31, 2019
@rgommers rgommers modified the milestones: 1.3.1, 1.4.0 May 31, 2019
@vnmabus
Copy link
Contributor Author

vnmabus commented May 31, 2019

Sorry for bothering you @rgommers but I am not sure if these commits should be also pushed to master. I opened this PR against the maintenance branch, and I am realizing now that maybe that was not right.

@rgommers
Copy link
Member

Oh thanks for pointing that out, I totally missed that. I'll forward-port them.

@rgommers rgommers modified the milestones: 1.4.0, 1.3.1 May 31, 2019
@rgommers rgommers removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label May 31, 2019
@pv
Copy link
Member

pv commented May 31, 2019

@vnmabus: For future reference, as a general rule everything should go to master first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants