Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soundex Appears Broken? #14

Open
quantology opened this issue Nov 5, 2017 · 12 comments
Open

Soundex Appears Broken? #14

quantology opened this issue Nov 5, 2017 · 12 comments

Comments

@quantology
Copy link

Using the test case, in python 3.5:

phrase = 'FancyFree'
print(repr(fuzzy.Soundex(4)(phrase)))

yields: ''

Occasionally instead of yielding an empty string, it yields a unicode error. dmeta and nysiis are working fine in this install, so I don't believe it was an install error.

@pw717
Copy link

pw717 commented Nov 13, 2017

Hi, same for me on python 2.7, please see example below.
Thank you in advance for your help.

| => python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.__call__
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

I had to put back previous version 1.1 :

| => python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')
'T2300000'

@jaraco
Copy link
Contributor

jaraco commented Nov 14, 2017

In this job, you can see the tests I added in fa184ba now failing. Annoyingly, they pass when I run the same tests on my mac. So there are apparently some issues with Cython or maybe with the compiler on Linux. I welcome someone to dive deeper and find a solution.

@jaraco
Copy link
Contributor

jaraco commented Nov 14, 2017

As you can see, little changed with fuzzy.pyx from 1.1 to 1.2, and it changed slightly from 1.2 to 1.2.2.

@pw717
Copy link

pw717 commented Nov 16, 2017

Hi, thank you very much for your answer.

mac

On mac as you mentionned (OSX Sierra 10.12.6) it's not OK either: it doesn't show any error but the return value appears to be wrong:

with 1.2.2
python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
u'T23'

We should have this instead:
with 1.1
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
'T2300000'

It may be noticeable that the function on newer versions returns unicode type rather than str as before.

Linux

On linux debian 8.2 jessie (with both versions 1.2 and 1.2.2), this may interest you :

with 1.1
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
'T2300000'

with 1.2.2
| => python
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
>>> import fuzzy
>>> fuzzy.Soundex(8)('Test')
u''

Also below: sorry for the repetitions but this may help if you look at the the third attempt: the return value remains wrong but it doens't throw any error!

>>> sdx = fuzzy.Soundex(8)
>>> sdx('Test')

Traceback (most recent call last):
File "", line 1, in
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 0: ordinal not in range(128)

>>> sdx('Test')

Traceback (most recent call last):
File "", line 1, in
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

>>> sdx('Test')
u''

>>> sdx('Test')

Traceback (most recent call last):
File "", line 1, in
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

>>> sdx('Test')

Traceback (most recent call last):
File "", line 1, in
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

and so on...

@yaakov2
Copy link

yaakov2 commented Dec 3, 2017

I tested the sample code from the documentation with versions 1.0, 1.1, 1.2, 1.2.1 and 1.2.2 on a GoogleCloud Ubuntu 16.04 instance:

import fuzzy
soundex = fuzzy.Soundex(4)
print soundex('fuzzy')
print 'should be: F200'

Versions 1.0 and 1.1 produce the expected results 'F200'. Versions 1.2 onward produce empty strings.

@yaakov2
Copy link

yaakov2 commented Dec 3, 2017

It seems to me that two weeks ago, version 1.2.2 used to work for us --- but then something changed and the results are wrong. Also, the results are sporadic --- we get different error messages in different runs of the program. For the time being, we go back to version 1.1 -- but it is not clear whether this solves the problem.

I would think that a basic test for the Soundex function should not be marked as "expected to fail": If the test doesn't produce the correct answer, then there is some problem that needs to be corrected (and people should see that when they decide whether to use the package or not).

@pw717
Copy link

pw717 commented Apr 24, 2018

Hi, I'm porting my project on python3 and it seems that the library doesn't work as it should with Soundex, as @metaperture reported earlier.

Please see examples below:

python
Python 3.6.5 (default, Mar 30 2018, 06:42:10)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy; sdx = fuzzy.Soundex(8)
>>> sdx('fuzzy')
"F2x('fuzzy')\n"
>>> sdx('Jéroboam')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/fuzzy.pyx", line 207, in fuzzy.Soundex.__call__
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)

The "test" string "fuzzy" does not give the expected result and a string containing accented character throws an exception.

Thank-you in advance for your help.

Regards,

Philippe

@morvan-s
Copy link

morvan-s commented Jul 2, 2018

I've personally given up and used another implementation (https://pypi.org/project/soundex/)

@jaraco
Copy link
Contributor

jaraco commented Jul 2, 2018

As I'm not the original author, I have little visibility to the project, so I can give little guidance to what this library should be doing, so it's nice to know of the soundex library, as we can use that as a guide for what may or may not be correct.

Thinking about @pw717's comment above, it seems to me that on the mac, the behavior on 1.2 is more correct than that of 1.1, especially considering that the trailing zeros seem like padding, but also because the soundex lib doesn't render them either:

fuzzy master $ rwt soundex
Collecting soundex
  Downloading https://files.pythonhosted.org/packages/f8/8f/37b9711595d007e82f70ae6f41b6ab6a1fda406a8321ccfc458fb5023b5f/soundex-1.1.3.tar.gz
Collecting silpa_common>=0.3 (from soundex)
  Downloading https://files.pythonhosted.org/packages/8d/55/452f5103cb7071d188a818d9e2f12c19c4c8a12124a28aaa212eb6716a4d/silpa_common-0.3.tar.gz
Building wheels for collected packages: soundex, silpa-common
  Running setup.py bdist_wheel for soundex ... done
  Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/b5/bb/e6/9a4b6be56c40aa707509bddaf6d414187461ded9db7a25a41a
  Running setup.py bdist_wheel for silpa-common ... done
  Stored in directory: /Users/jaraco/Library/Caches/pip/wheels/16/4f/ba/604a82bf904740f1a1d3ad88029c0df5c638bd8825a3cb972d
Successfully built soundex silpa-common
Installing collected packages: silpa-common, soundex
Successfully installed silpa-common-0.3 soundex-1.1.3
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 26 2018, 23:26:24)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundex
>>> ob = soundex.Soundex()
>>> ob.soundex('Test')
'T23'

@jaraco
Copy link
Contributor

jaraco commented Jul 2, 2018

There are several issues at play here. Let's set aside for the moment the issue that non-ascii characters are not yet supported (as the encoding for strings is declared to be ascii). I'll file that as a separate issue for clarity.

Excluding that issue, the tests pass on macOS.

What we need is someone to spend some time to understand the Cython code and dig into the details on a system where the tests are failing and devise a fix.

@supriyo-biswas
Copy link

Soundex also has other bugs:

>>> import fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> soundex('hello')
'H4'
>>> soundex('hi')
"Houndex('hi')\n"

This is on Python 3.7.0 (macOS 10.14) with Fuzzy 1.2.2

@CognitiveClouds-Prasad
Copy link

I am getting weird errors. Sometimes, I am getting blank strings with return carriage. Sometimes, I am getting this error.

Traceback (most recent call last):
File "", line 1, in
File "src/fuzzy.pyx", line 230, in fuzzy.Soundex.call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 1: ordinal not in range(128)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants