Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different result when giving the same text #31

Closed
Stophface opened this issue Mar 25, 2015 · 6 comments
Closed

Different result when giving the same text #31

Stophface opened this issue Mar 25, 2015 · 6 comments

Comments

@Stophface
Copy link

I have a database from which I read. I want to identify the language in a specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
    #print (type(row)) #tuple
    rf = str(row)
    #print (type(rf)) #string
    lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!

@saffsd
Copy link
Owner

saffsd commented Mar 25, 2015

Hi Stophface,

Thanks for reporting the issue. The algorithm used by langid.py is
entirely deterministic, so the only way to get two different outputs is to
provide it with two different inputs. An encoding issue would be my first
thought, it is possible that your database is returning text that is not
UTF8-encoded? Another possibility is perhaps some weirdness in the space
characters. In any case, what I think is happening is that the string
returned by your database looks the same as the one manually entered, but
is actually different when you compare them at the byte level.

Aside, what do you expect to be the correct output for that input?

Cheers,
Marco

On Wed, Mar 25, 2015 at 11:04 PM, Stophface notifications@github.com
wrote:

I have a database from which I read. I want to identify the language in a
specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not
displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me
Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor
Portuguese, why is it returned different results?!


Reply to this email directly or view it on GitHub
#31.

@Stophface
Copy link
Author

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement
http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python tuple. I convert them then to a string as you can see.
According to this SO post http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8 python is ASCII.
You recommend converting it to UTF 8?

@saffsd
Copy link
Owner

saffsd commented Mar 25, 2015

Given the text you showed there should be no difference between ASCII and
UTF8 (all ASCII is valid UTF8 by design of UTF8). Are you on Python2 or
Python3? The best thing to do would be to look at the output of your
database as a sequence of bytes. In Python2 this would be something like
print(map(ord(rf))).

However, looking at your code more closely, I notice that you do
rf=str(row) - this will show the representation of the tuple row as a
string, which will including quoting and parenthesis. Is this your
intention? Or did you intend for this to be rf = ' '.join(row) ?

On Thu, Mar 26, 2015 at 8:12 AM, Stophface notifications@github.com wrote:

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I
showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement

http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python
tuple. I convert them then to a string as you can see.
According to this SO post
http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8
python is ASCII.
You recommend converting it to UTF 8?


Reply to this email directly or view it on GitHub
#31 (comment).

@Stophface
Copy link
Author

There will be different text. Farsi, Pashto, Arabic. Basically all the languages spokeny might be in the variable I pass to langid.
I am not a programmer as you might recognized already. Is there a difference between ASCII or UTF-8 when passing it to langid in a variable?

My intention is to pass to langid text, as clean as possible.
So rf = ''.join(row) is the better thing to do. I had that before, but I started editing my code and it got lost. Thanks for mentioning it.
However, passing it to langid with join(row) does not do the trick.
I am working in python 3.3. Could you specify what you mean by "looking at it byte by byte"?
print(map(ord(rf))) thats for python 2. I cannot find the syntax for python 3 since I do not know exactly what to look for.

Thats my output when looking at it byte by byte:

b = rf.encode('utf-8')
print (b)

b'shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote'

Thanks for providing such a creat tool!

@Stophface
Copy link
Author

Solved.
There was a problem further down with my script....

Ah, and I expected the language to be identified as french :) All good now!

If your interested: I am using your library on the flickr API :)

@saffsd
Copy link
Owner

saffsd commented Aug 12, 2015

Closing as @Stophface indicated issue is solved.

@saffsd saffsd closed this as completed Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants