Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename str_to_unicode and unicode_to_str functions #778

Closed
kmike opened this issue Jul 2, 2014 · 13 comments
Closed

rename str_to_unicode and unicode_to_str functions #778

kmike opened this issue Jul 2, 2014 · 13 comments
Milestone

Comments

@kmike
Copy link
Member

@kmike kmike commented Jul 2, 2014

I think that unicode_to_str and str_to_unicode names are misleading: unicode_to_str can accept input which is not unicode, and 'str' is ambiguous in Python 2.x / 3.x; str_to_unicode can accept input which is not str.

What about renaming them to text_to_bytes and text_to_unicode? "Text" will mean "either a bytestring or a unicode string".

Django uses force_bytes / force_text names for similar utilities; the difference is that they acept any objects, not just bytes/unicode as input.

"Text" means "unicode" in Django names. This makes sense because not all bytestrings are a representation of text. I think that for Scrapy bytestrings acceptable by text_to_bytes and text_to_unicode functions are representation of text, so "either a bytestring or a unicode string" meaning is OK in Scrapy.

@felixonmars
Copy link
Contributor

@felixonmars felixonmars commented Jul 15, 2014

I think text_to_bytes and text_to_unicode still look ambiguous if used together with six.text_type and six.string_types, but it's better than current approach.

@kmike
Copy link
Member Author

@kmike kmike commented Jul 21, 2014

@felixonmars do you have an idea for better names?

@felixonmars
Copy link
Contributor

@felixonmars felixonmars commented Jul 21, 2014

Hmm, I don't think six provide a better name, but since we already have to live with it, we can use the same naming scheme: string_to_binary and string_to_text?

Well, doesn't sound that good to me already :P

@kmike
Copy link
Member Author

@kmike kmike commented Jul 21, 2014

Hmm, yes, "string" looks better than "text" that has another meaning in six. Other options:

  • string_to_unicode and string_to_bytes;
  • bytes_or_unicode_to_unicode and bytes_or_unicode_to_bytes. Actually I like these super-long names because they don't left reader wondering what is function doing, and that's important because it is easy to get bytes/unicode handling wrong.
@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jul 21, 2014

But it also accept None, so it should be none_or_bytes_or_unicode_to_unicode/none_or_bytes_or_to_bytes.
I think to_unicode and to_bytes is ok. If you use a not valid input you will get an exception, if the input worth adding the support somebody will create a ticket and we can work on it.

@kmike
Copy link
Member Author

@kmike kmike commented Jul 21, 2014

@nramirezuy these functions don't accept None.

I think that's good they are not "give me unicode from everything" functions and that bytes and unicode are the only supported input types.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jul 22, 2014

Oh, was the encoding.

@kmike Also Django force_bytes/force_text have some encoding guessing, right?

@kmike
Copy link
Member Author

@kmike kmike commented Jul 22, 2014

@klangner
Copy link
Contributor

@klangner klangner commented Feb 10, 2015

If I understand it correctly in python3:
unicode_to_str accepts Str and Binary as input and output is always Binary
str_to_unicode accepts binary and Str and output is always Str

I'm not sure if in all places this will be correct behaviour in py3. It would be good to rename this functions (force_bytes/force_text sounds good to me) so they will make sense in py3. And probably it will be necessary to add another functions which will return Str in py3.

Also the current tests are misleading in py3.
For example:

self.assertEqual(unicode_to_str(u'\xa3 49'), '\xc2\xa3 49')
assert isinstance(text, str)
Expected type is string in py3, but should be binary

What should be done with those tests in py3? (tests/test_utils_python.py)

@kmike
Copy link
Member Author

@kmike kmike commented Feb 10, 2015

If I understand it correctly in python3:
unicode_to_str accepts Str and Binary as input and output is always Binary
str_to_unicode accepts binary and Str and output is always Str

That's correct. To be precise,

  • Str == unicode in Python 2.x;
  • Str == str in Python 3.x;
  • Binary == str == bytes in Python 2.x;
  • Binary == bytes in Python 3.x.

I'm not sure if in all places this will be correct behaviour in py3. It would be good to rename this functions (force_bytes/force_text sounds good to me) so they will make sense in py3. And probably it will be necessary to add another functions which will return Str in py3.

I don't quite understand this, could you please explain it?

What should be done with those tests in py3? (tests/test_utils_python.py)

The test should be fixed. bytes is an alias for str in Python 2.x; test should be written like this:

self.assertEqual(unicode_to_str(u'\xa3 49'), b'\xc2\xa3 49')  # note b prefix
assert isinstance(text, bytes)  # this is a name for Binary compatible with both 2.x and 3.x
@klangner
Copy link
Contributor

@klangner klangner commented Feb 10, 2015

Thanks for the answers.

I'm not sure if in all places this will be correct behaviour in py3. It would be good to rename this
functions (force_bytes/force_text sounds good to me) so they will make sense in py3. And probably
it will be necessary to add another functions which will return Str in py3.

I don't quite understand this, could you please explain it?

I wanted to say that it would be good to rename this functions. I don't know if in all places where this functions are called it would be ok to have bytes as output type. If not, maybe it will be necessary to add another pair of functions which will return str instead of bytes

@curita
Copy link
Member

@curita curita commented Feb 11, 2015

I wanted to say that it would be good to rename this functions. I don't know if in all places where this functions are called it would be ok to have bytes as output type. If not, maybe it will be necessary to add another pair of functions which will return str instead of bytes.

Not sure if I understood that myself. Talking about str is maybe confusing since it has different meanings depending on Python's version.

Strings come in two (and only two) "flavors", they are unicode objects, or they are bytes objects (this distinction was made clearer on Python 3). str either refers to bytes on Python 2 (It's an alias for it as @kmike said) or unicode on Python 3 (the actual unicode type was replaced for str on Python 3).

We should focus on making this distinction explicit to make code Python 2 and Python 3 compatible since they use different default types for strings, and that means to stop using str alias, have "u" and "b" prefixes on strings, and use six.text_type and bytes accordingly while checking string types.

@curita curita added this to the Scrapy 1.0 milestone Feb 11, 2015
@curita curita modified the milestones: Scrapy 1.1, Scrapy 1.0 Mar 13, 2015
@kmike kmike closed this in 61cd27e Jul 23, 2015
@kmike
Copy link
Member Author

@kmike kmike commented Jul 23, 2015

We (@dangra, @eliasdorneles and me) decided to go with to_bytes and to_unicode names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants