Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cppyy.gbl.std.string representation changed in ROOT master #15153

Closed
1 task done
Nowakus opened this issue Apr 5, 2024 · 8 comments · Fixed by #15484
Closed
1 task done

cppyy.gbl.std.string representation changed in ROOT master #15153

Nowakus opened this issue Apr 5, 2024 · 8 comments · Fixed by #15484

Comments

@Nowakus
Copy link
Contributor

Nowakus commented Apr 5, 2024

Check duplicate issues.

  • Checked for duplicates

Description

repr() of cppyy.gbl.std.string objects now comes with a "b" prefix

Reproducer

import cppyy
s = cppyy.gbl.std.string("XXXX")
s
b'XXXX'
str(s)
'XXXX'
repr(s)
"b'XXXX'"

ROOT version

ROOT 6.31/01 (master)

Installation method

dev3

Operating system

Alma9

Additional context

Is this a bug or intentional?

@Nowakus Nowakus added the bug label Apr 5, 2024
@Nowakus
Copy link
Contributor Author

Nowakus commented Apr 5, 2024

CC @scott-snyder @elmsheus

@guitargeek guitargeek self-assigned this Apr 5, 2024
@wlav
Copy link
Contributor

wlav commented Apr 5, 2024

Intentional.

An std::string has a single-char array representation underlying it, which is a Python bytes object, not a str, which is unicode. The point of repr() is to be able to provide a str object from which the original object can be reconstructed. That, therefore, has to be a bytes representation. The point of str() is to provide something that looks pretty and informational, which in practice is more likely to be a decoded str, hence that choice.

Specifically, b/c unicode isn't properly implemented in any C++ standard string class, it's common practice to place an encoded unicode string into std::string. Regardless, even if it contains unencoded ASCII, whenever a C++ std::string is represented as a Python str, it needs decoding. However, no information is available on the encoding scheme. It's typically safe to assume an UTF-8 encoding b/c ASCII is a subset, but there's no guarantee. Hence the current implementation tries to avoid decoding as much as possible and only does so when absolutely necessary (e.g. when doing string comparisons or, as in this example, when calling str()).

The implementation of std::string has been Pythonized so that it can be used as a drop-in replacement wherever a str object is expected, with the exception of when you do anything type specific, such as repr(), type(), and isinstance(), where it will represent as an std::string, which it is.

For CERN, none of this is likely to matter as all European languages are encodable UTF-8 AFAIK. The most common language, in my experience, where this assumption fails is Chinese. (Edit: I read up on it and turns out that all languages are encodable in UTF-8, it's just that that encoding is designed to be space-efficient for Latin languages, but is rather inefficient for Asian languages, hence the common choice of different encodings there.)

@guitargeek
Copy link
Contributor

Thanks for chiming in here @wlav!

@Nowakus, are you happy with this answer? If yes, my suggestion is that I also mention this in the 6.32 release notes so nobody will wonder again if this is intentional or not, and then I close this issue. You agree with this?

@guitargeek guitargeek added this to the 6.32/00 milestone Apr 6, 2024
@Nowakus
Copy link
Contributor Author

Nowakus commented Apr 8, 2024

Yes, thank you @wlav for a very detailed explanation. I think it may be good to mention this in release notes and maybe even include a link to the explanation here.

@wlav
Copy link
Contributor

wlav commented Apr 8, 2024

Aside, another difference is that an std::string is editable in-place, whereas a Python str is immutable. It's probably not common, but some use cases may rely on this and would otherwise be impossible if std::string was automatically converted to str everywhere (as besides str being immutable, it involves a copy). This includes passing an std::string return from some function as a non-const reference argument into another function.

@andresailer
Copy link
Contributor

What is somewhat confusing here (ROOT master from Saturday) is that one sees b'mystring', but this is not a bytes instance.

source /cvmfs/sft-nightlies.cern.ch/lcg/nightlies/dev3/Sat/ROOT/HEAD/x86_64-el9-gcc13-dbg/ROOT-env.sh
python -c "import cppyy; mystr = cppyy.gbl.std.string('mystring'); print(repr(mystr)); print(isinstance(mystr, bytes))"

Gives

b'mystring'
False

So at least some printout messages in existing code are confusing me. As you, @wlav , write as well, repr usually returns something that let's one construct the object again. So something like cppyy.gbl.std.string('mystring') is maybe more consistent with the actual type of the object? Or the type and address in angle brackets, so maybe <cppyy.gbl.std string object at 0xf22d8f0>?
I saw a print out of b'something' and happily coded my isinstance(.., bytes) and nothing changed.

@wlav
Copy link
Contributor

wlav commented Apr 15, 2024

Similar goes for e.g. np.int64 and Python's builtin int: they represent the same with repr(), but are different types. The contents of repr() do not contain any originating type info, but are, however, sufficient to reconstitute either one. As for isinstance(), I'm afraid that's never going to be a solid solution for a language that relies on duck typing (e.g., similarly, both isinstance(12345, np.int64) and isinstance(np.int64(12345), int) are False). Instead, std::string is made to be a drop-in replacement for str (and bytes for that matter as a concession;str and bytes not being drop-in replacements). I'd be interested to know for what use case this would fail and isinstance being the only way out (I/O maybe?).

guitargeek added a commit to guitargeek/root that referenced this issue May 11, 2024
guitargeek added a commit to guitargeek/root that referenced this issue May 11, 2024
guitargeek added a commit to guitargeek/root that referenced this issue May 11, 2024
guitargeek added a commit that referenced this issue May 13, 2024
guitargeek added a commit to guitargeek/root that referenced this issue May 13, 2024
guitargeek added a commit that referenced this issue May 13, 2024
silverweed pushed a commit to silverweed/root that referenced this issue May 14, 2024
@guitargeek guitargeek added this to Issues in Fixed in: not applicable via automation May 16, 2024
@guitargeek
Copy link
Contributor

Fixed in "not applicable" because this was a clarification request and not a bug.

@root-project root-project deleted a comment from github-actions bot May 16, 2024
@root-project root-project deleted a comment from github-actions bot May 16, 2024
@root-project root-project deleted a comment from github-actions bot May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants