-
Notifications
You must be signed in to change notification settings - Fork 238
Improve Python 3 Support #2
Comments
Can someone point to the part of the code where this change would have to be done? I'm not terribly proficient at C++ but can take a look and maybe a first pass |
@nichochar I'm not completely familiar with the code, but it seems that the Unicode support must be implemented somewhere around the pystring code: https://github.com/uber/pyflame/blob/master/src/pystring.cc#L38 |
I updated the top-level comment on this issue to provide more background, if anyone wants to take a stab. |
I have got a first (very ugly) version of this to work. I will clean it up over the next couple of days, but then I would certainly like to get some feedback. I have made some assertions right now that may or may not be allowed, and it will probably not work for early versions of Python 3 (<3.3) either. I also ran into the issue where FlameGraph does not really handle UTF-8 correctly either, so that will have to be fixed before people can actually use it. EDIT: |
I have pushed a commit which I would like some feedback on. Besides the bunch of TODOs and asserts, The most important part I think is the discrepancy in the signature of Python 3's StringData() function. It currently returns a std::string, because it does all the heavy lifting itself. I think I have to do it this way, because the address of the second PtracePeekBytes depends on the result of some processing on the first PtracePeekBytes. I could move this processing and PtracePeekBytes stuff to the FollowFrame() function, but that would only make everything uglier and harder to understand. EDIT: |
@jackvreeken Sorry for not seeing this earlier. Your commit looks pretty good to me! Note that some stuff in As for testing, I think it's sufficient to create copies of the test Python scripts in the I don't know how to test every file encoding (perhaps you do), but if we just confirmed that non-ASCII worked that would be good enough for me. |
Alright, when #81 gets merged I will rebase on top and fix anything that has changed. For testing I would not just want to test unicode filenames, but also unicode function names (just in case). What do you mean by every file encoding? Do you mean Latin-1, UCS-1, UCS-2 and UCS-4, or other exotic types like BIG5 and SHIFT-JIS? The latter two are handled by Python itself, so we only have to deal the "standard" set. I think Python just chooses the smallest one that fits to store strings, so I could come up with some filenames/function names that don't fit in UCS-2 but do fit in UCS-4. I would have to check that though. Our output is always UTF-8 by the way, as that is what FlameGraph expects. |
That change is landed now, and I released it as 1.4.0. If we can fix this issue, I will tag/release 1.5.0. Good call on testing Unicode function names -- that's easier to test anyway (although I agree we should still have some Unicode file name tests). For the encoding, what I remember when I first wrote the code for Python 3 string handling, is that Python internally uses one of four possible representations (e.g. ASCII, UCS-{1,2,4}, etc.) based on what bytes are actually in the string. Ideally we would have file/function names that test all possible internal representations (or at least, all of the ones Pyflame supports). I know the ASCII heuristic is just to use the ASCII internal representation if the string is representable as ASCII, for multi-byte encodings I'm not sure how it decides which multi-byte representation it wants to use. For the very large characters, I know the newest emoji are large (I believe the terminology is that they're in the astral plane?). For instance, https://unicode-table.com/en/2603/ is three bytes in UTF-8, and https://unicode-table.com/en/1F356/ is four bytes in UTF-8. |
This is fixed by #92 , contributed by @jackvreeken . I've tagged and released v1.5.0 which includes this change! |
Currently Pyflame works with Python 3 only as long as the file names are ASCII. If the file names include Unicode characters, Pyflame will fail to decode them.
Some Background
If you download the source code for Python, you can find the implementation of strings (actually unicode objects) in
Objects/unicodeobject.c
. There are a few other files, but this is the main one.A unicode object in Python actually has one of three representations; it can be any of the following:
PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject
Which one is actually used depends on what codepoints are actually in the string. If the string just contains ASCII characters a compact
PyASCIIObject
representation will be used. This representation is just like a Python 2 string: it has a size, and a pointer to bytes. Currently Pyflame assumes that string objects are really of typePyASCIIObject
.A Solution
A fix for this would include:
You should be pretty familiar with the internals of
unicodeobject.c
before attempting to implement a fix here.As of this writing, the code here should probably go into
src/frob.cc
which is where the existing string handling code is. Although depending on the complexity of the solution, it may make sense to break out into another file.The text was updated successfully, but these errors were encountered: