Skip to content

Latest commit

 

History

History
276 lines (213 loc) · 10 KB

differences-from-c-python.md

File metadata and controls

276 lines (213 loc) · 10 KB

This page documents various differences between IronPython and CPython. Since IronPython is under active development, any of the differences described here may change or disappear in the future.

Environment Variables

  • IRONPYTHONSTARTUP is used instead of PYTHONSTARTUP

  • IRONPYTHONPATH is used instead of PYTHONPATH

COM Interaction

  • Interaction with COM objects is handled by the CLR rather than a python library binding to the native COM dlls.

Integers

The long type is no longer exposed in built-ins however it still shows up while getting the type of a long number.

CPython

>>> type(1<<64)
<class 'int'>

IronPython

>>> type(1<<64)
<class 'long'>

However, in order to reduce compatibility issues, using isinstance to test if a value is an integer will work as it does in CPython:

CPython and IronPython

>>> isinstance(1<<64, int)
True

Strings

  • str objects are represented in UTF-16 (like all .NET strings) rather than UTF-32 used by CPython.

This has a few visible consequences if characters ouside of the Basic Multilingual Plane (BMP) are used (that is, characters with Unicode code points above U+FFFF). A few examples below illustrate the differences.

Let's take a Unicode character U+1F70B, '🜋'. In CPython, it is represented by a single character:

CPython

>>> len('\U0001f70b')
1
>>> str('\U0001f70b')
'🜋'

In IronPython, it is represented by a pair of surrogate characters U+D83D and U+DF0B:

IronPython

>>> len('\U0001f70b')
2
>>> str('\U0001f70b')
'\ud83d\udf0b'

In both cases, however, the string containing such character is printed out correctly, since print will transcode the string from its internal representation to whichever encoding is used by the console or file (usually UTF-8):

CPython and IronPython

print('\U0001f70b')
'🜋'

Any surrogate pair in IronPython strings represents one logical character. CPython, however, sees a surrogate pair as two invalid characters.

IronPython

>>> '\ud83d\udf0b'
'\ud83d\udf0b'
>>> print('\ud83d\udf0b')  
🜋
>>> '\ud83d\udf0b'.encode('utf-8')       
b'\xf0\x9f\x9c\x8b'
>>> '\U0001f70b'.encode('utf-8')
b'\xf0\x9f\x9c\x8b'

CPython

>>> '\ud83d\udf0b'
'\ud83d\udf0b'
>>> print('\ud83d\udf0b')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
'\ud83d\udf0b'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

CPython requires use of 'surrogatepass' error handler to let those pairs through. Note however, that they are still being treated as two separate characters. IronPython encodes the pair as if it were one character.

CPython

>>> '\ud83d\udf0b'.encode('utf-8','surrogatepass')
b'\xed\xa0\xbd\xed\xbc\x8b'
>>> '\U0001f70b'.encode('utf-8')
b'\xf0\x9f\x9c\x8b'

The 'surrogatepass' error handler is still needed in IronPython to handle surrogate characters that do not form a valid surrogate pair:

IronPython

print('\ud83d\udf0b')          
🜋
>>> print('\ud83d\udf0b'[::-1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'cp65001' codec can't encode character '\udf0b' in position 0: Unable to translate Unicode character \\uDF0B at index 0 to specified code page.
>>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass'))
b'\xed\xbc\x8b\xed\xa0\xbd'

CPython

>>> print('\ud83d\udf0b')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
>>> print('\ud83d\udf0b'[::-1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
>>> print('\ud83d\udf0b'[::-1].encode('utf-8','surrogatepass'))
b'\xed\xbc\x8b\xed\xa0\xbd'

Interaction with the Operating System

  • Environment variables are decoded using the 'replace' error handler, rather than the 'surrogateescape' error handler used by CPython.

This is how .NET libraries handle encoding errors in the system. The difference is only visible on Posix systems that have environment variables defined using a different encoding than the encoding used by the system (Windows environment variables are always in UTF-16, so no conversion takes place when accessed as Python str objects).

Assume that a Linux system is configured to use UTF-8. Under bash:

$ python -c 'f=open("test.sh","w",encoding="latin-1");print("NAME=\"André\"",file=f)'
$ source test.sh
$ export NAME

This creates an environment variable that is encoded using Latin-1 encoding, thather than the system encoding. CPython will escape the invalid byte 0xe9 (letter 'é' in Latin-1) in a lone surrogate 0xdce9, which is still an invalid Unicode character.

CPython

>>> import os
>>> os.environ["NAME"]
'Andr\udce9'
>>> print(os.environ["NAME"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 4: surrogates not allowed

IronPython will replace the invalid byte with U+FFFD, the Unicode replacement character, which is a valid and printable character.

IronPython

>>> import os
>>> os.environ["NAME"]
'Andr�'
>>> print(os.environ["NAME"])
Andr�
>>> hex(ord(os.environ["NAME"][-1]))
'0xfffd'

The CPython's representation is not printable, but can be safely encoded back to the original form using 'surrogateescape' (default when dealing with the OS environment):

CPython

>>> os.environ["PATH"] = os.environ["PATH"] + ":/home/" + os.environ["NAME"] + "/bin"
>>> import posix
>>> posix.environ[b"PATH"]
b'/bin:/usr/bin:/usr/local/bin:/home/Andr\xe9/bin'
>>> os.environ["NAME"].encode("utf-8","surrogateescape")
b'Andr\xe9'

The IronPython's representation is printable, but the original byte value is lost:

IronPython

>>> os.environ["NAME"].encode("utf-8","surrogateescape")
b'Andr\xef\xbf\xbd'

Codecs

  • Some single-byte codecs may have unused positions in their codepage. There are differences how CPython and IronPython (and .NET) handle such cases.

A simple example is encoding Windows-1252. According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The Unicode "best fit" mapping documents this behavior. CPython will treat those bytes as invalid, while IronPython will map them to the "best fit" Unicode character:

CPython

>>> b'\x81'.decode('windows-1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
>>> b'\x81'.decode('windows-1252','surrogateescape')
'\udc81'

IronPython

>>> b'\x81'.decode('windows-1252')
'\x81'
>>> b'\x81'.decode('windows-1252','surrogateescape')
'\x81'

The same difference in behavior can be observed during encoding:

CPython

>>> '\x81'.encode('windows-1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/envs/py34/lib/python3.4/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 0: character maps to <undefined>

IronPython

>>> '\x81'.encode('windows-1252')
b'\x81'
  • When using the UTF-7 encoding, IronPython (and .NET) always terminates the modified Base64 encoded blocks with a '-' while CPython omits the '-' if allowed.

The UTF-7 standard allows encoders for some freedom of implementation. One optionality allowed in UTF-7 is how to end a sequence encoded in the modified Base64 code. In principle, + marks the start of the sequence, and - is the terminator. However, it is allowed to omit the terminating - if the next character unambiguously does not belong to the encoded Base64 block. CPython chooses to drop the terminating - in such cases, while IronPython will always terminate Base64-encoded blocks with a -:

CPython

>>> 'abc:~~:zyz'.encode('utf-7')
b'abc:+AH4Afg:zyz'

IronPython

>>> 'abc:~~:zyz'.encode('utf-7')
b'abc:+AH4Afg-:zyz'

Note that both forms are fully interchangeable; IronPython will correctly decode what CPython encoded and vice versa.

Source File Encoding

  • Widechar Unicode encodings are supported as source file encoding, in addition to standard Python encodings.

The default source file encoding is UTF-8. This also applies to bytestrings used within the program (processed by compile, eval, or exec). The source file encoding can be explicitly specified, and possibly changed, in one of the two ways:

  1. By declaring the encoding in a Python comment in one of the first two lines — in accordance with PEP-263.
  2. By a byte-order-mark (BOM) — only for Unicode encodings.

CPython recognizes only UTF-8 BOM. IronPython recognizes BOM in UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE.

If both BOM and PEP-263 methods are used simultaneously in the same file, they should be specifying the same encoding. If the PEP-263 encoding does not match the BOM, then:

  • In case of UTF-8 BOM, an error will be reported (by both CPython and IronPython).
  • In case of other BOMs, the encoding specified in the PEP-263 comment is silently ignored.