You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a super-technical edge case, but it recently bit me as part of a project I'm working on and I have a fix ready.
Using an SDWriter that's connected to a StringIO raises a UnicodeDecodeError if the output SD data happens to include a UTF-8 extended character on the boundary between two "blocks" (where I'm defining a block to be a set of bytes of length equal to the buffer size used in python_streambuf.h.
some padding that is just the ri
RDKit 2D
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 12 12 0 0 0
M V30 BEGIN ATOM
M V30 1 C -1.08313 -0.7909 0 0
M V30 2 O -1.08393 -1.7909 0 0
M V30 3 O -1.94873 -0.2901 0 0
M V30 4 Cl -1.08153 1.2091 0 0
M V30 5 Cl 0.648067 -1.7921 0 0
M V30 6 N 0.651267 2.2079 0 0
M V30 7 C -0.216733 -0.2915 0 0
M V30 8 C -0.215933 0.7085 0 0
M V30 9 C 0.648867 -0.7921 0 0
M V30 10 C 0.650467 1.2079 0 0
M V30 11 C 1.51527 -0.2929 0 0
M V30 12 C 1.51607 0.7071 0 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 1 1 1 2
M V30 2 2 1 3
M V30 3 1 1 7
M V30 4 1 4 8
M V30 5 1 5 9
M V30 6 1 6 10
M V30 7 2 7 8
M V30 8 1 7 9
M V30 9 1 8 10
M V30 10 2 9 11
M V30 11 2 10 12
M V30 12 1 11 12
M V30 END BOND
M V30 END CTAB
M END
> <cas.rn> (1)
paddingpad
> <cas.index.name> (1)
paddingpaddingpaddingpaddingpadding
> <molecular.formula> (1)
C7H5Cl2NO2
> <molecular.weight> (1)
206.03
> <boiling.point.predicted> (1)
123.4±56.7 °C paddingpaddingp
$$$$
And the exception:
(py37_rdkit) glandrum@Badger:/scratch/RDKit_git/build_dbg$ python foo.py
Traceback (most recent call last):
File "foo.py", line 19, in <module>
print(convert_to_molblock(m,True,True))
File "foo.py", line 11, in convert_to_molblock
writer.write(mol)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1023: unexpected end of data
In this case the problem is the ° symbol, which is defined by the two bytes 0xc2 0xb0 and which falls with 0xc2 in the 1024'th position. When the streambuf tries to convert this much to a python string, Python complains because it can't find the next byte.
The text was updated successfully, but these errors were encountered:
* Fixes#3553
* add another test
* Apply suggestions from code review
Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
* add an additional test for that
Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
* Fixes#3553
* add another test
* Apply suggestions from code review
Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
* add an additional test for that
Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
Configuration:
Description:
This is a super-technical edge case, but it recently bit me as part of a project I'm working on and I have a fix ready.
Using an
SDWriter
that's connected to aStringIO
raises aUnicodeDecodeError
if the output SD data happens to include a UTF-8 extended character on the boundary between two "blocks" (where I'm defining a block to be a set of bytes of length equal to the buffer size used inpython_streambuf.h
.Here's the reproducible.
The file foo.py:
and the file utf.sdf:
And the exception:
In this case the problem is the
°
symbol, which is defined by the two bytes0xc2 0xb0
and which falls with0xc2
in the 1024'th position. When the streambuf tries to convert this much to a python string, Python complains because it can't find the next byte.The text was updated successfully, but these errors were encountered: