Error writing SDF data containing UTF-8 to a StringIO object #3553

greglandrum · 2020-11-06T15:28:03Z

Configuration:

RDKit Version: 2020.09.1
Operating system: linux
Python version (if relevant): 3.7

Description:

This is a super-technical edge case, but it recently bit me as part of a project I'm working on and I have a fix ready.

Using an SDWriter that's connected to a StringIO raises a UnicodeDecodeError if the output SD data happens to include a UTF-8 extended character on the boundary between two "blocks" (where I'm defining a block to be a set of bytes of length equal to the buffer size used in python_streambuf.h.

Here's the reproducible.

The file foo.py:

from io import StringIO

from rdkit import Chem

def convert_to_molblock(mol, v3000, kekulize):
    sio = StringIO()
    writer = Chem.SDWriter(sio)
    writer.SetKekulize(kekulize)
    writer.SetForceV3000(v3000)
    writer.write(mol)
    writer.flush()
    try:
        return sio.getvalue()
    finally:
        writer.close()

m = next(Chem.SDMolSupplier('./utf.sdf'))
print(convert_to_molblock(m,True,True))

and the file utf.sdf:

some padding that is just the ri
     RDKit          2D

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 12 12 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.08313 -0.7909 0 0
M  V30 2 O -1.08393 -1.7909 0 0
M  V30 3 O -1.94873 -0.2901 0 0
M  V30 4 Cl -1.08153 1.2091 0 0
M  V30 5 Cl 0.648067 -1.7921 0 0
M  V30 6 N 0.651267 2.2079 0 0
M  V30 7 C -0.216733 -0.2915 0 0
M  V30 8 C -0.215933 0.7085 0 0
M  V30 9 C 0.648867 -0.7921 0 0
M  V30 10 C 0.650467 1.2079 0 0
M  V30 11 C 1.51527 -0.2929 0 0
M  V30 12 C 1.51607 0.7071 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 1 3
M  V30 3 1 1 7
M  V30 4 1 4 8
M  V30 5 1 5 9
M  V30 6 1 6 10
M  V30 7 2 7 8
M  V30 8 1 7 9
M  V30 9 1 8 10
M  V30 10 2 9 11
M  V30 11 2 10 12
M  V30 12 1 11 12
M  V30 END BOND
M  V30 END CTAB
M  END
>  <cas.rn>  (1) 
paddingpad

>  <cas.index.name>  (1) 
paddingpaddingpaddingpaddingpadding

>  <molecular.formula>  (1) 
C7H5Cl2NO2

>  <molecular.weight>  (1) 
206.03

>  <boiling.point.predicted>  (1) 
123.4±56.7 °C    paddingpaddingp

$$$$

And the exception:

(py37_rdkit) glandrum@Badger:/scratch/RDKit_git/build_dbg$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 19, in <module>
    print(convert_to_molblock(m,True,True))
  File "foo.py", line 11, in convert_to_molblock
    writer.write(mol)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1023: unexpected end of data

In this case the problem is the ° symbol, which is defined by the two bytes 0xc2 0xb0 and which falls with 0xc2 in the 1024'th position. When the streambuf tries to convert this much to a python string, Python complains because it can't find the next byte.

The text was updated successfully, but these errors were encountered:

* Fixes #3553 * add another test * Apply suggestions from code review Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com> * add an additional test for that Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>

greglandrum added the bug label Nov 6, 2020

greglandrum added this to the 2020_09_2 milestone Nov 6, 2020

greglandrum added a commit to greglandrum/rdkit that referenced this issue Nov 6, 2020

Fixes rdkit#3553

3d5b0b9

greglandrum linked a pull request Nov 6, 2020 that will close this issue

Fixes #3553 #3554

Merged

bp-kelley closed this as completed in #3554 Nov 9, 2020

j-wags mentioned this issue Jan 25, 2021

RDKitToolkitWrapper from_file_obj sometimes prints a non-fatal error when reading from an SDF with stereochemistry openforcefield/openff-toolkit#814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error writing SDF data containing UTF-8 to a StringIO object #3553

Error writing SDF data containing UTF-8 to a StringIO object #3553

greglandrum commented Nov 6, 2020

Error writing SDF data containing UTF-8 to a StringIO object #3553

Error writing SDF data containing UTF-8 to a StringIO object #3553

Comments

greglandrum commented Nov 6, 2020