Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing SDF data containing UTF-8 to a StringIO object #3553

Closed
greglandrum opened this issue Nov 6, 2020 · 0 comments · Fixed by #3554
Closed

Error writing SDF data containing UTF-8 to a StringIO object #3553

greglandrum opened this issue Nov 6, 2020 · 0 comments · Fixed by #3554
Labels
Milestone

Comments

@greglandrum
Copy link
Member

Configuration:

  • RDKit Version: 2020.09.1
  • Operating system: linux
  • Python version (if relevant): 3.7

Description:

This is a super-technical edge case, but it recently bit me as part of a project I'm working on and I have a fix ready.

Using an SDWriter that's connected to a StringIO raises a UnicodeDecodeError if the output SD data happens to include a UTF-8 extended character on the boundary between two "blocks" (where I'm defining a block to be a set of bytes of length equal to the buffer size used in python_streambuf.h.

Here's the reproducible.

The file foo.py:

from io import StringIO

from rdkit import Chem

def convert_to_molblock(mol, v3000, kekulize):
    sio = StringIO()
    writer = Chem.SDWriter(sio)
    writer.SetKekulize(kekulize)
    writer.SetForceV3000(v3000)
    writer.write(mol)
    writer.flush()
    try:
        return sio.getvalue()
    finally:
        writer.close()

m = next(Chem.SDMolSupplier('./utf.sdf'))
print(convert_to_molblock(m,True,True))

and the file utf.sdf:

some padding that is just the ri
     RDKit          2D

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 12 12 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -1.08313 -0.7909 0 0
M  V30 2 O -1.08393 -1.7909 0 0
M  V30 3 O -1.94873 -0.2901 0 0
M  V30 4 Cl -1.08153 1.2091 0 0
M  V30 5 Cl 0.648067 -1.7921 0 0
M  V30 6 N 0.651267 2.2079 0 0
M  V30 7 C -0.216733 -0.2915 0 0
M  V30 8 C -0.215933 0.7085 0 0
M  V30 9 C 0.648867 -0.7921 0 0
M  V30 10 C 0.650467 1.2079 0 0
M  V30 11 C 1.51527 -0.2929 0 0
M  V30 12 C 1.51607 0.7071 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 2 2 1 3
M  V30 3 1 1 7
M  V30 4 1 4 8
M  V30 5 1 5 9
M  V30 6 1 6 10
M  V30 7 2 7 8
M  V30 8 1 7 9
M  V30 9 1 8 10
M  V30 10 2 9 11
M  V30 11 2 10 12
M  V30 12 1 11 12
M  V30 END BOND
M  V30 END CTAB
M  END
>  <cas.rn>  (1) 
paddingpad

>  <cas.index.name>  (1) 
paddingpaddingpaddingpaddingpadding

>  <molecular.formula>  (1) 
C7H5Cl2NO2

>  <molecular.weight>  (1) 
206.03

>  <boiling.point.predicted>  (1) 
123.4±56.7 °C    paddingpaddingp

$$$$

And the exception:

(py37_rdkit) glandrum@Badger:/scratch/RDKit_git/build_dbg$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 19, in <module>
    print(convert_to_molblock(m,True,True))
  File "foo.py", line 11, in convert_to_molblock
    writer.write(mol)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 1023: unexpected end of data

In this case the problem is the ° symbol, which is defined by the two bytes 0xc2 0xb0 and which falls with 0xc2 in the 1024'th position. When the streambuf tries to convert this much to a python string, Python complains because it can't find the next byte.

@greglandrum greglandrum added the bug label Nov 6, 2020
@greglandrum greglandrum added this to the 2020_09_2 milestone Nov 6, 2020
greglandrum added a commit to greglandrum/rdkit that referenced this issue Nov 6, 2020
@greglandrum greglandrum linked a pull request Nov 6, 2020 that will close this issue
bp-kelley pushed a commit that referenced this issue Nov 9, 2020
* Fixes #3553

* add another test

* Apply suggestions from code review

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>

* add an additional test for that

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
greglandrum added a commit that referenced this issue Nov 24, 2020
* Fixes #3553

* add another test

* Apply suggestions from code review

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>

* add an additional test for that

Co-authored-by: Paolo Tosco <paolo.tosco.mail@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant