Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-quote characters in regex patterns create invalid dataclasses #592

Closed
ScottKelly opened this issue Oct 1, 2021 · 1 comment · Fixed by #593
Closed

Double-quote characters in regex patterns create invalid dataclasses #592

ScottKelly opened this issue Oct 1, 2021 · 1 comment · Fixed by #593
Labels
bug Something isn't working

Comments

@ScottKelly
Copy link

Unexpected Behavior

If there is a pattern restriction in an XML file similar to this

	<xs:simpleType name="mytype">
		<xs:restriction base="xs:token">
			<xs:minLength value="1"/>
			<xs:maxLength value="256"/>
			<xs:pattern value="([^\\ \? &gt; &lt; \* / &quot; : |]{1,256})"/>
		</xs:restriction>
	</xs:simpleType>

xsdata will generate a field with an unescaped quote character in the metadata dict which looks like the following:

    mytype: Optional[str] = field(
        default=None,
        metadata={
            "type": "Attribute",
            "sa": Column(String),
            "required": True,
            "min_length": 1,
            "max_length": 256,
            "pattern": r"([^\\ \? > < \* / " : |]{1,256})",
        }
    )

This terminates the regex string early and causes a syntax error

Expected Behavior

The expected behavior should produce this:

    mytype: Optional[str] = field(
        default=None,
        metadata={
            "type": "Attribute",
            "sa": Column(String),
            "required": True,
            "min_length": 1,
            "max_length": 256,
            "pattern": r"([^\\ \? > < \* / \" : |]{1,256})",
        }
    )

which has the double-quote escaped in the regex pattern.

The double-quote in the regex pattern in the XML is valid so the issue is the dataclass file is generated using double-quotes around strings but doesn't have any logic to escape double-quotes in regex patterns. However, double-quotes are properly escaped for all other strings. This issue is only because of the special if for patterns located here

https://github.com/tefra/xsdata/blob/master/xsdata/formats/dataclass/filters.py#L433

##Proposed Change
Using the existing behavior of text.escape_string wouldn't be valid for regexes since we don't want to escape other characters. We only need to escape double-quotes in the regex. Also, we shouldn't escape already escaped double-quotes.

I can create a PR where I add a simple regex sub like this:

        if key == "pattern":
            value = re.sub(r'([^\\])\"', r'\1\\"', data)
            return f'r"{value}"'

so we only escape double-quotes if they aren't already escaped.

Let me know if you like that solution and I'll create a PR.

@tefra
Copy link
Owner

tefra commented Oct 4, 2021

Thanks for reporting @ScottKelly, please go ahead I will accept your solution

Down the road I want to add some sort of translation between xsd patterns and python regexes but that's not something I am looking forward to...

@tefra tefra added the bug Something isn't working label Oct 4, 2021
@tefra tefra closed this as completed in #593 Oct 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants