Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to validate big xml? #1351

Closed
iLLysion opened this issue Sep 14, 2015 · 7 comments
Closed

How to validate big xml? #1351

iLLysion opened this issue Sep 14, 2015 · 7 comments

Comments

@iLLysion
Copy link

Hello.
I need to validate xml file using xds. When the file is not too big I use document validation and it's Ok.
I have a problem when the file is big. Then I use file validation instead of the document. But this type of the validation doesn't detect error such as unclosed double quote on the attributes and I don't know how many other cases could be.

@flavorjones
Copy link
Member

Hi, @iLLysion,

Can you provide some examples of what you're seeing? I'm not sure I understand what you're asking, and code will be much clearer.

@iLLysion
Copy link
Author

Here is code sample http://pastebin.com/5VckXdZm
In the second case I have problem with validation.

@flavorjones
Copy link
Member

Hi @iLLysion,

I still don't understand what error you're seeing, or how the behavior of Nokogiri differs from what you expect. You'll need to help me understand in order to receive help.

@iLLysion
Copy link
Author

I have a simple xml file like this http://pastebin.com/69c50iS8
When I remove one double quote on the attribute version like this
<catalog version="123 xmlns="http://google.com"
I have no errors using second variant of the validation.

@dsounded
Copy link

Same for me, I can't find the fast solution to validate large files
Any suggestions ?

@flavorjones
Copy link
Member

@iLLysion I'm not trying to be obtuse, but you still haven't provided a complete working piece of code demonstrating what you're seeing, which makes it extremely difficult to help you.

Here, I've put together such a complete example, which I'll comment on below.

#! /usr/bin/env ruby

require 'nokogiri'
require 'tempfile'
require 'pp'

xsd_contents = <<EOXSD
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:annotation>
    <xsd:documentation xml:lang="en">
     Purchase order schema for Example.com.
     Copyright 2000 Example.com. All rights reserved.
    </xsd:documentation>
  </xsd:annotation>

  <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>

  <xsd:element name="comment" type="xsd:string"/>

  <xsd:complexType name="PurchaseOrderType">
    <xsd:sequence>
      <xsd:element name="shipTo" type="USAddress"/>
      <xsd:element name="billTo" type="USAddress"/>
      <xsd:element ref="comment" minOccurs="0"/>
      <xsd:element name="items"  type="Items"/>
    </xsd:sequence>
    <xsd:attribute name="orderDate" type="xsd:date"/>
  </xsd:complexType>

  <xsd:complexType name="USAddress">
    <xsd:sequence>
      <xsd:element name="name"   type="xsd:string"/>
      <xsd:element name="street" type="xsd:string"/>
      <xsd:element name="city"   type="xsd:string"/>
      <xsd:element name="state"  type="xsd:string"/>
      <xsd:element name="zip"    type="xsd:decimal"/>
    </xsd:sequence>
    <xsd:attribute name="country" type="xsd:NMTOKEN"
                   fixed="US"/>
  </xsd:complexType>

  <xsd:complexType name="Items">
    <xsd:sequence>
      <xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:sequence>
            <xsd:element name="productName" type="xsd:string"/>
            <xsd:element name="quantity">
              <xsd:simpleType>
                <xsd:restriction base="xsd:positiveInteger">
                  <xsd:maxExclusive value="100"/>
                </xsd:restriction>
              </xsd:simpleType>
            </xsd:element>
            <xsd:element name="USPrice"  type="xsd:decimal"/>
            <xsd:element ref="comment"   minOccurs="0"/>
            <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
          </xsd:sequence>
          <xsd:attribute name="partNum" type="SKU" use="required"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>

  <!-- Stock Keeping Unit, a code for identifying products -->
  <xsd:simpleType name="SKU">
    <xsd:restriction base="xsd:string">
      <xsd:pattern value="\\d{3}-[A-Z]{2}"/>
    </xsd:restriction>
  </xsd:simpleType>

</xsd:schema>
EOXSD

valid_xml_contents = <<EOVALIDXML
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
   <shipTo country="US">
      <name>Alice Smith</name>
      <street>123 Maple Street</street>
      <city>Mill Valley</city>
      <state>CA</state>
      <zip>90952</zip>
   </shipTo>
   <billTo country="US">
      <name>Robert Smith</name>
      <street>8 Oak Avenue</street>
      <city>Old Town</city>
      <state>PA</state>
      <zip>95819</zip>
   </billTo>
   <comment>Hurry, my lawn is going wild!</comment>
   <items>
      <item partNum="872-AA">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
         <comment>Confirm this is electric</comment>
      </item>
      <item partNum="926-AA">
         <productName>Baby Monitor</productName>
         <quantity>1</quantity>
         <USPrice>39.98</USPrice>
         <shipDate>1999-05-21</shipDate>
      </item>
   </items>
</purchaseOrder>
EOVALIDXML

# missing <city></city>
invalid_xml_contents = <<EOINVALIDXML
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
   <shipTo country="US">
      <name>Alice Smith</name>
      <street>123 Maple Street</street>
      <state>CA</state>
      <zip>90952</zip>
   </shipTo>
   <billTo country="US">
      <name>Robert Smith</name>
      <street>8 Oak Avenue</street>
      <state>PA</state>
      <zip>95819</zip>
   </billTo>
   <comment>Hurry, my lawn is going wild!</comment>
   <items>
      <item partNum="872-AA">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
         <comment>Confirm this is electric</comment>
      </item>
      <item partNum="926-AA">
         <productName>Baby Monitor</productName>
         <quantity>1</quantity>
         <USPrice>39.98</USPrice>
         <shipDate>1999-05-21</shipDate>
      </item>
   </items>
</purchaseOrder>
EOINVALIDXML

# missing quotes in shipTo
malformed_xml_contents = <<EOMALFORMEDXML
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
   <shipTo country="US>
      <name>Alice Smith</name>
      <street>123 Maple Street</street>
      <state>CA</state>
      <zip>90952</zip>
   </shipTo>
   <billTo country="US">
      <name>Robert Smith</name>
      <street>8 Oak Avenue</street>
      <state>PA</state>
      <zip>95819</zip>
   </billTo>
   <comment>Hurry, my lawn is going wild!</comment>
   <items>
      <item partNum="872-AA">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
         <comment>Confirm this is electric</comment>
      </item>
      <item partNum="926-AA">
         <productName>Baby Monitor</productName>
         <quantity>1</quantity>
         <USPrice>39.98</USPrice>
         <shipDate>1999-05-21</shipDate>
      </item>
   </items>
</purchaseOrder>
EOMALFORMEDXML

# validate valid document
xsd = Nokogiri::XML::Schema.new xsd_contents
valid_xml = Nokogiri::XML valid_xml_contents
errors = xsd.validate valid_xml
raise unless errors.length == 0

# validate invalid document
xsd = Nokogiri::XML::Schema.new xsd_contents
invalid_xml = Nokogiri::XML invalid_xml_contents
errors = xsd.validate invalid_xml
pp errors
# => [#<Nokogiri::XML::SyntaxError: Element 'state': This element is not expected. Expected is ( city ).>,
#     #<Nokogiri::XML::SyntaxError: Element 'state': This element is not expected. Expected is ( city ).>]

# validate malformed invalid document
xsd = Nokogiri::XML::Schema.new xsd_contents
malformed_xml = Nokogiri::XML malformed_xml_contents
# NOTE that here, nokogiri fixes malformed markup to be:
#    <shipTo country="US&gt;       "/><name>Alice Smith</name>
errors = xsd.validate malformed_xml
pp errors
# => [#<Nokogiri::XML::SyntaxError: Element 'shipTo', attribute 'country': 'US>' is not a valid value of the atomic type 'xs:NMTOKEN'.>,
#     #<Nokogiri::XML::SyntaxError: Element 'shipTo': Missing child element(s). Expected is ( name ).>,
#     #<Nokogiri::XML::SyntaxError: Element 'name': This element is not expected. Expected is ( billTo ).>]

# validate valid file
xsd = Nokogiri::XML::Schema.new xsd_contents
valid_xml_file = Tempfile.new "valid"
valid_xml_file.write valid_xml_contents
valid_xml_file.close

errors = xsd.validate valid_xml_file.path
raise unless errors.length == 0

# validate invalid file
xsd = Nokogiri::XML::Schema.new xsd_contents
invalid_xml_file = Tempfile.new "invalid"
invalid_xml_file.write invalid_xml_contents
invalid_xml_file.close

errors = xsd.validate invalid_xml_file.path
pp errors
# => [#<Nokogiri::XML::SyntaxError: Element 'state': This element is not expected. Expected is ( city ).>,
#     #<Nokogiri::XML::SyntaxError: Element 'state': This element is not expected. Expected is ( city ).>]


# validate malformed invalid file
xsd = Nokogiri::XML::Schema.new xsd_contents
malformed_xml_file = Tempfile.new "malformed"
malformed_xml_file.write malformed_xml_contents
malformed_xml_file.close

# note that the malformed xml file is still malformed
errors = xsd.validate malformed_xml_file.path
pp errors
# => [#<Nokogiri::XML::SyntaxError: Element 'purchaseOrder': Character content other than whitespace is not allowed because the content type is 'element-only'.>,
#     #<Nokogiri::XML::SyntaxError: Element 'purchaseOrder': Character content other than whitespace is not allowed because the content type is 'element-only'.>,
#     #<Nokogiri::XML::SyntaxError: Element 'purchaseOrder': Character content other than whitespace is not allowed because the content type is 'element-only'.>,
#     #<Nokogiri::XML::SyntaxError: Element 'purchaseOrder': Character content other than whitespace is not allowed because the content type is 'element-only'.>]

You're conflating "validity" with "well-formedness". The two things are not the same.

This example clearly shows errors when attempting to validate a malformed document; but they're not the same errors I get if I allow Nokogiri to fix the broken markup (making it well-formed) first.

If you are seeing something different, then you need to provide me with a complete example of working code including markup and schema.

If you pass in a malformed document to Schema#validate you're going to get inconsistent results. Don't use it unless you're sure the document is well-formed.

flavorjones added a commit that referenced this issue Sep 15, 2015
to make sure we're testing validation of invalid files.

Related to #1351.
@iLLysion
Copy link
Author

Thank you. You've answered my question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants